
#**Week 2 - Data Cleaning**





Data cleaning is a crucial step in data analysis as it ensures that the data used for reporting is accurate, reliable, and free from errors or inconsistencies. Data cleaning involves the identification and correction of errors, inconsistencies, and inaccuracies in data collected from various sources, such as governments, companies, and non-profit organizations.

One significant difference between clean numerical data and text data is the nature of the data itself. Numerical data refers to data that consists of numbers, such as financial data or survey responses, while text data refers to data that consists of text or written language, such as news articles or social media posts.

The cleaning process for numerical data typically involves identifying and removing outliers, inconsistencies, and errors in the data, such as missing or incorrect values. This process often involves statistical techniques to identify patterns or trends in the data and to remove any data points that do not fit these patterns. Once the data has been cleaned, it can be used for analysis and reporting.

On the other hand, cleaning text data involves identifying and correcting errors in the text, such as spelling and grammatical errors, and removing any irrelevant or redundant information. This can be a more subjective process than cleaning numerical data, as it requires a human editor to review the text and make decisions about what information to include or exclude.

Another important difference between clean numerical data and text data is the types of analysis that can be conducted with each type of data. Numerical data is often used for statistical analysis, such as regression analysis or hypothesis testing, while text data is often used for sentiment analysis or natural language processing.

In conclusion, while both numerical data and text data require cleaning to ensure accuracy and reliability, the cleaning process for each type of data is different due to the nature of the data itself. Additionally, the types of analysis that can be conducted with each type of data may vary, making it important to understand the differences between the two.

##**Cleaning numerical data**

Data Cleaning for Numerical Data

Dataset:

The dataset you will be using is a IMDB movies dataset. The dataset contains the following columns:


1.   Rank
2.   Title
3.   Genre
4.   Description
5.   Director
6.   Actors
7.   Year
8.   Runtime (Minutes)
9.   Rating
10.   Votes
11.   Metascore

The dataset is saved in a CSV file named "IMDB-Movie-Data.csv".



###**Steps:**

1.   Load the dataset into a pandas DataFrame and display the first 10 rows of the dataset.


In [None]:
#import libraries
import csv
import pandas as pd
import numpy as np

In [None]:
dataset_url = "https://github.com/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/blob/acb7e7da0406a990115e1d0556f593f27500c046/data/IMDB-Movie-Data.csv?raw=true"
df = pd.read_csv(dataset_url,delimiter=",")
df.head(10)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,62.0
3,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,71.0
4,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,59.0
5,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,40.0
6,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,42.0
7,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,76.0
8,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,93.0
9,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,71.0



2.   Check if there are any missing values in the dataset. If there are, replace them with the mean of the corresponding column.


In [None]:
#The .sum() method after applying .isnull(), this will return the sum of missing values within each column in the data frame.
df.isnull().sum()



Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Metascore            64
dtype: int64

In [None]:
df['Metascore'] = df['Metascore'].fillna((df['Metascore'].mean()))
 
#printing the dataframes after replacing null values
print(df.isna().sum())

Rank                 0
Title                0
Genre                0
Description          0
Director             0
Actors               0
Year                 0
Runtime (Minutes)    0
Rating               0
Votes                0
Metascore            0
dtype: int64


3.   Check if there are any duplicate entries in the dataset. If there are, remove them.

In [None]:
#checking the duplicates 
df.duplicated().sum()





5

In [None]:
#dropping the duplicates
df = df.drop_duplicates()
df.duplicated().sum()

0


4.   Let's detect outliers in the Votes column in our dataset and filter out the outliers using a z-score. The idea behind this method resides in the fact that values situated at 3 standard deviations from the mean will be called Outlier.

In [None]:
from scipy import stats
 
#importing dataset
 
#filtering outliers
df = df[(np.abs(stats.zscore(df.Votes)) < 3)]
df.head(10)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Metascore
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,62.0
3,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,71.0
4,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,59.0
5,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,40.0
6,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,42.0
8,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,93.0
10,9,The Lost City of Z,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,78.0
11,10,Passengers,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,41.0
12,11,Fantastic Beasts and Where to Find Them,"Adventure,Family,Fantasy",The adventures of writer Newt Scamander in New...,David Yates,"Eddie Redmayne, Katherine Waterston, Alison Su...",2016,133,7.5,232072,66.0


5.   Save the cleaned dataset to a new CSV file named "cleaned_IMDB-Movie-Data.csv.csv".

In [None]:
df.to_csv('cleaned_IMDB-Movie-Data.csv')

##**Cleaning Text data**

Data Cleaning for Text Data

Dataset:
The dataset you will be using is a small dataset english about UFOs and AREA 51. The dataset contains the following columns:

created_at: The datetime the tweet was created
text: the text of the tweet

### **Steps:**

1.   Load the dataset into a pandas DataFrame and display the first 10 rows of the dataset.

In [None]:
dataset_url = "https://github.com/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/blob/28d3700fa0ba87eeaf8cb05467350c2a5569e19f/data/UFO_2023.csv?raw=true"
df = pd.read_csv(dataset_url,delimiter=",")
df.head(10)


Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,RT @anuragchugh: Area 51 | Aliens UFO &amp; Ad...
1,2023-01-01T21:45:03.000Z,RT @LatestUFOs: Could Jeremy Corbell's New Vid...
2,2023-01-01T17:21:02.000Z,RT @trishab777: BRAND NEW REAL UFO TAKE OFF FR...
3,2023-01-01T17:20:53.000Z,BRAND NEW REAL UFO TAKE OFF FROM AREA 51. STRA...
4,2023-01-01T08:50:31.000Z,Skeptics don't get that Chris Mellon is correc...
5,2023-01-01T06:21:19.000Z,UFO Model Cow Abduction Alien Decoration Area ...
6,2023-01-02T23:34:58.000Z,@SteveDeaceShow @JesseKellyDC https://t.co/guH...
7,2023-01-02T22:20:13.000Z,@AFlyonMikePense George Santos' mother died in...
8,2023-01-02T22:09:14.000Z,BRAND NEW REAL UFO TAKE OFF FROM AREA 51. STRA...
9,2023-01-02T21:14:17.000Z,@uhhhyanna So good! I went down a huge rabbit ...


2.   Convert all text to lowercase to standardize the text data.



In [None]:
df['text'] = df['text'].str.lower()
df.head(10)

Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,rt anuragchugh area alien ufo amp advanc techn...
1,2023-01-01T21:45:03.000Z,rt latestufo could jeremi corbel s new video c...
2,2023-01-01T17:21:02.000Z,rt trishab brand new real ufo take off from ar...
3,2023-01-01T17:20:53.000Z,brand new real ufo take off from area strang s...
4,2023-01-01T08:50:31.000Z,skeptic don t get that chri mellon is correct ...
5,2023-01-01T06:21:19.000Z,ufo model cow abduct alien decor area ufo lamp...
6,2023-01-02T23:34:58.000Z,stevedeaceshow jessekellydc http t co guh tdcx...
7,2023-01-02T22:20:13.000Z,aflyonmikepens georg santo mother die in hi ar...
8,2023-01-02T22:09:14.000Z,brand new real ufo take off from area strang s...
9,2023-01-02T21:14:17.000Z,uhhhyanna so good i went down a huge rabbit ho...


3.   Remove punctuation, special characters, and numbers from the review text using regular expressions.

In [None]:

# library to clean data
import re

for i, row in df.iterrows():
  row['text'] = re.sub('[^a-zA-Z]', ' ', row['text']) 

df.head(10)





Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,rt anuragchugh area alien ufo amp advanc techn...
1,2023-01-01T21:45:03.000Z,rt latestufo could jeremi corbel s new video c...
2,2023-01-01T17:21:02.000Z,rt trishab brand new real ufo take off from ar...
3,2023-01-01T17:20:53.000Z,brand new real ufo take off from area strang s...
4,2023-01-01T08:50:31.000Z,skeptic don t get that chri mellon is correct ...
5,2023-01-01T06:21:19.000Z,ufo model cow abduct alien decor area ufo lamp...
6,2023-01-02T23:34:58.000Z,stevedeaceshow jessekellydc http t co guh tdcx...
7,2023-01-02T22:20:13.000Z,aflyonmikepens georg santo mother die in hi ar...
8,2023-01-02T22:09:14.000Z,brand new real ufo take off from area strang s...
9,2023-01-02T21:14:17.000Z,uhhhyanna so good i went down a huge rabbit ho...


4.  Remove stop words (commonly used words such as "the", "a", "an", "and", etc.) 


In [None]:

# Natural Language Tool Kit 
import nltk 

nltk.download('stopwords') 

# to remove stopword 
from nltk.corpus import stopwords 

# for Stemming propose 
from nltk.stem.porter import PorterStemmer 
for i, row in df.iterrows():
  # split to array(default delimiter is " ") 
	tweet = row['text'].split() 
	
	# creating PorterStemmer object to 
	# take main stem of each word 
	ps = PorterStemmer() 
	
	# loop for stemming each word 
	# in string array at ith row	 
	tweet = [ps.stem(word) for word in tweet 
				if not word in set(stopwords.words('english'))] 
				
	# rejoin all string array elements 
	# to create back into a string 
	row['text'] = ' '.join(tweet)

df.head(10)
   

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,rt anuragchugh area alien ufo amp advanc techn...
1,2023-01-01T21:45:03.000Z,rt latestufo could jeremi corbel new video con...
2,2023-01-01T17:21:02.000Z,rt trishab brand new real ufo take area strang...
3,2023-01-01T17:20:53.000Z,brand new real ufo take area strang shape dayl...
4,2023-01-01T08:50:31.000Z,skeptic get chri mellon correct anunnaki even ...
5,2023-01-01T06:21:19.000Z,ufo model cow abduct alien decor area ufo lamp...
6,2023-01-02T23:34:58.000Z,stevedeaceshow jessekellydc http co guh tdcxor...
7,2023-01-02T22:20:13.000Z,aflyonmikepen georg santo mother die hi arm te...
8,2023-01-02T22:09:14.000Z,brand new real ufo take area strang shape dayl...
9,2023-01-02T21:14:17.000Z,uhhhyanna good went huge rabbit hole joe rogan...


7.  Save the cleaned dataset to a new CSV file named "cleaned_UFO_2023.csv".

In [None]:
df.to_csv('cleaned_UFO_2023.csv')