<a href="https://colab.research.google.com/github/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/blob/main/2_Hands_on.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#**Week 2 - Basic Statistics and Data Cleaning**



# Basic statistics

Basic statistics is an essential part of data analysis, and Python is a powerful language that can be used for this purpose. In this essay, we will explore how to use two popular libraries, NumPy and Pandas, for basic statistics in Python.

Pandas is a popular open-source data analysis and manipulation library for the Python programming language. It provides highly efficient data structures for working with structured and tabular data, such as data frames and series, and includes tools for cleaning, transforming, and analyzing data. Pandas is widely used in fields such as finance, economics, social sciences, and many others for its ease of use and powerful features. It can be used in conjunction with other Python libraries, such as NumPy and Matplotlib, to create complex data visualizations and statistical models. Overall, Pandas is an essential tool for anyone working with data in Python.

NumPy is a Python library that provides support for numerical operations and arrays. One of the primary benefits of using NumPy is its ability to perform mathematical operations on arrays with great speed and efficiency. Here are some of the basic statistical functions that NumPy provides:



**Some functions**

In [1]:
#import libraries
import csv
import pandas as pd
import numpy as np

1.   **Mean**: NumPy provides the mean() function, which can be used to calculate the average of an array. For example, the following code will calculate the mean of an array:

In [50]:
datav = np.array([1, 2, 6, 4, 5]) #try with other numbers
print(datav)

[1 2 6 4 5]


In [51]:
mean = np.mean(datav)
print(mean)

3.6


2.   **Median**: NumPy provides the mean() function, which can be used to calculate the mean of an array. For example, the following code will calculate the median of an array:

In [53]:
datax = np.array([1, 2, 6, 4, 5])
median = np.median(datax)
print(median)

4.0


In [55]:
datax.sort()
print(datax)

[1 2 4 5 6]


3.   **Mode**: NumPy provides the mode() function, which can be used to calculate the most frequent value of an array. For example, the following code will calculate the mode of an array:

In [74]:
datay = np.array([1, 2, 6, 2, 5])

In [86]:
from scipy import stats

mode = stats.mode(datay)
print(mode)

ModeResult(mode=array([2]), count=array([2]))


  mode = stats.mode(datay)


In [88]:
vals,counts = np.unique(datay, return_counts=True)
mode = np.argmax(counts)
print(vals[mode])

2


4. **Standard deviation**: NumPy provides the std() function, which can be used to calculate the standard deviation of an array. For example, the following code will calculate the standard deviation of an array:

In [None]:
data = np.array([5, 6, 7, 8, 9])
print(data)

[1 2 3 4 5]


In [None]:
std_dev = np.std(data)
print(std_dev)

1.4142135623730951


3. **Variance**: NumPy provides the var() function, which can be used to calculate the variance of an array. For example, the following code will calculate the variance of an array:

In [None]:
data = np.array([1, 2, 3, 4, 5])
variance = np.var(data)
print(variance)

2.0


# Data cleaning

Data cleaning is a crucial step in data analysis as it ensures that the data used for reporting is accurate, reliable, and free from errors or inconsistencies. Data cleaning involves the identification and correction of errors, inconsistencies, and inaccuracies in data collected from various sources, such as governments, companies, and non-profit organizations.

One significant difference between clean numerical data and text data is the nature of the data itself. Numerical data refers to data that consists of numbers, such as financial data or survey responses, while text data refers to data that consists of text or written language, such as news articles or social media posts.

The cleaning process for numerical data typically involves identifying and removing outliers, inconsistencies, and errors in the data, such as missing or incorrect values. This process often involves statistical techniques to identify patterns or trends in the data and to remove any data points that do not fit these patterns. Once the data has been cleaned, it can be used for analysis and reporting.

On the other hand, cleaning text data involves identifying and correcting errors in the text, such as spelling and grammatical errors, and removing any irrelevant or redundant information. This can be a more subjective process than cleaning numerical data, as it requires a human editor to review the text and make decisions about what information to include or exclude.

Another important difference between clean numerical data and text data is the types of analysis that can be conducted with each type of data. Numerical data is often used for statistical analysis, such as regression analysis or hypothesis testing, while text data is often used for sentiment analysis or natural language processing.

In conclusion, while both numerical data and text data require cleaning to ensure accuracy and reliability, the cleaning process for each type of data is different due to the nature of the data itself. Additionally, the types of analysis that can be conducted with each type of data may vary, making it important to understand the differences between the two.

##**Cleaning numerical data**

Data Cleaning for Numerical Data

Dataset:

The dataset you will be using is a IMDB movies dataset. The dataset contains the following columns:


1.   Rank
2.   Title
3.   Genre
4.   Description
5.   Director
6.   Actors
7.   Year
8.   Runtime (Minutes)
9.   Rating
10.   Votes
11.   Metascore

The dataset is saved in a CSV file named "IMDB-Movie-Data.csv".



###**Steps:**

1.   Load the dataset into a pandas DataFrame and display the first 10 rows of the dataset.


In [94]:
#import libraries
import csv
import pandas as pd
import numpy as np

In [None]:
dataset_url = "https://github.com/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/blob/acb7e7da0406a990115e1d0556f593f27500c046/data/IMDB-Movie-Data.csv?raw=true"
df = pd.read_csv(dataset_url,delimiter=",")
df.head(10)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,62.0
3,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,71.0
4,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,59.0
5,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,40.0
6,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,42.0
7,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,76.0
8,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,93.0
9,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,71.0



2.   Check if there are any missing values in the dataset. If there are, replace them with the mean of the corresponding column.


In [None]:
#The .sum() method after applying .isnull(), this will return the sum of missing values within each column in the data frame.
df.isnull().sum()



Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Metascore            64
dtype: int64

In [None]:
df['Metascore'] = df['Metascore'].fillna((df['Metascore'].mean()))
 
#printing the dataframes after replacing null values
print(df.isna().sum())

Rank                 0
Title                0
Genre                0
Description          0
Director             0
Actors               0
Year                 0
Runtime (Minutes)    0
Rating               0
Votes                0
Metascore            0
dtype: int64


3.   Check if there are any duplicate entries in the dataset. If there are, remove them.

In [None]:
#checking the duplicates 
df.duplicated().sum()


5

In [None]:
#dropping the duplicates
df = df.drop_duplicates()
df.duplicated().sum()

0

4.   Save the cleaned dataset to a new CSV file named "cleaned_IMDB-Movie-Data.csv.csv".

In [None]:
df.to_csv('cleaned_IMDB-Movie-Data.csv')

##**Cleaning textual data - Example 1**

Data Cleaning for Textual Data

Dataset:
The dataset you will be using is a small dataset english about UFOs and AREA 51. The dataset contains the following columns:

created_at: The datetime the tweet was created

text: the text of the tweet

### **Steps:**

1.   Load the dataset into a pandas DataFrame and display the first 10 rows of the dataset.

In [None]:
dataset_url = "https://github.com/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/blob/28d3700fa0ba87eeaf8cb05467350c2a5569e19f/data/UFO_2023.csv?raw=true"
df = pd.read_csv(dataset_url,delimiter=",")
df.head(10)


Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,RT @anuragchugh: Area 51 | Aliens UFO &amp; Ad...
1,2023-01-01T21:45:03.000Z,RT @LatestUFOs: Could Jeremy Corbell's New Vid...
2,2023-01-01T17:21:02.000Z,RT @trishab777: BRAND NEW REAL UFO TAKE OFF FR...
3,2023-01-01T17:20:53.000Z,BRAND NEW REAL UFO TAKE OFF FROM AREA 51. STRA...
4,2023-01-01T08:50:31.000Z,Skeptics don't get that Chris Mellon is correc...
5,2023-01-01T06:21:19.000Z,UFO Model Cow Abduction Alien Decoration Area ...
6,2023-01-02T23:34:58.000Z,@SteveDeaceShow @JesseKellyDC https://t.co/guH...
7,2023-01-02T22:20:13.000Z,@AFlyonMikePense George Santos' mother died in...
8,2023-01-02T22:09:14.000Z,BRAND NEW REAL UFO TAKE OFF FROM AREA 51. STRA...
9,2023-01-02T21:14:17.000Z,@uhhhyanna So good! I went down a huge rabbit ...


2.   Convert all text to lowercase to standardize the text data.



In [None]:
df['text'] = df['text'].str.lower()
df.head(10)

Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,rt @anuragchugh: area 51 | aliens ufo &amp; ad...
1,2023-01-01T21:45:03.000Z,rt @latestufos: could jeremy corbell's new vid...
2,2023-01-01T17:21:02.000Z,rt @trishab777: brand new real ufo take off fr...
3,2023-01-01T17:20:53.000Z,brand new real ufo take off from area 51. stra...
4,2023-01-01T08:50:31.000Z,skeptics don't get that chris mellon is correc...
5,2023-01-01T06:21:19.000Z,ufo model cow abduction alien decoration area ...
6,2023-01-02T23:34:58.000Z,@stevedeaceshow @jessekellydc https://t.co/guh...
7,2023-01-02T22:20:13.000Z,@aflyonmikepense george santos' mother died in...
8,2023-01-02T22:09:14.000Z,brand new real ufo take off from area 51. stra...
9,2023-01-02T21:14:17.000Z,@uhhhyanna so good! i went down a huge rabbit ...


3.   Remove punctuation, special characters, and numbers from the review text using regular expressions.





> Regular expression is not a library nor is it a programming language. Instead, regular expression is a sequence of characters that specifies a search pattern in any given text (string).

Read more [here](https://towardsdatascience.com/regular-expressions-clearly-explained-with-examples-822d76b037b4).



In [None]:

# library to clean data
import re #this is a module that provides regular expression support

for i, row in df.iterrows():
  row['text'] = re.sub('[^a-zA-Z]', ' ', row['text']) 

df.head(10)


Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,rt anuragchugh area aliens ufo amp ad...
1,2023-01-01T21:45:03.000Z,rt latestufos could jeremy corbell s new vid...
2,2023-01-01T17:21:02.000Z,rt trishab brand new real ufo take off fr...
3,2023-01-01T17:20:53.000Z,brand new real ufo take off from area stra...
4,2023-01-01T08:50:31.000Z,skeptics don t get that chris mellon is correc...
5,2023-01-01T06:21:19.000Z,ufo model cow abduction alien decoration area ...
6,2023-01-02T23:34:58.000Z,stevedeaceshow jessekellydc https t co guh...
7,2023-01-02T22:20:13.000Z,aflyonmikepense george santos mother died in...
8,2023-01-02T22:09:14.000Z,brand new real ufo take off from area stra...
9,2023-01-02T21:14:17.000Z,uhhhyanna so good i went down a huge rabbit ...


4.  Remove stop words (commonly used words such as "the", "a", "an", "and", etc.) 


In [None]:

# Natural Language Tool Kit 
import nltk 

nltk.download('stopwords') 

# to remove stopword 
from nltk.corpus import stopwords 

## for Stemming propose
# It is a type of stemmer which is mainly known for Data Mining and Information Retrieval.
# As its applications are limited to the English language only. 
# It is based on the idea that the suffixes in the English language are made up of 
# a combination of smaller and simpler suffixes, it is also majorly known for its simplicity and speed. 
#The advantage is, it produces the best output from other stemmers and has less error rate.
from nltk.stem.porter import PorterStemmer #
for i, row in df.iterrows():
  # split to array(default delimiter is " ") 
	tweet = row['text'].split() 
	
	# creating PorterStemmer object to 
	# take main stem of each word 
	ps = PorterStemmer() 
	
	# loop for stemming each word 
	# in string array at ith row	 
	tweet = [ps.stem(word) for word in tweet 
				if not word in set(stopwords.words('english'))] 
				
	# rejoin all string array elements 
	# to create back into a string 
	row['text'] = ' '.join(tweet)

df.head(10)
   

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,created_at,text
0,2023-01-02T02:00:35.000Z,rt anuragchugh area alien ufo amp advanc techn...
1,2023-01-01T21:45:03.000Z,rt latestufo could jeremi corbel new video con...
2,2023-01-01T17:21:02.000Z,rt trishab brand new real ufo take area strang...
3,2023-01-01T17:20:53.000Z,brand new real ufo take area strang shape dayl...
4,2023-01-01T08:50:31.000Z,skeptic get chri mellon correct anunnaki even ...
5,2023-01-01T06:21:19.000Z,ufo model cow abduct alien decor area ufo lamp...
6,2023-01-02T23:34:58.000Z,stevedeaceshow jessekellydc http co guh tdcxor...
7,2023-01-02T22:20:13.000Z,aflyonmikepen georg santo mother die hi arm te...
8,2023-01-02T22:09:14.000Z,brand new real ufo take area strang shape dayl...
9,2023-01-02T21:14:17.000Z,uhhhyanna good went huge rabbit hole joe rogan...


7.  Save the cleaned dataset to a new CSV file named "cleaned_UFO_2023.csv".

In [None]:
df.to_csv('cleaned_UFO_2023.csv')

##**Cleaning mixed data**


**Cleaning data with pandas**

This dataset was retrieved from [here](https://github.com/lamthuyvo/cuny-advanced-data-journalism)



### Steps:

**Check empty data**

In [97]:
import pandas as pd

dataset_url = "https://raw.githubusercontent.com/mathiasfls/Foundations-of-Cultural-and-Social-Data-Analysis/main/data/simple_data.csv?raw=true"


In [103]:
simple_data = pd.read_csv(dataset_url)
simple_data

Unnamed: 0,participant_id,empty_column,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,,1.0,4.5,8,2
1,23423.0,,1.0,4.2,NO_DATA,5
2,,,,,,
3,43029.0,,0.0,3.7,4,ASK_LATER
4,30400.0,,1.0,,9,2
5,60495.0,,0.0,4.4,NO_DATA,2
6,12321.0,,1.0,3.2,3,ASK_LATER
7,23090.0,,1.0,2.1,NO_DATA,1
8,99230.0,,,3.2,2,4
9,23432.0,,0.0,,7,6


Checking for null values

In [104]:
simple_data.isna()

Unnamed: 0,participant_id,empty_column,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,False,True,False,False,False,False
1,False,True,False,False,False,False
2,True,True,True,True,True,True
3,False,True,False,False,False,False
4,False,True,False,True,False,False
5,False,True,False,False,False,False
6,False,True,False,False,False,False
7,False,True,False,False,False,False
8,False,True,True,False,False,False
9,False,True,False,True,False,False


Let's chain the `.sum()` function after the `.isna()` to get some aggregate counts.

In [105]:
simple_data.isna().sum()

participant_id           1
empty_column            13
missing_values           3
missing_values_2         3
placeholder_values       1
placeholder_values_2     1
dtype: int64

**Dropping empty columns**

There's a column, aptly titled **empty_column**, that has no values at all! We can drop it using the .dropna() function with axis and how.

In [108]:
simple_data.dropna(axis='columns', how='all', inplace=True)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
2,,,,,
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
8,99230.0,,3.2,2,4
9,23432.0,0.0,,7,6


**Dropping empty rows**

It also looks like there's a row with no data at all! We can drop that using **.dropna()** too, but this time we can set our axis to rows.

In [109]:
simple_data.dropna(axis='rows', how='all', inplace=True)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
8,99230.0,,3.2,2,4
9,23432.0,0.0,,7,6
10,21233.0,1.0,2.1,NO_DATA,8


**Dropping rows with missing values**

Let's remove rows that have NaN or a null value, in the missing_values column.

In [110]:
simple_data.dropna(subset=['missing_values'], inplace=True)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
9,23432.0,0.0,,7,6
10,21233.0,1.0,2.1,NO_DATA,8
12,93904.0,1.0,2.8,NO_DATA,0


**Filling missing values**

Let's replace with the **NaN** values in the missing_values_2 column with .fillna()!

In [112]:
values = {'missing_values_2': 0.0}
simple_data.fillna(value=values, inplace=True)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
1,23423.0,1.0,4.2,NO_DATA,5
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,0.0,9,2
5,60495.0,0.0,4.4,NO_DATA,2
6,12321.0,1.0,3.2,3,ASK_LATER
7,23090.0,1.0,2.1,NO_DATA,1
9,23432.0,0.0,0.0,7,6
10,21233.0,1.0,2.1,NO_DATA,8
12,93904.0,1.0,2.8,NO_DATA,0


**Dropping placeholder values by condition**

Sometimes people put things that shouldn't be in the data at all. In our case, the **NO_DATA** entries in **placeholder_values** are not only unecessary, but shouldn't be there at all! 
They're basically a **NaN** but worse--we can't drop rows with them using .dropna() like we were able to for our other columns.

In [113]:
simple_data = simple_data[simple_data['placeholder_values'] != 'NO_DATA'].copy()
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
3,43029.0,0.0,3.7,4,ASK_LATER
4,30400.0,1.0,0.0,9,2
6,12321.0,1.0,3.2,3,ASK_LATER
9,23432.0,0.0,0.0,7,6


**Replacing placeholder values**

Other times we want to replace placeholder values. To do this, we can use the **.replace()** function.

In [114]:
simple_data['placeholder_values_2'].replace('ASK_LATER', 0, inplace=True)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8,2
3,43029.0,0.0,3.7,4,0
4,30400.0,1.0,0.0,9,2
6,12321.0,1.0,3.2,3,0
9,23432.0,0.0,0.0,7,6


**Fixing column types**

Let's check the types of each column. You can do this by appending .dtypes to your dataframe's variable name.

In [115]:
simple_data.dtypes

participant_id          float64
missing_values          float64
missing_values_2        float64
placeholder_values       object
placeholder_values_2     object
dtype: object



Notice there are two different data types being used in this dataframe: float64, and object. Different types have different rules. These rules can help us create guardrails for ourselves.

For instance, we probably want to be able to do math on all the numbers in the **placeholder_values** and **placeholder_values_2** columns. So let's fix that! Use the **.astype()** function to convert **placeholder_values** and **placeholder_values_2** from an **str** to a **float**.


In [118]:
simple_data['placeholder_values'] = simple_data['placeholder_values'].astype(float)
simple_data['placeholder_values_2'] = simple_data['placeholder_values_2'].astype(float)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202.0,1.0,4.5,8.0,2.0
3,43029.0,0.0,3.7,4.0,0.0
4,30400.0,1.0,0.0,9.0,2.0
6,12321.0,1.0,3.2,3.0,0.0
9,23432.0,0.0,0.0,7.0,6.0


In [119]:
simple_data.dtypes

participant_id          float64
missing_values          float64
missing_values_2        float64
placeholder_values      float64
placeholder_values_2    float64
dtype: object

One last thing! Right now, the column **participant_id** is a **float64**. We usually don't want or expect to do much on identification numbers, so let's convert that to a **str**.

In [120]:
simple_data['participant_id'] = simple_data['participant_id'].astype(int).astype(str)
simple_data

Unnamed: 0,participant_id,missing_values,missing_values_2,placeholder_values,placeholder_values_2
0,11202,1.0,4.5,8.0,2.0
3,43029,0.0,3.7,4.0,0.0
4,30400,1.0,0.0,9.0,2.0
6,12321,1.0,3.2,3.0,0.0
9,23432,0.0,0.0,7.0,6.0


In [121]:
simple_data.dtypes

participant_id           object
missing_values          float64
missing_values_2        float64
placeholder_values      float64
placeholder_values_2    float64
dtype: object

In [123]:
simple_data.to_csv('simple_clean_data.csv', index=False)