<h1 style="font-size:30px">Data Cleaning</h1>
<hr>

1. Drop unwanted observations
2. Fix structural erros
3. Remove unwanted outliers
4. Label missing categorical data
5. Flag and fill missing numerical data

<span style="font-size:18px">**Import libraries**</span>

In [2]:
# Numpy for numerical computing
import numpy as np

# Pandas for Dataframes
import pandas as pd
pd.set_option('display.max_columns',100)

# Matplolib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

<span style="font-size:18px">**Load dataset**</span>

In [3]:
df = pd.read_csv('mushrooms.csv')

<span style="font-size:18px">**1. Drop unwanted observations**</span><br>
* Duplicate observations<br>
* Irrelevant observations: observations that don't actually fit the **specific problem**

In [4]:
# Drop duplicates
df.drop_duplicates().shape

(8124, 23)

<span style="font-size:18px">**2. Fix structural errors**</span>

<span style="font-size:14px">**2.1. Wannabe indicator variables**<br></span>
Check variables that should actually be binary indicator variables.<br>
* These variables should be either 0 or 1
* Maybe they are saved under different logic
* Fill missing values 'nan' with the value 0 to turn the feature into a true indicator variable

In [None]:
# Display unique values of the feature
df.feature.unique()

In [None]:
# Missing feature value should be 0
df.feature.fillna(0, inplace = True)
print(df.feature.unique())

<span style="font-size:14px">**2.2. Typos, capitalization and misslabeled classes**<br></span>
* Mostly a concern for **categorical features**<br>
* Check for typos or inconsistent capitalization<br>
* Check for classes that are labeled as separate classes when they should really be the same

In [None]:
# Plot class distribution for the 'categorical feature'
sns.countplot(y = 'categorical_feature', data = df)
plt.show()

In [None]:
# 'Categ' should be 'Categorical'
df.categorical_feature.replace('Categ', 'Categorical', inplace = True)

# 'categorical' should be 'Categorical'
df.categorical_feature.replace('categorical', 'Categorical', inplace = True)

<span style="font-size:18px">**3. Remove unwanted outliers**</span><br>
* Suspicious measurements that are unlikely to be real data<br>
* Outliers that belong in a different population<br>
* Outliers that belong to a different problem

* Is there any long and skinny tail?
* Is it a potential outlier?

In [None]:
# Violin plot of 'target' using the Seaborn library
sns.violinplot(df.target)
plt.show()

# Violin plot of 'feature'
sns.violinplot(df.feature)
plt.show()

* Check the smaller/largest 5 lot size just to confirm
* Use a boolean mask to filter only wanted observations

In [None]:
# Sort the df.feature and display the top 5 samples
df.feature.sort_values(ascending = False).head()

# Sort the df.feature and display the low 5 samples
df.feature.sort_values(ascending = False).tail()

In [None]:
# Remove feature outlier
df = df[(df[['feature']] < 'size').all(axis = 1)]

# Print length of df
print(len(df))

<span style="font-size:18px">**4. Label missing categorical data**</span><br><br>
Avoid:<br>
* **Dropping** observations that have missing values<br>
* **Imputing** the missing values based on values from other observations

In [None]:
# Display number of missing values by feature (categorical)
df.select_dtypes(include = ['object']).isnull().sum()

* Label missing data as 'Missing', adding a new class for the feature

In [None]:
# Fill missing categorical values
for column in df.select_dtypes(include = ['object']):
    df[column] = df[column].fillna('Missing')

<span style="font-size:18px">**5. Flag and fill missing numeric data**</span><br>
* Best used for **cross-sectional** data. The cross-sectional data is data collected for many subjects at the same point in time
* For time series, consider **interpolation**. The time series data is data collected for one subject throughout many points in time

In [None]:
# Display number of missing values by feature (numeric)
df.select_dtypes(exclude = ['object']).isnull().sum()

**Flag** the observation with an **indicator feature** of missingness:<br>
* 0 is not missing<br>
* 1 if missing

In [None]:
# Indicator variable for missing feature
df['missing_feature'] = df.feature.isnull().astype(int)

**Fill** in the original missing value with 0 just to meet Scikit-Learn's technical requeriment of no missing values

In [None]:
# Fill missing values in feature with 0
df.feature.fillna(0, inplace = True)

<span style="font-size:18px">**6. Save the cleaned dataframe**</span><br>

In [None]:
# Save cleaned dataframe to new file
df.to_csv('cleaned_df.csv', index = None)