# Data Cleaning


In this lesson we will learn the basics of Data Cleaning and the Exploratory Data Analysis Pipeline.

<img src='figures/data_cleaning.png' width=300>

## Introduction
This is a **comprehensive EDA technique with python**.

It is clear that everyone in this community is familiar with Meta Kaggle and kaggle survey 2018 datasets but if you need to review your information about the datasets please visit  [meta-kaggle](https://www.kaggle.com/kaggle/meta-kaggle) and [kaggle survey 2018](https://www.kaggle.com/kaggle/kaggle-survey-2018).

## Loading Packages
In this kernel we are using the following packages:

 <img src="figures/packages.png" width=300>
 Now we import all of them 

In [None]:
# Now import the libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from wordcloud import WordCloud as wc
from nltk.corpus import stopwords
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
from pandas import get_dummies
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import string
import scipy
import numpy
import nltk
import json
import sys
import csv
import os

import warnings
warnings.filterwarnings('ignore')

In [None]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))
#print('wordcloud: {}'.format(wordcloud.version))

A few tiny adjustments for better **code readability**

In [None]:
sns.set(style='white', context='notebook', palette='deep')
pylab.rcParams['figure.figsize'] = 12,8
warnings.filterwarnings('ignore')
mpl.style.use('ggplot')
sns.set_style('white')
%matplotlib inline

## 3- Exploratory Data Analysis (EDA)
In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. 
 
* Which variables suggest interesting relationships?
* Which observations are unusual?

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.  then We will review analytical and statistical operations:

1. Data Collection
1. Visualization
1. Data Cleaning
1. Data Preprocessing

<img src="figures/EDA.png" width=350>

## Data Collection
**Data collection** is the process of gathering and measuring data.
<img src='figures/data-collection.jpg' width=300>

I start Collection Data by the Users and Kernels datasets into **Pandas DataFrames**

In [None]:
# import kernels and users to play with it
users = pd.read_csv("data/kaggle_Users.csv")
kernels = pd.read_csv("data/kaggle_Kernels.csv")
messages = pd.read_csv("data/kaggle_ForumMessages.csv")
freeFormResponses=pd.read_csv("data/kaggle_freeFormResponses.csv")
multipleChoiceResponses=pd.read_csv("data/kaggle_multipleChoiceResponses.csv")

**Note 1**

* Each row is an observation (also known as : sample, example, instance, record)
* Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

In [None]:
users.sample(1) 

In [None]:
kernels.sample(1) 

In [None]:
freeFormResponses.sample(1) 

In [None]:
multipleChoiceResponses.sample(1) 

Select a random userid from the dataset to use for the experiment.

In [None]:
username="mjbahmani"
userid=int(users[users['UserName']=="mjbahmani"].Id)
userid

## Features
Features can be from following types:
1. numeric
1. categorical
1. ordinal
1. datetime
1. coordinates

Find the type of features in **Meta Kaggle**?!
<br>
for getting some information about the dataset you can use **info()** command

In [None]:
print(users.info())

In [None]:
print(freeFormResponses.info())

## Explore the Dataset
1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.

Don’t worry, each look at the data is **one command**. These are useful commands that you can use again and again on future projects.

In [None]:
# shape
print(users.shape)

In [None]:
# shape
print(kernels.shape)

In [None]:
print(freeFormResponses.shape)

In [None]:
#columns*rows
users.size

In [None]:
#columns*rows
kernels.size


We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
To grab information about the dataset you can use **info()** command

In [None]:
print(users.info())

In [None]:
print(kernels.info())

Check the number of unique items for Species:

In [None]:
kernels['Medal'].unique()

In [None]:
kernels["Medal"].value_counts()


Check the first 5 rows of the data set:

In [None]:
kernels.head(5) 

Check the last 5 rows of the data set:

In [None]:
users.tail() 

Check 5 random rows from the data set:

In [None]:
kernels.sample(5) 

Statistical summary about the dataset:

In [None]:
kernels.describe() 

## Data Cleaning
When dealing with real-world data, dirty data is the norm rather than the exception. 
We continuously need to predict correct values, inspite of missing ones, and find links between various data artefacts such as schemas and records. 
We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

<img src='figures/Data_Cleansing_Cycle.png' height=300>

The primary goal of data cleaning is to detect and remove errors and **anomalies** to increase the value of data in analytics and decision making. 
While it has been the focus of many researchers for several years, individual problems have been addressed separately. 
These include missing value correction, outliers detection, transformations, integrity constraints violations detection and repair, consistent query answering, deduplication, and many other related problems such as profiling and constraints mining.

Check how many nulls are on the dataset:

In [None]:
#How many NA elements in every column
users.isnull().sum()

In [None]:
kernels.isnull().sum()

In [None]:
kernels.groupby('Medal').count()

Print dataset **columns**

In [None]:
kernels.columns

In [None]:
users.columns

**Note**
in Pandas you can perform queries like "where"

## Find yourself in Users dataset

In [None]:
users[users['Id']==userid]

## Find your kernels in Kernels dataset

In [None]:
yourkernels=kernels[kernels['AuthorUserId']==userid]
yourkernels

## Data Preprocessing
**Data preprocessing** refers to the transformations applied to our data before feeding it to the algorithm.
 
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
there are plenty of steps for data preprocessing and we just listed some of them in general (not just for Quora) :
* removing Target column (id)
* Sampling (without replacement)
* Making part of iris unbalanced and balancing (with undersampling and SMOTE)
* Introducing missing values and treating them (replacing by average values)
* Noise filtering
* Data discretization
* Normalization and standardization
* PCA analysis
* Feature selection (filter, embedded, wrapper)

**Note**
Preprocessing and generation pipelines depend on a model type.

Visualization
**Data visualization**  is the presentation of data in a graphical format. 
It enables decision makers to "see" analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.

In this section we will look at  **11 plots** with **matplotlib** and **seaborn**
 <img src="figures/visualization.jpg" width=350>


## Scatter plot

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

In [None]:
yourkernels.columns

In [None]:
# Modify the graph above by assigning each species an individual color.
x=yourkernels["TotalVotes"]
y=yourkernels["TotalViews"]
plt.scatter(x, y)
plt.legend()
plt.show()


In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
yourkernels['Medal'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Medal')
ax[0].set_ylabel('Count')
plt.show()

### Box
In descriptive statistics, a **box plot** or boxplot is a method for graphically depicting groups of numerical data through their quartiles. 
Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.

In [None]:
yourkernels["TotalVotes"].plot(kind='box')
plt.figure()
#This gives us a much clearer idea of the distribution of the input attributes:



In [None]:
yourkernels["TotalComments"].plot(kind='box')
plt.figure()


In [None]:
# To plot the species data using a box plot:

sns.boxplot(x="TotalComments", y="TotalVotes", data=yourkernels )
plt.show()

In [None]:
# Use Seaborn's striplot to add data points on top of the box plot 
# Insert jitter=True so that the data points remain scattered and not piled into a verticle line.
# Assign ax to each axis, so that each plot is ontop of the previous axis. 

ax= sns.boxplot(x="TotalViews", y="TotalVotes", data=yourkernels)
ax= sns.stripplot(x="TotalViews", y="TotalVotes", data=yourkernels, jitter=True, edgecolor="gray")
plt.show()

In [None]:
# Tweek the plot above to change fill and border color color using ax.artists.
# Assing ax.artists a variable name, and insert the box number into the corresponding brackets

ax= sns.boxplot(x="TotalViews", y="TotalVotes", data=yourkernels)
ax= sns.stripplot(x="TotalViews", y="TotalVotes", data=yourkernels, jitter=True, edgecolor="gray")

boxtwo = ax.artists[2]
boxtwo.set_facecolor('red')
boxtwo.set_edgecolor('black')
boxthree=ax.artists[1]
boxthree.set_facecolor('yellow')
boxthree.set_edgecolor('black')

plt.show()

In [None]:
sns.factorplot('TotalViews','TotalVotes',hue='Medal',data=yourkernels)
plt.show()

In [None]:
sns.factorplot('TotalComments','TotalVotes',hue='Medal',data=yourkernels)
plt.show()

### Histogram
We can also create a **histogram** of each input variable to get an idea of the distribution.


In [None]:
# histograms
yourkernels.hist(figsize=(15,20))
plt.figure()

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.



In [None]:
yourkernels["TotalViews"].hist();

In [None]:
yourkernels["TotalComments"].hist();

In [None]:
sns.factorplot('TotalViews','TotalVotes',data=yourkernels)
plt.show()

In [None]:
sns.factorplot('TotalComments','TotalVotes',data=yourkernels)
plt.show()

### violinplots

In [None]:
# violinplots on petal-length for each species
sns.violinplot(data=yourkernels,x="TotalViews", y="TotalVotes")

In [None]:
# violinplots on petal-length for each species
sns.violinplot(data=yourkernels,x="TotalComments", y="TotalVotes")

In [None]:
sns.violinplot(data=yourkernels,x="Medal", y="TotalVotes")

In [None]:
sns.violinplot(data=yourkernels,x="Medal", y="TotalComments")

how many NA elements in every column


###  kdeplot

In [None]:
# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
sns.FacetGrid(yourkernels, hue="Medal", size=5).map(sns.kdeplot, "TotalComments").add_legend()
plt.show()

In [None]:
sns.FacetGrid(yourkernels, hue="Medal", size=5).map(sns.kdeplot, "TotalVotes").add_legend()
plt.show()

In [None]:
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(yourkernels[yourkernels['Medal']==1].TotalVotes,ax=ax[0])
ax[0].set_title('TotalVotes in Medal 1')
sns.distplot(yourkernels[yourkernels['Medal']==2].TotalVotes,ax=ax[1])
ax[1].set_title('TotalVotes in Medal 2')
sns.distplot(yourkernels[yourkernels['Medal']==3].TotalVotes,ax=ax[2])
ax[2].set_title('TotalVotes in Medal 3')
plt.show()

### jointplot

In [None]:
# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
sns.jointplot(x="TotalVotes", y="TotalViews", data=yourkernels, size=10,ratio=10, kind='hex',color='green')
plt.show()

###  andrews_curves

In [None]:
# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density 
# estimation in the same figure
sns.jointplot(x="TotalVotes", y="TotalViews", data=yourkernels, size=6, kind='kde', color='#800000', space=0)

### Heatmap

In [None]:
plt.figure(figsize=(10,7)) 
sns.heatmap(yourkernels.corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

## WordCloud
It is possible that you have seen a cloud filled with lots of words in different sizes, which represent the frequency or the importance of each word. 
This is called Tag Cloud or WordCloud.

In [None]:
import nltk
nltk.download('stopwords')
#nltk.data.LazyLoader('data/nltk_dataset/corpora/stopwords.zip')
#nltk.open('data/nltk_dataset/corpora/stopwords')

from wordcloud import WordCloud as wc
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))
messages.head(1)

In [None]:
def generate_wordcloud(text): 
    wordcloud = wc(relative_scaling = 1.0,stopwords = eng_stopwords).generate(text)
    fig,ax = plt.subplots(1,1,figsize=(10,10))
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis("off")
    ax.margins(x=0, y=0)
    plt.show()

In [None]:
text=','.join(str(v) for v in messages['Message'])
#text =" ".join(messages['Message'])
generate_wordcloud(text)