<font color="blue">To use this notebook on Colaboratory, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# Practice Notebook: Data Cleaning with Python - Outliers


## 6. Outliers

#### <font color="blue">Pre-requisites</font>

In [None]:
# Pre-requisite 1
# ---
# Importing pandas library
# ---
# OUR CODE GOES BELOW
# 
import pandas as pd

In [None]:
# Pre-requisite 2
# ---
# Importing the seaborn library
# This is a visualisation library
# ---
# 
import seaborn as sns

#### <font color="blue">Examples</font>

##### <font color="blue">Example 1</font>

In [None]:
# Example 1
# --- 
# Finding outliers
# We can check for outliers with a box plot
# ---
# Dataset url = http://bit.ly/CountryDataset1
# ---
# OUR CODE GOES BELOW
#  

# Let's read data from url as dataframe
# 
outliers_df = pd.read_csv("http://bit.ly/CountryDataset1") 

# Lets preview our our dataframe below
#
outliers_df.head()

# Then we will work with only data for the year 2007
# Let's uncomment the line below 
# ---
# 
outliers_df_2007 = outliers_df[outliers_df['year']==2007] 
outliers_df_2007.head()

# We then specify from the multiple options to customize the boxplot with Seaborn.
# We then choose color palette scheme for the boxplot with Seaborn. 
# Here, we chose colorblind friendly palette “colorblind”. 
# Other color palette options available include deep, muted, bright, pastel, and dark.
# # The dots in the plot are outliers .
# Let's uncomment the line below 
# ---
#
bplot = sns.boxplot(y='lifeExp', x='continent', data = outliers_df_2007, width=0.5, palette="colorblind")

# Boxplot alone is extremely useful in getting the summary of data within and between groups. 
# However, often, it is a good practice to overlay the actual data points on the boxplot.
# We use jitter = True option to spread the data points horizontally and also 
# make boxplot with data points using stripplot.
# Let's uncomment the line below 
# ---
#
bplot = sns.stripplot(y='lifeExp', x='continent',  data = outliers_df_2007, jitter=True, marker='o', alpha=0.5, color='black')

# Naming and sizing our graph and axis
# Let's uncomment the 3 lines below 
# ---
#
bplot.axes.set_title("Life expectancy in the World", fontsize=13)
bplot.set_xlabel("Continents", fontsize=13)
bplot.set_ylabel("Life Expectancy", fontsize=13)

##### <font color="blue">Example 2</font>

In [None]:
outliers_df.shape

In [None]:
# Example 2
# ---
# Dealing with Outliers using the Interquantile range
# ---
# 
 
# There are many ways of dealing with the outliers however in this session we wiil 
# use the interquartile range (IQR). This is the first quartile subtracted from the third quartile, 
# i.e. the range covered by the middle 50% of the data; 
# The first and third quartile can be clearly seen on a box plot on the data above. 
# It is a measure of the dispersion similar to standard deviation or variance, 
# but is much more robust against outliers. Now, calculating IQR for each column.
# 

Q1 = outliers_df.quantile(0.25)
Q3 = outliers_df.quantile(0.75)
IQR = Q3 - Q1
IQR

# We now filter out outliers by keeping only valid values.
# Let's uncomment the lines below
# ---
#
outliers_df_iqr = outliers_df[~((outliers_df < (Q1 - 1.5 * IQR)) |(outliers_df > (Q3 + 1.5 * IQR))).any(axis=1)]
outliers_df_iqr.shape

# Checking the size of the dataset with outliers for cleaning purposes
# ---
#
# outliers_df.shape

#### <font color="green">Challenges</font> 

##### <font color="green">Challenge 1</font>

In [None]:
# Challenge 1
# ---
# Question: Find the outliers in the given dataset.
# ---
# Dataset url = http://bit.ly/SampleDataset
# ---
# OUR CODE GOES BELOW
# 

# Loading our dataset for outlier detection
# ---
# 
outlier_df = pd.read_csv('http://bit.ly/SampleDataset')
outlier_df

Unnamed: 0,NAME,CITY,COUNTRY,HEIGHT,WEIGHT,ACCOUNT A,ACCOUNT B,TOTAL ACCOUNT
0,Adi Dako,LISBON,PORTUGAL,56,132.0,2390.0,4340,6730
1,John Paul,LONDON,UNITED KINGDOM,62,165.0,4500.0,34334,38834
2,Cindy Jules,Stockholm,Sweden,48,117.0,,5504,8949
3,Arthur Kegels,BRUSSELS,BELGIUM,59,121.0,4344.0,8999,300
4,Freya Bismark,Berlin,GERMANYY,53,126.0,7000.0,19000,26000
5,Rena Filip,Brasilia,BRAZIL,50,167.0,4999.0,3999,3450
6,Cindy Jules,Stockholm,Sweden,48,117.0,3445.0,5504,8949
7,John Paul,LONDON,UNITED KINGDOM,62,,4500.0,2300,6800


In [None]:
# Finding the shape of our dataset
# ---
# 
outlier_df.shape

(8, 8)

In [None]:
# Defining our quantiles
# ---
# 
Q1 = outlier_df.quantile(0.25)
Q3 = outlier_df.quantile(0.75)
IQR = Q3 - Q1
IQR

HEIGHT             10.25
WEIGHT             29.50
ACCOUNT A         855.00
ACCOUNT B        7244.50
TOTAL ACCOUNT    7301.75
dtype: float64

In [None]:
# Determining how many outliers there are in our dataset
# ---
# 
outlier_df_iqr = outlier_df[((outlier_df < (Q1 - 1.5 * IQR)) |(outlier_df > (Q3 + 1.5 * IQR))).any(axis=1)]
outlier_df_iqr.shape


(3, 8)

In [None]:
# Displaying our outliers
# ---
# 
outlier_df[((outlier_df < (Q1 - 1.5 * IQR)) |(outlier_df > (Q3 + 1.5 * IQR))).any(axis=1)]


Unnamed: 0,NAME,CITY,COUNTRY,HEIGHT,WEIGHT,ACCOUNT A,ACCOUNT B,TOTAL ACCOUNT
0,Adi Dako,LISBON,PORTUGAL,56,132.0,2390.0,4340,6730
1,John Paul,LONDON,UNITED KINGDOM,62,165.0,4500.0,34334,38834
4,Freya Bismark,Berlin,GERMANYY,53,126.0,7000.0,19000,26000


##### <font color="green">Challenge 2</font>

In [None]:
# Challenge 2
# ---
# Question: Deal with the outliers in the given dataset.
# ---
# Dataset url = http://bit.ly/SampleDataset
# ---
# OUR CODE GOES BELOW
# 


In [None]:
# Dropping the outliers
# ---
# 
clean_df = outlier_df[~((outlier_df < (Q1 - 1.5 * IQR)) |(outlier_df > (Q3 + 1.5 * IQR))).any(axis=1)]
clean_df.shape

(5, 8)

In [None]:
# Displaying our clean dataset
# NB: Our dataset still needs to be cleaned in other ways...
# ---
# 
clean_df

Unnamed: 0,NAME,CITY,COUNTRY,HEIGHT,WEIGHT,ACCOUNT A,ACCOUNT B,TOTAL ACCOUNT
2,Cindy Jules,Stockholm,Sweden,48,117.0,,5504,8949
3,Arthur Kegels,BRUSSELS,BELGIUM,59,121.0,4344.0,8999,300
5,Rena Filip,Brasilia,BRAZIL,50,167.0,4999.0,3999,3450
6,Cindy Jules,Stockholm,Sweden,48,117.0,3445.0,5504,8949
7,John Paul,LONDON,UNITED KINGDOM,62,,4500.0,2300,6800
