# Challenge
### Data Preparation 

#### Exploring a DataFrame

In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's Iris data set used in his 1936 paper "The use of multiple measurements in taxonomic problems":

In [5]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

In [6]:
iris_df.shape

(150, 4)

In [7]:
iris_df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [8]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [9]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [11]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#### Exercise:
From the example given above, it is clear that, by default, DataFrame.head returns the first five rows of a DataFrame. In the code cell below, can you figure out a way to display more than five rows?

In [12]:
iris_df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [13]:
iris_df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


#### Missing Data

In [14]:
import numpy as np

example1 = np.array([2, None, 6, 8])
example1

array([2, None, 6, 8], dtype=object)

The reality of upcast data types carries two side effects with it. First, operations will be carried out at the level of interpreted Python code rather than compiled NumPy code. Essentially, this means that any operations involving Series or DataFrames with None in them will be slower. While you would probably not notice this performance hit, for large datasets it might become an issue.

The second side effect stems from the first. Because None essentially drags Series or DataFrames back into the world of vanilla Python, using NumPy/pandas aggregations like sum() or min() on arrays that contain a None value will generally produce an error:

In [15]:
example1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In contrast to None, NumPy (and therefore pandas) supports NaN for its fast, vectorized operations and ufuncs. The bad news is that any arithmetic performed on NaN always results in NaN. For example:

In [16]:
np.nan + 1

nan

In [17]:
np.nan * 0

nan

In [18]:
example2 = np.array([2, np.nan, 6, 8]) 
example2.sum(), example2.min(), example2.max()

(nan, nan, nan)

#### Exercise:
What happens if you add np.nan and None together?

In [19]:
example1 = np.nan
example1+None

TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

Even though NaN and None can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a Series of integers:

In [20]:
int_series = pd.Series([1, 2, 3], dtype=int)
int_series

0    1
1    2
2    3
dtype: int32

#### Excercise

In [21]:
# Now set an element of int_series equal to None.
# How does that element show up in the Series?
# What is the dtype of the Series?

In [24]:
int_series = pd.Series([1==None, 2, 3], dtype=int)
int_series

0    0
1    2
2    3
dtype: int32

In [25]:
# The element showed up as 0
# The dtype of the series is int

#### Detecting null values

In [26]:
example3 = pd.Series([0, np.nan, '', None])

In [27]:
example3.isnull()

0    False
1     True
2    False
3     True
dtype: bool

If we want the total number of missing values, we can just do a sum over the mask produced by the isnull() method.

In [28]:
example3.isnull().sum()

2

#### Excercise

In [29]:
# Try running example3[example3.notnull()].
# Before you do so, what do you expect to see?

In [30]:
# I expect to see the following output:
'''
0    True
1     False
2    True
3     False
dtype: bool
'''

'\n0    True\n1     False\n2    True\n3     False\ndtype: bool\n'

In [31]:
example3.notnull()

0     True
1    False
2     True
3    False
dtype: bool

#### Dealing with missing data

There are primarily two ways of dealing with missing data:

Drop the row containing the missing value
Replace the missing value with some other value

##### Dropping null values

In [32]:
example3 = example3.dropna()
example3

0    0
2     
dtype: object

In [33]:
example4 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example4

Unnamed: 0,0,1,2
0,1.0,,7
1,2.0,5.0,8
2,,6.0,9


In [34]:
example4.dropna()

Unnamed: 0,0,1,2
1,2.0,5.0,8


If necessary, you can drop NA values from columns. Use axis=1 to do so:

In [35]:
example4.dropna(axis='columns')

Unnamed: 0,2
0,7
1,8
2,9


Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in dropna with the how and thresh parameters.

By default, how='any' (if you would like to check for yourself or see what other parameters the method has, run example4.dropna? in a code cell). You could alternatively specify how='all' so as to drop only rows or columns that contain all null values. Let's expand our example DataFrame to see this in action in the next exercise.

In [36]:
example4[3] = np.nan
example4

Unnamed: 0,0,1,2,3
0,1.0,,7,
1,2.0,5.0,8,
2,,6.0,9,


Key takeaways:

Dropping null values is a good idea only if the dataset is large enough.
Full rows or columns can be dropped if they have most of their data missing.
The DataFrame.dropna(axis=) method helps in dropping null values. The axis argument signifies whether rows are to be dropped or columns.
The how argument can also be used. By default it is set to any. So, it drops only those rows/columns which contain any null values. It can be set to all to specify that we will drop only those rows/columns where all values are null.

#### Excercise

In [38]:
# How might you go about dropping just column 3?
# Hint: remember that you will need to supply both the axis parameter and the how parameter.

In [39]:
example4.dropna(axis='columns',how='all')

Unnamed: 0,0,1,2
0,1.0,,7
1,2.0,5.0,8
2,,6.0,9


The thresh parameter gives you finer-grained control: you set the number of non-null values that a row or column needs to have in order to be kept:

In [40]:
example4.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,5.0,8,


#### Filling null values

##### Categorical Data(Non-numeric)
First let us consider non-numeric data. In datasets, we have columns with categorical data. Eg. Gender, True or False etc.

In most of these cases, we replace missing values with the mode of the column. Say, we have 100 data points and 90 have said True, 8 have said False and 2 have not filled. Then, we can will the 2 with True, considering the full column.

Again, here we can use domain knowledge here. Let us consider an example of filling with the mode.

In [42]:
fill_with_mode = pd.DataFrame([[1,2,"True"],
                               [3,4,None],
                               [5,6,"False"],
                               [7,8,"True"],
                               [9,10,"True"]])

fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,
2,5,6,False
3,7,8,True
4,9,10,True


In [43]:
# Now, lets first find the mode before filling the None value with the mode.

fill_with_mode[2].value_counts()

True     3
False    1
Name: 2, dtype: int64

In [44]:
fill_with_mode[2].fillna('True',inplace=True)
fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,True
2,5,6,False
3,7,8,True
4,9,10,True


##### Numeric Data
Now, coming to numeric data. Here, we have a two common ways of replacing missing values:

Replace with Median of the row
Replace with Mean of the row
We replace with Median, in case of skewed data with outliers. This is because median is robust to outliers.

When the data is normalized, we can use mean, as in that case, mean and median would be pretty close.

First, let us take a column which is normally distributed and let us fill the missing value with the mean of the column.

In [45]:
fill_with_mean = pd.DataFrame([[-2,0,1],
                               [-1,2,3],
                               [np.nan,4,5],
                               [1,6,7],
                               [2,8,9]])

fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,,4,5
3,1.0,6,7
4,2.0,8,9


In [46]:
np.mean(fill_with_mean[0])

0.0

In [47]:
fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)
fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,0.0,4,5
3,1.0,6,7
4,2.0,8,9


As we can see, the missing value has been replaced with its mean.

Now let us try another dataframe, and this time we will replace the NaN values with the median of the column.

In [48]:
fill_with_median = pd.DataFrame([[-2,0,1],
                               [-1,2,3],
                               [0,np.nan,5],
                               [1,6,7],
                               [2,8,9]])

fill_with_median

Unnamed: 0,0,1,2
0,-2,0.0,1
1,-1,2.0,3
2,0,,5
3,1,6.0,7
4,2,8.0,9


In [49]:
# The median of the second column is

fill_with_median[1].median()

4.0

In [50]:
# Filling with median

fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)
fill_with_median

Unnamed: 0,0,1,2
0,-2,0.0,1
1,-1,2.0,3
2,0,4.0,5
3,1,6.0,7
4,2,8.0,9



As we can see, the NaN value has been replaced by the median of the column

In [51]:
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example5

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [53]:
# You can fill all of the null entries with a single value, such as 0:

example5.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

Key takeaways:

Filling in missing values should be done when either there is less data or there is a strategy to fill in the missing data.
Domain knowledge can be used to fill in missing values by approximating them.
For Categorical data, mostly, missing values are substituted with the mode of the column.
For numeric data, missing values are usually filled in with the mean(for normalized datasets) or the median of the columns.

#### Excercise

In [54]:
# What happens if you try to fill null values with a string, like ''?

In [55]:
example5.fillna('')

a    1.0
b       
c    2.0
d       
e    3.0
dtype: object

You can forward-fill null values, which is to use the last valid value to fill a null:

In [56]:
example5.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

As you might guess, this works the same with DataFrames, but you can also specify an axis along which to fill null values:

In [57]:
example4

Unnamed: 0,0,1,2,3
0,1.0,,7,
1,2.0,5.0,8,
2,,6.0,9,


In [58]:
example4.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,7.0,7.0
1,2.0,5.0,8.0,8.0
2,,6.0,9.0,9.0


Notice that when a previous value is not available for forward-filling, the null value remains.

#### Excercise

In [60]:
# What output does example4.fillna(method='bfill', axis=1) produce?
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
# Can you think of a longer code snippet to write that can fill all of the null values in example4?

In [61]:
example4.fillna(method='bfill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,7.0,7.0,
1,2.0,5.0,8.0,
2,6.0,6.0,9.0,


In [62]:
example4.fillna(method='ffill')

Unnamed: 0,0,1,2,3
0,1.0,,7,
1,2.0,5.0,8,
2,2.0,6.0,9,


In [64]:
example4.fillna(method='bfill')

Unnamed: 0,0,1,2,3
0,1.0,5.0,7,
1,2.0,5.0,8,
2,,6.0,9,


You can be creative about how you use fillna. For example, let's look at example4 again, but this time let's fill the missing values with the average of all of the values in the DataFrame:

In [65]:
example4.fillna(example4.mean())

Unnamed: 0,0,1,2,3
0,1.0,5.5,7,
1,2.0,5.0,8,
2,1.5,6.0,9,


Notice that column 3 is still valueless: the default direction is to fill values row-wise.

#### Identifying duplicates: duplicated
You can easily spot duplicate values using the duplicated method in pandas, which returns a Boolean mask indicating whether an entry in a DataFrame is a duplicate of an earlier one. Let's create another example DataFrame to see this in action.



In [66]:
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
                         'numbers': [1, 2, 1, 3, 3]})
example6

Unnamed: 0,letters,numbers
0,A,1
1,B,2
2,A,1
3,B,3
4,B,3


In [67]:
example6.duplicated()

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [68]:
example6.drop_duplicates()

Unnamed: 0,letters,numbers
0,A,1
1,B,2
3,B,3



Both duplicated and drop_duplicates default to consider all columns but you can specify that they examine only a subset of columns in your DataFrame:

In [69]:
example6.drop_duplicates(['letters'])

Unnamed: 0,letters,numbers
0,A,1
1,B,2


Takeaway: Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you inaccurate results!