# Data Preparation

[Original Notebook source from *Data Science: Introduction to Machine Learning for Data Science Python and Machine Learning Studio by Lee Stott*](https://github.com/leestott/intro-Datascience/blob/master/Course%20Materials/4-Cleaning_and_Manipulating-Reference.ipynb)

## Exploring `DataFrame` information

> **Learning goal:** By di end of dis subsection, you go sabi how to find general information about di data wey dey pandas DataFrames.

Once you don load your data enter pandas, e go most likely dey inside `DataFrame`. But if di data wey dey your `DataFrame` get 60,000 rows and 400 columns, how you go even start to understand wetin you dey work with? Luckily, pandas get some easy tools wey go help you quick quick look di overall information about one `DataFrame` plus di first few and last few rows.

To fit check dis functionality, we go import di Python scikit-learn library and use one popular dataset wey every data scientist don see plenty times: British biologist Ronald Fisher *Iris* data set wey e use for him 1936 paper "The use of multiple measurements in taxonomic problems":


In [1]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

### `DataFrame.shape`
We don load the Iris Dataset inside di variable `iris_df`. Before we go deep inside di data, e go make sense to sabi how many datapoints dey and di overall size of di dataset. E dey useful to check di amount of data we dey deal with.


In [2]:
iris_df.shape

(150, 4)

So, we dey deal wit 150 rows and 4 columns of data. Each row na one datapoint and each column na one feature wey dey follow di data frame. So basically, we get 150 datapoints wey get 4 features each.

`shape` for here na attribute of di dataframe and e no be function, na why e no end wit pair of parentheses.


### `DataFrame.columns`
Make we now waka enter di 4 columns of data. Wetin each of dem mean exactly? Di `columns` attribute go show us di name of di columns wey dey di dataframe.


In [3]:
iris_df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

As we fit see, dem get four(4) columns. Di `columns` attribute dey tell us di name of di columns and no dey talk any oda tin. Dis attribute dey important wen we wan identify di features wey dey inside one dataset.


### `DataFrame.info`
Di amount of data (wey `shape` attribute dey show) and di name of di features or columns (wey `columns` attribute dey show) go tell us small tins about di dataset. Now, we go wan look di dataset well well. Di `DataFrame.info()` function dey very useful for dis.


In [4]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


From hia, we fit see some tins:  
1. Di DataType wey each column get: For dis dataset, all di data dey store as 64-bit floating-point numbers.  
2. Number of Non-Null values: To handle null values na important step for data preparation. We go handle am later for di notebook.  


### DataFrame.describe()
If we get plenty numerical data for our dataset, we fit use univariate statistical calculation like mean, median, quartiles etc. for each column one by one. The `DataFrame.describe()` function dey give us statistical summary for the numerical columns wey dey inside dataset.


In [5]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Di output wey dey up show di total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and di maximum value for each column.


### `DataFrame.head`
Wit all di functions and attributes wey we don talk about, we don get beta overview of di dataset. We sabi how many data points dey, how many features dey, di data type of each feature, and di number of non-null values for each feature.

Now na time to check di data itself. Make we see wetin di first few rows (di first few datapoints) of our `DataFrame` look like:


In [6]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


As di output wey dey here, we fit see five(5) entries for di dataset. If we look di index for di left side, we go find say na di first five rows be dis.


### Exercise:

From wetin dem show for example wey dem give above, e clear say, by default, `DataFrame.head` dey return di first five rows of one `DataFrame`. For di code cell wey dey below, you fit find way to show more than five rows?


In [7]:
# Hint: Consult the documentation by using iris_df.head?

### `DataFrame.tail`
Anoda way wey you fit take look data na from di end (no be di beginning). Di opposite of `DataFrame.head` na `DataFrame.tail`, wey dey return di last five rows of one `DataFrame`:


In [8]:
iris_df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


For real life, e dey useful to fit check di first few rows or di last few rows of one `DataFrame` easy, especially wen you dey find outliers for ordered datasets.

All di functions and attributes wey dem show for di code examples above, dey help us see how di data be.

> **Takeaway:** Even if na just to look di metadata about di information wey dey inside one DataFrame or di first and last few values wey dey inside am, you fit quick sabi di size, shape, and wetin dey inside di data wey you dey work with.


### Missin Data
Make we talk about missin data. Missin data dey happen wen no value dey for some of di columns.

Make we use example: say person wey dey care about im weight no gree put im weight for one survey. Di weight value for dat person go dey miss.

Most times, for real world datasets, missin values dey show.

**How Pandas dey handle missin data**

Pandas dey handle missin values in two ways. Di first one wey you don see before for di previous sections na `NaN`, or Not a Number. Dis one na special value wey dey part of di IEEE floating-point specification and dem dey use am only for missin floating-point values.

For missin values wey no be floats, pandas dey use Python `None` object. E fit look somehow say you go see two different kinds of values wey mean di same thing, but di reason wey dem do am like dat make sense for programming. E make pandas fit balance well for most cases. But sha, both `None` and `NaN` get some restrictions wey you go need sabi, especially how you fit use dem.


### `None`: non-float missing data
Bikos `None` dey come from Python, e no fit work for NumPy and pandas arrays wey no be data type `'object'`. Make you remember say, NumPy arrays (and di data structures for pandas) fit only hold one type of data. Na dis one dey give dem di big power for big data and computation work, but e still dey limit how dem fit flex. Dis kain arrays go gree change to di “lowest common denominator,” di data type wey go fit hold everything wey dey inside di array. If `None` dey inside di array, e mean say you dey work with Python objects.

To see how e dey work, check dis example array (see di `dtype` wey e get):


In [9]:
import numpy as np

example1 = np.array([2, None, 6, 8])
example1

array([2, None, 6, 8], dtype=object)

Di reality of upcast data types dey carry two side effects follow body. First, operations go dey run for di level of interpreted Python code instead of compiled NumPy code. Wetin dis mean be say any operation wey involve `Series` or `DataFrames` wey get `None` inside dem go dey slow. Even though you fit no notice di performance wahala, for big datasets, e fit turn problem.

Di second side effect na from di first one. Because `None` dey drag `Series` or `DataFrame`s go back to di normal Python way, if you use NumPy/pandas aggregations like `sum()` or `min()` for arrays wey get `None` value, e go usually throw error:


In [10]:
example1.sum()

TypeError: ignored

**Key takeaway**: Addition (and oda operations) between integers and `None` values no get definition, e fit limit wetin you fit do wit datasets wey get dem.


### `NaN`: missing float values

Unlike `None`, NumPy (and so pandas) dey support `NaN` for dia fast, vectorized operations and ufuncs. Di bad news be say any arithmetic wey you do with `NaN` go always give `NaN`. For example:


In [11]:
np.nan + 1

nan

In [12]:
np.nan * 0

nan

Di good news be say: aggregations wey dem run for arrays wey get `NaN` inside no dey show error. Di bad news be say: di results no dey always useful:


In [13]:
example2 = np.array([2, np.nan, 6, 8]) 
example2.sum(), example2.min(), example2.max()

(nan, nan, nan)

### Exercise:


In [11]:
# What happens if you add np.nan and None together?


Remember: `NaN` na only for missing floating-point values; e no get `NaN` equivalent for integers, strings, or Booleans.


### `NaN` and `None`: null values for pandas

Even though `NaN` and `None` fit behave small different, pandas still dey set up to handle dem like say dem be the same. To understand wetin we mean, check one `Series` wey get integers:


In [15]:
int_series = pd.Series([1, 2, 3], dtype=int)
int_series

0    1
1    2
2    3
dtype: int64

### Exercise:


In [16]:
# Now set an element of int_series equal to None.
# How does that element show up in the Series?
# What is the dtype of the Series?


For di process wey pandas dey use to change data type to make sure say data dey uniform for `Series` and `DataFrame`s, pandas go fit change missing values between `None` and `NaN` without wahala. Because of di way dem design am, e go make sense to see `None` and `NaN` as two different kain "null" for pandas. True true, some of di main methods wey you go use to handle missing values for pandas dey show dis idea for dia names:

- `isnull()`: E dey create Boolean mask wey go show missing values
- `notnull()`: E dey do di opposite of `isnull()`
- `dropna()`: E dey return filtered version of di data
- `fillna()`: E dey return copy of di data wey dem don fill or replace di missing values

Dis methods dey very important to sabi well and make sure say you dey comfortable to use dem, so make we look dem one by one for better understanding.


### How to sabi null values

Now wey we don understand why missing values dey important, we go need sabi dem for our dataset before we fit handle dem.  
`isnull()` and `notnull()` na di main methods wey you go use to sabi null data. Both dey return Boolean masks for your data.


In [17]:
example3 = pd.Series([0, np.nan, '', None])

In [18]:
example3.isnull()

0    False
1     True
2    False
3     True
dtype: bool

Make sure say you look am well. E get anything wey surprise you? Even though `0` na arithmetic null, e still be correct integer and pandas dey treat am like dat. `''` dey small tricky. For Section 1, we use am to show empty string value, but e still be string object and pandas no dey see am as null.

Now, make we turn am around and use dis methods the way we go dey use dem for real life. You fit use Boolean masks directly as ``Series`` or ``DataFrame`` index, and e dey useful when you wan work with missing (or present) values.

If we wan know the total number of missing values, we fit just do sum for the mask wey `isnull()` method produce.


In [19]:
example3.isnull().sum()

2

### Exercise:


In [20]:
# Try running example3[example3.notnull()].
# Before you do so, what do you expect to see?


**Key takeaway**: Both the `isnull()` and `notnull()` methods dey produce similar results wen you use dem for DataFrames: dem dey show the results and the index of those results, wey go help you well well as you dey work with your data.


### How to handle missing data

> **Learning goal:** By di end of dis subsection, you go sabi how and when to replace or remove null values from DataFrames.

Machine Learning models no fit handle missing data by demself. So, before we pass di data enter di model, we need to handle di missing values.

How we go handle missing data get small tradeoffs, e fit affect your final analysis and wetin go happen for real life.

Two main ways dey to handle missing data:

1.   Remove di row wey get di missing value
2.   Replace di missing value with another value

We go talk about both methods and di good and bad side of dem in detail.


### Dropping null values

Di amount of data we dey pass give our model go affect how e go perform. If we drop null values, e mean say we dey reduce di number of datapoints, and because of dat, di size of di dataset go reduce. So, e good to drop rows wey get null values if di dataset big well well.

Another case fit be say one row or column get plenty missing values. For dat kind case, we fit drop am because e no go really add better value to our analysis since most of di data for dat row/column dey miss.

Apart from to find missing values, pandas get easy way to remove null values from `Series` and `DataFrame`s. To see how e dey work, make we go back to `example3`. Di `DataFrame.dropna()` function dey help us drop di rows wey get null values.


In [21]:
example3 = example3.dropna()
example3

0    0
2     
dtype: object

Make sure say dis one go look like wetin you go get from `example3[example3.notnull()]`. Di difference be say, instead of just dey index di masked values, `dropna` don comot di missing values from di `Series` `example3`.

Since DataFrames get two dimensions, e fit give you more options to take comot data.


In [22]:
example4 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example4

Unnamed: 0,0,1,2
0,1.0,,7
1,2.0,5.0,8
2,,6.0,9


(You see say pandas don change two of di columns to floats so e fit carry di `NaN`s?)

You no fit comot only one value from `DataFrame`, so you go need comot full rows or columns. Depend on wetin you dey do, you fit wan do one or di other one, and pandas dey give you option for both. Because for data science, columns dey usually represent variables and rows dey represent observations, e go make sense to comot rows of data; di default setting for `dropna()` na to comot all rows wey get any null values:


In [23]:
example4.dropna()

Unnamed: 0,0,1,2
1,2.0,5.0,8


If e necessary, you fit drop NA values from columns. Use `axis=1` to do am:


In [24]:
example4.dropna(axis='columns')

Unnamed: 0,2
0,7
1,8
2,9


Make you sabi say dis fit make plenty data wey you go wan keep comot, especially for small datasets. Wetin go happen if na only rows or columns wey get plenty or even all null values you wan comot? You go fit set am inside `dropna` using `how` and `thresh` parameters.

By default, `how='any'` (if you wan check am by yasef or see wetin other parameters wey the method get, run `example4.dropna?` for code cell). You fit also set `how='all'` so e go comot only rows or columns wey get all null values. Make we expand our example `DataFrame` to see how e dey work for the next exercise.


In [25]:
example4[3] = np.nan
example4

Unnamed: 0,0,1,2,3
0,1.0,,7,
1,2.0,5.0,8,
2,,6.0,9,


> Key takeaways: 
1. To remove null values na good idea only if di dataset big well well.  
2. You fit remove full rows or columns if dem get plenty missing data.  
3. Di `DataFrame.dropna(axis=)` method dey help to remove null values. Di `axis` argument dey show whether na rows or columns we go remove.  
4. You fit use di `how` argument too. By default, e dey set to `any`. So, e go remove only di rows/columns wey get any null values. You fit set am to `all` to talk say we go remove only di rows/columns wey all di values na null.  


### Exercise:


In [22]:
# How might you go about dropping just column 3?
# Hint: remember that you will need to supply both the axis parameter and the how parameter.


The `thresh` parameter dey give you beta control: you go set the number of *non-null* values wey row or column need get so dem go fit keep am:


In [27]:
example4.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,5.0,8,


Here, di first and last row don comot, because dem get only two non-null values.


### How to take care of null values

Sometimes e go make sense to put value wey fit valid for place wey value dey miss. E get some ways wey you fit take fill null values. Di first one na to use Domain Knowledge (di knowledge wey you get about di topic wey di dataset dey base on) to try estimate di missing values.

You fit use `isnull` to do am directly, but e fit stress you, especially if di values wey you wan fill plenty. Because dis kind work na normal thing for data science, pandas get `fillna`, wey go return copy of di `Series` or `DataFrame` wey don replace di missing values with di one wey you choose. Make we create another example `Series` to see how e dey work for real.


### Categorical Data (Non-numeric)
Make we first look non-numeric data. For datasets, we get columns wey get categorical data. Example na Gender, True or False etc.

For most of dis kind case, we dey replace missing values wit di `mode` of di column. For example, if we get 100 data points and 90 talk True, 8 talk False and 2 no fill am. We fit use True take fill di 2, based on di whole column.

Again, we fit use domain knowledge for dis kind matter. Make we look example of how to fill wit di mode.


In [28]:
fill_with_mode = pd.DataFrame([[1,2,"True"],
                               [3,4,None],
                               [5,6,"False"],
                               [7,8,"True"],
                               [9,10,"True"]])

fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,
2,5,6,False
3,7,8,True
4,9,10,True


Make we first find di mode before we go fill di `None` value wit di mode.


In [29]:
fill_with_mode[2].value_counts()

True     3
False    1
Name: 2, dtype: int64

So, we go replace None wit True


In [30]:
fill_with_mode[2].fillna('True',inplace=True)

In [31]:
fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,True
2,5,6,False
3,7,8,True
4,9,10,True


As we fit see, di null value don replace. E no need talk, we fit don write anything for di place of `'True'` and e go don substitute.


### Numeric Data
Now, make we tok about numeric data. For here, we get two common ways wey we fit take replace missing values:

1. Replace am wit di Median of di row  
2. Replace am wit di Mean of di row  

We dey use Median when di data get outliers or e dey skewed. Dis na because median no dey too affect by outliers.

But if di data don normalize, we fit use mean, because for dat kind case, mean and median go dey almost di same.

First, make we carry one column wey dey normally distributed, then make we fill di missing value wit di mean of di column.


In [32]:
fill_with_mean = pd.DataFrame([[-2,0,1],
                               [-1,2,3],
                               [np.nan,4,5],
                               [1,6,7],
                               [2,8,9]])

fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,,4,5
3,1.0,6,7
4,2.0,8,9


Di mean of di column na


In [33]:
np.mean(fill_with_mean[0])

0.0

Filling wit mean


In [34]:
fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)
fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,0.0,4,5
3,1.0,6,7
4,2.0,8,9


As we fit see, di missing value don replace wit im mean.


Make we try anoda dataframe now, dis time we go replace di None values wit di median of di column.


In [35]:
fill_with_median = pd.DataFrame([[-2,0,1],
                               [-1,2,3],
                               [0,np.nan,5],
                               [1,6,7],
                               [2,8,9]])

fill_with_median

Unnamed: 0,0,1,2
0,-2,0.0,1
1,-1,2.0,3
2,0,,5
3,1,6.0,7
4,2,8.0,9


Di median for di second column na


In [36]:
fill_with_median[1].median()

4.0

Filling wit median


In [37]:
fill_with_median[1].fillna(fill_with_median[1].median(),inplace=True)
fill_with_median

Unnamed: 0,0,1,2
0,-2,0.0,1
1,-1,2.0,3
2,0,4.0,5
3,1,6.0,7
4,2,8.0,9


As we fit see, di NaN value don change to di median of di column


In [38]:
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example5

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

You fit put all di empty space wit one value, like `0`:


In [39]:
example5.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

> Wetin you go carry comot:
1. You go fill empty values if data no plenty or you get plan wey go help you fill di empty data.  
2. You fit use wetin you sabi for di area to fill di empty values by guessing am.  
3. For Categorical data, di empty values dey mostly replace wit di mode of di column.  
4. For numeric data, di empty values dey usually fill wit di mean (if na normalized datasets) or di median of di columns.  


### Exercise:


In [40]:
# What happens if you try to fill null values with a string, like ''?


You fit **forward-fill** null values, wey mean say you go use di last valid value take fill di null:


In [41]:
example5.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

You fit also **back-fill** to push di next valid value go back to fill null:


In [42]:
example5.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

As you fit guess, dis one dey work di same way wit DataFrames, but you fit also specify `axis` wey you go use take fill null values:


In [43]:
example4

Unnamed: 0,0,1,2,3
0,1.0,,7,
1,2.0,5.0,8,
2,,6.0,9,


In [44]:
example4.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,7.0,7.0
1,2.0,5.0,8.0,8.0
2,,6.0,9.0,9.0


Make sure say if old value no dey for forward-fill, di null value go still remain.


### Eksasais:


In [45]:
# What output does example4.fillna(method='bfill', axis=1) produce?
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
# Can you think of a longer code snippet to write that can fill all of the null values in example4?


You fit use style for how you go take use `fillna`. For example, make we check `example4` again, but dis time make we fill di missing values wit di average of all di values wey dey di `DataFrame`:


In [46]:
example4.fillna(example4.mean())

Unnamed: 0,0,1,2,3
0,1.0,5.5,7,
1,2.0,5.0,8,
2,1.5,6.0,9,


Make sure say column 3 still no get value: di default way na to fill values row by row.

> **Wetin you go learn:** Plenty ways dey to handle missing values for your datasets. Di way wey you go use (whether na to remove am, replace am, or even how you go replace am) go depend on di kind data wey you dey work with. Di more you dey work with datasets, di more you go sabi how to handle missing values.


### Encoding Categorical Data

Machine learning models na only numbers dem sabi work with, any type of number data. E no fit sabi di difference between Yes and No, but e go fit sabi di difference between 0 and 1. So, afta we don fill di missing values, we go need change di categorical data to some kind number form wey di model go fit understand.

We fit do di encoding in two ways. We go talk about dem next.


**LABEL ENCODING**

Label encoding na wen we dey change each category to number. For example, make we say we get one dataset wey dey show airline passengers and one column dey wey get dia class like dis ['business class', 'economy class', 'first class']. If we do Label encoding for dis one, e go change to [0,1,2]. Make we see example wit code. Since we go dey learn `scikit-learn` for di notebooks wey dey come, we no go use am for here.


In [47]:
label = pd.DataFrame([
                      [10,'business class'],
                      [20,'first class'],
                      [30, 'economy class'],
                      [40, 'economy class'],
                      [50, 'economy class'],
                      [60, 'business class']
],columns=['ID','class'])
label

Unnamed: 0,ID,class
0,10,business class
1,20,first class
2,30,economy class
3,40,economy class
4,50,economy class
5,60,business class


To do label encoding for di 1st column, we go first describe how we go map each class to number, before we go replace am.


In [48]:
class_labels = {'business class':0,'economy class':1,'first class':2}
label['class'] = label['class'].replace(class_labels)
label

Unnamed: 0,ID,class
0,10,0
1,20,2
2,30,1
3,40,1
4,50,1
5,60,0


As we fit see, di output match wetin we bin think go happen. So, wen we go use label encoding? Label encoding dey used for one or both of dis kind cases:  
1. Wen di number of categories plenty  
2. Wen di categories get order.  


**ONE HOT ENCODING**

One kain encoding wey dem dey call One Hot Encoding dey. For dis kain encoding, each category wey dey for di column go dey add as separate column, and each datapoint go get 0 or 1 based on whether e get dat category. So, if di categories plenty reach n, n columns go dey join di dataframe.

For example, make we use di same aeroplane class example. Di categories na: ['business class', 'economy class', 'first class']. So, if we do one hot encoding, di three columns wey go dey add to di dataset na: ['class_business class', 'class_economy class', 'class_first class'].


In [49]:
one_hot = pd.DataFrame([
                      [10,'business class'],
                      [20,'first class'],
                      [30, 'economy class'],
                      [40, 'economy class'],
                      [50, 'economy class'],
                      [60, 'business class']
],columns=['ID','class'])
one_hot

Unnamed: 0,ID,class
0,10,business class
1,20,first class
2,30,economy class
3,40,economy class
4,50,economy class
5,60,business class


Make we do one hot encoding for di 1st column.


In [50]:
one_hot_data = pd.get_dummies(one_hot,columns=['class'])

In [51]:
one_hot_data

Unnamed: 0,ID,class_business class,class_economy class,class_first class
0,10,1,0,0
1,20,0,0,1
2,30,0,1,0
3,40,0,1,0
4,50,0,1,0
5,60,1,0,0


Each one hot encoded column get 0 or 1, wey dey show if dat category dey for dat datapoint.


When we go use one hot encoding? We dey use one hot encoding for one or both of dis kind case:

1. Wen di number of categories and di size of di dataset small.
2. Wen di categories no get any particular order.


> Key Takeaways:
1. Encoding na wetin dem dey use change data wey no be number to data wey be number.  
2. Two type of encoding dey: Label encoding and One Hot encoding, and you fit do any one wey the dataset need.


## How to comot duplicate data

> **Wetin you go learn:** By di end of dis subsection, you go sabi how to find and comot duplicate values from DataFrames.

Apart from missing data, you go see duplicate data plenty times for real-world datasets. E good say pandas get easy way to find and comot duplicate entries.


### How to sabi duplicates: `duplicated`

You fit quick quick sabi duplicate values if you use `duplicated` method for pandas. E go return Boolean mask wey go show if one entry for `DataFrame` na duplicate of one wey don dey before. Make we create another example `DataFrame` to see how e dey work.


In [52]:
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
                         'numbers': [1, 2, 1, 3, 3]})
example6

Unnamed: 0,letters,numbers
0,A,1
1,B,2
2,A,1
3,B,3
4,B,3


In [53]:
example6.duplicated()

0    False
1    False
2     True
3    False
4     True
dtype: bool

### Dropping duplicates: `drop_duplicates`
`drop_duplicates` go return copy of di data wey all di `duplicated` values na `False`:


In [54]:
example6.drop_duplicates()

Unnamed: 0,letters,numbers
0,A,1
1,B,2
3,B,3


Both `duplicated` and `drop_duplicates` dey default to check all di columns but you fit specify make dem check only some columns for your `DataFrame`:


In [55]:
example6.drop_duplicates(['letters'])

Unnamed: 0,letters,numbers
0,A,1
1,B,2


> **Takeaway:** To comot duplicate data na important part for almost every data-science project. Duplicate data fit change di result of your analysis and give you wrong result!


## Real-World Data Quality Checks

> **Learning goal:** By di end of dis section, you go sabi how to find and fix common real-world data quality wahala like inconsistent categorical values, abnormal numeric values (outliers), and duplicate entities wey get small-small difference.

Even though missing values and exact duplicates na common wahala, real-world datasets dey carry more subtle problems:

1. **Inconsistent categorical values**: Di same category fit dey spell different (e.g., "USA", "U.S.A", "United States")
2. **Abnormal numeric values**: Extreme outliers wey fit mean say na data entry mistake (e.g., age = 999)
3. **Near-duplicate rows**: Records wey dey represent di same entity but get small-small difference

Make we check techniques wey we fit use to find and handle dis kind wahala.


### How to Make Sample "Dirty" Dataset

First, make we create one sample dataset wey get di kind wahala wey we dey see for real-world data:


In [None]:
import pandas as pd
import numpy as np

# Create a sample dataset with quality issues
dirty_data = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'name': ['John Smith', 'Jane Doe', 'John Smith', 'Bob Johnson', 
             'Alice Williams', 'Charlie Brown', 'John  Smith', 'Eva Martinez',
             'Bob Johnson', 'Diana Prince', 'Frank Castle', 'Alice Williams'],
    'age': [25, 32, 25, 45, 28, 199, 25, 31, 45, 27, -5, 28],
    'country': ['USA', 'UK', 'U.S.A', 'Canada', 'USA', 'United Kingdom',
                'United States', 'Mexico', 'canada', 'USA', 'UK', 'usa'],
    'purchase_amount': [100.50, 250.00, 105.00, 320.00, 180.00, 90.00,
                       102.00, 275.00, 325.00, 195.00, 410.00, 185.00]
})

print("Sample 'Dirty' Dataset:")
print(dirty_data)

### 1. How to See Countries Wey No Match Well

Make we look the `country` column, e get different way wey dem take write the same country name. Make we find all this wahala:


In [None]:
# Check unique values in the country column
print("Unique country values:")
print(dirty_data['country'].unique())
print(f"\nTotal unique values: {dirty_data['country'].nunique()}")

# Count occurrences of each variation
print("\nValue counts:")
print(dirty_data['country'].value_counts())

#### How to Make Categorical Values Standard

We fit do mapping to make all these values standard. One easy way na to change am to small letter and then make mapping dictionary:


In [None]:
# Create a standardization mapping
country_mapping = {
    'usa': 'USA',
    'u.s.a': 'USA',
    'united states': 'USA',
    'uk': 'UK',
    'united kingdom': 'UK',
    'canada': 'Canada',
    'mexico': 'Mexico'
}

# Standardize the country column
dirty_data['country_clean'] = dirty_data['country'].str.lower().map(country_mapping)

print("Before standardization:")
print(dirty_data['country'].value_counts())
print("\nAfter standardization:")
print(dirty_data[['country_clean']].value_counts())

**Alternative: How Fuzzy Matching Fit Work**

For wahala wey dey more complex, we fit use fuzzy string matching wit `rapidfuzz` library to sabi similar strings automatically:


In [None]:
try:
    from rapidfuzz import process, fuzz
except ImportError:
    print("rapidfuzz is not installed. Please install it with 'pip install rapidfuzz' to use fuzzy matching.")
    process = None
    fuzz = None

# Get unique countries
unique_countries = dirty_data['country'].unique()

# For each country, find similar matches
if process is not None and fuzz is not None:
    print("Finding similar country names (similarity > 70%):")
    for country in unique_countries:
        matches = process.extract(country, unique_countries, scorer=fuzz.ratio, limit=3)
        # Filter matches with similarity > 70 and not identical
        similar = [m for m in matches if m[1] > 70 and m[0] != country]
        if similar:
            print(f"\n'{country}' is similar to:")
            for match, score, _ in similar:
                print(f"  - '{match}' (similarity: {score}%)")
else:
    print("Skipping fuzzy matching because rapidfuzz is not available.")

### 2. How to sabi abnormal number value dem (Outliers)

If we look the `age` column, we go see some kind value wey no make sense like 199 and -5. Make we use statistical method dem take find this kain outlier dem.


In [None]:
# Display basic statistics
print("Age column statistics:")
print(dirty_data['age'].describe())

# Identify impossible values using domain knowledge
print("\nRows with impossible age values (< 0 or > 120):")
impossible_ages = dirty_data[(dirty_data['age'] < 0) | (dirty_data['age'] > 120)]
print(impossible_ages[['customer_id', 'name', 'age']])

#### How to Use IQR (Interquartile Range) Method

IQR method na strong statistikal way wey dem dey use find outlier, e no dey too dey affect by values wey dey too high or too low:


In [None]:
# Calculate IQR for age (excluding impossible values)
valid_ages = dirty_data[(dirty_data['age'] >= 0) & (dirty_data['age'] <= 120)]['age']

Q1 = valid_ages.quantile(0.25)
Q3 = valid_ages.quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"IQR-based outlier bounds for age: [{lower_bound:.2f}, {upper_bound:.2f}]")

# Identify outliers
age_outliers = dirty_data[(dirty_data['age'] < lower_bound) | (dirty_data['age'] > upper_bound)]
print(f"\nRows with age outliers:")
print(age_outliers[['customer_id', 'name', 'age']])

#### How to Use Z-Score Method

Z-score method dey find outliers based on how many standard deviation e dey from the mean:


In [None]:
try:
    from scipy import stats
except ImportError:
    print("scipy is required for Z-score calculation. Please install it with 'pip install scipy' and rerun this cell.")
else:
    # Calculate Z-scores for age, handling NaN values
    age_nonan = dirty_data['age'].dropna()
    zscores = np.abs(stats.zscore(age_nonan))
    dirty_data['age_zscore'] = np.nan
    dirty_data.loc[age_nonan.index, 'age_zscore'] = zscores

    # Typically, Z-score > 3 indicates an outlier
    print("Rows with age Z-score > 3:")
    zscore_outliers = dirty_data[dirty_data['age_zscore'] > 3]
    print(zscore_outliers[['customer_id', 'name', 'age', 'age_zscore']])

    # Clean up the temporary column
    dirty_data = dirty_data.drop('age_zscore', axis=1)

#### How to Take Care of Outliers

When you don see outliers, you fit handle dem in different ways:
1. **Remove**: Comot di rows wey get outliers (if na mistake dem be)
2. **Cap**: Change am to di boundary values
3. **Replace with NaN**: Treat am like missing data and use imputation methods
4. **Keep**: If dem be real extreme values


In [None]:
# Create a cleaned version by replacing impossible ages with NaN
dirty_data['age_clean'] = dirty_data['age'].apply(
    lambda x: np.nan if (x < 0 or x > 120) else x
)

print("Age column before and after cleaning:")
print(dirty_data[['customer_id', 'name', 'age', 'age_clean']])

### 3. How to Find Rows Wey Be Like Duplicate

You go notice say for our dataset, we get plenty entries for "John Smith" wey get small-small difference for their values. Make we try find duplicates wey dey similar based on name.


In [None]:
# First, let's look at exact name matches (ignoring extra whitespace)
dirty_data['name_normalized'] = dirty_data['name'].str.strip().str.lower()

print("Checking for duplicate names:")
duplicate_names = dirty_data[dirty_data.duplicated(['name_normalized'], keep=False)]
print(duplicate_names.sort_values('name_normalized')[['customer_id', 'name', 'age', 'country']])

#### How to Find Near-Duplicates wit Fuzzy Matching

If you wan do beta duplicate detection, you fit use fuzzy matching to find names wey resemble:


In [None]:
try:
    from rapidfuzz import process, fuzz

    # Function to find potential duplicates
    def find_near_duplicates(df, column, threshold=90):
        """
        Find near-duplicate entries in a column using fuzzy matching.
        
        Parameters:
        - df: DataFrame
        - column: Column name to check for duplicates
        - threshold: Similarity threshold (0-100)
        
        Returns: List of potential duplicate groups
        """
        values = df[column].unique()
        duplicate_groups = []
        checked = set()
        
        for value in values:
            if value in checked:
                continue
                
            # Find similar values
            matches = process.extract(value, values, scorer=fuzz.ratio, limit=len(values))
            similar = [m[0] for m in matches if m[1] >= threshold]
            
            if len(similar) > 1:
                duplicate_groups.append(similar)
                checked.update(similar)
        
        return duplicate_groups

    # Find near-duplicate names
    duplicate_groups = find_near_duplicates(dirty_data, 'name', threshold=90)

    print("Potential duplicate groups:")
    for i, group in enumerate(duplicate_groups, 1):
        print(f"\nGroup {i}:")
        for name in group:
            matching_rows = dirty_data[dirty_data['name'] == name]
            print(f"  '{name}': {len(matching_rows)} occurrence(s)")
            for _, row in matching_rows.iterrows():
                print(f"    - Customer {row['customer_id']}: age={row['age']}, country={row['country']}")
except ImportError:
    print("rapidfuzz is not installed. Skipping fuzzy matching for near-duplicates.")

#### How to handle duplicates

Once you don see say duplicates dey, you go need decide how you go take handle am:
1. **Keep the first one wey show**: Use `drop_duplicates(keep='first')`
2. **Keep the last one wey show**: Use `drop_duplicates(keep='last')`
3. **Join information together**: Gather all the info wey dey duplicate rows
4. **Check am manually**: Mark am make person look am


In [None]:
# Example: Remove duplicates based on normalized name, keeping first occurrence
cleaned_data = dirty_data.drop_duplicates(subset=['name_normalized'], keep='first')

print(f"Original dataset: {len(dirty_data)} rows")
print(f"After removing name duplicates: {len(cleaned_data)} rows")
print(f"Removed: {len(dirty_data) - len(cleaned_data)} duplicate rows")

print("\nCleaned dataset:")
print(cleaned_data[['customer_id', 'name', 'age', 'country_clean']])

### Summary: Complete Data Cleaning Pipeline

Make we put everything together for one complete cleaning pipeline:


In [None]:
def clean_dataset(df):
    """
    Comprehensive data cleaning function.
    """
    # Create a copy to avoid modifying the original
    cleaned = df.copy()
    
    # 1. Standardize categorical values (country)
    country_mapping = {
        'usa': 'USA', 'u.s.a': 'USA', 'united states': 'USA',
        'uk': 'UK', 'united kingdom': 'UK',
        'canada': 'Canada', 'mexico': 'Mexico'
    }
    cleaned['country'] = cleaned['country'].str.lower().map(country_mapping)
    
    # 2. Clean abnormal age values
    cleaned['age'] = cleaned['age'].apply(
        lambda x: np.nan if (x < 0 or x > 120) else x
    )
    
    # 3. Remove near-duplicate names (normalize whitespace)
    cleaned['name'] = cleaned['name'].str.strip()
    cleaned = cleaned.drop_duplicates(subset=['name'], keep='first')
    
    return cleaned

# Apply the cleaning pipeline
final_cleaned_data = clean_dataset(dirty_data)

print("Before cleaning:")
print(f"  Rows: {len(dirty_data)}")
print(f"  Unique countries: {dirty_data['country'].nunique()}")
print(f"  Invalid ages: {((dirty_data['age'] < 0) | (dirty_data['age'] > 120)).sum()}")

print("\nAfter cleaning:")
print(f"  Rows: {len(final_cleaned_data)}")
print(f"  Unique countries: {final_cleaned_data['country'].nunique()}")
print(f"  Invalid ages: {((final_cleaned_data['age'] < 0) | (final_cleaned_data['age'] > 120)).sum()}")

print("\nCleaned dataset:")
print(final_cleaned_data[['customer_id', 'name', 'age', 'country', 'purchase_amount']])

### 🎯 Challenge Exercise

Na your turn now! Di data wey dey below get plenti wahala. You fit:

1. See all di wahala wey dey dis row
2. Write code wey go fix each wahala
3. Add di clean row join di dataset

See di data wey get problem:


In [None]:
# New problematic row
new_row = pd.DataFrame({
    'customer_id': [13],
    'name': ['  Diana  Prince  '],  # Extra whitespace
    'age': [250],  # Impossible age
    'country': ['U.S.A.'],  # Inconsistent format
    'purchase_amount': [150.00]
})

print("New row to clean:")
print(new_row)

# TODO: Your code here to clean this row
# Hints:
# 1. Strip whitespace from the name
# 2. Check if the name is a duplicate (Diana Prince already exists)
# 3. Handle the impossible age value
# 4. Standardize the country name

# Example solution (uncomment and modify as needed):
# new_row_cleaned = new_row.copy()
# new_row_cleaned['name'] = new_row_cleaned['name'].str.strip()
# new_row_cleaned['age'] = np.nan  # Invalid age
# new_row_cleaned['country'] = 'USA'  # Standardized
# print("\nCleaned row:")
# print(new_row_cleaned)

### Wetin You Go Learn

1. **Categories wey no dey consistent** dey common for real-world data. Always check the unique values and use mappings or fuzzy matching to make dem standard.

2. **Outliers** fit affect your analysis well well. Use wetin you sabi for the domain plus statistical methods like IQR or Z-score to find dem.

3. **Near-duplicates** dey harder to find pass exact duplicates. Try use fuzzy matching and normalize the data (like make am lowercase, remove space) to see dem.

4. **Data cleaning na process wey dey repeat**. You fit need to use plenty techniques and check the results well before you go finalize your cleaned dataset.

5. **Write down wetin you decide**. Make sure say you keep record of the cleaning steps wey you do and why you do am, e dey important for reproducibility and transparency.

> **Best Practice:** Always keep one copy of your original "dirty" data. No ever overwrite your source data files - create cleaned versions wey get clear names like `data_cleaned.csv`.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI transleto service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even though we dey try make am correct, abeg make you sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument wey dey for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human translation. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
