# Pandas: Data Cleaning

In [None]:
from util import np, pd

## 1. Methodology

### 1.1. Data quality
[Clean data](https://en.wikipedia.org/wiki/Data_cleansing) needs to pass some quality criteria. They are logical rules or constraints that are based on business knowledge. These constraints fall into the following categories:
- Data-type constraints: Each column must be of a particular data type such as numeric, date or text.
- Accuracy: Data Scientists have to verify that the data is close to the true values, sometimes by using external sources.
- Range constraints: Typically, numbers or dates should fall within a certain range.
- Set-membership constraints: Values of a column must come from a pre-defined set.
- Pattern constraints: Certain text fields have to match regular expression patterns.
- Cross-field validation: For example, in a dataset of sales contracts, the delivery date cannot be earlier than the signature date.
- Uniqueness: A field or a combination of fields must be unique across the dataset. For example, two customers cannot have the same ID.
- Consistency: For example, a customer is recorded in two different tables with two different addresses.
- Completeness: Certain columns cannot be empty.
- Uniformity: Each field can only have one unit of measure such as kg or lb, USD or EUR.

### 1.2. The workflow

- Inspecting. The inspection can be done in the data exploration step. Here are the two most important methods to inspect the dataset:
    - Data profiling: Calculating summary statistics is really helpful to give a general idea about the quality of the data. Some questions need to be answered are *How many values are missing?*, *Is this field has a constraint with another?* and *Which data type should this column be of?*.
    - Data visualization: Visualization, especially when combined with statistical methods helps answering *How the data is distributed?* and *Which point is an outlier?*.
- Cleaning. In this step, all the criteria mentioned above are taken into account. Overall, incorrect data will be either removed, corrected or imputed.

## 2. Basic data cleaning

In [None]:
from util import np, pd

### 2.1. Common techniques

#### Selecting columns
Two approaches: selecting the necessary columns only or removing unnecessary ones.

In [1]:
from util._data import df_aqua
df_aqua

Unnamed: 0,Year,Month name,Month number,Product Name,PROFIT,Company Name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [2]:
df_aqua[['Year', 'Month number', 'Product Name', 'PROFIT']]

Unnamed: 0,Year,Month number,Product Name,PROFIT
0,2020,1,Fish,7415
1,2020,1,Shrimp,3239
2,2020,6,Fish,7280
3,2020,6,Shrimp,2007
4,2020,7,Fish,3574
5,2020,7,Shrimp,9285


In [4]:
df_aqua.drop(columns=['Month name', 'Company Name'])

Unnamed: 0,Year,Month number,Product Name,PROFIT
0,2020,1,Fish,7415
1,2020,1,Shrimp,3239
2,2020,6,Fish,7280
3,2020,6,Shrimp,2007
4,2020,7,Fish,3574
5,2020,7,Shrimp,9285


#### Renaming columns
Column names should follow either `PascalCase`, `camelCase` or `snake_case`, but mostly `snake_case`.

In [5]:
from util._data import df_aqua
df_aqua

Unnamed: 0,Year,Month name,Month number,Product Name,PROFIT,Company Name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [9]:
# PascalCase
df_pascal = df_aqua.copy()
df_pascal.columns = df_aqua.columns.str.title().str.split().str.join('')

df_snake = df_aqua.copy()
df_snake.columns = df_aqua.columns.str.lower().str.split().str.join('_')

df_snake

Unnamed: 0,year,month_name,month_number,product_name,profit,company_name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [11]:
df_aqua.clean_names()

Unnamed: 0,year,month_name,month_number,product_name,profit,company_name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


The `rename()` method allows renaming specific columns.

In [14]:
df_aqua.rename(columns={
    'Product Name': 'commodity',
    'Company Name': 'company'
})

Unnamed: 0,Year,Month name,Month number,commodity,PROFIT,company
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


#### Correcting data types

In [1]:
from util._data import df_athlete
df_athlete

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019.0,20191103,145509,Gold,Wayne,1
1,2019.0,20190812,135433,Bronze,Robert,0
2,2020.0,20200125,214412,Silver,Ashley,0
3,2020.0,20200129,124254,Bronze,Jamie,0
4,2020.0,20200412,123349,Silver,Jessie,1
5,2020.0,20200220,233517,Silver,Sergio,0


In [2]:
df_athlete.dtypes

year           float64
date            object
time            object
medal           object
name            object
left_handed      int64
dtype: object

Simple data types (string or numeric) can easily be corrected using the `astype()` method.

In [3]:
df_athlete.astype({
    'year': int,
    'left_handed': bool
})

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019,20191103,145509,Gold,Wayne,True
1,2019,20190812,135433,Bronze,Robert,False
2,2020,20200125,214412,Silver,Ashley,False
3,2020,20200129,124254,Bronze,Jamie,False
4,2020,20200412,123349,Silver,Jessie,True
5,2020,20200220,233517,Silver,Sergio,False


For more complex data types (date or categorical), the corresponding function has to be used.

In [12]:
pd.to_datetime(df_athlete['date'], format='%Y%m%d')

0   2019-11-03
1   2019-08-12
2   2020-01-25
3   2020-01-29
4   2020-04-12
5   2020-02-20
Name: date, dtype: datetime64[ns]

In [13]:
pd.to_datetime(df_athlete['date'] + ' ' + df_athlete['time'], format='%Y%m%d %H%M%S')

0   2019-11-03 14:55:09
1   2019-08-12 13:54:33
2   2020-01-25 21:44:12
3   2020-01-29 12:42:54
4   2020-04-12 12:33:49
5   2020-02-20 23:35:17
dtype: datetime64[ns]

In [14]:
pd.Categorical(df_athlete['medal'], categories=['Bronze', 'Silver', 'Gold'])

['Gold', 'Bronze', 'Silver', 'Bronze', 'Silver', 'Silver']
Categories (3, object): ['Bronze', 'Silver', 'Gold']

In [15]:
df_athlete['date'] = pd.to_datetime(df_athlete['date'], format='%Y%m%d')
df_athlete['medal'] = pd.Categorical(df_athlete['medal'], categories=['Bronze', 'Silver', 'Gold'])
df_athlete

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019,2019-11-03,145509,Gold,Wayne,True
1,2019,2019-08-12,135433,Bronze,Robert,False
2,2020,2020-01-25,214412,Silver,Ashley,False
3,2020,2020-01-29,124254,Bronze,Jamie,False
4,2020,2020-04-12,123349,Silver,Jessie,True
5,2020,2020-02-20,233517,Silver,Sergio,False


In [16]:
df_athlete.sort_values(by='medal')

Unnamed: 0,year,date,time,medal,name,left_handed
1,2019,2019-08-12,135433,Bronze,Robert,False
3,2020,2020-01-29,124254,Bronze,Jamie,False
2,2020,2020-01-25,214412,Silver,Ashley,False
4,2020,2020-04-12,123349,Silver,Jessie,True
5,2020,2020-02-20,233517,Silver,Sergio,False
0,2019,2019-11-03,145509,Gold,Wayne,True


#### Filtering

In [1]:
from util._data import df_job
df_job

Unnamed: 0,worker,age,job,years_on_job
0,Wayne,8,Student,0
1,Robert,37,Data Scientist,12
2,Ashley,25,DATA ANALYST,2
3,Jamie,26,data engineer,6
4,Jessie,80,Retired,0
5,Sergio,30,Business Intelligence,18
6,Harry,20,Student,12
7,Johnny,31,Data Analyst,2
8,Aaron,28,AI Engineer,8


In the dataset above, consider only people who are of legal working age (15 to 60) and are working in the data industry. Notice that `age` minus `years_on_job` (which calculates how old is the worker when he/she starts working) cannot be smaller than 15.

In [18]:
df_job[
    (df_job.job.str.lower().str.contains('data')) &
    (df_job['age'] >= 15) &
    (df_job['age'] <= 60) &
    (df_job['age'] - df_job['years_on_job'] >= 15)
]

Unnamed: 0,worker,age,job,years_on_job
1,Robert,37,Data Scientist,12
2,Ashley,25,DATA ANALYST,2
3,Jamie,26,data engineer,6
7,Johnny,31,Data Analyst,2


### 2.2. Text cleaning

#### Trimming
Space and newline characters usually appear in text columns, because of user's habit.

In [1]:
from util._data import df_trade
df_trade

Unnamed: 0,year,country,export,import
0,2017,United\nKingdom,5466,1546
1,2018,United\nKingdom,8558,3546
2,2019,United Kingdom,8435,2007
3,2020,United Kingdom\n,8435,3574


In [2]:
df_trade['country'].unique()

array(['United\nKingdom  ', '  United\nKingdom', 'United    Kingdom',
       ' United Kingdom\n'], dtype=object)

In [22]:
df_trade['country'].str.split().str.join(' ')

0    United Kingdom
1    United Kingdom
2    United Kingdom
3    United Kingdom
Name: country, dtype: object

In [23]:
df_trade['country'] = df_trade['country'].str.split().str.join(' ')
df_trade['country'].unique()

array(['United Kingdom'], dtype=object)

#### Standardization
The approach is to translate different naming conventions, abbreviations or formats into one unique value.

In [1]:
from util._data import df_shrimp
df_shrimp

Unnamed: 0,date,commodity,price,unit
0,2020-01-01,"Shrimp, frozen, chem free",10,usd/kg
1,2020-01-02,"Shrimp, frz, chemical-free",13,USD/KG
2,2020-01-03,"Prawn, frz, chemical-free",14,USD/kg


In [25]:
df_shrimp['commodity'] = df_shrimp['commodity'].str.replace('Prawn', 'Shrimp')
df_shrimp['commodity'] = df_shrimp['commodity'].str.replace('frz', 'frozen')
df_shrimp['commodity'] = df_shrimp['commodity'].str.replace('chem free', 'chemical-free')
df_shrimp['unit'] = df_shrimp['unit'].str.replace('usd', 'USD')
df_shrimp['unit'] = df_shrimp['unit'].str.replace('KG', 'kg')

In [26]:
df_shrimp

Unnamed: 0,date,commodity,price,unit
0,2020-01-01,"Shrimp, frozen, chemical-free",10,USD/kg
1,2020-01-02,"Shrimp, frozen, chemical-free",13,USD/kg
2,2020-01-03,"Shrimp, frozen, chemical-free",14,USD/kg


#### Padding numbers

In [27]:
from util._data import df_info
df_info

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


In [28]:
df_info = df_info.astype(str)
df_info.dtypes

customer_id    object
phone          object
name           object
information    object
dtype: object

In [29]:
df_info['customer_id'] = df_info['customer_id'].str.pad(width=4, fillchar='0')
df_info['phone'] = df_info['phone'].str.pad(width=10, fillchar='0')

In [30]:
df_info

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


#### Spliting a column

In [None]:
from util._data import df_info
df_info

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


In [32]:
df_info['information'].str.split()

0     [England, Male]
1    [Colombia, Male]
2    [France, Female]
Name: information, dtype: object

In [33]:
# unpacking
df_info['nationality'] = df_info['information'].str.split().str[0]
df_info['gender'] = df_info['information'].str.split().str[1]

df_info.drop(columns=['information'])

Unnamed: 0,customer_id,phone,name,nationality,gender
0,3,363334444,Jack,England,Male
1,423,913334444,James,Colombia,Male
2,5464,123334444,Gabriel,France,Female


#### Concatenating columns

In [1]:
from util._data import df_football
df_football

Unnamed: 0,first_name,last_name,position
0,Wayne,Rooney,Second Striker
1,Cristiano,Ronaldo,Left Winger
2,Lionel,Messi,Right Winger


In [2]:
df_football['first_name'] + ' ' + df_football['last_name']

0         Wayne Rooney
1    Cristiano Ronaldo
2         Lionel Messi
dtype: object

## 3. Low quality data

### 3.1. Handling missing data
There are three reasons that cause missing data, illustrated in the example below. This table records the IQ score of 9 people at different ages, and we assume there are 3 missing values caused by each reason.

|Age   |IQ     |MCAR|MAR|MNAR|
|:----:|:-----:|:--:|:-:|:--:|
|**20**|**120**|120 |   |120 |
|**22**|**112**|    |   |112 |
|**24**|**127**|127 |   |127 |
|**29**|**97** |    |97 |    |
|**30**|**103**|103 |103|103 |
|**40**|**95** |95  |95 |    |
|**45**|**141**|    |141|141 |
|**47**|**92** |92  |92 |    |
|**52**|**115**|115 |115|115 |

- MCAR (Missing Completely At Random): The name says it all, there's no actual reason behind the missing values. This type of missing does not lead to bias, therefore *deletion* and *imputation* are both suitable solutions.
- MAR (Missing At Random): The missing values in a feature relate to another feature. In the example above, people under 25 years old miss their IQ score. Deleting these records causes bias, making *imputation* the best choice.
- MNAR (Missing Not At Random): Assume people with IQ score of 100 or less tend to refuse to answer the survey. There is no way missing data can be inferred only by looking at collected data. Either *deletion* or *imputation* makes data biased, and Data Scientists may not even realize if they are facing a MNAR case.

In [None]:
from util import np, pd
from util._data import df_covid_missing, df_report

In [3]:
df_covid_missing

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,,1979617.0,North America
1,Brazil,2289951,84207.0,,South America
2,India,1288130,30645.0,,Asia
3,Russia,795038,12892.0,580330.0,
4,South Africa,408052,,,Africa
5,Peru,371096,17645.0,,South America
6,Mexico,370712,41908.0,236209.0,North America
7,Chile,338759,8838.0,,South America
8,Iran,284034,15074.0,247230.0,
9,Italy,245338,35029.0,,Europe


#### Columns removal
A column having more than 50% of missing data can be dropped.

In [4]:
df_covid_missing.isna().mean()

country      0.0
cases        0.0
deaths       0.2
recovered    0.6
area         0.2
dtype: float64

In [7]:
df_covid_missing.drop(columns='recovered')

Unnamed: 0,country,cases,deaths,area
0,USA,4169991,,North America
1,Brazil,2289951,84207.0,South America
2,India,1288130,30645.0,Asia
3,Russia,795038,12892.0,
4,South Africa,408052,,Africa
5,Peru,371096,17645.0,South America
6,Mexico,370712,41908.0,North America
7,Chile,338759,8838.0,South America
8,Iran,284034,15074.0,
9,Italy,245338,35029.0,Europe


#### Rows removal

In [10]:
df_covid_missing.dropna(subset=['deaths'])

Unnamed: 0,country,cases,deaths,recovered,area
1,Brazil,2289951,84207.0,,South America
2,India,1288130,30645.0,,Asia
3,Russia,795038,12892.0,580330.0,
5,Peru,371096,17645.0,,South America
6,Mexico,370712,41908.0,236209.0,North America
7,Chile,338759,8838.0,,South America
8,Iran,284034,15074.0,247230.0,
9,Italy,245338,35029.0,,Europe


#### Imputation
Some values may be used to fill missing date are mean, median, mode and zero.

In [4]:
from sklearn.impute import KNNImputer
from feature_engine.imputation import MeanMedianImputer
from feature_engine.wrappers import SklearnTransformerWrapper

In [5]:
df_covid_missing.fillna({'deaths' : 0, 'recovered': 0, 'area': 'MISSING'})

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,0.0,1979617.0,North America
1,Brazil,2289951,84207.0,0.0,South America
2,India,1288130,30645.0,0.0,Asia
3,Russia,795038,12892.0,580330.0,MISSING
4,South Africa,408052,0.0,0.0,Africa
5,Peru,371096,17645.0,0.0,South America
6,Mexico,370712,41908.0,236209.0,North America
7,Chile,338759,8838.0,0.0,South America
8,Iran,284034,15074.0,247230.0,MISSING
9,Italy,245338,35029.0,0.0,Europe


In [6]:
imputer = MeanMedianImputer(variables=['cases', 'deaths', 'recovered'])
imputer.fit_transform(df_covid_missing)

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,24145.0,1979617.0,North America
1,Brazil,2289951,84207.0,413780.0,South America
2,India,1288130,30645.0,413780.0,Asia
3,Russia,795038,12892.0,580330.0,
4,South Africa,408052,24145.0,413780.0,Africa
5,Peru,371096,17645.0,413780.0,South America
6,Mexico,370712,41908.0,236209.0,North America
7,Chile,338759,8838.0,413780.0,South America
8,Iran,284034,15074.0,247230.0,
9,Italy,245338,35029.0,413780.0,Europe


### 3.2. Handling duplicated values
Duplicated values are usually caused by unique contraints within a column or between a combination of columns. If duplicated values occur, there can only be no more than 1 true value.

Depend on the context, there are many strategies to handle duplicated values:
- List and sort all duplicated values, then manually remove incorrect records.
- Remove duplicated values based on specific criteria, such as keeping the greatest value only.
- Calculate a value such as sum or mean representing all duplicated records.

In [6]:
from util import np, pd
from util._data import df_report

In [7]:
df_report

Unnamed: 0,year,company,sales,profit
0,2019,Pandas,5466,1546
1,2019,Numpy,8558,3546
2,2020,Pandas,8435,3574
3,2020,Numpy,7280,3352
4,2020,Numpy,9285,4678
5,2020,Pandas,6650,2007


In this example, the combination of <code style="font-size:13px">year</code> and <code style="font-size:13px">company</code> creates a unique constraint: in each year, a company cannot have two values of sales and profit.

In [57]:
subset = ['year', 'company']

#### Manual removal

In [58]:
df_report[df_report.duplicated(subset, keep=False)].sort_values(subset)

Unnamed: 0,year,company,sales,profit
3,2020,Numpy,7280,3352
4,2020,Numpy,9285,4678
2,2020,Pandas,8435,3574
5,2020,Pandas,6650,2007


In [59]:
df_report.drop(index=[4, 2])

Unnamed: 0,year,company,sales,profit
0,2019,Pandas,5466,1546
1,2019,Numpy,8558,3546
3,2020,Numpy,7280,3352
5,2020,Pandas,6650,2007


#### Conditional removal

In [60]:
# keep the biggest sales values only
(
    df_report
    .sort_values(by='sales')
    .drop_duplicates(subset=subset, keep='last')
)

Unnamed: 0,year,company,sales,profit
1,2019,Numpy,8558,3546
0,2019,Pandas,5466,1546
4,2020,Numpy,9285,4678
2,2020,Pandas,8435,3574


#### Aggregating

In [61]:
df_report.groupby(by=subset).sum().reset_index()

Unnamed: 0,year,company,sales,profit
0,2019,Numpy,8558,3546
1,2019,Pandas,5466,1546
2,2020,Numpy,16565,8030
3,2020,Pandas,15085,5581
