# Pandas: Data Cleaning

In [1]:
import numpy as np
import pandas as pd

## 1. Methodology

### 1.1. Data quality
[Clean data](https://en.wikipedia.org/wiki/Data_cleansing) needs to pass some quality criteria. They are logical rules or constraints that are based on business knowledge. These constraints fall into the following categories:
- Data-type constraints: Each column must be of a particular data type such as numeric, date or text.
- Accuracy: Data Scientists have to verify that the data is close to the true values, sometimes by using external sources.
- Range constraints: Typically, numbers or dates should fall within a certain range.
- Set-membership constraints: Values of a column must come from a pre-defined set.
- Pattern constraints: Certain text fields have to match regular expression patterns.
- Cross-field validation: For example, in a dataset of sales contracts, the delivery date cannot be earlier than the signature date.
- Uniqueness: A field or a combination of fields must be unique across the dataset. For example, two customers cannot have the same ID.
- Consistency: For example, a customer is recorded in two different tables with two different addresses.
- Completeness: Certain columns cannot be empty.
- Uniformity: Each field can only have one unit of measure such as kg or lb, USD or EUR.

### 1.2. The workflow

- Inspecting. The inspection can be done in the data exploration step. Here are the two most important methods to inspect the dataset:
    - Data profiling: Calculating summary statistics is really helpful to give a general idea about the quality of the data. Some questions need to be answered are *How many values are missing?*, *Is this field has a constraint with another?* and *Which data type should this column be of?*.
    - Data visualization: Visualization, especially when combined with statistical methods helps answering *How the data is distributed?* and *Which point is an outlier?*.
- Cleaning. In this step, all the criteria mentioned above are taken into account. Overall, incorrect data will be either removed, corrected or imputed.

## 2. Basic data cleaning

### 2.1. Common techniques

In [1]:
import numpy as np
import pandas as pd

#### Selecting columns
Two approaches: selecting the necessary columns only or removing unnecessary ones.

In [2]:
dfAqua = pd.DataFrame({
    'year': pd.Series([2020, 2020, 2020, 2020, 2020, 2020]),
    'month_name': pd.Series(['Jan', 'Jan', 'Jun', 'Jun', 'Jul', 'Jul']),
    'month_number': pd.Series([1, 1, 6, 6, 7, 7]),
    'commodity': pd.Series(['Fish', 'Shrimp', 'Fish', 'Shrimp', 'Fish', 'Shrimp']),
    'profit': pd.Series([7415, 3239, 7280, 2007, 3574, 9285]),
    'company': pd.Series(['Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas'])
})

In [2]:
dfAqua

Unnamed: 0,year,month_name,month_number,commodity,profit,company
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [3]:
dfAqua[['year', 'month_number', 'commodity', 'profit']]

Unnamed: 0,year,month_number,commodity,profit
0,2020,1,Fish,7415
1,2020,1,Shrimp,3239
2,2020,6,Fish,7280
3,2020,6,Shrimp,2007
4,2020,7,Fish,3574
5,2020,7,Shrimp,9285


In [4]:
dfAqua.drop(columns=['month_name', 'company'])

Unnamed: 0,year,month_number,commodity,profit
0,2020,1,Fish,7415
1,2020,1,Shrimp,3239
2,2020,6,Fish,7280
3,2020,6,Shrimp,2007
4,2020,7,Fish,3574
5,2020,7,Shrimp,9285


#### Renaming columns
Column names should follow either `PascalCase`, `camelCase` or `snake_case`, but mostly `snake_case`.

In [5]:
dfAqua = pd.DataFrame({
    'Year': pd.Series([2020, 2020, 2020, 2020, 2020, 2020]),
    'Month name': pd.Series(['Jan', 'Jan', 'Jun', 'Jun', 'Jul', 'Jul']),
    'Month number': pd.Series([1, 1, 6, 6, 7, 7]),
    'Product name': pd.Series(['Fish', 'Shrimp', 'Fish', 'Shrimp', 'Fish', 'Shrimp']),
    'Profit': pd.Series([7415, 3239, 7280, 2007, 3574, 9285]),
    'Company name': pd.Series(['Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas', 'Pandas'])
})

In [5]:
dfAqua

Unnamed: 0,Year,Month name,Month number,Product name,Profit,Company name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [6]:
# PascalCase
dfAquaPascal = dfAqua.copy()

dfAquaPascal.columns = dfAqua.columns.str.title().str.replace(' ', '')
dfAquaPascal

Unnamed: 0,Year,MonthName,MonthNumber,ProductName,Profit,CompanyName
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


In [7]:
# snake_case
dfAquaSnake = dfAqua.copy()

dfAquaSnake.columns = dfAqua.columns.str.lower().str.replace(' ', '_')
dfAquaSnake

Unnamed: 0,year,month_name,month_number,product_name,profit,company_name
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


The `rename()` method allows renaming specific columns.

In [8]:
dfAquaSnake.rename(columns={
    'product_name': 'commodity',
    'company_name': 'company'
})

Unnamed: 0,year,month_name,month_number,commodity,profit,company
0,2020,Jan,1,Fish,7415,Pandas
1,2020,Jan,1,Shrimp,3239,Pandas
2,2020,Jun,6,Fish,7280,Pandas
3,2020,Jun,6,Shrimp,2007,Pandas
4,2020,Jul,7,Fish,3574,Pandas
5,2020,Jul,7,Shrimp,9285,Pandas


#### Correcting data types

In [9]:
dfAthletes = pd.DataFrame({
    'year': [2019, 2019, 2020., 2020, 2020, 2020],
    'date': ['20191103', '20190812', '20200125', '20200129', '20200412', '20200220'],
    'time': ['145509', '135433', '214412', '124254', '123349', '233517'],
    'medal': ['Gold', 'Bronze', 'Silver', 'Bronze', 'Silver', 'Silver'],
    'name': ['Wayne', 'Robert', 'Ashley', 'Jamie', 'Jessie', 'Sergio'],
    'left_handed': [1, 0, 0, 0, 1, 0]
})

In [9]:
dfAthletes

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019.0,20191103,145509,Gold,Wayne,1
1,2019.0,20190812,135433,Bronze,Robert,0
2,2020.0,20200125,214412,Silver,Ashley,0
3,2020.0,20200129,124254,Bronze,Jamie,0
4,2020.0,20200412,123349,Silver,Jessie,1
5,2020.0,20200220,233517,Silver,Sergio,0


In [10]:
dfAthletes.dtypes

year           float64
date            object
time            object
medal           object
name            object
left_handed      int64
dtype: object

Simple data types (string or numeric) can easily be corrected using the `astype()` method.

In [11]:
dfAthletes = dfAthletes.astype({
    'year': int,
    'left_handed': bool
})
dfAthletes

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019,20191103,145509,Gold,Wayne,True
1,2019,20190812,135433,Bronze,Robert,False
2,2020,20200125,214412,Silver,Ashley,False
3,2020,20200129,124254,Bronze,Jamie,False
4,2020,20200412,123349,Silver,Jessie,True
5,2020,20200220,233517,Silver,Sergio,False


For more complex data types (date or categorical), the corresponding function has to be used.

In [12]:
pd.to_datetime(dfAthletes.date, format='%Y%m%d')

0   2019-11-03
1   2019-08-12
2   2020-01-25
3   2020-01-29
4   2020-04-12
5   2020-02-20
Name: date, dtype: datetime64[ns]

In [13]:
pd.to_datetime(dfAthletes.date + ' ' + dfAthletes.time, format='%Y%m%d %H%M%S')

0   2019-11-03 14:55:09
1   2019-08-12 13:54:33
2   2020-01-25 21:44:12
3   2020-01-29 12:42:54
4   2020-04-12 12:33:49
5   2020-02-20 23:35:17
dtype: datetime64[ns]

In [14]:
pd.Categorical(dfAthletes.medal, categories=['Bronze', 'Silver', 'Gold'])

['Gold', 'Bronze', 'Silver', 'Bronze', 'Silver', 'Silver']
Categories (3, object): ['Bronze', 'Silver', 'Gold']

In [15]:
dfAthletes.date = pd.to_datetime(dfAthletes.date, format='%Y%m%d')
dfAthletes.medal = pd.Categorical(dfAthletes.medal, categories=['Bronze', 'Silver', 'Gold'])
dfAthletes

Unnamed: 0,year,date,time,medal,name,left_handed
0,2019,2019-11-03,145509,Gold,Wayne,True
1,2019,2019-08-12,135433,Bronze,Robert,False
2,2020,2020-01-25,214412,Silver,Ashley,False
3,2020,2020-01-29,124254,Bronze,Jamie,False
4,2020,2020-04-12,123349,Silver,Jessie,True
5,2020,2020-02-20,233517,Silver,Sergio,False


In [16]:
dfAthletes.sort_values(by='medal')

Unnamed: 0,year,date,time,medal,name,left_handed
1,2019,2019-08-12,135433,Bronze,Robert,False
3,2020,2020-01-29,124254,Bronze,Jamie,False
2,2020,2020-01-25,214412,Silver,Ashley,False
4,2020,2020-04-12,123349,Silver,Jessie,True
5,2020,2020-02-20,233517,Silver,Sergio,False
0,2019,2019-11-03,145509,Gold,Wayne,True


#### Filtering

In [17]:
dfJob = pd.DataFrame({
    'worker': [
        'Wayne', 'Robert', 'Ashley',
        'Jamie', 'Jessie', 'Sergio',
        'Harry', 'Johnny', 'Aaron'
    ],
    'age': [8, 37, 25, 26, 80, 30, 20, 31, 28],
    'job': [
        'Student', 'Data Scientist', 'DATA ANALYST',
        'data engineer', 'Retired', 'Business Intelligence',
        'Student', 'Data Analyst', 'AI Engineer'
    ],
    'years_on_job': [0, 12, 2, 6, 0, 18, 12, 2, 8]
})

In [17]:
dfJob

Unnamed: 0,worker,age,job,years_on_job
0,Wayne,8,Student,0
1,Robert,37,Data Scientist,12
2,Ashley,25,DATA ANALYST,2
3,Jamie,26,data engineer,6
4,Jessie,80,Retired,0
5,Sergio,30,Business Intelligence,18
6,Harry,20,Student,12
7,Johnny,31,Data Analyst,2
8,Aaron,28,AI Engineer,8


In the dataset above, consider only people who are of legal working age (15 to 60) and are working in the data industry. Notice that `age` minus `years_on_job` (which calculates how old is the worker when he/she starts working) cannot be smaller than 15.

In [18]:
dfJob[
    (dfJob.job.str.lower().str.contains('data')) &
    (dfJob.age >= 15) &
    (dfJob.age <= 60) &
    (dfJob.age - dfJob.years_on_job >= 15)
]

Unnamed: 0,worker,age,job,years_on_job
1,Robert,37,Data Scientist,12
2,Ashley,25,DATA ANALYST,2
3,Jamie,26,data engineer,6
7,Johnny,31,Data Analyst,2


### 2.2. Text cleaning

In [19]:
import numpy as np
import pandas as pd

#### Trimming
Space and newline characters usually appear in text columns, because of user's habit.

In [20]:
dfTrade = pd.DataFrame({
    'year': pd.Series([2017, 2018, 2019, 2020]),
    'country': pd.Series([
        'United\nKingdom  ',
        '  United\nKingdom',
        'United    Kingdom',
        ' United Kingdom\n']),
    'export': pd.Series([5466, 8558, 8435, 8435]),
    'import': pd.Series([1546, 3546, 2007, 3574])
})

In [20]:
dfTrade

Unnamed: 0,year,country,export,import
0,2017,United\nKingdom,5466,1546
1,2018,United\nKingdom,8558,3546
2,2019,United Kingdom,8435,2007
3,2020,United Kingdom\n,8435,3574


In [21]:
dfTrade.country.unique()

array(['United\nKingdom  ', '  United\nKingdom', 'United    Kingdom',
       ' United Kingdom\n'], dtype=object)

In [22]:
dfTrade.country.str.split().str.join(' ')

0    United Kingdom
1    United Kingdom
2    United Kingdom
3    United Kingdom
Name: country, dtype: object

In [23]:
dfTrade.country = dfTrade.country.str.split().str.join(' ')
dfTrade.country.unique()

array(['United Kingdom'], dtype=object)

#### Standardization
The approach is to translate different naming conventions, abbreviations or formats into one unique value.

In [24]:
dfShrimp = pd.DataFrame({
    'date': ['2020-01-01', '2020-01-02', '2020-01-03'],
    'commodity': ['Shrimp, frozen, chem free', 'Shrimp, frz, chemical-free', 'Prawn, frz, chemical-free'],
    'price': [10, 13, 14],
    'unit': ['usd/kg', 'USD/KG', 'USD/kg']
})

In [24]:
dfShrimp

Unnamed: 0,date,commodity,price,unit
0,2020-01-01,"Shrimp, frozen, chem free",10,usd/kg
1,2020-01-02,"Shrimp, frz, chemical-free",13,USD/KG
2,2020-01-03,"Prawn, frz, chemical-free",14,USD/kg


In [25]:
dfShrimp.commodity = dfShrimp.commodity.str.replace('Prawn', 'Shrimp')
dfShrimp.commodity = dfShrimp.commodity.str.replace('frz', 'frozen')
dfShrimp.commodity = dfShrimp.commodity.str.replace('chem free', 'chemical-free')
dfShrimp.unit = dfShrimp.unit.str.replace('usd', 'USD')
dfShrimp.unit = dfShrimp.unit.str.replace('KG', 'kg')

In [26]:
dfShrimp

Unnamed: 0,date,commodity,price,unit
0,2020-01-01,"Shrimp, frozen, chemical-free",10,USD/kg
1,2020-01-02,"Shrimp, frozen, chemical-free",13,USD/kg
2,2020-01-03,"Shrimp, frozen, chemical-free",14,USD/kg


#### Padding numbers

In [27]:
dfInfo = pd.DataFrame({
    'customer_id': [3, 423, 5464],
    'phone': [363334444, 913334444, 123334444],
    'name': ['Jack', 'James', 'Gabriel'],
    'information': ['England Male', 'Colombia Male', 'France Female']
})

In [27]:
dfInfo

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


In [28]:
dfInfo = dfInfo.astype(str)
dfInfo.dtypes

customer_id    object
phone          object
name           object
information    object
dtype: object

In [29]:
dfInfo.customer_id = dfInfo.customer_id.str.pad(width=4, fillchar='0')
dfInfo.phone = dfInfo.phone.str.pad(width=10, fillchar='0')

In [30]:
dfInfo

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


#### Spliting a column

In [31]:
dfInfo = pd.DataFrame({
    'customer_id': [3, 423, 5464],
    'phone': [363334444, 913334444, 123334444],
    'name': ['Jack', 'James', 'Gabriel'],
    'information': ['England Male', 'Colombia Male', 'France Female']
})

In [31]:
dfInfo

Unnamed: 0,customer_id,phone,name,information
0,3,363334444,Jack,England Male
1,423,913334444,James,Colombia Male
2,5464,123334444,Gabriel,France Female


In [32]:
dfInfo['information'].str.split()

0     [England, Male]
1    [Colombia, Male]
2    [France, Female]
Name: information, dtype: object

In [33]:
# unpacking
dfInfo['nationality'] = dfInfo['information'].str.split().str[0]
dfInfo['gender'] = dfInfo['information'].str.split().str[1]

dfInfo.drop(columns=['information'])

Unnamed: 0,customer_id,phone,name,nationality,gender
0,3,363334444,Jack,England,Male
1,423,913334444,James,Colombia,Male
2,5464,123334444,Gabriel,France,Female


#### Concatenating columns

In [34]:
dfFootball = pd.DataFrame({
    'first_name': ['Wayne', 'Cristiano', 'Lionel'],
    'last_name': ['Rooney', 'Ronaldo', 'Messi'],
    'position': ['Second Striker', 'Left Winger', 'Right Winger']
})

In [34]:
dfFootball

Unnamed: 0,first_name,last_name,position
0,Wayne,Rooney,Second Striker
1,Cristiano,Ronaldo,Left Winger
2,Lionel,Messi,Right Winger


In [35]:
dfFootball['player'] = dfFootball.first_name + ' ' + dfFootball.last_name

In [36]:
dfFootball

Unnamed: 0,first_name,last_name,position,player
0,Wayne,Rooney,Second Striker,Wayne Rooney
1,Cristiano,Ronaldo,Left Winger,Cristiano Ronaldo
2,Lionel,Messi,Right Winger,Lionel Messi


## 3. Low quality data

### 3.1. Handling missing data
There are three reasons that cause missing data, illustrated in the example below. This table records the IQ score of 9 people at different ages, and we assume there are 3 missing values caused by each reason.

|Age   |IQ     |MCAR|MAR|MNAR|
|:----:|:-----:|:--:|:-:|:--:|
|**20**|**120**|120 |   |120 |
|**22**|**112**|    |   |112 |
|**24**|**127**|127 |   |127 |
|**29**|**97** |    |97 |    |
|**30**|**103**|103 |103|103 |
|**40**|**95** |95  |95 |    |
|**45**|**141**|    |141|141 |
|**47**|**92** |92  |92 |    |
|**52**|**115**|115 |115|115 |

::::{tab-set}

:::{tab-item} MCAR

MCAR (Missing Completely At Random): The name says it all, there's no actual reason behind the missing values. This type of missing does not lead to bias, therefore *deletion* and *imputation* are both suitable solutions.
:::

:::{tab-item} MAR

MAR (Missing At Random): The missing values in a feature relate to another feature. In the example above, people under 25 years old miss their IQ score. Deleting these records causes bias, making *imputation* the best choice.
:::

:::{tab-item} MAR

MNAR (Missing Not At Random): Assume people with IQ score of 100 or less tend to refuse to answer the survey. There is no way missing data can be inferred only by looking at collected data. Either *deletion* or *imputation* makes data biased, and Data Scientists may not even realize if they are facing a MNAR case.
:::

::::

In [37]:
import numpy as np
import pandas as pd

In [38]:
# COVID-19 data, complete data
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa',
           'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, 1570237, 817593, 580330, 236260, 255945, 236209, 311431, 247230, 197842]
area = ['North America', 'South America', 'Asia', 'Europe', 'Africa',
        'South America', 'North America', 'South America', 'Asia', 'Europe']

pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered,
    'area': area
})

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,147333,1979617,North America
1,Brazil,2289951,84207,1570237,South America
2,India,1288130,30645,817593,Asia
3,Russia,795038,12892,580330,Europe
4,South Africa,408052,6093,236260,Africa
5,Peru,371096,17645,255945,South America
6,Mexico,370712,41908,236209,North America
7,Chile,338759,8838,311431,South America
8,Iran,284034,15074,247230,Asia
9,Italy,245338,35029,197842,Europe


#### Columns removal
A column having more than 50% of missing data can be dropped.

In [39]:
import numpy as np
import pandas as pd

In [40]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa', 'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, None, None, 580330, None, None, 236209, None, 247230, None]

dfCovid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered
})

In [40]:
dfCovid

Unnamed: 0,country,cases,deaths,recovered
0,USA,4169991,147333,1979617.0
1,Brazil,2289951,84207,
2,India,1288130,30645,
3,Russia,795038,12892,580330.0
4,South Africa,408052,6093,
5,Peru,371096,17645,
6,Mexico,370712,41908,236209.0
7,Chile,338759,8838,
8,Iran,284034,15074,247230.0
9,Italy,245338,35029,


In [41]:
dfCovid.isna().mean().map('{:.0%}'.format)

country       0%
cases         0%
deaths        0%
recovered    60%
dtype: object

In [42]:
dfCovid.drop(columns='recovered')

Unnamed: 0,country,cases,deaths
0,USA,4169991,147333
1,Brazil,2289951,84207
2,India,1288130,30645
3,Russia,795038,12892
4,South Africa,408052,6093
5,Peru,371096,17645
6,Mexico,370712,41908
7,Chile,338759,8838
8,Iran,284034,15074
9,Italy,245338,35029


#### Rows removal

In [43]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa', 'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, None, 817593, None, 236260, 255945, 236209, 311431, 247230, 197842]

dfCovid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered
})

In [43]:
dfCovid

Unnamed: 0,country,cases,deaths,recovered
0,USA,4169991,147333,1979617.0
1,Brazil,2289951,84207,
2,India,1288130,30645,817593.0
3,Russia,795038,12892,
4,South Africa,408052,6093,236260.0
5,Peru,371096,17645,255945.0
6,Mexico,370712,41908,236209.0
7,Chile,338759,8838,311431.0
8,Iran,284034,15074,247230.0
9,Italy,245338,35029,197842.0


In [44]:
dfCovid.dropna(subset=['recovered'])

Unnamed: 0,country,cases,deaths,recovered
0,USA,4169991,147333,1979617.0
2,India,1288130,30645,817593.0
4,South Africa,408052,6093,236260.0
5,Peru,371096,17645,255945.0
6,Mexico,370712,41908,236209.0
7,Chile,338759,8838,311431.0
8,Iran,284034,15074,247230.0
9,Italy,245338,35029,197842.0


#### Filling
Some values may be used to fill missing date are mean, median, mode and zero.

In [45]:
import numpy as np
import pandas as pd

In [46]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa',
           'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, None, 817593, None, 236260, 255945, 236209, 311431, 247230, 197842]
area = ['North America', 'South America', 'Asia', np.nan, 'Africa',
        'South America', 'North America', 'South America', np.nan, 'Europe']

dfCovid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered,
    'area': area
})

In [46]:
dfCovid

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,147333,1979617.0,North America
1,Brazil,2289951,84207,,South America
2,India,1288130,30645,817593.0,Asia
3,Russia,795038,12892,,
4,South Africa,408052,6093,236260.0,Africa
5,Peru,371096,17645,255945.0,South America
6,Mexico,370712,41908,236209.0,North America
7,Chile,338759,8838,311431.0,South America
8,Iran,284034,15074,247230.0,
9,Italy,245338,35029,197842.0,Europe


In [47]:
recovered_mean = dfCovid.recovered.mean()
recovered_mean

535265.875

In [48]:
area_mode = dfCovid.area.mode()[0]
area_mode

'South America'

In [49]:
dfCovid.recovered = dfCovid.recovered.fillna(recovered_mean)
dfCovid.area = dfCovid.area.fillna(area_mode)
dfCovid

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,147333,1979617.0,North America
1,Brazil,2289951,84207,535265.875,South America
2,India,1288130,30645,817593.0,Asia
3,Russia,795038,12892,535265.875,South America
4,South Africa,408052,6093,236260.0,Africa
5,Peru,371096,17645,255945.0,South America
6,Mexico,370712,41908,236209.0,North America
7,Chile,338759,8838,311431.0,South America
8,Iran,284034,15074,247230.0,South America
9,Italy,245338,35029,197842.0,Europe


#### Imputing
k-NN (k-Nearest Neighbors) is one of the Machine Learning algorithms that can be used in imputing missing values. This algorithm considers $k$ nearest observations (according to some distance metrics) to predict missing values.

In [50]:
import numpy as np
import pandas as pd

In [51]:
country = ['USA', 'Brazil', 'India', 'Russia', 'South Africa',
           'Peru', 'Mexico', 'Chile', 'Iran', 'Italy']
cases = [4169991, 2289951, 1288130, 795038, 408052, 371096, 370712, 338759, 284034, 245338]
deaths = [147333, 84207, 30645, 12892, 6093, 17645, 41908, 8838, 15074, 35029]
recovered = [1979617, 1570237, 817593, 580330, 236260, 255945, 236209, 311431, 247230, 197842]
area = ['America', 'America', 'Asia', np.nan, 'Africa',
        'America', 'America', np.nan, 'Asia', 'Europe']

dfCovid = pd.DataFrame({
    'country': country,
    'cases': cases,
    'deaths': deaths,
    'recovered': recovered,
    'area': area
})

In [51]:
dfCovid

Unnamed: 0,country,cases,deaths,recovered,area
0,USA,4169991,147333,1979617,America
1,Brazil,2289951,84207,1570237,America
2,India,1288130,30645,817593,Asia
3,Russia,795038,12892,580330,
4,South Africa,408052,6093,236260,Africa
5,Peru,371096,17645,255945,America
6,Mexico,370712,41908,236209,America
7,Chile,338759,8838,311431,
8,Iran,284034,15074,247230,Asia
9,Italy,245338,35029,197842,Europe


In [52]:
dfTrain = dfCovid[~dfCovid.area.isna()]
xTrain = dfTrain[['cases', 'deaths', 'recovered']]
yTrain = dfTrain.area

dfPredict = dfCovid[dfCovid.area.isna()]
xPred = dfPredict[['cases', 'deaths', 'recovered']]

In [53]:
from sklearn.neighbors import KNeighborsClassifier as knn
clf = knn(3, weights='distance').fit(xTrain, yTrain)
yPred = clf.predict(xPred)
yPred

array(['America', 'America'], dtype=object)

In [54]:
dfTrain.append(dfPredict.assign(area=yPred)).sort_values('cases')

Unnamed: 0,country,cases,deaths,recovered,area
9,Italy,245338,35029,197842,Europe
8,Iran,284034,15074,247230,Asia
7,Chile,338759,8838,311431,America
6,Mexico,370712,41908,236209,America
5,Peru,371096,17645,255945,America
4,South Africa,408052,6093,236260,Africa
3,Russia,795038,12892,580330,America
2,India,1288130,30645,817593,Asia
1,Brazil,2289951,84207,1570237,America
0,USA,4169991,147333,1979617,America


### 3.2. Handling duplicated values
Duplicated values are usually caused by unique contraints within a column or between a combination of columns. If duplicated values occur, there can only be no more than 1 true value.

Depend on the context, there are many strategies to handle duplicated values:
- List and sort all duplicated values, then manually remove incorrect records.
- Remove duplicated values based on specific criteria, such as keeping the greatest value only.
- Calculate a value such as sum or mean representing all duplicated records.

In [55]:
import numpy as np
import pandas as pd

In [56]:
dfReport = pd.DataFrame({
    'year': pd.Series([2019, 2019, 2020, 2020, 2020, 2020]),
    'company': pd.Series(['Pandas', 'Numpy', 'Pandas', 'Numpy', 'Numpy', 'Pandas']),
    'sales': pd.Series([5466, 8558, 8435, 7280, 9285, 6650]),
    'profit': pd.Series([1546, 3546, 3574, 3352, 4678, 2007])
})

In [56]:
dfReport

Unnamed: 0,year,company,sales,profit
0,2019,Pandas,5466,1546
1,2019,Numpy,8558,3546
2,2020,Pandas,8435,3574
3,2020,Numpy,7280,3352
4,2020,Numpy,9285,4678
5,2020,Pandas,6650,2007


In this example, the combination of `year` and `company` creates a unique constraint: in each year, a company cannot have two values of sales and profit.

In [57]:
subset = ['year', 'company']

#### Manual removal

In [58]:
dfReport[dfReport.duplicated(subset, keep=False)].sort_values(subset)

Unnamed: 0,year,company,sales,profit
3,2020,Numpy,7280,3352
4,2020,Numpy,9285,4678
2,2020,Pandas,8435,3574
5,2020,Pandas,6650,2007


In [59]:
dfReport.drop(index=[4, 2])

Unnamed: 0,year,company,sales,profit
0,2019,Pandas,5466,1546
1,2019,Numpy,8558,3546
3,2020,Numpy,7280,3352
5,2020,Pandas,6650,2007


#### Conditional removal

In [60]:
# keep the biggest sales values only
dfReport\
    .sort_values(by=['year', 'company', 'sales'])\
    .drop_duplicates(subset=subset, keep='last')

Unnamed: 0,year,company,sales,profit
1,2019,Numpy,8558,3546
0,2019,Pandas,5466,1546
4,2020,Numpy,9285,4678
2,2020,Pandas,8435,3574


#### Aggregating

In [61]:
dfReport.groupby(by=['year', 'company']).sum().reset_index()

Unnamed: 0,year,company,sales,profit
0,2019,Numpy,8558,3546
1,2019,Pandas,5466,1546
2,2020,Numpy,16565,8030
3,2020,Pandas,15085,5581
