# Data Cleaning with Pandas

A simplified data science process can be broken down into the following straightforward steps:

1. **Define the Problem**: Clearly state the problem you want to solve using data. Understand the business or research context. Examples:
    - Recommendation Systems:
        - Problem: An e-commerce platform wants to increase sales by recommending products to users.
        - Definition: Define the problem as creating a recommendation system that suggests products to users based on their preferences, browsing history, and purchase behavior.
        - Data Required: User behavior data (e.g., clicks, purchases, views), product information, user profiles, and possibly contextual data (location, time).
        - Data Sources: Website and App Usage Analytics, Product Catalogs, User Profiles, and External Contextual Data.
    - Healthcare Diagnosis:
        - Problem: A hospital wants to improve early disease detection.
        - Definition: Frame the problem as developing a predictive model for early diagnosis of diseases based on patient symptoms and medical history.
        - Data Required: Patient medical records (symptoms, test results, medical history), diagnostic codes, and potentially external data (research papers, clinical trials).
        - Data Sources: Electronic Health Records (EHR), Lab Results, Imaging Reports, and Medical Literature.


2. **Data Collection**: Gather the data required to address the problem. This might involve databases, spreadsheets, or other sources. Some problems could be:
    - Inaccurate, incomplete, or inconsistent data can lead to unreliable results. 
    - Data may not be readily available or may exist in various formats and locations, making it challenging to collect and consolidate.
    - When integrating data from different sources, ensuring data compatibility (e.g., consistent data formats, units, and identifiers) can be challenging.
    - Collecting and using data may be subject to legal restrictions, contracts, or industry regulations. Ensuring compliance is essential.

3. **Data Cleaning**: Clean and organize the data. Remove errors, handle missing values, and format it for analysis. (TODAY)

4. **Data Analysis**: Explore the data to understand its characteristics. This can involve basic statistics and simple visualizations.(NEXT WEEK)

5. **Model Building**: Choose a straightforward algorithm or method to build a predictive or descriptive model.

6. **Model Training**: Train your model using the cleaned data. Adjust model settings if needed.

7. **Model Evaluation**: Assess how well the model performs. Use simple metrics to measure its accuracy.

8. **Communication**: Share the findings and insights with others, making sure they understand the results.

9. **Iterate**: If necessary, repeat the process with additional data or make improvements based on feedback.

## 1. Data types and missing values

Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to ensure that the data is of high quality and reliable for analysis. Data cleaning is essential because the accuracy and quality of data significantly impact the validity and effectiveness of any data-driven project. 

This section used the [Billionaires Statistic Dataset](https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset) that contains information about world's billionaies.

Description:

- rank: The ranking of the billionaire in terms of wealth.
- finalWorth: The final net worth of the billionaire in U.S. dollars.
- category: The category or industry in which the billionaire's business operates.
- personName: The full name of the billionaire.
- age: The age of the billionaire.
- country: The country in which the billionaire resides.
- city: The city in which the billionaire resides.
- source: The source of the billionaire's wealth.
- industries: The industries associated with the billionaire's business interests.
- countryOfCitizenship: The country of citizenship of the billionaire.
- organization: The name of the organization or company associated with the billionaire.
- selfMade: Indicates whether the billionaire is self-made (True/False).
- status: "D" represents self-made billionaires (Founders/Entrepreneurs) and "U" indicates inherited or unearned wealth.
- gender: The gender of the billionaire.
- birthDate: The birthdate of the billionaire.
- lastName: The last name of the billionaire.
- firstName: The first name of the billionaire.
- title: The title or honorific of the billionaire.
- date: The date of data collection.
- state: The state in which the billionaire resides.
- residenceStateRegion: The region or state of residence of the billionaire.
- birthYear: The birth year of the billionaire.
- birthMonth: The birth month of the billionaire.
- birthDay: The birth day of the billionaire.
- cpi_country: Consumer Price Index (CPI) for the billionaire's country.
- cpi_change_country: CPI change for the billionaire's country.
- gdp_country: Gross Domestic Product (GDP) for the billionaire's country.
- gross_tertiary_education_enrollment: Enrollment in tertiary education in the billionaire's country.
- gross_primary_education_enrollment_country: Enrollment in primary education in the billionaire's country.
- life_expectancy_country: Life expectancy in the billionaire's country.
- tax_revenue_country_country: Tax revenue in the billionaire's country.
- total_tax_rate_country: Total tax rate in the billionaire's country.
- population_country: Population of the billionaire's country.
- latitude_country: Latitude coordinate of the billionaire's country.
- longitude_country: Longitude coordinate of the billionaire's country.

In [61]:
import numpy as np
import pandas as pd

In [62]:
# Read data from a csv file
df = pd.read_csv("../data/billionairs.csv")

In [63]:
# Check the shape of data
df.shape

(2640, 35)

In [64]:
# View first 3 rows
df.head(3)

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
0,1,211000,Fashion & Retail,Bernard Arnault & family,74.0,France,Paris,LVMH,Fashion & Retail,France,...,1.1,"$2,715,518,274,227",65.6,102.5,82.5,24.2,60.7,67059887.0,46.227638,2.213749
1,2,180000,Automotive,Elon Musk,51.0,United States,Austin,"Tesla, SpaceX",Automotive,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
2,3,114000,Technology,Jeff Bezos,59.0,United States,Medina,Amazon,Technology,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891


In [65]:
# Print all columns of the data
df.columns

Index(['rank', 'finalWorth', 'category', 'personName', 'age', 'country',
       'city', 'source', 'industries', 'countryOfCitizenship', 'organization',
       'selfMade', 'status', 'gender', 'birthDate', 'lastName', 'firstName',
       'title', 'date', 'state', 'residenceStateRegion', 'birthYear',
       'birthMonth', 'birthDay', 'cpi_country', 'cpi_change_country',
       'gdp_country', 'gross_tertiary_education_enrollment',
       'gross_primary_education_enrollment_country', 'life_expectancy_country',
       'tax_revenue_country_country', 'total_tax_rate_country',
       'population_country', 'latitude_country', 'longitude_country'],
      dtype='object')

In [72]:
# Check missing values by columns
missing_values_count = df.isna().sum()
na_cols = missing_values_count[missing_values_count > 0].sort_values(ascending=False)

print(f"{len(na_cols)} columns contain missing values.")
na_cols

23 columns contain missing values.


organization                                  2315
title                                         2301
residenceStateRegion                          1893
state                                         1887
cpi_country                                    184
cpi_change_country                             184
tax_revenue_country_country                    183
total_tax_rate_country                         182
life_expectancy_country                        182
gross_tertiary_education_enrollment            182
gross_primary_education_enrollment_country     181
longitude_country                              164
latitude_country                               164
population_country                             164
gdp_country                                    164
birthYear                                       77
birthDay                                        76
birthMonth                                      76
birthDate                                       76
city                           

In [75]:
df.isna().sum().sum() / df.size * 100

11.702380952380953

In [17]:
# how many total missing values do we have?
total_cells = df.size
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(f"{percent_missing:.2f}% of data is missing.")

11.70% of data is missing.


**Data Intuition**

Is the data missing because it does not exist or it is not recorded?
- If it does not exist, for example, current salary of a housewife who does not have a paying job, or year of college graduation of a person who has never study at a college. It's better leave it as missing. Attempting to guess a value is not meaningful.
- If the data is not recorded but you can guess what value could have been based on other values in the same row or the same column, you can fill missing values with a guessed value. This is called `imputation`.

Some techniques to handle missing data

1. Drop missing values

In [76]:
# drop a row if there is any missing value in that row
dropped_na_df = df.dropna()
dropped_na_df.shape

(238, 35)

In [19]:
# drop a column if there is any missing value in that column
dropped_na_df = df.dropna(axis=1)
print(dropped_na_df.shape)
print(dropped_na_df.isna().sum().sum())
dropped_na_df.head()

(2640, 12)
0


Unnamed: 0,rank,finalWorth,category,personName,source,industries,countryOfCitizenship,selfMade,status,gender,lastName,date
0,1,211000,Fashion & Retail,Bernard Arnault & family,LVMH,Fashion & Retail,France,False,U,M,Arnault,23/07/2023
1,2,180000,Automotive,Elon Musk,"Tesla, SpaceX",Automotive,United States,True,D,M,Musk,22/06/2023
2,3,114000,Technology,Jeff Bezos,Amazon,Technology,United States,True,D,M,Bezos,06/10/2023
3,4,107000,Technology,Larry Ellison,Oracle,Technology,United States,True,U,M,Ellison,17/09/2023
4,5,106000,Finance & Investments,Warren Buffett,Berkshire Hathaway,Finance & Investments,United States,True,D,M,Buffett,21/06/2023


2. Fill missing values manually

In [77]:
df[["personName", "firstName", "lastName"]]

Unnamed: 0,personName,firstName,lastName
0,Bernard Arnault & family,Bernard,Arnault
1,Elon Musk,Elon,Musk
2,Jeff Bezos,Jeff,Bezos
3,Larry Ellison,Larry,Ellison
4,Warren Buffett,Warren,Buffett
...,...,...,...
2635,Yu Rong,Rong,Yu
2636,"Richard Yuengling, Jr.",Richard,Yuengling
2637,Zhang Gongyun,Gongyun,Zhang
2638,Zhang Guiping & family,Guiping,Zhang


In [20]:
# personName is the fullname, we want infer missing firstName from personName
# First, check why the name is missing
df[df["firstName"].isna()][["personName", "firstName", "lastName"]]

Unnamed: 0,personName,firstName,lastName
676,Tahir & family,,Tahir
1240,Jay-Z,,Jay-Z
2095,Rihanna,,Rihanna


In [21]:
# Check where birthYear is missing but age is presented
df[df["birthYear"].isna() & df["age"].notna()][["birthYear", "age"]]

Unnamed: 0,birthYear,age
423,,53.0
902,,53.0
904,,53.0
913,,53.0
1494,,53.0
1592,,53.0
1759,,53.0
1809,,53.0
2038,,53.0
2069,,53.0


In [79]:
idx = df[df["birthYear"].isna() & df["age"].notna()][["birthYear", "age"]].index
idx
df.loc[idx, "birthYear"]

423    NaN
902    NaN
904    NaN
913    NaN
1494   NaN
1592   NaN
1759   NaN
1809   NaN
2038   NaN
2069   NaN
2437   NaN
2625   NaN
Name: birthYear, dtype: float64

In [81]:
# Calculate the birthYear from age
# Since this data is collected in 2023, we use 2023 as the current year
def get_birthYear(row):
    return 2023 - row["age"]

df.loc[idx, "birthYear"] = df.loc[idx, :].apply(get_birthYear, axis=1)
df[df["birthYear"].isna() & df["age"].notna()][["birthYear", "age"]]

Unnamed: 0,birthYear,age


3. Fill missing values with a mean/median values

In [23]:
df1 = pd.DataFrame({
    "age": [13,19,12,16,15,13,None,14,13, None],
    "n_rounds": [5,6,7,8,3,4,5,6,6,8]
})
df1

Unnamed: 0,age,n_rounds
0,13.0,5
1,19.0,6
2,12.0,7
3,16.0,8
4,15.0,3
5,13.0,4
6,,5
7,14.0,6
8,13.0,6
9,,8


In [93]:
median_age = int(df1["age"].median())
mean_age = round(df1["age"].mean(),0)
print(median_age, mean_age)

13 14.0


In [94]:
df1["age"] = df1["age"].fillna(median_age)
df1

Unnamed: 0,age,n_rounds
0,13.0,5
1,19.0,6
2,12.0,7
3,16.0,8
4,15.0,3
5,13.0,4
6,13.0,5
7,14.0,6
8,13.0,6
9,13.0,8


4. Fill missing values with a default value

In [26]:
# Fill in missing organization with string "Unknown"
df["organization"].fillna("Unknown", inplace=True)

## 2. Inconsistent data entries

In [95]:
# Get unique values of countries
df["country"].unique()

array(['France', 'United States', 'Mexico', 'India', 'Spain', 'China',
       'Canada', 'Germany', 'Switzerland', 'Belgium', 'Hong Kong',
       'Austria', 'Japan', 'United Kingdom', 'Australia', 'Indonesia',
       'United Arab Emirates', 'Russia', 'Chile', 'Monaco', 'RF',
       'Czech Republic', 'Sweden', nan, 'Thailand', 'Uzbekistan',
       'Singapore', 'Nigeria', 'Israel', 'Italy', 'South Africa',
       'Brazil', 'Malaysia', 'China ', 'South Korea', 'New Zealand',
       'Philippines', 'Taiwan', 'Norway', 'Egypt', 'Denmark',
       'Eswatini (Swaziland)', 'Colombia', 'Netherlands', 'Poland',
       'Bahamas', 'Ukraine', 'Cayman Islands', 'Greece', 'Turkey',
       'Argentina', 'Georgia', 'Portugal', 'Kazakhstan', 'Algeria',
       'Vietnam', 'Latvia', 'Finland', 'Bermuda', 'Luxembourg',
       'British Virgin Islands', 'Cambodia', 'Lebanon', 'Oman', 'Ireland',
       'Cyprus', 'Guernsey', 'Liechtenstein', 'P.R.C',
       'Turks and Caicos Islands', 'Romania', 'Qatar', 'Uruguay',

In [98]:
# How many country "RF"?
df[df['country']=="RF"].shape[0]

7

In [103]:
# There are 2 names for Russia: "Russia" and "RF"
# We need to change to "Russia"
indices = df[df['country']=="RF"].index
df.loc[indices, "country"] = "Russia"
df[df["country"]=="RF"]

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country


## 3. Parsing datetime

In [104]:
df["date"]

0       23/07/2023
1       22/06/2023
2       06/10/2023
3       17/09/2023
4       21/06/2023
           ...    
2635    05/05/2023
2636    23/09/2023
2637    01/09/2023
2638    13/02/2023
2639    18/06/2023
Name: date, Length: 2640, dtype: object

- Check dtype of column `date`

In [105]:
df["date"].dtype

dtype('O')

- Check any inconsitent format

In [107]:
# Get the normal length of the date string
df["date"].str.len()

0       10
1       10
2       10
3       10
4       10
        ..
2635    10
2636    10
2637    10
2638    10
2639    10
Name: date, Length: 2640, dtype: int64

In [109]:
# Find date string that are not equal normal length
cond1 = df["date"].str.len() < 10
cond2 = df["date"].str.len() > 10
df[cond1 | cond2]["date"]

450    21/07/2023 22:08
523    17/04/2023 01:46
775    05/02/2023 02:45
Name: date, dtype: object

In [110]:
# Update date string to match
indices = df[cond1 | cond2].index
df.loc[indices, "date"] = df.loc[indices, "date"].str[:10]
df.loc[indices, "date"]

450    21/07/2023
523    17/04/2023
775    05/02/2023
Name: date, dtype: object

Parse to datetime

In [111]:
df["date"] = pd.to_datetime(df["date"], format="%d/%m/%Y")
df["date"].dtype

dtype('<M8[ns]')

Get days, months, years of datetime

In [35]:
df["date"].dt.day

0       23
1       22
2        6
3       17
4       21
        ..
2635     5
2636    23
2637     1
2638    13
2639    18
Name: date, Length: 2640, dtype: int32

See strftime derective [here](https://strftime.org/)

## 4. Some basic Pandas methods to explore data

- Get unique values of a column

In [112]:
df["gender"].unique()

array(['M', 'F'], dtype=object)

- Get a subset dataframe by a condition

In [113]:
df[df["city"]=="Hanoi"]

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
650,636,4300,Diversified,Pham Nhat Vuong,54.0,Vietnam,Hanoi,Diversified,Diversified,Vietnam,...,2.8,"$261,921,244,843",28.5,110.6,75.3,19.1,37.6,96462106.0,14.058324,108.277199
1708,1647,1800,Manufacturing,Tran Dinh Long,62.0,Vietnam,Hanoi,Steel,Manufacturing,Vietnam,...,2.8,"$261,921,244,843",28.5,110.6,75.3,19.1,37.6,96462106.0,14.058324,108.277199
1938,1905,1500,Finance & Investments,Ho Hung Anh,52.0,Vietnam,Hanoi,"Consumer products, banking",Finance & Investments,Vietnam,...,2.8,"$261,921,244,843",28.5,110.6,75.3,19.1,37.6,96462106.0,14.058324,108.277199


In [114]:
df[df["age"] > 95]

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
145,146,11800,Diversified,Robert Kuok,99.0,Hong Kong,Hong Kong,"Palm oil, shipping, property",Diversified,Malaysia,...,,,,,,,,,,
409,405,6300,Manufacturing,Stef Wertheimer & family,96.0,Israel,Tel Aviv,Metalworking tools,Manufacturing,Israel,...,0.8,"$395,098,666,122",63.4,104.9,82.8,23.1,25.3,9053300.0,31.046051,34.851612
592,591,4600,Media & Entertainment,Charles Dolan & family,96.0,United States,Oyster Bay,Cable television,Media & Entertainment,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239500.0,37.09024,-95.712891
1216,1217,2500,Food & Beverage,S. Daniel Abraham,98.0,United States,Palm Beach,Slim-Fast,Food & Beverage,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239500.0,37.09024,-95.712891
1342,1312,2300,Food & Beverage,David Murdock,99.0,United States,Ventura,"Dole, real estate",Food & Beverage,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239500.0,37.09024,-95.712891
1398,1368,2200,Finance & Investments,Charles Munger,99.0,United States,Los Angeles,Berkshire Hathaway,Finance & Investments,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239500.0,37.09024,-95.712891
1409,1368,2200,Healthcare,Alice Schwartz,96.0,United States,El Cerrito,Biotech,Healthcare,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239500.0,37.09024,-95.712891
1941,1905,1500,Finance & Investments,Stephen Jarislowsky,97.0,Canada,Montreal,Money management,Finance & Investments,Canada,...,1.9,"$1,736,425,629,520",68.9,100.9,81.9,12.8,24.5,36991980.0,56.130366,-106.346771
2033,2020,1400,Manufacturing,John Farber,97.0,United States,New York,Chemicals,Manufacturing,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239500.0,37.09024,-95.712891
2065,2020,1400,Fashion & Retail,Bernard Lewis & family,97.0,United Kingdom,London,Fashion retailer,Fashion & Retail,United Kingdom,...,1.7,"$2,827,113,184,696",60.0,101.2,81.3,25.5,30.6,66834400.0,55.378051,-3.435973


- Get a subset dataframe by multiple conditions

In [115]:
# Female Billionairs who have citizenship of 'Japan'
df[(df["gender"]=="F") & (df["countryOfCitizenship"]=="Japan")]

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
2081,2020,1400,Real Estate,Yoshiko Mori,82.0,Japan,Tokyo,Real estate,Real Estate,Japan,...,0.5,"$5,081,769,542,380",63.2,98.8,84.2,11.9,46.7,126226568.0,36.204824,138.252924
2083,2020,1400,Healthcare,Tadako Nakatani,,Japan,Kobe,Medical diagnostic equipment,Healthcare,Japan,...,0.5,"$5,081,769,542,380",63.2,98.8,84.2,11.9,46.7,126226568.0,36.204824,138.252924
2164,2133,1300,Media & Entertainment,Keiko Erikawa,74.0,Japan,Yokohama City,Video games,Media & Entertainment,Japan,...,0.5,"$5,081,769,542,380",63.2,98.8,84.2,11.9,46.7,126226568.0,36.204824,138.252924


- Get unique values of a subset dataframe

In [116]:
# City/Cities where all Russian billionaires in "Fashion & Retail" industry reside?
df[(df["country"]=="Russia") & (df["industries"]=="Fashion & Retail")]["city"].unique()

array(['Moscow region', 'Moscow', 'Krasnodar'], dtype=object)

- Get min, max, mean values of a subset

In [117]:
# The youngest billionaires on the list
df["age"].min()

18.0

In [118]:
# The oldest billionaires reside in "Canada"
df[df["country"]=="Canada"]["age"].max()

97.0