# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [1]:
import pandas as pd
import matplotlib as mpl
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# boilerplate copy-in for various EDA libraries - can optimize later as needed

df22 = pd.read_csv(r'C:\Users\chris\PycharmProjects\eda-checkpoint\BACI_HS22_V202501 Data\BACI_HS22_Y2022_V202501.csv', dtype={'k':'str'}) # data for 2022

df23 = pd.read_csv(r'C:\Users\chris\PycharmProjects\eda-checkpoint\BACI_HS22_V202501 Data\BACI_HS22_Y2023_V202501.csv', dtype={'k':'str'}) # data for 2023

country_codes = pd.read_csv(r'C:\Users\chris\PycharmProjects\eda-checkpoint\BACI_HS22_V202501 Data\country_codes_V202501.csv') # country code keys

product_codes = pd.read_csv(r'C:\Users\chris\PycharmProjects\eda-checkpoint\BACI_HS22_V202501 Data\product_codes_HS22_V202501.csv', dtype= str) # product code keys

# used absolute paths for absolute certainty

In [2]:
df = pd.concat([df22, df23])

rename_map = {
    "t": "year",
    "i": "exporter",
    "j": "importer",
    "k": "product",
    "v": "value",
    "q": "quantity"
}

df = df.rename(columns=rename_map)

df.info()
df.head()

# boilerplate column renames for clarity

<class 'pandas.core.frame.DataFrame'>
Index: 22192308 entries, 0 to 11232738
Data columns (total 6 columns):
 #   Column    Dtype  
---  ------    -----  
 0   year      int64  
 1   exporter  int64  
 2   importer  int64  
 3   product   object 
 4   value     float64
 5   quantity  float64
dtypes: float64(2), int64(3), object(1)
memory usage: 1.2+ GB


Unnamed: 0,year,exporter,importer,product,value,quantity
0,2022,4,20,210610,0.412,0.002
1,2022,4,20,210690,0.07,0.001
2,2022,4,20,271000,6.985,8.103
3,2022,4,20,843131,0.354,0.022
4,2022,4,31,80211,2.25,0.5


## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [3]:
print(f"CEPI Data:")
for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

print('\nProduct Codes:')
for col in product_codes.columns:
    pct_missing = np.mean(product_codes[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

print('\nCountry Codes:')
for col in country_codes.columns:
    pct_missing = np.mean(country_codes[col].isnull())
    print('{} - {}%'.format(col, round(pct_missing*100)))

# Used for loops to iterate through both dataframe in order to identify the % of nulls per column - a cleaner way to visualize nulls than .info

# Nothing alarming

CEPI Data:
year - 0%
exporter - 0%
importer - 0%
product - 0%
value - 0%
quantity - 4%

Product Codes:
code - 0%
description - 0%

Country Codes:
country_code - 0%
country_name - 0%
country_iso2 - 2%
country_iso3 - 0%


## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [18]:
df.describe() # checking the distribution of the data

df['value'].describe() # check for insane values

df['quantity'].describe() # check for insane values

df['year'].unique() # checking for no irregular years

missing_mask = ~df['product'].isin(product_codes['code'])
missing_rows = df[missing_mask]
print(missing_rows) # no product codes are unmapped

Empty DataFrame
Columns: [year, exporter, importer, product, value, quantity, product length]
Index: []


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [5]:
num_duplicate_occurrences = df.duplicated(keep='first').sum()
print("Duplicate rows (excluding first occurrences), CEPI data:", int(num_duplicate_occurrences))

num_duplicate_occurrences = product_codes.duplicated(keep='first').sum()
print("Duplicate rows (excluding first occurrences, product codes):", int(num_duplicate_occurrences))

num_duplicate_occurrences = country_codes.duplicated(keep='first').sum()
print("Duplicate rows (excluding first occurrences, country codes):", int(num_duplicate_occurrences))

# checking for entire-row-level duplicates

# from the results of my EDA, I'm only initially going to focus on the following product codes, making all the other product codes unnecessary: keycodes = ['381800', '848610', '848620', '848640', '851419', '852351', '852352', '852359', '854110', '854121', '854129', '854190', '854231', '854232', '854233', '854239', '854290', '903082', '903141']

Duplicate rows (excluding first occurrences), CEPI data: 0
Duplicate rows (excluding first occurrences, product codes): 0
Duplicate rows (excluding first occurrences, country codes): 0


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [6]:
# During my EDA, I discovered that pandas will remove trailing zeroes from certain datatypes. So I had pandas read those columns as strings as to avoid removing my precious leading zeroes. That was the only thing that was inconsistent. Here's a bunch of code I used to doublecheck that the trailing zeroes were in place and that my relevant columns contained 6-digits as specified by the readme.

product_codes['code length'] = product_codes['code'].str.len()

unique_pc_lengths = product_codes['code length'].unique()

print(unique_pc_lengths) # this was to confirm that all my product codes were 6 digits long

df['product length'] = df['product'].str.len()

unique_pc_lengths_df = df['product length'].unique()

print(unique_pc_lengths_df) # this was to confirm that all my product codes were 6 digits long in the df

[6]
[6]


## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset? No, 3 out of 4 found. Missing: found and safely ignored; Irregular: not found; Unnecessary: found and cleaned; Inconsistent: found and cleaned.
2. Did the process of cleaning your data give you new insights into your dataset? Yes, it helped me learn the importance of the leading zeroes in the product codes of this dataset. It helped me understand more fully that it is near-spotless and normally distributed (for the type of data it represents).
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations? I'll need to subset the concatenated dataframe into whole vs. the relevant part (Taiwan) for the product codes we've deemed relevant in 'keycodes'. Fairly certain I'll be able to do all of that in Tableau once I make a smaller .csv that contains strictly the data relevant to this analysis. Which should be easy to do thanks to our EDA and cleaning.