# BAIS:3250 - Final Project
### Data Transformation

**Author(s):** Natalie Brown, Max Kaiser

**Date Modified:** 11-21-2024 (*date created:* 11-21-2024)


**Description:** Transforming columns and dropping / imputing nulls for final dataset

---

### Import Libaries
* **pandas:** for data frames and data cleaning functions

In [3]:
import pandas as pd

---
### Load Data
* **country_merged_clean.csv**

In [5]:
# load data
country_df=pd.read_csv('05_country_merged_clean.csv',sep=',',encoding='utf-8')

# display
country_df.head()

Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,...,gross_domestic_product_usd(b),primary_education_enrollment_pct,secondary_education_enrollment_pct,tax_revenue_pct,unemployment_rt,continent,covid_cases,covid_deaths,total_area_km2,average_population
0,Afghanistan,32.49,AFN,4.47,47.9,64.5,Pashto,0.28,33.93911,67.709953,...,19.1014,104.0,9.7,9.3,11.12,Asia,46498.0,1774.0,652000.0,38673974
1,Albania,11.78,ALL,1.62,7.8,78.5,Albanian,1.2,41.153332,20.168331,...,15.2781,107.0,55.0,18.6,12.33,Europe,37625.0,798.0,28748.0,2865340
2,Algeria,24.28,DZD,3.02,20.1,76.7,Arabic,1.72,28.033886,1.659626,...,169.9882,109.9,51.4,37.2,11.7,Africa,83199.0,2431.0,2381741.0,43621542
3,Andorra,7.2,EUR,1.27,2.7,,Catalan,3.33,42.506285,1.521801,...,3.1541,106.4,,,,Europe,6712.0,76.0,468.0,77230
4,Angola,40.73,AOA,5.52,51.6,60.8,Portuguese,0.21,-11.202692,17.873887,...,94.6354,113.5,9.3,9.2,6.89,Africa,15139.0,348.0,1246620.0,32569069


In [6]:
# List of countries to drop 
countries_to_drop = ['North Macedonia', 'Palestinian National Authority', 'Vatican City'] 

# Drop rows with these countries 
country_df=country_df[~country_df['country'].isin(countries_to_drop)]

---
### Null Display
* determine how many nulls are in each column, should they be imputed or dropped?
    * we will look up data for nulls on WHO website and create a map to impute (world health organization) and other various reputable sources
* nulls columns:
  - [x] lat
  - [x] long
  - [x] infant mortality_rt
  - [x] birth_rt
  - [x] ~ag_land_pct~
      * we can drop this because we have data to compute a new column ourselves
  - [x] physicians per thousand
  - [x] fertility rt
  - [x] primary education enrollment_pct
  - [x] life expectancy
  - [x] secondary_education_enrollment_pct
  - [x] ~currency_code~
      * drop 
  - [x] ~consuer_price_index~
      * drop
  - [x] unemployment_rt
  - [x] tax_revenue_pct
  - [x] continent
  - [x] ~official language~
      * drop
  - [x] covid deaths
  - [x] covid cases
        

In [8]:
# function for nulls, returns a dataframe with nulls
def df_nulls(df):
    null_counts=country_df.isna().sum()
    null_percentages=round((df.isna().sum()/df.shape[0])*100,2)

    # data frame for results
    nulls_df=pd.DataFrame({
        'Nulls':null_counts,
        'Nulls (%)':null_percentages
    })
    
    # display
    print("Country Data Nulls")
    display(nulls_df.sort_values(by='Nulls',ascending=False))

df_nulls(country_df)

Country Data Nulls


Unnamed: 0,Nulls,Nulls (%)
tax_revenue_pct,23,11.98
unemployment_rt,16,8.33
consumer_price_index,14,7.29
currency_code,14,7.29
secondary_education_enrollment_pct,9,4.69
covid_cases,6,3.12
continent,6,3.12
covid_deaths,6,3.12
total_area_km2,6,3.12
life_expectancy,5,2.6


In [9]:
# function for displaying all nulls in a column
def find_nulls(df,column):
    return df[df[column].isnull()][['country',column]]

def null_list(df,column):
    return df[df[column].isnull()]['country'].tolist()
    
# function to display column format
def col_format(df,col):
    format_list=df[col].unique()
    return format_list[0:5]

---
**latitude**

In [11]:
# find countrys with null
find_nulls(country_df,'lat')

Unnamed: 0,country,lat
150,Sao Tome and Principe,


In [12]:
# find format
col_format(country_df,'lat')

array([ 33.93911 ,  41.153332,  28.033886,  42.506285, -11.202692])

**latitude found on [GPS Coordinates](https://gps-coordinates.org/sao-tome-and-principe-latitude.php)**
- 0.203237

In [14]:
# impute
country_df.loc[country_df['country'] == 'Sao Tome and Principe', 'lat'] = 0.203237

# review
find_nulls(country_df,'lat')

Unnamed: 0,country,lat


---
**longitude**

In [16]:
# find countrys with null
find_nulls(country_df,'long')

Unnamed: 0,country,long
150,Sao Tome and Principe,


**longitude found on [GPS Coordinates](https://gps-coordinates.org/sao-tome-and-principe-latitude.php)**
- 6.608357

In [18]:
# impute
country_df.loc[country_df['country'] == 'Sao Tome and Principe', 'long'] = 6.608357

# review
find_nulls(country_df,'long')

Unnamed: 0,country,long


---
**infant mortality rate found on [CIA World Factbook](https://www.cia.gov/the-world-factbook/field/infant-mortality-rate/country-comparison/)**

In [20]:
# find nulls
find_nulls(country_df,'infant_mortality_rt')

Unnamed: 0,country,infant_mortality_rt
56,Eswatini,
98,Liechtenstein,
120,Nauru,


In [21]:
# get current format
col_format(country_df,'infant_mortality_rt')

array([47.9,  7.8, 20.1,  2.7, 51.6])

In [22]:
# define infant morality rates
impute_mortality_rts={
    'Eswatini':36.7,
    'Liechtenstein':3.9,
    'Nauru':7.6
}

# impute
for country,rt in impute_mortality_rts.items():
    country_df.loc[country_df['country']==country,'infant_mortality_rt']=rt

# review
find_nulls(country_df,'infant_mortality_rt')

Unnamed: 0,country,infant_mortality_rt


---
**birth rate found on [CIA World Factbook](https://www.cia.gov/the-world-factbook/field/birth-rate/country-comparison/)**

In [24]:
# find nulls
find_nulls(country_df,'birth_rt')

Unnamed: 0,country,birth_rt
56,Eswatini,
120,Nauru,
181,Tuvalu,


In [25]:
# get format
col_format(country_df,'birth_rt')

array([32.49, 11.78, 24.28,  7.2 , 40.73])

In [26]:
# define rates
impute_birth_rts={
    'Eswatini':22.3,
    'Nauru':20.2,
    'Tuvalu':22,
}

# impute
for country,rt in impute_birth_rts.items():
    country_df.loc[country_df['country']==country,'birth_rt']=rt

# review
find_nulls(country_df,'birth_rt')

Unnamed: 0,country,birth_rt


---
**fertility rate found on [CIA World Factbook](https://www.cia.gov/the-world-factbook/field/total-fertility-rate/country-comparison/)**

In [28]:
# find nulls
find_nulls(country_df,'fertility_rt')

Unnamed: 0,country,fertility_rt
56,Eswatini,
113,Monaco,
120,Nauru,
181,Tuvalu,


In [29]:
# get format
col_format(country_df,'fertility_rt')

array([4.47, 1.62, 3.02, 1.27, 5.52])

In [30]:
# define rates
impute_fertility_rts={
    'Eswatini':2.37,
    'Monaco':1.54,
    'Nauru':2.55,
    'Tuvalu':2.78
}

# impute
for country,rt in impute_fertility_rts.items():
    country_df.loc[country_df['country']==country,'fertility_rt']=rt

# review
find_nulls(country_df,'fertility_rt')

Unnamed: 0,country,fertility_rt


---
**unemployment rate found on [CIA World Factbook](https://www.cia.gov/the-world-factbook/field/unemployment-rate/country-comparison/)**

In [32]:
# find nulls
find_nulls(country_df,'unemployment_rt')

Unnamed: 0,country,unemployment_rt
3,Andorra,
5,Antigua and Barbuda,
48,Dominica,
56,Eswatini,
67,Grenada,
89,Kiribati,
98,Liechtenstein,
107,Marshall Islands,
111,Micronesia,
113,Monaco,


In [33]:
# get format
col_format(country_df,'unemployment_rt')

array([11.12, 12.33, 11.7 ,   nan,  6.89])

In [34]:
# dictionary
impute_ue_rts={
    'Andorra': 3.7,
    'Antigua and Barbuda': 11,
    'Dominica': 11,
    'Eswatini': 37.64,
    'Grenada': 24,
    'Kiribati': 8.6,
    'Liechtenstein': 2.4,
    'Marshall Islands': 9.8,
    'Micronesia': 8.9, 
    'Monaco': 6.3,
    'Nauru': 5.1,
    'Palau': 1.7, 
    'San Marino': 8.1, 
    'Seychelles': 3, 
    'Tuvalu': 11,
    'Saint Kitts and Nevis':5.09
}

# impute
for country,rt in impute_ue_rts.items():
    country_df.loc[country_df['country']==country,'unemployment_rt']=rt

# review
find_nulls(country_df,'unemployment_rt')

Unnamed: 0,country,unemployment_rt


---
**tax revenue percentage found on [CIA World Factbook](https://www.cia.gov/the-world-factbook/field/taxes-and-other-revenues/)**

In [36]:
# find nulls
find_nulls(country_df,'tax_revenue_pct')

Unnamed: 0,country,tax_revenue_pct
3,Andorra,
24,Brunei,
34,Chad,
38,Comoros,
42,Cuba,
47,Djibouti,
50,Ecuador,
54,Eritrea,
71,Guyana,
72,Haiti,


In [37]:
# null list
null_list(country_df,'tax_revenue_pct')

['Andorra',
 'Brunei',
 'Chad',
 'Comoros',
 'Cuba',
 'Djibouti',
 'Ecuador',
 'Eritrea',
 'Guyana',
 'Haiti',
 'Libya',
 'Liechtenstein',
 'Mauritania',
 'Monaco',
 'Montenegro',
 'Nauru',
 'North Korea',
 'Panama',
 'South Sudan',
 'Turkmenistan',
 'Tuvalu',
 'Venezuela',
 'Yemen']

In [38]:
# get format
col_format(country_df,'tax_revenue_pct')

array([ 9.3, 18.6, 37.2,  nan,  9.2])

In [39]:
# define tax revenues
impute_tax_revenue_percentages={
    'Andorra': 69,
    'Brunei': 18.5,
    'Chad': 13.08,
    'Comoros': 14.22,
    'Cuba': 58.1,
    'Djibouti': 18.95,
    'Ecuador': 13.0,
    'Eritrea': 34.9,
    'Guyana': 15.37,
    'Haiti': 6.24,
    'Libya': 51.6,
    'Liechtenstein': 52.3,
    'Mauritania': 27.4,
    'Monaco': 14.4,
    'Montenegro': 37.2,
    'Nauru': 44.3,
    'Panama': 7.5,
    'South Sudan': 8.5,
    'Turkmenistan': 14.9,
    'Tuvalu': 18.5,
    'Venezuela': 5.96,
    'Yemen': 9.58,
    'North Korea':0 # abolished taxing
    }

# impute
for country,pct in impute_tax_revenue_percentages.items():
    country_df.loc[country_df['country']==country,'tax_revenue_pct']=pct

# review
find_nulls(country_df,'tax_revenue_pct')


Unnamed: 0,country,tax_revenue_pct


---
**life expectancy found on [CIA World Factbook](https://www.cia.gov/the-world-factbook/field/life-expectancy-at-birth/country-comparison/)**

In [41]:
# find nulls
find_nulls(country_df,'life_expectancy')

Unnamed: 0,country,life_expectancy
3,Andorra,
56,Eswatini,
113,Monaco,
120,Nauru,
181,Tuvalu,


In [42]:
# list of countries
null_list(country_df,'life_expectancy')

['Andorra', 'Eswatini', 'Monaco', 'Nauru', 'Tuvalu']

In [43]:
# get format
col_format(country_df,'life_expectancy')

array([64.5, 78.5, 76.7,  nan, 60.8])

In [44]:
# define list to impute
impute_life_expectancies={
    'Andorra':83.8,
    'Eswatini':60.7,
    'Monaco':89.8,
    'Nauru':68.6,
    'Tuvalu':69
}

# impute
for country,expectancy in impute_life_expectancies.items():
    country_df.loc[country_df['country']==country,'life_expectancy']=expectancy

# review
find_nulls(country_df,'life_expectancy')

Unnamed: 0,country,life_expectancy


---
**physicians per thousand found on [CIA World Factbook](country_df,'physicians_per_thousand')**

In [46]:
# find nulls
find_nulls(country_df,'physicians_per_thousand')

Unnamed: 0,country,physicians_per_thousand
56,Eswatini,
98,Liechtenstein,
120,Nauru,
163,South Sudan,


In [47]:
# list of nulls
null_list(country_df,'physicians_per_thousand')

['Eswatini', 'Liechtenstein', 'Nauru', 'South Sudan']

In [48]:
# get format
col_format(country_df,'physicians_per_thousand')

array([0.28, 1.2 , 1.72, 3.33, 0.21])

In [49]:
# define dictionary to impute
impute_phys_per_thousand={
    'Eswatini':0.14,
    'Liechtenstein':2.5,
    'Nauru':1.35,
    'South Sudan':0.04
}

# impute
for country,stat in impute_phys_per_thousand.items():
    country_df.loc[country_df['country']==country,'physicians_per_thousand']=stat

# review
find_nulls(country_df,'physicians_per_thousand')


Unnamed: 0,country,physicians_per_thousand


---
**primary education enrollment percentage found on [Our World in Data](https://ourworldindata.org/grapher/total-net-enrollment-rate-in-primary-education?tab=table&time=2021..latest)**

In [51]:
# find nulls
find_nulls(country_df,'primary_education_enrollment_pct')

Unnamed: 0,country,primary_education_enrollment_pct
21,Bosnia and Herzegovina,
56,Eswatini,
113,Monaco,
120,Nauru,


In [52]:
# list
null_list(country_df,'primary_education_enrollment_pct')

['Bosnia and Herzegovina', 'Eswatini', 'Monaco', 'Nauru']

In [53]:
# format
col_format(country_df,'primary_education_enrollment_pct')

array([104. , 107. , 109.9, 106.4, 113.5])

In [54]:
# define dictionary to impute
prim_enrollment_pct={
    'Bosnia and Herzegovina':84.4,
    'Eswatini':89.5,
    'Monaco':97.9,
    'Nauru':96.3
}

# impute
for country,pct in prim_enrollment_pct.items():
    country_df.loc[country_df['country']==country,'primary_education_enrollment_pct']=pct

# review
find_nulls(country_df,'primary_education_enrollment_pct')

Unnamed: 0,country,primary_education_enrollment_pct


---
**secondary education enrollment percentage found on [Our World in Data](https://ourworldindata.org/grapher/total-net-enrollment-rate-in-primary-education?tab=table&time=2021..latest)**

In [56]:
# find nulls
find_nulls(country_df,'secondary_education_enrollment_pct')

Unnamed: 0,country,secondary_education_enrollment_pct
3,Andorra,
20,Bolivia,
56,Eswatini,
89,Kiribati,
113,Monaco,
120,Nauru,
159,Solomon Islands,
163,South Sudan,
181,Tuvalu,


In [57]:
# get list
null_list(country_df,'secondary_education_enrollment_pct')

['Andorra',
 'Bolivia',
 'Eswatini',
 'Kiribati',
 'Monaco',
 'Nauru',
 'Solomon Islands',
 'South Sudan',
 'Tuvalu']

In [58]:
# define dictionary to impute
impute_secd_enrollment_pct={
    'Andorra':97.5,
    'Bolivia':91.5,
    'Eswatini':86,
    'Kiribati':81.6,
    'Monaco':154.2,
    'Nauru':86.4,
    'Solomon Islands':48.3,
    'South Sudan':11.2,
    'Tuvalu':91.3
}

# impute
for country,pct in impute_secd_enrollment_pct.items():
    country_df.loc[country_df['country']==country,'secondary_education_enrollment_pct']=pct

# review
find_nulls(country_df,'secondary_education_enrollment_pct')

Unnamed: 0,country,secondary_education_enrollment_pct


---
**continent**

In [60]:
# find nulls
find_nulls(country_df,'continent')

Unnamed: 0,country,continent
29,Cape Verde,
122,Netherlands,
127,North Korea,
145,Saint Kitts and Nevis,
162,South Korea,
174,East Timor,


In [61]:
# get list
null_list(country_df,'continent')

['Cape Verde',
 'Netherlands',
 'North Korea',
 'Saint Kitts and Nevis',
 'South Korea',
 'East Timor']

In [62]:
# define dictionary to impute
impute_continent={
    'Cape Verde':'Africa',
    'Netherlands':'Europe',
    'North Korea':'Asia',
    'Saint Kitts and Nevis':'North America',
    'South Korea':'Asia',
    'East Timor':'Asia'
    
}

# impute
for country,continent in impute_continent.items():
    country_df.loc[country_df['country']==country,'continent']=continent

# review
find_nulls(country_df,'continent')

Unnamed: 0,country,continent


---
**covid deaths**

In [64]:
# find nulls
find_nulls(country_df,'covid_deaths')

Unnamed: 0,country,covid_deaths
29,Cape Verde,
122,Netherlands,
127,North Korea,
145,Saint Kitts and Nevis,
162,South Korea,
174,East Timor,


In [65]:
# find nulls
null_list(country_df,'covid_deaths')

['Cape Verde',
 'Netherlands',
 'North Korea',
 'Saint Kitts and Nevis',
 'South Korea',
 'East Timor']

In [66]:
# define dictionary to impute
impute_deaths={
    'Cape Verde':417,
     'Netherlands':22986,
     'North Korea':74,
     'Saint Kitts and Nevis':0,
     'South Korea':20000,
     'East Timor':0

    
}

# impute
for country,death in impute_deaths.items():
    country_df.loc[country_df['country']==country,'covid_deaths']=death

# review
find_nulls(country_df,'covid_deaths')

Unnamed: 0,country,covid_deaths


---
**covid cases**


In [68]:
# find nulls
find_nulls(country_df,'covid_cases')

Unnamed: 0,country,covid_cases
29,Cape Verde,
122,Netherlands,
127,North Korea,
145,Saint Kitts and Nevis,
162,South Korea,
174,East Timor,


In [69]:
# find nulls
null_list(country_df,'covid_cases')

['Cape Verde',
 'Netherlands',
 'North Korea',
 'Saint Kitts and Nevis',
 'South Korea',
 'East Timor']

In [70]:
# define dictionary to impute
impute_cases={
    'Cape Verde':64474,
    'Netherlands':8300000,
    'North Korea':4772813,
    'Saint Kitts and Nevis':0,
    'South Korea':14500000,
    'East Timor':0

    
}

# impute
for country,case in impute_cases.items():
    country_df.loc[country_df['country']==country,'covid_cases']=case

# review
find_nulls(country_df,'covid_cases')

Unnamed: 0,country,covid_cases


---
**drop columns**

In [72]:
country_df=country_df.drop(columns=['ag_land_pct','official_language','consumer_price_index','currency_code','total_area_km2','land_area_km2'])

---
**check nulls before saving**

In [74]:
df_nulls(country_df)

Country Data Nulls


Unnamed: 0,Nulls,Nulls (%)
country,0,0.0
primary_education_enrollment_pct,0,0.0
covid_deaths,0,0.0
covid_cases,0,0.0
continent,0,0.0
unemployment_rt,0,0.0
tax_revenue_pct,0,0.0
secondary_education_enrollment_pct,0,0.0
gross_domestic_product_usd(b),0,0.0
birth_rt,0,0.0


---
### Save Data
* **new csv file name**

In [76]:
# save csv file
country_df.to_csv('06_country_merged_imputed.csv',sep=',',encoding='utf-8',index=False)