# BAIS:3250 - Final Project
### Merge Datasets

**Author(s):** Natalie Brown, Max Kaiser

**Date Modified:** 11-15-2024 (*date created:* 11-15-2024)


**Description:** Merge the country data from Kaggle with the country data from the API

---

### Import Libaries
* **pandas:** for data frames and data cleaning functions

In [3]:
import pandas as pd

---
### Load Data
* **world_data_clean.csv** - Kaggle data
* **api_country_data_clean.csv** - API data

In [5]:
# load data frames
kaggle_df=pd.read_csv('03_world_data_clean.csv',sep=',',encoding='utf-8')
api_df=pd.read_csv('01_api_country_data_clean.csv',sep=(','),encoding='utf-8')

In [6]:
# display kaggle data
kaggle_df.head()

Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,ag_land_pct,land_area_km2,consumer_price_index,gross_domestic_product_USD(b),primary_education_enrollment_pct,secondary_education_enrollment_pct,population,tax_revenue_pct,unemployment_rt
0,Afghanistan,32.49,AFN,4.47,47.9,64.5,Pashto,0.28,33.93911,67.709953,58.1,652230.0,149.9,19.1014,104.0,9.7,38041754.0,9.3,11.12
1,Albania,11.78,ALL,1.62,7.8,78.5,Albanian,1.2,41.153332,20.168331,43.1,28748.0,119.05,15.2781,107.0,55.0,2854191.0,18.6,12.33
2,Algeria,24.28,DZD,3.02,20.1,76.7,Arabic,1.72,28.033886,1.659626,17.4,2381741.0,151.36,169.9882,109.9,51.4,43053054.0,37.2,11.7
3,Andorra,7.2,EUR,1.27,2.7,,Catalan,3.33,42.506285,1.521801,40.0,468.0,,3.1541,106.4,,77142.0,,
4,Angola,40.73,AOA,5.52,51.6,60.8,Portuguese,0.21,-11.202692,17.873887,47.5,1246700.0,261.73,94.6354,113.5,9.3,31825295.0,9.2,6.89


In [7]:
# display api data
api_df.head()

Unnamed: 0,country,continent,covid_cases,covid_deaths,area_km2,population
0,Afghanistan,Asia,46498,1774,652000,39306195
1,Albania,Europe,37625,798,28748,2876490
2,Algeria,Africa,83199,2431,2381741,44190030
3,Andorra,Europe,6712,76,468,77317
4,Angola,Africa,15139,348,1246620,33312843


---
#### Prepare Data Frames for Merging
- [x] Get unique country names for both and ensure the formatting matches

In [9]:
# create list of unique countries in kaggle df
kaggle_country_list=kaggle_df['country'].unique()

# display
print(kaggle_country_list)

['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
 'Azerbaijan' 'The Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina'
 'Botswana' 'Brazil' 'Brunei' 'Bulgaria' 'Burkina Faso' 'Burundi'
 'Ivory Coast' 'Cape Verde' 'Cambodia' 'Cameroon' 'Canada'
 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia' 'Comoros'
 'Republic of the Congo' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus'
 'Czech Republic' 'Democratic Republic of the Congo' 'Denmark' 'Djibouti'
 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador'
 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Eswatini' 'Ethiopia' 'Fiji'
 'Finland' 'France' 'Gabon' 'The Gambia' 'Georgia' 'Germany' 'Ghana'
 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana' 'Haiti'
 'Vatican City' 'Honduras' 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran'
 'Iraq' 'Republic of Ireland' 'Israel' 'Italy' '

In [10]:
# create list of unique countries in api df
api_country_list=api_df['country'].unique()

# display
print(api_country_list)

['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua And Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina'
 'Botswana' 'Brazil' 'Brunei' 'Bulgaria' 'Burkina Faso' 'Burundi'
 'Cambodia' 'Cameroon' 'Canada' 'Central African Republic' 'Chad' 'Chile'
 'China' 'Colombia' 'Comoros' 'Congo' 'Costa Rica' "Cote D'Ivoire"
 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark' 'Djibouti'
 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador'
 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji' 'Finland'
 'France' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana' 'Greece' 'Grenada'
 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana' 'Haiti' 'Honduras'
 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran' 'Iraq' 'Ireland' 'Israel'
 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati'
 'Kuwait' 'Kyrgyzstan' 'Laos' 'Latvia' 'Lebanon' 'L

**there is an encoding error for one of the countries in the kaggle_df**
* **error:** Sï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
- [x] compare the error row in kaggle_df with countries in the api_df
- [x] Could not find, but I believe it is Sao Tome and Principe because it would come after San Marino in alphabteical order, small discrepancy in population due to data collection time, but good enough so replace with Sao Tome and Principe
- [x] Create a map to replace names in the kaggle df to the format in api_df

In [12]:
# display api
api_df[api_df['country']=='Sao Tome and Principe']

Unnamed: 0,country,continent,covid_cases,covid_deaths,area_km2,population
141,Sao Tome and Principe,Africa,991,17,1001,220914


In [13]:
# display error
kaggle_df[kaggle_df['country']=='Sï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½']

Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,ag_land_pct,land_area_km2,consumer_price_index,gross_domestic_product_USD(b),primary_education_enrollment_pct,secondary_education_enrollment_pct,population,tax_revenue_pct,unemployment_rt
150,Sï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½,31.54,STN,4.32,24.4,70.2,,0.05,,,50.7,964.0,185.09,0.429,106.8,13.4,215056.0,14.6,13.37


In [14]:
# loop through countries and find those that don't have match 
unique_in_api=[country for country in api_country_list if country not in kaggle_country_list]

# Find countries in kaggle_country_list not in api_country_list
unique_in_kaggle=[country for country in kaggle_country_list if country not in api_country_list]

# Print the countries that do not have a match
print("Countries in API list not in Kaggle list:")
for country in unique_in_api:
    print(country)

print("\nCountries in Kaggle list not in API list:")
for country in unique_in_kaggle:
    print(country)


Countries in API list not in Kaggle list:
Antigua And Barbuda
Bahamas
Congo
Cote D'Ivoire
Gambia
Ireland
Micronesia
Saint Vincent And The Grenadines
Sao Tome and Principe
Swaziland
Taiwan
Trinidad And Tobago

Countries in Kaggle list not in API list:
Antigua and Barbuda
The Bahamas
Ivory Coast
Cape Verde
Republic of the Congo
Democratic Republic of the Congo
Eswatini
The Gambia
Vatican City
Republic of Ireland
Federated States of Micronesia
Netherlands
North Korea
North Macedonia
Palestinian National Authority
Saint Kitts and Nevis
Saint Vincent and the Grenadines
Sï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
South Korea
East Timor
Trinidad and Tobago


In [15]:
# map to apply to kaggle, note we have to deal with congo rows, find way to merge the rows
country_map_kaggle={
    'The Bahamas':'Bahamas',
    'Republic of the Congo':'Congo',
    'Democratic Republic of the Congo':'Congo',
    'The Gambia':'Gambia',
    'Sï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½':'Sao Tome and Principe',
    'Republic of Ireland':'Ireland',
    'Federated States of Micronesia':'Micronesia',
}
    
# map to apply to api
country_map_api={
    'Cote D\'Ivoire':'Ivory Coast',
    'Antigua And Barbuda':'Antigua and Barbuda',
    'Trinidad And Tobago':'Trinidad and Tobago',
    'Saint Vincent And The Grenadines':'Saint Vincent and the Grenadines',
    'Swaziland':'Eswatini'
}


In [16]:
# apply map to kaggle
kaggle_df['country_edited']=kaggle_df['country'].replace(country_map_kaggle)

# review
kaggle_df[['country_edited','country']]

Unnamed: 0,country_edited,country
0,Afghanistan,Afghanistan
1,Albania,Albania
2,Algeria,Algeria
3,Andorra,Andorra
4,Angola,Angola
...,...,...
190,Venezuela,Venezuela
191,Vietnam,Vietnam
192,Yemen,Yemen
193,Zambia,Zambia


In [17]:
# apply map to api
api_df['country_edited']=api_df['country'].replace(country_map_api)

# review
api_df[['country_edited','country']]

Unnamed: 0,country_edited,country
0,Afghanistan,Afghanistan
1,Albania,Albania
2,Algeria,Algeria
3,Andorra,Andorra
4,Angola,Angola
...,...,...
181,Venezuela,Venezuela
182,Vietnam,Vietnam
183,Yemen,Yemen
184,Zambia,Zambia


**run loops to check country names again for mismatches**

In [19]:
# create lists for country edited  
api_country_list_edited=api_df['country_edited'].unique()
kaggle_country_list_edited=kaggle_df['country_edited'].unique()

# loop through countries and find those that don't have match 
unique_in_api_edited=[country for country in api_country_list_edited if country not in kaggle_country_list_edited]

# Find countries in kaggle_country_list not in api_country_list
unique_in_kaggle_edited=[country for country in kaggle_country_list_edited if country not in api_country_list_edited]

# Print the countries that do not have a match
print("Countries in API list not in Kaggle list:")
for country in unique_in_api_edited:
    print(country)

print("\nCountries in Kaggle list not in API list:")
for country in unique_in_kaggle_edited:
    print(country)


Countries in API list not in Kaggle list:
Taiwan

Countries in Kaggle list not in API list:
Cape Verde
Vatican City
Netherlands
North Korea
North Macedonia
Palestinian National Authority
Saint Kitts and Nevis
South Korea
East Timor


**review number of unique countries in kaggle_df to api_df**

- The difference in unique countries matches the number of countries that don't match
- These will not have matches in the final df, and therefor will be dropped

In [21]:
# get length
num_unique_countries_kaggle=len(kaggle_df['country_edited'].unique())
num_unique_countries_api=len(api_df['country_edited'].unique())

# display
print(f'Unique Countries in Kaggle: {num_unique_countries_kaggle}\nUnique Countries in API: {num_unique_countries_api}')

Unique Countries in Kaggle: 194
Unique Countries in API: 186


**fix the congo occurence in kaggle_df**

In [23]:
# filter
congo_df=kaggle_df[kaggle_df['country_edited']=='Congo']

# display 
congo_df

Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,ag_land_pct,land_area_km2,consumer_price_index,gross_domestic_product_USD(b),primary_education_enrollment_pct,secondary_education_enrollment_pct,population,tax_revenue_pct,unemployment_rt,country_edited
39,Republic of the Congo,32.86,XAF,4.43,36.2,64.3,French,0.12,-0.228021,15.827659,31.1,342000.0,124.74,10.8206,106.6,12.7,5380508.0,9.0,9.47,Congo
45,Democratic Republic of the Congo,41.18,CDF,5.92,68.2,60.4,French,0.07,-4.038333,21.758664,11.6,2344858.0,133.85,47.3196,108.0,6.6,86790567.0,10.7,4.24,Congo


In [24]:
# display congo in the api df to see which to drop
congo_df_2=api_df[api_df['country_edited']=='Congo']

congo_df_2

Unnamed: 0,country,continent,covid_cases,covid_deaths,area_km2,population,country_edited
37,Congo,Africa,5774,94,2345410,5576861,Congo


In [25]:
# double check if there are any other occurences of congo
congo_countries = api_df[api_df['country'].str.contains('congo', case=False, na=False)]

congo_countries


Unnamed: 0,country,continent,covid_cases,covid_deaths,area_km2,population,country_edited
37,Congo,Africa,5774,94,2345410,5576861,Congo


**there are no commonalities for the congo rows, so we will drop**

In [27]:
# exclude from api
api_df=api_df[api_df['country_edited']!='Congo']

# exclude from kaggle
kaggle_df=kaggle_df[kaggle_df['country_edited']!='Congo']

In [28]:
# code to ensure they were dropped
congo_occurences_api=len(api_df[api_df['country_edited']=='Congo'])
congo_occurences_kaggle=len(kaggle_df[kaggle_df['country_edited']=='Congo'])

# display
congo_occurences_api,congo_occurences_kaggle

(0, 0)

### Merge data frames with an inner join

- [x] review join

In [30]:
# merge
merged_df=pd.merge(kaggle_df,api_df,on='country_edited',how='inner')

# display
merged_df.head()

Unnamed: 0,country_x,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,...,population_x,tax_revenue_pct,unemployment_rt,country_edited,country_y,continent,covid_cases,covid_deaths,area_km2,population_y
0,Afghanistan,32.49,AFN,4.47,47.9,64.5,Pashto,0.28,33.93911,67.709953,...,38041754.0,9.3,11.12,Afghanistan,Afghanistan,Asia,46498,1774,652000,39306195
1,Albania,11.78,ALL,1.62,7.8,78.5,Albanian,1.2,41.153332,20.168331,...,2854191.0,18.6,12.33,Albania,Albania,Europe,37625,798,28748,2876490
2,Algeria,24.28,DZD,3.02,20.1,76.7,Arabic,1.72,28.033886,1.659626,...,43053054.0,37.2,11.7,Algeria,Algeria,Africa,83199,2431,2381741,44190030
3,Andorra,7.2,EUR,1.27,2.7,,Catalan,3.33,42.506285,1.521801,...,77142.0,,,Andorra,Andorra,Europe,6712,76,468,77317
4,Angola,40.73,AOA,5.52,51.6,60.8,Portuguese,0.21,-11.202692,17.873887,...,31825295.0,9.2,6.89,Angola,Angola,Africa,15139,348,1246620,33312843


In [31]:
# review shape
print(f'Merged df rows: {merged_df.shape[0]}\nMerged df columns: {merged_df.shape[1]}')

Merged df rows: 184
Merged df columns: 26


### Save data frame to csv file

---
### Save Data
* **new csv file name**

In [34]:
# save
merged_df.to_csv('04_country_merged_raw.csv',sep=',',encoding='utf-8',index=False)