# BAIS:3250 - Final Project
### Clean Merged Countries Data Frame

**Author(s):** Natalie Brown, Max Kaiser

**Date Modified:** 11-19-2024 (*Date Created:* 11-19-2024)


**Description:** Clean the merged Countries data frame by dropping and renaming columns, discovering nulls and imputing correctly, etc.

---

### Import Libaries
* **pandas:** for data frames and data cleaning functions

In [108]:
import pandas as pd

---
### Load Data
* **country_merged_raw.csv**

In [111]:
# load
country_df=pd.read_csv('04_country_merged_raw.csv',sep=',',encoding='utf-8')

# display header
country_df.head()

Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,...,primary_education_enrollment_pct,secondary_education_enrollment_pct,population_x,tax_revenue_pct,unemployment_rt,continent,covid_cases,covid_deaths,area_km2,population_y
0,Afghanistan,32.49,AFN,4.47,47.9,64.5,Pashto,0.28,33.93911,67.709953,...,104.0,9.7,38041754.0,9.3,11.12,Asia,46498.0,1774.0,652000.0,39306195.0
1,Albania,11.78,ALL,1.62,7.8,78.5,Albanian,1.2,41.153332,20.168331,...,107.0,55.0,2854191.0,18.6,12.33,Europe,37625.0,798.0,28748.0,2876490.0
2,Algeria,24.28,DZD,3.02,20.1,76.7,Arabic,1.72,28.033886,1.659626,...,109.9,51.4,43053054.0,37.2,11.7,Africa,83199.0,2431.0,2381741.0,44190030.0
3,Andorra,7.2,EUR,1.27,2.7,,Catalan,3.33,42.506285,1.521801,...,106.4,,77142.0,,,Europe,6712.0,76.0,468.0,77317.0
4,Angola,40.73,AOA,5.52,51.6,60.8,Portuguese,0.21,-11.202692,17.873887,...,113.5,9.3,31825295.0,9.2,6.89,Africa,15139.0,348.0,1246620.0,33312843.0


---
### Fix Columns
- [ ] Review all columns
- [ ] Review duplicate columns
- [ ] Drop unecessary columns
- [ ] Rename columns (if necessary)


- duplicate kaggle columns are denoted by x
- duplicate api columns are denoted by y

# Reviewing all Columns

In [115]:

# Display column names and data types
print(country_df.info())

# Show the first few rows to get an idea of the dataset
country_df.head()

# List all column names
print(country_df.columns.tolist())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   country                             195 non-null    object 
 1   birth_rt                            189 non-null    float64
 2   currency_code                       180 non-null    object 
 3   fertility_rt                        188 non-null    float64
 4   infant_mortality_rt                 189 non-null    float64
 5   life_expectancy                     187 non-null    float64
 6   official_language                   190 non-null    object 
 7   physicians_per_thousand             188 non-null    float64
 8   lat                                 194 non-null    float64
 9   long                                194 non-null    float64
 10  ag_land_pct                         188 non-null    float64
 11  land_area_km2                       194 non-n

# Reviewing Duplicate Columns 

In [118]:

# List of duplicate columns
duplicate_columns = {
    'country': ['country_x', 'country_y'],
    'population': ['population_x', 'population_y'],
    'land_area': ['land_area_km2', 'area_km2']
}

print("Duplicate Columns Identified:")
for key, cols in duplicate_columns.items():
    print(f"{key}: {cols}")


Duplicate Columns Identified:
country: ['country_x', 'country_y']
population: ['population_x', 'population_y']
land_area: ['land_area_km2', 'area_km2']


In [127]:
# Replace null and zero values in 'population_y' with values from 'population_x'
country_df['population_y'] = country_df['population_y'].fillna(country_df['population_x'])
country_df.loc[country_df['population_y'] == 0, 'population_y'] = country_df['population_x']

# Calculate the average population, handling NaN values
country_df['average_population'] = country_df[['population_x', 'population_y']].mean(axis=1)

# Optionally, round the population to integers, filling NaN with 0 before conversion
country_df['average_population'] = country_df['average_population'].fillna(0).round(0).astype(int)

# Drop the original population columns
country_df = country_df.drop(columns=['population_x', 'population_y'])

In [129]:
# rename area to total area for clarification
country_df = country_df.rename(columns={'area_km2': 'total_area_km2'})

# Verifying Changes

In [132]:
# Checking changes made
# Display updated column names
print("Updated Columns:")
print(country_df.columns.tolist())

# Display the first few rows to verify changes
country_df.head()


Updated Columns:
['country', 'birth_rt', 'currency_code', 'fertility_rt', 'infant_mortality_rt', 'life_expectancy', 'official_language', 'physicians_per_thousand', 'lat', 'long', 'ag_land_pct', 'land_area_km2', 'consumer_price_index', 'gross_domestic_product_USD(b)', 'primary_education_enrollment_pct', 'secondary_education_enrollment_pct', 'tax_revenue_pct', 'unemployment_rt', 'continent', 'covid_cases', 'covid_deaths', 'total_area_km2', 'average_population']


Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,...,gross_domestic_product_USD(b),primary_education_enrollment_pct,secondary_education_enrollment_pct,tax_revenue_pct,unemployment_rt,continent,covid_cases,covid_deaths,total_area_km2,average_population
0,Afghanistan,32.49,AFN,4.47,47.9,64.5,Pashto,0.28,33.93911,67.709953,...,19.1014,104.0,9.7,9.3,11.12,Asia,46498.0,1774.0,652000.0,38673974
1,Albania,11.78,ALL,1.62,7.8,78.5,Albanian,1.2,41.153332,20.168331,...,15.2781,107.0,55.0,18.6,12.33,Europe,37625.0,798.0,28748.0,2865340
2,Algeria,24.28,DZD,3.02,20.1,76.7,Arabic,1.72,28.033886,1.659626,...,169.9882,109.9,51.4,37.2,11.7,Africa,83199.0,2431.0,2381741.0,43621542
3,Andorra,7.2,EUR,1.27,2.7,,Catalan,3.33,42.506285,1.521801,...,3.1541,106.4,,,,Europe,6712.0,76.0,468.0,77230
4,Angola,40.73,AOA,5.52,51.6,60.8,Portuguese,0.21,-11.202692,17.873887,...,94.6354,113.5,9.3,9.2,6.89,Africa,15139.0,348.0,1246620.0,32569069


# FInal Cleanup and Standarization

In [135]:
# Standardize column names
country_df.columns = country_df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')

print("Standardized Column Names:")
print(country_df.columns.tolist())


Standardized Column Names:
['country', 'birth_rt', 'currency_code', 'fertility_rt', 'infant_mortality_rt', 'life_expectancy', 'official_language', 'physicians_per_thousand', 'lat', 'long', 'ag_land_pct', 'land_area_km2', 'consumer_price_index', 'gross_domestic_product_usd(b)', 'primary_education_enrollment_pct', 'secondary_education_enrollment_pct', 'tax_revenue_pct', 'unemployment_rt', 'continent', 'covid_cases', 'covid_deaths', 'total_area_km2', 'average_population']


**review columns**

- duplicate columns
    - country_x - country_y

In [138]:
# get list
columns_list=country_df.columns

# display
for column in columns_list:
    print(column)

country
birth_rt
currency_code
fertility_rt
infant_mortality_rt
life_expectancy
official_language
physicians_per_thousand
lat
long
ag_land_pct
land_area_km2
consumer_price_index
gross_domestic_product_usd(b)
primary_education_enrollment_pct
secondary_education_enrollment_pct
tax_revenue_pct
unemployment_rt
continent
covid_cases
covid_deaths
total_area_km2
average_population


---
### Save Data
* **new csv file name**

In [141]:
# example save code
# df.to_csv('new_file.csv',sep=',',encoding='utf-8')

# Save the cleaned DataFrame to a new CSV
country_df.to_csv('05_country_merged_clean.csv', index=False, encoding='utf-8')
print("Cleaned dataset saved as 'country_merged_clean.csv'")


Cleaned dataset saved as 'country_merged_clean.csv'
