# BAIS:3250 - Final Project
### Cleaning API Data

**Author(s):** Natalie Brown, Max Kaiser

**Date Modified:** 11-13-2024 (*date created:* 11-13-2024)


**Description:** Clean country data that was scraped from RESTcountries API

---

### Import Libaries
* **pandas:** for data frames and data cleaning functions

In [3]:
import pandas as pd

---
### Load Data
* **api_country_data_raw.csv**

In [5]:
# example load code
country_df=pd.read_csv('api_country_data_raw.csv',sep=',',encoding='utf-8',header=0)

# display head to ensure properly loaded
country_df.head()

Unnamed: 0,country,capital,covid_cases,covid_deaths,president_name,president_gender,president_appointment_start_date,president_appointment_end_date,continent,size,population
0,Afghanistan,Kabul,46498,1774,Ashraf Ghani,Male,2020-03-09,,Asia,"652,000 km²",39306195
1,Albania,Tirana,37625,798,Ilir Rexhep Meta,Male,2017-07-24,,Europe,"28,748 km²",2876490
2,Algeria,Algiers,83199,2431,,,,,Africa,"2,381,741 km²",44190030
3,Andorra,Andorra la Vella,6712,76,,,,,Europe,468 km²,77317
4,Angola,Luanda,15139,348,,,,,Africa,"1,246,620 km²",33312843


---
### Perform intial discovery
* display datatypes
* display nulls
* display data frame size

In [7]:
# data frame size
shape=country_df.shape
print(f'Country Data Shape\n------------------\nColumns: {shape[1]}\nRows: {shape[0]}')

Country Data Shape
------------------
Columns: 11
Rows: 186


In [8]:
# display data types
print(f'Country Data Types\n------------------\n{country_df.dtypes}')

Country Data Types
------------------
country                              object
capital                              object
covid_cases                          object
covid_deaths                         object
president_name                       object
president_gender                     object
president_appointment_start_date     object
president_appointment_end_date      float64
continent                            object
size                                 object
population                           object
dtype: object


In [9]:
# determine nulls counts and percentages
null_counts=country_df.isna().sum()
null_percentages=round((country_df.isna().sum()/shape[0])*100,2)

# data frame for results
nulls_df=pd.DataFrame({
    'Nulls':null_counts,
    'Nulls (%)':null_percentages
})

# display
print("Country Data Nulls")
nulls_df

Country Data Nulls


Unnamed: 0,Nulls,Nulls (%)
country,0,0.0
capital,0,0.0
covid_cases,0,0.0
covid_deaths,0,0.0
president_name,183,98.39
president_gender,183,98.39
president_appointment_start_date,183,98.39
president_appointment_end_date,186,100.0
continent,2,1.08
size,0,0.0


---
### Perform Inital Cleaning
- [x] remove columns with over 50% nulls
- [x] fix data types
    * displaying values first to get formatting
    * features to be fixed (all need to be integer):
        - [x] **covid_cases**
        - [x] **covid_deaths**
        - [x] **size** renamed to **area_km2**
        - [x] **population**
- [x] fill continent for the two countries missing

#### Remove Columns

In [12]:
# remove columns
country_df=country_df.drop(columns=['president_name','president_gender','president_appointment_start_date','president_appointment_end_date','capital'])

#### Fix data types

In [14]:
# function to display current data format
def data_format(col):
    print(f'Data Format for {col}\n-----------------\n{country_df[col].unique}')

**covid_cases**

In [16]:
data_format('covid_cases')

Data Format for covid_cases
-----------------
<bound method Series.unique of 0      46,498
1      37,625
2      83,199
3       6,712
4      15,139
        ...  
181         0
182         0
183     2,081
184    17,647
185    10,034
Name: covid_cases, Length: 186, dtype: object>


In [17]:
# remove commas and fix to integer
country_df['covid_cases_edited']=country_df['covid_cases'].str.replace(',', '').astype(int)

# display changes to confirm
country_df[['covid_cases_edited','covid_cases']]

Unnamed: 0,covid_cases_edited,covid_cases
0,46498,46498
1,37625,37625
2,83199,83199
3,6712,6712
4,15139,15139
...,...,...
181,0,0
182,0,0
183,2081,2081
184,17647,17647


In [18]:
# remove the original and rename the old
country_df=country_df.drop(columns='covid_cases')

# rename edited
country_df=country_df.rename(columns={'covid_cases_edited':'covid_cases'})

# review changes
country_df['covid_cases']

0      46498
1      37625
2      83199
3       6712
4      15139
       ...  
181        0
182        0
183     2081
184    17647
185    10034
Name: covid_cases, Length: 186, dtype: int32

***covid_cases** complete*

**covid_deaths**

In [21]:
data_format('covid_deaths')

Data Format for covid_deaths
-----------------
<bound method Series.unique of 0      1,774
1        798
2      2,431
3         76
4        348
       ...  
181        0
182        0
183      606
184      357
185      277
Name: covid_deaths, Length: 186, dtype: object>


In [22]:
# remove commas and fix to integer
country_df['covid_deaths_edited']=country_df['covid_deaths'].str.replace(',', '').astype(int)

# display changes to confirm
country_df[['covid_deaths_edited','covid_deaths']]

Unnamed: 0,covid_deaths_edited,covid_deaths
0,1774,1774
1,798,798
2,2431,2431
3,76,76
4,348,348
...,...,...
181,0,0
182,0,0
183,606,606
184,357,357


In [23]:
# remove the original and rename the old
country_df=country_df.drop(columns='covid_deaths')

# rename edited
country_df=country_df.rename(columns={'covid_deaths_edited':'covid_deaths'})

# review changes
country_df['covid_deaths']

0      1774
1       798
2      2431
3        76
4       348
       ... 
181       0
182       0
183     606
184     357
185     277
Name: covid_deaths, Length: 186, dtype: int32

***covid_deaths** complete*

**size**

In [26]:
data_format('size')

Data Format for size
-----------------
<bound method Series.unique of 0        652,000 km²
1         28,748 km²
2      2,381,741 km²
3            468 km²
4      1,246,620 km²
           ...      
181      916,445 km²
182      331,212 km²
183      527,968 km²
184      752,618 km²
185      390,757 km²
Name: size, Length: 186, dtype: object>


In [27]:
# remove commas and measure
country_df['size_edited']=country_df['size'].str.replace(',', '').str.replace(' km²', '').astype(int)

# display changes to confirm
country_df[['size_edited','size']]

Unnamed: 0,size_edited,size
0,652000,"652,000 km²"
1,28748,"28,748 km²"
2,2381741,"2,381,741 km²"
3,468,468 km²
4,1246620,"1,246,620 km²"
...,...,...
181,916445,"916,445 km²"
182,331212,"331,212 km²"
183,527968,"527,968 km²"
184,752618,"752,618 km²"


In [28]:
# remove the original and rename the old
country_df=country_df.drop(columns='size')

# rename edited, add a note about the measure
country_df=country_df.rename(columns={'size_edited':'land_area_km2'})

# review changes
country_df['land_area_km2']

0       652000
1        28748
2      2381741
3          468
4      1246620
        ...   
181     916445
182     331212
183     527968
184     752618
185     390757
Name: land_area_km2, Length: 186, dtype: int32

***size** completed, renamed to **area_km2***

**population**

In [31]:
data_format('population')

Data Format for population
-----------------
<bound method Series.unique of 0      39,306,195
1       2,876,490
2      44,190,030
3          77,317
4      33,312,843
          ...    
181    28,402,272
182    97,702,766
183    30,110,883
184    18,609,335
185    14,955,711
Name: population, Length: 186, dtype: object>


In [32]:
# remove commas
country_df['population_edited']=country_df['population'].str.replace(',', '').astype(int)

# display changes to confirm
country_df[['population_edited','population']]

Unnamed: 0,population_edited,population
0,39306195,39306195
1,2876490,2876490
2,44190030,44190030
3,77317,77317
4,33312843,33312843
...,...,...
181,28402272,28402272
182,97702766,97702766
183,30110883,30110883
184,18609335,18609335


In [33]:
# remove the original and rename the old
country_df=country_df.drop(columns='population')

# rename edited
country_df=country_df.rename(columns={'population_edited':'population'})

# review changes
country_df['population']

0      39306195
1       2876490
2      44190030
3         77317
4      33312843
         ...   
181    28402272
182    97702766
183    30110883
184    18609335
185    14955711
Name: population, Length: 186, dtype: int32

***population** completed*

**fill missing continents**

In [36]:
# first find the occurences
null_continent_rows=country_df.loc[country_df['continent'].isna()]

null_continent_rows

Unnamed: 0,country,continent,covid_cases,covid_deaths,land_area_km2,population
168,Trinidad And Tobago,,0,0,0,0
169,Tunisia,,96769,3260,0,0


In [37]:
country_df.loc[country_df['country']=='Trinidad And Tobago', 'continent']=country_df.loc[country_df['country']=='Trinidad And Tobago', 'continent'].fillna('North America')
country_df.loc[country_df['country']=='Tunisia', 'continent']=country_df.loc[country_df['country']=='Tunisia', 'continent'].fillna('Africa')

In [38]:
# review changes
country_df[country_df['country']=='Trinidad And Tobago']

Unnamed: 0,country,continent,covid_cases,covid_deaths,land_area_km2,population
168,Trinidad And Tobago,North America,0,0,0,0


In [39]:
country_df[country_df['country']=='Tunisia']

Unnamed: 0,country,continent,covid_cases,covid_deaths,land_area_km2,population
169,Tunisia,Africa,96769,3260,0,0


In [40]:
# display nulls again
null_counts=country_df.isna().sum()
null_percentages=round((country_df.isna().sum()/shape[0])*100,2)

# data frame for results
nulls_df=pd.DataFrame({
    'Nulls':null_counts,
    'Nulls (%)':null_percentages
})

# display
print("Country Data Nulls (cleaned)")
nulls_df

Country Data Nulls (cleaned)


Unnamed: 0,Nulls,Nulls (%)
country,0,0.0
continent,0,0.0
covid_cases,0,0.0
covid_deaths,0,0.0
land_area_km2,0,0.0
population,0,0.0


In [41]:
# display head one last time
country_df.head()

Unnamed: 0,country,continent,covid_cases,covid_deaths,land_area_km2,population
0,Afghanistan,Asia,46498,1774,652000,39306195
1,Albania,Europe,37625,798,28748,2876490
2,Algeria,Africa,83199,2431,2381741,44190030
3,Andorra,Europe,6712,76,468,77317
4,Angola,Africa,15139,348,1246620,33312843


---
### Save Data
* **api_country_data_clean.csv**

In [43]:
# save to csv
country_df.to_csv('api_country_data_clean.csv',sep=',',encoding='utf-8',header=True,index=False)