# BAIS:3250 - Final Project
### Cleaning World Data

**Author(s):** Natalie Brown, Max Kaiser

**Date Modified:** 11-14-2024 (*date created:* 11-13-2024)


**Description:** Clean world data from csv file that was downloaded from Kaggle

---

### Import Libaries
* **pandas:** for data frames and data cleaning functions
* **numpy** for null functions

In [3]:
import pandas as pd
import numpy as np

---
### Load Data
* **world-data-2023.csv**

In [5]:
# load file
world_df=pd.read_csv('world_data_raw.csv',sep=',',encoding='utf-8',header=0)

In [6]:
# display header
world_df.head()

Unnamed: 0,Country,Density\n(P/Km2),Abbreviation,Agricultural Land( %),Land Area(Km2),Armed Forces size,Birth Rate,Calling Code,Capital/Major City,Co2-Emissions,...,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban_population,Latitude,Longitude
0,Afghanistan,60,AF,58.10%,652230,323000.0,32.49,93.0,Kabul,8672,...,78.40%,0.28,38041754,48.90%,9.30%,71.40%,11.12%,9797273,33.93911,67.709953
1,Albania,105,AL,43.10%,28748,9000.0,11.78,355.0,Tirana,4536,...,56.90%,1.2,2854191,55.70%,18.60%,36.60%,12.33%,1747593,41.153332,20.168331
2,Algeria,18,DZ,17.40%,2381741,317000.0,24.28,213.0,Algiers,150006,...,28.10%,1.72,43053054,41.20%,37.20%,66.10%,11.70%,31510100,28.033886,1.659626
3,Andorra,164,AD,40.00%,468,,7.2,376.0,Andorra la Vella,469,...,36.40%,3.33,77142,,,,,67873,42.506285,1.521801
4,Angola,26,AO,47.50%,1246700,117000.0,40.73,244.0,Luanda,34693,...,33.40%,0.21,31825295,77.50%,9.20%,49.10%,6.89%,21061025,-11.202692,17.873887


---
### Perform Initial Discovery
- [x] data frame size
- [x] columns
- [x] nulls
- [x] data types

In [8]:
# data frame size
shape=world_df.shape
print(f'World Data Shape\n------------------\nColumns: {shape[1]}\nRows: {shape[0]}')

World Data Shape
------------------
Columns: 35
Rows: 195


In [9]:
# columns
print(f'World Data Columns\n-------------------\n{world_df.columns}')

World Data Columns
-------------------
Index(['Country', 'Density\n(P/Km2)', 'Abbreviation', 'Agricultural Land( %)',
       'Land Area(Km2)', 'Armed Forces size', 'Birth Rate', 'Calling Code',
       'Capital/Major City', 'Co2-Emissions', 'CPI', 'CPI Change (%)',
       'Currency-Code', 'Fertility Rate', 'Forested Area (%)',
       'Gasoline Price', 'GDP', 'Gross primary education enrollment (%)',
       'Gross tertiary education enrollment (%)', 'Infant mortality',
       'Largest city', 'Life expectancy', 'Maternal mortality ratio',
       'Minimum wage', 'Official language', 'Out of pocket health expenditure',
       'Physicians per thousand', 'Population',
       'Population: Labor force participation (%)', 'Tax revenue (%)',
       'Total tax rate', 'Unemployment rate', 'Urban_population', 'Latitude',
       'Longitude'],
      dtype='object')


In [10]:
# data types
print(f'World Data Types\n------------------------\n{world_df.dtypes}')

World Data Types
------------------------
Country                                       object
Density\n(P/Km2)                              object
Abbreviation                                  object
Agricultural Land( %)                         object
Land Area(Km2)                                object
Armed Forces size                             object
Birth Rate                                   float64
Calling Code                                 float64
Capital/Major City                            object
Co2-Emissions                                 object
CPI                                           object
CPI Change (%)                                object
Currency-Code                                 object
Fertility Rate                               float64
Forested Area (%)                             object
Gasoline Price                                object
GDP                                           object
Gross primary education enrollment (%)        object
Gros

In [11]:
# nulls
# display nulls again
null_counts=world_df.isna().sum()
null_percentages=round((world_df.isna().sum()/shape[0])*100,2)

# data frame for results
nulls_df=pd.DataFrame({
    'Nulls':null_counts,
    'Nulls (%)':null_percentages
})

# display
print("World Data Nulls")
nulls_df

World Data Nulls


Unnamed: 0,Nulls,Nulls (%)
Country,0,0.0
Density\n(P/Km2),0,0.0
Abbreviation,7,3.59
Agricultural Land( %),7,3.59
Land Area(Km2),1,0.51
Armed Forces size,24,12.31
Birth Rate,6,3.08
Calling Code,1,0.51
Capital/Major City,3,1.54
Co2-Emissions,7,3.59


---
### Perform Inital Cleaning
- [x] remove unnecessary columns
- [x] rename columns (snake case)
- [x] fix data types

*null imputation will be completed in a different notebook*

---
**Remove Unecessary Columns**
- [x] Abbreviation - do not need abbreviation, have country name
- [x] Armed Forces size - unecessary
- [x] Calling Code - do not need calling code, have country name
- [x] Largest city - unecessary
- [x] Density\n(P/Km2) - unecessary
- [x] Total tax rate - tax rates are not a good comparative measure, will use tax revenue %
- [x] Minimum wage - the cost of living varies by country so this is a bad measure
- [x] Gasoline Price - not a good measure as cost of living varies
- [x] Population: Labor force participation (%) - using unemployment rate for this measure instead
- [x] Out of pocket health expenditure - bad measure as many variables impact this across different countries
- [x] Co2-Emissions - unecessary measure
- [x] Forested Area (%) - bad measure
- [x] Urban_population - unecessary measure
- [x] CPI Change (%) - unecessary measure
- [x] Maternal mortality ratio - unecessary measure
- [x] Capital/Major City - unecessary measure

In [14]:
# columns to remove: calling code, capital / major city & land area (other data set has this and does not have nulls), density, abbreviation, armed forces size
world_df=world_df.drop(columns=['Abbreviation','Density\n(P/Km2)','Armed Forces size',
                                'Calling Code','Largest city','Total tax rate','Minimum wage',
                               'Gasoline Price','Population: Labor force participation (%)','Out of pocket health expenditure',
                               'Co2-Emissions','Forested Area (%)','Urban_population','CPI Change (%)',
                               'Maternal mortality ratio','Capital/Major City'])

In [15]:
display(world_df.head())

Unnamed: 0,Country,Agricultural Land( %),Land Area(Km2),Birth Rate,CPI,Currency-Code,Fertility Rate,GDP,Gross primary education enrollment (%),Gross tertiary education enrollment (%),Infant mortality,Life expectancy,Official language,Physicians per thousand,Population,Tax revenue (%),Unemployment rate,Latitude,Longitude
0,Afghanistan,58.10%,652230,32.49,149.9,AFN,4.47,"$19,101,353,833",104.00%,9.70%,47.9,64.5,Pashto,0.28,38041754,9.30%,11.12%,33.93911,67.709953
1,Albania,43.10%,28748,11.78,119.05,ALL,1.62,"$15,278,077,447",107.00%,55.00%,7.8,78.5,Albanian,1.2,2854191,18.60%,12.33%,41.153332,20.168331
2,Algeria,17.40%,2381741,24.28,151.36,DZD,3.02,"$169,988,236,398",109.90%,51.40%,20.1,76.7,Arabic,1.72,43053054,37.20%,11.70%,28.033886,1.659626
3,Andorra,40.00%,468,7.2,,EUR,1.27,"$3,154,057,987",106.40%,,2.7,,Catalan,3.33,77142,,,42.506285,1.521801
4,Angola,47.50%,1246700,40.73,261.73,AOA,5.52,"$94,635,415,870",113.50%,9.30%,51.6,60.8,Portuguese,0.21,31825295,9.20%,6.89%,-11.202692,17.873887


---
**Rename Columns**
- [x] Country : country
- [x] Agricultural Land( %) : ag_land_pct
- [x] Land Area(Km2) : land_area_km2
- [x] Birth Rate : birth_rt
- [x] CPI : consumer_price_index
- [x] Currency-Code : currency_code
- [x] Fertility Rate : fertility_rt
- [x] GDP : gross_domestic_product
- [x] Gross primary education enrollment (%) : primary_education_enrollment_pct
- [x] Gross tertiary education enrollment (%) : seconary_education_enrollment_pct
- [x] Infant mortality : infant_mortality_rt
- [x] Life expectancy : life_expectancy
- [x] Official language : official_language
- [x] Physicians per thousand : physicians_per_thousand
- [x] Population : population
- [x] Tax revenue (%) : tax_revenue_pct
- [x] Unemployment rate : unemployment_rt
- [x] Latitude : lat
- [x] Longitude : long

In [17]:
# rename
world_df=world_df.rename(columns={
    'Country':'country',
    'Agricultural Land( %)':'ag_land_pct',
    'Land Area(Km2)':'land_area_km2',
    'Birth Rate':'birth_rt',
    'CPI':'consumer_price_index',
    'Currency-Code':'currency_code',
    'Fertility Rate':'fertility_rt',
    'GDP':'gross_domestic_product',
    'Gross primary education enrollment (%)':'primary_education_enrollment_pct',
    'Gross tertiary education enrollment (%)':'secondary_education_enrollment_pct',
    'Infant mortality':'infant_mortality_rt',
    'Life expectancy':'life_expectancy',
    'Official language':'official_language',
    'Physicians per thousand':'physicians_per_thousand',
    'Population':'population',
    'Tax revenue (%)':'tax_revenue_pct',
    'Unemployment rate':'unemployment_rt',
    'Latitude':'lat',
    'Longitude':'long'
})

In [18]:
# review changes
world_df.head()

Unnamed: 0,country,ag_land_pct,land_area_km2,birth_rt,consumer_price_index,currency_code,fertility_rt,gross_domestic_product,primary_education_enrollment_pct,secondary_education_enrollment_pct,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,population,tax_revenue_pct,unemployment_rt,lat,long
0,Afghanistan,58.10%,652230,32.49,149.9,AFN,4.47,"$19,101,353,833",104.00%,9.70%,47.9,64.5,Pashto,0.28,38041754,9.30%,11.12%,33.93911,67.709953
1,Albania,43.10%,28748,11.78,119.05,ALL,1.62,"$15,278,077,447",107.00%,55.00%,7.8,78.5,Albanian,1.2,2854191,18.60%,12.33%,41.153332,20.168331
2,Algeria,17.40%,2381741,24.28,151.36,DZD,3.02,"$169,988,236,398",109.90%,51.40%,20.1,76.7,Arabic,1.72,43053054,37.20%,11.70%,28.033886,1.659626
3,Andorra,40.00%,468,7.2,,EUR,1.27,"$3,154,057,987",106.40%,,2.7,,Catalan,3.33,77142,,,42.506285,1.521801
4,Angola,47.50%,1246700,40.73,261.73,AOA,5.52,"$94,635,415,870",113.50%,9.30%,51.6,60.8,Portuguese,0.21,31825295,9.20%,6.89%,-11.202692,17.873887


---
**Fix Data Types**
- [x] **ag_land_pct:** object - float
- [x] **land_area_km2:** object - float
- [x] **consumer_price_index:** object - float
- [x] **gross_domestic_product:** object - float
    - renamed to **gross_domestic_product_USD(b)** to inform the numbers are in billions, and measured in USD
- [x] **primary_education_enrollment_pct:** object - float
- [x] **seconary_education_enrollment_pct:** object - float
- [x] **population:** object - int
- [x] **tax_revenue_pct:** object - float
- [x] **unemployment_rt:** object - float

In [20]:
# display data types again, now that we have removed columns
print(f'World Data Types\n------------------------\n{world_df.dtypes}')

World Data Types
------------------------
country                                object
ag_land_pct                            object
land_area_km2                          object
birth_rt                              float64
consumer_price_index                   object
currency_code                          object
fertility_rt                          float64
gross_domestic_product                 object
primary_education_enrollment_pct       object
secondary_education_enrollment_pct     object
infant_mortality_rt                   float64
life_expectancy                       float64
official_language                      object
physicians_per_thousand               float64
population                             object
tax_revenue_pct                        object
unemployment_rt                        object
lat                                   float64
long                                  float64
dtype: object


In [21]:
# formula to display current format, and nulls if any
def format(col):
    print(f'Nulls in {col}\n-----------------------\n{world_df[col].isna().sum()}\n\nCurrent Format of {col}\n--------------------\n{world_df[col].unique()}')

In [22]:
# formula to display data types changes
def type_change(col):
    edited=col+'_edited'
    print(f'Data Type Before: {world_df[col].dtype}\nData Type After: {world_df[edited].dtype}')
    display(world_df[[edited,col]].head())

In [23]:
def name_change(col):
    display(world_df[col].head())

**ag_land_pct**

object - float

In [25]:
format('ag_land_pct')

Nulls in ag_land_pct
-----------------------
7

Current Format of ag_land_pct
--------------------
['58.10%' '43.10%' '17.40%' '40.00%' '47.50%' '20.50%' '54.30%' '58.90%'
 '48.20%' '32.40%' '57.70%' '1.40%' '11.10%' '70.60%' '23.30%' '42.00%'
 '44.60%' '7.00%' '33.30%' '13.60%' '34.80%' '45.60%' '33.90%' '2.70%'
 '46.30%' '44.20%' '79.20%' '64.80%' '19.60%' '30.90%' '20.60%' '6.90%'
 '8.20%' '39.70%' '21.20%' '56.20%' '40.30%' '71.50%' '31.10%' '34.50%'
 '27.60%' '59.90%' '12.20%' '45.20%' '11.60%' '62.00%' '73.40%' '48.70%'
 '22.20%' '3.80%' '76.40%' '10.10%' '75.20%' '23.10%' nan '36.30%' '7.50%'
 '52.40%' '20.00%' '59.80%' '47.70%' '69.00%' '47.60%' '23.50%' '36.00%'
 '59.00%' '58.00%' '8.60%' '66.80%' '28.90%' '58.40%' '18.70%' '60.40%'
 '31.50%' '28.20%' '21.40%' '64.50%' '24.60%' '43.20%' '41.00%' '12.30%'
 '12.00%' '80.40%' '48.50%' '8.40%' '55.00%' '10.30%' '64.30%' '77.60%'
 '28.00%' '8.70%' '32.20%' '47.20%' '53.70%' '71.20%' '61.40%' '26.30%'
 '33.80%' '63.90%' '38.50%' '42

In [26]:
# remove percentage and convert to float
world_df['ag_land_pct_edited']=world_df['ag_land_pct'].str.replace('%','').astype(float)

type_change('ag_land_pct')

Data Type Before: object
Data Type After: float64


Unnamed: 0,ag_land_pct_edited,ag_land_pct
0,58.1,58.10%
1,43.1,43.10%
2,17.4,17.40%
3,40.0,40.00%
4,47.5,47.50%


In [27]:
# rename and drop
world_df=world_df.drop(columns='ag_land_pct').rename(columns={'ag_land_pct_edited':'ag_land_pct'})

# review drop and rename
name_change('ag_land_pct')

0    58.1
1    43.1
2    17.4
3    40.0
4    47.5
Name: ag_land_pct, dtype: float64

***ag_land_pct** completed*

**land_area_km2**

object - float

In [30]:
# display format
format('land_area_km2')

Nulls in land_area_km2
-----------------------
1

Current Format of land_area_km2
--------------------
['652,230' '28,748' '2,381,741' '468' '1,246,700' '443' '2,780,400'
 '29,743' '7,741,220' '83,871' '86,600' '13,880' '765' '148,460' '430'
 '207,600' '30,528' '22,966' '112,622' '38,394' '1,098,581' '51,197'
 '581,730' '8,515,770' '5,765' '110,879' '274,200' '27,830' '322,463'
 '4,033' '181,035' '475,440' '9,984,670' '622,984' '1,284,000' '756,096'
 '9,596,960' '1,138,910' '2,235' '342,000' '51,100' '56,594' '110,860'
 '9,251' '78,867' '2,344,858' '43,094' '23,200' '751' '48,670' '283,561'
 '1,001,450' '21,041' '28,051' '117,600' '45,228' '17,364' '1,104,300'
 '18,274' '338,145' '643,801' '267,667' '11,300' '69,700' '357,022'
 '238,533' '131,957' '349' '108,889' '245,857' '36,125' '214,969' '27,750'
 '0' '112,090' '93,028' '103,000' '3,287,263' '1,904,569' '1,648,195'
 '438,317' '70,273' '20,770' '301,340' '10,991' '377,944' '89,342'
 '2,724,900' '580,367' '811' '17,818' '199,951' '23

In [31]:
# remove comma and convert to float
world_df['land_area_km2_edited']=world_df['land_area_km2'].str.replace(',','').astype(float)

# display changes
type_change('land_area_km2')

Data Type Before: object
Data Type After: float64


Unnamed: 0,land_area_km2_edited,land_area_km2
0,652230.0,652230
1,28748.0,28748
2,2381741.0,2381741
3,468.0,468
4,1246700.0,1246700


In [32]:
# rename and drop
world_df=world_df.drop(columns='land_area_km2').rename(columns={'land_area_km2_edited':'land_area_km2'})

# review drop and rename
name_change('land_area_km2')

0     652230.0
1      28748.0
2    2381741.0
3        468.0
4    1246700.0
Name: land_area_km2, dtype: float64

 ***land_area_km2** completed*

**consumer_price_index**

object - float

In [35]:
# displau format and nulls
format('consumer_price_index')

Nulls in consumer_price_index
-----------------------
17

Current Format of consumer_price_index
--------------------
['149.9' '119.05' '151.36' nan '261.73' '113.81' '232.75' '129.18' '119.8'
 '118.06' '156.32' '116.22' '117.59' '179.68' '134.09' '117.11' '105.68'
 '110.71' '167.18' '148.32' '104.9' '149.75' '167.4' '99.03' '114.42'
 '106.58' '182.11' '111.61' '110.5' '127.63' '118.65' '116.76' '186.86'
 '117.7' '131.91' '125.08' '140.95' '103.62' '124.74' '128.85' '109.82'
 '102.51' '116.48' '133.85' '110.35' '120.25' '103.87' '135.5' '124.14'
 '288.57' '111.23' '124.35' '122.14' '143.86' '132.3' '112.33' '110.05'
 '122.19' '172.73' '133.61' '112.85' '268.36' '101.87' '107.43' '142.92'
 '262.95' '111.65' '116.19' '179.29' '150.34' '121.64' '129' '180.44'
 '151.18' '550.93' '119.86' '108.15' '110.62' '162.47' '105.48' '125.6'
 '182.75' '180.51' '99.55' '126.6' '155.68' '135.87' '116.86' '130.02'
 '155.86' '223.13' '125.71' '118.38' '115.09' '184.33' '418.34' '121.46'
 '99.7' '108.73' 

In [36]:
# convert to int, ignore erros
world_df['consumer_price_index_edited']=world_df['consumer_price_index'].str.replace(',','').astype(float)

# review change
type_change('consumer_price_index')

Data Type Before: object
Data Type After: float64


Unnamed: 0,consumer_price_index_edited,consumer_price_index
0,149.9,149.9
1,119.05,119.05
2,151.36,151.36
3,,
4,261.73,261.73


In [37]:
# rename and drop
world_df=world_df.drop(columns='consumer_price_index').rename(columns={'consumer_price_index_edited':'consumer_price_index'})

# review drop and rename
name_change('consumer_price_index')

0    149.90
1    119.05
2    151.36
3       NaN
4    261.73
Name: consumer_price_index, dtype: float64

***consumer_price_index** completed*

**gross_domestic_product**

object - float

In [40]:
# review format and nulls
format('gross_domestic_product')

Nulls in gross_domestic_product
-----------------------
2

Current Format of gross_domestic_product
--------------------
['$19,101,353,833 ' '$15,278,077,447 ' '$169,988,236,398 '
 '$3,154,057,987 ' '$94,635,415,870 ' '$1,727,759,259 '
 '$449,663,446,954 ' '$13,672,802,158 ' '$1,392,680,589,329 '
 '$446,314,739,528 ' '$39,207,000,000 ' '$12,827,000,000 '
 '$38,574,069,149 ' '$302,571,254,131 ' '$5,209,000,000 '
 '$63,080,457,023 ' '$529,606,710,418 ' '$1,879,613,600 '
 '$14,390,709,095 ' '$2,446,674,101 ' '$40,895,322,865 '
 '$20,047,848,435 ' '$18,340,510,789 ' '$1,839,758,040,766 '
 '$13,469,422,941 ' '$86,000,000,000 ' '$15,745,810,235 '
 '$3,012,334,882 ' '$58,792,205,642 ' '$1,981,845,741 ' '$27,089,389,787 '
 '$38,760,467,033 ' '$1,736,425,629,520 ' '$2,220,307,369 '
 '$11,314,951,343 ' '$282,318,159,745 ' '$19,910,000,000,000 '
 '$323,802,808,108 ' '$1,185,728,677 ' '$10,820,591,131 '
 '$61,773,944,174 ' '$60,415,553,039 ' '$100,023,000,000 '
 '$24,564,647,935 ' '$246,489,245,49

In [41]:
# review commas and $, convert to float and then convert to billions
world_df['gross_domestic_product_edited']=round((world_df['gross_domestic_product'].str.replace(',','').str.replace('$','').astype(float)/1000000000),4)

# review change
type_change('gross_domestic_product')

Data Type Before: object
Data Type After: float64


Unnamed: 0,gross_domestic_product_edited,gross_domestic_product
0,19.1014,"$19,101,353,833"
1,15.2781,"$15,278,077,447"
2,169.9882,"$169,988,236,398"
3,3.1541,"$3,154,057,987"
4,94.6354,"$94,635,415,870"


In [42]:
# drop, rename, make a note about the measures, 
world_df=world_df.drop(columns='gross_domestic_product').rename(columns={'gross_domestic_product_edited':'gross_domestic_product_USD(b)'})

# review change
name_change('gross_domestic_product_USD(b)')

0     19.1014
1     15.2781
2    169.9882
3      3.1541
4     94.6354
Name: gross_domestic_product_USD(b), dtype: float64

***gross_domestic_product_USD(b)** completed*

**primary_education_enrollment_pct**

object - float

In [45]:
# display format
format('primary_education_enrollment_pct')

Nulls in primary_education_enrollment_pct
-----------------------
7

Current Format of primary_education_enrollment_pct
--------------------
['104.00%' '107.00%' '109.90%' '106.40%' '113.50%' '105.00%' '109.70%'
 '92.70%' '100.30%' '103.10%' '99.70%' '81.40%' '99.40%' '116.50%'
 '100.50%' '103.90%' '111.70%' '122.00%' '100.10%' '98.20%' nan '103.20%'
 '115.40%' '89.30%' '96.10%' '121.40%' '99.80%' '107.40%' '103.40%'
 '100.90%' '102.00%' '86.80%' '101.40%' '100.20%' '114.50%' '99.50%'
 '106.60%' '113.30%' '96.50%' '101.90%' '99.30%' '100.70%' '108.00%'
 '101.30%' '75.30%' '114.70%' '105.70%' '103.30%' '106.30%' '94.80%'
 '61.80%' '68.40%' '97.20%' '101.00%' '102.50%' '139.90%' '98.00%'
 '98.60%' '104.80%' '99.60%' '106.90%' '91.50%' '118.70%' '97.80%'
 '113.60%' '100.80%' '100.40%' '113.00%' '110.70%' '108.70%' '104.90%'
 '91.00%' '98.80%' '81.50%' '104.40%' '92.40%' '107.60%' '102.40%'
 '95.10%' '120.90%' '85.10%' '109.00%' '104.70%' '102.30%' '142.50%'
 '105.30%' '97.10%' '75.60%' '8

In [46]:
# remove %
world_df['primary_education_enrollment_pct_edited']=world_df['primary_education_enrollment_pct'].str.replace('%','').astype(float)

# review changes
type_change('primary_education_enrollment_pct')

Data Type Before: object
Data Type After: float64


Unnamed: 0,primary_education_enrollment_pct_edited,primary_education_enrollment_pct
0,104.0,104.00%
1,107.0,107.00%
2,109.9,109.90%
3,106.4,106.40%
4,113.5,113.50%


In [47]:
# change names
world_df=world_df.drop(columns='primary_education_enrollment_pct').rename(columns={
    'primary_education_enrollment_pct_edited':'primary_education_enrollment_pct'})

# review changes
name_change('primary_education_enrollment_pct')

0    104.0
1    107.0
2    109.9
3    106.4
4    113.5
Name: primary_education_enrollment_pct, dtype: float64

***primary_education_enrollment_pct** completed*

**secondary_education_enrollment_pct**

object - float

In [50]:
# display format
format('secondary_education_enrollment_pct')

Nulls in secondary_education_enrollment_pct
-----------------------
12

Current Format of secondary_education_enrollment_pct
--------------------
['9.70%' '55.00%' '51.40%' nan '9.30%' '24.80%' '90.00%' '54.60%'
 '113.10%' '85.10%' '27.70%' '15.10%' '50.50%' '20.60%' '65.40%' '87.40%'
 '79.70%' '24.70%' '12.30%' '15.60%' '23.30%' '24.90%' '51.30%' '31.40%'
 '71.00%' '6.50%' '6.10%' '23.60%' '13.70%' '12.80%' '68.90%' '3.00%'
 '3.30%' '88.50%' '50.60%' '55.30%' '9.00%' '12.70%' '55.20%' '67.90%'
 '41.40%' '75.90%' '64.10%' '6.60%' '80.60%' '5.30%' '7.20%' '59.90%'
 '44.90%' '35.20%' '29.40%' '1.90%' '3.40%' '69.60%' '8.10%' '16.10%'
 '88.20%' '65.60%' '8.30%' '2.70%' '63.90%' '70.20%' '15.70%' '136.60%'
 '104.60%' '21.80%' '11.60%' '2.60%' '1.10%' '26.20%' '48.50%' '71.80%'
 '28.10%' '36.30%' '68.10%' '16.20%' '77.80%' '63.40%' '61.90%' '27.10%'
 '63.20%' '34.40%' '61.70%' '11.50%' '54.40%' '41.30%' '15.00%' '88.10%'
 '26.30%' '10.20%' '11.90%' '60.50%' '35.60%' '72.40%' '19.20%' '5.40%

In [51]:
# remove %
world_df['secondary_education_enrollment_pct_edited']=world_df['secondary_education_enrollment_pct'].str.replace('%','').astype(float)

# review changes
type_change('secondary_education_enrollment_pct')

Data Type Before: object
Data Type After: float64


Unnamed: 0,secondary_education_enrollment_pct_edited,secondary_education_enrollment_pct
0,9.7,9.70%
1,55.0,55.00%
2,51.4,51.40%
3,,
4,9.3,9.30%


In [52]:
# drop and rename
world_df=world_df.drop(columns='secondary_education_enrollment_pct').rename(columns={
    'secondary_education_enrollment_pct_edited':'secondary_education_enrollment_pct'})

# review
name_change('secondary_education_enrollment_pct')

0     9.7
1    55.0
2    51.4
3     NaN
4     9.3
Name: secondary_education_enrollment_pct, dtype: float64

***secondary_education_enrollment_pct** completed*

**population**

object - int

In [55]:
# display format
format('population')

Nulls in population
-----------------------
1

Current Format of population
--------------------
['38,041,754' '2,854,191' '43,053,054' '77,142' '31,825,295' '97,118'
 '44,938,712' '2,957,731' '25,766,605' '8,877,067' '10,023,318' '389,482'
 '1,501,635' '167,310,838' '287,025' '9,466,856' '11,484,055' '390,353'
 '11,801,151' '727,145' '11,513,100' '3,301,000' '2,346,179' '212,559,417'
 '433,285' '6,975,761' '20,321,378' '11,530,580' '25,716,544' '483,628'
 '16,486,542' '25,876,380' '36,991,981' '4,745,185' '15,946,876'
 '18,952,038' '1,397,715,000' '50,339,443' '850,886' '5,380,508'
 '5,047,561' '4,067,500' '11,333,483' '1,198,575' '10,669,709'
 '86,790,567' '5,818,553' '973,560' '71,808' '10,738,958' '17,373,662'
 '100,388,073' '6,453,553' '1,355,986' '6,333,135' '1,331,824' '1,093,238'
 '112,078,730' '889,953' '5,520,314' '67,059,887' '2,172,579' '2,347,706'
 '3,720,382' '83,132,799' '30,792,608' '10,716,322' '112,003' '16,604,026'
 '12,771,246' '1,920,922' '782,766' '11,263,077' '83

In [56]:
# remove commas and convert to int
world_df['population_edited']=world_df['population'].str.replace(',','').astype(float)

# review chnage
type_change('population')

Data Type Before: object
Data Type After: float64


Unnamed: 0,population_edited,population
0,38041754.0,38041754
1,2854191.0,2854191
2,43053054.0,43053054
3,77142.0,77142
4,31825295.0,31825295


In [57]:
# drop, rename
world_df=world_df.drop(columns='population').rename(columns={'population_edited':'population'})

# review
name_change('population')

0    38041754.0
1     2854191.0
2    43053054.0
3       77142.0
4    31825295.0
Name: population, dtype: float64

***population** completed*

**tax_revenue_pct**

object - float

In [60]:
# display format
format('tax_revenue_pct')

Nulls in tax_revenue_pct
-----------------------
26

Current Format of tax_revenue_pct
--------------------
['9.30%' '18.60%' '37.20%' nan '9.20%' '16.50%' '10.10%' '20.90%' '23.00%'
 '25.40%' '13.00%' '14.80%' '4.20%' '8.80%' '27.50%' '14.70%' '24.00%'
 '26.30%' '10.80%' '16.00%' '17.00%' '20.40%' '19.50%' '14.20%' '20.20%'
 '15.00%' '13.60%' '11.80%' '20.10%' '17.10%' '12.80%' '8.60%' '18.20%'
 '9.40%' '14.40%' '9.00%' '22.00%' '24.50%' '14.90%' '10.70%' '32.40%'
 '22.10%' '12.50%' '18.10%' '6.10%' '28.60%' '7.50%' '24.20%' '20.80%'
 '10.20%' '21.70%' '11.50%' '12.60%' '26.20%' '19.40%' '10.60%' '10.30%'
 '17.30%' '23.30%' '11.20%' '7.40%' '2.00%' '18.30%' '23.10%' '24.30%'
 '26.80%' '11.90%' '15.10%' '11.70%' '1.40%' '18.00%' '12.90%' '22.90%'
 '15.30%' '31.60%' '16.90%' '26.50%' '12.00%' '11.60%' '17.80%' '19.10%'
 '13.10%' '25.20%' '17.70%' '16.80%' '21.90%' '0.00%' '5.40%' '27.10%'
 '20.70%' '29.00%' '15.60%' '1.50%' '23.90%' '2.50%' '21.30%' '10.00%'
 '14.30%' '14.00%' '17.40%' 

In [61]:
# remove % and convert
world_df['tax_revenue_pct_edited']=world_df['tax_revenue_pct'].str.replace('%','').astype(float)

# review
type_change('tax_revenue_pct')

Data Type Before: object
Data Type After: float64


Unnamed: 0,tax_revenue_pct_edited,tax_revenue_pct
0,9.3,9.30%
1,18.6,18.60%
2,37.2,37.20%
3,,
4,9.2,9.20%


In [62]:
# drop and rename
world_df=world_df.drop(columns='tax_revenue_pct').rename(columns={'tax_revenue_pct_edited':'tax_revenue_pct'})

# review
name_change('tax_revenue_pct')

0     9.3
1    18.6
2    37.2
3     NaN
4     9.2
Name: tax_revenue_pct, dtype: float64

***tax_revenue_pct** completed*

**unemployment_rt**

object - float

In [65]:
# display format
format('unemployment_rt')

Nulls in unemployment_rt
-----------------------
19

Current Format of unemployment_rt
--------------------
['11.12%' '12.33%' '11.70%' nan '6.89%' '9.79%' '16.99%' '5.27%' '4.67%'
 '5.51%' '10.36%' '0.71%' '4.19%' '10.33%' '4.59%' '5.59%' '6.41%' '2.23%'
 '2.34%' '3.50%' '18.42%' '18.19%' '12.08%' '9.12%' '4.34%' '6.26%'
 '1.43%' '3.32%' '12.25%' '0.68%' '3.38%' '5.56%' '3.68%' '1.89%' '7.09%'
 '4.32%' '9.71%' '9.47%' '11.85%' '6.93%' '1.64%' '7.27%' '1.93%' '4.24%'
 '4.91%' '10.30%' '5.84%' '3.97%' '10.76%' '4.11%' '6.43%' '5.14%' '5.11%'
 '2.08%' '4.10%' '6.59%' '8.43%' '20.00%' '9.06%' '14.40%' '3.04%' '4.33%'
 '17.24%' '2.46%' '4.30%' '2.47%' '13.78%' '5.39%' '3.40%' '2.84%' '5.36%'
 '4.69%' '11.38%' '12.82%' '4.93%' '3.86%' '9.89%' '8.00%' '2.29%'
 '14.72%' '2.64%' '2.18%' '6.33%' '0.63%' '6.52%' '6.23%' '23.41%' '2.81%'
 '18.56%' '6.35%' '1.76%' '5.65%' '6.14%' '7.22%' '3.47%' '9.55%' '6.67%'
 '3.42%' '5.47%' '6.01%' '14.88%' '9.02%' '3.24%' '1.58%' '20.27%' '1.41%'
 '3.20%' '4.

In [66]:
# remove %
world_df['unemployment_rt_edited']=world_df['unemployment_rt'].str.replace('%','').astype(float)

# review
type_change('unemployment_rt')

Data Type Before: object
Data Type After: float64


Unnamed: 0,unemployment_rt_edited,unemployment_rt
0,11.12,11.12%
1,12.33,12.33%
2,11.7,11.70%
3,,
4,6.89,6.89%


In [67]:
# drop and rename
world_df=world_df.drop(columns='unemployment_rt').rename(columns={'unemployment_rt_edited':'unemployment_rt'})

# review
name_change('unemployment_rt')

0    11.12
1    12.33
2    11.70
3      NaN
4     6.89
Name: unemployment_rt, dtype: float64

***unemployment_rt** completed*

In [69]:
# last data type check on whole data frame
print(f'Data Types (after cleaning)\n-----------------------------\n{world_df.dtypes}')

Data Types (after cleaning)
-----------------------------
country                                object
birth_rt                              float64
currency_code                          object
fertility_rt                          float64
infant_mortality_rt                   float64
life_expectancy                       float64
official_language                      object
physicians_per_thousand               float64
lat                                   float64
long                                  float64
ag_land_pct                           float64
land_area_km2                         float64
consumer_price_index                  float64
gross_domestic_product_USD(b)         float64
primary_education_enrollment_pct      float64
secondary_education_enrollment_pct    float64
population                            float64
tax_revenue_pct                       float64
unemployment_rt                       float64
dtype: object


---
### Save Data
* **world_data_cleaned.csv**

In [71]:
# display head one last time to ensure correct df
world_df.head()

Unnamed: 0,country,birth_rt,currency_code,fertility_rt,infant_mortality_rt,life_expectancy,official_language,physicians_per_thousand,lat,long,ag_land_pct,land_area_km2,consumer_price_index,gross_domestic_product_USD(b),primary_education_enrollment_pct,secondary_education_enrollment_pct,population,tax_revenue_pct,unemployment_rt
0,Afghanistan,32.49,AFN,4.47,47.9,64.5,Pashto,0.28,33.93911,67.709953,58.1,652230.0,149.9,19.1014,104.0,9.7,38041754.0,9.3,11.12
1,Albania,11.78,ALL,1.62,7.8,78.5,Albanian,1.2,41.153332,20.168331,43.1,28748.0,119.05,15.2781,107.0,55.0,2854191.0,18.6,12.33
2,Algeria,24.28,DZD,3.02,20.1,76.7,Arabic,1.72,28.033886,1.659626,17.4,2381741.0,151.36,169.9882,109.9,51.4,43053054.0,37.2,11.7
3,Andorra,7.2,EUR,1.27,2.7,,Catalan,3.33,42.506285,1.521801,40.0,468.0,,3.1541,106.4,,77142.0,,
4,Angola,40.73,AOA,5.52,51.6,60.8,Portuguese,0.21,-11.202692,17.873887,47.5,1246700.0,261.73,94.6354,113.5,9.3,31825295.0,9.2,6.89


In [72]:
# save to csv
world_df.to_csv('world_data_clean.csv',sep=',',encoding='utf-8',index=False,header=True)