# 1. Data to be used

All data is from the CIA World Factbook (https://www.cia.gov/library/publications/resources/the-world-factbook/)

1. 'emissions' reports millions of megatons of carbon dioxide emitted nationally from consumption of energy
2. 'urban' is the percent of total population living in urban areas
3. 'gdp' is the gross domestic product per capita in US dollars
4. 'population' reports number of people

Links to the tables of data:

1. emissions = https://www.cia.gov/library/publications/resources/the-world-factbook/fields/274.html
2. urban = https://www.cia.gov/library/publications/resources/the-world-factbook/fields/349.html
3. gdp = https://www.cia.gov/library/publications/resources/the-world-factbook/fields/211.html
4. pop= https://www.cia.gov/library/publications/resources/the-world-factbook/fields/335.html

# 2. Reading the data

In [94]:
import pandas as pd

In [95]:
#creating a dataframe of CO2 emmissions
link1="https://www.cia.gov/library/publications/resources/the-world-factbook/fields/274.html"
emissions=pd.read_html(link1,header=0,flavor='bs4',attrs={'id': 'fieldListing'})[0]
emissions.head()

Unnamed: 0,Country,Carbon dioxide emissions from consumption of energy
0,Afghanistan,9.067 million Mt (2017 est.)
1,Albania,4.5 million Mt (2017 est.)
2,Algeria,135.9 million Mt (2017 est.)
3,American Samoa,"361,100 Mt (2017 est.)"
4,Angola,20.95 million Mt (2017 est.)


In [96]:
#creating a dataframe of percent urbanization
link2="https://www.cia.gov/library/publications/resources/the-world-factbook/fields/349.html"
urban=pd.read_html(link2,header=0,flavor='bs4',attrs={'id': 'fieldListing'})[0]
urban.head()

Unnamed: 0,Country,Urbanization
0,Afghanistan,urban population: 25.5% of total population ...
1,Albania,urban population: 60.3% of total population ...
2,Algeria,urban population: 72.6% of total population ...
3,American Samoa,urban population: 87.2% of total population ...
4,Andorra,urban population: 88.1% of total population ...


In [97]:
#creating a dataframe of GDP per capita
link3="https://www.cia.gov/library/publications/resources/the-world-factbook/fields/211.html"
gdp=pd.read_html(link3,header=0,flavor='bs4',attrs={'id': 'fieldListing'})[0]
gdp.shape

(232, 2)

In [98]:
#creating a data frame of population
link4="https://www.cia.gov/library/publications/resources/the-world-factbook/fields/335.html"
pop=pd.read_html(link4,header=0,flavor='bs4',attrs={'id': 'fieldListing'})[0]
pop.shape

(262, 2)

# 3. Merging data sets

In [99]:
#1st merge. Confirm that country data is lining up. 
join1=pd.merge(emissions,urban,left_on='Country',right_on='Country')
join1.head()

Unnamed: 0,Country,Carbon dioxide emissions from consumption of energy,Urbanization
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...
3,American Samoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...


In [100]:
#18 countries that didn't show up in both dataframes being merged were dropped.
join1.shape

(214, 3)

In [101]:
#2nd merge. Confirm that country data is lining up. 
join2=pd.merge(join1,gdp,on='Country')
join2.shape

(214, 4)

In [102]:
#3rd merge. Confirm that country data is lining up. 
data=pd.merge(join2,pop,on='Country')
data.head()

Unnamed: 0,Country,Carbon dioxide emissions from consumption of energy,Urbanization,GDP - per capita (PPP),Population
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)"
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)"
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)"
3,American Samoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)"
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)..."


In [103]:
data.shape

(214, 5)

# 4. Renaming columns

In [104]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214 entries, 0 to 213
Data columns (total 5 columns):
Country                                                214 non-null object
Carbon dioxide emissions from consumption of energy    214 non-null object
Urbanization                                           214 non-null object
GDP - per capita (PPP)                                 214 non-null object
Population                                             214 non-null object
dtypes: object(5)
memory usage: 10.0+ KB


In [105]:
data.columns

Index(['Country', 'Carbon dioxide emissions from consumption of energy',
       'Urbanization', 'GDP - per capita (PPP)', 'Population'],
      dtype='object')

In [106]:
newNames=['Country','CO2_Emissions','Urbanization','GDP_Per_Capita']

In [107]:
nameChanges={old:new for old,new in zip(data.columns,newNames)}

In [108]:
data.rename(nameChanges,axis=1,inplace=True)

In [109]:
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)"
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)"
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)"
3,American Samoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)"
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)..."


In [110]:
data.dtypes

Country           object
CO2_Emissions     object
Urbanization      object
GDP_Per_Capita    object
Population        object
dtype: object

## 5. Cleaning the Country Column

In [111]:
#Getting rid of spaces
pattern='\\ '
nothing=''
testString='World '
re.sub(pattern,nothing,testString)

'World'

In [112]:
pattern='\\ '
nothing=''
newValues=[re.sub(pattern,nothing,oldValue) for oldValue in data.Country]

In [113]:
data=data.assign(Country=newValues)
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)"
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)"
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)"
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)"
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)..."


# 6. Cleaning the CO2 Emissions Column

In [114]:
#Split values after million
emissionsnumber=[element.split(' Mt')[0] for element in data.CO2_Emissions]

#Making the above list a new column:
data=data.assign(CO2_Emissions_Number=emissionsnumber)
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9.067 million
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4.5 million
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135.9 million
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20.95 million


In [115]:
# the new values have text, which tells the units, so we want to keep that somewhere

units=[] #empty list

for element in data.CO2_Emissions_Number:
  result=element.split(' ')
  if len(result)>1:
      units.append(result[1])  # add text
  else:
    units.append(1) # add '1'

In [116]:
#Making the above list a new column:
data=data.assign(units=units)
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,units
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9.067 million,million
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4.5 million,million
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135.9 million,million
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100,1
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20.95 million,million


In [117]:
#Now I can keep the first element (number):
emissionsnumber=[element.split(' ')[0] for element in data.CO2_Emissions_Number]

#Making the above list a new column:
data=data.assign(CO2_Emissions_Number=emissionsnumber)
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,units
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9.067,million
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4.5,million
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135.9,million
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,1
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20.95,million


In [118]:
data.units.value_counts() # we need to turn this into numbers

million    162
1           46
billion      6
Name: units, dtype: int64

In [119]:
newUnits=[10**6 if x=='million' else x for x in data.units] # first the millions
newUnits=[10**9 if x=='billion' else x for x in newUnits] #then the billions
# rewriting column
data=data.assign(units=newUnits)
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,units
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9.067,1000000
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4.5,1000000
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135.9,1000000
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,1
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20.95,1000000


In [120]:
data.dtypes # checking data type

Country                 object
CO2_Emissions           object
Urbanization            object
GDP_Per_Capita          object
Population              object
CO2_Emissions_Number    object
units                    int64
dtype: object

In [121]:
# there are values with commas, I need to:
import re
pattern='\\,'
nothing=''
testString='1,073,002'
re.sub(pattern,nothing,testString)

'1073002'

In [122]:
pattern='\\,'
nothing=''

newValues=[re.sub(pattern,nothing,oldValue) for oldValue in data.CO2_Emissions_Number]

In [123]:
# now full numeric column
data=data.assign(CO2_Emissions_Number=newValues)
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,units
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9.067,1000000
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4.5,1000000
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135.9,1000000
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,1
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20.95,1000000


In [124]:
data.dtypes #to see that CO2_Emissions Number is not yet float64 numbers

Country                 object
CO2_Emissions           object
Urbanization            object
GDP_Per_Capita          object
Population              object
CO2_Emissions_Number    object
units                    int64
dtype: object

In [125]:
data.CO2_Emissions_Number=pd.to_numeric(data.CO2_Emissions_Number) #to convert to float64 numbers

In [126]:
data.dtypes #make sure it worked

Country                  object
CO2_Emissions            object
Urbanization             object
GDP_Per_Capita           object
Population               object
CO2_Emissions_Number    float64
units                     int64
dtype: object

In [127]:
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,units
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9.067,1000000
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4.5,1000000
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135.9,1000000
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,1
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20.95,1000000


In [128]:
#Multiply units and CO2_Emissions_Number to get the entire number
data.CO2_Emissions_Number=data.CO2_Emissions_Number*data.units

In [129]:
data.head()

Unnamed: 0,Country,CO2_Emissions,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,units
0,Afghanistan,9.067 million Mt (2017 est.),urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,1000000
1,Albania,4.5 million Mt (2017 est.),urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,1000000
2,Algeria,135.9 million Mt (2017 est.),urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,1000000
3,AmericanSamoa,"361,100 Mt (2017 est.)",urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,1
4,Angola,20.95 million Mt (2017 est.),urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,1000000


In [130]:
#dropping the old CO2 Emissions columns
data=data.drop('units',axis=1).drop('CO2_Emissions',axis=1)

In [131]:
data.head()

Unnamed: 0,Country,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number
0,Afghanistan,urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0
1,Albania,urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0
2,Algeria,urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0
3,AmericanSamoa,urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0
4,Angola,urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0


## 7. Cleaning the Urbanization Column

In [132]:
urbanizationnumber=[element.split('%')[0] for element in data.Urbanization]
data=data.assign(Urbanization1=urbanizationnumber)
data.head()

Unnamed: 0,Country,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,Urbanization1
0,Afghanistan,urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,urban population: 25.5
1,Albania,urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,urban population: 60.3
2,Algeria,urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,urban population: 72.6
3,AmericanSamoa,urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,urban population: 87.2
4,Angola,urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,urban population: 65.5


In [133]:
urbanizationnumber=[element.split(':')[1] for element in data.Urbanization1]
data=data.assign(Urbanization_Percentage=urbanizationnumber)
data.head()

Unnamed: 0,Country,Urbanization,GDP_Per_Capita,Population,CO2_Emissions_Number,Urbanization1,Urbanization_Percentage
0,Afghanistan,urban population: 25.5% of total population ...,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,urban population: 25.5,25.5
1,Albania,urban population: 60.3% of total population ...,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,urban population: 60.3,60.3
2,Algeria,urban population: 72.6% of total population ...,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,urban population: 72.6,72.6
3,AmericanSamoa,urban population: 87.2% of total population ...,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,urban population: 87.2,87.2
4,Angola,urban population: 65.5% of total population ...,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,urban population: 65.5,65.5


In [134]:
data.dtypes

Country                     object
Urbanization                object
GDP_Per_Capita              object
Population                  object
CO2_Emissions_Number       float64
Urbanization1               object
Urbanization_Percentage     object
dtype: object

In [135]:
data.Urbanization_Percentage=pd.to_numeric(data.Urbanization_Percentage)

In [136]:
data.dtypes

Country                     object
Urbanization                object
GDP_Per_Capita              object
Population                  object
CO2_Emissions_Number       float64
Urbanization1               object
Urbanization_Percentage    float64
dtype: object

In [137]:
#dropping old Urbanization columns
data=data.drop("Urbanization",axis=1).drop("Urbanization1",axis=1)

In [138]:
data.head()

Unnamed: 0,Country,GDP_Per_Capita,Population,CO2_Emissions_Number,Urbanization_Percentage
0,Afghanistan,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,25.5
1,Albania,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,60.3
2,Algeria,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,72.6
3,AmericanSamoa,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,87.2
4,Angola,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5


## 8. Cleaning the GDP Per Capita Column

In [139]:
#Split values after million
gdppercapita=[element.split('(')[0] for element in data.GDP_Per_Capita]

#Making the above list a new column:
data=data.assign(GDPPerCapita1=gdppercapita)
data.head()

Unnamed: 0,Country,GDP_Per_Capita,Population,CO2_Emissions_Number,Urbanization_Percentage,GDPPerCapita1
0,Afghanistan,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,25.5,"$2,000"
1,Albania,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,60.3,"$12,500"
2,Algeria,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,72.6,"$15,200"
3,AmericanSamoa,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,87.2,"$11,200"
4,Angola,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,"$6,800"


In [140]:
#Split values after million
gdppercapita=[element.split('$')[1] for element in data.GDPPerCapita1]

#Making the above list a new column:
data=data.assign(GDPPerCapita2=gdppercapita)
data.head()

Unnamed: 0,Country,GDP_Per_Capita,Population,CO2_Emissions_Number,Urbanization_Percentage,GDPPerCapita1,GDPPerCapita2
0,Afghanistan,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,25.5,"$2,000",2000
1,Albania,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,60.3,"$12,500",12500
2,Algeria,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,72.6,"$15,200",15200
3,AmericanSamoa,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,87.2,"$11,200",11200
4,Angola,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,"$6,800",6800


In [141]:
pattern='\\,'
nothing=''
testString='2,100'
re.sub(pattern,nothing,testString)

'2100'

In [142]:
#to get rid of the commas
pattern='\\,'
nothing=''
newValues=[re.sub(pattern,nothing,oldValue) for oldValue in data.GDPPerCapita2]

In [143]:
data=data.assign(GDP_Per_Capita_Number=newValues)
data.head()

Unnamed: 0,Country,GDP_Per_Capita,Population,CO2_Emissions_Number,Urbanization_Percentage,GDPPerCapita1,GDPPerCapita2,GDP_Per_Capita_Number
0,Afghanistan,"$2,000 (2017 est.) $2,000 (2016 est.) $2,0...","34,940,837 (July 2018 est.)",9067000.0,25.5,"$2,000",2000,2000
1,Albania,"$12,500 (2017 est.) $12,100 (2016 est.) $1...","3,057,220 (July 2018 est.)",4500000.0,60.3,"$12,500",12500,12500
2,Algeria,"$15,200 (2017 est.) $15,200 (2016 est.) $1...","41,657,488 (July 2018 est.)",135900000.0,72.6,"$15,200",15200,15200
3,AmericanSamoa,"$11,200 (2016 est.) $11,300 (2015 est.) $1...","50,826 (July 2018 est.)",361100.0,87.2,"$11,200",11200,11200
4,Angola,"$6,800 (2017 est.) $7,200 (2016 est.) $7,6...","30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,"$6,800",6800,6800


In [144]:
#dropping old GDP columnns
data=data.drop("GDP_Per_Capita",axis=1).drop("GDPPerCapita1",axis=1).drop("GDPPerCapita2",axis=1)

In [145]:
data.dtypes

Country                     object
Population                  object
CO2_Emissions_Number       float64
Urbanization_Percentage    float64
GDP_Per_Capita_Number       object
dtype: object

In [146]:
#need to replace NA values with nothing
pattern='NA'
nothing=''
testString='NA'
re.sub(pattern,nothing,testString)

''

In [147]:
#to get rid of the NA's
pattern='NA'
nothing=''
newValues=[re.sub(pattern,nothing,oldValue) for oldValue in data.GDP_Per_Capita_Number]

In [148]:
data=data.assign(GDP_Per_Capita_Number=newValues)
data.head()

Unnamed: 0,Country,Population,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number
0,Afghanistan,"34,940,837 (July 2018 est.)",9067000.0,25.5,2000
1,Albania,"3,057,220 (July 2018 est.)",4500000.0,60.3,12500
2,Algeria,"41,657,488 (July 2018 est.)",135900000.0,72.6,15200
3,AmericanSamoa,"50,826 (July 2018 est.)",361100.0,87.2,11200
4,Angola,"30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,6800


In [149]:
pattern=' '
nothing=''
testString=' NA '
re.sub(pattern,nothing,testString)

'NA'

In [150]:
#to get rid of the NA's
pattern=' '
nothing=''
newValues=[re.sub(pattern,nothing,oldValue) for oldValue in data.GDP_Per_Capita_Number]

In [151]:
data=data.assign(GDP_Per_Capita_Number=newValues)
data.head()

Unnamed: 0,Country,Population,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number
0,Afghanistan,"34,940,837 (July 2018 est.)",9067000.0,25.5,2000
1,Albania,"3,057,220 (July 2018 est.)",4500000.0,60.3,12500
2,Algeria,"41,657,488 (July 2018 est.)",135900000.0,72.6,15200
3,AmericanSamoa,"50,826 (July 2018 est.)",361100.0,87.2,11200
4,Angola,"30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,6800


In [152]:
data.GDP_Per_Capita_Number=pd.to_numeric(data.GDP_Per_Capita_Number)

In [153]:
data.head()

Unnamed: 0,Country,Population,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number
0,Afghanistan,"34,940,837 (July 2018 est.)",9067000.0,25.5,2000.0
1,Albania,"3,057,220 (July 2018 est.)",4500000.0,60.3,12500.0
2,Algeria,"41,657,488 (July 2018 est.)",135900000.0,72.6,15200.0
3,AmericanSamoa,"50,826 (July 2018 est.)",361100.0,87.2,11200.0
4,Angola,"30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,6800.0


## 9. Cleaning the Population Column

In [154]:
#split values at (
popnumber=[element.split(' (')[0] for element in data.Population]

#Making the above list a new column:
data=data.assign(Population1=popnumber)
data.head()

Unnamed: 0,Country,Population,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number,Population1
0,Afghanistan,"34,940,837 (July 2018 est.)",9067000.0,25.5,2000.0,34940837
1,Albania,"3,057,220 (July 2018 est.)",4500000.0,60.3,12500.0,3057220
2,Algeria,"41,657,488 (July 2018 est.)",135900000.0,72.6,15200.0,41657488
3,AmericanSamoa,"50,826 (July 2018 est.)",361100.0,87.2,11200.0,50826
4,Angola,"30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,6800.0,30355880


In [155]:
# getting rid of commas
pattern='\\,'
nothing=''
testString='198,450'
re.sub(pattern,nothing,testString)

'198450'

In [156]:
# getting rid of commas in all values and creating newValues container
pattern='\\,'
nothing=''

newValues=[re.sub(pattern,nothing,oldValue) for oldValue in data.Population1]

In [157]:
#replacing old column, Population1 with newValues
data=data.assign(Population1=newValues)
data.head()

Unnamed: 0,Country,Population,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number,Population1
0,Afghanistan,"34,940,837 (July 2018 est.)",9067000.0,25.5,2000.0,34940837
1,Albania,"3,057,220 (July 2018 est.)",4500000.0,60.3,12500.0,3057220
2,Algeria,"41,657,488 (July 2018 est.)",135900000.0,72.6,15200.0,41657488
3,AmericanSamoa,"50,826 (July 2018 est.)",361100.0,87.2,11200.0,50826
4,Angola,"30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,6800.0,30355880


In [158]:
#some of the values still aren't just numbers so going to split again at spaces
popnumber=[element.split(' ')[0] for element in data.Population1]

#Making the above list a new column:
data=data.assign(Population_Number=popnumber)
data.head()

Unnamed: 0,Country,Population,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number,Population1,Population_Number
0,Afghanistan,"34,940,837 (July 2018 est.)",9067000.0,25.5,2000.0,34940837,34940837
1,Albania,"3,057,220 (July 2018 est.)",4500000.0,60.3,12500.0,3057220,3057220
2,Algeria,"41,657,488 (July 2018 est.)",135900000.0,72.6,15200.0,41657488,41657488
3,AmericanSamoa,"50,826 (July 2018 est.)",361100.0,87.2,11200.0,50826,50826
4,Angola,"30,355,880 (July 2017 est.) (July 2018 est.)...",20950000.0,65.5,6800.0,30355880,30355880


In [159]:
data.dtypes

Country                     object
Population                  object
CO2_Emissions_Number       float64
Urbanization_Percentage    float64
GDP_Per_Capita_Number      float64
Population1                 object
Population_Number           object
dtype: object

In [160]:
#converting Population_Number from object to float64
data.Population_Number=pd.to_numeric(data.Population_Number)

In [161]:
#dropping old population columns
data=data.drop("Population",axis=1).drop("Population1",axis=1)

In [162]:
data.head()

Unnamed: 0,Country,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number,Population_Number
0,Afghanistan,9067000.0,25.5,2000.0,34940837
1,Albania,4500000.0,60.3,12500.0,3057220
2,Algeria,135900000.0,72.6,15200.0,41657488
3,AmericanSamoa,361100.0,87.2,11200.0,50826
4,Angola,20950000.0,65.5,6800.0,30355880


## 10. Creating an Emissions Per Capita Column

In [163]:
from __future__ import division
emissionspercap=(data.CO2_Emissions_Number/data.Population_Number)

In [164]:
data=data.assign(CO2_Emissions_PerCapita=emissionspercap)

In [165]:
data.head()

Unnamed: 0,Country,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number,Population_Number,CO2_Emissions_PerCapita
0,Afghanistan,9067000.0,25.5,2000.0,34940837,0.259496
1,Albania,4500000.0,60.3,12500.0,3057220,1.471925
2,Algeria,135900000.0,72.6,15200.0,41657488,3.262319
3,AmericanSamoa,361100.0,87.2,11200.0,50826,7.104631
4,Angola,20950000.0,65.5,6800.0,30355880,0.690146


## 11. Dropping the World Row

In [166]:
data.iloc[210,:]

Country                         World
CO2_Emissions_Number        3.362e+10
Urbanization_Percentage          55.3
GDP_Per_Capita_Number           17500
Population_Number          7405107650
CO2_Emissions_PerCapita       4.54011
Name: 210, dtype: object

In [170]:
data.drop(210,axis=0,inplace=True)

In [171]:
#checking
data.iloc[208:212,:]

Unnamed: 0,Country,CO2_Emissions_Number,Urbanization_Percentage,GDP_Per_Capita_Number,Population_Number,CO2_Emissions_PerCapita
208,WestBank,3113000.0,76.2,4300.0,2798494,1.112384
209,WesternSahara,268400.0,86.7,2500.0,619551,0.433217
211,Yemen,13680000.0,36.6,2500.0,28667230,0.4772
212,Zambia,3777000.0,43.5,4000.0,16445079,0.229674


## 12. Saving File for R

In [172]:
data.to_csv("Cleaned_Data.csv",index=None)