# Pandas Project

## Select Data Set
I selected the **Countries of the World** data set from [Kaggle Data Sets](https://www.kaggle.com/fernandol/countries-of-the-world).

## Why Countries of the World?
This data set has data for all countries in the world.
I selected this data set since I want to find the relationship between education and GDP per capita in 2010 globally.

## Create Jupiter Notebook and Import Pandas Library
I launched Jupiter Notebook through the terminal 
```
$ jupyter notebook
```

I created a new Python3 notebook called 
**data-wrangling.ipynb**

I imported the pandas library

In [1]:
import pandas as pd

## Import the Data Set using Pandas
I downloaded the data set in csv from Kaggle and saved it in the same folder as the jupyter notebook.

I went to the jupyter notebook and imported the csv file in 'dataset' and assigned it to the variable 'data'.

In [2]:
data = pd.read_csv('countries of the world.csv')
print(type(data))

<class 'pandas.core.frame.DataFrame'>


## Examine Data Set
I printed the first five rows of 'data' to get a general idea of the data set.

In [3]:
data.head()

Unnamed: 0,Country,Region,Population,Area (sq. mi.),Pop. Density (per sq. mi.),Coastline (coast/area ratio),Net migration,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Arable (%),Crops (%),Other (%),Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


I printed the summary information to further understand the composition of the data set.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
Country                               227 non-null object
Region                                227 non-null object
Population                            227 non-null int64
Area (sq. mi.)                        227 non-null int64
Pop. Density (per sq. mi.)            227 non-null object
Coastline (coast/area ratio)          227 non-null object
Net migration                         224 non-null object
Infant mortality (per 1000 births)    224 non-null object
GDP ($ per capita)                    226 non-null float64
Literacy (%)                          209 non-null object
Phones (per 1000)                     223 non-null object
Arable (%)                            225 non-null object
Crops (%)                             225 non-null object
Other (%)                             225 non-null object
Climate                               205 non-null object
Birthrate                 

## Clean and Manipulate the Data Set

### Correct Data Types
Most of the columns in the data set had a data type of 'object'. Therefore, I decided to start the analysis with reviewing if this was correct. From my review I concluded that 15 out of the 20 columns needed the data type to be corrected. The data type for the 15 columns needed to change from 'object' to 'float'.

Trying to correct the data types I discoved that the data was separed by commas instead of periods. Therefore, I had to replace the commas for periods before I could change the data types.

#### Replace Commas for Periods 
I executed this action for the 15 columns and replaced the currect commas for periods.

In [5]:
data['Pop. Density (per sq. mi.)'] = data['Pop. Density (per sq. mi.)'].str.replace(',', '.')
data['Coastline (coast/area ratio)'] = data['Coastline (coast/area ratio)'].str.replace(',', '.')
data['Net migration'] = data['Net migration'].str.replace(',', '.')
data['Infant mortality (per 1000 births)'] = data['Infant mortality (per 1000 births)'].str.replace(',', '.')
data['Literacy (%)'] = data['Literacy (%)'].str.replace(',', '.')
data['Phones (per 1000)'] = data['Phones (per 1000)'].str.replace(',', '.')
data['Arable (%)'] = data['Arable (%)'].str.replace(',', '.')
data['Crops (%)'] = data['Crops (%)'].str.replace(',', '.')
data['Other (%)'] = data['Other (%)'].str.replace(',', '.')
data['Climate'] = data['Climate'].str.replace(',', '.')
data['Birthrate'] = data['Birthrate'].str.replace(',', '.')
data['Deathrate'] = data['Deathrate'].str.replace(',', '.')
data['Agriculture'] = data['Agriculture'].str.replace(',', '.')
data['Industry'] = data['Industry'].str.replace(',', '.')
data['Service'] = data['Service'].str.replace(',', '.')

With this change I could now proceed to change the data types. I executed this action for the 15 columns and corrected the data type from 'obeject' to 'float64'.

In [6]:
data['Pop. Density (per sq. mi.)'] = data['Pop. Density (per sq. mi.)'].astype('float64')
data['Coastline (coast/area ratio)'] = data['Coastline (coast/area ratio)'].astype('float64')
data['Net migration'] = data['Net migration'].astype('float64')
data['Infant mortality (per 1000 births)'] = data['Infant mortality (per 1000 births)'].astype('float64')
data['Literacy (%)'] = data['Literacy (%)'].astype('float64')
data['Phones (per 1000)'] = data['Phones (per 1000)'].astype('float64')
data['Arable (%)'] = data['Arable (%)'].astype('float64')
data['Crops (%)'] = data['Crops (%)'].astype('float64')
data['Other (%)'] = data['Other (%)'].astype('float64')
data['Climate'] = data['Climate'].astype('float64')
data['Birthrate'] = data['Birthrate'].astype('float64')
data['Deathrate'] = data['Deathrate'].astype('float64')
data['Agriculture'] = data['Agriculture'].astype('float64')
data['Industry'] = data['Industry'].astype('float64')
data['Service'] = data['Service'].astype('float64')

I proceded to confirm that the data types were actually changed.

In [7]:
data.dtypes

Country                                object
Region                                 object
Population                              int64
Area (sq. mi.)                          int64
Pop. Density (per sq. mi.)            float64
Coastline (coast/area ratio)          float64
Net migration                         float64
Infant mortality (per 1000 births)    float64
GDP ($ per capita)                    float64
Literacy (%)                          float64
Phones (per 1000)                     float64
Arable (%)                            float64
Crops (%)                             float64
Other (%)                             float64
Climate                               float64
Birthrate                             float64
Deathrate                             float64
Agriculture                           float64
Industry                              float64
Service                               float64
dtype: object

### Find Missing Values
I created and printed the variable 'null_cols' to understand which columns had missing values.

In [8]:
null_cols = data.isnull().sum()
null_cols[null_cols > 0]

Net migration                          3
Infant mortality (per 1000 births)     3
GDP ($ per capita)                     1
Literacy (%)                          18
Phones (per 1000)                      4
Arable (%)                             2
Crops (%)                              2
Other (%)                              2
Climate                               22
Birthrate                              3
Deathrate                              4
Agriculture                           15
Industry                              16
Service                               15
dtype: int64

The column missing the most values is 'Climate' and it is only missing 22 values out of 227. Therefore, I concluded that dropping columns based on null data was not necesary. 

### Drop Columns
Further analyzing the data, I discovered that 'Pop. Density (per sq. mi.)' comes from diving 'Area (sq. mi.)' by 'Population'. Since this is a data point we could later get if necesary for analysis, I decided to drop the column. 

In [9]:
data = data.drop('Pop. Density (per sq. mi.)', axis=1)

I proceded to confirm that the column was actually dropped by printing the columns.

In [10]:
data.columns

Index(['Country', 'Region', 'Population', 'Area (sq. mi.)',
       'Coastline (coast/area ratio)', 'Net migration',
       'Infant mortality (per 1000 births)', 'GDP ($ per capita)',
       'Literacy (%)', 'Phones (per 1000)', 'Arable (%)', 'Crops (%)',
       'Other (%)', 'Climate', 'Birthrate', 'Deathrate', 'Agriculture',
       'Industry', 'Service'],
      dtype='object')

### Changing Column Order
I decided to reorder the columns to have data related to each other together. Therefore, I created a list 'column_order' containing the data frame’s column names in the order I wanted them.

In [11]:
column_order = ['Country', 'Region', 'Area (sq. mi.)', 'Coastline (coast/area ratio)', 'Population', 'Net migration', 'Birthrate', 'Deathrate','Infant mortality (per 1000 births)', 
       'GDP ($ per capita)', 'Literacy (%)', 'Phones (per 1000)', 'Agriculture', 'Industry', 'Service', 'Climate', 'Arable (%)',
       'Crops (%)', 'Other (%)']

I now proced to recreared the data frame with the customized order.

In [12]:
data = data[column_order]
data.head()

Unnamed: 0,Country,Region,Area (sq. mi.),Coastline (coast/area ratio),Population,Net migration,Birthrate,Deathrate,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Agriculture,Industry,Service,Climate,Arable (%),Crops (%),Other (%)
0,Afghanistan,ASIA (EX. NEAR EAST),647500,0.0,31056997,23.06,46.6,20.34,163.07,700.0,36.0,3.2,0.38,0.24,0.38,1.0,12.13,0.22,87.65
1,Albania,EASTERN EUROPE,28748,1.26,3581655,-4.93,15.11,5.22,21.52,4500.0,86.5,71.2,0.232,0.188,0.579,3.0,21.09,4.42,74.49
2,Algeria,NORTHERN AFRICA,2381740,0.04,32930091,-0.39,17.14,4.61,31.0,6000.0,70.0,78.1,0.101,0.6,0.298,1.0,3.22,0.25,96.53
3,American Samoa,OCEANIA,199,58.29,57794,-20.71,22.46,3.27,9.27,8000.0,97.0,259.5,,,,2.0,10.0,15.0,75.0
4,Andorra,WESTERN EUROPE,468,0.0,71201,6.6,8.71,6.25,4.05,19000.0,100.0,497.2,,,,3.0,2.22,0.0,97.78


### Incorrect Values
When analizing the data I noticed that one values in 'Climate' had a decimal point, while the rest of data points seemed to be whole numbers. Therefore I decided to get the value counts for 'Climate' to see if this was an error or not. 

In [13]:
data.Climate.value_counts()

2.0    111
3.0     48
1.0     29
1.5      8
4.0      6
2.5      3
Name: Climate, dtype: int64

From the output above I concluded that the decimal points were not errors.

### Filtering Records
I decided to filter the 'Literacy (%)' null records to get an idea of which countries were missing the values.

In [14]:
null_literacy = data[(data['Literacy (%)'].isnull()==True)]
null_literacy.head()

Unnamed: 0,Country,Region,Area (sq. mi.),Coastline (coast/area ratio),Population,Net migration,Birthrate,Deathrate,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Agriculture,Industry,Service,Climate,Arable (%),Crops (%),Other (%)
25,Bosnia & Herzegovina,EASTERN EUROPE,51129,0.04,4498976,0.31,8.77,8.27,21.05,6100.0,,215.4,0.142,0.308,0.55,4.0,13.6,2.96,83.44
66,Faroe Islands,WESTERN EUROPE,1399,79.84,47246,1.41,14.05,8.7,6.24,22000.0,,503.8,0.27,0.11,0.62,,2.14,0.0,97.86
74,Gaza Strip,NEAR EAST,360,11.11,1428757,1.6,39.45,3.8,22.93,600.0,,244.3,0.03,0.283,0.687,3.0,28.95,21.05,50.0
78,Gibraltar,WESTERN EUROPE,7,171.43,27928,0.0,10.74,9.31,5.13,17500.0,,877.7,,,,,0.0,0.0,100.0
80,Greenland,NORTHERN AMERICA,2166086,2.04,56361,-8.37,15.93,7.84,15.82,20000.0,,448.9,,,,1.0,0.0,0.0,100.0


I then counted the values missing by region

In [15]:
null_literacy['Region'].value_counts()

WESTERN EUROPE                         5
OCEANIA                                4
EASTERN EUROPE                         3
NEAR EAST                              2
NORTHERN AFRICA                        1
LATIN AMER. & CARIB                    1
SUB-SAHARAN AFRICA                     1
NORTHERN AMERICA                       1
Name: Region, dtype: int64

I proceded to filter the 'Literacy (%)' records that actually had a value.

In [16]:
filtered = data[data['Literacy (%)'] >= 0.0]
filtered.head()

Unnamed: 0,Country,Region,Area (sq. mi.),Coastline (coast/area ratio),Population,Net migration,Birthrate,Deathrate,Infant mortality (per 1000 births),GDP ($ per capita),Literacy (%),Phones (per 1000),Agriculture,Industry,Service,Climate,Arable (%),Crops (%),Other (%)
0,Afghanistan,ASIA (EX. NEAR EAST),647500,0.0,31056997,23.06,46.6,20.34,163.07,700.0,36.0,3.2,0.38,0.24,0.38,1.0,12.13,0.22,87.65
1,Albania,EASTERN EUROPE,28748,1.26,3581655,-4.93,15.11,5.22,21.52,4500.0,86.5,71.2,0.232,0.188,0.579,3.0,21.09,4.42,74.49
2,Algeria,NORTHERN AFRICA,2381740,0.04,32930091,-0.39,17.14,4.61,31.0,6000.0,70.0,78.1,0.101,0.6,0.298,1.0,3.22,0.25,96.53
3,American Samoa,OCEANIA,199,58.29,57794,-20.71,22.46,3.27,9.27,8000.0,97.0,259.5,,,,2.0,10.0,15.0,75.0
4,Andorra,WESTERN EUROPE,468,0.0,71201,6.6,8.71,6.25,4.05,19000.0,100.0,497.2,,,,3.0,2.22,0.0,97.78


I again counted the values by region, but this time for the ones with values.

In [17]:
filtered['Region'].value_counts()

SUB-SAHARAN AFRICA                     50
LATIN AMER. & CARIB                    44
ASIA (EX. NEAR EAST)                   28
WESTERN EUROPE                         23
OCEANIA                                17
NEAR EAST                              14
C.W. OF IND. STATES                    12
EASTERN EUROPE                          9
NORTHERN AFRICA                         5
NORTHERN AMERICA                        4
BALTICS                                 3
Name: Region, dtype: int64

### Replacing Values
Since I want to do an analysis between 'Literacy (%)' and 'GDP ($ per capita)' and 'Literacy (%)' had 18 nulls, I decided to replace the nulls for the mean.

I found the mean for 'Literacy (%)' first.

In [18]:
(data['Literacy (%)']).mean()

82.83827751196175

I then changed the nulls for the mean.

In [19]:
data['Literacy (%)'] = data['Literacy (%)'].fillna(82.8)

I ensured the nulls were actually changed.

In [20]:
null_lit = data['Literacy (%)'].isnull().sum()
null_lit

0

### Binning Numeric Variables
I decided to build an equal width bins with 'GDP ($ per capita)'. To do so I first found the min and max values in the column to be able to understand the cutoffs in the bins.

In [21]:
data['GDP ($ per capita)'].min()

500.0

In [22]:
data['GDP ($ per capita)'].max()

55100.0

I then proceded to create the the labels for the bins.

In [23]:
labels_bins = ['Very Low','Low','Moderate','High','Very High']

In [24]:
bins = pd.cut(data['GDP ($ per capita)'],5, labels=labels_bins)
bins.head(10)

0    Very Low
1    Very Low
2    Very Low
3    Very Low
4         Low
5    Very Low
6    Very Low
7    Very Low
8    Very Low
9    Very Low
Name: GDP ($ per capita), dtype: category
Categories (5, object): [Very Low < Low < Moderate < High < Very High]

## Export the Data Set using Pandas
After cleaning the data set, I exported the new clean version.

In [33]:
data.to_csv('countries of the world_clean.cvs', index=False)