# CIA Country Analysis and Clustering


Source: All these data sets are made up of data from the US government.
https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html

## Goal:

### Gain insights into similarity between countries and regions of the world by experimenting with different cluster amounts. What do these clusters represent? *Note: There is no 100% right answer.*

----

## Imports and Data

**TASK: Run the following cells to import libraries and read in data.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('../DATA/CIA_Country_Facts.csv')

## Exploratory Data Analysis

**TASK: Explore the rows and columns of the data as well as the data types of the columns.**

In [None]:
# CODE HERE

# Exploratory Data Analysis

Let's create some visualizations. Please feel free to expand on these with your own analysis and charts!

**TASK: Create a histogram of the Population column.**

In [None]:
# CODE HERE

**TASK: You should notice the histogram is skewed due to a few large countries, reset the X axis to only show countries with less than 0.5 billion people**

In [None]:
#CODE HERE


**TASK: Now let's explore GDP and Regions. Create a bar chart showing the mean GDP per Capita per region (recall the black bar represents std).**

In [None]:
# CODE HERE


**TASK: Create a scatterplot showing the relationship between Phones per 1000 people and the GDP per Capita. Color these points by Region.**

In [None]:
#CODE HERE

**TASK: Create a scatterplot showing the relationship between GDP per Capita and Literacy (color the points by Region). What conclusions do you draw from this plot?**

In [None]:
#CODE HERE

**TASK: Create a Heatmap of the Correlation between columns in the DataFrame.**

In [None]:
#CODE HERE

-----

## Data Preparation and Model Discovery

Let's now prepare our data for Kmeans Clustering!

### Missing Data

**TASK: Report the number of missing elements per column.**

In [None]:
#CODE HERE

**TASK: What countries have NaN for Agriculture? What is the main aspect of these countries?**

In [None]:
df[df['Agriculture'].isnull()]['Country']

3            American Samoa
4                   Andorra
78                Gibraltar
80                Greenland
83                     Guam
134                 Mayotte
140              Montserrat
144                   Nauru
153      N. Mariana Islands
171            Saint Helena
174    St Pierre & Miquelon
177              San Marino
208       Turks & Caicos Is
221       Wallis and Futuna
223          Western Sahara
Name: Country, dtype: object

**TASK: You should have noticed most of these countries are tiny islands, with the exception of Greenland and Western Sahara. Go ahead and fill any of these countries missing NaN values with 0, since they are so small or essentially non-existant. There should be 15 countries in total you do this for. For a hint on how to do this, recall you can do the following:**

    df[df['feature'].isnull()]
    

**TASK: Now check to see what is still missing by counting number of missing elements again per feature:**

In [None]:
#CODE HERE

**TASK: Notice climate is missing for a few countries, but not the Region! Let's use this to our advantage. Fill in the missing Climate values based on the mean climate value for its region.**

Hints on how to do this: https://stackoverflow.com/questions/19966018/pandas-filling-missing-values-by-mean-in-each-group


In [None]:
# CODE HERE

In [None]:
df['Climate'] = df['Climate'].fillna(df.groupby('Region')['Climate'].transform('mean'))

**TASK: Check again on many elements are missing:**

**TASK: It looks like Literacy percentage is missing. Use the same tactic as we did with Climate missing values and fill in any missing Literacy % values with the mean Literacy % of the Region.**

In [None]:
#CODE HERE
df[df['Literacy (%)'].isnull()]

In [None]:
df['Literacy (%)'] = df['Literacy (%)'].fillna(df.groupby('Region')['Literacy (%)'].transform('mean'))

Let's break down what's happening:

df['Literacy (%)']: This selects the column labeled 'Literacy (%)' in the DataFrame df.

.fillna(...): This is a pandas method that is used to fill missing (NaN) values in a DataFrame or Series. In this case, it will fill the missing values in the 'Literacy (%)' column.

df.groupby('Region')['Literacy (%)'].transform('mean'): This part does the following:

1- df.groupby('Region'): This groups the DataFrame df by the values in the 'Region' column. This creates groups of data where each group corresponds to a unique region.
['Literacy (%)']: This specifies that we are interested in the 'Literacy (%)'
column within each group.
2- .transform('mean'): This calculates the mean of the 'Literacy (%)' values within each group and returns a Series with the same length as the original DataFrame. This means that for each row in the original DataFrame, this operation will give the mean literacy percentage of its corresponding region.
The result of df.groupby('Region')['Literacy (%)'].transform('mean') is a Series containing the mean literacy percentage for each region. The indices of this Series align with the indices of the original DataFrame df.

So, df['Literacy (%)'].fillna(...) fills the missing values in the 'Literacy (%)' column with the corresponding mean literacy percentage for their respective regions.

In summary, this line of code ensures that missing values in the 'Literacy (%)' column are replaced with the mean literacy percentage of their respective regions. This is a common strategy for handling missing data based on some grouping characteristic.







**TASK: Check again on the remaining missing values:**

In [None]:
df.isnull().sum()

Country                               0
Region                                0
Population                            0
Area (sq. mi.)                        0
Pop. Density (per sq. mi.)            0
Coastline (coast/area ratio)          0
Net migration                         1
Infant mortality (per 1000 births)    1
GDP ($ per capita)                    0
Literacy (%)                          0
Phones (per 1000)                     2
Arable (%)                            1
Crops (%)                             1
Other (%)                             1
Climate                               0
Birthrate                             1
Deathrate                             2
Agriculture                           0
Industry                              1
Service                               1
dtype: int64

**TASK: Optional: We are now missing values for only a few countries. Go ahead and drop these countries OR feel free to fill in these last few remaining values with any preferred methodology. For simplicity, we will drop these.**

In [None]:
# CODE HERE

In [None]:
df = df.dropna()

## Data Feature Preparation

**TASK: It is now time to prepare the data for clustering. The Country column is still a unique identifier string, so it won't be useful for clustering, since its unique for each point. Go ahead and drop this Country column.**

In [None]:
#CODE HERE

In [None]:
X = df.drop("Country",axis=1)

**TASK: Now let's create the X array of features, the Region column is still categorical strings, use Pandas to create dummy variables from this column to create a finalzed X matrix of continuous features along with the dummy variables for the Regions.**

In [None]:
#CODE here

### Scaling

**TASK: Due to some measurements being in terms of percentages and other metrics being total counts (population), we should scale this data first. Use Sklearn to scale the X feature matrics.**

In [None]:
#CODE HERE

### Creating and Fitting Kmeans Model

**TASK: Use a for loop to create and fit multiple KMeans models, testing from K=2-30 clusters. Keep track of the Sum of Squared Distances for each K value, then plot this out to create an "elbow" plot of K versus SSD. Optional: You may also want to create a bar plot showing the SSD difference from the previous cluster.**

In [None]:
#CODE HERE

-----

# Model Interpretation


**TASK: What K value do you think is a good choice? Are there multiple reasonable choices? What features are helping define these cluster choices. As this is unsupervised learning, there is no 100% correct answer here.**

In [None]:
# Nothing to really code here, but choose a K value and see what features
# are most correlated to belonging to a particular cluster!

# Remember, there is no 100% correct answer here!

-----


#### Example Interpretation: Choosing K=3

**One could say that there is a significant drop off in SSD difference at K=3 (although we can see it continues to drop off past this). What would an analysis look like for K=3? Let's explore which features are important in the decision of 3 clusters!**

In [None]:
X.corr()['K=3 Clusters'].sort_values()

---