# Business Problem:

HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

 

After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid. 

 

And this is where you come in as a data analyst. Your job is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most. 

Our main task is to cluster the countries by the factors mentioned above and then present your solution and recommendations to the CEO using a PPT. 

# Steps

1. Data understanding

   - Import Python libraries
   - Read and inspect data for understanding
   - Summary statistics
   - Analyse the columns


2. Exploratory Data Analysis

  - Null Value Analysis
  - Derive data – convert exports, imports, health from % of GDPP to absolute values
  - Data Visualization
    - Visualising Bivariate Distributions using pair plot
    - Visualising Correlation using heatmap
    - Visualising univariate distributions using Histogram and density plots
    - Visualising GDPP, income, child mortality and other parameters across countries using point plots and bar plots
    - Visualising univariate distributions using boxplots to identify outliers


3. Prepare data for modelling

  - Outlier Treatment
  - Hopkins test
  - Scaling


4. Build model using K-means

   - Metrics to choose the value of K
     - Elbow curve 
     - Silhouette Analysis
     - Iterating with different values of k and choose optimal k as optimal number of clusters
     - Country Segmentation
     - Cluster Profiling based on GDPP, Income and Child Mortality Rate
       - Scatter Plots
       - Box plots
       - Bar plots on the mean of the columns in each cluster
     - Identification of top 10 countries in need of aid.

5. Build model using Hierarchical clustering

  - Single linkage
  - Complete linkage
  
     - Country Segmentation
     - Cluster Profiling based on GDPP, Income and Child Mortality Rate
       - Scatter Plots
       - Box plots
       - Bar plots on the mean of the columns in each cluster
     - Identification of top 10 countries in need of aid.

6. Suggestions 

## 1.1 Import libraries

Ignore warnings and import necessary libraries:

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd

# To display data dictionary fully
pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 1000)

# For Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.color_palette("Set1")

# To Scale our data
from sklearn.preprocessing import StandardScaler

# To perform KMeans clustering 
from sklearn.cluster import KMeans

# To perform Hierarchical clustering
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

## 1.2 Read and inspect data for understanding

Lets read the data dictionary into `data_dictionary` dataframe and understand the columns

In [3]:

data_dictionary = pd.read_csv('data-dictionary.csv')
data_dictionary

FileNotFoundError: [Errno 2] No such file or directory: 'data-dictionary.csv'

There are total of `10 columns` in the actual dataset.

From the data dictionary, we can see that `exports`, `health` and `imports` are given as percentage of GDP. Hence we have to convert these columns to absolute values in the EDA section

Lets read the actual dataset into `country_df`

In [None]:
country_df = pd.read_csv('Country-data.csv')
country_df.head() # Checking the top 5 rows of the dataframe

In [None]:
# checking botton 5 rows of the dataframe
country_df.tail()

### Data Inspection:

Inspect the various aspects of the `country_df` dataframe such as 
- `shape` for number of rows and columns
- `size` for memory usage
- `info()` for the presence of null values
- `nunique()` for checking how many unique entries present in each column and any column is of categorical in nature.
- `nunique() and duplicated` for duplicates analysis
- `describe()` for statistical information

In [None]:
# Checking the shape of the dataframe
country_df.shape

There are `167` rows and `10` columns in the given dataset.

In [None]:
# Checking the size of the dataframe
country_df.size

In [None]:
# Inspecting type
print(country_df.dtypes)

In [None]:
# How many types of each data type column exists and total memory usage
country_df.info()

`info` shows that there are no null values present. Still we will check in EDA section using `isnull()`

Also `info()` shows that apart from `country` column, all the columns are numerical and continuous in nature.

### Duplicates analysis

In [None]:
country_df.duplicated().sum()

In [None]:
# Checking the number of unique values each column possess to identify categorical columns
country_df.nunique().sort_values()

`nunique()` shows that there are no categorical variables in the dataframe and all variables are numerical continuous variables.

Also there are `167` rows and `country` has `167` unique values which means the column **country** is unique in each row and hence there are no duplicate records

### Statistical summary

In [None]:
# Checking the numerical columns data distribution
country_df.describe()

Summary statistics from `describe()` indicates the presence of outliers. We can visualize the distribution and percentile of each column in Section 2.3 using boxplots

# 2. Exploratory Data Analysis (EDA)
## 2.1 Null Values Analysis

In [None]:
# Looking for any null value in any column 
print(country_df.isnull().sum())

Both `info()` and `isnull()` output indicate that there are no null values in the given dataset.

## 2.2 Derived Data

From the data dictionary in section 1, we saw that exports, health and imports are given as percentage of GDP. Hence we have to convert these columns to absolute values as clustering uses `Euclidean Distance` between values to group the countries

In [None]:
country_df.head() # Lets check data before conversion

In [None]:
# Converting exports,imports and health spending percentages to absolute values.

country_df['exports'] = country_df['exports'] * country_df['gdpp']/100
country_df['imports'] = country_df['imports'] * country_df['gdpp']/100
country_df['health'] = country_df['health'] * country_df['gdpp']/100

country_df.head() # Lets check data after conversion

Now all the columns have absolute values and we can visualize the data for more understanding

## 2.3 Visualizing the dataset

Lets start with bi-variate analysis as all are numerical variables and lets see the relationship between each feature.

Also analysing univariate and outlier analysis can be helpful to treat the outliers.

Lets first get the list of numerical columns. As we have seen above, `info()` shows that all columns except `country` are numerical in nature

In [None]:
numerical_cols = list(country_df.columns) # Get all column names
numerical_cols.remove('country') # Remove country as its not numerical
numerical_cols

## 2.3.1 Visualising Bivariate Distributions using pairplot

In [None]:
sns.pairplot(country_df[numerical_cols], corner=True)
plt.show()

### Inference

The diagonals of the pairplot gives us the distribution of the column's data and the other plots give the pairwise-relationship.

Initial look at the histograms across the columns give us following picture:
- All the columns have values in different range
e.g total_fer has 0 to 10; child_mort has -50 to 250; gdpp, income, exports, imports have values in thousands.
Hence we need to scale the data.

Some patterns we could see are

- `child_mort` and `total_fer` are directly proportional.
- `child_mort` seems to be high where `health` spending is low  and `inflation` is high
- When `health` spending is more, `life_expect` is high.
- `gdpp` increases as `exports`, `imports`, `income` increase.


## 2.3.2 Visualising Correlation using heatmap

Lets check correlation between different features using heatmap. Both strong positive and negative correlation are shown in darker shades(green and red respectively). The lighter shades indicate the weak or mild correlation:

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(country_df.corr(), annot=True,linewidth = 0.5, cmap = sns.diverging_palette(380, 150, n=8))

### Inferences:
- imports and exports are highly correlated with correlation of 0.99
- health and gdpp are highly correlated with correlation of 0.92
- income and gdpp are highly correlated with correlation of 0.9
- child_mortality and life_expentency are highly and inversely correlated with correlation of -0.89
- child_mortality and total_fertility are highly correlated with correlation of 0.85
life_expentency and total_fertility are highly correlated with correlation of -0.76
- gdpp and exports are highly correlated with correlation of 0.77
- gdpp and imports are highly correlated with correlation of 0.76

## 2.3.3 Visualising univariate distributions using Histogram and density plots


In [None]:
plt.figure(figsize=[16,12])
i=1 # to track the ith plot in the subplot
for col in numerical_cols:
    plt.subplot(3,3,i)
    sns.distplot(country_df[col])
    plt.title(col)
    plt.xlabel('')    
    i+=1

### Inference:

None of the columns have even distribution and further analysis of the columns show presence of outliers as below:

- `child_mort`is right skewed and it indicates that there are countries which have higher values of child mortality and majority of values lie between 0 and 50

- `exports`, `imports`, `health` have huge number of outliers and it is right skewed

- `income` has slight outliers with majority of countries having per person income as between 0 and 25000

- `inflation` has some higher outliers and is right skewed.

- `life_expec` has some lower outliers and is left skewed.

- `gdpp` is right skewed and indicate few countries are doing good and their GDP is pretty high and do not need any aid.

However, we will perform K-means and Hierarchical clustering to perfectly identify the clusters and the countries which are in need of aid.

## 2.3.4 Visualising univariate distributions across countries using point plots and barplots:

We will plot `gdpp`, `income`, `child_mort` of all countries and see how the data is distributed. Since pointplot plots the mean of categorical variables and each row is unique, point plots can be used here to show the distribution of the data and their range compared with other countries.

The following is a generic function to create point plots and draw a line to indicate the countries that might need help:

In [None]:
def point_plots_with_line(y_column, line_cutoff, ylog=False):
    #plt.figure(figsize=(18,6))
    ax = sns.pointplot(x="country", y=y_column, data=country_df)
    plt.xticks(rotation = 90,fontsize =7)
    if ylog:
        plt.yscale('log')
    ax.axhline(line_cutoff, ls='--',color='red')
    plt.title("%s vs countries" %y_column)
    plt.ylabel(y_column)
    plt.xlabel('')
    #plt.show()

In [None]:
def plot_bottom10_countries(y_column, sort_order=True, truncate_string=False):
    sorted_df = country_df[['country',y_column]].sort_values(y_column, ascending = sort_order).head(10) # get bottom 10
    sorted_df[y_column] = sorted_df[y_column].round(2) # roundoff to 2 decimals
    if truncate_string: # truncate only for subplots proper visualization purpose
        sorted_df.loc[sorted_df['country'].str.contains('Central African Republic'),'country'] = 'Cent.Afr.Repub.'
        sorted_df.loc[sorted_df['country'].str.contains('Congo'),'country'] = 'Congo'
        sorted_df.loc[sorted_df['country'].str.contains('Equatorial Guinea'),'country'] = 'Guinea'
        
    ax = sns.barplot(x='country', y=y_column, data= sorted_df)
    for each_bar in ax.patches:
        ax.annotate(str(each_bar.get_height()), (each_bar.get_x() * 1.01 , each_bar.get_height() * 1.01))
    plt.ylabel(y_column)
    plt.xlabel('10 Countries which have poor %s' %y_column)
    ax.set_xticklabels(sorted_df['country'], rotation=45, ha='center')

### Visualize GDPP across countries

In [None]:
plt.figure(figsize = (18,20))
plt.subplot(3,1,1)
point_plots_with_line("gdpp", 4660) # 4660 is median of GDPP
plt.subplot(3,1,2)
point_plots_with_line("gdpp", 1330, True) # 1330 is 25th percentile of GDPP
plt.subplot(3,1,3)
plot_bottom10_countries("gdpp")
plt.show()

### INFERENCES:

- First plot is little difficult to interpret due to some countries having huge value of GDPP in comparison with other countries.
- Red line in the First plot divides the countries by their median value. i.e countries above the red lines have their GDPP higher than 4660 and countries below the red line have their GDPP lower than 4660.
- Second plot shows the same plot with log scaled in the y-axis and it is a bit easier to interpret.
- Red line in the Second plot divides the countries by their 25th percentile.
- Second plot shows around 37 countries are below the red line and these are the countries of our focus.
- Third plot shows the list of 10 countries that have the low GDPP

### Visualize Income per person across countries

In [None]:
plt.figure(figsize = (18,20))
plt.subplot(3,1,1)
point_plots_with_line("income", 9960) # 9960 is median of income
plt.subplot(3,1,2)
point_plots_with_line("income", 3350, True) # 3350 is 25th percentile of income
plt.subplot(3,1,3)
plot_bottom10_countries("income")
plt.show()

### INFERENCES:

- This plot gives similar inferences as that of GDPP.
- Red line in the First plot divides the countries by their median value. i.e countries above the red lines have their income per person higher than 9960 and countries below the red line have their income per person lower than 9960.
- Second plot shows the same plot with log scaled in the y-axis and it is a bit easier to interpret.
- Red line in the Second plot divides the countries by their 25th percentile.
- Second plot shows around 37 countries are below the red line and these are the countries of our focus.
- Third plot shows the list of 10 countries that have the low income per person

- Comparing this and the previous GDPP plots, we can see that some of the countries are seen below the red line such as Afghanishtan, Congo, Liberia, Niger etc.

### Visualize Child Mortality Rate per 1000 lives across countries

In [None]:
plt.figure(figsize = (18,12))
plt.subplot(2,1,1)
point_plots_with_line("child_mort", 62) # 62 is 75th percentile of child_mort
plt.subplot(2,1,2)
plot_bottom10_countries("child_mort", False)
plt.show()

### INFERENCES:

- This plot gives opposite inferences as that of GDPP and income indicating the countries who are doing well have less child mortality attributing to money being spent on citizens health and nutrition.
- We have chosen 75th percentile i.e. 62 to show the countries which have child mortality rate higher than that per 1000 lives.
- We can see that some of the countries which have low GDPP and low income per person suffer from high mortality rate sucg as Congo, Haiti etc.
- Third plot shows the list of 10 countries that have the high child moratality rate and these seem to be present in Africa where healthcare facilities are poor.

### Visualize bottom 10 countries w.r.t exports, imports, health, inflation, life_expectancy and total fertility rate

In [None]:
plt.figure(figsize = (18,16))
plt.subplot(2,3,1)
plot_bottom10_countries("exports", sort_order=True, truncate_string=True)
plt.subplot(2,3,2)
plot_bottom10_countries("health")
plt.subplot(2,3,3)
plot_bottom10_countries("imports")
plt.subplot(2,3,4)
plot_bottom10_countries("inflation", sort_order=False, truncate_string=True)
plt.subplot(2,3,5)
plot_bottom10_countries("life_expec", sort_order=True, truncate_string=True)
plt.subplot(2,3,6)
plot_bottom10_countries("total_fer", sort_order=False, truncate_string=True)
plt.show()

### INFERENCES:

- exports, health follow GDPP and income plots pattern and the countries which had low GDPP seem to have low exports.
- imports show a different trend indicating that these countries manufacture a lot of goods sufficiently.
- inflation shows countries that are not stable and not self-sufficient and suffering from other political and social issues. 

Though these plots give us an overall picture of how the countries are doing in terms of GDPP, income and child mortality. Still we will need more statistical data to identify this cluster in particular and to highlight the countries from this cluster which are in need of aid at priority. We will build an unsupervised model based on clustering techniques, which is sensitive to outliers. Due to this, lets proceed and identify outliers and handle those so that we would get better results.

## 2.3.5 Visualising univariate distributions using boxplots to identify outliers

Boxplots are a great way to visualise univariate data because they represent statistics such as the 25th percentile, 50th percentile, etc. We will use the boxplots to analyse the outliers. Lets first see the summary statistics of all the percentiles

In [None]:
country_df.describe(percentiles=[.1,.5,.25,.75,.90,.95,.99])

In [None]:
def boxplot_for_outlier_analysis():
    plt.figure(figsize=[16,12])
    i=1 # to track the ith plot in the subplot
    for col in numerical_cols:
        plt.subplot(3,3,i)
        sns.boxplot(y=country_df[col])
        plt.title(col)
        plt.ylabel('')
        i+=1

In [None]:
boxplot_for_outlier_analysis()

### Outlier Analysis:

Lets not remove the outliers as it explains the country needs and also deleting outliers will delete lot of countries. Hence, lets find the outliers in the dataset, and then use the following strategies whether to keep those as such or to cap those to the corresponding higher or lower quartile values.


- `child_mort` has only higher outliers and since we need to identify the countries where `child_mort` is high, lets **not cap** these outliers.

- `exports` and `imports` have only higher outliers and this means the country exports are high and the country is doing fine and hence we can **cap these outliers** to the column's 99th percentile.

- `health` has huge amount of higher outliers and this mean that these countries spend relatively lot of money on the health and hence we can **cap these outliers** to the column's 99th percentile.

- `income` has some amount of higher outliers and this mean that people in these countries income are very higher than the other countries and hence we can **cap these outliers** to the column's 99th percentile.

- `inflation` has some higher outliers and these countries might be in need of aid and hence lets **not cap** these outliers.

- `life_expec` has some lower outliers and these are the countries which need our attention and hence lets **not cap** these outliers.

- `total_fer` has some higher outliers and lets **cap these outliers** to 99th percentile

- `gdpp` has huge amount of higher outliers and this indicate that these countries are doing well by themselves and hence we can **cap these outliers** to the column's 99th percentile.


# 3. Prepare the data for modelling
## 3.1 Outlier Treatment

Outliers can be treated in two ways:
1. Statistical treatment where all outliers can be either removed or capped. Since we have dataset of 167 rows, deleting outliers will remove the countries which might be in need of aid. 

2. Domain based outlier treatment:
Here we will cap or remove outliers based on the data's relevance to the business need. We will cap the outliers as discussed in the above section.

There are different ranges in capping the outliers:
- Soft range: 1th and 99th percentile.
- Mid range: 5th and 95th percentile.
- 25th and 75th percentile.

We will be doing **Soft capping** as the data points are few and the capping should not influence the clusters much.

In [None]:
higher_outlier_cols = ['exports','imports','health','income','total_fer','gdpp']

for col in higher_outlier_cols:
    Q4 = country_df[col].quantile(0.99) # Get 99th quantile
    country_df.loc[country_df[col] >= Q4, col] = Q4 # outlier capping

In [None]:
boxplot_for_outlier_analysis()

In [None]:
country_df.describe(percentiles=[.1,.5,.25,.75,.90,.95,.99])

There are some outliers present in the data after outlier treatment as we have used soft capping. Lets proceed and cluster the countries based on the prepared data.

## 3.2 Hopkins test to understand cluster tendency

- Before we apply any clustering algorithm to the given data, it's important to check whether the given data has some meaningful clusters or not. This usually means the given data is not random. 

- The process to evaluate the data to check if the data is feasible for clustering or not is know as the **clustering tendency**. 

- To check cluster tendency, we use **Hopkins test.**

- `Hopkins test` examines whether data points differ significantly from uniformly distributed data in the multidimensional space.

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
hopkins(country_df[numerical_cols])

### Interpretation of Hopkins score:

- Hopkins Statistic over .70 is a good score that indicated that the data is good for cluster analysis. 
- A 'Hopkins Statistic' value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0.


Hopkins test results will vary as it picks a set of samples each time. On running it multiple times, it can be seen that this data set gives Hopkins statistic value in the range of 0.86 to 0.97 and hence our dataset is good for clustering and lets proceed our analysis 

## 3.3 Scaling

- Feature scaling is essential for machine learning algorithms that calculate distances between data. 
- Most of distance based models e.g. k-means and Hierarchical clustering need standard scaling so that large-scaled features don't dominate the variation.
- If we do not scale, the feature with a higher value range starts dominating when calculating distances
- We have chosen `StandardScaler` as clustering does not work well when the variance differs a lot.

In [None]:
# Scaling on numerical features

scaler = StandardScaler() # instantiate scaler

country_df_scaled = scaler.fit_transform(country_df[numerical_cols]) # fit parameters to have mean 0 and SD as 1 and transform data accordingly
country_df_scaled = pd.DataFrame(country_df_scaled, columns = numerical_cols) # convert to dataframe
country_df_scaled

# 4. Build model using K-means algorithm for clustering

Now that scaling is done on all numerical features, lets build the unsupervised model using clustering technique. There are many algorithms available in clustering. We will pick two common algorithms such as 
1. K-means algorithm
2. Hierachical Clustering

and build the model using both the methods and cluster the countries identify the countries in need

**Kmeans algorithm** is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.


## 4.1 Metrics to choose the value of K
The main challenge in this algorithm is to find the optimal value of k or `number of clusters`. There are two common approaches that help to find k:

1. Elbow method
2. Silhouette Analysis

### 4.1.1 Elbow Method

Elbow method gives us an idea on what a good k number of clusters would be based on the `sum of squared distance (SSD) ` between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow. 

Lets use `KMeans()` from `sklearn` to form clusters of 2,3,4 and so on till 11 and calculate SSD and plot the number of clusters and SSD and see where the elbow is formed.

In [None]:
# Elbow curve-ssd
ssd = []
for k in range(2, 11):
    kmean = KMeans(n_clusters = k).fit(country_df_scaled)
    ssd.append([k, kmean.inertia_])
    
temp = pd.DataFrame(ssd)
ax = plt.axes()
ax.plot(temp[0], temp[1]) # plot the SSDs for each n_clusters
ax.axvline(3, ls='dotted',color='red') # elbow formed as 3
plt.xlabel('Number of clusters')
plt.ylabel('SSD')
plt.show()

### INFERENCE:

SSD flattens and forms an elbow at 3 indicating that 3 is optimal value of k.

### 4.1.2 Silhouette Analysis

The silhouette score is a measure of how similar an object is to its own cluster (`cohesion`) compared to other clusters (`separation`).

Lets use `KMeans()` from `sklearn` to form clusters of 2,3,4 and so on till 11 and calculate `silhouette_score` and plot the `number of clusters` against `silhouette_score`

In [None]:
# Silhouette score

from sklearn.metrics import silhouette_score
silhouette_scores_list = []
for k in range(2, 11):
    kmean = KMeans(n_clusters = k).fit(country_df_scaled) # intialise kmeans
    silhouette_avg = silhouette_score(country_df_scaled, kmean.labels_) # silhouette score
    silhouette_scores_list.append([k, silhouette_avg])
    print("For k_clusters={0}, the silhouette score is {1:2f}".format(k, silhouette_avg))
    
temp = pd.DataFrame(silhouette_scores_list)    
ax = plt.axes()
ax.plot(temp[0], temp[1])
ax.axvline(3, ls='dotted',color='green') # elbow formed as 3
ax.axvline(4, ls='dotted',color='blue') # elbow formed as 3
ax.axvline(5, ls='dotted',color='maroon') # elbow formed as 3
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()

### INFERENCES:

- The silhouette score is maximum when k is 2 which is 0.47
- 2 is very less number of clusters and countries within the 2 clusters might be very different.

So lets look at the next optimal silhouette score.

- 3,4,5 seem to have good silhouette scores. As k increases, silhouette score decreases and hence these will have definitely lesser sihouette score than that of k=2.
- Though elbow curve indicate 3 is optimal number and silhouette score of 3 seem to be the best, lets use K-means algorithm for k=3,4,5 and see which value of k gives us better `cluster profiling`

## 4.2 Iterating with k=3,4 and 5

In [None]:
# Function for all steps of Kmean Clustering; Call with K=3,4,5
def K_means_model(k):
    kmean = KMeans(n_clusters = k, random_state = 50+k)
    kmean.fit(country_df_scaled)
    country_df_kmean = country_df.copy() # copy the actual data into a new dataframe to explain the cluster profiling
    label  = pd.DataFrame(kmean.labels_, columns= ['k_means_cluster_label'])
    country_df_kmean = pd.concat([country_df_kmean, label], axis =1) # assign the countries with the cluster labels.
    print("Number of countries in each cluster(k=%s):" %k)
    print(country_df_kmean.k_means_cluster_label.value_counts())# shows how many countries are in each cluster
    return(country_df_kmean) # returns clustered labelled dataset for further analysis

In [None]:
# Created Models are available globally to access inside cluster profiling functions
k_3_model = K_means_model(3) # K means model with 3 clusters
k_4_model = K_means_model(4) # K means model with 4 clusters
k_5_model = K_means_model(5) # K means model with 5 clusters

### Cluster Analysis:

We can see that when there are 3 or 4 clusters, there are some distribution of countries. When there are 5 clusters, there is a country which is in a separate cluster. This would make actionable items a bit tedious. We will progress with these 3 models and profile the clusters created by each model based on 3 important parameters i.e **GDPP, Income and Child_mortality** and see which is a good value of k.

In [None]:
# Function for Profiling Clusters to plot scatter plots
def clusters_scatter_plots(col1, col2):
    plt.figure(figsize=(18,8))
    plt.subplot(2,2,1)
    sns.scatterplot(x = col1, y = col2, hue = 'k_means_cluster_label', data = k_3_model, palette=['blue','green','red'])
    plt.subplot(2,2,2)
    sns.scatterplot(x = col1, y = col2, hue = 'k_means_cluster_label', data = k_4_model, palette=['orange','blue','green','red'])
    plt.subplot(2,2,3)
    sns.scatterplot(x = col1, y = col2, hue = 'k_means_cluster_label', data = k_5_model, palette=['red','orange','maroon','green','blue'])

### Visualization of GDPP vs Income when k=3,4,5

In [None]:
clusters_scatter_plots('gdpp','income')

### INFERENCES:

- When there are 3 or 4 clusters, there is a distinct separation of the clusters.
- When there are 5 clusters, as we have seen previously that one of the clusters has just one country, it cannot be seen evidently here.

Lets visualize the clusters more and see which helps in understanding the clusters and identifying countries in need.

### Visualization of GDPP vs Child mortality when k=3,4,5

In [None]:
clusters_scatter_plots('gdpp','child_mort')

### INFERENCES:

This plot is very helpful to see the clustered groups. For e.g. when k=3, we see that cluster 2 require aid as their gdpp is low and child mortality is pretty high.

As the cluster labels are picked in random and assigned, we can see similar group is represented as 3 when k=4 and k=5.

### Visualization of Child mortality vs Income when k=3,4,5

In [None]:
clusters_scatter_plots('income','child_mort')

### INFERENCES:

As we saw in previous two scatter plots, we can see that same cluster which had bad child_moratality rate and low gdpp has low income in this plot.

### Visualization of univariate distributions when k=3,4,5

Lets also do some boxplots to understand the gdpp, income and child_mortality data distributions within each cluster:

In [None]:
# Function for Profiling Clusters to plot box plots
def clusters_box_plots(column_name, logy=False):
    #plt.figure(figsize=(18,8))
    
    if logy:
        i=1
    else:
        i=0
    
    plt.subplot(2+i,2,1)
    sns.boxplot(x = 'k_means_cluster_label', y = column_name, data = k_3_model, palette=['red','green','blue'])
    if logy:
        plt.subplot(3,2,2)
        sns.boxplot(x = 'k_means_cluster_label', y = column_name, data = k_3_model, palette=['red','green','blue'])
        plt.yscale('log')

    plt.subplot(2+i,2,2+i)
    sns.boxplot(x = 'k_means_cluster_label', y = column_name, data = k_4_model, palette=['orange', 'blue','green','red'])    
    if logy:
        plt.subplot(3,2,4)
        sns.boxplot(x = 'k_means_cluster_label', y = column_name, data = k_4_model, palette=['orange','blue','green','red'])
        plt.yscale('log')
        
    plt.subplot(2+i,2,3+i+i)
    sns.boxplot(x = 'k_means_cluster_label', y = column_name, data = k_5_model, palette=['red','blue','maroon','green','orange'])
    if logy:
        plt.subplot(3,2,6)
        sns.boxplot(x = 'k_means_cluster_label', y = column_name, data = k_5_model, palette=['red','blue','maroon','green','orange'])
        plt.yscale('log')    

### Visualization of GDPP distribution when k=3,4,5

In [None]:
plt.figure(figsize = (18,16))
clusters_box_plots('gdpp',True) # log scaled
plt.show()

### INFERENCES:

- First set of 3 plots at the left show the distribution without log scale and the next 3 plots show the distribution with GDPP log scaled
- GDPP of the developed countries are so high that we are unable to see the GDPP of the poor countries properly in this boxplot.
- From the right side 3 plots, it can be seen that GDPP of cluster 0 is in the range of 10000(10^4) and cluster 1 is in 100000(10^5) whereas cluster 2 is in range of 10^3 indicating help
- There is a slight overlap in the clusters when k=5 and k=4

### Visualization of Income distribution when k=3,4,5

In [None]:
plt.figure(figsize = (18,16))
clusters_box_plots('income',True) # log scaled
plt.show()

### INFERENCES:

- Income follows the same pattern as that of GDPP.
- The clusters are seggregated well when k=3 and slightly overlaps when k=4 and k=5
- k=3 seems to be a good indicator of k as the cohesion within the cluster is good and clusters are well separated.

### Visualization of Child mortality distribution when k=3,4,5

In [None]:
plt.figure(figsize = (16,10))
clusters_box_plots('child_mort')
plt.show()

### INFERENCES:

- Child mortality follows the opposite pattern of GDPP and income.
- Clusters which were high on GDPP and income have less child mortality, indicating that these countries have ample amount of money to take care of child mortaliy and health issues.
- We can see that k=3 gives good clusters as the cohesion within the cluster is good and clusters are well separated.

### Visualization of Mean of GDPP, income and Child mortality when k=3,4,5

In [None]:
plt.figure(figsize=(18,8))
grouped_df_k3 = k_3_model[['gdpp', 'income', 'child_mort','k_means_cluster_label']].groupby('k_means_cluster_label').mean()
axes = grouped_df_k3.plot.bar(subplots=True)
plt.show()

### INFERENCES:

When K=3, the clusters can be profiled as
- 0 : Medium GDPP, medium Income and mild child mortality rate.
- 1 : High GDPP, High income and very low child mortality rate.
- 2 : Low GDPP, Low income and very high mortality rate.

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together

grouped_df_k3.plot(kind='bar', colormap='Accent')    
grouped_df_k3.plot(kind='bar',logy=True, colormap='Accent')    
plt.show()

### INFERENCES:

This gives a very good inference about each cluster.

- 0 : Medium GDPP, medium Income and mild child mortality rate.
- 1 : High GDPP, High income and very low child mortality rate.
- 2 : Low GDPP, Low income and very high mortality rate and has to be focussed.

In [None]:
plt.figure(figsize=(18,8))
grouped_df_k4 = k_4_model[['gdpp', 'income', 'child_mort','k_means_cluster_label']].groupby('k_means_cluster_label').mean()
axes = grouped_df_k4.plot.bar(subplots=True)
plt.show()

### INFERENCES:

When K=4, the clusters 1 and 2 are a bit similar and it seems to be in same cluster when k=3. Lets profile the clusters as
- 0 : Medium GDPP, medium Income and mild child mortality rate.
- 1 : High GDPP, High income and very low child mortality rate.
- 2 : Very high GDPP, Very high income and very low child mortality rate.
- 3 : Low GDPP, Low income and very high mortality rate

Since we are concerned about the countries not doing well, this granular seggregation of countries doing well is not of our primary concern. 

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together
grouped_df_k4.plot(kind='bar', colormap='Accent')    
grouped_df_k4.plot(kind='bar',logy=True, colormap='Accent')    
plt.show()

### INFERENCES:

Not much of difference between clusters 2 and 1 and these can be represented in a single cluster leading to k=3 being a good number

In [None]:
plt.figure(figsize=(18,8))
grouped_df_k5 = k_5_model[['gdpp', 'income', 'child_mort','k_means_cluster_label']].groupby('k_means_cluster_label').mean()
axes = grouped_df_k5.plot.bar(subplots=True)
plt.show()

### INFERENCES:

- The clusters 1 and 3 are a bit similar and it seems to be in same cluster when k=3 
- Cluster 4 is not effective as its just one country and we cannot compare it with other clusters.So k=5 is not effective for identifying the countries that are in need of aid.

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together
grouped_df_k5.plot(kind='bar', colormap='Accent')    
grouped_df_k5.plot(kind='bar',logy=True, colormap='Accent')    
plt.show()

In [None]:
k_5_model[k_5_model['k_means_cluster_label']==4]

### INFERENCES:

When K=5, we saw that there was only one country in cluster 4.

- Clusters 1 and 3 are similar
- Cluster 0 and 4 are also similar.

Our concern is on clusters 2 and 3. But if they are spread across clusters, its difficult to find which country needs the aid at most than the other. So lets go with k=3 as optimal number of clusters.

### 4.3 Final Model: K-means clustering with K =3 `

Now that we have solved the biggest challenge in K-means algorithm i.e. Find the optimal value of k, we can build our final model using k=3. Lets run `K-means` algorithm on the scaled data set as the clustering takes Euclidean distance as a measure

In [None]:
kmean = KMeans(n_clusters = 3, random_state = 50)
kmean.fit(country_df_scaled)

#### Creating Cluster labels using K-means
Since scaled data will be a bit confusing while explaining to business people, we will copy the actual data into a new dataframe to explain the cluster labels. We will use this `country_df_kmean_3` for cluster profiling. Lets create a column called `k_means_cluster_label` and concatenate to the `country_df_kmean_3` to assign the countries with the cluster labels.

In [None]:
country_df_kmean = country_df.copy() # copy df into new df, as the same df will be used for hierarchical clustering too.
label  = pd.DataFrame(kmean.labels_, columns= ['k_means_cluster_label'])
label.head()

In [None]:
country_df_kmean = pd.concat([country_df_kmean, label], axis =1)
country_df_kmean.head()

### 4.4 INITIAL CLUSTER PROFILING

- `value_counts` shows how many countries are clustered under each cluster label. 
- Lets analyse these 3 clusters and see if we can profile these clusters by comparing their `gdpp`, `child_mort`, `income`
- Lets visualize these clusters using `scatter plots`, `barplots` and `boxplots`
- We also need to analyse the clusters and see if k=3 helps us to identify the countries which are in dire need of aid

In [None]:
country_df_kmean.k_means_cluster_label.value_counts()

In [None]:
# Profiling GDP, INCOME AND CHID_MORT in separate plots

grouped_df = country_df_kmean[['gdpp', 'income', 'child_mort','k_means_cluster_label']].groupby('k_means_cluster_label').mean()
axes = grouped_df.plot.bar(subplots=True)
plt.show()

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together
grouped_df.plot(kind='bar', colormap='Accent')
grouped_df.plot(kind='bar',logy=True, colormap='Accent')

### INFERENCES:

From the above three plots, We can see that the clusters are grouped as
- 0 : Medium GDPP, medium Income and mild child mortality rate.
- 1 : High GDPP, High income and very low child mortality rate.
- 2 : Low GDPP, Low income and very high mortality rate.

### 4.5 Countries Segmentation

We can rename the labels for better business understanding as cluster label 0, 1 and 2 does not make sense to interpret. Then we will perform the cluster profiling with new labels.

Lets rename the cluster labesl as 
- 0 : Developing Countries
- 1 : Developed Countries
- 2 : Under-developed Countries

This would help the NGO to recognise and differentiate the clusters of developed countries from the clusters of under-developed countries and focus on **Cluster 2: Under-developed Countries**

In [None]:
# Medium income, Medium GDP and Slightly high Child_mort
# Filter the data for that clsuter

country_df_kmean.loc[country_df_kmean['k_means_cluster_label'] == 0,'k_means_cluster_label'] ='Developing Countries'
country_df_kmean[country_df_kmean['k_means_cluster_label'] == 'Developing Countries']

In [None]:
country_df_kmean[country_df_kmean['k_means_cluster_label'] == 'Developing Countries'].describe()

Summary statistics show that the variation within the group is very less and mean and median are so close. So this clustering is good.

In [None]:
# Developed Countries: High income, High GDP and Low Child_mort
# Filter the data for that clsuter
country_df_kmean.loc[country_df_kmean['k_means_cluster_label'] == 1,'k_means_cluster_label'] ='Developed Countries'
country_df_kmean[country_df_kmean['k_means_cluster_label'] == 'Developed Countries']

In [None]:
country_df_kmean[country_df_kmean['k_means_cluster_label'] == 'Developed Countries'].describe()

Summary statistics show that the variation within the group is very less and mean and median are so close. So this clustering is good.

Also the stats of this cluster and the previous cluster has wide difference.

In [None]:
# Under-Developed Countries:Low income, Low GDP and High Child_mort
# Filter the data for that clsuter

country_df_kmean.loc[country_df_kmean['k_means_cluster_label'] == 2,'k_means_cluster_label'] ='Under-Developed Countries'
country_df_kmean[country_df_kmean['k_means_cluster_label'] == 'Under-Developed Countries']

In [None]:
country_df_kmean[country_df_kmean['k_means_cluster_label'] == 'Under-Developed Countries'].describe()

Summary statistics show that the variation within the group is very less and mean and median are so close. So this clustering is good.

Also the stats of this cluster and the previous clusters has wide difference.

Hence cohesion and separation are well preserved in this clustering

Now that we have segmented the clusters and renamed properly, it will be easy to interpret the plots. Lets do the cluster profiling with the new cluster labels. 

### 4.6 CLUSTER PROFILING WITH NEW LABELS

In [None]:
profiling_cols = ['gdpp','child_mort','income'] # create a list to store profiling variables

In [None]:
# Plot the cluster
plt.figure(figsize=(18,8))
i=0
for i in range(len(profiling_cols)):
    plt.subplot(2,2,i+1)
    sns.scatterplot(x = profiling_cols[i], y = profiling_cols[(i+1)%len(profiling_cols)], hue = 'k_means_cluster_label', data = country_df_kmean, palette=['red','blue','darkgreen'])

### INFERENCES:

These new labels help us to interpret the plots in a better fashion. We can see that 
- Developing countries have Medium GDPP, medium Income and mild child mortality rate.
- Developed countries have High GDPP, High income and very low child mortality rate.
- Under-Developed countries have Low GDPP, Low income and very high mortality rate and should be our primary focus.

In [None]:
# Plot the cluster
plt.figure(figsize=(18,8))
i=0
for i in range(len(profiling_cols)):
    plt.subplot(2,2,i+1)
    sns.boxplot(x = 'k_means_cluster_label', y = profiling_cols[i], data = country_df_kmean, palette=['red','blue','darkgreen'])
    plt.xlabel('')

### INFERENCES:

The similar observation that we got from scatter plots can be seen in boxplots too. The clusters are grouped as
- Developing countries have Medium GDPP, medium Income and mild child mortality rate.
- Developed countries have High GDPP, High income and very low child mortality rate.
- Under-Developed countries have Low GDPP, Low income and very high mortality rate and should be our primary focus.

We can see that GDPP and income of the under-developed countries are so low that they are not seen properly in the same scale that of the developed countries.

In [None]:
# Profiling GDP, INCOME AND CHID_MORT in sub-plots
plt.figure(figsize=(18,8))
grouped_df = country_df_kmean[['gdpp', 'income', 'child_mort','k_means_cluster_label']].groupby('k_means_cluster_label').mean()
axes = grouped_df.plot.bar(subplots=True)
plt.show()

### INFERENCES:

We can see the mean of the gdpp and income of the under-developed countries are so low when compared to developing or developed countries and we need to look further into this cluster to get the countries which are in most need of aid.

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together from the above grouped_df
grouped_df.plot(kind='bar', colormap='Accent')
grouped_df.plot(kind='bar',logy=True, colormap='Accent')

### INFERENCES:

The mean of each cluster show the similar observation and the grouping is done perfectly such that we can focus on cluster **Under-Developed Countries** as it has **Low GDPP, Low income and very high mortality rate.**

## 4.7 Identification of Top 10 countries that require aid on priority using K-means algorithm:

In [None]:
K_top10 = country_df_kmean[country_df_kmean['k_means_cluster_label'] =='Under-Developed Countries'].sort_values(['gdpp', 'child_mort', 'income'], ascending = [True, False, True]).head(10)
K_top10

In [None]:
K_top10.country

These are the countries that require aid, identified by K-means algorithm. Lets use another technique called `Hierarchical clustering` and see if any other country requires aid much more than these countries and present our final analysis

# 5 Build unsupervised model using Hierarchical Clustering


- We have built an unsupervised model using K-means algorithm. Now lets create a model to cluster the countries using `Hierarchical Clustering`. 
- Hierarchical clustering starts by treating each observation as a separate cluster. 
- Then, it repeatedly executes the following two steps: 
  1. identify the two clusters that are closest together
  2. merge the two most similar clusters. 
  
This iterative process continues until all the clusters are merged together. The main output of Hierarchical Clustering is a `dendrogram`, which shows the hierarchical relationship between the clusters

### Linkage Criteria

- There are multiple linkage criteria which determines from where `Euclidean distance` is computed. 
- It can be computed between 
  - The two most similar parts of a cluster in a **single-linkage**
  - The two least similar bits of a cluster in a **complete-linkage
  - The center of the clusters in a mean or average-linkage
  
We will do both single and complete linkages here and try to interpret their dendograms.

## 5.1 Single linkage

In [None]:
country_df_scaled

In [None]:
# single linkage
mergings = linkage(country_df_scaled, method="single", metric='euclidean')
dendrogram(mergings)
plt.show()

### INTERPRETATION OF DENDOGRAM:

Single linkage's dendogram is not readable or interpretable. Hence we cannot use this for our problem.

Lets try with complete linkage and see if it helps

## 5.2 Complete Linkage:

In [None]:
# complete linkage
mergings = linkage(country_df_scaled, method="complete", metric='euclidean')
dendrogram(mergings)
plt.show()

### INTERPRETATION OF DENDOGRAM:

Complete linkage's dendogram is readable and better to interpret when compared to single linkage's dendogram.

We can see merging of clusters represented in different colors.

If we cut the dendogram tree at SCORE 5 or 6, we have 4 clusters. But we can see the dissimilarity between 4 clusters and 3 clusters is not much as at score 8 itself, we see 3 clusters forming. Only at higher score of 12, 2 sets of clusters available. 

This indicates 3 clusters is a good choice as there will be good dissimilarity between clusters and good similarity within clusters.

In [None]:
# 3 clusters
cluster_labels = cut_tree(mergings, n_clusters=3).reshape(-1, )
cluster_labels

In [None]:
# assign cluster labels
country_df['cluster_labels'] = cluster_labels
country_df.head()

## 5.3 Initial Cluster Profiling Using Hierarchical Clustering Model

In [None]:
country_df.cluster_labels.value_counts()

We can see here that though the number of clusters are same as K-means algorithm i.e.3, number of countries in each cluster vary. Lets profile the clusters and label these 

In [None]:
# Profiling GDP, INCOME AND CHID_MORT in separete plots
grouped_df = country_df[['gdpp', 'income', 'child_mort','cluster_labels']].groupby('cluster_labels').mean()
grouped_df.plot(kind='bar', colormap='Accent')    
grouped_df.plot(kind='bar',logy=True, colormap='Accent')    
plt.show()

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together
grouped_df.plot(kind='bar', colormap='Accent')
grouped_df.plot(kind='bar',logy=True, colormap='Accent')

From the above plots, its evident that the cluster labels 
- 0 : Under-developed countries having low GDPP, low income and high child mortality rate.
- 1 : Developing countries having medium GDPP, medium income and mild child mortality rate.
- 2 : Developed countries having high GDPP, high income and very low child mortality rate.

## 5.4 Countries Segmentation

Similar to our approach in K-mean algorithm, We can rename the labels for better business understanding as cluster label 0, 1 and 2 does not make sense to interpret. Then we will perform the cluster profiling with new labels.

Lets rename the cluster labels as 
- 0 : Under-developed Countries
- 1 : Developing Countries
- 2 : Developed Countries
    
This would help the NGO to recognise and differentiate the clusters of developed countries from the clusters of under-developed countries and focus on **Cluster 2: Under-developed Countries**

In [None]:
# Low income, Low GDP and High Child_mort
# Filter the data for that clsuter

country_df.loc[country_df['cluster_labels'] == 0,'cluster_labels'] ='Under-Developed Countries'
country_df[country_df['cluster_labels'] == 'Under-Developed Countries']

In [None]:
country_df[country_df['cluster_labels'] == 'Under-Developed Countries'].describe()

In [None]:
# Medium income, Medium GDP and Mild Child_mort
# Filter the data for that clsuter

country_df.loc[country_df['cluster_labels'] == 1,'cluster_labels'] ='Developing Countries'
country_df[country_df['cluster_labels'] == 'Developing Countries']

In [None]:
country_df[country_df['cluster_labels'] == 'Developing Countries'].describe()

In [None]:
# High income, High GDP and Low Child_mort
# Filter the data for that clsuter

country_df.loc[country_df['cluster_labels'] == 2,'cluster_labels'] ='Developed Countries'
country_df[country_df['cluster_labels'] == 'Developed Countries']

In [None]:
country_df[country_df['cluster_labels'] == 'Developed Countries'].describe()

Clustering is done well in this case as well as can be seen from summary statistics of each cluster.

## 5.5 Final Cluster Profiling with new labels

In [None]:
# Plot the cluster
plt.figure(figsize=(18,8))
i=0
for i in range(len(profiling_cols)):
    plt.subplot(2,2,i+1)
    sns.scatterplot(x = profiling_cols[i], y = profiling_cols[(i+1)%len(profiling_cols)], hue = 'cluster_labels', data = country_df, palette=['red','blue','green'])

### INFERENCES:

These new labels help us to interpret the plots in a better fashion. We can see that 
- Developing countries have Medium GDPP, medium Income and mild child mortality rate.
- Developed countries have High GDPP, High income and very low child mortality rate.
- Under-Developed countries have Low GDPP, Low income and very high mortality rate and should be our primary focus.

In [None]:
# Plot the cluster
plt.figure(figsize=(18,8))
i=0
for i in range(len(profiling_cols)):
    plt.subplot(2,2,i+1)
    sns.boxplot(x = 'cluster_labels', y = profiling_cols[i], data = country_df, palette=['red','blue','green'])

### INFERENCES:

The similar observation that we got from scatter plots can be seen in boxplots too. The clusters are grouped as
- Developing countries have Medium GDPP, medium Income and mild child mortality rate.
- Developed countries have High GDPP, High income and very low child mortality rate.
- Under-Developed countries have Low GDPP, Low income and very high mortality rate and should be our primary focus.

We can see that GDPP and income of the under-developed countries are so low that they are not seen properly in the same scale that of the developed countries.

These plots are similar to K-means algorithm plots and hence our analysis is good.

In [None]:
# Profiling GDP, INCOME AND CHID_MORT in sub-plots
grouped_df = country_df[['gdpp', 'income', 'child_mort','cluster_labels']].groupby('cluster_labels').mean()
grouped_df.plot(kind='bar', subplots=True)
plt.show()

In [None]:
# Profiling GDP, INCOME AND CHID_MORT together
grouped_df.plot(kind='bar', colormap='Accent')
grouped_df.plot(kind='bar',logy=True, colormap='Accent')

### INFERENCES:

The average of each cluster show the similar observation and the grouping is done perfectly such that we can focus on cluster **Under-Developed Countries** as it has **Low GDPP, Low income and very high mortality rate.**

## 5.6 Identification of Top 10 countries that require aid on priority using Hierarchical clustering:

In [None]:
H_top10 = country_df[country_df['cluster_labels'] =='Under-Developed Countries'].sort_values(by = ['gdpp','child_mort','income'], ascending = [True, False, True]).head(10)
H_top10

In [None]:
H_top10.country

In [None]:
list(K_top10.country)==list(H_top10.country)

This indicates both K-means and Hierarchical Clustering returned same list of 10 countries which are in need of aid. 

We can chose final suggestions based on K-means clustering as the similarity within a cluster and dissimilarity across clusters is good in this methodology as can be seen from the summary statistics after segmentation. Also in K-means clustering, we have got around 48 countries in under developed countries clusters and hence it looks like better clustering mechanism.

Lets break this into two lists as `top5 in priority 1 list` and `next set of 5 countries in priority 2 list`.

In [None]:
Priority_1_countries = K_top10.head(5)
Priority_1_countries['Aid Priority'] = "Aid Requirement Priority 1"
Priority_1_countries

In [None]:
Priority_2_countries = K_top10.tail(5)
Priority_2_countries['Aid Priority'] = "Aid Requirement Priority 2"
Priority_2_countries

# 6. Presenting countries that are in need of help to HELP International
   

In [None]:
def results_plots(df_name):
    plt.figure(figsize=[18,6])
    for i,column_name in enumerate(profiling_cols):
        plt.subplot(2,2,i+1)
        ax = sns.barplot(x='country', y=column_name, data= df_name)
        for each_bar in ax.patches:
            ax.annotate(str(each_bar.get_height()), (each_bar.get_x() * 1.01 , each_bar.get_height() * 1.01))
        plt.ylabel(column_name)
        plt.xlabel('Countries which have poor %s' %column_name)

First set of countries that require aid immediately: 

In [None]:
Priority_1_countries

In [None]:
Priority_1_countries.set_index('country').plot(kind='bar')
plt.xlabel('')
plt.show()

In [None]:
results_plots(Priority_1_countries)

Once the above countries are helped with, the following set could be provided aid as next set of countries:

In [None]:
Priority_2_countries

In [None]:
Priority_2_countries.set_index('country').plot(kind='bar')
plt.xlabel('')
plt.show()

In [None]:
results_plots(Priority_2_countries)

# Suggestions to HELP International - Countries that are in need of aid:
The following 5 are the countries which have to be provided aid first:

1. Burundi
2. Liberia
3. Congo, Dem. Rep.
4. Niger
5. Sierra Leone

Once the above countries are provided with Aid, the following are the next set of countries which would require aid in order to reduce the child mortality rate and improve their GDPP and income per person:

6. Madagascar
7. Mozambique
8. Central African Republic
9. Malawi
10. Eritrea