# <center>HELP International NGO</center>

## Problem Statement
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. <br>
Key problem statements are: 
**<ul>
    <li>How to use newly received $10 million funding strategically and effectively</li>
    <li>Categorise the countries using socio-economic and health factors.</li>
    <li>Choose appropriate countries that are in the direst need of aid.</li>
</ul>**

### Load Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Load Data and Perform EDA

In [None]:
data_dict = pd.read_csv('data-dictionary.csv', index_col='Column Name')
pd.set_option('display.max_colwidth', -1)
data_dict

In [None]:
countryDf = pd.read_csv('Country-data.csv')
countryDf.head()

In [None]:
countryDf.info()

In [None]:
countryDf.isnull().sum()

**No Missing Data**

In [None]:
countryDf.shape

In [None]:
countryDf[countryDf.duplicated()]

In [None]:
sns.pairplot(countryDf)

In [None]:
countryDf.describe()

In [None]:
countryDf[['country']].describe()

**Here we have 167 rows containing 167 unique countries. <br>
So, no duplicate data rows**

**Outlier Analysis**

In [None]:
plt.figure(figsize=(18,10))
numericCol = countryDf.columns.drop('country')
index = 1
for col in numericCol:
    plt.subplot(3,3,index)
    sns.boxplot(data=countryDf, y=col)
    index = index + 1

In [None]:
plt.figure(figsize=(18,10))
numericCol = countryDf.columns.drop('country')
index = 1
for col in numericCol:
    plt.subplot(3,3,index)
    sns.distplot(countryDf[col])
    index = index + 1

**Analysing Outlier based on different features**

In [None]:
countryDf[countryDf['child_mort'] > 100]

In [None]:
countryDf[countryDf['life_expec'] < 50]

**High Child_mort** rate &  **low life_expec** rate is big concern.<BR>
If we check other features it shows under developed countries (Where GDP capita is low as well and heath expenditure as compare to GDP is low also low income per person) facing such kind of problem.

In [None]:
countryDf[countryDf['exports'] > 100]

In [None]:
countryDf[countryDf['income'] > 50000]

In [None]:
countryDf[countryDf['gdpp'] > 40000]

In [None]:
countryDf[countryDf['inflation'] > 20]

There are few outliners in Data, but data set is very small with 167 Observation.<BR>
So will keep all data as is and perform further analysis.

In [None]:
countryDf.corr()

In [None]:
#Check correlation with heatmap
corr = countryDf.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(10,5))
sns.heatmap(corr, cmap='RdBu', annot=True,  mask=mask, center=0, linewidths= 0.1)
plt.show()

There are few features are highly  corelated.
Will can drop few of the feature like **Total_Fer, Income, Exports** <BR>
But will keep all the features as is and use **PCA** to Solve **Multicollinearity** factor.

In [None]:
countryDf.describe()

Few of the features are corelated with each other, instead of checking & removing features manually will implement PCA.

### PCA Implementation
Before performing PCA will first scale data on Standard Scalar as income & gdpp feature have wide spread as compare to other features.  

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [None]:
trainDf = countryDf.drop(labels='country', axis=1)
sc = StandardScaler()
scalledData = sc.fit_transform(trainDf)
trainDf_scalled = pd.DataFrame(data=scalledData, columns=trainDf.columns)

In [None]:
pca = PCA(svd_solver='randomized', random_state=100)

In [None]:
pca.fit(trainDf_scalled)

In [None]:
pca.components_

In [None]:
features = pd.Series(trainDf_scalled.columns, name='Feature')
pcaDF = pd.DataFrame(data=pca.components_.T, columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9'])
pcaDF = pd.concat([features, pcaDF], axis=1)
pcaDF

In [None]:
#Cumulative variance explained by PCA component
np.cumsum(pca.explained_variance_ratio_)

In [None]:
#Plot scree plot to check Cumulative Variance Explaining by PCA component
plt.plot(np.cumsum(pca.explained_variance_ratio_))

As per above scree plot **5** PCA component explains **94.5%** of variance, if we take **6** components will get hardly **1.5%** of gain in variance explained by component.<BR>
So will take **5** component and build clustering model.

In [None]:
from sklearn.decomposition import IncrementalPCA
final_pca = IncrementalPCA(n_components=5)
trainDf_pca = final_pca.fit_transform(trainDf_scalled)

In [None]:
final_pca.components_

In [None]:
trainDf_pca.shape

In [None]:
#Check correlation with heatmap
corr = np.corrcoef(trainDf_pca.transpose()).round(2)

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

#plt.figure(figsize=(5,5))
sns.heatmap(corr, cmap='RdBu', annot=True,  mask=mask, center=0, linewidths= 0.1)
plt.show()

There is no Correlation between any 2 features.

In [None]:
features = pd.Series(trainDf_scalled.columns, name='Feature')
final_pcaDF = pd.DataFrame(data=final_pca.components_.T, columns=['PC1','PC2','PC3','PC4','PC5'])
final_pcaDF = pd.concat([features, final_pcaDF], axis=1)
final_pcaDF

In [None]:
trainDf_pca = pd.DataFrame(data=trainDf_pca)
trainDf_pca.columns = ['PC1','PC2','PC3','PC4','PC5']
trainDf_pca.head()

### Model Building : Using Clustering Algorithms

First analyze  **Hopkins Statistic** to check data tendency/pattern fits for clustering or not.  

In [None]:
from sklearn.neighbors import NearestNeighbors
from random import sample
from math import isnan
 
def HopkinsStats(data):
    d = data.shape[1]
    n = len(data) # rows
    m = int(0.1 * n)
    
    nbrs = NearestNeighbors(n_neighbors=1).fit(data.values)
    
    rand_data = sample(range(0, n, 1), m)
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(np.random.uniform(np.amin(data,axis=0),np.amax(data,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(data.iloc[rand_data[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])

    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        H = 0

    return H

In [None]:
HopkinsStats(trainDf_pca)

If Hopkins Statistics have value between **0.7 - 0.99** than data is good for clustering. <BR>
Our Hopkins Stats value is ** > 0.7** which show good tendency of clustering. <BR><BR>
Will implement **K-Mean & Hierarchical Clustering** algorithm on data set.

### K-Mean Clustering

In [None]:
from sklearn.cluster import KMeans

For K-Mean algorithm we have to first find of number on Cluster required of analysis.

**Sum of Squared Error** method to find number of clusters.

In [None]:
sse = []
for k in range(2,15):
    kMean_model = KMeans(n_clusters=k, random_state= 0, max_iter=50)
    kMean_model.fit(trainDf_pca)
    sse.append([k,kMean_model.inertia_])

In [None]:
plt.plot(pd.DataFrame(sse)[0], pd.DataFrame(sse)[1]);
plt.ylabel("Sum Of Squared Error")
plt.xlabel("No Of Cluster")
plt.show()

**Elbow Method** to derive number of clusters.<BR>
Will use **Silhouette Analysis**.

In [None]:
from sklearn.metrics import silhouette_score
ss = []
for k in range(2, 15):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(trainDf_pca)
    ss.append([k, silhouette_score(trainDf_pca, kmeans.labels_)])

plt.plot(pd.DataFrame(ss)[0], pd.DataFrame(ss)[1], '-gD')
plt.show()

**Sum of Squared Error** method shows more cluster are better, three is significant drop in SSE till approx. 8 cluster. But having 8 clusters are not good approach. <BR><BR>
**Elbow Method: Using Silhouette Analysis** technique shows 5 clusters are good for modelling.

In [None]:
kMeans_Clust5 = KMeans(n_clusters=5, max_iter=50, random_state=0)
kMeans_Clust5.fit(trainDf_pca)

In [None]:
print(kMeans_Clust5.labels_.shape)
kMeans_Clust5.labels_

In [None]:
kMean_DF = pd.concat([pd.Series(kMeans_Clust5.labels_, name="ClusterId"), countryDf, trainDf_pca], axis=1)
kMean_DF.head()

In [None]:
kMean_DF['ClusterId'].value_counts()

In [None]:
grpMean = kMean_DF.groupby(by=['ClusterId']).mean()
grpMean = grpMean.reset_index()
grpMean

In [None]:
plt.figure(figsize=(12,5))
sns.scatterplot(data=kMean_DF, x='PC1', y='PC2', hue='ClusterId', style='ClusterId', legend="full")

All clusters have clear segregation except Cluster 4. <BR>
PC1 & PC2 data points segregate data properly, will choose some feature variable to analyze clusters.<BR>
Will choose features based on Linear Combination value of features in PC1 & PC2

In [None]:
final_pcaDF

In [None]:
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.scatterplot(data=kMean_DF, x='gdpp', y='exports', hue='ClusterId', style='ClusterId', legend="full")
plt.subplot(2,2,2)
sns.scatterplot(data=kMean_DF, x='gdpp', y='life_expec', hue='ClusterId', style='ClusterId', legend="full")
plt.subplot(2,2,3)
sns.scatterplot(data=kMean_DF, x='gdpp', y='child_mort', hue='ClusterId', style='ClusterId', legend="full")
plt.subplot(2,2,4)
sns.scatterplot(data=kMean_DF, x='gdpp', y='income', hue='ClusterId', style='ClusterId', legend="full")

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(3,3,1)
sns.barplot(data=grpMean, x='ClusterId', y='child_mort')
plt.subplot(3,3,2)
sns.barplot(data=grpMean, x='ClusterId', y='exports')
plt.subplot(3,3,3)
sns.barplot(data=grpMean, x='ClusterId', y='health')
plt.subplot(3,3,4)
sns.barplot(data=grpMean, x='ClusterId', y='imports')
plt.subplot(3,3,5)
sns.barplot(data=grpMean, x='ClusterId', y='income')
plt.subplot(3,3,6)
sns.barplot(data=grpMean, x='ClusterId', y='inflation')
plt.subplot(3,3,7)
sns.barplot(data=grpMean, x='ClusterId', y='life_expec')
plt.subplot(3,3,8)
sns.barplot(data=grpMean, x='ClusterId', y='total_fer')
plt.subplot(3,3,9)
sns.barplot(data=grpMean, x='ClusterId', y='gdpp')
plt.show()

Key features of Developed Countries are: High Income, High GDPP, Low Inflation, High Life Expectancy, High Health Spending, Low Child Mortality Rate, etc.  <BR>
In contrast Under Developed Countries are having: Low Income, Low GDPP, High Inflation, High Child Mortality rate, Low Life Expectancy

Based on above graph we can say that Countries in **Cluster Id - 0** are under developed countries. <BR>
Countries in **Cluster Id -  4 ** doing better than **Cluster Id - 0**<BR>
Looks like Countries in **Cluster Id - 1 & 2 ** and developed countries.<BR><BR>

Will verify each cluster and for our analysis will stick to **Cluster 0 & 4**

In [None]:
kMean_DF[(kMean_DF['ClusterId'] == 0)].describe()

In [None]:
#Under Developed Countries
kMean_DF[(kMean_DF['ClusterId'] == 0)].sort_values(by=['child_mort'], ascending=False).head(10)

In [None]:
#Under Developed Countries
kMean_DF[(kMean_DF['ClusterId'] == 0)].sort_values(by=['life_expec']).head(5)

In [None]:
#Under Developed Countries
kMean_DF[(kMean_DF['ClusterId'] == 0)].sort_values(by=['inflation'], ascending=False).head(5)

In [None]:
kMean_DF[(kMean_DF['ClusterId'] == 4)]

Based on above analysis and key features like **Chile Mortality, GDPP, Inflation, Life Expectancy** we can Cheery Pick countries which are in direst need of aid<BR>
<ul>
    <li>**Haiti** : Very high Child Mortality rate **208**& very low life expectancy **32**</li>
    <li>**Sierra Leone** : Same as Haiti, this country also have very high Child Mortality rate & low life expectancy, GDPP of this country is very low **399** and Inflation is very high **17.20** (more than 75% from that Cluster) </li>
    <li>**Chad & Central African Republic** : Both the countries showing same trend High Child Mortality with Low Life Expectancy as well as Low GDPP. **Chad** have very high Total fertility ration </li>
    <li>**Nigeria** : This country from cluster 4 have very high Inflation rate **104** and high child mortality **130**. </li>
    <li>**Mali** : Here Total Fertility to each women is very high **6.55** which can impact high Child Mortality **137**. </li>
    <li>**Niger** : With very high Total fertility rate and low GDPP this country also facing High Child Mortality problem.</li>
</ul>

### Hierarchical Clustering

In [None]:
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
plt.figure(figsize=(18,6))
mergings = linkage(trainDf_pca, method = "complete", metric='euclidean')
dendrogram(mergings)
plt.show()

In [None]:
#Will use same cluster size as K-Mean Algo
clusterCut = pd.Series(cut_tree(mergings, n_clusters = 5).reshape(-1,), name="ClusterId")
hierar_DF = pd.concat([clusterCut, countryDf, trainDf_pca], axis=1)
hierar_DF.head()

In [None]:
hierar_DF['ClusterId'].value_counts()

In [None]:
grpHierarClust = hierar_DF.groupby(by=['ClusterId']).mean()
grpHierarClust = grpHierarClust.reset_index()
grpHierarClust

Looks like **Cluster Id - 4 ** is group of under developed countries with High Chile Mortality, Very High Inflation Rate, Low GDPP, Low Heath Expenditure. But there is only 1 country assigned in that group. <BR>
**Cluster Id - 0** is another group shows similar trend and 38 countries assigned to that group.

In [None]:
plt.figure(figsize=(12,5))
sns.scatterplot(data=hierar_DF, x='PC1', y='PC2', hue='ClusterId', style='ClusterId', legend="full")

In [None]:
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.scatterplot(data=hierar_DF, x='gdpp', y='exports', hue='ClusterId', style='ClusterId', legend="full")
plt.subplot(2,2,2)
sns.scatterplot(data=hierar_DF, x='gdpp', y='life_expec', hue='ClusterId', style='ClusterId', legend="full")
plt.subplot(2,2,3)
sns.scatterplot(data=hierar_DF, x='gdpp', y='child_mort', hue='ClusterId', style='ClusterId', legend="full")
plt.subplot(2,2,4)
sns.scatterplot(data=hierar_DF, x='gdpp', y='income', hue='ClusterId', style='ClusterId', legend="full")

More precise segregation of data in Hierarchical Clustering than KMean.

In [None]:
plt.figure(figsize=(17,10))
plt.subplot(3,3,1)
sns.barplot(data=grpHierarClust, x='ClusterId', y='child_mort')
plt.subplot(3,3,2)
sns.barplot(data=grpHierarClust, x='ClusterId', y='exports')
plt.subplot(3,3,3)
sns.barplot(data=grpHierarClust, x='ClusterId', y='health')

plt.subplot(3,3,4)
sns.barplot(data=grpHierarClust, x='ClusterId', y='imports')
plt.subplot(3,3,5)
sns.barplot(data=grpHierarClust, x='ClusterId', y='income')
plt.subplot(3,3,6)
sns.barplot(data=grpHierarClust, x='ClusterId', y='inflation')
plt.subplot(3,3,7)

sns.barplot(data=grpHierarClust, x='ClusterId', y='life_expec')
plt.subplot(3,3,8)
sns.barplot(data=grpHierarClust, x='ClusterId', y='total_fer')
plt.subplot(3,3,9)
sns.barplot(data=grpHierarClust, x='ClusterId', y='gdpp')
plt.show()


In [None]:
hierar_DF[(hierar_DF['ClusterId'] == 0)].describe()

In [None]:
hierar_DF[(hierar_DF['ClusterId'] == 4)]

In [None]:
hierar_DF[(hierar_DF['ClusterId'] == 0)].sort_values(by=['child_mort'], ascending=False).head(10)

In [None]:
hierar_DF[(hierar_DF['ClusterId'] == 0)].sort_values(by=['life_expec']).head(5)

In [None]:
hierar_DF[(hierar_DF['ClusterId'] == 0)].sort_values(by=['gdpp']).head(5)

### Conclusion

In both the clustering algorithm similar kind of clusters generated.<BR>
Both methods shows **Cluster 0 & Cluster 4** contains under developed countries and faces similar kind of problems. 

We have finalized below mentioned list of countries which are in direst need of aid:
<ol>
    <li>**Haiti**</li>
    <li>**Sierra Leone**</li>
    <li>**Chad**</li>
    <li>**Central African Republic**</li>
    <li>**Nigeria**</li>
    <li>**Mali**</li>
    <li>**Niger**</li>
</ol>

In [None]:
kMean_DF.to_csv('kMean_Clustering.csv')