# Machine Learning
### Datasets Used:
> 2015 UNData: (2015 is the only year where all variables are reported, since the World Happiness Report started in 2012) 
> - Population, surface area and density  
> - Population in the capital city, urban and rural areas    
> - GDP and GDP per capita    
> - GVA by kind of economic activity  
> - Education at the primary, secondary and tertiary levels  
> - Employment by economic activity  
> - Water supply and sanitation coverage  
> - Internet usage  
> - Population growth, fertility, life expectancy and mortality
>
> Source: https://data.un.org/

> 2017 UNData:
> - Country Statistics - UNData  
>
> Source: https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles

> World Happiness Report 2015 & 2017:  
Source: https://www.kaggle.com/datasets/unsdsn/world-happiness  

### Essential Libraries

Importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [9]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
sb.set()

### Import Dataset
> Import dataset created after data processing & cleaning.

In [10]:
Data = pd.read_csv('Dataset.csv')
Data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Dataset.csv'

In this machine learning section, we will predict the happiness score of a country which is in the form of numerical value. Hence, our model selection will be based on this idea. 

We will be exploring `3` machine learning models.
<br>

For the supervised learning, we choose `ElasticNet Regression` and `Random Forest Regressorion`.

For the unsupervised learning, we choose `KMeans Clustering (Centroid Based)`.


---
## 1.  ElasticNet Regression

>- ElasticNet is a combination of Least Squares and the regression penalty of both Lasso and Ridge Regression.
>- It allows us to combine the strengths of lasso and ridge regression into one.
>- Cross validation is used to tune the hyperparameters to make more accurate predictions.
>- Least Squares is the basic linear regression model.
>- Lasso (L1 Regularization) and Ridge Regression (L2 Regularization) are very similar as they both introduces a small amount of bias in an attempt to reduce the variance.
>- The only difference is that Lasso Regression can shrink less important features coefficient all the way to zero, which helps with feature selection.

>Why we choose this model? 
>- We have a small sample size of ~300.
>- Only a few features that are highly correlated were chosen as predictors.


>We will be using top 6 variables that are highly correlated with happiness 
score as the independent variables.

>Response Variable: `happiness.score`  
>Predictor Features: `Employment: Services (% of employed)`, `Life expectancy at birth (females, years)`,    
`Life expectancy at birth (males, years)`, `Education: Secondary gross enrol. ratio (male per 100 pop.)`,     
`Education: Secondary gross enrol. ratio (female per 100 pop.)`,     
`Individuals using the Internet (per 100 inhabitants)`  

In [None]:
y = pd.DataFrame(Data["happiness.score"])
X = pd.DataFrame(Data[["Employment: Services (% of employed)", 
                       "Life expectancy at birth (females, years)", 
                       "Life expectancy at birth (males, years)", 
                       "Education: Secondary gross enrol. ratio (male per 100 pop.)", 
                       "Education: Secondary gross enrol. ratio (female per 100 pop.)", 
                       "Individuals using the Internet (per 100 inhabitants)"]])
y

In [None]:
X

Split the dataset into train and test sets at a 80:20 ratio

In [None]:
# Import the required function from sklearn
from sklearn.model_selection import train_test_split

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=0)



# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set  :", X_test.shape, y_test.shape)

### Basic Exploration
Basic statistical exploration and visualization on the Train Set.

In [None]:
X_train.describe()

In [None]:
y_train.describe()

In [None]:
# Draw the distribution of Response
f, axes = plt.subplots(1, 3, figsize=(15, 5))
sb.boxplot(data = y_train, orient = "h", ax = axes[0])
sb.histplot(data = y_train, ax = axes[1])
sb.violinplot(data = y_train, orient = "h", ax = axes[2])

In [None]:
# Draw the distributions of all Predictors
f, axes = plt.subplots(6, 3, figsize=(18, 25))

count = 0
for var in X_train:
    sb.boxplot(data = X_train[var], orient = "h", ax = axes[count,0])
    sb.histplot(data = X_train[var], ax = axes[count,1])
    sb.violinplot(data = X_train[var], orient = "h", ax = axes[count,2])
    count += 1

In [None]:
# Correlation between Response and the Predictors
train_df = pd.concat([y_train, X_train], axis = 1).reindex(y_train.index)

# Relationship between Response and the Predictors
sb.pairplot(data = train_df)

### ElasticNet Regression model

In [None]:
from sklearn.linear_model import ElasticNet,ElasticNetCV
from sklearn.model_selection import RepeatedKFold
from numpy import arange
# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=0)


#elastic = ElasticNet(alpha=0.01)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=0)
ratios = arange(0, 1, 0.01)
alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
#### ElasticNet Regression
elastic = ElasticNetCV(l1_ratio=ratios, alphas=alphas, cv=cv, n_jobs=-1,random_state=0)

elastic.fit(X_train, y_train)

In [None]:
print("Intercept: ", elastic.intercept_)
print("Coefficients: ")
pd.DataFrame(list(zip(X_train.columns, elastic.coef_)), columns = ["Predictors", "Coefficients"])

> The two columns have coefficients of 0 because the lasso regression of elastic net is capable of shrinking less important features coefficient all the way to 0. 

In [None]:
# Predict the Total values from Predictors
y_train_pred = elastic.predict(X_train)
y_test_pred = elastic.predict(X_test)

In [None]:
plt.figure(figsize = (16,4))
x_ax = range(len(X_test))
plt.scatter(x_ax, y_test, s=5, color="blue", label="original")
plt.plot(x_ax, y_test_pred, lw=0.8, color="red", label="predicted")
plt.xlabel('id')
plt.ylabel('happiness score')
plt.legend()
plt.show()


In [None]:
#Plot the measured and predicted result in a graph
fig, ax = plt.subplots()
ax.scatter(y_train, y_train_pred, edgecolors=(0, 0, 0),color = 'blue')
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()


fig, ax = plt.subplots()
ax.scatter(y_test, y_test_pred, edgecolors=(0, 0, 0), color = 'red')
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", elastic.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_train, y_train_pred)))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Explained Variance (R^2) \t:", elastic.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_test, y_test_pred)))
print()

>- The RMSE value of the test dataset is still very high which is 0.683
>- The R^2 which represent the accuracy of this model is only 59%.


> <b>Pros of Elastic Net Regression :</b>
>- Combine the strengths of lasso and ridge regression into one.

><b>Cons of Elastic Net Regression :</b>
>-  Low Accuracy and High RMSE value
>-  Computationally more expensive than Lasso or Ridge regression.
 

## 2. Random Forest Regression

The Random Forest Regressor is a supervised learning algorithm that uses ensemble learning method for regression. 

<b> Ensemble learning method</b> is a technique that combines predictions from multiple machine learning algorithms to make a more accurate prediction than a single model.

This model is a bootstrap (bagging) technique. Bootstrap refers to random sampling with replacement of a small subset of data from the data set, which allows us to better understand the bias and the variance within the data set.

<b>Steps: </b>
1. Construct a large number of decision trees at training time.
2. Each tree is created from a different sample of data and at each node, a different sample of features is selected for splitting. 
3. Each of the trees makes its own individual prediction. 
4. These predictions are then averaged to produce a single result. 


In [None]:
df = pd.read_csv("Dataset.csv")
df.head()

### A. Model with All Independent Variables
#### Larger tree (max_depth = 20)

In [None]:
# Extract Response and Predictors
df = df.drop('country', axis = 1)
df_copy = pd.get_dummies(df)

y = df['happiness.score']
X = df_copy.drop('happiness.score', axis=1)

X_list = list(X.columns)

X = np.array(X)
y = np.array(y)

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
# Import RandomForestRegressor model from Scikit-Learn
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split

#Fitting the data
regressor = RandomForestRegressor (n_estimators = 100, # n_estimators denote number of trees
                                   random_state = 0, # random_state parameter initialize the internal random number generator 
                                  max_depth = 20
                                  )
# Fit Random Forest on Train Data
regressor.fit(X_train,y_train)
                                   
# Predict the Response corresponding to Predictors
y_train_pred = regressor.predict (X_train)
y_test_pred = regressor.predict(X_test)

In [None]:
# Plot the trained Decision Tree
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot

f = plt.figure(figsize=(500,400))
# Pull out one tree from the forest
tree = regressor.estimators_[5]

# Export the image to a dot file
tree=export_graphviz(tree,
                     feature_names = X_list, 
                     rounded = True, 
                     filled = True)


import graphviz 
graphviz.Source(tree)

In [None]:
plt.figure(figsize = (16,4))
prediction, = plt.plot(y_test_pred, label = "prediction")
original, = plt.plot(y_test, label = 'original')
first_legend= plt.legend(handles=[prediction])
ax = plt.gca().add_artist(first_legend)
plt.xlabel('id')
plt.ylabel('happiness score')
plt.legend(handles=[original],loc=4)
plt.show()

In [11]:
y_test

NameError: name 'y_test' is not defined

In [None]:
#Error checking
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_train, y_train_pred)))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_test, y_test_pred)))
print()


#### Smaller tree (max_depth = 3)

In [None]:
y = df['happiness.score']
X = df_copy.drop('happiness.score', axis=1)

X_list = list(X.columns)

X = np.array(X)
y = np.array(y)

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Import RandomForestRegressor model from Scikit-Learn
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split

#Fitting the data
regressor = RandomForestRegressor (n_estimators = 100, # n_estimators denote number of trees
                                   random_state = 0, # random_state parameter initialize the internal random number generator 
                                  max_depth = 3)
# Fit Random Forest on Train Data
regressor.fit(X_train,y_train)
                                   
# Predict the Response corresponding to Predictors
y_train_pred = regressor.predict (X_train)
y_test_pred = regressor.predict(X_test)

In [None]:

# Extract the small tree
tree_small = regressor.estimators_[5]

f = plt.figure(figsize=(12,12))
tree_small = export_graphviz(tree_small, 
                             feature_names = X_list, 
                             rounded = True,  
                             filled = True)

import graphviz 
graphviz.Source(tree_small)

In [None]:
plt.figure(figsize = (16,4))
prediction, = plt.plot(y_test_pred, label = "prediction")
original, = plt.plot(y_test, label = 'original')
first_legend= plt.legend(handles=[prediction])
ax = plt.gca().add_artist(first_legend)
plt.xlabel('id')
plt.ylabel('happiness score')
plt.legend(handles=[original],loc=4)
plt.show()

In [None]:
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_train, y_train_pred)))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_test, y_test_pred)))
print()


> From the two model above we can see that the higher the max_depth of the tree the higher the accuracy and the lower the RMSE

#### Variable Importances
>Feature importances are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.

In [None]:
# Get numerical feature importances
importances = list(regressor.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

#### Variable Importances Visualizations

In [None]:
# Import matplotlib for plotting and use magic command for Jupyter Notebooks
import matplotlib.pyplot as plt
%matplotlib inline

# Set the style
plt.style.use('fivethirtyeight')

# list of x locations for plotting
x_values = list(range(len(importances)))

# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, X_list, rotation='vertical')

# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

>- There are only 14 variables that are used to split the tree. Among those, GDP per capita shows the most important variables compare to the others. 
>- We will choose only the top 7 independent variables with importance >=0.02 in our new random forest model.

### B. Model with Most Important Features

In [None]:
df = df.drop('Region', axis = 1)

In [None]:
# Extract features and lables 
y = df['happiness.score']
X = df[['GDP per capita (current US$)', 
        'Life expectancy at birth (males, years)',
        'Life expectancy at birth (females, years)',
        'Education: Secondary gross enrol. ratio (female per 100 pop.)',
        'Population age distribution (0-14 years, %)',
        'Infant mortality rate (per 1000 live births', 
        'Urban population (% of total population)']]
              
X_list = list(X.columns)
X = np.array(X)
y = np.array(y)
#Split the dataset into training and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#Fitting the data
regressor = RandomForestRegressor (n_estimators = 100, random_state = 0, max_depth = 20)
regressor.fit (X_train,y_train)

#Predicting the test set results 
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)



#Error checking
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_train, y_train_pred)))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_test, y_test_pred)))
print()



>The RMSE has reduced from 0.478 to 0.468 and the the accuracy (R^2) also improve from 80.1% to 80.8%. This shows that removing insignificant variables can improve the splitting of decision tree better even though not that large.

### C. Model Based on Variables with Highest Correlation Values
>Our initial approach is to use top 6 independent variables with the highest correlation values with happiness score. Hence, we will try to use random forest to predict whether this idea can improve the model accuracy or not. 

In [None]:
# Extract features and lables 
y = df['happiness.score']
X = df[['Employment: Agriculture (% of employed)',
       'Life expectancy at birth (females, years)',
       'Employment: Services (% of employed)',
       'Population age distribution (60+ years, %)',
       'Education: Primary gross enrol. ratio (female per 100 pop.)']]

                               
X_list = list(X.columns)
X = np.array(X)
y = np.array(y)
#Split the dataset into training and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#Fitting the data
regressor = RandomForestRegressor (n_estimators = 100, random_state = 0, max_depth =20)
regressor.fit (X_train,y_train)


#Predicting the test set results 
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

#Error checking
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_train, y_train_pred)))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Regression Accuracy (R^2) \t:", regressor.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error (y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mean_squared_error (y_test, y_test_pred)))
print()

In [None]:
Accuracy= pd.read_excel ('Error checking.xlsx', sheet_name = 'Sheet4')
Accuracy.head()

In [None]:
error = pd.read_excel ('Error checking.xlsx', sheet_name = 'Sheet3')
error.head()

>There was no significant increase in the RMSE scores and accuracy scores. Instead it got worse. Hence, we can conclude that selecting independent variables based on the highest correlation values is not the best way to be used in random forest model. 


><b>Pros of random forest regressor: </b>
>- Lower risk of overfitting due to randomness of data and feature selection
>- Improve the accuracy
>- Robust to outliers.

><b>Cons of random forest regressor: </b>
>- Slow training since it requires lots computational power amd resources since it builds greater amount of trees

## 3. K Means Clustering (Centroid Based)

K-means is a centroid-based algorithm, or a distance-based algorithm. It calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. Clustering gives the user the ability to understand how happy a country is at a simple glance of their group category.


In [12]:
# Import related libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from scipy.spatial.distance import cdist

In [13]:
#Drop non-float data
data = pd.read_csv('Dataset.csv')
data.drop("country", axis=1, inplace=True)
data.drop("Region", axis=1, inplace=True)
data.info()

FileNotFoundError: [Errno 2] No such file or directory: 'Dataset.csv'

### Scaling
>Scaling is necessary as K means is a method that groups samples according to their distance from the chosen centroid

In [None]:
#Scale all variables in dataset
scaler = MinMaxScaler()
for column in data:
    scaler.fit(data[[column]])
    data[column] = scaler.transform(data[[column]])

### Correlation

In [None]:
#Find variables most related to happiness
f = plt.figure(figsize=(30, 30))
sb.heatmap(data.corr(), vmin = -1, vmax = 1, linewidths = 1,
           annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")

In [None]:
# Picked 2 variable with highest correleation to happiness.score for clustering, dropping the labelled data happiness,
# to simulate unlabeled data and perform unsupervised learning
ml = pd.DataFrame(data[['Individuals using the Internet (per 100 inhabitants)', 'Education: Secondary gross enrol. ratio (female per 100 pop.)']])

In [None]:
# Plot the graph
plt.scatter(data['Individuals using the Internet (per 100 inhabitants)'], data['Education: Secondary gross enrol. ratio (female per 100 pop.)'] )
plt.xlabel('Individuals using the Internet (per 100 inhabitants)')
plt.ylabel('Secondary Education ratio (female per 100 pop.)')

### Finding K
> Finding K manuallty is required to perform K means clustering, this essentially decides how many clusters should the algorithm split the data into.

> There are primarly different elbow methods, distortion, and inertia

In [None]:
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
 
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(ml)
    kmeanModel.fit(ml)
 
    distortions.append(sum(np.min(cdist(ml, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / ml.shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(ml, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / ml.shape[0]
    mapping2[k] = kmeanModel.inertia_

In [None]:
for key, val in mapping1.items():
    print(f'{key} : {val}')

In [None]:
#Elbow method using distortion
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

> Distortion is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.

In [None]:
for key, val in mapping2.items():
    print(f'{key} : {val}')

In [None]:
#Elbow method using inertia
plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()

> Inertia is the sum of squared distances of samples to their closest cluster center.

>One can see that in both the inertia distortion graph and the distortion graph, the point where the graph starts to turn linear would be a K = 5.

>Therefore, using the elbow method our K value would be 5 this time round

### Happiness Conversion

In [None]:
#Splits up happiness score based on k groups to test for cluster later on
og = list()
for score in data['happiness.score']:
    if score <= 0.20:
        og.append(0)
    elif score <= 0.4:
        og.append(1)
    elif score <= 0.6:
        og.append(2)
    elif score <= 0.8:
        og.append(3)
    else:
        og.append(4)
data['cluster'] = pd.Series(og)
data.head()

### Plotting

In [None]:
km = KMeans(n_clusters = 5)
predict = km.fit_predict(ml)
ml['group'] = predict
center = km.cluster_centers_
print(center)
# Initliazed centroids that can be used in the next block to prevent randomness,
# as cluster group is always random upon running algorithm

In [None]:
# Whole block used to rearrange clusters so that it will always be group 0 being lowest tier
iter = [0,1,2,3,4]
newgroup = list()
newcluster = list()
for i in iter:
    if center[i][0] <= 0.20:
        newgroup.append(0)
    elif center[i][0] <= 0.4:
        newgroup.append(1)
    elif center[i][0] <= 0.6:
        newgroup.append(2)
    elif center[i][0] <= 0.8:
        newgroup.append(3)
    else:
        newgroup.append(4)
print(newgroup)
for cluster in ml['group']:
    newcluster.append(newgroup[cluster])
ml['new'] = pd.Series(newcluster)
ml.head()

In [None]:
# Ensure the data is sorted according to group
ml1 =  ml[ml.new == 4]
ml2 =  ml[ml.new == 3]
ml3 =  ml[ml.new == 2]
ml4 =  ml[ml.new == 1]
ml5 =  ml[ml.new == 0]

In [None]:
# Coloring and printing centroid for visulaizing
plt.scatter(ml1['Individuals using the Internet (per 100 inhabitants)'], ml1['Education: Secondary gross enrol. ratio (female per 100 pop.)'], color ='green')
plt.scatter(ml2['Individuals using the Internet (per 100 inhabitants)'], ml2['Education: Secondary gross enrol. ratio (female per 100 pop.)'], color ='yellow')
plt.scatter(ml3['Individuals using the Internet (per 100 inhabitants)'], ml3['Education: Secondary gross enrol. ratio (female per 100 pop.)'], color ='orange')
plt.scatter(ml4['Individuals using the Internet (per 100 inhabitants)'], ml4['Education: Secondary gross enrol. ratio (female per 100 pop.)'], color ='red')
plt.scatter(ml5['Individuals using the Internet (per 100 inhabitants)'], ml5['Education: Secondary gross enrol. ratio (female per 100 pop.)'], color ='blue')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1], color = 'purple', marker='*', label= 'centroid')
plt.xlabel('Internet percentage')
plt.ylabel('Secondary Education ratio (female per 100 pop.)')

### Error checking

In [None]:
ml['cluster'] = pd.Series(og)

In [None]:
ml.head()

In [None]:
#Comparing previously dropped happiness cluster with predicted cluster results
wrong = 0
for index, row in ml.iterrows():
    if row['new'] != row['cluster']:
        wrong += 1
print("Total rightly predicted : " + str(len(ml)-wrong) + " out of " + str(len(ml)))
print("Total accuracy is : " + str((len(ml)-wrong)/len(ml)*100))

> <b>Pros of KMeans Clustering :</b>
>- Relatively simple to implement, this is especially important for new analyst such as ourselves to be able to understand how it works to be able to implement the machine learning to our liking
>- Ease of interpretation and visualization

><b>Cons of KMeans Clustering :</b>
>-  No optimal set of clusters and would require one manually choose K as seen from the distortion and inertia method
>- Sensitive to outlier data as their entroids can be dragged by outliers instead of being ignored, having the need to remove outliers first before clustering
>-  Inconsistency can happen when K-Means algorithm picks random centroid initialization to develop the clusters

### Conclusion
> K means clustering is the obvious inferior machine learning situation here as it specializes in unsupervied learning with unlabeled data. However in this case, the happiness score is provided and hence not necessary to be limited to unsupervised learning.This is especially the case for the K means which has a astounding low accuracy this time round

> Moving on to the other 2 supervised learning models, if we compare the random forest with the ElasticNet regression model, the accuracy of the random forest (with most important variables) is much higher (59.3% to 80.4%) and has a significantly lower RMSE value (0.543 to 0.469). 

>Hence, random forest regressor is a better model than Elasticnet Regression when it comes to predicting this happiness score dataset.