# Project Overview

In many researches, it has been confirmed that GDP and waste generated are closely linked to each other. Thus, I am going to build two models with linear regression and KNN regression algorithm to respectively. Finally, MAE will be used to compare the models' accuracy and decide which model produce the best result.


# Data Source

I have to tables of data which both are downloaded from the World data catalog. The first one contains mainly the waste composition, total solid waste and gdp in country scale. The second one records each country's gdp in 2019/2020. Note that I have to find another dataset for gdp because the data of gdp in table 1 are somewhat inaccurate. For that being said, this analysis may yield unexpected result due to the inconsistant data source. Thus, in this project I want to emphasize on the application of basic machine learning model, rather than try to build a model with better accurcy



# Data Cleaning


In [105]:
import pandas as pd
w = pd.read_csv(r'C:\Users\user\Desktop\project\GDP and Waste\country_level_data_0.csv')
g = pd.read_csv(r'C:\Users\user\Desktop\project\GDP and Waste\API_NY.GDP.MKTP.CD_DS2_en_csv_v2_2593330.csv')
print(w.info())
print(g.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 51 columns):
 #   Column                                                                                 Non-Null Count  Dtype  
---  ------                                                                                 --------------  -----  
 0   iso3c                                                                                  217 non-null    object 
 1   region_id                                                                              217 non-null    object 
 2   country_name                                                                           216 non-null    object 
 3   income_id                                                                              217 non-null    object 
 4   gdp                                                                                    216 non-null    float64
 5   composition_food_organic_waste_percent                                        


After a quick look on both table, I decided to select the columns needed and merge two dataframe.


In [255]:
w1 = w[['country_name', 'income_id', 'total_msw_total_msw_generated_tons_year']].rename(columns={'country_name':'Country Name'})
g1= g[['Country Name', '2019']]
df = w1.merge(g1, on='Country Name', how='inner')
df.info()
df = df.rename(columns={'total_msw_total_msw_generated_tons_year':'waste', '2019':'gdp'})
df_new = df.dropna(subset=['waste', 'gdp'])
df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 0 to 211
Data columns (total 4 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Country Name                             212 non-null    object 
 1   income_id                                212 non-null    object 
 2   total_msw_total_msw_generated_tons_year  210 non-null    float64
 3   2019                                     195 non-null    float64
dtypes: float64(2), object(2)
memory usage: 6.6+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 194 entries, 1 to 211
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  194 non-null    object 
 1   income_id     194 non-null    object 
 2   waste         194 non-null    float64
 3   gdp           194 non-null    float64
dtypes: float64(2), object(2)
memory usage: 6.1+ KB


In [257]:
df_new.head()

Unnamed: 0,Country Name,income_id,waste,gdp
1,Afghanistan,LIC,5628525.37,19291100000.0
2,Angola,LMC,4213643.585,89417190000.0
3,Albania,UMC,1087446.75,15286610000.0
4,Andorra,HIC,43000.0,3155065000.0
5,United Arab Emirates,HIC,5617682.0,421142000000.0



Now I have a decent dataframe. Let's make a scatter plot to see the distribution of the two variables.


In [176]:
import plotly.express as px
fig = px.scatter(df_new, 
                 x='gdp',
                 y='waste', 
                 color='income_id',
                 labels=dict(gdp='GDP (US$)', waste='Total Solid Waste (Ton)', income_id='Income Level'),
                 hover_name='Country Name')
fig.show()



I make the graph a little different by passing the argument "color" with 'region_id'. Though it has nothing to do with the model building in this project, it's still fascinating to find that we would get nearly perfect linear relationship for countries in each income level.



In [271]:
out = df_new.sort_values('waste', ascending=False).iloc[0:3]
print(out)
out_ind=[]
out_x=[]
out_y=[]
for key, val in out[['gdp','waste']].iterrows():
    out_ind.append(key)
    out_x.append(val[0])
    out_y.append(val[1])
print(out_ind)
print(out_x)
print(out_y)

      Country Name income_id        waste           gdp
36           China       UMC  395081376.0  1.427990e+13
198  United States       HIC  265224528.0  2.143320e+13
87           India       LMC  189750000.0  2.870500e+12
[36, 198, 87]
[14279900000000.0, 21433200000000.0, 2870500000000.0]
[395081376.0, 265224528.0, 189750000.0]


In [272]:
for i in range(len(out_ind)):
    fig.add_annotation(x=out_x[i],
                   y=out_y[i],
                   text='Outlier',
                   font=dict(size=16, color="red"),
                   showarrow=True,
                   arrowcolor="#636363",
                   arrowhead=1)
fig.show()

Visually there are some outlier which may affect the model accurcy significantly. Therefore next step I am going to remove them.


In [273]:
df_f = df_new.drop(index=out_ind)
fig_f = px.scatter(df_f, 
                 x='gdp',
                 y='waste', 
                 color='income_id',
                 labels=dict(gdp='GDP (US$)', waste='Total Solid Waste (Ton)', income_id='Income Level'),
                 hover_name='Country Name')
fig_f.show()


It looks better! It's time to train and test the model.


# Model Building and Validating

### Linear Regression Model

In [261]:
import numpy as np
from sklearn.model_selection import train_test_split
import sklearn.linear_model
x, y = np.c_[df_f['gdp']], np.c_[df_f['waste']]
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state=42) 
linear_model = sklearn.linear_model.LinearRegression()
linear_model.fit(train_x, train_y)

LinearRegression()

In [262]:
x_range=np.linspace(train_x.min(), train_x.max(), num=100)
y_r_range=linear_model.predict(x_range.reshape(-1,1))

plot_train_x = train_x.reshape(1, -1)[0]
plot_train_y = train_y.reshape(1, -1)[0]
plot_test_x = test_x.reshape(1, -1)[0]
plot_test_y = test_y.reshape(1, -1)[0]

fig_model = go.Figure([
    go.Scatter(x=plot_train_x, y=plot_train_y, name='Train Data', mode='markers'),
    go.Scatter(x=plot_test_x, y=plot_test_y, name='Test Data', mode='markers'),
    go.Scatter(x=x_range.reshape(1,-1)[0], y=y_r_range.reshape(1,-1)[0], name='Linear Regression Model')])
fig_model.update_xaxes(title_text='GDP (US$)')
fig_model.update_yaxes(title_text='Total Solid Waste (Ton)')
fig_model.show()

In [263]:
from sklearn.metrics import mean_absolute_error
pred_y = linear_model.predict(test_x)
mae_linear = mean_absolute_error(test_y, pred_y)
mae_linear

4147392.1371088163

### KNN Regression Model

In [264]:
import sklearn.neighbors
mae_dict = {}
for i in range(1,50):
    model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=i)
    model.fit(train_x, train_y)
    pred_y = model.predict(test_x)
    mae = mean_absolute_error(pred_y, test_y)
    mae_dict[i] = mae

df_dict = pd.DataFrame(list(mae_dict.items()), columns=['k', 'mae'])
print(df_dict[df_dict.mae==df_dict.mae.min()])

knn_model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=27)
knn_model.fit(train_x, train_y)

y_k_range = knn_model.predict(x_range.reshape(-1, 1))

fig_model.add_trace(go.Scatter(x=x_range.reshape(1,-1)[0], y=y_k_range.reshape(1,-1)[0], name='KNN Regression Model'))


     k           mae
26  27  3.317104e+06


In [265]:
pred_y_knn = knn_model.predict(test_x)
mae_knn = mean_absolute_error(pred_y_knn, test_y)
mae_knn

3317104.4938168353

# Conclusion

In [267]:
print('mae_linear: {}'.format(mae_linear))
print('mae_knn: {}'.format(mae_knn))

mae_linear: 4147392.1371088163
mae_knn: 3317104.4938168353


For numeric result, it should be knn a win in anyway. Yet with the visualizaion of the model, I find that knn performs really bad when gdp is high. So I came into a conclusion that knn would be a better model with a lower GDP, and linear regression would give a more accurate result with higher GDP. 