# Simple scikit learn model

Does money make people happier? Simple version without data splitting.

## Data

### Import data

In [1]:
import pandas as pd

# Load the data from GitHub
LINK = "https://raw.githubusercontent.com/kirenz/datasets/master/oecd_gdp.csv"
df = pd.read_csv(LINK)

### Data structure

In [2]:
df

Unnamed: 0,Country,GDP per capita,Life satisfaction
0,Russia,9054.914,6.0
1,Turkey,9437.372,5.6
2,Hungary,12239.894,4.9
3,Poland,12495.334,5.8
4,Slovak Republic,15991.736,6.1
5,Estonia,17288.083,5.6
6,Greece,18064.288,4.8
7,Portugal,19121.592,5.1
8,Slovenia,20732.482,5.7
9,Spain,25864.721,6.5


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            29 non-null     object 
 1   GDP per capita     29 non-null     float64
 2   Life satisfaction  29 non-null     float64
dtypes: float64(2), object(1)
memory usage: 824.0+ bytes


### Data corrections

In [3]:
# Change column names (lower case and spaces to underscore)
df.columns = df.columns.str.lower().str.replace(' ', '_')

# show the first 5 rows
df.head()

Unnamed: 0,country,gdp_per_capita,life_satisfaction
0,Russia,9054.914,6.0
1,Turkey,9437.372,5.6
2,Hungary,12239.894,4.9
3,Poland,12495.334,5.8
4,Slovak Republic,15991.736,6.1


### Variable lists

Prepare the data for later use

In [4]:
# define outcome variable as y_label
y_label = 'life_satisfaction'

# select features
X = df[["gdp_per_capita"]]

# create response
y = df[y_label]

## Data splitting

In [None]:
Perform split into training and test data

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Investigate the data:

In [6]:
X_train.shape, y_train.shape

((23, 1), (23,))

In [7]:
X_train.head(2)

Unnamed: 0,gdp_per_capita
21,43724.031
0,9054.914


In [8]:
X_test.shape, y_test.shape

((6, 1), (6,))

We make a copy of the training data since we don’t want to alter our data during data exploration. We will use this data for our exploratory data analysis.

In [9]:
df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))

In [10]:
df_train.head(2)

Unnamed: 0,gdp_per_capita,life_satisfaction
21,43724.031,6.9
0,9054.914,6.0


### Data exploration

In [11]:
%matplotlib inline
import altair as alt

# Visualize the data
alt.Chart(df_train).mark_circle(size=100).encode(
    x='gdp_per_capita:Q',
    y='life_satisfaction:Q',
).interactive()

## Linear regression model

In [12]:
from sklearn.linear_model import LinearRegression

In [13]:
# Select a linear regression model
reg = LinearRegression()

### Training & validation

In [14]:
from sklearn.model_selection import cross_val_score

In [26]:
# crossvalidation with 5 folds
scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1

In [27]:
scores

array([0.33518676, 0.10178157, 0.0971665 , 0.25295494, 0.44389327])

In [28]:
# Mean squared error over all folds
scores.mean()

0.24619660790356263

The mean score and the standard deviation are given by:

In [23]:
# Standard variation over all folds
scores.std()

0.13424944530888838

Let's assume we are satisfied with the results and want to use this model. 

### Final training

Train your model with the complete training data (without crossvalidation).

In [33]:
# Fit the model
reg.fit(X_train, y_train)

In [34]:
# Model intercept
reg.intercept_

4.867383809184242

In [30]:
# Model coefficient
reg.coef_

array([4.9184703e-05])

Prediction with training data:

In [38]:
# Prediction for our training data
y_pred_train = reg.predict(X_train)

In [36]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [40]:
mean_squared_error(y_train, y_pred_train)

0.20471900918125246

In [41]:
# Root mean squared error
mean_squared_error(y_train, y_pred_train, squared=False)

0.4524588480527842

In [42]:
mean_absolute_error(y_train, y_pred_train)

0.38519097158215426

### Test error

In [30]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [33]:
# Mean squared error
mean_squared_error(y, y_pred)

0.18075033705835142

In [34]:
# Root mean squared error
mean_squared_error(y, y_pred, squared=False)

0.4251474297915388

In [32]:
mean_absolute_error(y, y_pred)

0.35530429427921734

## K-Nearest Neighbor Model

In [40]:
from sklearn.neighbors import KNeighborsRegressor

In [46]:
reg2 = KNeighborsRegressor(n_neighbors=2)

In [47]:
reg2.fit(X, y)

In [48]:
y_pred2 = reg2.predict(X)

In [49]:
reg2.predict(X_new) 

array([7.35])

In [50]:
mean_squared_error(y, y_pred2)

0.06181034482758619

In [51]:
mean_absolute_error(y, y_pred2)

0.20517241379310344