# Simple Machine Learning Example

### Using the same dataset as from the Exploratory Data Analysis Example
### Dataset comes from Kaggle 
#### https://www.kaggle.com/mirichoi0218/insurance

A really good example:
https://www.kaggle.com/hely333/eda-regression

##### Columns
**age:** age of primary beneficiary

**sex:** insurance contractor gender, female, male

**bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

**children:** Number of children covered by health insurance / Number of dependents

**smoker:** Smoking

**region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges:** Individual medical costs billed by health insurance

##### For this example we would like to predict a beneficiaries charges based on multiple factors (age, sex, bmi, children, smoker, region)

**Dependent Variable** = Charges

**Independent Variables** = age, sex, bmi, children, smoker, region

Because **Charges** is a continuous variable, we will be performing a **multiple linear regression**. (See exploratory analysis.)

In [1]:
#Import libraries to load data and manipulate dataframes
import pandas as pd
import numpy as np
import requests
import io

In [26]:
#import model from scikit-learn
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

In [5]:
# Downloading the csv file from your GitHub account

url = "https://raw.githubusercontent.com/maulcait/Python-Practice-Applications/main/insurance.csv" 
# Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

df = pd.read_csv(io.StringIO(download.decode('utf-8')))
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [18]:
# Linear Regression requires all data inputs to be continuous variables
# We need to convert the sex, smoker, and region columns to dummy variables
# pandas has a function to create theses columns
# drop_first is set to True to avoid multicollinearity - dummy variable trap 
df_clean = pd.get_dummies(df, drop_first=True)

In [17]:
#display columns
df_clean.columns

Index(['age', 'bmi', 'children', 'charges', 'sex_male', 'smoker_yes',
       'region_northwest', 'region_southeast', 'region_southwest'],
      dtype='object')

In [19]:
#break out data into independent and dependent dataframes
#we are going to be solving the equation
# y = mX + b 
# where y is the dependent column 
# and X is all the independent columns
y = df['charges']
X = df_clean.drop('charges', axis=1)  

In [20]:
# We need to split the data into training and test set
# This will help us create better predictions by not overtraining the model
# We will reserve 1/3 of the data set to test our predictive model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [21]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [28]:
y_pred = lr.predict(X_test)
print("The first five prediction {}".format(pred_y[:5]))
print("The real first five labels {}".format(y_test[:5]))

mse = metrics.mean_squared_error(y_test, pred_y)
print("Mean Squared Error {}".format(mse))

The first five prediction [ 8826.06227121  7070.49034864 37007.2387042   9438.74358115
 27105.41944988]
The real first five labels 764      9095.06825
887      5272.17580
890     29330.98315
1293     9301.89355
259     33750.29180
Name: charges, dtype: float64
Mean Squared Error 35090225.72562567


In [57]:
#To retrieve the intercept:
print('Intercept: ' ,lr.intercept_)


Intercept:  -12426.214137670124


In [55]:
coeff_df = pd.DataFrame(lr.coef_, X.columns, columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
age,261.568395
bmi,347.09729
children,371.762169
sex_male,121.123686
smoker_yes,23700.983287
region_northwest,-339.618396
region_southeast,-886.499581
region_southwest,-803.884788


In [38]:
#Create new dataframe with the test data and predicted values
y_results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred.flatten()})

In [58]:
y_results['error'] = y_results['Actual'] - y_results['Predicted']
y_results.head()

Unnamed: 0,Actual,Predicted,error
764,9095.06825,8826.062271,269.005979
887,5272.1758,7070.490349,-1798.314549
890,29330.98315,37007.238704,-7676.255554
1293,9301.89355,9438.743581,-136.850031
259,33750.2918,27105.41945,6644.87235


In [43]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 4193.463021932145
Mean Squared Error: 35090225.72562567
Root Mean Squared Error: 5923.700340633857


In [64]:
# This score is pretty good for basically no analysis or feature engineering
print("R-squared: ",lr.score(X_test,y_test))

R-squared:  0.7605492639270064


In [71]:
#Let's try to improve the model a bit by adding an interaction variable 
#between smoking and BMI based on what we saw in our exploratory data analysis
X_train['smoker_x_bmi'] = X_train['smoker_yes'] * X_train['bmi']
X_test['smoker_x_bmi'] = X_test['smoker_yes'] * X_test['bmi']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['smoker_x_bmi'] = X_train['smoker_yes'] * X_train['bmi']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['smoker_x_bmi'] = X_test['smoker_yes'] * X_test['bmi']


In [73]:
lr_2 = LinearRegression()
lr_2.fit(X_train, y_train)
# This score is much better than the previous model when we included the interaction between the smoking and bmi
print("R-squared: ",lr_2.score(X_test,y_test))

R-squared:  0.8504045611520333


In [78]:
#we see a decrease in the RMSE suggesting this models accuracy is better 
y_pred_w_interaction = lr_2.predict(X_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_w_interaction))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_w_interaction))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_w_interaction)))

Mean Absolute Error: 2777.2099644623736
Mean Squared Error: 21922412.111938484
Root Mean Squared Error: 4682.1375579897785


In [75]:
#This interaction increases the charges by a lot - 
#implying that a person who is a smoker with a high BMI will be more likely to have higher charges 
coeff_df_with_interaction = pd.DataFrame(lr_2.coef_, X_test.columns, columns=['Coefficient'])
coeff_df_with_interaction

Unnamed: 0,Coefficient
age,267.226658
bmi,31.69175
children,426.844631
sex_male,-463.227177
smoker_yes,-21760.225224
region_northwest,-511.068193
region_southeast,-898.157575
region_southwest,-1065.973588
smoker_x_bmi,1494.94378


##### We could continue to improve on this model by running multiple types of models such as Random Forest, Decision Trees, Naive Bayes, Gradient Boosting, etc and combine into an ensemble model