<h4>Installing & Importing Libraries</h4>

 Installing Libraries

In [4]:
# !pip install -q datascience                   # Package that is required by pandas profiling
# !pip install -q pandas-profiling              # Library to generate basic statistics about data
# !pip install -q yellowbrick                   # Toolbox for Measuring Machine Performance

Upgrading Libraries
- After upgrading the libraries, you need to restart the runtime to make the libraries in sync.

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [5]:
# !pip install -q --upgrade pandas-profiling
# !pip install -q --upgrade yellowbrick

Importing Libraries

In [6]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt                                             
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
%matplotlib inline

In [7]:
data = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv')
print('Data Shape:', data.shape)
data.head()

Data Shape: (1338, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [8]:
data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


<h4>Data Wrangling</h4>

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


**Observation:**

- We can observe that there's no missing data present.
- All the features are found to have correct data type.

In [10]:
print('Does data contain duplicate rows?:', data.duplicated().any(), data.duplicated().sum())

# Dropping duplicate rows
data.drop_duplicates(inplace=True)

print('Dropping duplicates Success!')

Does data contain duplicate rows?: True 1
Dropping duplicates Success!


**Observation:**

- We can observe that there are a total of 3 categorical features.
- Before encoding we need to identify the cardinality of these featuers.

<h4>Feature Encoding</h4>

In [11]:
print('Cardinality of sex:', len(data['sex'].unique()))
print('Cardinality of smoker:', len(data['smoker'].unique()))
print('Cardinality of region:', len(data['region'].unique()))

Cardinality of sex: 2
Cardinality of smoker: 2
Cardinality of region: 4


**Observation:**
    
- We can observe that cardinality of features is low.
- So, we can go ahead with dummification (one hot encoding).

In [12]:
data = pd.get_dummies(data=data, columns=['sex', 'smoker', 'region'])
data.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


<h4>Feature Scaling</h4>

In [13]:
X = data.drop(labels='charges', axis=1)
y = data['charges']

In [14]:
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(data=scaled_X, columns=X.columns)
scaled_X.head()

Unnamed: 0,age,bmi,children,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,-1.440418,-0.45316,-0.909234,1.009771,-1.009771,-1.96966,1.96966,-0.565546,-0.565546,-0.611638,1.764609
1,-1.511647,0.509422,-0.079442,-0.990324,0.990324,0.507702,-0.507702,-0.565546,-0.565546,1.634955,-0.566698
2,-0.79935,0.383155,1.580143,-0.990324,0.990324,0.507702,-0.507702,-0.565546,-0.565546,1.634955,-0.566698
3,-0.443201,-1.305052,-0.909234,-0.990324,0.990324,0.507702,-0.507702,-0.565546,1.768203,-0.611638,-0.566698
4,-0.514431,-0.292456,-0.909234,-0.990324,0.990324,0.507702,-0.507702,-0.565546,1.768203,-0.611638,-0.566698


<h4>Feature Selection using Random Forest</h4>

- Now in **real world**, it is very **rare** that **all** the **features** are **important** while developing the model.

- So **instead** we **analyze** the **impact** of **input over the target** feature.

- We do so by either performing **statistical** **tests** (Pearson, ANOVA, Chi-Square) or by using **Random Forest**.

- **Random forests** are one the most **popular machine learning algorithms** because they **provide**:
    - **a good predictive performance**,
     - **low overfitting and**
     - **easy interpretability.**

- This **interpretability** is **derived** from the **importance of each feature** on the tree decision **evaluated** on the **reduction** in **impurity**.

- In other words, it is **easy to compute** how much **each feature is contributing** to the **decision**.

- **Below** we have **implemented** a function namely, **SelectFromModel** **available** in **Sklearn** which **uses** the **base estimator** to **identify** **important features**.

- The **importance** of feature is **determined** on the **basis** of **threshold** (a measure to calculate feature importance).

In [16]:
# Have some patience, may take some time :)
selector = SelectFromModel(RandomForestRegressor(n_estimators = 100, random_state = 42, n_jobs = -1))
selector.fit(X, y)

# Extracting list of important features
selected_feat = X.columns[(selector.get_support())].tolist()

print('Total Features Selected are', len(selected_feat))

# Estimated by taking mean(default) of feature importance
print('Threshold set by Model:', np.round(selector.threshold_, decimals = 2))
print('Features:', selected_feat)

Total Features Selected are 4
Threshold set by Model: 0.09
Features: ['age', 'bmi', 'smoker_no', 'smoker_yes']


<h4>Data Preparation</h4>

In [17]:
X = data[selected_feat]
y = data['charges']

In [18]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.2, random_state=0)

print('Training Data Shape:', X_train.shape, y_train.shape)
print('Testing Data Shape:', X_test.shape, y_test.shape)

Training Data Shape: (1069, 11) (1069,)
Testing Data Shape: (268, 11) (268,)


<h4>Model Development & Evaluation</h4>

In [19]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred_test = regressor.predict(X_test)
y_pred_train = regressor.predict(X_train)

#interpreting model coefficients
print("Intercept:", regressor.intercept_)
print("Coeeficients:",regressor.coef_)

Intercept: 13108.525903378584
Coeeficients: [ 3428.53452251  1902.42183541   593.01732727    50.42896917
   -50.42896917 -4801.39592159  4801.39592159   274.0389709
    81.92296734  -157.37696405  -192.27810045]


In [20]:
# MAE (Mean absolute error)
MAE_train = metrics.mean_absolute_error(y_train, y_pred_train)
MAE_test = metrics.mean_absolute_error(y_test, y_pred_test)

# MSE (Mean sqaured error)
MSE_train = metrics.mean_squared_error(y_train, y_pred_train)
MSE_test = metrics.mean_squared_error(y_test, y_pred_test)

# RMSE (Root Mean Squared Error)
RMSE_train = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train))
RMSE_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))

# R-Squared
r2_score_train = r2_score(y_train, y_pred_train)
r2_score_test = r2_score(y_test, y_pred_test)

print('MAE for training set is {}' .format(MAE_train))
print('MAE for test set is {}' .format(MAE_test))
print('MSE for training set is {}' .format(MSE_train))
print('MSE for test set is {}' .format(MSE_test))
print('RMSE for training set is {}' .format(RMSE_train))
print('RMSE for test set is {}' .format(RMSE_test))
print('R-Squared for training set:', r2_score_train)
print('R-Squared for test set:', r2_score_test)

MAE for training set is 4017.4459975187024
MAE for test set is 4396.031406695097
MSE for training set is 35391932.94623181
MSE for test set is 41546216.6608684
RMSE for training set is 5949.111946016129
RMSE for test set is 6445.63547378134
R-Squared for training set: 0.748860461829179
R-Squared for test set: 0.7530385567240127


**Observation:**

- We can observe that the model is underfitting as R-Squared on train set is low as compared to test set.

- In such a case, we can try out more complex models such as Decision Tree and Random Forest.