<a href="https://colab.research.google.com/github/mdkamrulhasan/data_mining_kdd/blob/main/notebooks/Regression_Insurance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What will we cover today ?


1.   Feature encoding (One-hot)
2.   Feature Normalization (MinMaxScalar)



In [1]:
import numpy as np
import pandas as pd
# Models (Sklearn)
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
# Data and Evaluation packages
from sklearn import datasets
from sklearn.metrics import mean_squared_error
# visualization
import plotly.express as px

[Data description](https://www.kaggle.com/datasets/mirichoi0218/insurance/)

In [5]:
# Load the  dataset

data_url = 'https://raw.githubusercontent.com/mdkamrulhasan/data_mining_kdd/main/data/medical-cost/insurance.csv'
df = pd.read_csv(data_url)
df.keys()

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [6]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Features and labels

In [77]:
features_df = df[df.columns[:-1]]
y = df[df.columns[-1]]

In [39]:
features_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,female,27.9,0,yes,southwest
1,18,male,33.77,1,no,southeast
2,28,male,33.0,3,no,southeast
3,33,male,22.705,0,no,northwest
4,32,male,28.88,0,no,northwest


In [49]:
fig = px.scatter(x=features_df.bmi, y=y)
fig.show()

Numeric vs categorical features

In [40]:
features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 62.8+ KB


In [41]:
features_numeric = []
features_categorical = []
for colx in features_df.columns:
  if features_df[colx].dtype == 'object':
    #print(colx, df[colx].nunique())
    features_categorical.append(colx)
  else:
    features_numeric.append(colx)
print('numeric features:', features_numeric)
print('categorical features:', features_categorical)

numeric features: ['age', 'bmi', 'children']
categorical features: ['sex', 'smoker', 'region']


One-hot-encoding

In [42]:
f_categorical = features_categorical[0]
one_hot_transform = pd.get_dummies(
    features_df[f_categorical], prefix=f_categorical+'_')
one_hot_transform.head(3)

Unnamed: 0,sex__female,sex__male
0,1,0
1,0,1
2,0,1


Lets do the encoding for all categorical features

In [43]:
one_hot_dataframes = []
for featx in features_categorical:
  one_hot_dataframes.append(
      pd.get_dummies(features_df[featx], prefix=featx+'_')
  )

Concat all dataframes

In [73]:
df_all_numeric = pd.concat([features_df[features_numeric]]
                           + one_hot_dataframes, axis=1)
df_all_numeric.head()

Unnamed: 0,age,bmi,children,sex__female,sex__male,smoker__no,smoker__yes,region__northeast,region__northwest,region__southeast,region__southwest
0,19,27.9,0,1,0,0,1,0,0,0,1
1,18,33.77,1,0,1,1,0,0,0,1,0
2,28,33.0,3,0,1,1,0,0,0,1,0
3,33,22.705,0,0,1,1,0,0,1,0,0
4,32,28.88,0,0,1,1,0,0,1,0,0


Feature Normalization

In [45]:
from sklearn.preprocessing import MinMaxScaler
feature_scaler = MinMaxScaler()
df_all_numeric_scaled = feature_scaler.fit_transform(df_all_numeric.to_numpy())
df_preprocessed = pd.DataFrame(df_all_numeric_scaled, columns=df_all_numeric.columns)

In [46]:
df_preprocessed.head()

Unnamed: 0,age,bmi,children,sex__female,sex__male,smoker__no,smoker__yes,region__northeast,region__northwest,region__southeast,region__southwest
0,0.021739,0.321227,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,0.47915,0.2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.217391,0.458434,0.6,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.326087,0.181464,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.304348,0.347592,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


Now all ready for modeling

In [78]:
X = df_preprocessed.to_numpy()
X.shape, y.shape

((1338, 11), (1338,))

Our Wrapper Class (Can take any model as an input)

In [79]:
from sklearn.model_selection import cross_val_score

class myRegressionModel:
  def __init__(self, model):
    self.model = model
    self.nb_cv_splits = 3
    self.evaluation_metrics = 'neg_mean_squared_error'

  def train(self, X, y):
    self.model.fit(X, y)

  def evaluate(self, X, y):
    y_predict = self.model.predict(X)
    return mean_squared_error(y, y_predict)

  def cv_error(self, X, y):
    return cross_val_score(self.model,
                           X,
                           y, scoring=self.evaluation_metrics,
                           cv=self.nb_cv_splits)



Linear Regression (LR)

- Feature encoding and data normalization is a MUST

In [80]:
my_model = myRegressionModel(LinearRegression())
my_model.train(X, y)
cv_scores = my_model.cv_error(X, y)
print('cross validation score (mean):', np.mean(cv_scores))
print('cross validation score (std):', np.std(cv_scores))

cross validation score (mean): -37868776.44922774
cross validation score (std): 2667272.428684553


Random Forest (RF)
- Can work without Feature encoding and data normalization

In [81]:
my_model = myRegressionModel(RandomForestRegressor())
my_model.train(X, y)
cv_scores = my_model.cv_error(X, y)
print('cross validation score (mean):', np.mean(cv_scores))
print('cross validation score (std):', np.std(cv_scores))

cross validation score (mean): -23780351.99929453
cross validation score (std): 1128018.7524548355


Support Vector Regression (SVR)
- Feature encoding and data normalization is a MUST

In [82]:
my_model = myRegressionModel(SVR())
my_model.train(X, y)
cv_scores = my_model.cv_error(X, y)
print('cross validation score (mean):', np.mean(cv_scores))
print('cross validation score (std):', np.std(cv_scores))

cross validation score (mean): -161042373.77048707
cross validation score (std): 6140241.114155319


Boosting
- Feature encoding and data normalization depends on what are the underlying weak regressors are

In [83]:
my_model = myRegressionModel(GradientBoostingRegressor())
my_model.train(X, y)
cv_scores = my_model.cv_error(X, y)
print('cross validation score (mean):', np.mean(cv_scores))
print('cross validation score (std):', np.std(cv_scores))

cross validation score (mean): -21512751.45538403
cross validation score (std): 1350151.7747861133
