In the linear_model module of the sklearn library, I'm going to try L1 normalisation using the Lasso model.

The data I am using here is about how much insurance fee patinets pay for. This data include the age, gender, bmi, how many children do you have, whether you smoke, where you came from, and as a result, insurance fee.

In [1]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from math import sqrt

import numpy as np
import pandas as pd

# data file root
INSURANCE_FILE_PATH = 'insurance.csv'

insurance_df = pd.read_csv(INSURANCE_FILE_PATH)  
insurance_df.head() # Always check the data.


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [2]:
# Use One-hot encoding to the columns that I want, here, I chose sex, smoker and region.
insurance_df = pd.get_dummies(data=insurance_df, columns=['sex', 'smoker', 'region'])
insurance_df

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.900,0,16884.92400,1,0,0,1,0,0,0,1
1,18,33.770,1,1725.55230,0,1,1,0,0,0,1,0
2,28,33.000,3,4449.46200,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.880,0,3866.85520,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,0,1,1,0,0,1,0,0
1334,18,31.920,0,2205.98080,1,0,1,0,1,0,0,0
1335,18,36.850,0,1629.83350,1,0,1,0,0,0,1,0
1336,21,25.800,0,2007.94500,1,0,1,0,0,0,0,1


In [3]:
# Save a new dataframe without the insurance fee called 'charges' 
X = insurance_df.drop(['charges'], axis=1)

In [4]:
polynomial_transformer = PolynomialFeatures(4)  # Define quadratic transformer
polynomial_features = polynomial_transformer.fit_transform(X.values)  #  Transform a quadratic variable 

features = polynomial_transformer.get_feature_names(X.columns)  # Create new variable names

X = pd.DataFrame(polynomial_features, columns=features)  # Create dataframe
y = insurance_df[['charges']]  # Define Target Variable


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

#I chose the optional parameter as 'alpha=1, max_iter=2000, normalize=True'
model = Lasso(alpha=1, max_iter=2000, normalize=True)


In [6]:
#Train the training model
model.fit(X_train, y_train)

Lasso(alpha=1, max_iter=2000, normalize=True)

In [7]:
y_train_predict = model.predict(X_train)
y_test_predict = model.predict(X_test)

In [10]:
# Testing Code

mse = mean_squared_error(y_train, y_train_predict)

print("Performance : training set")
print("-----------------------")
print(f'error: {sqrt(mse)}')

mse = mean_squared_error(y_test, y_test_predict)
print("------------------------------")

print("Performance : testing set")
print("-----------------------")
print(f'error: {sqrt(mse)}')

Performance : training set
-----------------------
error: 4726.636439607448
------------------------------
Performance : testing set
-----------------------
error: 4692.232442526967


Using the high regression model of the fourth term (quadratic), we can see that the performance is not that different between training and test sets.