<a href="https://colab.research.google.com/github/mkjubran/Fundamentals-of-AI-and-Machine-Learning/blob/main/FITTING_AND_EVALUATING_LINEAR_REGRESSION_MODELS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## FITTING AND EVALUATING LINEAR REGRESSION MODELS


In this notebook, we will demonstrate how to build and evaluate linear regression models. We will work on part of the modified version of the cardiovascular dataset from Kaggle (https://www.kaggle.com/code/sulianova/eda-cardiovascular-data/data). 


# Import Libraries

First, we need to import some libraries that will be used during the creation and evaluation of linear regression models.

In [None]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

# Data Preparation

**Clone the dataset Repository**

The modified dataset can be cloned from the GitHub repository https://github.com/mkjubran/AIData.git as below

In [None]:
!rm -rf ./AIData
!git clone https://github.com/mkjubran/AIData.git

**Read the dataset**

The data is stored in the MedicalCostPersonalDatasets.csv file. Read the input data into a dataframe using the Pandas library (https://pandas.pydata.org/) to read the data.

In [None]:
df = pd.read_csv("/content/AIData/MedicalCostPersonalDatasets.csv",sep=",")
df.head()

**Display Data Info**

Display some information about the dataset using the info() method

In [None]:
df.info()

The dataset contains 1338 records with 6 features for each record. Four features are numeric and the rest are objects (strings).

# Clean Data and Remove Outliers

**Check Missing Values**

Check if there are any missing values in the dataset

In [None]:
df.isnull().sum()

As can be observed, no missing data in the dataset.

**Remove Outliers**

Let us get the description of the dataset and check if there is anything not normal

In [None]:
df.describe()

The minimum age is 18 years which is the age at which a person can get an insurance plan. According to the records, the maximum age is 64 years. The ideal value of the bmi feature should be between 18.5 and 24.9, so there are records in the dataset for persons with non-ideal bmi values. The number of children is between 0 (no children) and 5. And the charges feature which is the target feature is always positive.

Let's use the box plot to check for any outliers in the dataset. As for the 'children' feature, its value is between 0 and 5, and thus no outliers. Let us check for the 'age' and 'bmi' features (independent variables).

In [None]:
sns.boxplot(data=df[["age", "bmi"]])

There are no outliers for the 'age' feature and there are few outliers for the 'bmi' feature. The values of these outliers have values close to the third quartile, thus we will not remove them. Let us check the outliers in the 'charges' feature.

In [None]:
sns.boxplot(data=df[["charges"]])

There are many outliers above the third quartile. Before handling them, let us check the distribution of the 'charges' feature.

In [None]:
sns.set_style('whitegrid')
sns.distplot(df['charges'], kde = False, color ='blue', bins = 30)

So the outliers appeared in the boxplot because the 'charges' feature has a skewed distribution which is due to the fact that most of the records are for medication that has low and moderate costs and only few records for high costs. So we should keep these high charges so that the regression model can predict them.

# Encode Categorical Data and Check the Significance of Features

**Encode Categorical Features**

The 'sex', 'smoker', and 'region' are three categorical features that we need to encode. We will encode them using one hot encoding.

In [None]:
df = pd.get_dummies(df)
df.head()

Remember to drop one of the columns that resulted from the hot encoding of each feature. Also, make sure that the original features ('sex', 'smoker', and 'region') are dropped too.

In [None]:
df.drop(['sex_male','region_northeast','smoker_no'],axis=1,inplace=True)
df.head()

**Check the significance of features for the regression model**

Next ew will use the statistical models to check the significance of every feature for the regression model 

In [None]:
X=df.drop('charges',axis=1)
Y=df.charges
X = sm.add_constant(X, prepend=True)
lm = sm.OLS(endog=Y, exog=X,)
lm = lm.fit()
print(lm.summary())

The model achieves an R-squared of 0.751, which means that the model manages to explain 75.1% of the variability observed in the charges. The Adj. R-squared is 0.749 which shows the goodness of the regression model (above 0.5 is good). Also, the p-values of all the features except 'sex_female' and 'region_northwest' is significant (lower value means rejecting the Null Hypotheses that the feature does not influence the target feature).

# Perform And Evaluate Linear Regression

**Performing Linear Regression**

We will start by splitting the dataset into training and testing splits of the dataset, the split ratio is usually 80% training and 20% testing.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=200)
print('Size of the dataset = {}'.format(len(X)))
print('Size of the training dataset = {} ({}%)'.format(len(x_train), 100*len(x_train)/len(X)))
print('Size of the testing dataset = {} ({}%)'.format(len(x_test), 100*len(x_test)/len(X)))

Notice that we used a random_state so that the results are reproducible. You should avoid setting this argument in your production code so that the split is random at every run.

Now, we will import the regression model from sklearn and train the model using the training split of the dataset.

In [None]:
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(x_train,y_train)

**Evaluate Linear Regression**

To evaluate the model, we will compute the R2-score using the training and testing splits of the dataset

In [None]:
R2Score_train = lm.score(x_train, y_train)
R2Score_test = lm.score(x_test, y_test)

from prettytable import PrettyTable
t = PrettyTable(['R2-Score', 'Linear Regression (%)'])
t.add_row(['Training', R2Score_train*100])
t.add_row(['Testing', R2Score_test*100])
print(t)

Let us try to perform the linear regression but without the less significant features; 'sex_female' and 'region_northwest'.

In [None]:
X2=df.drop(['sex_female', 'region_northwest','charges'], axis=1)
Y2=df.charges
x2_train, x2_test, y2_train, y2_test = train_test_split(X2,Y2,test_size=0.2, random_state=200)
lm.fit(x2_train,y2_train)
R2Score_train = lm.score(x2_train, y2_train)
R2Score_test = lm.score(x2_test, y2_test)

from prettytable import PrettyTable
t = PrettyTable(['R2-Score', 'Linear Regression (%)'])
t.add_row(['Training', R2Score_train*100])
t.add_row(['Testing', R2Score_test*100])
print(t)

So dropping the 'sex_female' and 'region_northwest' did not improve the R2-score.

**Polynomial Regression**

Let us try polynomial regression to improve the performance of linear regression.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_reg  = PolynomialFeatures(degree=2)
Xp = poly_reg.fit_transform(X)
xp_train, xp_test, yp_train, yp_test = train_test_split(Xp,Y,test_size=0.2, random_state=200)
lm.fit(xp_train,yp_train)
R2Score_train_poly = lm.score(xp_train, yp_train)
R2Score_test_poly = lm.score(xp_test, yp_test)

from prettytable import PrettyTable
t = PrettyTable(['R2-Score', 'Linear Regression (%)', 'Polynomial Regression - 2nd order (%)'])
t.add_row(['Training', R2Score_train*100, R2Score_train_poly*100])
t.add_row(['Testing', R2Score_test*100, R2Score_test_poly*100])
print(t)

As can be observed the polynomial regression provided a better R2-score.

# Saving and Loading Models

We will learn how to save and load models. We will do that using two methods; Pickle and Joblib.

Option #1: we will save the regression model using pickle library (https://docs.python.org/3/library/pickle.html).

In [None]:
import pickle
with open('./Model.pickle','wb') as f:
  pickle.dump(lm,f)

with open('./poly_reg.pickle','wb') as f:
  pickle.dump(poly_reg,f)

The linear model and the transformation are saved in your current directory (.\content). It doesn't include the dataframes or any other libraries.

We will load the models useing the load() method from the pickle library as

In [None]:
with open('./Model.pickle','rb') as f:
  lm_pickle = pickle.load(f)

with open('./poly_reg.pickle','rb') as f:
  poly_reg_pickle = pickle.load(f)

Option#2: Another option is to save the models using joblib from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) as

In [None]:
import joblib as jb
jb.dump(lm, './Model.joblib') 
jb.dump(poly_reg, './poly_reg.joblib') 

And to lead these models, we will use the load() method

In [None]:
lm_joblib = jb.load('./Model.joblib')
poly_reg_joblib = jb.load('./poly_reg.joblib')


# Predict New Values Using Models

To predict the target values for new data, we will use the loaded models

In [None]:
x_test.head()

In [None]:
x_new=x_test.copy()
xp_test = poly_reg_pickle.transform(x_new)
y_predict = lm_pickle.predict(xp_test)
dfnew=x_new
dfnew['charges_predict']=y_predict
dfnew.head()