<a href="https://colab.research.google.com/github/mohammad0alfares/MachineLearningNotebooks/blob/master/RegressionBasics_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basics of Regression Techniques - Part2

In this tutorial, we will continue exploring the basics of regression techniques. We will be discussing Saving and Loading Training Models, One Hot Encoding, and Model Accuracy. 

**By the end of this tutorial**, you will be able to:
-	Save and load inference models.
-	Apply multiple leaner regression to a categorical data.
-	Measure the accuracy of inference model.

**Before Session**:
-	Read about how to save and load Machine Learning Models (reading source 2.1)
-	Read about how to encode labeled data for machine learning algorithms (reading sources 2.2 – 2.5)
-	Read about how to evaluate the accuracy of inference models (reading source 2.6 and 2.7)
-	Watch a video about how to evaluate the accuracy of inference models ( video source 2.8)

**Resources:**

2.1	Save and Load Machine Learning Models in Python with scikit-learn (reading): https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

2.2	What is One Hot Encoding and How to Do It (reading): https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179

2.3	Categorical encoding using Label-Encoding and One-Hot-Encoder (reding): https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

2.4	One Hot Encoding (reading): https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f

2.5	One Hot Encoding with Python | Handling Categorical Data (video): https://youtu.be/YOR6rQTTEAQ

2.6	Difference between Loss, Accuracy, Validation loss, Validation accuracy (reading): https://www.javacodemonk.com/difference-between-loss-accuracy-validation-loss-validation-accuracy-in-keras-ff358faa\

2.7	Train/Test Split (reading): https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

2.8	The 7 Steps of Machine Learning (video): https://youtu.be/nKW8Ndu7Mjw



**Clone the Source GitHub Reporsitory**

Before we start applying the procedure of this tutorial, we need to clone some source files to be used throughtout this tutorial from a GitHub reprository

In [0]:
!rm -rf ./MachineLearning
!git clone https://github.com/mkjubran/MachineLearning.git

# Saving and Loading Training Models

In this section we will learn how to save and load training models. We will do that using two methods; Pickle and Joblib.

First, we will create and train the linear regression model usied in the linear regresison section. 

In [0]:
import pandas as pd
from sklearn import linear_model

df = pd.read_csv("./MachineLearning/1_Regression/Mall_Customers_Logitic_short.csv")

reg = linear_model.LinearRegression()
reg.fit(df[['Annual Income']],df[['Spendings']])

print(reg.coef_) ## print the coefficient
print(reg.intercept_) ## print the intercept
df.head()

Now, we will save the **reg** linear model using **pickle** library (https://docs.python.org/3/library/pickle.html).

In [0]:
import pickle
with open('./SpendingsLinearModel.pickle','wb') as f:
  pickle.dump(reg,f)

The **SpendingsLinearModel** is saved at your current directory (.\content\) and includes the linear regression model. It doesn't include the dataframes or any other libraries.

To load the model, we will use another method from the pickle library as:

In [0]:
with open('./SpendingsLinearModel.pickle','rb') as f:
  reg_pickle = pickle.load(f)

We could now use the new model to predict the prices of houses based on their areas as done before.

In [0]:
df2 = pd.read_csv("./MachineLearning/1_Regression/Mall_Customers_short_new.csv")
print(df2)
p=reg_pickle.predict(df2)
df2['Spendings_new']=p
df2.head()

Another approach to save the reg linear model is by using joblib from sklearn library (https://scikit-learn.org/stable/modules/model_persistence.html) as:

In [0]:
import joblib as jb
jb.dump(reg, './SpendingsLinearModel.joblib') 

In this case there is no need to open file before dumping the data. similarly for loading data.

In [0]:
reg_joblib = jb.load('./SpendingsLinearModel.joblib')

df2 = pd.read_csv("./MachineLearning/1_Regression/Mall_Customers_short_new.csv")
print(df2)
p=reg_joblib.predict(df2)
df2['Spendings']=p
df2.head()

If you need to learn more about pickle and joblib refer to https://scikit-learn.org/stable/modules/model_persistence.html .

# One Hot Encoding
**Introduction**

In this section, we will apply multiple linear regression to a categorical data. 

We will be using the **One hot encoding** to convert nominal categorical variables into a form that could be provided to ML algorithms for linear regression.

**Implementation**

Read the input data from a csv file called "FamilyCitySpendings.csv" \\
To read the data in the file, we will be using the pandas library (https://pandas.pydata.org/).

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/1_Regression/FamilyCitySpendings.csv")
df.head()

In [0]:
df.groupby(['City'])['City'].count()

In [0]:
df['Kids'].max()

In [0]:
df.loc[df['Kids'].idxmax()]

As can be seen, one of the fields (city) contains nominal categorical variable. Thus we need to encode this field into numeric values using one-hot coding. We wil use the pd.get_dummies(df.City) method as

In [0]:
dm = pd.get_dummies(df.City)
dm.head()

After executing the above command we get a table with a code per city. Now we need to concatenate these rows to the original (df) dataframe.

In [0]:
df_merge = pd.concat([df,dm],axis='columns')
df_merge.head()

Now we need to get the multiple regression model. Note we pass the 'Annual Income', 'Working', 'Kids', and the code of two of the city dummy variables ('Jerusalem' and 'Ramallah') to train the model. 

In [0]:
from sklearn import linear_model
regm = linear_model.LinearRegression()
regm.fit(df_merge[['Annual Income',	'Working','Kids','Jerusalem','Ramallah']],df_merge[['Spendings']])
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

Alternatively, we could clean the data frame by dropping the not needed fields from the data frame and then define the inout variables to the modelas 

In [0]:
x= df_merge.drop(['Nablus','City','Spendings'],axis=1)
print(x.head())
y = df_merge[['Spendings']]
print(y.head())

To train the model using x and y:

In [0]:
regm.fit(x,y)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

The model is now ready. To estimate the spendings of a family from Ramallah with an annual income of $20000 and two persons working. The family has only one kid. We first create a new dataframe

In [0]:
x_ = pd.DataFrame(index=None,columns=None)
x_['Annual Income'] = [20000]
x_['Working'] = [2]
x_['Kids'] = [1]
x_['Jerusalem'] = [0]
x_['Ramallah'] = [1]
x_.head()

Then apply to the new dataframe to the regression model

In [0]:
regm.predict(x_)

Alternatively, we could apply the family data directly to the regressor

In [0]:
regm.predict([[20000,2,1,0,1]])

**Is the order we apply the data to the regressor important?**

The spendings of a family with \$20000 annual income, 2 persons working, and 1 kid is about \$187543. Let us next compare this with same family living in other cities. Use the city code based on the one hot coding shwon in output cell [37]. 

In [0]:
## Ramallah = [0 1]
## Jerusalem = [1 0]
## Nablus = [ 0 ]

x_=[[20000,2,1,0,1],[20000,2,1,1,0],[20000,2,1,0,0]]
regm.predict(x_)

What about the accuracy of the model. We can view the accuracy of the model by printing the score as

In [0]:
regm.score(x,y)

The accuracy of the model is about $98.899\%$. This is called the training accuracy because the data used for training is used to compute the model accuracy. Having high training accuracy means the model fitted the training data very well and the relationship of the training data is linear. However, we need to measure the accuracy of the model to predict the prices of new data not used for training. This will be discussed in the next section.

**Exercise 2.1:**

Use multiple linear regression to estimate the prices of the following cars:

Specifications | Car #1 | Car #2 | Car #3
-- | --- | --- | ---
Make    |  BMW | Audi | Nissan
Model    | 1 Series M | 100 | 370z
Year      |      2011 | 1992 | 2106
Engine Fuel Type|  premium unleaded (required) | regular unleaded | premium unleaded (required)
Engine HP        |   335 |172 | 332
Engine Cylinders  |   6 | 6 | 6
Transmission Type  |   MANUAL | MANUAL | MANUAL
Driven_Wheels      |  rear wheel drive | all wheel drive | rear wheel drive
Number of Doors    |    2 | 4 | 2
Market Category    |  Factory Tuner,Luxury,High-Performance | Luxury | High-Performance
Vehicle Size       |   Compact | Midsize | Compact
Vehicle Style      |    Coupe | Sedan | Coupe
highway MPG        |     26 | 21 | 26
city mpg           |     19 | 16 | 18
Popularity         |   3916 | 3105 | 2009

You may use a subset of the car features to train and predict prices. We will use the data set in the 'CarPrices.csv' file in the Github repository to train the model. This data set is downloaded from kaggle. $^{[1]}$ 


[1] https://www.kaggle.com/CooperUnion/cardataset/data

To read and view specific row of the data set, use the following code:

In [0]:
import pandas as pd
df_cars = pd.read_csv("./MachineLearning/1_Regression/carsdataset.csv")
row=100
print(df_cars.loc[row,:])

In [0]:
## check null values 
pd.isnull(df_cars).sum()

In [0]:
print('dim before drop null ',df_cars.shape)
df_cars= df_cars.dropna()
print('dim after drop null ',df_cars.shape)

In [0]:
col = 'Make'
df_cars.groupby([col])[col].count()

In [0]:
df_cars.info()

In [0]:
df_cars.select_dtypes(include='object').columns

In [0]:
dm_cars = pd.get_dummies(data =df_cars[df_cars.select_dtypes(include='object').columns], columns= df_cars.select_dtypes(include='object').columns )
dm_cars.columns

In [0]:
dm_cars

In [0]:
df_cars_ex = df_cars.select_dtypes(exclude=['object'])
df_merge_cars = pd.concat([df_cars_ex,dm_cars],axis='columns')
df_merge_cars.shape

In [0]:
df_merge_cars.columns


In [0]:
df_merge_cars.select_dtypes(include='object').columns

In [0]:
x= df_merge_cars.drop(['MSRP'], axis=1)
y= df_merge_cars[['MSRP']]

In [0]:
y

In [0]:
from sklearn import linear_model
regm = linear_model.LinearRegression()
regm.fit(x,y)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

In [0]:
regm.score(x,y)

In [0]:
data = [['BMW','1 Series M',2011,'premium unleaded (required)',335,6,'MANUAL','rear wheel drive',2,'Factory Tuner,Luxury,High-Performance','Compact','Coupe',26,19,3916],
        ['Audi','100',1992,'regular unleaded',172,6,'MANUAL','all wheel drive',4,'Luxury','Midsize','Sedan',21,16,3105],
				    ['Nissan','370z',2106,'premium unleaded (required)',332,6,'MANUAL','rear wheel drive',2,'High-Performance','Compact','Coupe',26,18,2009]]

df_data = pd.DataFrame(data, columns = ['Make','Model','Year','Engine Fuel Type','Engine HP','Engine Cylinders','Transmission Type','Driven_Wheels','Number of Doors','Market Category','Vehicle Size','Vehicle Style','highway MPG','city mpg','Popularity']) 
df_data


# Model Accuracy

In this section, we will learn how to measure the accuracy of a model. This requires splitting the available dataset into a training dataset and testing dataset. The training dataset will be used to derive the coefficients of the model. whereas the testing dataset will be used to measure the model accuracy which is sometimes referred to as testing accuracy.

In this section, we will use part of the cars dataset used in the exercise above. This dataset is stored in the 'carsdataset_short.csv' in the repository.

To load the dataset we will use the panda library as before.

In [0]:
import pandas as pd
cars = pd.read_csv("./MachineLearning/1_Regression/carsdataset_short.csv")
print(cars)

Now we will use one hot coding to represent the car make as follows:

In [0]:
CarMake = pd.get_dummies(cars.Make)
cars_merge = pd.concat([cars, CarMake], axis=1)
x = cars_merge.drop(['Make','price','Mercedes-Benz'],axis=1)
y = cars_merge.price
print(x)
print(y)

Now, we need to split the datset into training and testing datsets. We will use 80% of the dataset for training and the rest will be used for testing. 

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print(len(x))
print(len(x_train))
print(len(x_test))


next we will train the linear regresison model using the training dataset as

In [0]:
from sklearn import linear_model
regm = linear_model.LinearRegression()
regm.fit(x_train,y_train)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

We will use the model now to predict the prices of the test datset

In [0]:
price_test = regm.predict(x_test)
print(price_test)

To combine the actual prices of the test data sets and the predicted prices for observation use the following 

In [0]:
y_test_pred = pd.DataFrame.copy(y_test)
y_test_pred = y_test_pred.to_frame()
y_test_pred['pprice'] = price_test
y_test_pred['difference'] = y_test_pred['price'] - y_test_pred['pprice']
print(y_test_pred)

The model accuracy can be obtained as follows

In [0]:
#training accuracy
Acc_train = regm.score(x_train,y_train)
print(Acc_train)
#testing accuracy
Acc_test = regm.score(x_test,y_test)
print(Acc_test)

As expected, the training accuracy is greater than the testing accuracy. A high training accuracy means that the model fits very well hr training data and a high testing accuracy means that the model can be generalized to other samples or datasets.