<a href="https://colab.research.google.com/github/mkjubran/MachineLearning/blob/master/Regression_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone the Source GitHub Reporsitory 
We need to clone some source files to be used throughtout this tutorial from a GitHub reprository

In [0]:
!rm -rf ./MachineLearning
!git clone https://github.com/mkjubran/MachineLearning.git

# One Hot Encoding
**Introduction**

In this section, we will apply multiple leaner regression to a categorical data. 

We will be using the **One hot encoding** to convert nominal categorical variables into a form that could be provided to ML algorithms for linear regression.

**Theory** \\

One hot encoding is a process by which nominal categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.[1]

Say suppose the dataset is as follows:

City | Area| Price
--- | --- | ---
Jerusalem | 160 | 550000
Jerusalem | 200 | 600000
Jerusalem | 250 | 620000
Ramallah | 160 | 200000
Ramallah | 200 | 220000
Ramallah | 240 | 300000
Nablus | 160 | 150000
Nablus | 230 | 180000
Bethlehem | 160 | 160000
Bethlehem | 210 | 180000


We need to encode the names of the cities before passing this data into a machine learning model. This can be achieved through integer encoding as follows:

City | Code | Area| Price
--- | --- | --- | ---
Jerusalem |0| 160 | 550000
Jerusalem |0| 200 | 600000
Jerusalem |0| 250 | 620000
Ramallah  |1| 160 | 200000
Ramallah  |1| 200 | 220000
Ramallah  |1| 240 | 300000
Nablus    |2| 160 | 150000
Nablus    |2| 230 | 180000
Bethlehem |3| 160 | 160000
Bethlehem |3| 210 | 180000


However, Ml might understand that Nablus is double Ramallah or Bethlehem is triple of Ramallah. But this categorical variable is not nominal (values don't exhibit any order as compared to ordinal variables) . so instead of this, we use **one hot coding** as follows:

City | Jerusalem | Ramallah | Nablus| Bethlehem | Area| Price
--- | --- | --- | --- | --- | --- | ---
Jerusalem |1|0|0|0| 160 | 550000
Jerusalem |1|0|0|0| 200 | 600000
Jerusalem |1|0|0|0| 250 | 620000
Ramallah  |0|1|0|0| 160 | 200000
Ramallah  |0|1|0|0| 200 | 220000
Ramallah  |0|1|0|0| 240 | 300000
Nablus    |0|0|1|0| 160 | 150000
Nablus    |0|0|1|0| 230 | 180000
Bethlehem |0|0|0|1| 160 | 160000
Bethlehem |0|0|0|1| 210 | 180000

As can be seen, four independent variables (dummy variables) are created; Jerusalem, Ramallah, Nablus, and Bethlehem. Each of these dummy variables encodes its city by "1" otherwise it is "0". 

Before passing this table to the ML, we need to remove one of the city columns because it is not needed and also cause what is called **Dummy variable trap**$^{[2]}$; say we remove the dummy variable Ramallah, so if none of the other dummy variables (Jerusalem, Nablus, and  Bethlehem) is "1" then the ML learn it is Ramallah. The Dummy variable trap occurs when one dummy variable can be predicted using the other dummy variables. Reducing the dimensionality of the dataset also reduces the complexity and time of training the model. To read further about one hot coding you may refer to [1].

[1] https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
[2] https://analyticstraining.com/understanding-dummy-variable-traps-regression/

**Implementation**

Read the input data from a csv file called "homeprices_OHE.csv" \\
To read the data in the file, we will be using the pandas library (https://pandas.pydata.org/).

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/1_Regression/homeprices_OHE.csv")
print(df)

As can be seen, one of the fields (city) contains nominal categorical variable. Thus we need to encode this field into numeric values using one-hot coding. We wil use the pd.get_dummies(df.city) method as

In [0]:
dm = pd.get_dummies(df.city)
dm

After executing the above command we get a table with a code per city. Now we need to concatenate these rows to the original (df) dataframe.

In [0]:
df_merge = pd.concat([df,dm],axis='columns')
print(df_merge)

Now we need to get the multiple regression model. Note we pass the area and the three city dummy variables to train the model. 

In [0]:
from sklearn import linear_model
regm = linear_model.LinearRegression()
regm.fit(df_merge[['area','Bethlehem','Jerusalem','Nablus']],df_merge.price)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

Alternatively, we could clean the data frame by dropping the not needed fields from the data frame and then define the inout variables to the modelas 

In [0]:
x= df_merge.drop(['Ramallah','city','price'],axis=1)
print(x)
y = df_merge.price
print(y)

To train the model using x and y:

In [0]:
regm.fit(x,y)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

The model is now ready. To estimate the price of a new house in Ramallah with an area of 190 $m^2$, we apply it to the model as follows:

In [0]:
regm.predict([[190,0,0,0]])

So the price of such a house is about $232112. Let us next compare the prices of houses of the same size (area) in different cities. Use the city code based on the one hot coding shwon in output cell [37]. 

In [0]:
x_new=[[190,1,0,0],[190,0,1,0],[190,0,0,1],[190,0,0,0]]
regm.predict(x_new)

So the prices of houses with an area of 190 $m^2$ is as follows:

City | Price
--- | ---
Bethlehem | 173943.76899694 
Jerusalem | 579483.28267475
Nablus | 161056.23100302
Ramallah | 232112.46200606

**Exercise**

Use multiple linear regression to estimate the prices of the following cars:

Specifications | Car #1 | Car #2 | Car #3
-- | --- | --- | ---
Make    |  BMW | Audi | Nissan
Model    | 1 Series M | 100 | 370z
Year      |      2011 | 1992 | 2106
Engine Fuel Type|  premium unleaded (required) | regular unleaded | premium unleaded (required)
Engine HP        |   335 |172 | 332
Engine Cylinders  |   6 | 6 | 6
Transmission Type  |   MANUAL | MANUAL | MANUAL
Driven_Wheels      |  rear wheel drive | all wheel drive | rear wheel drive
Number of Doors    |    2 | 4 | 2
Market Category    |  Factory Tuner,Luxury,High-Performance | Luxury | High-Performance
Vehicle Size       |   Compact | Midsize | Compact
Vehicle Style      |    Coupe | Sedan | Coupe
highway MPG        |     26 | 21 | 26
city mpg           |     19 | 16 | 18
Popularity         |   3916 | 3105 | 2009

You may use only a subset of the car features to train and predict prices. We will use the data set in the 'CarPrices.csv' file in the Github repository to train the model. This data set is downloaded from kaggle. $^{[1]}$ 


[1] https://www.kaggle.com/CooperUnion/cardataset/data

To read and view specific row of the data set, use the following code:

In [0]:
import pandas as pd
df_cars = pd.read_csv("./MachineLearning/1_Regression/carsdataset.csv")
row=100
print(df_cars.loc[row,:])