<a href="https://colab.research.google.com/github/mkjubran/MachineLearningNotebooks/blob/master/Regression_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone the Source GitHub Reporsitory 
We need to clone some source files to be used throughtout this tutorial from a GitHub reprository

In [0]:
!rm -rf ./MachineLearning
!git clone https://github.com/mkjubran/MachineLearning.git

# One Hot Encoding
**Introduction**

In this section, we will apply multiple leaner regression to a categorical data. 

We will be using the **One hot encoding** to convert nominal categorical variables into a form that could be provided to ML algorithms for linear regression.

**Reading and Resources** \\

[1] https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f

[2] https://analyticstraining.com/understanding-dummy-variable-traps-regression/

**Implementation**

Read the input data from a csv file called "FamilyCitySpendings.csv" \\
To read the data in the file, we will be using the pandas library (https://pandas.pydata.org/).

In [0]:
import pandas as pd
df = pd.read_csv("./MachineLearning/1_Regression/FamilyCitySpendings.csv")
df.head()

As can be seen, one of the fields (city) contains nominal categorical variable. Thus we need to encode this field into numeric values using one-hot coding. We wil use the pd.get_dummies(df.city) method as

In [0]:
dm = pd.get_dummies(df.City)
dm.head()

After executing the above command we get a table with a code per city. Now we need to concatenate these rows to the original (df) dataframe.

In [0]:
df_merge = pd.concat([df,dm],axis='columns')
df_merge.head()

Now we need to get the multiple regression model. Note we pass the 'Annual Income', 'Working', 'Kids', and the code of two of the city dummy variables ('Jerusalem' and 'Ramallah') to train the model. 

In [0]:
from sklearn import linear_model
regm = linear_model.LinearRegression()
regm.fit(df_merge[['Annual Income',	'Working','Kids','Jerusalem','Ramallah']],df_merge[['Spendings']])
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

Alternatively, we could clean the data frame by dropping the not needed fields from the data frame and then define the inout variables to the modelas 

In [0]:
x= df_merge.drop(['Nablus','City','Spendings'],axis=1)
print(x.head())
y = df_merge[['Spendings']]
print(y.head())

To train the model using x and y:

In [0]:
regm.fit(x,y)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

The model is now ready. To estimate the spendings of a family from Ramallah with an annual income of $20000 and two persons working. The family has only one kid. We first create a new dataframe

In [0]:
x_ = pd.DataFrame(index=None,columns=None)
x_['Annual Income'] = [20000]
x_['Working'] = [2]
x_['Kids'] = [1]
x_['Jerusalem'] = [0]
x_['Ramallah'] = [1]
x_.head()

Then apply to the new dataframe to the regression model

In [0]:
regm.predict(x_)

Alternatively, we could apply the family data directly to the regressor

In [0]:
regm.predict([[20000,2,1,0,1]])

**Is the order we apply the datato the regressor important?**

The spendings of a family with \$20000 annual income, 2 persons working, and 1 kid is about \$187543. Let us next compare this with same family living in other cities. Use the city code based on the one hot coding shwon in output cell [37]. 

In [0]:
## Ramallah = [0 1]
## Jerusalem = [1 0]
## Nablus = [ 0 ]

x_=[[20000,2,1,0,1],[20000,2,1,1,0],[20000,2,1,0,0]]
regm.predict(x_)

What about the accuracy of the model. We can view the accuracy of the model by printing the score as

In [0]:
regm.score(x,y)

The accuracy of the model is about $98.899\%$. This is called the training accuracy because the data used for training is used to computed the model accuracy. Having high training accuracy means the model fitted the training data very well and the relationship of the training data is linear. However, we need to measure the accuracy of the model to predict the prices of new data not used for training. This will be discussed in the next section.

**Exercise**

Use multiple linear regression to estimate the prices of the following cars:

Specifications | Car #1 | Car #2 | Car #3
-- | --- | --- | ---
Make    |  BMW | Audi | Nissan
Model    | 1 Series M | 100 | 370z
Year      |      2011 | 1992 | 2106
Engine Fuel Type|  premium unleaded (required) | regular unleaded | premium unleaded (required)
Engine HP        |   335 |172 | 332
Engine Cylinders  |   6 | 6 | 6
Transmission Type  |   MANUAL | MANUAL | MANUAL
Driven_Wheels      |  rear wheel drive | all wheel drive | rear wheel drive
Number of Doors    |    2 | 4 | 2
Market Category    |  Factory Tuner,Luxury,High-Performance | Luxury | High-Performance
Vehicle Size       |   Compact | Midsize | Compact
Vehicle Style      |    Coupe | Sedan | Coupe
highway MPG        |     26 | 21 | 26
city mpg           |     19 | 16 | 18
Popularity         |   3916 | 3105 | 2009

You may use a subset of the car features to train and predict prices. We will use the data set in the 'CarPrices.csv' file in the Github repository to train the model. This data set is downloaded from kaggle. $^{[1]}$ 


[1] https://www.kaggle.com/CooperUnion/cardataset/data

To read and view specific row of the data set, use the following code:

In [0]:
import pandas as pd
df_cars = pd.read_csv("./MachineLearning/1_Regression/carsdataset.csv")
row=100
print(df_cars.loc[row,:])

# Model Accuracy

In this section, we will learn how to measure the accuracy of a model. This requires splitting the available dataset into a training dataset and testing dataset. The training dataset will be used to derive the coefficients of the model. whereas the testing dataset will be used to measure the model accuracy which is sometimes referred to as testing accuracy.

In this section, we will use part of the cars dataset used in the exercise above. This dataset is stored in the 'carsdataset_short.csv' in the repository.

To load the dataset we will use the panda library as before.

In [0]:
import pandas as pd
cars = pd.read_csv("./MachineLearning/1_Regression/carsdataset_short.csv")
print(cars)

Now we will use one hot coding to represent the car make as follows:

In [0]:
CarMake = pd.get_dummies(cars.Make)
cars_merge = pd.concat([cars, CarMake], axis=1)
x = cars_merge.drop(['Make','price','Mercedes-Benz'],axis=1)
y = cars_merge.price
print(x)
print(y)

Now, we need to split the datset into training and testing datsets. We will use 80% of the dataset for training and the rest will be used for testing. 

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
print(len(x))
print(len(x_train))
print(len(x_test))


next we will train the linear regresison model using the training dataset as

In [0]:
from sklearn import linear_model
regm = linear_model.LinearRegression()
regm.fit(x_train,y_train)
print(regm.coef_) ## print the coefficients
print(regm.intercept_) ## print the intercept

We will use the model now to predict the prices of the test datset

In [0]:
price_test = regm.predict(x_test)
print(price_test)

To combine the actual prices of the test data sets and the predicted prices for observation use the following 

In [0]:
y_test_pred = pd.DataFrame.copy(y_test)
y_test_pred = y_test_pred.to_frame()
y_test_pred['pprice'] = price_test
y_test_pred['difference'] = y_test_pred['price'] - y_test_pred['pprice']
print(y_test_pred)

The model accuracy can be obtained as follows

In [0]:
#training accuracy
Acc_train = regm.score(x_train,y_train)
print(Acc_train)
#testing accuracy
Acc_test = regm.score(x_test,y_test)
print(Acc_test)

As expected, the training accuracy is greater than the testing accuracy. A high training accuracy means that the model fits very well hr training data and a high testing accuracy means that the model can be generalized to other samples or datasets.