## One_Hot_Encoding
Majority of times we have text data in our dataset, so as we know ML don't understand text, text should be converted into a number. The ways through which we can covert the text into numbers are different, one way is just assigning n [1,2,3,4, ...] to each of the variable. But this approach don't look good.

The better method which we can use is One Hot Encoding. The way one hot encoding is work, you create a new column for each of your category and assign binary values of 1 and 0.

* So let's do One Hot Encoding ...

In [127]:
# Required models ...
import pandas as pd

In [128]:
# Let's read the dataset
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [129]:
# Now as we see, in the first column we have text data. So it need coversion.
# Pandas has 'get_dumies()' method to return dummy variables columns.
dummies = pd.get_dummies(df.town)
dummies

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In this 'town' column we had three twons, so if we had 10 columns it would return 10 columns.

In [130]:
# The next step is to concatinate the dummies DataFrame with the original DataFrame:
merged = pd.concat([df, dummies], axis = "columns")
merged

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


So, now we have the dummies variables in the original DataFrame, so now we don't need the text column 'town'. So we drop this column. After dropping the 'town' column, we need to drop one the dummies column. The concept why we drop the dummies column is, whenever one variable can be derived from the rest of the variables, these variables are set to be multi-colinear and whenever you have multi-colinearity in you dataset, it creates the problem of dummy variable trap that can affect ML model. So the rule is that you have to drop one of the dummy column. If you have five dummy variables, you dropped one of them and 4 of them will be used.

In [131]:
# So we have three dummies variables, we drop one of them, we can choose any of them:
# So we drop two columns 'town' and 'west windsor':
final_df = merged.drop(["town", "west windsor"], axis = "columns")
final_df

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


* **Yesssssssssssssssssssssss!!!** Now our DataFrame is looking pretty good:


### Dummy Variable Trap

One thing you may note it, when you're using sklearn linear regression model, it will even work when you don't drop it. Because LR Model is aware of dummy variable trap and it will drop it authomatically. But generally it's a good practice to drop it by your own.

When you can derive one variable from other variables, they are known to be multi-colinear. Here if you know values of california and georgia then you can easily infer value of new jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this situation linear regression won't work as expected. Hence you need to drop one column.

**NOTE:** sklearn library takes care of dummy variable trap hence even if you don't drop one of the state columns it is going to work, however we should make a habit of taking care of dummy variable trap ourselves just in case library that you are using is not handling this for you

In [98]:
# Now, let's create a linear regression model:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

In [132]:
# So now we give 'x' and 'y' for training: 'x' is all the columns without price, because price is dependant variable.
# So to create 'x', we drop the price column again:
X = final_df.drop(["price"], axis = "columns")
X

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [133]:
# & 'y' is nothing but price column:
Y = final_df.price
Y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [134]:
# So now the 'x' and 'y' is ready. The next step is to train the model:
reg.fit(X, Y)

LinearRegression()

In [136]:
# Now the model is trained, let's do prediction:
# So as we predict the price of robins town, so the first parameter will be area, 2nd will be zero and the 3rd will be 1.
reg.predict([[2800, 0, 1]])



array([590775.63964739])

In [103]:
# Next to predict price in 'west vendor' town, so we put both zeroes in the dummy variables:
reg.predict([[3400, 0, 0]])



array([681241.66845839])

In [104]:
# If we want how accurate our model is, we can use score method and supply the 'x' and 'y'.
reg.score(X, Y)

0.9573929037221873

### Using sklearn OneHotEncoder
* So now we'll use One Hot Encoder to do the same thing. So in order to use one-hot-encoder, first you need to do label encoding on the twon column. So first label incoder should be used. As result the town category will be convert into integer numbers.

In [115]:
# So our original DataFrame was:
df

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


In [117]:
# To include 'LabelEncoder':
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [118]:
# So first we will create a new DataFrame, next we use LabelEncoder to take column label as input and return the labels:
dfle = df
dfle.town = le.fit_transform(dfle.town)
dfle

Unnamed: 0,town,area,price
0,0,2600,550000
1,0,3000,565000
2,0,3200,610000
3,0,3600,680000
4,0,4000,725000
5,2,2600,585000
6,2,2800,615000
7,2,3300,650000
8,2,3600,710000
9,1,2600,575000


In [119]:
# Once we have this DataFrame, next is to create 'x' and 'y' variables:
# We use 'values', because our 'x' will be a two dimensional array.
X = dfle[["town", "area"]].values
X

array([[   0, 2600],
       [   0, 3000],
       [   0, 3200],
       [   0, 3600],
       [   0, 4000],
       [   2, 2600],
       [   2, 2800],
       [   2, 3300],
       [   2, 3600],
       [   1, 2600],
       [   1, 2900],
       [   1, 3100],
       [   1, 3600]], dtype=int64)

In [120]:
# To create 'y':
Y = dfle.price
Y

0     550000
1     565000
2     610000
3     680000
4     725000
5     585000
6     615000
7     650000
8     710000
9     575000
10    600000
11    620000
12    695000
Name: price, dtype: int64

In [121]:
# So, now we create three dummy variable columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('town', OneHotEncoder(categories='auto'), [0])],
    remainder='passthrough' 
)
X = ct.fit_transform(X)
X

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

In [122]:
# Now we drop one of the dummy columns, so we drop the first column [0], and the way we do that in 2D-array is:
X = X[:, 1:]
X

array([[0.0e+00, 0.0e+00, 2.6e+03],
       [0.0e+00, 0.0e+00, 3.0e+03],
       [0.0e+00, 0.0e+00, 3.2e+03],
       [0.0e+00, 0.0e+00, 3.6e+03],
       [0.0e+00, 0.0e+00, 4.0e+03],
       [0.0e+00, 1.0e+00, 2.6e+03],
       [0.0e+00, 1.0e+00, 2.8e+03],
       [0.0e+00, 1.0e+00, 3.3e+03],
       [0.0e+00, 1.0e+00, 3.6e+03],
       [1.0e+00, 0.0e+00, 2.6e+03],
       [1.0e+00, 0.0e+00, 2.9e+03],
       [1.0e+00, 0.0e+00, 3.1e+03],
       [1.0e+00, 0.0e+00, 3.6e+03]])

In [124]:
reg.fit(X,Y)

LinearRegression()

In [125]:
reg.predict([[1,0,2800]])

array([590775.63964739])

In [126]:
reg.predict([[0,1,3400]])

array([681241.6684584])

    So we see two methods of creating dummy variables, one was 'Pandas get dummy' method and the 2nd one was 'sklearn preprocessing OneHotEncoder' library.

### Exercise
At the same level as this notebook on github, there is an Exercise folder that contains carprices.csv. This file has car sell prices for 3 different models. First plot data points on a scatter plot chart to see if linear regression model can be applied. If yes, then build a model that can answer following questions,

    1) Predict price of a mercedez benz that is 4 yr old with mileage 45000
    2) Predict price of a BMW X5 that is 7 yr old with mileage 86000
    3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())