## Categorical Variables

There are 2 types of variables Nominal and ordinal variables

Nominal variables are independent variables from each other

eg: Male, Female
eg: Red, green, blue

Here Male is not related or comparable to Female, same goes with colors


Ordinal Variables

eg: Satisfied, neutral, dissatisfied
eg: graduate, masters, phd
eg: high, medium, low

Here we can compare: phd > masters > graduate like wise high > medium > low


## One Hot Encoding

we are using nominal variables in this dataset and now the question is how are we going to represent them
Usually ML models are good at solving numbers

to address this problems comes <b> One HotEncoding </b>

In [1]:
import pandas as pd

In [2]:
home_price = pd.read_csv('./homeprices.csv')

In [3]:
home_price

Unnamed: 0,town,area,price
0,monroe township,2600,550000
1,monroe township,3000,565000
2,monroe township,3200,610000
3,monroe township,3600,680000
4,monroe township,4000,725000
5,west windsor,2600,585000
6,west windsor,2800,615000
7,west windsor,3300,650000
8,west windsor,3600,710000
9,robinsville,2600,575000


### Intuition

The idea here is to create dummy variables for a particular column and append it to our data frame

it looks something like this

<table>
  <thead>
    <tr>
      <th>town</th>
      <th>area</th>
      <th>price</th>
      <th>monroe township</th>
      <th>robinsville</th>
      <th>west windsor</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>monroe township</td>
      <td>2600</td>
      <td>550000</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>robinsville</td>
      <td>3000</td>
      <td>565000</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <td>west windsor</td>
      <td>3200</td>
      <td>610000</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
  </tbody>
</table>


In [7]:
dummy_table = pd.get_dummies(home_price.town)

In [8]:
dummy_table

Unnamed: 0,monroe township,robinsville,west windsor
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
5,0,0,1
6,0,0,1
7,0,0,1
8,0,0,1
9,0,1,0


In [10]:
merged_df = pd.concat([home_price, dummy_table], axis='columns' )

In [11]:
merged_df

Unnamed: 0,town,area,price,monroe township,robinsville,west windsor
0,monroe township,2600,550000,1,0,0
1,monroe township,3000,565000,1,0,0
2,monroe township,3200,610000,1,0,0
3,monroe township,3600,680000,1,0,0
4,monroe township,4000,725000,1,0,0
5,west windsor,2600,585000,0,0,1
6,west windsor,2800,615000,0,0,1
7,west windsor,3300,650000,0,0,1
8,west windsor,3600,710000,0,0,1
9,robinsville,2600,575000,0,1,0


## Dummy variable trap


The dummy variable trap occurs when the number of dummy variables created is equal to the number of values the categorical value can take on. This leads to multicollinearity, which means that one variable can be predicted from the others. The dummy variable trap can cause incorrect calculations of regression coefficients and p-values.


For example, if you are interested in political affiliation, a categorical variable that might assume three values - Republican, Democrat, or Independent - you could represent political affiliation with two dummy variables.
To avoid the dummy variable trap, one dummy variable should be dropped to avoid multicollinearity.

### Multi Colinearity

Multicollinearity is a statistical concept that occurs when two or more independent variables in a model are correlated. This means that one independent variable can be predicted from another in a regression model. For example, if two variables have a correlation coefficient of +/- 1.0, they are considered perfectly collinear.
Multicollinearity can result in less reliable statistical inferences, make it hard to interpret a model, and create an overfitting problem. It can also limit the research conclusions that can be drawn.
Some examples of multicollinearity include including the same information twice (such as weight in pounds and weight in kilograms), or not using dummy variables correctly

In [12]:
# dropping dummy and column

In [14]:
final = merged_df.drop(['town','west windsor'],axis='columns')

In [15]:
final

Unnamed: 0,area,price,monroe township,robinsville
0,2600,550000,1,0
1,3000,565000,1,0
2,3200,610000,1,0
3,3600,680000,1,0
4,4000,725000,1,0
5,2600,585000,0,0
6,2800,615000,0,0
7,3300,650000,0,0
8,3600,710000,0,0
9,2600,575000,0,1


<b> Note: </b> In general Linear regession model is aware of Dummy variable trap and it drops column accordingly. But it is good practice if we drop the column

In [16]:
from sklearn.linear_model import LinearRegression

In [17]:
model = LinearRegression()

In [18]:
X = final.drop('price',axis='columns')

In [19]:
X

Unnamed: 0,area,monroe township,robinsville
0,2600,1,0
1,3000,1,0
2,3200,1,0
3,3600,1,0
4,4000,1,0
5,2600,0,0
6,2800,0,0
7,3300,0,0
8,3600,0,0
9,2600,0,1


In [20]:
y = final.price

In [21]:
model.fit(X,y)

LinearRegression()

In [23]:
model.predict([[2800,0,1]])



array([590775.63964739])

In [26]:
# for west winsor we have to supply both zeros as both are zeros

model.predict([[3400,0,0]])



array([681241.66845839])