### Discuss the concept of One-Hot-Encoding, Multicollinearity and the Dummy Variable Trap.   
### What is Nominal and Ordinal Variables ?
---



### One Hot Encoding  
#### What is one hot encoding?  
A one hot encoding is a representation of categorical variables as binary vectors.

This first requires that the categorical values be mapped to integer values.

Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.  
#### Why Use a One Hot Encoding?
A one hot encoding allows the representation of categorical data to be more expressive.

Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

We could use an integer encoding directly, rescaled where needed. This may work for problems where there is a natural ordinal relationship between the categories, and in turn the integer values, such as labels for temperature ‘cold’, warm’, and ‘hot’.

There may be problems when there is no ordinal relationship and allowing the representation to lean on any such relationship might be damaging to learning to solve the problem. An example might be the labels ‘dog’ and ‘cat’

In these cases, we would like to give the network more expressive power to learn a probability-like number for each possible label value. This can help in both making the problem easier for the network to model. When a one hot encoding is used for the output variable, it may offer a more nuanced set of predictions than a single label.



### MultiCollinearity  

#### What Is Multicollinearity?
Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model.


In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.  
>KEY TAKEAWAYS  

- Multicollinearity is a statistical concept where independent variables in a model are correlated.<br>  

- Multicollinearity among independent variables will result in less reliable statistical inferences.  

- It is better to use independent variables that are not correlated or repetitive when building multiple regression models that use two or more variables

### Dummy Variable Trap
The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

To demonstrate the Dummy Variable Trap, take the case of gender (male/female) as an example. Including a dummy variable for each is redundant (of male is 0, female is 1, and vice-versa)  
  
### Nominal Variables  
Nominal scale also called the categorical variable scale, is defined as a scale used for labeling variables into distinct classifications and doesn’t involve a quantitative value or order. This scale is the simplest of the four variable measurement scales. Calculations done on these variables will be futile as there is no numerical value of the options.

There are cases where this scale is used for the purpose of classification – the numbers associated with variables of this scale are only tags for categorization or division. Calculations done on these numbers will be futile as they have no quantitative significance.  
  
### Ordinal Variables  

Ordinal scale is defined as a variable measurement scale used to simply depict the order of variables and not the difference between each of the variables. These scales are generally used to depict non-mathematical ideas such as frequency, satisfaction, happiness, a degree of pain, etc. It is quite straightforward to remember the implementation of this scale as ‘Ordinal’ sounds similar to ‘Order’, which is exactly the purpose of this scale.

Ordinal Scale maintains descriptional qualities along with an intrinsic order but is void of an origin of scale and thus, the distance between variables can’t be calculated. Descriptional qualities indicate tagging properties similar to the nominal scale, in addition to which, the ordinal scale also has a relative position of variables. Origin of this scale is absent due to which there is no fixed start or “true zero”.

![Measureement of scales](https://www.questionpro.com/blog/wp-content/uploads/2018/05/Types-of-measurements-scales.jpg)

### Salary Dataset of 52 professors having categorical columns.  
### Apply dummy variables concept and one-hot-encoding on categorical columns.



In [2]:
import pandas as pd

In [26]:
df=pd.read_csv(r'E:\Goeduhub_ML_Program_May_20\data\salary.csv')
df.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sx      52 non-null     object
 1   rk      52 non-null     object
 2   yr      52 non-null     int64 
 3   dg      52 non-null     object
 4   yd      52 non-null     int64 
 5   sl      52 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 2.6+ KB


In [27]:
pd.get_dummies(df['sx']).head()

Unnamed: 0,female,male
0,0,1
1,0,1
2,0,1
3,1,0
4,0,1


In [30]:
pd.get_dummies(df,columns=['sx','rk','dg'])

Unnamed: 0,yr,yd,sl,sx_female,sx_male,rk_assistant,rk_associate,rk_full,dg_doctorate,dg_masters
0,25,35,36350,0,1,0,0,1,1,0
1,13,22,35350,0,1,0,0,1,1,0
2,10,23,28200,0,1,0,0,1,1,0
3,7,27,26775,1,0,0,0,1,1,0
4,19,30,33696,0,1,0,0,1,0,1
5,16,21,28516,0,1,0,0,1,1,0
6,0,32,24900,1,0,0,0,1,0,1
7,16,18,31909,0,1,0,0,1,1,0
8,13,30,31850,0,1,0,0,1,0,1
9,13,31,32850,0,1,0,0,1,0,1


### Using Onehotencoder

In [24]:
from sklearn.preprocessing import OneHotEncoder

ord_enc = OneHotEncoder()
r = ord_enc.fit_transform(df[["dg"]])

  (0, 0)	1.0
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 1)	1.0
  (5, 0)	1.0
  (6, 1)	1.0
  (7, 0)	1.0
  (8, 1)	1.0
  (9, 1)	1.0
  (10, 0)	1.0
  (11, 0)	1.0
  (12, 0)	1.0
  (13, 1)	1.0
  (14, 0)	1.0
  (15, 0)	1.0
  (16, 0)	1.0
  (17, 1)	1.0
  (18, 1)	1.0
  (19, 1)	1.0
  (20, 1)	1.0
  (21, 1)	1.0
  (22, 0)	1.0
  (23, 0)	1.0
  (24, 0)	1.0
  :	:
  (27, 0)	1.0
  (28, 1)	1.0
  (29, 1)	1.0
  (30, 1)	1.0
  (31, 1)	1.0
  (32, 1)	1.0
  (33, 1)	1.0
  (34, 1)	1.0
  (35, 0)	1.0
  (36, 0)	1.0
  (37, 0)	1.0
  (38, 0)	1.0
  (39, 0)	1.0
  (40, 0)	1.0
  (41, 1)	1.0
  (42, 0)	1.0
  (43, 0)	1.0
  (44, 0)	1.0
  (45, 0)	1.0
  (46, 0)	1.0
  (47, 0)	1.0
  (48, 0)	1.0
  (49, 0)	1.0
  (50, 0)	1.0
  (51, 0)	1.0
float64
