## One-Hot Encoding

One-hot encoding is a technique used in data preprocessing to represent categorical data as binary vectors. It is particularly useful when dealing with machine learning models that cannot handle categorical data directly.

For a categorical variable with $n$ distinct categories, one-hot encoding creates $n$ binary features, where each feature represents one category. If the variable belongs to a specific category, the corresponding binary feature is set to $1$, while all others are set to $0$.

#### Example:

Suppose a categorical variable has three categories: $A$, $B$, and $C$. One-hot encoding transforms these categories as follows:

$$
A \rightarrow [1, 0, 0] \\
B \rightarrow [0, 1, 0] \\
C \rightarrow [0, 0, 1]
$$

#### Mathematical Representation:

If the categorical variable $x$ has $n$ categories, one-hot encoding can be represented as a matrix $O$ of size $m \times n$, where $m$ is the number of samples. Each row vector in $O$ corresponds to a one-hot encoded representation of the respective sample.

$$
O =
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}
$$

#### Advantages:

1. Eliminates the ordinal nature of categorical variables, ensuring no implicit ranking.
2. Makes the data suitable for models requiring numerical input.

#### Disadvantages:

1. Increases dimensionality, especially for variables with many categories.
2. May lead to sparsity in the dataset.


In [12]:
import pandas as pd

In [13]:
loan_data = pd.read_csv("loan.csv")

In [14]:
loan_data.head(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [15]:
loan_data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

#### Lets work on **Gender** and  **Married** columns

In [16]:
loan_data['Gender'].isnull().sum(), loan_data['Married'].isnull().sum()

(13, 3)

#### Filling null values with mode

In [20]:
loan_data['Gender'].fillna(loan_data['Gender'].mode()[0], inplace=True) 
loan_data['Married'].fillna(loan_data['Married'].mode()[0], inplace=True)

In [21]:
loan_data['Gender'].isnull().sum(), loan_data['Married'].isnull().sum()

(0, 0)

### 1. Using **get_dummies**

This approach illustrates the use of pandas' `get_dummies` function for one-hot encoding:

1. Select specific categorical columns (e.g., `Gender` and `Married`) from the dataset to form a new DataFrame for encoding.

2. Apply `get_dummies` to the selected DataFrame, which generates binary columns for each unique category within the specified columns. Each category becomes a separate column (e.g., `Gender_Male`, `Married_Yes`).

3. Check the structure and details of the encoded DataFrame to confirm successful transformation, including column names, data types, and non-null counts.


In [22]:
encoded_data = loan_data[['Gender', 'Married']]
encoded_data.head()

Unnamed: 0,Gender,Married
0,Male,No
1,Male,Yes
2,Male,Yes
3,Male,Yes
4,Male,No


In [24]:
pd.get_dummies(encoded_data).head()

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes
0,False,True,True,False
1,False,True,False,True
2,False,True,False,True
3,False,True,False,True
4,False,True,True,False


In [25]:
pd.get_dummies(encoded_data).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Gender_Female  614 non-null    bool 
 1   Gender_Male    614 non-null    bool 
 2   Married_No     614 non-null    bool 
 3   Married_Yes    614 non-null    bool 
dtypes: bool(4)
memory usage: 2.5 KB


### 2. Using OneHotEncoder from Scikit-Learn

This process demonstrates how to perform one-hot encoding for categorical variables using Scikit-Learn's `OneHotEncoder`:

1. **One-Hot Encoding with All Categories**  
   - Create an instance of `OneHotEncoder`.
   - Learn the encoding from the data and transform the categorical variables into a one-hot encoded array.
   - Convert the resulting array into a DataFrame with columns representing all categories (e.g., 'Gender_Female', 'Gender_Male', 'Married_No', 'Married_Yes').

2. **One-Hot Encoding with Dropped First Category**  
   - Use `drop='first'` in `OneHotEncoder` to exclude the first category of each variable, preventing multicollinearity in regression models.
   - Transform the data and create a DataFrame with columns representing only the remaining categories (e.g., 'Gender' and 'Married').

The first approach includes all categories, which is suitable for scenarios where no category should be omitted. The second approach is commonly used in statistical modeling to eliminate redundancy by dropping one category from each variable.


In [26]:
from sklearn.preprocessing import OneHotEncoder

In [28]:
ohe = OneHotEncoder()
arr = ohe.fit_transform(encoded_data).toarray()
arr

array([[0., 1., 1., 0.],
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       ...,
       [0., 1., 0., 1.],
       [0., 1., 0., 1.],
       [1., 0., 1., 0.]])

In [29]:
data1 = pd.DataFrame(arr, columns=['Gender_Female', 'Gender_Male', 'Married_No', 'Married_Yes'])

In [30]:
data1.head()

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes
0,0.0,1.0,1.0,0.0
1,0.0,1.0,0.0,1.0
2,0.0,1.0,0.0,1.0
3,0.0,1.0,0.0,1.0
4,0.0,1.0,1.0,0.0


In [32]:
ohe = OneHotEncoder(drop='first')
arr = ohe.fit_transform(encoded_data).toarray()

In [35]:
data2 = pd.DataFrame(arr, columns=['Gender_Male','Married_Yes'])
data2.head()

Unnamed: 0,Gender_Male,Married_Yes
0,1.0,0.0
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0
4,1.0,0.0
