# Label Encoding

Label encoding is a technique used to transform categorical data into numerical values by assigning a unique integer to each category.

#### Concept
Each unique category in a column is mapped to an integer. For instance, categories like Red, Green, and Blue might be encoded as 0, 1, and 2, respectively.

#### Usage
Label encoding is commonly used for categorical variables, especially when the data is ordinal or when simplicity is preferred over more complex encoding methods.

#### Considerations
- It is efficient and straightforward but may unintentionally introduce an ordinal relationship between categories, which could mislead algorithms that assume numerical order.
- For non-ordinal categories, one-hot encoding is often a better choice to avoid the risk of creating false hierarchies in the data.


In [14]:
import pandas as pd

In [15]:
data = pd.DataFrame({'Name':['Dog', 'Cat', 'Tiger', 'Cow', 'Chita']})
data

Unnamed: 0,Name
0,Dog
1,Cat
2,Tiger
3,Cow
4,Chita


In [16]:
from sklearn.preprocessing import LabelEncoder

In [17]:
label_encoder = LabelEncoder()
data['Encoded_Name'] = label_encoder.fit_transform(data['Name'])
data

Unnamed: 0,Name,Encoded_Name
0,Dog,3
1,Cat,0
2,Tiger,4
3,Cow,2
4,Chita,1


In [18]:
loan_data = pd.read_csv('loan.csv')

loan_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


#### Lets work with `Property_Area`

### Label Encoding using Scikit-Learn

This example demonstrates how to apply label encoding to a categorical column using Scikit-Learn's `LabelEncoder`.

1. **Data Preparation**  
   - A `DataFrame` named `data` is created with a `Name` column containing animal names like 'Dog', 'Cat', 'Tiger', 'Cow', and 'Chita'.

2. **Label Encoder Setup**  
   - `label_encoder = LabelEncoder()`: Initializes a `LabelEncoder` object from Scikit-Learn. This encoder will convert categorical string values into unique numerical labels.

3. **Fit and Transform**  
   - `data['Encoded_Name'] = label_encoder.fit_transform(data['Name'])`: The `fit_transform` method is applied to the `Name` column of the `data` DataFrame. This method converts each unique category in the `Name` column into a numerical label (e.g., 'Dog' might be encoded as 0, 'Cat' as 1, etc.). The encoded values are stored in the `Encoded_Name` column.

4. **Loan Data Encoding**  
   - A `loan_data` DataFrame is loaded by reading a CSV file (`loan.csv`).
   - `loan_data['Property_Area']` displays the values in the `Property_Area` column, which contains categorical values such as 'Urban', 'Semiurban', and 'Rural'.
   - `loan_data['Property_Area'].unique()`: Displays the unique values in the `Property_Area` column.
   - `loan_data['Property_Area'].fillna(loan_data['Property_Area'].mode()[0], inplace=True)`: Replaces any missing values (NaN) in the `Property_Area` column with the most frequent (mode) value.
   - `loan_data['Property_Area'] = arr`: Applies the label encoding to the `Property_Area` column. Here, `arr` represents the encoded values, transforming the categorical data into numerical labels.


In [19]:
loan_data['Property_Area'].head()

0    Urban
1    Rural
2    Urban
3    Urban
4    Urban
Name: Property_Area, dtype: object

In [20]:
loan_data['Property_Area'].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [21]:
loan_data['Property_Area'].fillna(loan_data['Property_Area'].mode()[0], inplace=True)
loan_data['Property_Area'].unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [22]:
label_encoder = LabelEncoder()
label_encoder.fit(loan_data['Property_Area'])
arr = label_encoder.transform(loan_data['Property_Area']) # We can use fit_transform or fit and transform individually

In [23]:
arr

array([2, 0, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       1, 0, 1, 1, 1, 2, 2, 1, 2, 2, 0, 1, 0, 2, 2, 1, 2, 1, 2, 2, 2, 1,
       2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 0, 2, 2, 2, 2, 0, 0, 1, 1,
       2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1,
       2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 0, 2, 1,
       2, 1, 0, 1, 1, 0, 1, 2, 0, 2, 0, 1, 1, 1, 0, 0, 0, 0, 2, 0, 2, 2,
       1, 1, 1, 1, 0, 2, 1, 0, 0, 2, 1, 1, 2, 1, 2, 2, 0, 1, 0, 0, 2, 0,
       2, 1, 0, 2, 0, 1, 1, 2, 1, 0, 2, 0, 0, 0, 1, 1, 0, 2, 0, 1, 1, 0,
       0, 1, 1, 2, 2, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 2, 1, 0, 1, 0, 2,
       1, 2, 1, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 2, 0, 2, 1, 0, 1, 1, 1,
       2, 1, 1, 1, 1, 0, 2, 1, 1, 0, 1, 0, 0, 1, 1, 0, 2, 2, 0, 1, 0, 2,
       2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 0, 1, 2, 0, 0, 2, 0, 1, 2, 1, 1, 0,
       1, 0, 1, 2, 0, 2, 2, 2, 0, 1, 1, 1, 1, 2, 1, 0, 2, 1, 2, 2, 0, 0,
       1, 0, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2, 0, 2, 2,

In [24]:
loan_data['Property_Area'] = arr

In [25]:
loan_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,2,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,2,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,2,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,2,Y


In [26]:
loan_data['Property_Area'].head()

0    2
1    0
2    2
3    2
4    2
Name: Property_Area, dtype: int32