<a href="https://colab.research.google.com/github/kanchandhole/Data-Scientist/blob/main/20th_march_feature_engneering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Q1.** What is data encoding? How is it useful in data science?

**Ans:**

## **What is Data Encoding?**

**Data Encoding** is the process of **transforming data from one format or type into another**, usually into a **numerical format** that can be easily processed by machine learning algorithms.

* Many algorithms cannot work with **categorical/text data** directly
* Encoding converts these features into **numeric values** while preserving their **meaning or relationships**

---

### **Types of Data Encoding**

1. **Label Encoding**

   * Assigns a unique **integer** to each category
   * Example: `Color = ['Red', 'Blue', 'Green'] → [0, 1, 2]`

2. **One-Hot Encoding**

   * Creates **binary columns** for each category
   * Example: `Color = ['Red', 'Blue', 'Green']` →

| Red | Blue | Green |
| --- | ---- | ----- |
| 1   | 0    | 0     |
| 0   | 1    | 0     |
| 0   | 0    | 1     |

3. **Binary Encoding / Target Encoding / Frequency Encoding**

   * Other advanced techniques for high-cardinality categorical data

---

### **Why Data Encoding is Useful in Data Science**

1. **Makes data machine-readable**

   * ML algorithms work best with numeric input

2. **Preserves information**

   * Encoded features retain the **categorical meaning**

3. **Enables model building**

   * Without encoding, **categorical features cannot be used** in regression, classification, or clustering models

4. **Improves performance**

   * Proper encoding can **increase accuracy** and **reduce errors**


# One-Hot Encoding
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1
```
 **Conclusion:**
Data encoding is a **critical preprocessing step** in data science, especially when dealing with **categorical variables**, because it **transforms data into a numeric format**, making it compatible with **machine learning algorithms**.


In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample dataset
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Label Encoding
le = LabelEncoder()
data['Color_Label'] = le.fit_transform(data['Color'])
print(data)

# One-Hot Encoding
data_encoded = pd.get_dummies(data['Color'], prefix='Color')
print(data_encoded)

   Color  Color_Label
0    Red            2
1   Blue            0
2  Green            1
3   Blue            0
4    Red            2
   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


**Q2.** What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

**Ans:**

## **What is Nominal Encoding?**

**Nominal encoding** is a type of **data encoding** used to convert **categorical variables with no inherent order** (nominal variables) into a **numeric format** suitable for machine learning algorithms.

* Nominal variables are **labels** or **categories** that cannot be ranked
* Examples: Color, Gender, Country, Product Type

**Nominal encoding methods:**

1. **One-Hot Encoding** – most commonly used
2. **Dummy Variables** – similar to one-hot encoding but drops one column to avoid multicollinearity

---

### **Key Characteristics**

* Each category is represented as a **separate column**
* Encoded values **do not imply any order**
* Prevents algorithms from **misinterpreting categorical data as ordinal**

---

## **Real-World Example**

**Scenario:**
A company is analyzing **customer preferences for product colors**.

* Feature: `Color` = ["Red", "Blue", "Green"]
* Goal: Use color as input in a machine learning model

**Nominal encoding using One-Hot Encoding:**

| Red | Blue | Green |
| --- | ---- | ----- |
| 1   | 0    | 0     |
| 0   | 1    | 0     |
| 0   | 0    | 1     |
| 0   | 1    | 0     |

**Explanation:**

* Each color gets a separate binary column
* The model can now use **color information** without assuming any **order**


**Output:**

```
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1
```

---

### **Use Cases of Nominal Encoding**

1. Predicting customer preferences (e.g., product color, brand)
2. Classifying countries or cities in a geographic analysis
3. Encoding payment methods in e-commerce transactions

---

✅ **Conclusion:**
Nominal encoding transforms **categorical variables without order** into a numeric format while **preserving the categorical meaning**. This is essential for machine learning algorithms that require **numeric input**.

In [2]:
import pandas as pd

# Sample dataset
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# One-Hot Encoding (Nominal Encoding)
data_encoded = pd.get_dummies(data['Color'], prefix='Color')
print(data_encoded)

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


**Q3.** In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

**Ans:**

---

## **Nominal Encoding vs One-Hot Encoding**

* **Nominal Encoding** generally refers to **label encoding** for nominal (categorical) variables, where each category is assigned a **unique integer**.
* **One-Hot Encoding** creates **binary columns** for each category.

**Key Difference:**

* One-hot encoding increases **dimensionality** (more columns)
* Nominal encoding keeps **single-column representation**

---

## **When Nominal Encoding is Preferred**

1. **High-cardinality categorical variables**

   * If a feature has **many unique categories**, one-hot encoding creates too many columns → high dimensionality → slower computation and risk of sparse matrices.
   * Nominal (integer) encoding keeps it compact.

2. **Tree-based models (Decision Trees, Random Forest, XGBoost, LightGBM)**

   * These models can **handle integer-encoded categories directly**
   * One-hot encoding is **not required** for these models

3. **Memory or computational constraints**

   * Nominal encoding reduces the number of columns, saving memory and computation time

---

## **Practical Example**

**Scenario:**

* A telecom company wants to predict **customer churn**
* Feature: `City` = 1,000 unique cities
* Using one-hot encoding would create 1,000 additional columns → inefficient

**Solution:**

* Use **Nominal Encoding (Label Encoding)**


**Why this works:**

* Tree-based models can split on these integer labels
* No need for 1-hot encoding → memory-efficient

---

## ✅ **Summary Table**

| Scenario                               | Use Nominal Encoding | Use One-Hot Encoding |
| -------------------------------------- | -------------------- | -------------------- |
| High-cardinality categorical variable  | ✅                    | ❌                    |
| Tree-based models                      | ✅                    | ❌                    |
| Linear models or distance-based models | ❌                    | ✅                    |
| Few categories                         | ❌                    | ✅                    |

---

**Conclusion:**
Nominal encoding is preferred when the categorical variable has **many unique categories** or when using **tree-based models**, as it reduces memory usage while preserving model performance.

In [3]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']})

# Label encoding
le = LabelEncoder()
data['City_Encoded'] = le.fit_transform(data['City'])
print(data)

          City  City_Encoded
0     New York             3
1  Los Angeles             2
2      Chicago             0
3      Houston             1
4      Phoenix             4


**Q4.** Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

**Ans:**

## **Problem**

* Dataset: Categorical feature with **5 unique values**
* Goal: Transform it into a **machine-learning-friendly numeric format**

---

## **Step 1: Analyze the Options**

**1. Label (Nominal) Encoding**

* Assigns integers 0–4 to the 5 categories
* Pros: Simple, single column
* Cons: Implies an **ordinal relationship**, which may mislead algorithms like linear regression or KNN

**2. One-Hot Encoding**

* Creates **5 binary columns**, one for each category
* Pros: No ordinal relationship implied; widely compatible with all algorithms
* Cons: Slightly increases dimensionality (5 columns)

---

## **Step 2: Recommended Technique**

**For 5 unique categories → Use One-Hot Encoding**

**Reasoning:**

1. **Number of categories is small** → 5 additional columns are manageable
2. **Avoids misleading ordinal interpretation**
3. Compatible with most machine learning algorithms (linear models, distance-based models, tree-based models)

---

### **Step 3: Example in Python**


### **Step 4: Notes**

* If the categorical variable had **many unique values** (high cardinality, e.g., 1,000 categories), then **Nominal/Label Encoding** might be preferred for efficiency
* For **small number of categories (≤ 10)**, One-Hot Encoding is **safer and standard**

---

✅ **Conclusion:**

* **One-Hot Encoding** is the best choice for a categorical feature with 5 unique values.
* It ensures the **model does not assume any ordinal relationship** and maintains **compatibility with most ML algorithms**.


In [4]:
import pandas as pd

# Sample categorical data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# One-Hot Encoding
data_encoded = pd.get_dummies(data['Color'], prefix='Color')
print(data_encoded)

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


**Q4.** Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

**Ans:**

---

## **Problem**

* Dataset: Categorical feature with **5 unique values**
* Goal: Transform it into a **numeric format** suitable for machine learning algorithms.

---

## **Step 1: Consider Encoding Options**

1. **Label (Nominal) Encoding**

   * Assigns integers (0 to 4) to the 5 categories.
   * **Pros:** Simple, uses only one column.
   * **Cons:** Implies an **ordinal relationship** (0 < 1 < 2 …), which may mislead algorithms like linear regression, KNN, or distance-based models.

2. **One-Hot Encoding**

   * Creates **binary columns** for each category.
   * **Pros:** Does **not imply any order**, fully compatible with all algorithms.
   * **Cons:** Adds extra columns (5 in this case), but manageable since the number of categories is small.

---

## **Step 2: Recommended Technique**

**Use One-Hot Encoding.**

**Reasoning:**

* Only **5 unique categories** → creating 5 columns is computationally cheap.
* Avoids **introducing false ordinal relationships**.
* Works well with **all types of machine learning algorithms**.

---

## **Step 3: Example in Python**


---

### **Step 4: Notes**

* If the categorical variable had **high cardinality** (hundreds or thousands of unique values), **Label Encoding or Target Encoding** may be preferred to reduce dimensionality.
* For **small sets of categories (like 5)**, One-Hot Encoding is **safer and standard**.

---

✅ **Conclusion:**
For a categorical variable with 5 unique values, **One-Hot Encoding** is preferred because it preserves the **non-ordinal nature** of the data and ensures **compatibility with machine learning algorithms**.

In [5]:
import pandas as pd

# Sample categorical data
data = pd.DataFrame({'Category': ['A', 'B', 'C', 'B', 'A']})

# One-Hot Encoding
data_encoded = pd.get_dummies(data['Category'], prefix='Category')
print(data_encoded)

   Category_A  Category_B  Category_C
0        True       False       False
1       False        True       False
2       False       False        True
3       False        True       False
4        True       False       False


**Q5.** In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.

**Ans:**

## **Problem**

* Dataset: 1000 rows × 5 columns
* Columns: 2 categorical, 3 numerical
* Task: Apply **nominal encoding** (One-Hot Encoding) to categorical columns

---

## **Step 1: Identify the unique values in categorical columns**

Assume the **categorical columns** have the following unique values:

| Column     | Unique Values |
| ---------- | ------------- |
| Category_1 | 4             |
| Category_2 | 3             |

> Note: The number of unique values determines how many new columns will be created for each categorical column.

---

## **Step 2: Apply Nominal Encoding (One-Hot Encoding)**

* **Category_1** → 4 unique values → 4 new columns
* **Category_2** → 3 unique values → 3 new columns

**Total new columns created = 4 + 3 = 7**

---

## **Step 3: New Dataset Shape**

* Original numerical columns: 3
* Encoded categorical columns: 7

**Total columns after encoding = 3 + 7 = 10**

* Rows remain unchanged: 1000

**New dataset shape: (1000, 10)**

---

## **Step 4: Example in Python**





✅ **Answer:**

* **New columns created by nominal encoding:** **7**
* **Total columns after encoding:** **10** (3 numeric + 7 encoded)



In [6]:
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    'Category_1': ['A', 'B', 'C', 'A', 'B'],
    'Category_2': ['X', 'Y', 'X', 'Z', 'Y'],
    'Num_1': [1, 2, 3, 4, 5],
    'Num_2': [5, 4, 3, 2, 1],
    'Num_3': [10, 20, 30, 40, 50]
})

# One-Hot Encoding
data_encoded = pd.get_dummies(data, columns=['Category_1', 'Category_2'])
print(data_encoded)
print("New dataset shape:", data_encoded.shape)

   Num_1  Num_2  Num_3  Category_1_A  Category_1_B  Category_1_C  \
0      1      5     10          True         False         False   
1      2      4     20         False          True         False   
2      3      3     30         False         False          True   
3      4      2     40          True         False         False   
4      5      1     50         False          True         False   

   Category_2_X  Category_2_Y  Category_2_Z  
0          True         False         False  
1         False          True         False  
2          True         False         False  
3         False         False          True  
4         False          True         False  
New dataset shape: (5, 9)


**Q6.** You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.

**Ans:**


## **Problem**

* Dataset features:

  * `Species` (e.g., Lion, Tiger, Elephant)
  * `Habitat` (e.g., Forest, Savanna, Jungle)
  * `Diet` (e.g., Carnivore, Herbivore, Omnivore)

* All three are **categorical variables with no inherent order** (nominal).

* Goal: Transform the categorical data into a **machine-learning-friendly numeric format**.

---

## **Step 1: Analyze the data**

* **Nominal variables** → Categories **do not have a ranking or order**

  * Example: Lion ≠ 0 < Tiger ≠ 1
* Using **integer/label encoding** could mislead the algorithm into thinking there is an order.

---

## **Step 2: Choose the Encoding Technique**

**Use One-Hot Encoding (Nominal Encoding).**

**Reasons:**

1. **No ordinal relationship**: One-hot encoding ensures the model does not assume any ranking between categories.
2. **Compatible with all algorithms**: Works well with regression, classification, and clustering models.
3. **Number of unique values is manageable**:

   * Species, Habitat, and Diet usually have a limited number of categories.
   * One-hot encoding won’t create too many additional columns.

---

## **Step 3: Example in Python**





## ✅ **Conclusion**

For categorical features like `Species`, `Habitat`, and `Diet`:

* **Encoding Technique:** **One-Hot Encoding (Nominal Encoding)**
* **Reason:** Preserves the **nominal nature** of the data, avoids introducing false order, and is compatible with all machine learning algorithms.


In [7]:
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    'Species': ['Lion', 'Tiger', 'Elephant', 'Tiger', 'Lion'],
    'Habitat': ['Savanna', 'Jungle', 'Forest', 'Jungle', 'Savanna'],
    'Diet': ['Carnivore', 'Carnivore', 'Herbivore', 'Carnivore', 'Carnivore']
})

# One-Hot Encoding
data_encoded = pd.get_dummies(data, columns=['Species', 'Habitat', 'Diet'])
print(data_encoded)

   Species_Elephant  Species_Lion  Species_Tiger  Habitat_Forest  \
0             False          True          False           False   
1             False         False           True           False   
2              True         False          False            True   
3             False         False           True           False   
4             False          True          False           False   

   Habitat_Jungle  Habitat_Savanna  Diet_Carnivore  Diet_Herbivore  
0           False             True            True           False  
1            True            False            True           False  
2           False            False           False            True  
3            True            False            True           False  
4           False             True            True           False  


**Q7.**You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

**Ans:**

## **Problem**

Dataset features:

| Feature         | Type        | Example Values                     |
| --------------- | ----------- | ---------------------------------- |
| Gender          | Categorical | Male, Female                       |
| Age             | Numerical   | 25, 34, 45                         |
| Contract Type   | Categorical | Month-to-month, One year, Two year |
| Monthly Charges | Numerical   | 50, 75, 100                        |
| Tenure          | Numerical   | 1, 12, 24                          |

**Goal:** Transform **categorical features** (`Gender` and `Contract Type`) into numeric format for machine learning models.

---

## **Step 1: Identify the Encoding Technique**

1. **Gender** → 2 unique categories (Male, Female)

   * **Binary nominal variable** → Can use **Label Encoding (0 and 1)** or **One-Hot Encoding**.
   * **Recommended:** One-Hot Encoding is safe but for 2 categories, label encoding is fine.

2. **Contract Type** → 3 unique categories (Month-to-month, One year, Two year)

   * No ordinal relationship → Use **One-Hot Encoding** to avoid implying order.

---

## **Step 2: Implement Encoding in Python**

---

## **Step 3: Notes**

* `Gender_Encoded` → 0 or 1 for binary category
* `Contract` → 3 separate columns for each category, avoiding ordinal assumptions
* Numerical columns (`Age`, `MonthlyCharges`, `Tenure`) are left as-is
* After this preprocessing, the dataset is ready for ML algorithms

---

## ✅ **Summary of Steps**

1. **Identify categorical features** → `Gender`, `Contract`
2. **Select encoding technique**

   * Binary → Label Encoding
   * Multi-class nominal → One-Hot Encoding
3. **Apply encoding in Python**
4. **Combine with numerical features** → Ready for ML models

In [9]:
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male'],
    'Age': [25, 34, 45, 23, 30],
    'Contract': ['Month-to-month', 'One year', 'Two year', 'Month-to-month', 'Two year'],
    'MonthlyCharges': [50, 75, 100, 60, 80],
    'Tenure': [1, 12, 24, 2, 18]
})

# Step 1: Encode Gender (Binary)
data['Gender_Encoded'] = data['Gender'].map({'Male': 0, 'Female': 1})

# Step 2: Encode Contract Type (One-Hot Encoding)
data_encoded = pd.get_dummies(data, columns=['Contract'], drop_first=False)

print(data_encoded)

   Gender  Age  MonthlyCharges  Tenure  Gender_Encoded  \
0    Male   25              50       1               0   
1  Female   34              75      12               1   
2  Female   45             100      24               1   
3    Male   23              60       2               0   
4    Male   30              80      18               0   

   Contract_Month-to-month  Contract_One year  Contract_Two year  
0                     True              False              False  
1                    False               True              False  
2                    False              False               True  
3                     True              False              False  
4                    False              False               True  
