#### Q1. What is data encoding? How is it useful in data science?

Ans.

Data encoding is the process of converting categorical or textual data into numerical format so that machine learning models can process and analyze it effectively. Since most models work with numerical data, categorical features (such as names, colors, or labels) need to be transformed into a numerical representation.

**Data Encoding Useful in Data Science:**  
- Makes Data Machine-Readable: ML models require numerical input, and encoding converts categorical features into numerical form.
- Improves Model Performance: Proper encoding can help models learn relationships between categories more effectively.
- Reduces Data Complexity: Encoded data allows efficient processing and better feature representation.
- Handles Categorical Data: Many datasets contain categorical variables, and encoding helps integrate them into ML workflows.

---

#### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Ans.

Nominal encoding is a method of converting categorical (nominal) data into a numerical format for machine learning models. Nominal variables are unordered categories, meaning there is no inherent ranking or priority among them. Examples include colors, country names, or product types.

In [1]:
import pandas as pd

data = pd.DataFrame({"Product_ID": [101, 102, 103, 104, 105],
                     "Product_Category": ["Electronics", "Clothing", "Home & Kitchen", "Electronics", "Clothing"]})

encoded_data = pd.get_dummies(data, columns=["Product_Category"])

print(encoded_data)

   Product_ID  Product_Category_Clothing  Product_Category_Electronics  \
0         101                      False                          True   
1         102                       True                         False   
2         103                      False                         False   
3         104                      False                          True   
4         105                       True                         False   

   Product_Category_Home & Kitchen  
0                            False  
1                            False  
2                             True  
3                            False  
4                            False  


In [2]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data["Product_Category_Encoded"] = encoder.fit_transform(data["Product_Category"])
print(data)

   Product_ID Product_Category  Product_Category_Encoded
0         101      Electronics                         1
1         102         Clothing                         0
2         103   Home & Kitchen                         2
3         104      Electronics                         1
4         105         Clothing                         0


---

#### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Ans.

Although One-Hot Encoding (OHE) is the most common technique for handling nominal data, it can lead to high-dimensionality issues when dealing with categorical features with many unique values. In such cases, alternative nominal encoding techniques (e.g., Label Encoding, Target Encoding, Frequency Encoding) are preferred.

**Situations Where Nominal Encoding is Preferred Over One-Hot Encoding:**  

1.When There Are Too Many Unique Categories (High Cardinality)  
- One-Hot Encoding creates one column per category, which increases dimensionality and memory usage.
- Example: Encoding 1,000 cities using One-Hot Encoding would create 1,000 columns, making the dataset sparse.

2.When the Categorical Variable is Used in Tree-Based Models  
- Decision trees (e.g., Random Forest, XGBoost) can naturally handle numerical labels, making Label Encoding a better choice.
- Example: "Job Titles" in an HR dataset.

3.When There is an Established Ordinal Relationship  
- If the categories have some meaning in order, Ordinal Encoding is preferred.
- Example: Movie Ratings → ["Bad", "Average", "Good", "Excellent"] → [0, 1, 2, 3]

4.When Using Target Encoding (Mean Encoding) for Supervised Learning  
- Instead of OHE, we replace each category with the mean of the target variable.
- Example: Predicting house prices based on neighborhood.

In [3]:
import pandas as pd

data = pd.DataFrame({
    "City": ["New York", "San Francisco", "Los Angeles", "Chicago", "New York", "Los Angeles"],
    "Total_Rides": [12000, 8000, 9500, 7200, 11000, 9700]
})

city_counts = data["City"].value_counts()
data["City_Encoded"] = data["City"].map(city_counts)

print(data)

            City  Total_Rides  City_Encoded
0       New York        12000             2
1  San Francisco         8000             1
2    Los Angeles         9500             2
3        Chicago         7200             1
4       New York        11000             2
5    Los Angeles         9700             2


---

#### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

Ans.

In [4]:
import pandas as pd

data = pd.DataFrame({"Product_Category": ["Electronics", "Clothing", "Home", "Sports", "Toys"]})

encoded_data = pd.get_dummies(data, columns=["Product_Category"])
print(encoded_data)

   Product_Category_Clothing  Product_Category_Electronics  \
0                      False                          True   
1                       True                         False   
2                      False                         False   
3                      False                         False   
4                      False                         False   

   Product_Category_Home  Product_Category_Sports  Product_Category_Toys  
0                  False                    False                  False  
1                  False                    False                  False  
2                   True                    False                  False  
3                  False                     True                  False  
4                  False                    False                   True  


In [5]:
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({"Product_Category": ["Electronics", "Clothing", "Home", "Sports", "Toys"]})

encoder = LabelEncoder()
data["Category_Encoded"] = encoder.fit_transform(data["Product_Category"])
print(data)

  Product_Category  Category_Encoded
0      Electronics                 1
1         Clothing                 0
2             Home                 2
3           Sports                 3
4             Toys                 4


In [6]:
data = pd.DataFrame({
    "Neighborhood": ["A", "B", "C", "D", "E"],
    "House_Price": [500000, 700000, 600000, 800000, 550000]
})

mean_target = data.groupby("Neighborhood")["House_Price"].mean()

data["Neighborhood_Encoded"] = data["Neighborhood"].map(mean_target)
print(data)

  Neighborhood  House_Price  Neighborhood_Encoded
0            A       500000              500000.0
1            B       700000              700000.0
2            C       600000              600000.0
3            D       800000              800000.0
4            E       550000              550000.0


---

#### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

Ans.

In [7]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

data = pd.DataFrame({
    "Category_A": ["Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Yellow", "Green", "Yellow"],
    "Category_B": ["Small", "Medium", "Large", "Small", "Large", "Medium", "Small", "Large", "Medium", "Small"],
    "Feature_1": [10.5, 20.3, 15.8, 18.2, 11.7, 22.1, 13.4, 17.6, 19.8, 12.5],
    "Feature_2": [100, 200, 150, 180, 110, 220, 130, 170, 190, 120],
    "Feature_3": [5.5, 7.8, 6.3, 8.0, 5.9, 7.5, 6.1, 7.2, 7.0, 5.8]
})

print("\nOriginal Dataset:")
print(data)


Original Dataset:
  Category_A Category_B  Feature_1  Feature_2  Feature_3
0        Red      Small       10.5        100        5.5
1       Blue     Medium       20.3        200        7.8
2      Green      Large       15.8        150        6.3
3        Red      Small       18.2        180        8.0
4       Blue      Large       11.7        110        5.9
5      Green     Medium       22.1        220        7.5
6        Red      Small       13.4        130        6.1
7     Yellow      Large       17.6        170        7.2
8      Green     Medium       19.8        190        7.0
9     Yellow      Small       12.5        120        5.8


In [8]:
# OPTION 1: One-Hot Encoding

# Step 2: Apply One-Hot Encoding
ohe = OneHotEncoder(drop="first", sparse_output=False)  # Drop first to avoid dummy variable trap (optional)
encoded_ohe = ohe.fit_transform(data[["Category_A", "Category_B"]])

# Convert OHE result to a DataFrame with proper column names
ohe_columns = ohe.get_feature_names_out(["Category_A", "Category_B"])
encoded_df_ohe = pd.DataFrame(encoded_ohe, columns=ohe_columns)

# Step 3: Concatenate with the original dataset (dropping original categorical columns)
final_data_ohe = pd.concat([data.drop(columns=["Category_A", "Category_B"]), encoded_df_ohe], axis=1)

print("\nDataset After One-Hot Encoding:")
print(final_data_ohe)

# OPTION 2: Label Encoding

# Step 4: Apply Label Encoding
label_encoder_A = LabelEncoder()
label_encoder_B = LabelEncoder()

data["Category_A_Label"] = label_encoder_A.fit_transform(data["Category_A"])
data["Category_B_Label"] = label_encoder_B.fit_transform(data["Category_B"])

print("\nDataset After Label Encoding:")
print(data.drop(columns=["Category_A", "Category_B"]))  # Drop original categorical columns


Dataset After One-Hot Encoding:
   Feature_1  Feature_2  Feature_3  Category_A_Green  Category_A_Red  \
0       10.5        100        5.5               0.0             1.0   
1       20.3        200        7.8               0.0             0.0   
2       15.8        150        6.3               1.0             0.0   
3       18.2        180        8.0               0.0             1.0   
4       11.7        110        5.9               0.0             0.0   
5       22.1        220        7.5               1.0             0.0   
6       13.4        130        6.1               0.0             1.0   
7       17.6        170        7.2               0.0             0.0   
8       19.8        190        7.0               1.0             0.0   
9       12.5        120        5.8               0.0             0.0   

   Category_A_Yellow  Category_B_Medium  Category_B_Small  
0                0.0                0.0               1.0  
1                0.0                1.0               

---

#### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

Ans.

**1. One-Hot Encoding (OHE):**  
- Use when: The categorical variables are nominal (no inherent order), which is likely the case for species, habitat, and diet.
- Why OHE?
  - It creates binary columns for each category, ensuring that the algorithm doesn’t assume any ordinal relationship.
  - Works well with most machine learning algorithms (e.g., linear regression, decision trees, random forests).
- Downside: Can lead to a high-dimensional feature space if there are many unique categories (e.g., many species).

**2. Label Encoding:**  
- Use when: The categorical variables are ordinal (with a meaningful order).
- Why not here?
  - For non-ordinal categories like species or habitat, label encoding can introduce a false sense of order, potentially misleading algorithms that assume numerical order is meaningful.

---

#### Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

Ans.

**1. Gender (Binary Category):**  
- Encoding Method: Label Encoding
- Why: Since gender has only two categories (Male/Female), label encoding will assign 0 and 1 without introducing any unintended ordinal relationships.


**2. Contract Type (Multiple Categories, No Order):**  
- Encoding Method: One-Hot Encoding
- Why: The contract type has multiple nominal categories without any natural order. One-hot encoding prevents the model from assuming any ordinal relationship between them.

In [9]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

data = {
    'gender': ['Male', 'Female', 'Female', 'Male'],
    'age': [34, 45, 23, 56],
    'contract_type': ['Month-to-month', 'Two year', 'One year', 'Month-to-month'],
    'monthly_charges': [70.5, 89.2, 55.3, 99.8],
    'tenure': [5, 24, 12, 2]
}

df = pd.DataFrame(data)
df

Unnamed: 0,gender,age,contract_type,monthly_charges,tenure
0,Male,34,Month-to-month,70.5,5
1,Female,45,Two year,89.2,24
2,Female,23,One year,55.3,12
3,Male,56,Month-to-month,99.8,2


In [10]:
label_encoder = LabelEncoder()
df['gender'] = label_encoder.fit_transform(df['gender'])  # Male:1, Female:0
df

Unnamed: 0,gender,age,contract_type,monthly_charges,tenure
0,1,34,Month-to-month,70.5,5
1,0,45,Two year,89.2,24
2,0,23,One year,55.3,12
3,1,56,Month-to-month,99.8,2


In [11]:
contract_ohe = pd.get_dummies(df['contract_type'], prefix='contract', drop_first=False)
df = pd.concat([df.drop('contract_type', axis=1), contract_ohe], axis=1)
df

Unnamed: 0,gender,age,monthly_charges,tenure,contract_Month-to-month,contract_One year,contract_Two year
0,1,34,70.5,5,True,False,False
1,0,45,89.2,24,False,False,True
2,0,23,55.3,12,False,True,False
3,1,56,99.8,2,True,False,False
