#### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ans.

**1. Ordinal Encoding:**  
- Purpose: Used when the categorical variable has an inherent order or ranking.
- How it works: Each category is assigned an integer value based on the order of the categories.
- Example: Suppose we have a feature Size with categories:
  - Small → 1
  - Medium → 2
  - Large → 3
- When to use: Choose ordinal encoding when the order of the categories matters. For example, in a dataset predicting delivery time, Size of a package may affect delivery speed, and "Large" might logically take longer than "Small."

**2. Label Encoding:**  
- Purpose: Used when the categorical variable does not have an inherent order.
- How it works: Each unique category is assigned a unique integer, but no particular order is implied.
- Example: Suppose we have a feature Color with categories:
  - Red → 0
  - Green → 1
  - Blue → 2
- When to use: Choose label encoding when the categories are nominal (i.e., no order or rank). For instance, if you're predicting car prices based on color, the order of colors doesn’t carry meaning.

---

#### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Ans.

Target Guided Ordinal Encoding is a technique where categorical variables are encoded based on the relationship between each category and the target variable. Specifically, categories are ordered according to the mean (or median) of the target variable and then assigned integer values based on this order.

**How It Works:**  
- Calculate the mean (or median) of the target variable for each category.
- Sort the categories based on the calculated mean/median.
- Assign integers to the categories according to their rank.

In [1]:
import pandas as pd

df = pd.DataFrame({
    'Neighborhood': ['A', 'B', 'C', 'A', 'B', 'C'],
    'Price': [250000, 160000, 120000, 210000, 140000, 90000]
})

mean_price = df.groupby('Neighborhood')['Price'].mean().sort_values()
encoding = {category: idx + 1 for idx, category in enumerate(mean_price.index)}

df['Neighborhood_Encoded'] = df['Neighborhood'].map(encoding)

print(df)

  Neighborhood   Price  Neighborhood_Encoded
0            A  250000                     3
1            B  160000                     2
2            C  120000                     1
3            A  210000                     3
4            B  140000                     2
5            C   90000                     1


---

#### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Ans.

**Covariance:**  
- Covariance is a measure of the relationship between two variables. Specifically, it indicates how much two random variables change together:
  - Positive Covariance: When one variable increases, the other tends to increase.
  - Negative Covariance: When one variable increases, the other tends to decrease.
  - Zero Covariance: No linear relationship between the variables.

**Covariance Important in Statistical Analysis:**  
- Understanding Relationships: Covariance helps in identifying the direction of the relationship between variables.
- Feature Selection: In machine learning, features with strong covariance with the target variable may be more predictive.
- Portfolio Theory (Finance): Covariance is used to measure how different asset returns move together, which is essential for diversification.
- Principal Component Analysis (PCA): PCA uses the covariance matrix to reduce dimensionality by identifying the directions (principal components) with the most variance.

In [2]:
import numpy as np

X = [2, 4, 6, 8]
Y = [65, 70, 75, 80]

cov_matrix = np.cov(X, Y, bias=False)  
covariance = cov_matrix[0, 1]

print(f"Covariance between Study Hours and Exam Scores: {covariance:.2f}")

Covariance between Study Hours and Exam Scores: 16.67


---

#### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

Ans.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df,"\n\n")

le = LabelEncoder()

for column in df.columns:
    df[column + '_Encoded'] = le.fit_transform(df[column])

print("DataFrame after Label Encoding:\n", df)

Original DataFrame:
    Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic
3  green   small     wood
4    red   large    metal 


DataFrame after Label Encoding:
    Color    Size Material  Color_Encoded  Size_Encoded  Material_Encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3  green   small     wood              1             2                 2
4    red   large    metal              2             0                 0


---

#### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

Ans.

In [4]:
import numpy as np
import pandas as pd

data = {
    'Age': [25, 30, 45, 35, 50],
    'Income': [40, 50, 80, 60, 90],
    'Education_Level': [2, 3, 4, 2, 4]  
    # 1: High School, 2: Bachelor's, 3: Master's, 4: Ph.D.
}

df = pd.DataFrame(data)

cov_matrix = df.cov()
print("Covariance Matrix:\n", cov_matrix)


Covariance Matrix:
                     Age  Income  Education_Level
Age              107.50   215.0             8.75
Income           215.00   430.0            17.50
Education_Level    8.75    17.5             1.00


---

#### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

Ans.

In [5]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({'Gender': ['Male', 'Female', 'Female', 'Male']})
le = LabelEncoder()
df['Gender_Encoded'] = le.fit_transform(df['Gender'])
print(df)

   Gender  Gender_Encoded
0    Male               1
1  Female               0
2  Female               0
3    Male               1


In [6]:
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'Education_Level': ['High School', "Bachelor's", "Master's", 'PhD']})
edu_order = [['High School', "Bachelor's", "Master's", 'PhD']]
oe = OrdinalEncoder(categories=edu_order)
df['Education_Encoded'] = oe.fit_transform(df[['Education_Level']])
print(df)

  Education_Level  Education_Encoded
0     High School                0.0
1      Bachelor's                1.0
2        Master's                2.0
3             PhD                3.0


In [7]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'Employment_Status': ['Unemployed', 'Part-Time', 'Full-Time', 'Full-Time']})

ohe = OneHotEncoder(sparse_output=False) 
encoded = ohe.fit_transform(df[['Employment_Status']])

encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Employment_Status']))

df = pd.concat([df, encoded_df], axis=1)
print(df)

  Employment_Status  Employment_Status_Full-Time  Employment_Status_Part-Time  \
0        Unemployed                          0.0                          0.0   
1         Part-Time                          0.0                          1.0   
2         Full-Time                          1.0                          0.0   
3         Full-Time                          1.0                          0.0   

   Employment_Status_Unemployed  
0                           1.0  
1                           0.0  
2                           0.0  
3                           0.0  


---

#### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

Ans.

In [8]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

data = {
    'Temperature': [30, 25, 20, 28, 32],
    'Humidity': [45, 60, 80, 55, 40],
    'Weather_Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny'],
    'Wind_Direction': ['North', 'East', 'South', 'West', 'North']
}

df = pd.DataFrame(data)

encoder = OrdinalEncoder(categories=[['Sunny', 'Cloudy', 'Rainy'], ['North', 'East', 'South', 'West']])
df[['Weather_Condition_Encoded', 'Wind_Direction_Encoded']] = encoder.fit_transform(df[['Weather_Condition', 'Wind_Direction']])

cov_matrix = df[['Temperature', 'Humidity', 'Weather_Condition_Encoded', 'Wind_Direction_Encoded']].cov()
print("Covariance Matrix:\n", cov_matrix)

Covariance Matrix:
                            Temperature  Humidity  Weather_Condition_Encoded  \
Temperature                      22.00    -72.50                      -3.75   
Humidity                        -72.50    242.50                      12.75   
Weather_Condition_Encoded        -3.75     12.75                       0.70   
Wind_Direction_Encoded           -3.25     12.25                       0.80   

                           Wind_Direction_Encoded  
Temperature                                 -3.25  
Humidity                                    12.25  
Weather_Condition_Encoded                    0.80  
Wind_Direction_Encoded                       1.70  
