Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Label Encoding:

Converts categorical variables into numerical values.
No order is implied among the categories.
Example: For a categorical variable "Fruit" with categories ["Apple", "Banana", "Cherry"], label encoding might assign: Apple: 0, Banana: 1, Cherry: 2.

Ordinal Encoding:

Similar to label encoding but used when the categorical data has an inherent order.

Example: For a categorical variable "Size" with categories ["Small", "Medium", "Large"], ordinal encoding might assign: Small: 0, Medium: 1, Large: 2.

Choosing Between Them:

Use Label Encoding when the categorical variable has no inherent order.
Use Ordinal Encoding when the categorical variable has a natural order.

Example:

For a variable "Quality" with categories ["Low", "Medium", "High"], use ordinal encoding.
For a variable "Color" with categories ["Red", "Green", "Blue"], use label encoding.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Target Guided Ordinal Encoding:

This technique assigns numerical values to categories based on the relationship between the category and the target variable.
The categories are sorted by the mean of the target variable, and then numerical values are assigned based on this order.

#Example:

Suppose you have a dataset with a categorical feature "City" and a target variable "House Price." The steps are:

Calculate the mean house price for each city.

Sort the cities by the mean house price.

Assign ordinal values based on the sorted order.

#When to Use:

Use when the categorical variable has no inherent order, but you want to incorporate the relationship between the categorical variable and the target variable into the encoding process.

Example: Predicting house prices based on the "City" feature.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

#Covariance:

Covariance measures the directional relationship between two variables.

Positive covariance indicates that the variables tend to increase or decrease together.

Negative covariance indicates that as one variable increases, the other tends to decrease.

#Importance in Statistical Analysis:

Helps in understanding the relationship between variables.

Used in portfolio theory to determine the correlation between different assets.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


data = {
    'Color': ['red', 'green', 'blue'],
    'Size': ['small', 'medium', 'large'],
    'Material': ['wood', 'metal', 'plastic']
}

df = pd.DataFrame(data)

label_encoders = {
    'Color': LabelEncoder(),
    'Size': LabelEncoder(),
    'Material': LabelEncoder()
}

for column, encoder in label_encoders.items():
    df[column] = encoder.fit_transform(df[column])

print(df)

   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [4]:
import numpy as np

age = [25, 32, 47, 51, 62]
income = [35000, 45000, 62000, 78000, 91000]
education_level = [12, 14, 16, 16, 18]

data = np.array([age, income, education_level])

cov_matrix = np.cov(data)
print(cov_matrix)


[[2.213e+02 3.379e+05 3.340e+01]
 [3.379e+05 5.287e+08 5.020e+04]
 [3.340e+01 5.020e+04 5.200e+00]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

#Gender:

Use label encoding or one-hot encoding.

Reason: Binary category, label encoding is simple and sufficient.

Example: Male -> 0, Female -> 1

#Education Level:

Use ordinal encoding.

Reason: The categories have an inherent order.

Example: High School -> 0, Bachelor's -> 1, Master's -> 2, PhD -> 3

#Employment Status:

Use one-hot encoding.

Reason: No inherent order and multiple categories.

Example: Unemployed -> [1, 0, 0], Part-Time -> [0, 1, 0], Full-Time -> [0, 0, 1]

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [5]:
import numpy as np

temperature = [70, 75, 80, 85, 90]
humidity = [65, 60, 70, 75, 80]

covariance = np.cov(temperature, humidity)
print(covariance)


[[62.5  56.25]
 [56.25 62.5 ]]


The covariance between Temperature and Humidity is 50.

Positive covariance indicates that as temperature increases, humidity also tends to increase.

For categorical variables ("Weather Condition" and "Wind Direction"), encoding is needed before calculating relationships, often done through other methods like Chi-Square tests for independence rather than covariance.