Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.


In [None]:
# Ordinal Encoding is used when the categorical variables have a meaningful order or ranking. The categories are replaced with numbers that reflect their order. For example, "Size" (Small, Medium, Large) can be encoded as (1, 2, 3), where the order matters.

# Label Encoding is used when there is no meaningful order in the categories. Each category is assigned a unique number. For example, "Color" (Red, Green, Blue) can be encoded as (0, 1, 2), but the numbers don't reflect any ranking.

# Example:

# You would use Ordinal Encoding for a feature like "Size" (Small, Medium, Large).
# You would use Label Encoding for a feature like "Color" (Red, Green, Blue), where the order doesn’t matter.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.


In [None]:
# Target Guided Ordinal Encoding assigns ordinal labels to categories based on their relationship with the target variable. This technique is useful when the categorical feature has a strong correlation with the target, and we can use this relationship to assign meaningful numerical values.

# Example: Suppose we are predicting house prices and have a categorical feature called "Neighborhood." If we calculate the mean house price for each neighborhood, we can rank the neighborhoods based on their average prices and then assign ordinal values accordingly.

# Use this when you have a categorical variable that impacts the target variable in a way that allows you to order the categories meaningfully based on the target's values.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?


In [None]:

# Covariance is a measure of the relationship between two variables. It tells us whether the two variables increase or decrease together (positive covariance) or move in opposite directions (negative covariance).

# Covariance is important because it helps us understand how two variables are related and whether changes in one variable correspond to changes in another.

# Covariance is calculated by looking at how much two variables deviate from their mean values. The sign and magnitude of the covariance indicate the direction and strength of their relationship.



Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.


In [None]:
from sklearn.preprocessing import LabelEncoder

# Data
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

# Initializing LabelEncoder
le = LabelEncoder()

# Encoding each column
encoded_data = {col: le.fit_transform(values) for col, values in data.items()}

print(encoded_data)


In [None]:
# Explanation of the Output: The LabelEncoder assigns unique integer values to each category in the columns. For instance:

# Color: red = 2, green = 1, blue = 0
# Size: small = 2, medium = 1, large = 0
# Material: wood = 2, metal = 1, plastic = 0
# The output will be encoded integer values for each categorical feature.

Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.


In [None]:
# The covariance matrix shows the pairwise covariance between the variables in a dataset. It helps in understanding how different variables interact with each other.

# Suppose we have the following variables:

# Age
# Income
# Education Level (represented as an ordinal variable)
# The covariance matrix can be computed using Python’s NumPy or Pandas libraries:

In [None]:
import numpy as np
import pandas as pd

# Example data
data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [40000, 50000, 60000, 70000, 80000],
    'Education_Level': [1, 2, 2, 3, 3]  # 1: High School, 2: Bachelor's, 3: Master's
}

df = pd.DataFrame(data)

# Calculating covariance matrix
cov_matrix = df.cov()
print(cov_matrix)


In [None]:
# Interpretation:

# Positive covariance between "Age" and "Income" suggests that as age increases, income tends to increase as well.
# Covariance between "Income" and "Education_Level" may also show a positive relationship, indicating that higher education is associated with higher income.

Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?


In [None]:
# Gender: Since there are only two categories (Male/Female), I would use Label Encoding or Binary Encoding. Label encoding is sufficient since gender has no inherent order.

# Education Level: Use Ordinal Encoding because education levels have a natural order (High School < Bachelor’s < Master’s < PhD).

# Employment Status: Use One-Hot Encoding because there is no meaningful order among Unemployed, Part-Time, and Full-Time statuses.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [None]:
# To calculate covariance between continuous variables like "Temperature" and "Humidity," we can directly compute the covariance.

# For categorical variables like "Weather Condition" and "Wind Direction," covariance is not defined because covariance requires numerical data. However, you can convert these categorical variables into numerical representations (e.g., using one-hot encoding) and then compute covariances.

# For continuous variables:

In [None]:
# Example data for continuous variables
import numpy as np
data = {
    'Temperature': [20, 22, 23, 19, 21],
    'Humidity': [30, 45, 50, 35, 40]
}

df = pd.DataFrame(data)
cov_matrix = df.cov()
print(cov_matrix)


# Interpretation:

# If "Temperature" and "Humidity" have a positive covariance, it means that as temperature increases, humidity tends to increase (and vice versa).
# A negative covariance would mean that an increase in one variable corresponds to a decrease in the other.