# Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.
Ordinal Encoding and Label Encoding are both techniques used to convert categorical data into numerical values, but they are used in different contexts:

Ordinal Encoding: This is used when the categorical variable has a natural order or ranking. The numerical values assigned to the categories are based on the order or rank of the categories, preserving the relationship between them.

Example: A "Level of Education" feature with values ["High School", "Bachelor's", "Master's", "PhD"] can be encoded as:
High School = 0
Bachelor's = 1
Master's = 2
PhD = 3 This encoding is suitable because there is an inherent order (higher education levels represent a greater level of achievement).
Label Encoding: This technique assigns an integer value to each category arbitrarily, without considering any order between the categories. It is typically used for nominal data (categories with no natural order).

Example: A "Color" feature with values ["Red", "Green", "Blue"] could be encoded as:
Red = 0
Green = 1
Blue = 2 The categories have no meaningful order, so the encoding is simply a representation of each category as a unique integer.
When to choose one over the other:

Ordinal Encoding is preferred when the categorical variable has a meaningful order or ranking, as it preserves this relationship.
Label Encoding is suitable when the categories are arbitrary and have no intrinsic order, and the relationship between them is not important.


# Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.
Target Guided Ordinal Encoding (also known as Mean Encoding or Target Encoding) involves replacing categorical values with the mean (or some other statistic) of the target variable for each category. This encoding technique is used to capture the relationship between the categorical variable and the target variable.

How it works:

For each category in the feature, calculate the mean (or median, or another statistic) of the target variable for that category.
Replace the categorical values with these calculated statistics.
Example: Imagine you are working on a dataset for a binary classification problem where the target variable is whether a customer will churn or not, and one of the features is "Region" (with categories "North," "South," "East," and "West").

Calculate the average churn rate for each region:
North: 0.2
South: 0.5
East: 0.3
West: 0.1
Replace the categories with the corresponding mean churn rate:
North -> 0.2
South -> 0.5
East -> 0.3
West -> 0.1
When to use:

Target Guided Ordinal Encoding is useful when you want to incorporate the relationship between a categorical feature and the target variable. It works best when there are many categories in the feature and when there's a noticeable relationship between the feature and the target.
It should be used with caution in cases of data leakage, where the feature's values directly reveal information about the target variable.


# Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?
Covariance measures the degree to which two random variables change together. It is an indicator of the relationship between the variables: whether they tend to increase or decrease together (positive covariance) or whether one increases while the other decreases (negative covariance).

Importance: Covariance is important because it helps determine the direction of the linear relationship between two variables. It is a fundamental concept in statistics, particularly in correlation analysis and regression analysis.
Positive covariance indicates that the variables tend to increase or decrease together.
Negative covariance indicates that as one variable increases, the other tends to decrease.
A covariance of 0 indicates that there is no linear relationship between the variables.

In [None]:
# Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

from sklearn.preprocessing import LabelEncoder

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
}

# Create DataFrame
import pandas as pd
df = pd.DataFrame(data)

# Initialize LabelEncoder
le = LabelEncoder()

# Apply label encoding to each categorical column
df['Color_encoded'] = le.fit_transform(df['Color'])
df['Size_encoded'] = le.fit_transform(df['Size'])
df['Material_encoded'] = le.fit_transform(df['Material'])

# Show the DataFrame with encoded columns
print(df)


In [None]:
# Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

import numpy as np

# Sample dataset: Age, Income, Education Level (numerical encoding)
data = np.array([
    [25, 50000, 1],  # Age: 25, Income: 50000, Education Level: 1 (High School)
    [30, 60000, 2],  # Age: 30, Income: 60000, Education Level: 2 (Bachelor's)
    [35, 80000, 3],  # Age: 35, Income: 80000, Education Level: 3 (Master's)
    [40, 90000, 3],  # Age: 40, Income: 90000, Education Level: 3 (Master's)
    [45, 100000, 4], # Age: 45, Income: 100000, Education Level: 4 (PhD)
])

# Calculate covariance matrix
cov_matrix = np.cov(data, rowvar=False)
print(cov_matrix)


# Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?
Gender: Since "Gender" has only two categories (Male/Female), Label Encoding would work well, as the relationship between the categories is not ordinal.

Education Level: This has an inherent ordinal relationship (High School < Bachelor's < Master's < PhD), so Ordinal Encoding is the most appropriate choice.

Employment Status: Since "Employment Status" is a nominal variable (no order), One-Hot Encoding would be ideal to prevent the model from inferring any order between categories.

In [None]:
# Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

import pandas as pd
import numpy as np

# Sample data
data = {
    'Temperature': [25, 30, 22, 28, 35],
    'Humidity': [60, 70, 65, 55, 75],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

df = pd.DataFrame(data)

# Calculate covariance between continuous variables
cov_matrix = df[['Temperature', 'Humidity']].cov()
print(cov_matrix)
