Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you 
might choose one over the other

ANS  =  Ordinal Encoding preserves the order or hierarchy among categories, whereas Label Encoding assigns unique numerical labels to categories without considering any order. You might choose Ordinal Encoding when there's a clear ranking or hierarchy among categories, such as "low," "medium," and "high" for sizes. Label Encoding, on the other hand, can be chosen when there's no inherent order among categories, like "red," "green," and "blue" for colors.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in 
a machine learning project

ANS  =  Target Guided Ordinal Encoding is a technique where categorical variables are encoded based on the target variable's mean or median value for each category. It's useful when there's a strong relationship between the categorical variable and the target variable. For example, in a credit risk assessment project, you might use Target Guided Ordinal Encoding to encode customer risk levels based on their historical default rates.

Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

ANS  =  Covariance measures the degree to which two variables change together. It's important in statistical analysis because it helps understand the relationship between variables. Covariance is calculated as the average of the product of the differences of each variable from their respective means.



Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, 
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. 
Show your code and explain the output

In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'green'],
    'Size': ['small', 'medium', 'large', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'metal']
})

# Initialize LabelEncoder
encoder = LabelEncoder()

# Encoding each categorical variable
df['Color_encoded'] = encoder.fit_transform(df['Color'])
df['Size_encoded'] = encoder.fit_transform(df['Size'])
df['Material_encoded'] = encoder.fit_transform(df['Material'])

print(df)


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education 
level. Interpret the results

In [None]:
import numpy as np

# Simulated data
data = np.array([
    [25, 40000, 12],  # Age, Income, Education level
    [30, 45000, 16],
    [35, 50000, 20],
    [40, 55000, 21],
    [45, 60000, 22]
])

# Using numpy to calculate the covariance matrix
cov_matrix = np.cov(data, rowvar=False)

print(cov_matrix)


Q6. You are working on a machine learning project with a dataset containing several categorical 
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), 
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for 
each variable, and why?

Choosing Encoding Methods for Categorical Variables
When deciding which encoding method to use for categorical variables in a machine learning project, consider the nature of the data and the requirements of the learning algorithm:

Gender (Male/Female): For binary categorical variables like gender, you can use label encoding or binary encoding. Label encoding assigns 0 or 1 to the categories, while binary encoding creates binary digits for each category. Since gender has only two categories, label encoding is simple and sufficient.
Education Level (High School/Bachelor's/Master's/PhD): For ordinal categorical variables like education level, ordinal encoding or target encoding is appropriate. Ordinal encoding assigns integer values based on the order of the categories, while target encoding calculates the mean of the target variable for each category. Since education level has a clear order, ordinal encoding would preserve this order effectively.
Employment Status (Unemployed/Part-Time/Full-Time): For nominal categorical variables like employment status, one-hot encoding or target encoding can be used. One-hot encoding creates binary columns for each category, while target encoding computes the mean target value for each category. Since employment status does not have a clear order, one-hot encoding ensures that the model treats each category equally.

In [None]:
import pandas as pd
from category_encoders import OrdinalEncoder, TargetEncoder

# Sample DataFrame
df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Education Level': ['High School', "Bachelor's", 'Master's', 'PhD'],
    'Employment Status': ['Unemployed', 'Part-Time', 'Full-Time', 'Part-Time']
})

# Ordinal encoding for 'Education Level'
ordinal_encoder = OrdinalEncoder()
df['Education_Level_encoded'] = ordinal_encoder.fit_transform(df[['Education Level']])

# One-hot encoding for 'Employment Status'
df = pd.get_dummies(df, columns=['Employment Status'])

print(df)


Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two 
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results

In the case of calculating covariance between continuous and categorical variables, the usual covariance calculation isn't meaningful because covariance measures the joint variability between two continuous variables. However, you can analyze the relationship between categorical and continuous variables using techniques like ANOVA (Analysis of Variance) or visualization methods like box plots or scatter plots.

To calculate covariance between continuous variables like 'Temperature' and 'Humidity', you can use numpy's cov function:

In [None]:
import numpy as np

# Sample data
temperature = np.array([22, 25, 24, 18, 20])
humidity = np.array([0.80, 0.60, 0.70, 0.90, 0.85])

# Calculate covariance
covariance = np.cov(temperature, humidity)[0, 1]

print("Covariance between Temperature and Humidity:", covariance)
