In [None]:
'''
Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.
'''

Ordinal Encoding and Label Encoding are techniques used to convert categorical data into numerical values.

Ordinal Encoding : assigns numerical values to categories in a meaningful order, making it suitable
for ordinal data where the order matters .
(e.g., rating scales: "low" = 1, "medium" = 2, "high" = 3).

Label Encoding  : assigns arbitrary numerical values to categories without any order, making it suitable 
for nominal data where the order does not matter .
(e.g., colors: "red" = 1, "blue" = 2, "green" = 3).

Choose Ordinal Encoding when the categorical data has a meaningful order, and Label Encoding when the
order of categories is irrelevant.




In [None]:
'''
Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project
'''


Target Guided Ordinal Encoding works by ordering the categories based on the relationship between 
each category and the target variable, then encoding them with ordinal numbers based on this order.

For example, if you are predicting house prices and have a categorical feature "neighborhood," 
you can calculate the mean house price for each neighborhood and then encode the neighborhoods 
based on these mean prices. This way, neighborhoods with higher average prices get higher numerical values.

Use Target Guided Ordinal Encoding when the categorical feature has a significant impact
on the target variable and you want to capture this relationship to improve model performance.

In [None]:
'''
Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?
'''
Covariance is a measure of the degree to which two variables change together.
If the variables tend to increase together, the covariance is positive; 
if one increases while the other decreases, the covariance is negative.

Covariance is important in statistical analysis because it indicates the direction 
of the linear relationship between variables, which is useful in understanding how
variables interact and in constructing models like regression analysis.

Covariance is calculated using the formula:
    
                   1    ∑ (Xi - Xmean) (Yi - Ymean)
Cov(X , Y ) =    ------
                  n - 1 

  are the means of the variables X and Y respectively.

In [1]:
'''
Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.
'''

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'small', 'large'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Initializing LabelEncoder
le = LabelEncoder()

# Encoding each column
df['Color'] = le.fit_transform(df['Color'])
df['Size'] = le.fit_transform(df['Size'])
df['Material'] = le.fit_transform(df['Material'])

print(df)


   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     2         0
4      2     0         2


In [3]:
'''
Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.
'''

import numpy as np

# Sample data
data = {
    'Age': [25, 32, 47, 51, 62],
    'Income': [50000, 64000, 120000, 150000, 98000],
    'Education_Level': [12, 16, 18, 20, 16]
}

# Convert the data into a NumPy array
data_matrix = np.array([data['Age'], data['Income'], data['Education_Level']])

# Calculate the covariance matrix
cov_matrix = np.cov(data_matrix, bias=True)

print(cov_matrix)



[[1.77040e+02 3.49040e+05 2.22400e+01]
 [3.49040e+05 1.32704e+09 8.94400e+04]
 [2.22400e+01 8.94400e+04 7.04000e+00]]


In [None]:
'''
Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?
'''


Gender (Male/Female): Use Label Encoding
because there are only two categories. 
Encode "Male" as 0 and "Female" as 1 (or vice versa).

Education Level (High School/Bachelor's/Master's/PhD): Use Ordinal Encoding 
because there is a clear order from lower to higher education levels.
Encode "High School" as 1, "Bachelor's" as 2, "Master's" as 3, and "PhD" as 4.

Employment Status (Unemployed/Part-Time/Full-Time): Use One-Hot Encoding
because there is no inherent order among employment statuses. 
Create binary columns where each category has its own column, indicating its presence or absence in each observation.

In [None]:
'''
Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.
'''
For the dataset with variables Temperature, Humidity, Weather Condition (Sunny/Cloudy/Rainy), 
and Wind Direction (North/South/East/West):

Temperature and Humidity: Covariance measures their directional relationship.
A positive value indicates they tend to increase or decrease together,
while negative suggests an inverse relationship. Higher values indicate a stronger relationship.

Weather Condition and Wind Direction: Covariance isnot suitable for categorical variables.
Use methods like chi-square tests for association.