#### `Q1`. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.


* Ordinal encoding and label encoding are both techniques for encoding categorical variables into numerical data. The main difference between the two techniques is that ordinal encoding is used when there is a natural ordering or ranking of the categories, whereas label encoding is used when there is no inherent order to the categories.

* Ordinal encoding assigns a numerical value to each category based on its rank or order. 
    >  For example, if we have a categorical variable representing levels of education ("high school," "bachelor's degree," "master's degree," etc.), we can assign numerical values based on the hierarchy of education levels, such as 1 for "high school," 2 for "bachelor's degree," and 3 for "master's degree."

*  label encoding assigns a unique numerical value to each category in the variable, without any inherent ranking or ordering.
    > For example, if we have a categorical variable representing different colors ("red," "blue," "green," etc.), we can assign a unique numerical value to each color, such as 1 for "red," 2 for "blue," and 3 for "green."

#### `Q2`. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.


The steps to perform Target Guided Ordinal Encoding are:

1. Calculate the mean target for each category level in the training dataset.
2. Sort the categories based on their mean target value.
3. Assign an ordinal value to each category level based on the sorting order.
4. Replace the categorical values with the assigned ordinal values.

* For example, let's say you have a dataset of employees in a company, and one of the features is the department in which the employee works. You want to predict whether an employee will stay with the company or leave based on various features including their department. In this case, you can use Target Guided Ordinal Encoding to transform the department feature into an ordinal value based on the probability of an employee leaving in each department.

* To perform Target Guided Ordinal Encoding, you first calculate the mean target (employee churn probability) for each department level in the training dataset. Let's assume you have the following mean target values for each department:

  > * Sales: 0.2
  > * Marketing: 0.3
  > * Engineering: 0.1
  > * Operations: 0.4

* You then sort the departments based on their mean target value and assign an ordinal value to each department level based on the sorting order:

  > * Engineering: 1
  > * Sales: 2
  > * Marketing: 3
  > * Operations: 4

* Finally, you replace the department names with their assigned ordinal values.


#### `Q3`. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?


* **Covariance** is a measure of the extent to which two random variables are linearly related. Specifically, it measures the degree to which the values of one variable change in relation to the values of another variable. 
* If two variables have a positive covariance, it means that when one variable increases, the other variable tends to increase as well. 
* If they have a negative covariance, it means that when one variable increases, the other variable tends to decrease. If the covariance is zero, it means that the variables are not linearly related.

* **Covariance is important in statistical analysis** because it is a measure of the strength and direction of the relationship between two variables. This relationship can provide important insights into the underlying nature of the data and can inform decisions about modeling and prediction. For example, in finance, covariance is used to measure the degree to which the returns on two different stocks are related, which can help investors diversify their portfolios.

* Covariance is calculated as the sum of the product of the deviations of each variable from its mean, divided by the sample size minus one. The formula for the covariance between two variables X and Y with sample size n is:

  > cov(X,Y) = 1/(n-1) * ∑(X_i - X_mean) * (Y_i - Y_mean)

* where X_i and Y_i are the values of the variables at the ith observation, X_mean and Y_mean are the means of the variables, and the summation is taken over all n observations.

#### `Q4`. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.


In [1]:
import pandas as pd

df = pd.DataFrame({
    'Color': ['Red', 'Green', 'Blue'],
    'Size': ['Small', 'Medium', 'Large'],
    'Material': ['Wood', 'Metal', 'Plastic']
})

df.head()

Unnamed: 0,Color,Size,Material
0,Red,Small,Wood
1,Green,Medium,Metal
2,Blue,Large,Plastic


In [2]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()

n_color = label.fit_transform(df['Color'])
n_size = label.fit_transform(df['Size'])
n_material = label.fit_transform(df['Material'])

df1= pd.DataFrame({'n_color': list(n_color),
                  'n_size' : list(n_size),
                  'n_material' : list(n_material) })
df1

Unnamed: 0,n_color,n_size,n_material
0,2,2,2
1,1,1,0
2,0,0,1


#### `Q5`. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.


In [3]:
import numpy as np

# create a 3xN array with Age, Income, and Education level data
data = pd.DataFrame({
    'Age': [28,35,42,30,38,45],
    'Income' : [50000,70000,90000,55000,75000,95000],
    'Education level': [12,16,18,14,16,20]
})

data

Unnamed: 0,Age,Income,Education level
0,28,50000,12
1,35,70000,16
2,42,90000,18
3,30,55000,14
4,38,75000,16
5,45,95000,20


In [4]:
cov_matrix = np.cov(data, rowvar=False)
cov_matrix

array([[4.42666667e+01, 1.20000000e+05, 1.84000000e+01],
       [1.20000000e+05, 3.27500000e+08, 5.00000000e+04],
       [1.84000000e+01, 5.00000000e+04, 8.00000000e+00]])

#### `Q6`. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?


* For the **Gender** variable, I would use Label Encoding, as there are only two categories, Male and Female, and there is no inherent order or hierarchy between the categories.

* For the **Education Level** variable, I would use Ordinal Encoding, as there is an inherent order or hierarchy between the categories, with a higher education level being "better" than a lower education level. I would assign an ordinal value to each category based on its level of education, with High School being the lowest and PhD being the highest.

* For the Employment Status variable, I would use One-Hot Encoding, as there are three categories, and there is no inherent order or hierarchy between the categories. One-Hot Encoding creates a binary variable for each category, with a value of 1 if the observation belongs to that category, and a value of 0 otherwise. This approach ensures that there is no implied ranking or order between the categories.

#### `Q7`. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

* Covariance is calculated as the sum of the product of the deviations of each variable from its mean, divided by the sample size minus one. The formula for the covariance between two variables X and Y with sample size n is:

  > cov(X,Y) = 1/(n-1) * ∑(X_i - X_mean) * (Y_i - Y_mean)

In [5]:
import pandas as pd

df = pd.DataFrame({
    'Temperature': [50,35,75,34,31],
    'Humidity': [45,43,65,76,22],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy','Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East','South', 'West']
})

df.head()

Unnamed: 0,Temperature,Humidity,Weather Condition,Wind Direction
0,50,45,Sunny,North
1,35,43,Cloudy,South
2,75,65,Rainy,East
3,34,76,Sunny,South
4,31,22,Cloudy,West


In [6]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()

weather_list = label.fit_transform(df['Weather Condition'])
wind_list = label.fit_transform(df['Wind Direction'])

encoded_df = pd.DataFrame({'weather Condition':list(weather_list),
                          'wind Direction': list(wind_list) 
                          })
encoded_df

Unnamed: 0,weather Condition,wind Direction
0,2,1
1,0,2
2,1,0
3,2,2
4,0,3


In [7]:
new_df = pd.concat([df,encoded_df], axis=1)
new_df.drop('Weather Condition', axis=1, inplace=True)
new_df.drop('Wind Direction', axis=1, inplace=True)

In [8]:
new_df

Unnamed: 0,Temperature,Humidity,weather Condition,wind Direction
0,50,45,2,1
1,35,43,0,2
2,75,65,1,0
3,34,76,2,2
4,31,22,0,3


In [9]:
#covariance between Temperature and Humidity
cov_1 = new_df['Temperature'].cov(new_df['Humidity'])

#covariance between Weather Condition and Wind Direction
cov_2 = new_df['weather Condition'].cov(new_df['wind Direction'])

print(f"Covariance between Temperature and Humidity: {cov_1}")
print(f"Covariance between Weather Condition and Wind Direction: {cov_2}")

Covariance between Temperature and Humidity: 150.25
Covariance between Weather Condition and Wind Direction: -0.5


* **The covariance between temperature and humidity is 150.25**, indicating a positive relationship between the two variables. This means that as temperature increases, humidity tends to increase as well and vice versa.

* **The covariance between weather condition and wind direction is -0.5**, indicating a weak negative relationship between the two variables. This means that as one variable (e.g. weather condition) tends to increase, the other variable (e.g. wind direction) tends to decrease slightly, and vice versa. However, the strength of this relationship is relatively weak.