In [None]:
Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

In [None]:

1. Nature of Categorical Data:
   - Ordinal Encoding: Ordinal encoding is used when the categorical data has a natural order or hierarchy among its categories. This means that the categories can be ranked or ordered based on some inherent property.
   - Label Encoding: Label encoding, on the other hand, is used when the categorical data does not have a natural order or hierarchy among its categories. Each category is assigned a unique numerical label without any implied order.

2. Representation:
   - Ordinal Encoding: In ordinal encoding, categories are mapped to numerical values based on their order or ranking. Each category is assigned a numerical label, and the labels are usually assigned in ascending or descending order based on the predefined ranking.
   - Label Encoding: In label encoding, each category is assigned a unique numerical label without any specific order. The labels are typically assigned sequentially starting from 0 or 1.

3. Application:
   - Ordinal Encoding: Ordinal encoding is commonly used when dealing with categorical variables that have a clear order or hierarchy, such as education level (e.g., high school, college, graduate school), income level (e.g., low, medium, high), or customer satisfaction rating (e.g., poor, fair, good, excellent).
   - Label Encoding: Label encoding is suitable for categorical variables where there is no inherent order among the categories, such as gender (e.g., male, female), country names, or product categories.

4. Handling Missing Values:
   - Ordinal Encoding: Ordinal encoding can handle missing values in categorical data if the predefined order or hierarchy allows for the insertion of a new category to represent missing values.
   - Label Encoding: Label encoding may encounter challenges with missing values, as it relies on assigning numerical labels sequentially to each category. Missing values may need to be handled separately before applying label encoding.

In [None]:
Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

In [None]:

1. Purpose:
   - Target Guided Ordinal Encoding is particularly useful when dealing with categorical features in machine learning tasks such as regression, classification, and ranking problems.
   - It leverages information from the target variable to create an ordinal encoding for the categories.

2. How It Works:
   - The process involves sorting the categories based on the **mean of the target variable** for each category.
   - We then assign numerical values to each category based on its rank in this sorted order.

3. Example:
   - Suppose we have a dataset of employees with the following columns: `Employee Id`, `City`, `Highest Qualification`, and `Salary`.
   - Our goal is to predict an employee's salary based on other details.
   - Let's focus on encoding the `City` column using Target Guided Ordinal Encoding.
   - Here's the initial data:

     | Employee Id | City     | Highest Qualification | Salary |
     |-------------|----------|-----------------------|--------|
     | A100        | Delhi    | Phd                   | 50000  |
     | A101        | Delhi    | Bsc                   | 30000  |
     | A102        | Mumbai   | Msc                   | 45000  |
     | B101        | Pune     | Bsc                   | 25000  |
     | B102        | Kolkata  | Phd                   | 48000  |
     | C100        | Pune     | Msc                   | 30000  |
     | D103        | Kolkata  | Msc                   | 44000  |

   - Step 1: Calculate the mean salary for each city:
     - Delhi: (50000 + 30000) / 2 = 40000
     - Mumbai: 45000
     - Pune: (25000 + 30000) / 2 = 27500
     - Kolkata: (48000 + 44000) / 2 = 46000

   - Step 2: Sort the cities based on mean salary:
     - Kolkata > Mumbai > Delhi > Pune

   - Step 3: Assign ranks to the cities:
     - Kolkata: Rank 4
     - Mumbai: Rank 3
     - Delhi: Rank 2
     - Pune: Rank 1

   - Step 4: Encode the `City` column:
     - Update the dataset with the city ranks:

       | Employee Id | City | Highest Qualification | Salary |
       |-------------|------|-----------------------|--------|
       | A100        | 2    | Phd                   | 50000  |
       | A101        | 2    | Bsc                   | 30000  |
       | A102        | 3    | Msc                   | 45000  |
       | B101        | 1    | Bsc                   | 25000  |
       | B102        | 4    | Phd                   | 48000  |
       | C100        | 1    | Msc                   | 30000  |
       | D103        | 4    | Msc                   | 44000  |

In [None]:
Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

In [None]:
statistical analysis:

1. Measuring Relationship: Covariance provides a measure of the degree and direction of the linear relationship between two variables. It helps analysts understand how changes in one variable are associated with changes in another variable.

2. Predictive Power: Covariance is used in various statistical models and techniques, such as linear regression, where understanding the relationship between predictor variables and the response variable is crucial for making predictions.

3. Portfolio Analysis: In finance, covariance is used to measure the relationship between the returns of different assets in a portfolio. Positive covariance indicates that the assets tend to move together, while negative covariance suggests that they move in opposite directions. This information is essential for diversifying risk in investment portfolios.

4. Dimensionality Reduction: Covariance plays a role in techniques like principal component analysis (PCA), where it helps identify the directions of maximum variance in a dataset. By understanding the covariance structure of the data, PCA can reduce the dimensionality of the dataset while retaining most of the variability.

Covariance between two variables \( X \) and \( Y \) is calculated using the following formula:

\[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n} \]

Where:
- \( x_i \) and \( y_i \) are individual data points of variables \( X \) and \( Y \) respectively.
- \( \bar{x} \) and \( \bar{y} \) are the means (average) of variables \( X \) and \( Y \) respectively.
- \( n \) is the number of data points.

In [None]:
Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
})

label_encoder = LabelEncoder()

for col in data.columns:
    data[col + '_encoded'] = label_encoder.fit_transform(data[col])

print(data)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3    red  medium     wood              2             1                 2
4   blue   small    metal              0             2                 0


In [None]:
Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [2]:
import numpy as np

age = [30, 40, 50, 35, 45]
income = [50000, 60000, 70000, 55000, 65000]
education_level = [12, 16, 18, 14, 20]

covariance_matrix = np.cov([age, income, education_level])

print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
[[6.25e+01 6.25e+04 2.25e+01]
 [6.25e+04 6.25e+07 2.25e+04]
 [2.25e+01 2.25e+04 1.00e+01]]


In [None]:
Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In [None]:
1. Gender (Male/Female):
    - Since gender has no inherent order or ranking, we should use **one-hot encoding**. This technique creates separate binary columns for each category (Male and Female). For instance:
        - Male: 1, Female: 0
        - Male: 0, Female: 1
    - One-hot encoding ensures that both labels are treated equally and avoids any unintended ordinality.

2. Education Level (High School/Bachelor's/Master's/PhD):
    - Education level exhibits a natural rank ordering (High School < Bachelor's < Master's < PhD).
    - Therefore, we can use ordinal encoding for this variable. Ordinal encoding assigns numeric values based on the order of the categories.
        - High School: 1, Bachelor's: 2, Master's: 3, PhD: 4
    - This approach captures the inherent hierarchy in education levels.

3. Employment Status(Unemployed/Part-Time/Full-Time):
    - Similar to gender, employment status has no inherent order.
    - Hence, we should again use one-hot encoding to represent the three categories:
        - Unemployed: 1, 0, 0
        - Part-Time: 0, 1, 0
        - Full-Time: 0, 0, 1
    - One-hot encoding ensures equal treatment of all employment statuses.

In [None]:
Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results

In [3]:
import numpy as np

temperature = [25, 30, 22, 28, 27] 
humidity = [60, 55, 70, 65, 75]      
weather_condition = ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny']
wind_direction = ['North', 'South', 'East', 'West', 'North']

covariance_matrix = np.cov([temperature, humidity])

print("Covariance Matrix for Temperature and Humidity:")
print(covariance_matrix)


Covariance Matrix for Temperature and Humidity:
[[  9.3  -11.25]
 [-11.25  62.5 ]]
