# Feature Engineering-5

In [None]:
#### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

* Ordinal Encoding: Ordinal encoding is used for categorical variables that have an inherent order or rank. It assigns a unique numerical value to each category based on its rank. For example, if we have a variable "Education Level" with categories "High School," "Bachelor's," "Master's," and "PhD," we can assign values 1, 2, 3, and 4 to represent their order.
* Label Encoding: Label encoding is used for nominal categorical variables where there is no inherent order. It assigns a unique numerical value to each category without considering their rank. For example, if we have a variable "Color" with categories "red," "green," and "blue," we can assign values 0, 1, and 2.

Example: Suppose we have a dataset of clothing sizes: "Small," "Medium," and "Large." Since the sizes have a clear order, we would use ordinal encoding (1, 2, and 3) to represent their rank.

In [1]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
# Ordinal Encoding
oe = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
ordinal_data = oe.fit_transform([['Medium'], ['Small'], ['Large']])
print("Ordinal Encoded Data:", ordinal_data)
# Label Encoding
le = LabelEncoder()
label_data = le.fit_transform(['red', 'green', 'blue'])
print("Label Encoded Data:", label_data)

Ordinal Encoded Data: [[1.]
 [0.]
 [2.]]
Label Encoded Data: [2 1 0]


#### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding is a technique where the categories of a nominal variable are encoded based on the mean of the target variable for each category. This can be useful when there is a clear relationship between the category and the target variable's behavior. For instance, if we're working on a credit risk prediction project, we might encode credit scores into ordinal values based on their default rate.

Example: In a marketing campaign dataset, we could use target guided ordinal encoding to encode customer segments based on their response rate to previous campaigns. Segments with higher response rates might receive higher encoded values.

In [2]:
import pandas as pd
df = pd.DataFrame({'Segment': ['High', 'Medium', 'Low', 'High', 'Low'],'Response_Rate': [0.7, 0.3, 0.2, 0.6, 0.1]})
# Calculate mean response rate for each segment
mean_response = df.groupby('Segment')['Response_Rate'].mean().sort_values()
# Create mapping based on mean response rate
mapping = {segment: i for i, segment in enumerate(mean_response.index)}
# Apply mapping to the original data
df['Encoded_Segment'] = df['Segment'].map(mapping)
print(df)

  Segment  Response_Rate  Encoded_Segment
0    High            0.7                2
1  Medium            0.3                1
2     Low            0.2                0
3    High            0.6                2
4     Low            0.1                0


#### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that indicates the extent to which two variables change together. It describes the relationship between two variables and whether they move in the same direction (positive covariance) or in opposite directions (negative covariance).

Covariance measures the degree of association between variables but doesn't provide information about the strength of the relationship.

* Calculation: The covariance between two variables X and Y is calculated as follows:
* cov(X,Y)={∑(Xi-X^)(Yi-Y^)}/n-1

In [3]:
import numpy as ny
x = ny.array([1, 2, 3, 4, 5])
y = ny.array([2, 3, 4, 5, 6])
cov = ny.cov(x, y)[0, 1]
print("Covariance:", cov)

Covariance: 2.5


#### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [4]:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'small', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
})
le = LabelEncoder()
for c in df.columns:
    df[c] = le.fit_transform(df[c])
print(df)

   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      2     2         0
4      0     1         2


#### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [5]:
import numpy as ny
age = [25, 30, 40, 28, 35]
income = [50000, 60000, 80000, 55000, 70000]
education_level = [1, 2, 3, 2, 4]  # Assuming encoded values (1: HS, 2: Bachelor's, ...)
data = ny.array([age, income, education_level])
cov_matrix = ny.cov(data)
print("Covariance Matrix:\n",cov_matrix)

Covariance Matrix:
 [[3.53e+01 7.15e+04 5.45e+00]
 [7.15e+04 1.45e+08 1.10e+04]
 [5.45e+00 1.10e+04 1.30e+00]]


#### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

Encoding Categorical Variables:
* "Gender" (Male/Female): Use label encoding (0 for Male, 1 for Female) since it's a nominal variable.
* "Education Level" (HS/Bachelor's/Master's/PhD): Use ordinal encoding based on the education level's inherent order.
* "Employment Status" (Unemployed/Part-Time/Full-Time): Use label encoding (0, 1, 2) since there's no clear order.

In [6]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Education_Level': ['Bachelors', 'Masters', 'PhD', 'High School'],
    'Employment_Status': ['Unemployed', 'Full-Time', 'Part-Time', 'Full-Time']
})
# Label Encoding
le = LabelEncoder()
df['Gender_Encoded'] = le.fit_transform(df['Gender'])
df['Employment_Status_Encoded'] = le.fit_transform(df[['Employment_Status']])
# Ordinal Encoding
el = ['High School', 'Bachelors', 'Masters', 'PhD']
oe = OrdinalEncoder(categories=[el])
df['Education_Level_Encoded'] = oe.fit_transform(df[['Education_Level']])
df.head()

  y = column_or_1d(y, warn=True)


Unnamed: 0,Gender,Education_Level,Employment_Status,Gender_Encoded,Employment_Status_Encoded,Education_Level_Encoded
0,Male,Bachelors,Unemployed,1,2,1.0
1,Female,Masters,Full-Time,0,0,2.0
2,Male,PhD,Part-Time,1,1,3.0
3,Female,High School,Full-Time,0,0,0.0


#### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [7]:
# Example
import numpy as ny
temp = ny.array([75, 80, 70, 85, 72])
humidity = ny.array([60, 55, 65, 50, 58])
weather_condition = ny.array([1, 2, 1, 3, 2])  # Assuming encoded values
wind_direction = ny.array([0, 180, 270, 90, 0])  # Assuming encoded values
data = ny.array([temp, humidity, weather_condition, wind_direction])
# Calculate covariance matrix
cov_matrix = ny.cov(data)
print("Covariance Matrix:\n",cov_matrix)

Covariance Matrix:
 [[ 3.730e+01 -3.180e+01  4.100e+00 -7.650e+01]
 [-3.180e+01  3.130e+01 -4.350e+00  2.115e+02]
 [ 4.100e+00 -4.350e+00  7.000e-01 -1.800e+01]
 [-7.650e+01  2.115e+02 -1.800e+01  1.377e+04]]
