# Assignment

### Ans1)

Ordinal encoding and label encoding are both techniques used in machine learning for transforming categorical data into numerical data, but there are some differences between them.

Ordinal encoding involves assigning each unique category in a categorical variable a numerical value based on the order or rank of that category. For example, in a variable with the categories "low," "medium," and "high," we might assign the values 1, 2, and 3, respectively. This technique is useful when there is a natural order to the categories.

Label encoding, on the other hand, involves assigning a unique numerical value to each category in a categorical variable, without regard for any order or ranking among the categories. For example, in a variable with the categories "red," "green," and "blue," we might assign the values 1, 2, and 3, respectively.

### Ans2)

Target Guided Ordinal Encoding is a technique that involves encoding categorical variables based on the target variable, rather than just the categories themselves. This technique can be useful when we have a categorical variable with many categories and we want to capture the relationship between the categories and the target variable in a more meaningful way.

The general idea behind Target Guided Ordinal Encoding is to replace each category with a number that reflects the relationship between that category and the target variable. For example, if we have a binary target variable and a categorical variable with three categories, we might calculate the mean of the target variable for each category, and then assign each category a number based on its mean value.

Here's a step-by-step process for performing Target Guided Ordinal Encoding:

1) Group the categorical variable by its unique values.

2) Calculate the mean of the target variable for each group.

3) Sort the groups by their mean target value.

4) Assign each group a unique integer value based on its rank.

For example, suppose we have a dataset with a categorical variable called "city" and a binary target variable called "is_fraud." We can perform Target Guided Ordinal Encoding on the "city" variable as follows:

1) Group the dataset by the unique values of "city": New York, Los Angeles, Chicago, and Houston.

2) Calculate the mean of "is_fraud" for each group:

New York: 0.1

Los Angeles: 0.05

Chicago: 0.02

Houston: 0.01

3) Sort the groups by their mean target value:

4) New York

5) Los Angeles

6) Chicago

7) Houston

8) Assign each group a unique integer value based on its rank:
New York: 4

Los Angeles: 3

Chicago: 2

Houston: 1

In this example, we've assigned a higher value to cities with a higher mean rate of fraud, so the encoding reflects the relationship between the "city" variable and the target variable in a meaningful way

### Ans3)

Covariance is a statistical measure that describes the relationship between two variables. Specifically, it measures how much two variables change together: when one variable increases, does the other variable tend to increase or decrease as well? A positive covariance means that the variables tend to move in the same direction, while a negative covariance means that they tend to move in opposite directions.

Covariance is important in statistical analysis because it allows us to understand the degree of association between two variables. If two variables have a high positive covariance, we can infer that they are strongly related and tend to increase or decrease together. On the other hand, if two variables have a high negative covariance, we can infer that they are inversely related and tend to move in opposite directions.

Covariance is calculated using the following formula:

Cov(X,Y) = Σ[(X - μx)(Y - μy)] / (n - 1)

where X and Y are the variables being analyzed, μx and μy are their respective means, and n is the sample size. The formula subtracts the mean of each variable from each observation, multiplies the differences together, and then averages the results. A positive result indicates a positive covariance, while a negative result indicates a negative covariance.

### Ans4)

In [2]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

In [33]:
df = pd.DataFrame({'Color':['red','green','blue'],
        'Size':['small','medium','large'],
        'Material':['wood','metal','plastic']
       })

In [34]:
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [35]:
encoder=LabelEncoder()

In [36]:
df['Color']=encoder.fit_transform(df['Color'])

In [37]:
encoded

array([2, 1, 0])

In [38]:
df['Size']=encoder.fit_transform(df['Size'])

In [39]:
df['Material'] = encoder.fit_transform(df['Material'])

In [41]:
print(df)

   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1


The LabelEncoder object is used to encode each categorical variable as an integer. The fit_transform() method is called on each variable to both fit the encoder on the variable and transform it into encoded integers.

The resulting encoded integers range from 0 to n-1, where n is the number of unique categories in each variable. For example, the 'Color' variable has three unique categories, so it is encoded as 0, 1, and 2. The 'Size' variable has three unique categories, so it is encoded as 0, 1, and 2. Finally, the 'Material' variable has three unique categories, so it is also encoded as 0, 1, and 2.

### Ans5)

In [42]:
import numpy as np

# create a dataset with Age, Income, and Education level
age = [30, 40, 25, 35, 28]
income = [50000, 70000, 40000, 60000, 45000]
education = [16, 18, 14, 16, 15]

# stack the variables into a matrix
data = np.stack([age, income, education], axis=0)

# calculate the covariance matrix
covariance_matrix = np.cov(data)

print(covariance_matrix)

[[3.53e+01 7.15e+04 8.40e+00]
 [7.15e+04 1.45e+08 1.70e+04]
 [8.40e+00 1.70e+04 2.20e+00]]


The diagonal elements represent the variances of each variable. For example, the variance of Age is 33.5, the variance of Income is 3.25e+09 (3.25 * 10^9), and the variance of Education level is 1.3.

The off-diagonal elements represent the covariances between each pair of variables

### Ans6)


For encoding categorical variables in a machine learning project, there are several methods to choose from. The choice of method depends on the nature of the variable, the number of categories, and the algorithm being used. Here are some encoding methods that could be used for the given categorical variables:

1) Gender: Since this is a binary categorical variable with two categories (Male and Female), we could use binary encoding. Binary encoding assigns a unique binary code (0 or 1) to each category of the variable. For example, we could encode Male as 0 and Female as 1.

2) Education Level: Since this is an ordinal categorical variable with multiple categories that have an inherent order (High School < Bachelor's < Master's < PhD), we could use ordinal encoding. Ordinal encoding assigns a unique numerical code to each category of the variable based on its order. For example, we could encode High School as 1, Bachelor's as 2, Master's as 3, and PhD as 4.

3) Employment Status: Since this is a nominal categorical variable with multiple categories that have no inherent order, we could use one-hot encoding. One-hot encoding creates a new binary variable for each category of the variable, with a value of 1 indicating the presence of that category and 0 indicating the absence. For example, we could create three binary variables: Unemployed (encoded as 1 for unemployed observations and 0 otherwise), Part-Time (encoded as 1 for part-time observations and 0 otherwise), and Full-Time (encoded as 1 for full-time observations and 0 otherwise).

### Ans7)

In [2]:
import pandas as pd
import numpy as np

df=pd.DataFrame({
    'Tempreture':[50,32,43,55,34],
    'Humidity':[50,90,40,55,28],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy','Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East','South', 'West']
})
df.head()

Unnamed: 0,Tempreture,Humidity,Weather Condition,Wind Direction
0,50,50,Sunny,North
1,32,90,Cloudy,South
2,43,40,Rainy,East
3,55,55,Sunny,South
4,34,28,Cloudy,West


In [3]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()

weather_encoded= encoder.fit_transform(df['Weather Condition'])
wind_encoded = encoder.fit_transform(df['Wind Direction'])

list(weather_encoded)

df_encoded= pd.DataFrame({
    'Weather Con Encoded':list(weather_encoded),
    'Wind Dir Encoded':list(wind_encoded)
})

df_new = pd.concat([df,df_encoded],axis=1)
df_new = df_new.drop('Weather Condition',axis=1)
df_new = df_new.drop("Wind Direction",axis=1)

df_new


Unnamed: 0,Tempreture,Humidity,Weather Con Encoded,Wind Dir Encoded
0,50,50,2,1
1,32,90,0,2
2,43,40,1,0
3,55,55,2,2
4,34,28,0,3


In [9]:
## Covariance between Temprature & Hhumidity
cov_1= df_new['Tempreture'].cov(df_new['Humidity'])
## Covariance between Weather Condition and Wind Direction
cov_2= df_new['Weather Con Encoded'].cov(df_new['Wind Dir Encoded'])

print(f"Covariance between Temprature & Humidity{cov_1}")
print(f"Covariance between Weather Condition & Wind Direction{cov_2}")

Covariance between Temprature & Humidity-44.85
Covariance between Weather Condition & Wind Direction-0.5
