In [1]:
# Q1 ##  difference between Ordinal Encoding and Label Encoding

In [None]:
Label Encoding
1. The categorical values are labeled into numeric values by assigning each category to a number.
2. Different columns are not added. Rather different categories are converted into numeric values. 
So fewer computations.
3. Unique information is there
4. Different integers are used to represent data.

In [2]:
# example 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

In [3]:
df = pd.DataFrame({
    'color': ['red','blue','green','green','red','blue']
})

In [4]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [5]:
encoder=LabelEncoder()

In [6]:
encoder.fit_transform(df['color'])

array([2, 0, 1, 1, 2, 0])

In [None]:
One-hot Encoding
	1. A column with categorical values is split into multiple columns.
	2. It will add more columns and will be computationally heavy
	3. Redundant information is there
	4. Only 0 and 1 are used to represent data

In [7]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# create a sample df with ordinal
df = pd.DataFrame({
    'size': ['small','medium','large','medium','small','large']
})

In [8]:
encoder= OrdinalEncoder(categories=[['small','medium','large']])

In [9]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [10]:
# Q2 ##  Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [12]:
# example 
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [13]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [14]:
mean_price=df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [15]:
df['city_encoded']=df['city'].map(mean_price)

In [16]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [17]:
# Q3 # covariance

a measure of the relationship between two random variables and to what extent, they change together

Covariance indicates whether two variables fluctuate in the same (positive covariance) or opposite direction (negative covariance). The numerical value of covariance has no importance; only the sign is relevant.

A positive covariance means asset prices are moving in the same general direction. A negative covariance means asset prices are moving in opposite directions.

In [None]:
# formula  
Cov(X, Y) = Σ(Xi-µ)(Yj-v) / n

In [18]:
# Q 4 
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

In [19]:
df = pd.DataFrame({
    'color': ['red','blue','green','green','red','blue'],
    'size': ['small','medium','large','medium','small','large'],
    'material': ['wood','metal','plastic','metal','plastic','wood']
})

In [20]:
df

Unnamed: 0,color,size,material
0,red,small,wood
1,blue,medium,metal
2,green,large,plastic
3,green,medium,metal
4,red,small,plastic
5,blue,large,wood


In [21]:
encoder= LabelEncoder()

In [23]:
encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [24]:
encoder.fit_transform(df[['material']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 0, 1, 2])

In [25]:
encoder.fit_transform(df[['size']])

  y = column_or_1d(y, warn=True)


array([2, 1, 0, 1, 2, 0])

In [26]:
# Q5 
import pandas as pd 

In [27]:
df = pd.DataFrame({
    'age': [24,26,28,30,32,35],
    'income': ['50k','100k','80k','120k','90k','110k'],
    'education': ['B.E','PHD','Masters','PHD','B.E','Masters']
})

In [28]:
df

Unnamed: 0,age,income,education
0,24,50k,B.E
1,26,100k,PHD
2,28,80k,Masters
3,30,120k,PHD
4,32,90k,B.E
5,35,110k,Masters


In [33]:
df.cov()

  df.cov()


Unnamed: 0,age
age,16.166667


#Q6  You are working on a machine learning project with a dataset containing several categorical 
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), 
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

In [None]:
we use nominal encoding because When we have a feature where variables are just names and
there is no order or rank to this variable's feature

In [35]:
# Q7 
import pandas as pd 

In [36]:
df = pd.DataFrame({
    "temp": [25,45,35,34,27,26,29,40],
    "humidity": [30,15,16,20,23,30,14,10]
})

In [37]:
df

Unnamed: 0,temp,humidity
0,25,30
1,45,15
2,35,16
3,34,20
4,27,23
5,26,30
6,29,14
7,40,10


In [38]:
df.cov()

Unnamed: 0,temp,humidity
temp,51.696429,-40.392857
humidity,-40.392857,55.071429


In [39]:
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

In [45]:
df = pd.DataFrame({
    'weather_condition': ['sunny','cloudy','rainy',],
    'wind_direction': ['north','south','east']

})

In [46]:
df

Unnamed: 0,weather_condition,wind_direction
0,sunny,north
1,cloudy,south
2,rainy,east


In [47]:
encoder= OneHotEncoder()

In [50]:
encoder.fit_transform(df[['weather_con']])

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>