# One-hot encoding demo


## Part 1: Home price dataset dummy variables

In [None]:
import pandas as pd
df = pd.read_csv("home price.csv") # loading the dataset
dummies = pd.get_dummies(df.town)
merged = pd.concat([df,dummies],axis=1) # merging dummy variable
final = merged.drop(['town'], axis=1) # merging the ‘Town’ column
final = final.drop(['west windsor’], axis=1) # dropping anyone dummy variable.Here dropping ‘west windsor’

## Part 2: Handling Multi-categorical data

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2'])
df.head()
counts = df['X1'].value_counts().sum() # Counting number of labels for ‘X1’ column
top_10_labels = [y for y in df.X1.value_counts().sort_values(ascending=False).head(10).index]    # checking the top 10 labels
df.X1.value_counts().sort_values(ascending=False).head(10) # arranging the labels in ascending order
df=pd.get_dummies(df['X1']).sample(10) #applying One-hot encoding

# Part 3: Full Demo


In this code sample, we are preprocessing the categorical features available in ‘home price’ dataset. We have taken the same example (explained above), so that we can easily relate to it. Both One-hot encoding and label encoding are used. We implemented One-hot encoding and did the following:
- Dummy variables are created
- Merged with the original dataset
- Certain columns are dropped to avoid multicollinearity 

Label encoding is applied using scikit-learn's LabelEncoder, introducing a new 'town_encoded' column In label encoding the following steps are performed:
- Introduced a new 'town_encoded' column
- Data is split into dependent and independent variables for both encoding methods
- Train test split is performed (for both One-hot encoding and label encoding) 

Comparison for One-hot encoding and label encoding is performed by applying Linear regression models trained on both one-hot encoded and label-encoded datasets, and their predictive performance is evaluated using R-squared scores. 

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Read the dataset
df = pd.read_csv("home price.csv")

# One-Hot Encoding
dummies = pd.get_dummies(df['town'])
merged_one_hot = pd.concat([df, dummies], axis=1)
final_one_hot = merged_one_hot.drop(['town'], axis=1)
final_one_hot = final_one_hot.drop(['robinsville'], axis=1)

# Label Encoding
label_encoder = LabelEncoder()
df['town_encoded'] = label_encoder.fit_transform(df['town'])
final_label_encoded = df.drop(['town'], axis=1)

# Separate features and target for One-Hot Encoding
X_one_hot = final_one_hot.drop('price', axis=1)
y_one_hot = final_one_hot['price']

# Separate features and target for Label Encoding
X_label_encoded = final_label_encoded.drop('price', axis=1)
y_label_encoded = final_label_encoded['price']

# Split the data into train and test sets for One-Hot Encoding
X_train_one_hot, X_test_one_hot, y_train_one_hot, y_test_one_hot = train_test_split(X_one_hot, y_one_hot, test_size=0.2, random_state=42)

# Split the data into train and test sets for Label Encoding
X_train_label_encoded, X_test_label_encoded, y_train_label_encoded, y_test_label_encoded = train_test_split(X_label_encoded, y_label_encoded, test_size=0.2, random_state=42)

# Model using One-Hot Encoded data
model_one_hot = LinearRegression()
model_one_hot.fit(X_train_one_hot, y_train_one_hot)
predictions_one_hot = model_one_hot.predict(X_test_one_hot)
score_one_hot = r2_score(y_test_one_hot, predictions_one_hot)

# Model using Label Encoded data
model_label_encoded = LinearRegression()
model_label_encoded.fit(X_train_label_encoded, y_train_label_encoded)
predictions_label_encoded = model_label_encoded.predict(X_test_label_encoded)
score_label_encoded = r2_score(y_test_label_encoded, predictions_label_encoded)

# Compare results
print("R-squared score for One-Hot Encoded data:", score_one_hot)
print("R-squared score for Label Encoded data:", score_label_encoded)