<h1>Lab 24: Creating Dummy Variables</h1>
<h2>Objective:</h2>
<p>The objective of this lab is to teach students how to create dummy variables for categorical data. Students will learn about one-hot encoding and other methods for transforming categorical variables to prepare data for machine learning.</p>
<h2>Expected Outcome:</h2>
<p>By the end of this lab, students will be able to:</p>
<ul>
<li>Understand the purpose of dummy variables in data preparation.</li>
<li>Apply one-hot encoding to categorical variables.</li>
<li>Identify when and why to drop one dummy variable to avoid multicollinearity.</li>
</ul>

<p>Step 1: Import Required Libraries</p>

In [1]:
# Import pandas for data manipulation and seaborn for dataset loading
import pandas as pd
import seaborn as sns


<p>Step 2: Load Dataset with Categorical Data</p>

![image.png](attachment:image.png)

In [2]:
# Load the Titanic dataset from seaborn's built-in datasets
df = sns.load_dataset('titanic')

# Display the first five rows of the dataset to get an overview
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


<p><strong>Note:</strong> The Titanic dataset contains data about passengers, including both numerical and categorical variables.</p>

![image.png](attachment:image.png)

In [3]:
# Check for missing values in the dataset
print("\nMissing values in each column:")
print(df.isnull().sum())


Missing values in each column:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


<h3>Step 3: Identify Categorical Variables</h3>
<h4>Concept:</h4>
<p>Categorical variables can be transformed into dummy variables to allow machine learning algorithms to process them.</p>

![image.png](attachment:image.png)

In [4]:
# Display data types to identify categorical columns
print("Data Types in the Titanic Dataset:")
print(df.dtypes)

Data Types in the Titanic Dataset:
survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object


<p>From the output, we can identify the columns with data type <code>'object'</code>, which are the categorical variables.</p>

![image.png](attachment:image.png)

In [6]:
# Select columns with data type 'object' to get categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Display the list of categorical columns
print("\nCategorical Columns:", list(categorical_cols))

# Display the unique values in each categorical column
for col in categorical_cols:
    print(f"\nUnique values in '{col}': {df[col].unique()}")



Categorical Columns: ['sex', 'embarked', 'who', 'embark_town', 'alive']

Unique values in 'sex': ['male' 'female']

Unique values in 'embarked': ['S' 'C' 'Q' nan]

Unique values in 'who': ['man' 'woman' 'child']

Unique values in 'embark_town': ['Southampton' 'Cherbourg' 'Queenstown' nan]

Unique values in 'alive': ['no' 'yes']


<h3>Step 4: Creating Dummy Variables Using One-Hot Encoding</h3>
<h4>Concept:</h4>
<p>One-hot encoding transforms categorical variables into a series of binary columns (dummy variables). Each category becomes a separate column with 0 or 1 indicating the presence of that category.</p>

![image.png](attachment:image.png)

In [8]:
# Apply one-hot encoding to the 'sex' and 'embarked' columns
# Using pandas' get_dummies() function for convenience
df_dummies = pd.get_dummies(df, columns=['sex','embarked'],prefix=['sex','embarked'])

# Display the dataset with dummy variables
print("\nData with Dummy Variables for 'sex' and 'embarked' Columns:")
df_dummies.head()


Data with Dummy Variables for 'sex' and 'embarked' Columns:


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.25,Third,man,True,,Southampton,no,False,False,True,False,False,True
1,1,1,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,True,False,True,False,False
2,1,3,26.0,0,0,7.925,Third,woman,False,,Southampton,yes,True,True,False,False,False,True
3,1,1,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,True,False,False,False,True
4,0,3,35.0,0,0,8.05,Third,man,True,,Southampton,no,True,False,True,False,False,True


<h3>Step 5: Dropping One Dummy Variable to Avoid Multicollinearity</h3>
<h4>Concept:</h4>
<p>Dropping one dummy variable (also known as avoiding the "dummy variable trap") helps avoid multicollinearity when using linear models. Multicollinearity can cause issues in model training, especially in linear regression.</p>

![image.png](attachment:image.png)

In [10]:
# Apply one-hot encoding with drop_first=True to drop one column for each categorical feature
df_dummies_dropped = pd.get_dummies(df, columns=['sex','embarked'], prefix=['sex','embarked'],drop_first=True)

# Display the dataset with dropped dummy variables
print("\nData with Dummy Variables (One Column Dropped for Each Categorical Feature):")
df_dummies_dropped.head()


Data with Dummy Variables (One Column Dropped for Each Categorical Feature):


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,sex_male,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.25,Third,man,True,,Southampton,no,False,True,False,True
1,1,1,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,False,False,False
2,1,3,26.0,0,0,7.925,Third,woman,False,,Southampton,yes,True,False,False,True
3,1,1,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,False,False,True
4,0,3,35.0,0,0,8.05,Third,man,True,,Southampton,no,True,True,False,True


<h3>Step 6: Using One-Hot Encoding with scikit-learn</h3>
<h4>Concept:</h4>
<p>We can also use scikit-learn's <code>OneHotEncoder</code>, which provides more flexibility, especially for pipeline integration.</p>

![image.png](attachment:image.png)

In [13]:
# Import OneHotEncoder from scikit-Learn
from sklearn.preprocessing import OneHotEncoder
# Initialize OneHotEncoder with sparse=False and drop='first' to drop one category 
onehot_encoder = OneHotEncoder(sparse=False, drop='first')
# Apply OneHotEncoder to the 'class' column as an example 
class_encoded = onehot_encoder.fit_transform(df[['class']])
# Convert the result to a DataFrame for clarity
class_encoded_df = pd.DataFrame(class_encoded, columns=onehot_encoder.get_feature_names_out(['class']))
# Combine the encoded columns with the original DataFrame
df = pd.concat([df, class_encoded_df], axis=1)
# Display the updated DataFrame with one-hot encoded 'class' column 
print("\nData with One-Hot Encoded 'class' Column Using scikit-learn:") 
df.head()


Data with One-Hot Encoded 'class' Column Using scikit-learn:




Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,class_Second,class_Third
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0.0,1.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0.0,0.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0.0,1.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0.0,0.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0.0,1.0


<h3>Step 7: Discussion on Dummy Variable Trap and One-Hot Encoding</h3>
<h4>Discussion:</h4>
<ol>
<li>
<p><strong>Why might we drop one dummy variable for each categorical feature? What issue does this prevent?</strong></p>
<p>Dropping one dummy variable for each categorical feature prevents multicollinearity, which occurs when predictor variables are highly correlated. In the context of dummy variables, including all categories can introduce perfect multicollinearity because the categories are mutually exclusive and collectively exhaustive.</p>
</li>
<li>
<p><strong>In what situations could the inclusion of all dummy variables be beneficial despite the risk of multicollinearity?</strong></p>
<p>In non-linear models, such as tree-based models (e.g., decision trees, random forests), multicollinearity is less of a concern. Including all dummy variables can preserve all information, which might be beneficial.</p>
</li>
<li>
<p><strong>How could one-hot encoding potentially increase the complexity of a machine learning model?</strong></p>
<p>One-hot encoding can significantly increase the dimensionality of the dataset, especially when dealing with categorical variables with many categories. This can lead to the "curse of dimensionality," increasing computational complexity and potentially leading to overfitting.</p>
</li>
</ol>

<h3>Step 8: Practice Task</h3>
<h4>Practice:</h4>
<ul>
<li>Choose another categorical column (e.g., <code>'who'</code> or <code>'deck'</code>) and apply one-hot encoding with <code>drop_first=True</code>.</li>
<li>Experiment with creating dummy variables for multiple columns at once and observe the changes in the dataset.</li>
<li>Reflect on how creating dummy variables might affect the training of a machine learning model.</li>
</ul>

![image.png](attachment:image.png)

In [14]:
# Apply one-hot encoding to the 'who' and 'deck' columns with drop_first=True
df_encoded = pd.get_dummies(df, columns=['who','deck'], drop_first=True)

# Display the updated DataFrame with new dummy variables
print("\nData with Dummy Variables for 'who' and 'deck' Columns:")
df_encoded.head()


Data with Dummy Variables for 'who' and 'deck' Columns:


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,...,class_Second,class_Third,who_man,who_woman,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G
0,0,3,male,22.0,1,0,7.25,S,Third,True,...,0.0,1.0,True,False,False,False,False,False,False,False
1,1,1,female,38.0,1,0,71.2833,C,First,False,...,0.0,0.0,False,True,False,True,False,False,False,False
2,1,3,female,26.0,0,0,7.925,S,Third,False,...,0.0,1.0,False,True,False,False,False,False,False,False
3,1,1,female,35.0,1,0,53.1,S,First,False,...,0.0,0.0,False,True,False,True,False,False,False,False
4,0,3,male,35.0,0,0,8.05,S,Third,True,...,0.0,1.0,True,False,False,False,False,False,False,False


<h3>Optional: Encoding for Machine Learning Models</h3>
<h4>Training a Logistic Regression Model</h4>

![image.png](attachment:image.png)

In [16]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# For simplicity, we'll predict 'survived' using some features
# First, drop rows with missing values to simplify the model training
df_model = df_encoded.dropna(subset['age','fare','survived'])
# Define feature columns and target variable
feature_cols = ['age','fare'] + [
    col for col in df_encoded.columns 
    if 'sex_' in col or 'embarked' in col or 'who_' in col
]
x = df_model[feature_cols]
y = df_model['survived']
# Split the dataset into training and testing sets
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2,random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)
# Predict on the test set
y_pred=model.predict(x_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test,y_pred)
print(f"\nModel Accuracy:{accuracy:2f}")

NameError: name 'subset' is not defined

<h2>Lab Explanation</h2>
<p>This lab introduces the concept of dummy variables and their importance in encoding categorical data for machine learning. Using the Titanic dataset, we explored multiple methods for one-hot encoding categorical variables, including both pandas and scikit-learn approaches.</p>
<p>By applying one-hot encoding, we transformed categorical variables into a numerical format suitable for machine learning algorithms. We discussed the potential issue of multicollinearity when including all dummy variables and demonstrated how dropping one dummy variable can prevent this.</p>
<p>The lab also provided an opportunity to practice encoding additional categorical variables and understand the impact on model training by building a simple logistic regression model.</p>

---

# Submission
Submit all files to myConnexion.