<h2>Lab 23: Transforming Categorical Data to Numerical Data</h2>
<h3>Objective</h3>
<p>The objective of this lab is to introduce techniques for transforming categorical data into numerical data. You will learn how to apply various encoding methods, such as label encoding and one-hot encoding.</p>
<h3>Expected Outcomes</h3>
<p>By the end of this lab, you will be able to:</p>
<ul>
<li>Identify categorical variables in a dataset.</li>
<li>Apply label encoding and one-hot encoding to categorical data.</li>
<li>Understand when to use each encoding technique.</li>
</ul>

<h3>Table of Contents</h3>
<ol>
<li><a href="#step1" rel="noopener">Import Required Libraries</a></li>
<li><a href="#step2" rel="noopener">Load Dataset with Categorical Data</a></li>
<li><a href="#step3" rel="noopener">Identify Categorical Variables</a></li>
<li><a href="#step4" rel="noopener">Applying Label Encoding</a></li>
<li><a href="#step5" rel="noopener">Applying One-Hot Encoding</a></li>
<li><a href="#step6" rel="noopener">One-Hot Encoding with scikit-learn's OneHotEncoder</a></li>
<li><a href="#step7" rel="noopener">Discussion on Choosing Encoding Techniques</a></li>
<li><a href="#step8" rel="noopener">Practice Task</a></li>
</ol>

<h2>Step 1: Import Required Libraries</h2>
<p>In this step, we will import the necessary libraries for data manipulation and encoding.</p>

In [30]:
# Import pandas for data manipulation
import pandas as pd

# Import LabelEncoder and OneHotEncoder for encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Import numpy for numerical operations
import numpy as np

# Import seaborn for accessing built-in datasets
import seaborn as sns


<h2 id="Step-2:-Load-Dataset-with-Categorical-Data">Step 2: Load Dataset with Categorical Data</h2>
<p>We will use the <strong>Titanic</strong> dataset as it contains both categorical and numerical variables.</p>
<p><strong>Load the dataset from seaborn's built-in datasets:</strong></p>
<p><strong>Display the first few rows of the dataset:</strong></p>

![image.png](attachment:image.png)

In [31]:
# Load the Titanic dataset using seaborn's load_dataset function
df = sns.load_dataset('titanic')

# Display the first five rows of the dataset to understand its structure
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


<h2>Step 3: Identify Categorical Variables</h2>
<h3>Concept</h3>
<p>Categorical variables can be <strong>nominal</strong> (no inherent order) or <strong>ordinal</strong> (with order). It's essential to identify which variables are categorical to apply appropriate encoding.</p>
<h3>Step 3.1: Check the Data Types</h3>
<p>We can check the data types of the columns to identify categorical variables.</p>

![image.png](attachment:image.png)

In [32]:
# Print a message to indicate the start of data type inspection
print("Data Types in the Titanic Dataset:")

# Display the data types of each column in the dataset
print(df.dtypes)


Data Types in the Titanic Dataset:
survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object


<h3>Step 3.2: Select Only Categorical Columns</h3>
<p>We can select the columns with the data type 'object', which usually indicates categorical variables.</p>

![image.png](attachment:image.png)

In [33]:
categorical_cols = df.select_dtypes(include=['object']).columns
print("\nCategorical Columns:", list(categorical_cols))


Categorical Columns: ['sex', 'embarked', 'who', 'embark_town', 'alive']


<p>Alternatively, to display the categorical columns in a more readable format:</p>

![image.png](attachment:image.png)

In [34]:
# Print each categorical column individually for better readability
print("nCategorical Columns:")
for col in categorical_cols:
    print(f"-{col}")

nCategorical Columns:
-sex
-embarked
-who
-embark_town
-alive


<h2>Step 4: Applying Label Encoding</h2>
<h3>Concept</h3>
<p>Label Encoding assigns each unique category a numerical value. This method is simple but can introduce an ordinal relationship where none exists.</p>
<h3>Step 4.1: Apply Label Encoding to the 'sex' Column</h3>
<p>We will use <code>LabelEncoder</code> from scikit-learn to encode the 'sex' column.</p>

![image.png](attachment:image.png)

In [35]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'sex' column, then create a new encoded column 'sex_encoded'
df['sex_encoded'] = label_encoder.fit_transform(df['sex'])


<p>Display the original and encoded columns for comparison:</p>

![image.png](attachment:image.png)

In [36]:
# Print the first few entries of the original 'sex' column
print("\nOriginal 'sex' Column:")
print(df['sex'].head())
# Print the first few entries of the encoded 'sex_encoded' column
print("\nLabel Encoded 'sex_encoded' Column:")
print(df['sex_encoded'].head())


Original 'sex' Column:
0      male
1    female
2    female
3    female
4      male
Name: sex, dtype: object

Label Encoded 'sex_encoded' Column:
0    1
1    0
2    0
3    0
4    1
Name: sex_encoded, dtype: int32


<p>Display the mapping from labels to numbers:</p>

![image.png](attachment:image.png)

In [37]:
# Create a dictionary mapping of original labels to their encoded numerical values
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

# Print the label mapping for the 'sex' column
print("\nLabel Mapping for 'sex' Column:")
print(label_mapping)


Label Mapping for 'sex' Column:
{'female': 0, 'male': 1}


<h3>Step 4.2: Encode Another Column with More Categories (e.g., 'embarked')</h3>
<p>The 'embarked' column has more categories and missing values. We'll handle missing data by filling with 'Unknown'.</p>

![image.png](attachment:image.png)

In [38]:
# Fill missing values in 'embarked' with 'Unknown' to handle missing data
df['embarked_filled'] = df['embarked'].fillna('Unknown')

# Fit and transform the filled 'embarked' column, then create a new encoded column 'embarked_encoded'
df['embarked_encoded'] = label_encoder.fit_transform(df['embarked_filled'])


<p>Display the original and encoded columns:</p>

![image.png](attachment:image.png)

In [39]:
# Print the first few entries of the original 'embarked' column
print("\nOriginal 'embarked' Column:")
print(df['embarked'].head())
# Print the first few entries of the filled 'embarked_filled' column
print("\nFilled 'embarked_filled' Column:")
print(df['embarked_filled'].head())
# Print the first few entries of the encoded 'embarked_encoded' column
print("\nLabel Encoded 'embarked_encoded'Column:")
print(df['embarked_encoded'].head())


Original 'embarked' Column:
0    S
1    C
2    S
3    S
4    S
Name: embarked, dtype: object

Filled 'embarked_filled' Column:
0    S
1    C
2    S
3    S
4    S
Name: embarked_filled, dtype: object

Label Encoded 'embarked_encoded'Column:
0    2
1    0
2    2
3    2
4    2
Name: embarked_encoded, dtype: int32


<p>Display the mapping for 'embarked' column:</p>

![image.png](attachment:image.png)

In [40]:
# Create a dictionary mapping of original labels to their encoded numerical values for 'embarked'
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

# Print the label mapping for the 'embarked' column
print("\nLabel Mapping for 'embarked' Column:")
print(label_mapping)


Label Mapping for 'embarked' Column:
{'C': 0, 'Q': 1, 'S': 2, 'Unknown': 3}


<h2>Step 5: Applying One-Hot Encoding</h2>
<h3>Concept</h3>
<p>One-Hot Encoding creates binary columns for each category, avoiding ordinal relationships. It is particularly useful for nominal variables without a natural ordering.</p>
<h3>Step 5.1: Apply One-Hot Encoding to the 'embarked' Column</h3>
<p>We can use pandas' <code>get_dummies()</code> function for convenience.</p>

![image.png](attachment:image.png)

In [41]:
# Apply One-Hot Encoding to the 'embarked' column using get_dummies
# 'prefix' adds a prefix to the new columns, and 'drop_first=True' avoids multicollinearity by dropping the first category
df_onehot = pd.get_dummies(df,columns=['embarked'], prefix='embarked', drop_first=True)


<p>Display the updated DataFrame with one-hot encoded columns:</p>

![image.png](attachment:image.png)

In [42]:
# Display the first few rows of the DataFrame with one-hot encoded 'embarked' columns
df_onehot.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,sex_encoded,embarked_filled,embarked_encoded,embarked_Q,embarked_S
0,0,3,male,22.0,1,0,7.25,Third,man,True,,Southampton,no,False,1,S,2,False,True
1,1,1,female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,0,C,0,False,False
2,1,3,female,26.0,0,0,7.925,Third,woman,False,,Southampton,yes,True,0,S,2,False,True
3,1,1,female,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,0,S,2,False,True
4,0,3,male,35.0,0,0,8.05,Third,man,True,,Southampton,no,True,1,S,2,False,True


<p><strong>Explanation:</strong> By dropping one of the dummy variables (using <code>drop_first=True</code>), we avoid the dummy variable trap, which is the scenario where the dummy variables are highly correlated (multicollinearity), causing problems in some models.</p>

<h2>Step 6: One-Hot Encoding with scikit-learn's OneHotEncoder</h2>
<p>Using scikit-learn's <code>OneHotEncoder</code> provides more flexibility and is useful for pipelines.</p>
<h3>Apply One-Hot Encoding to the 'class' Column</h3>
<p>OneHotEncoder requires reshaping as it expects 2D input for a single column.</p>

![image.png](attachment:image.png)

In [43]:
# Initialize the OneHotEncoder with parameters:
# 'sparse=False' returns a dense array instead of a sparse matrix
# 'drop='first'' drops the first category to avoid multicollinearity
onehot_encoder = OneHotEncoder(sparse=False, drop='first')

# Fit and transform the 'class' column, which is reshaped to 2D, and store the result in 'class_onehot'
class_onehot = onehot_encoder.fit_transform(df[['class']])




<p>Convert the result to a DataFrame for clarity:</p>

![image.png](attachment:image.png)

In [44]:
# Create a DataFrame from the one-hot encoded array with appropriate column names
class_onehot_df = pd.DataFrame(class_onehot, columns=onehot_encoder.get_feature_names_out(['class']))


<p>Concatenate the new columns to the original DataFrame:</p>

![image.png](attachment:image.png)

In [45]:
# Concatenate the one-hot encoded 'class' columns to the original DataFrame
pd=pd.concat([df,class_onehot_df],axis=1)


<p>Display the updated DataFrame with one-hot encoded 'class' column:</p>

![image.png](attachment:image.png)

In [46]:
# Display the first few rows of the updated DataFrame with one-hot encoded 'class' columns
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_encoded,embarked_filled,embarked_encoded
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,1,S,2
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0,C,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0,S,2
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0,S,2
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1,S,2


<p><strong>Explanation of Parameters:</strong></p>
<ul>
<li><code>sparse=False</code> ensures that the output is a dense array (not a sparse matrix).</li>
<li><code>drop='first'</code> drops the first category to avoid multicollinearity.</li>
</ul>

<h2>Step 7: Discussion on Choosing Encoding Techniques</h2>
<h3>Discussion Questions</h3>
<ol>
<li>
<p><strong>When would you prefer to use Label Encoding over One-Hot Encoding?</strong></p>
<ul>
<li><strong>Answer:</strong> Label Encoding is preferred when the categorical variable is ordinal, meaning the categories have a natural order (e.g., 'low', 'medium', 'high'). It reduces dimensionality compared to One-Hot Encoding, which is beneficial for models sensitive to the number of features.</li>
</ul>
</li>
<li>
<p><strong>What potential issues could arise when using Label Encoding on nominal data?</strong></p>
<ul>
<li><strong>Answer:</strong> Label Encoding introduces an implicit ordinal relationship between categories, which may mislead the model into thinking that higher encoded values have higher importance. This can negatively affect the model's performance on nominal data where no such order exists.</li>
</ul>
</li>
<li>
<p><strong>Why do we drop one of the one-hot encoded columns (e.g., 'embarked_C') when using One-Hot Encoding?</strong></p>
<ul>
<li><strong>Answer:</strong> Dropping one of the dummy variables helps to avoid multicollinearity, known as the dummy variable trap. Since one category can be inferred from the others, dropping one prevents redundancy and ensures that the feature matrix is full rank, which is important for certain algorithms like linear regression.</li>
</ul>
</li>
</ol>

<h2>Step 8: Practice Task</h2>
<h3>Practice</h3>
<ul>
<li><strong>Task:</strong> Choose another categorical column (e.g., 'who') and apply both Label Encoding and One-Hot Encoding.</li>
<li><strong>Instructions:</strong>
<ul>
<li>Encode the 'who' column using Label Encoding and add it to the DataFrame.</li>
<li>Apply One-Hot Encoding to the 'who' column and add the new columns to the DataFrame.</li>
<li>Compare the results and note any differences in the structure of the dataset.</li>
<li>Experiment with converting numerical columns (e.g., 'pclass') to categorical format and apply encoding.</li>
</ul>
</li>
</ul>
<h3>Practice Solution Example (Optional)</h3>
<h4>Label Encoding 'who' Column</h4>

![image.png](attachment:image.png)

In [47]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'who' column, then create a new encoded column 'who_encoded'
df['who_encoded'] = label_encoder.fit_transform(df['who'])

# Create a dictionary mapping of original labels to their encoded numerical values
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

# Print the label mapping for the 'who' column
print("\nLabel Mapping for 'who' Column:")
print(label_mapping)


Label Mapping for 'who' Column:
{'child': 0, 'man': 1, 'woman': 2}


<p>One-Hot Encoding 'who' Column</p>

![image.png](attachment:image.png)

In [50]:
# Apply One-Hot Encoding to the 'who' column using get_dummies
# 'prefix' adds a prefix to the new columns, and 'drop_first=True' avoids multicollinearity by dropping the first category
import pandas as pd
df = pd.get_dummies(df, columns=['who'], prefix='who', drop_first=True)

# Display the first few rows of the DataFrame with one-hot encoded 'who' columns
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,deck,embark_town,alive,alone,sex_encoded,embarked_filled,embarked_encoded,who_encoded,who_man,who_woman
0,0,3,male,22.0,1,0,7.25,S,Third,True,,Southampton,no,False,1,S,2,1,True,False
1,1,1,female,38.0,1,0,71.2833,C,First,False,C,Cherbourg,yes,False,0,C,0,2,False,True
2,1,3,female,26.0,0,0,7.925,S,Third,False,,Southampton,yes,True,0,S,2,2,False,True
3,1,1,female,35.0,1,0,53.1,S,First,False,C,Southampton,yes,False,0,S,2,2,False,True
4,0,3,male,35.0,0,0,8.05,S,Third,True,,Southampton,no,True,1,S,2,1,True,False


<p>Converting 'pclass' to Categorical and Encoding</p>

![image.png](attachment:image.png)

In [52]:
# Convert 'pclass' to a categorical variable
df['pclass_cat'] = df['pclass'].astype('category')

# Initialize the LabelEncoder
label_encoder=LabelEncoder()

# Fit and transform the categorical 'pclass_cat' column, then create a new encoded column 'pclass_encoded'
df['pclass_encoded']=label_encoder.fit_transform(df['pclass_cat'])

# Apply One-Hot Encoding to the 'pclass_cat' column using get_dummies
# 'prefix' adds a prefix to the new columns, and 'drop_first=True' avoids multicollinearity by dropping the first category
df = pd.get_dummies(df, columns=['pclass_cat'], prefix='[pclass], drop_first=True')

# Display the first few rows of the updated DataFrame with encoded 'pclass' columns
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,...,sex_encoded,embarked_filled,embarked_encoded,who_encoded,who_man,who_woman,pclass_encoded,"[pclass], drop_first=True_1","[pclass], drop_first=True_2","[pclass], drop_first=True_3"
0,0,3,male,22.0,1,0,7.25,S,Third,True,...,1,S,2,1,True,False,2,False,False,True
1,1,1,female,38.0,1,0,71.2833,C,First,False,...,0,C,0,2,False,True,0,True,False,False
2,1,3,female,26.0,0,0,7.925,S,Third,False,...,0,S,2,2,False,True,2,False,False,True
3,1,1,female,35.0,1,0,53.1,S,First,False,...,0,S,2,2,False,True,0,True,False,False
4,0,3,male,35.0,0,0,8.05,S,Third,True,...,1,S,2,1,True,False,2,False,False,True


<p><strong>Questions to Consider:</strong></p>
<ul>
<li>
<p><strong>How does encoding 'pclass' as a categorical variable affect the dataset?</strong></p>
<ul>
<li><strong>Answer:</strong> Encoding 'pclass' as a categorical variable allows the model to interpret it appropriately, either as an ordinal feature if label encoded or as separate binary features if one-hot encoded. This can improve the model's ability to capture patterns related to passenger classes.</li>
</ul>
</li>
<li>
<p><strong>What are the implications of treating numerical variables as categorical?</strong></p>
<ul>
<li><strong>Answer:</strong> Treating numerical variables as categorical can lead to loss of ordinal information and may increase the dimensionality of the dataset if one-hot encoded. However, it can be beneficial if the numerical variable actually represents categories without a natural order or if the relationship between categories and the target variable is non-linear.</li>
</ul>
</li>
</ul>

<h2>Lab Summary</h2>
<p>In this lab, we explored techniques for transforming categorical data into numerical data, focusing on Label Encoding and One-Hot Encoding. We applied these techniques using both pandas and scikit-learn, and discussed when each method is appropriate.</p>

---

# Submission
Submit all files to myConnexionA