
# Assignment 1: Sci-Kit Learn machine learning preprocessing pipeline

This notebook contains a set of exercises that will guide you through the different steps of this assignment. Solutions must be code-based, _i.e._ hard-coded or manually computed results will not be accepted. Remember to write your solutions to each exercise in the dedicated cells and not modify or remove the test cells. When completing all the exercises submit this same notebook back to Moodle in **.ipynb** format.

<div class="alert alert-success">

<b>About the datasets used in this assignment</b>

<u>Context</u>

Superheroes have been in popular culture for a long time, and now more than ever. Since its creation, superheroes have not been diverse, but this is changing rapidly. The two datasets aim to provide an overview of heroes and their physical and power characteristics, helping curious people to identify trends and patterns. In this case, we want to understand how physical attributes and powers define superheroes' alignment (superhero, supervillain).

<u>Content</u>
    
The columns included in both datasets are: 
- **Dataset part 1**: Name, Gender, Eye color, Race, Hair color, Height, Publisher, Skin color, Alignment, Weight,Has Superpowers, Power Level, Intelligence Level
- **Dataset part 2**: Previous columns and, Agility, Accelerated Healing, Lantern Power Ring, Dimensional Awareness, Cold Resistance, Durability, Stealth, Energy Absorption, Flight, Danger Sense, Underwater breathing, Marksmanship, Weapons Master, Power Augmentation, Animal Attributes, Longevity, Intelligence, Super Strength, Cryokinesis, Telepathy, Energy Armor, Energy Blasts, Duplication, Size Changing, Density Control, Stamina, Astral Travel,Audio Control, Dexterity, Omnitrix, Super Speed, Possession, Animal Oriented Powers, Weapon-based Powers, Electrokinesis, Darkforce Manipulation, Death Touch, Teleportation, Enhanced Senses, Telekinesis, Energy Beams, Magic, Hyperkinesis, Jump, Clairvoyance, Dimensional Travel, Power Sense, Shapeshifting, Peak Human Condition, Immortality, Camouflage, Element Control, Phasing, Astral Projection, Electrical Transport, Fire Control,Projection, Summoning, Enhanced Memory, Reflexes, Invulnerability, Energy Constructs, Force Fields, Self-Sustenance ,Anti-Gravity, Empathy, Power Nullifier, Radiation Control, Psionic Powers, Elasticity, Substance Secretion, Elemental Transmogrification, Technopath/Cyberpath, Photographic Reflexes, Seismic Power, Animation, Precognition, Mind Control, Fire Resistance, Power Absorption, Enhanced Hearing, Nova Force, Insanity, Hypnokinesis, Animal Control, Natural Armor, Intangibility, Enhanced Sight, Molecular Manipulation, Heat Generation, Adaptation, Gliding, Power Suit, Mind Blast, Probability Manipulation, Gravity Control, Regeneration, Light Control, Echolocation, Levitation, Toxin and Disease Control, Banish, Energy Manipulation, Heat Resistance, Natural Weapons, Time Travel, Enhanced Smell, Illusions, Thirstokinesis, Hair Manipulation, Illumination, Omnipotent, Cloaking, Changing Armor, Power Cosmic, Biokinesis, Water Control, Radiation Immunity, Vision - Telescopic, Toxin and Disease Resistance, Spatial Awareness, Energy Resistance, Telepathy Resistance, Molecular Combustion, Omnilingualism, Portal Creation, Magnetism, Mind Control Resistance, Plant Control, Sonar, Sonic Scream, Time Manipulation, Enhanced Touch, Magic Resistance, Invisibility, Sub-Mariner, Radiation Absorption, Intuitive aptitude, Vision - Microscopic, Melting, Wind Control, Super Breath, Wallcrawling, Vision - Night, Vision - Infrared, Grim Reaping, Matter Absorption, The Force, Resurrection, Terrakinesis, Vision - Heat, Vitakinesis, Radar Sense, Qwardian Power Ring, Weather Control, Vision - X-Ray, Vision - Thermal, Web Creation, Reality Warping, Odin Force, Symbiote Costume, Speed Force, Phoenix Force, Molecular Dissipation, Vision - Cryo, Omnipresent, Omniscient.

Column names are self-explanatory. Physical attributes are numerical or categorical, while superpowers are dummy (Onehot) variables.
    
 <u>Inspiration</u>

What are the characteristics of your favorite superheroes? Are these characteristics affecting superheroes' alignment? Let's put some light on this important business question.

</div>

<div class="alert alert-danger"><b>Submission deadline:</b> Friday, October 18th, 23:55</div>


In [1]:
# DO NOT MODIFY NOR ADD CODE TO THIS CELL
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")

df = pd.read_csv('https://raw.githubusercontent.com/jnin/information-systems/refs/heads/main/data/superheroes_%20physical_traits.csv')

<div class="alert alert-info"><b>Exercise 1: Creating the Feature Matrix and Target Array</b>

Write the code to create the feature matrix ```X``` and the target array ```y``` from the dataframe ```df```. When creating ```X```, make sure to drop or ignore the irrelevant columns: ```['Name', 'Publisher']```.

The target variable for this problem is ```Alignment```.

<br><i>[0.5 points]</i>
</div>
<div class="alert alert-warning">
    
Python is case-sensitive, so ensure your code matches the required capitalization.
Do **not** download the dataset manually. Instead, run the previous cell to load the data directly from the provided link.

</div>

In [2]:
X = df.drop(columns=['Name', 'Publisher', 'Alignment'])
y = df['Alignment']

In [36]:
# LEAVE BLANK

In [37]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 2: Imputing Missing Values </b> 

The first step in our preprocessing routine is to handle missing values in the feature matrix ```X```. Write the code to instantiate a ```SimpleImputer``` with a ```most_frequent``` strategy, naming it ```imputer```. Then, test the imputer transforming ```X```, and store the transformed data in a new DataFrame called ```X_imputed```.

<br><i>[0.5 points]</i>
</div>


In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'most_frequent')
X_imputed = imputer.fit_transform(X)

In [39]:
# LEAVE BLANK

In [40]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 3: Encoding Categorical Features</b> 

Now that our dataset is free of missing values, let's handle the categorical columns. Create a `OneHotEncoder` object named `one_hot_encoder`. Next, create a DataFrame called `X_categorical` containing the following columns from `X_imputed`: `['Gender', 'Eye color', 'Race', 'Hair color', 'Skin color']`. 

Test the encoder by transforming the features of `X_categorical`, and store the transformed data in a new DataFrame named `X_categorical_encoded`.

<br><i>[0.75 points]</i>
</div>

In [4]:
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output = False)
X_categorical = X_imputed[['Gender', 'Eye color', 'Race', 'Hair color', 'Skin color']]
X_categorical_encoded = one_hot_encoder.fit_transform(X_categorical)

In [42]:
# LEAVE BLANK

In [43]:
# LEAVE BLANK

In [44]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 4: Encoding Ordinal Features </b> 

Next, repeat the process for the ordinal features `['Power Level', 'Intelligence Level']`. First, create a new DataFrame called `X_ordinal` containing these two columns. Then, instantiate an `OrdinalEncoder` and name it `ordinal_encoder`. 

Test the encoder by transforming the `X_ordinal` DataFrame, and store the transformed data in a new DataFrame called `X_ordinal_encoded`.

<br><i>[0.75 points]</i>
</div>

<div class="alert alert-warning">
    
Consider that the integer values assigned to each label should align with a meaningful interpretation of the label's significance.

</div>

In [5]:
from sklearn.preprocessing import OrdinalEncoder

X_ordinal = X_imputed[['Power Level', 'Intelligence Level']]
                               
ordinal_encoder = OrdinalEncoder(categories=[['Weak', 'Below Average', 'Average', 'Above Average', 'Extremely Powerful'],
                                             ['Low Intelligence', 'Average Intelligence', 'Smart', 'Genius', 'Super-Genius']])
X_ordinal_encoded = ordinal_encoder.fit_transform(X_imputed[['Power Level', 'Intelligence Level']])

In [46]:
# LEAVE BLANK

In [47]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 5: Combining Feature Transformations </b> 

Now that we have confirmed the transformations for categorical and ordinal columns, let's use a `ColumnTransformer` to apply them in parallel. Instantiate a `ColumnTransformer` named `transformer`, including both the `OneHotEncoder` and `OrdinalEncoder`. Be sure to specify the correct column names for each transformer.

Test your `transformer` by applying it to the `X_imputed` DataFrame, and store the transformed data in a new DataFrame called `X_encoded`.

<br><i>[1 points]</i>
</div>


In [6]:
from sklearn.compose import ColumnTransformer

onehot_encoder = OneHotEncoder(sparse_output=False)
ordinalencoder = OrdinalEncoder(categories=[['Weak', 'Below Average', 'Average', 'Above Average', 'Extremely Powerful'],
                                             ['Low Intelligence', 'Average Intelligence', 'Smart', 'Genius', 'Super-Genius']])

transformer = ColumnTransformer([('onehotencoder', onehot_encoder, ['Gender', 'Eye color', 'Race', 'Hair color', 'Skin color']),
                                 ('ordinalencoder', ordinalencoder, ['Power Level', 'Intelligence Level'])],
                                 remainder= 'passthrough')

X_encoded = transformer.fit_transform(X_imputed)

In [49]:
# LEAVE BLANK

In [50]:
# LEAVE BLANK

In [51]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 6: Standardizing the Features </b> 

To prevent potential issues with feature scaling, we will standardize the features using a `StandardScaler`. First, instantiate a `StandardScaler` and assign it to the variable `scaler`. Then, test it by transforming the `X_encoded` DataFrame. Store the scaled data in a new DataFrame called `X_scaled`.

 <br><i>[0.5 points]</i>
</div>

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)

In [53]:
# LEAVE BLANK

In [54]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 7: Building the Preprocessing Pipeline </b> 

To complete this part of the assignment, create a `Pipeline` named `pipe` that includes the imputer, transformer, and scaler from the previous exercises. Test the pipeline by transforming the original feature matrix `X`, and store the preprocessed data in a new DataFrame called `X_pipe`.

<br><i>[1 points]</i>
</div>

<div class='alert alert-warning'>

Be sure you apply the data transformations in the correct order.

</div>

In [8]:
from sklearn.pipeline import Pipeline

steps = [('imputer', imputer),
         ('transformer', transformer),
         ('scaler', scaler)]

pipe = Pipeline(steps)

X_pipe = pipe.fit_transform(X)


In [56]:
# LEAVE BLANK

In [57]:
# LEAVE BLANK

In [58]:
# LEAVE BLANK

In [59]:
# LEAVE BLANK

<div class="alert alert-info"><b>Exercise 8: End-to-End Preprocessing on a Complex Dataset </b> 

Now, apply everything you’ve learned to preprocess a more complex dataset. Execute the next cell to load a new dataset, `df`, which includes both numerical and categorical features, some of which are ordinal. This dataset also contains missing values and features that require scaling.

**Your tasks are:**

1. **Feature Selection and Engineering:**  
   Create a feature matrix `X` and a target array `y` (using the `Alignment` variable). Drop any irrelevant columns and explain your reasoning for each column you choose to exclude. If you find it relevant, consider combining existing columns or creating new ones based on the dataset's features.

2. **Encoding Categorical and Ordinal Features:**  
   Identify the categorical and ordinal columns, and encode them using a `ColumnTransformer` to apply the transformations in parallel.

3. **Handling Missing Data:**  
   If there are missing values in your feature matrix, decide on an appropriate method to handle them (e.g., imputation).

4. **Standardizing Features:**  
   Assess whether standardization is necessary for your numerical features, and apply it if needed.

5. **Building a Pipeline:**  
   Create a `Pipeline` that integrates all the preprocessing steps you have applied.

6. **Documentation:**
   Remember that thoroughly documenting your code and clearly explaining why certain decisions were made—while also considering and justifying why other options were not chosen—will be highly evaluated.
   



<br><i>[5 points]</i>
</div>



In [9]:
# DO NOT MODIFY NOR ADD CODE TO THIS CELL

df = pd.read_csv('https://raw.githubusercontent.com/jnin/information-systems/refs/heads/main/data/superheroes_complete.csv')

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

y = df['Alignment']

# Dropping irrelevant columns and 'Alignment' which is dependent variable
X = df.drop(columns = ['Name', 'Publisher', 'Alignment'])

# Dropping columns that have only one data and 'Skin color' which was almost empty
X = X.drop(columns = ['Skin color', 'Anti-Gravity', 'Banish', 'Biokinesis', 'Changing Armor', 'Electrical Transport',
                      'Hair Manipulation', 'Hyperkinesis', 'Intuitive aptitude', 'Molecular Dissipation', 'Omnitrix',
                      'Phoenix Force', 'Spatial Awareness', 'Speed Force', 'Thirstokinesis'])

categorical_columns = ['Gender', 'Race', 'Hair color', 'Eye color']
ordinal_columns = ['Intelligence level', 'Power level']
numerical_columns = ['Height', 'Weight']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse_output= False, handle_unknown='ignore'))
])

# Defining the range of categories for ordinal encoder
intelligence_level = ['Weak', 'Below Average', 'Average', 'Above Average', 'Extremely Powerful']
power_level = ['Low Intelligence', 'Average Intelligence', 'Smart', 'Genius', 'Super-Genius']

ordinal_transformer = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(categories=[intelligence_level, power_level]))
])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Merging all pipelines through a column transformer.
transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('ord', ordinal_transformer, ordinal_columns),
        ('cat', categorical_transformer, categorical_columns)
    ], remainder='passthrough')

transformer
