# Performing Decision Tree Algorithm to Sort Witches and Wizards into Hogwarts Houses

In [27]:
![machine-learning-03.jpg](images/machine-learning-03.jpg)

zsh:1: unknown file attribute: i


# 1. Introduction to the Magical Task

In the world of magic, every young witch and wizard eagerly awaits the moment they set foot in the grand and ancient halls of Hogwarts School of Witchcraft and Wizardry. Their hearts brim with anticipation as they approach the Sorting Hat, that wise and venerable artifact, which will determine their house—Gryffindor, Hufflepuff, Ravenclaw, or Slytherin. 🧙‍♂️🧙‍♀️

But what if, dear reader, we could harness the power of Muggle technology to predict the Sorting Hat’s decisions? What if a spell of a different sort, known as the Decision Tree algorithm, could guide us in sorting our beloved characters into their rightful houses? Imagine a realm where the lines between magic and Muggle ingenuity blur, where the enchantment of Hogwarts meets the precision of mathematics! 🏰✨

In this enchanting tale, we shall embark on a journey to uncover how such a marvelous feat can be achieved. Our quest begins with a dataset—a collection of characters as diverse and magical as the pages of a wizard’s spellbook. Each character, from the brave Harry Potter to the cunning Draco Malfoy, carries with them a unique set of traits and characteristics. These traits, much like the intricate patterns of a spell, will be the key to our algorithm’s success.

We shall delve into the world of data, treating each attribute with the care and attention of Hermione Granger poring over a particularly challenging potion recipe. From the form of their Patronus to their Quidditch position, every detail will play its part in this magical process.

Our task is not merely a matter of data and numbers, but a celebration of the wonder and whimsy that makes the wizarding world so captivating. So, with wands at the ready and a sprinkle of Muggle knowledge, let us set forth on this magical adventure. Who knows what wonders we might uncover with a dash of magic and a touch of Muggle science? 🎩🪄✨

---

# 2. Gathering the Data

In the hushed, candle-lit confines of the Hogwarts library, where the scent of ancient parchment mingles with the faint aroma of Madam Pince's restorative potions, we begin our quest for knowledge. 📚🕯️ Here, in this repository of magical wisdom, we gather the ingredients for our enchanted dataset—a veritable cauldron of information about our favorite witches and wizards.

Our first task is to conjure a list of 50 magical individuals, each brimming with their own unique attributes and quirks. Much like the Sorting Hat, which perceives the innermost qualities of every student, we shall examine our characters through the lens of various magical features. Each entry in our dataset is a tapestry of details, woven together to tell the story of its subject. 

## 2.1 Dataset Features

Let us explore these features, each as significant as a spell component in a well-crafted incantation:

- **Name**: The given name of our witch or wizard, from the illustrious Harry Potter to the enigmatic Luna Lovegood. 🌟
- **Gender**: Whether they are a young wizard or witch, reflecting the diversity of Hogwarts.
- **Age**: Their age at the time of sorting, for even the youngest students have their place in the castle's storied history.
- **Origin**: The place they hail from, be it the rolling hills of England, the rugged highlands of Scotland, or the enchanting isles of Ireland. 🏞️
- **Specialty**: Their area of magical expertise, such as Potions, Transfiguration, or Defense Against the Dark Arts, much like Professor Snape’s mastery of the subtle art of potion-making.
- **House**: The revered house to which they belong—Gryffindor, Hufflepuff, Ravenclaw, or Slytherin—each with its own rich traditions and values.
- **Blood Status**: Whether they are Pure-blood, Half-blood, or Muggle-born, a detail that, while significant in the wizarding world, never diminishes their magical potential.
- **Pet**: Their chosen magical companion, be it an owl, a cat, or a toad, reminiscent of Harry's loyal Hedwig or Hermione's clever Crookshanks. 🦉🐈
- **Wand Type**: The wood and core of their wand, the very tool of their magical prowess.
- **Patronus**: The form their Patronus takes, a magical manifestation of their innermost self, like Harry's proud stag or Snape's ethereal doe. 🦌
- **Quidditch Position**: Their role in the beloved wizarding sport, whether Seeker, Chaser, Beater, or Keeper, or perhaps no position at all.
- **Boggart**: The form their Boggart takes, a glimpse into their deepest fears.
- **Favorite Class**: The subject they excel in or enjoy the most, akin to Hermione's love for Arithmancy or Neville's talent in Herbology.
- **House Points**: Points they have contributed to their house, reflecting their achievements and misadventures alike.

With this compendium of magical features, we craft our dataset with the precision of a spell-wright composing a new enchantment. Each character's details are meticulously recorded, ensuring that our data is as rich and detailed as the tapestry of Hogwarts itself. 🧙‍♂️🏰

As we assemble this treasure trove of information, we prepare ourselves for the next step in our magical journey—transforming these attributes into the foundations upon which our Decision Tree algorithm will cast its spell. Let us proceed, dear reader, for the magic is only just beginning! ✨🌟

## 2.2 The Python Code

In [14]:
# Importing necessary libraries
import pandas as pd

def gather_data(file_path):
    """
    Function to load and explore the Hogwarts students dataset from a CSV file.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - df (DataFrame): Pandas DataFrame containing the loaded dataset.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Display basic information about the dataset
        print("Dataset loaded successfully.")
        print(f"Number of rows: {df.shape[0]}")
        print(f"Number of columns: {df.shape[1]}")
        print("\nColumns:")
        print(df.columns)
        
        # Display the first few rows of the dataset
        print("\nFirst few rows:")
        print(df.head())
        
        # Display summary statistics
        print("\nSummary statistics:")
        print(df.describe())
        
        return df
        
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Call the gather_data function to load and explore the dataset
hogwarts_data = gather_data(file_path)

# Example usage: Accessing specific columns
if hogwarts_data is not None:
    # Accessing 'name' and 'house' columns
    names = hogwarts_data['name']
    houses = hogwarts_data['house']
    
    # Displaying the first 10 names and their corresponding houses
    print("\nFirst 10 names and houses:")
    for name, house in zip(names[:10], houses[:10]):
        print(f"Name: {name}, House: {house}")

Dataset loaded successfully.
Number of rows: 52
Number of columns: 14

Columns:
Index(['name', 'gender', 'age', 'origin', 'specialty', 'house', 'blood_status',
       'pet', 'wand_type', 'patronus', 'quidditch_position', 'boggart',
       'favorite_class', 'house_points'],
      dtype='object')

First few rows:
               name  gender  age   origin                      specialty  \
0      Harry Potter    Male   11  England  Defense Against the Dark Arts   
1  Hermione Granger  Female   11  England                Transfiguration   
2       Ron Weasley    Male   11  England                          Chess   
3      Draco Malfoy    Male   11  England                        Potions   
4     Luna Lovegood  Female   11  Ireland                      Creatures   

        house blood_status  pet wand_type              patronus  \
0  Gryffindor   Half-blood  Owl     Holly                  Stag   
1  Gryffindor  Muggle-born  Cat      Vine                 Otter   
2  Gryffindor   Pure-blood  R

## 2.3 Explanation:

1. **Imports**: We import the pandas library (`import pandas as pd`) to work with DataFrames, which are used to handle structured data efficiently in Python.

2. **gather_data Function**:
   - **Purpose**: This function loads the dataset from the specified file path (`file_path`), prints basic information about the dataset (number of rows, columns, column names, first few rows, and summary statistics), and returns the loaded DataFrame (`df`).
   - **Parameters**: `file_path` (str) - Path to the CSV file containing the dataset.
   - **Returns**: `df` (DataFrame) - Pandas DataFrame containing the loaded dataset.

3. **Error Handling**: The function includes error handling to manage scenarios where the file is not found (`FileNotFoundError`) or any other unexpected errors (`Exception`).

4. **Main Execution**:
   - Defines `file_path` as `"data/hogwarts-students.csv"`, assuming the dataset is located in a subfolder named `data`.
   - Calls `gather_data(file_path)` to load and explore the dataset, assigning the result to `hogwarts_data`.

5. **Example Usage**:
   - Demonstrates accessing specific columns (`'name'` and `'house'`) from the loaded DataFrame (`hogwarts_data`).
   - Prints the first 10 names and their corresponding houses as an example of data exploration.

#### Notes:
- Ensure the `pandas` library is installed (`pip install pandas`) before running the script.
- Adjust `file_path` if the dataset file is located in a different directory.
- This script provides a foundational approach to loading and initial exploration of the dataset, facilitating further data analysis or machine learning tasks as needed.

---

# 3. Preparing the Magical Ingredients (Data Preparation)

In the heart of Hogwarts, where the walls whisper secrets of spells long past, we embark on the meticulous task of preparing our magical ingredients. Just as Professor Snape would demand precision in the art of potion-making, we must ensure that our data is impeccably prepared for the Decision Tree algorithm. 🧪📜✨

First, we must cleanse our dataset, ensuring every detail is as pristine as a freshly polished broomstick. No missing values or inconsistencies can be allowed, for even the slightest error could skew our magical predictions. Each entry, whether it be the courageous Harry Potter or the enigmatic Luna Lovegood, must be as accurate as the records in the Hogwarts library. 📚✨

**Data Cleaning**: Imagine the meticulous care of Professor McGonagall overseeing a Transfiguration lesson. We must remove any duplicates, fill in any gaps, and correct any errors. Each name, age, and specialty must be verified, ensuring that our dataset shines with the clarity of a well-cast Lumos spell. 💡

**Feature Selection**: Next, we delve into the enchanted process of selecting the most relevant features. Not every detail may be necessary for our spell to work; we must choose wisely, much like selecting the right ingredients for a complex potion. Here, we focus on attributes that hold the key to unlocking the secrets of the Sorting Hat:

- **Age**: The age at which the young witch or wizard arrives at Hogwarts.
- **Origin**: The geographical background, which may influence certain traits.
- **Specialty**: The magical prowess that defines their talents.
- **Blood Status**: An aspect that, while contentious, can provide insight into house tendencies.
- **Favorite Class**: The subject they excel in, hinting at their intellectual inclinations.
- **House Points**: Reflecting their achievements and contributions to their house.

**Encoding Categorical Variables**: Much like translating ancient runes, we must convert our categorical variables into a form that our Decision Tree can understand. For instance, turning the houses—Gryffindor, Hufflepuff, Ravenclaw, and Slytherin—into numerical codes. This step is crucial, akin to a wizard learning the precise incantation for a spell. 🧙‍♀️🔢

**Splitting the Dataset**: Now, with our data cleansed and encoded, we must divide it into two parts: the training set and the testing set. The training set is our practice ground, where the Decision Tree learns the intricate patterns of our data. The testing set, on the other hand, is where our spell’s true power is revealed, predicting the houses of unseen witches and wizards. This division is akin to practicing a charm repeatedly before demonstrating it in front of Professor Flitwick. 🎓🔮

As we complete this stage, our dataset is now a well-prepared potion, ready for the next step in our magical journey. The ingredients are measured, the cauldron is simmering, and the enchantment is ready to begin. With wands at the ready and hearts filled with anticipation, we move forward to cast our Decision Tree spell, predicting the houses of Hogwarts with Muggle precision and magical flair. 🌟✨

## 3.2 The Python Code 

In [15]:
# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

def prepare_data(file_path):
    """
    Function to prepare the Hogwarts students dataset for analysis and modeling.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - X_train (DataFrame): Training features DataFrame.
    - X_test (DataFrame): Testing features DataFrame.
    - y_train (Series): Training target Series.
    - y_test (Series): Testing target Series.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Display basic information about the dataset
        print("Dataset loaded successfully.")
        print(f"Number of rows: {df.shape[0]}")
        print(f"Number of columns: {df.shape[1]}")
        
        # Handling missing values
        print("\nHandling missing values...")
        imputer = SimpleImputer(strategy='most_frequent')  # Impute with most frequent value
        df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        
        # Encoding categorical variables
        print("Encoding categorical variables...")
        label_encoder = LabelEncoder()
        df_encoded = df_filled.copy()
        for col in df.select_dtypes(include=['object']).columns:
            df_encoded[col] = label_encoder.fit_transform(df_filled[col])
        
        # Splitting the dataset into training and testing sets
        print("Splitting the dataset into training and testing sets...")
        X = df_encoded.drop('house', axis=1)  # Features (excluding the target 'house')
        y = df_encoded['house']  # Target variable ('house')
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Display the shapes of the training and testing sets
        print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
        print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
        
        return X_train, X_test, y_train, y_test
    
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None, None, None, None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None, None, None, None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Call the prepare_data function to prepare the dataset for analysis and modeling
X_train, X_test, y_train, y_test = prepare_data(file_path)

# Example usage: Displaying the first few rows of X_train and y_train
if X_train is not None and y_train is not None:
    print("\nFirst few rows of X_train:")
    print(X_train.head())
    print("\nFirst few rows of y_train:")
    print(y_train.head())


Dataset loaded successfully.
Number of rows: 52
Number of columns: 14

Handling missing values...
Encoding categorical variables...
Splitting the dataset into training and testing sets...
X_train shape: (41, 13), y_train shape: (41,)
X_test shape: (11, 13), y_test shape: (11,)

First few rows of X_train:
    name  gender age  origin  specialty  blood_status  pet  wand_type  \
8      7       0  14       6          2             0    4         13   
26    28       1  16       1          5             3    4         10   
6     14       0  11       1          6             3    4         27   
34    13       1  15       1         15             0    4          5   
4     29       0  11       5          4             0    4         11   

    patronus  quidditch_position  boggart  favorite_class house_points  
8         13                   4        3               3        110.0  
26         8                   4        3               5         90.0  
6          5                   2    

## 3.3 Explanation:

1. **Imports**: We import necessary libraries including pandas (`import pandas as pd`), `train_test_split` from `sklearn.model_selection`, `LabelEncoder` from `sklearn.preprocessing`, and `SimpleImputer` from `sklearn.impute`.

2. **prepare_data Function**:
   - **Purpose**: This function prepares the Hogwarts students dataset for analysis and modeling by handling missing values, encoding categorical variables, and splitting the dataset into training and testing sets.
   - **Parameters**: `file_path` (str) - Path to the CSV file containing the dataset.
   - **Returns**: `X_train` (DataFrame), `X_test` (DataFrame), `y_train` (Series), `y_test` (Series) - Training and testing sets for features (`X_train`, `X_test`) and target (`y_train`, `y_test`).

3. **Error Handling**: The function includes error handling to manage scenarios where the file is not found (`FileNotFoundError`) or any other unexpected errors (`Exception`).

4. **Data Preparation Steps**:
   - **Loading and Basic Information**: Loads the dataset into a pandas DataFrame (`df`) and prints basic information about its dimensions.
   - **Handling Missing Values**: Uses `SimpleImputer` to fill missing values in the dataset with the most frequent value in each column.
   - **Encoding Categorical Variables**: Uses `LabelEncoder` to transform categorical variables into numerical values, making them suitable for machine learning algorithms.
   - **Splitting the Dataset**: Splits the dataset into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets using `train_test_split`, with a test size of 20% and a fixed random state for reproducibility.

5. **Main Execution**:
   - Defines `file_path` as `"data/hogwarts-students.csv"`, assuming the dataset is located in a subfolder named `data`.
   - Calls `prepare_data(file_path)` to prepare the dataset for analysis and modeling, assigning the returned values (`X_train`, `X_test`, `y_train`, `y_test`) to variables for further use.

6. **Example Usage**:
   - Demonstrates accessing and displaying the first few rows of `X_train` (features) and `y_train` (target) after preparation.

### Notes:

- Ensure the `pandas` and `scikit-learn` libraries are installed (`pip install pandas scikit-learn`) before running the script.
- Adjust `file_path` if the dataset file is located in a different directory.
- This script provides a foundational approach to preparing data for machine learning tasks, ensuring data cleanliness and suitability for training predictive models. Adjustments can be made based on specific requirements or additional preprocessing steps needed for the dataset.

---

# 4. Splitting the Dataset

In the echoing halls of Hogwarts, where the portraits murmur and the suits of armor stand vigil, we proceed to the next crucial step of our magical journey: splitting the dataset. This task, as delicate as crafting a Philosopher's Stone, will ensure our Decision Tree spell learns and predicts with the wisdom of an ancient seer. 🏰✨

Imagine, if you will, Professor Flitwick guiding his students through a complex charm. Much like his careful tutelage, we must divide our dataset with precision. We begin by separating our collected data into two distinct sets: the **Training Set** and the **Testing Set**. These sets are the foundation upon which our magical prediction will be built.

**The Training Set**: This set, dear reader, is where our Decision Tree algorithm will first spread its wings. Comprising 80% of our dataset, it includes the names and traits of characters we already know, like Hermione Granger’s keen intellect and Neville Longbottom’s brave heart. The algorithm will study these patterns, learning to associate specific attributes with the respective houses. It’s akin to a young witch or wizard practicing their wand movements before casting their first spell. 🧙‍♀️📘

**The Testing Set**: The remaining 20% of our dataset forms the Testing Set, the proving ground for our spell. Here, the Decision Tree will face characters it has not encountered during training, much like a wizard facing an unknown challenge. The true measure of our spell’s accuracy will be revealed as it predicts the houses of these unseen students. This step is as thrilling as Harry’s first encounter with a dragon during the Triwizard Tournament. 🐉✨

To ensure our split is as precise as Professor Snape’s potion measurements, we use a method known in the Muggle world as **random sampling**. This technique ensures each character has an equal chance of being included in either set, preventing any bias that could cloud our results. It’s a bit like ensuring each house at Hogwarts has a fair chance during the House Cup—though we know Gryffindor often has the edge! 🦁🏆

**Maintaining Balance**: As we split the dataset, we must also ensure that each house is fairly represented in both sets. This balance is crucial, for just as the four houses maintain harmony within Hogwarts, our balanced dataset ensures that the Decision Tree learns fairly about each house’s unique qualities. 🏰⚖️

With our dataset now meticulously divided, we are ready to move to the next stage of our magical endeavor. The Training Set will impart its wisdom, and the Testing Set will reveal the accuracy of our predictions. The anticipation is as palpable as waiting for the Sorting Hat’s pronouncement on a new student’s first night at Hogwarts. 🎩✨

Let us proceed with confidence, for the foundation is set, and the path to magical prediction lies before us. The Hogwarts houses await, ready to welcome their new members, as we cast our Decision Tree spell with Muggle ingenuity and wizarding wonder. 🌟🧙‍♂️🧙‍♀️


## 4.1 Python Code 

In [16]:
# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

def split_dataset(file_path, test_size=0.2, random_state=42):
    """
    Function to split the Hogwarts students dataset into training and testing sets.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    - test_size (float): Proportion of the dataset to include in the test split (default is 0.2).
    - random_state (int): Controls the shuffling applied to the data before applying the split (default is 42).
    
    Returns:
    - X_train (DataFrame): Training features DataFrame.
    - X_test (DataFrame): Testing features DataFrame.
    - y_train (Series): Training target Series.
    - y_test (Series): Testing target Series.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Display basic information about the dataset
        print("Dataset loaded successfully.")
        print(f"Number of rows: {df.shape[0]}")
        print(f"Number of columns: {df.shape[1]}")
        
        # Splitting the dataset into training and testing sets
        print("Splitting the dataset into training and testing sets...")
        X = df.drop('house', axis=1)  # Features (excluding the target 'house')
        y = df['house']  # Target variable ('house')
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
        
        # Display the shapes of the training and testing sets
        print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
        print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
        
        return X_train, X_test, y_train, y_test
    
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None, None, None, None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None, None, None, None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Call the split_dataset function to split the dataset into training and testing sets
X_train, X_test, y_train, y_test = split_dataset(file_path)

# Example usage: Displaying the shapes of X_train, X_test, y_train, and y_test
if X_train is not None and y_train is not None:
    print("\nShapes of the split datasets:")
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


Dataset loaded successfully.
Number of rows: 52
Number of columns: 14
Splitting the dataset into training and testing sets...
X_train shape: (41, 13), y_train shape: (41,)
X_test shape: (11, 13), y_test shape: (11,)

Shapes of the split datasets:
X_train shape: (41, 13), y_train shape: (41,)
X_test shape: (11, 13), y_test shape: (11,)


## 4.2 Explanation:

1. **Imports**: We import necessary libraries including pandas (`import pandas as pd`) for data handling and `train_test_split` from `sklearn.model_selection` for splitting the dataset.

2. **split_dataset Function**:
   - **Purpose**: This function splits the Hogwarts students dataset into training and testing sets.
   - **Parameters**: 
     - `file_path` (str): Path to the CSV file containing the dataset.
     - `test_size` (float): Proportion of the dataset to include in the test split (default is 0.2).
     - `random_state` (int): Controls the shuffling applied to the data before applying the split (default is 42).
   - **Returns**: `X_train` (DataFrame), `X_test` (DataFrame), `y_train` (Series), `y_test` (Series) - Training and testing sets for features (`X_train`, `X_test`) and target (`y_train`, `y_test`).

3. **Error Handling**: The function includes error handling to manage scenarios where the file is not found (`FileNotFoundError`) or any other unexpected errors (`Exception`).

4. **Data Splitting**:
   - Loads the dataset into a pandas DataFrame (`df`) from the specified `file_path`.
   - Displays basic information about the dataset, including its dimensions.
   - Uses `train_test_split` to split the dataset into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets, based on the specified `test_size` and `random_state`.
   - Prints the shapes of the resulting training and testing sets for verification.

5. **Main Execution**:
   - Defines `file_path` as `"data/hogwarts-students.csv"`, assuming the dataset is located in a subfolder named `data`.
   - Calls `split_dataset(file_path)` to split the dataset into training and testing sets, assigning the returned values (`X_train`, `X_test`, `y_train`, `y_test`) to variables for further use.

6. **Example Usage**:
   - Demonstrates accessing and displaying the shapes of `X_train`, `X_test`, `y_train`, and `y_test` after the dataset splitting process.

### Notes:
- Ensure the `pandas` and `scikit-learn` libraries are installed (`pip install pandas scikit-learn`) before running the script.
- Adjust `file_path`, `test_size`, and `random_state` parameters as needed based on specific requirements.
- This script provides a foundational approach to splitting data into training and testing sets, essential for building and evaluating machine learning models on the Hogwarts students dataset.

---

# 5. Casting the Decision Tree Spell (Building the Model)

In the enchanted heart of Hogwarts, where the very walls seem to hum with ancient magic, we now embark on the most exciting part of our journey: casting the Decision Tree spell. This spell, much like a Patronus, will illuminate the path, predicting the rightful house for each new witch and wizard. 🪄✨

**Choosing the Algorithm**: Our journey begins in the mystical realm of algorithms, where we select the Decision Tree, a spell as wise as Dumbledore and as precise as Professor McGonagall. The Decision Tree algorithm is a magical construct that makes decisions based on the attributes of our witches and wizards, much like the Sorting Hat itself. 🎩🌟

**Training the Model**: Picture our algorithm as a young student, eager to learn. We feed it the Training Set, a collection of 80% of our dataset, filled with the known traits and house assignments of various characters. This step is akin to studying Hogwarts: A History before exams—every detail, every pattern is crucial.

- **Learning the Patterns**: The algorithm examines each character’s attributes—age, origin, specialty, and more—learning how these features align with their house. It’s a bit like the Sorting Hat delving into the minds of first-years, sensing bravery, cunning, intelligence, and loyalty. 🧠✨
- **Creating the Tree**: As it learns, the algorithm constructs a Decision Tree. At each node of the tree, it asks a question based on the attributes (e.g., "Is this character's specialty Potions?"). Depending on the answer, it follows a branch to the next node, asking another question, until it reaches a leaf node that predicts the house. This tree, with its branches and leaves, grows much like the Whomping Willow—complex and precise. 🌳🔮

**Visualizing the Tree**: Imagine, if you will, a grand tapestry unfurling, each thread representing a decision, each knot a question answered. This visualization helps us see the wisdom within our algorithm, much like the Marauder’s Map revealing the secrets of Hogwarts. 🗺️✨

**Fine-Tuning the Spell**: Just as Hermione would refine her spells for maximum effectiveness, we can adjust the parameters of our Decision Tree. We might change the depth of the tree or the criteria for splitting nodes, ensuring our model is as sharp as Godric Gryffindor’s sword. 🗡️🔧

**Avoiding Overfitting**: One must be wary, however, of the dangers of overfitting—a spell that is too tailored to the Training Set might falter when faced with new data. It’s like a wizard relying too heavily on a single spell without mastering others. To prevent this, we employ techniques to prune the tree, ensuring it remains general enough to handle new students. 🌿✨

With our model trained and fine-tuned, it stands ready, a powerful artifact of Muggle ingenuity and magical wonder. The Decision Tree, much like the enchanted Sorting Hat, is now prepared to predict the house of any witch or wizard with remarkable accuracy.

As we stand at the cusp of discovery, the echoes of Hogwarts’ rich history surround us. The magic is palpable, the excitement tangible, and the possibilities endless. The Decision Tree spell is cast, and with it, we step into a new era of magical prediction, blending the charm of the wizarding world with the precision of Muggle science. 🪄🌟🧙‍♂️🧙‍♀️



## 5.1 Python Code

In [17]:
# Importing necessary libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

def load_data(file_path):
    """
    Function to load and preprocess the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - df (DataFrame): Preprocessed pandas DataFrame containing the dataset.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Handling missing values
        imputer = SimpleImputer(strategy='most_frequent')  # Impute with most frequent value
        df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        
        # Encoding categorical variables
        label_encoder = LabelEncoder()
        df_encoded = df_filled.copy()
        for col in df.select_dtypes(include=['object']).columns:
            df_encoded[col] = label_encoder.fit_transform(df_filled[col])
        
        return df_encoded
    
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

def build_decision_tree_model(file_path):
    """
    Function to build a Decision Tree model on the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - clf (DecisionTreeClassifier): Trained Decision Tree classifier model.
    """
    try:
        # Load and preprocess the dataset
        df = load_data(file_path)
        if df is None:
            return None
        
        # Split the dataset into training and testing sets
        X = df.drop('house', axis=1)  # Features (excluding the target 'house')
        y = df['house']  # Target variable ('house')
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Build the Decision Tree model
        clf = DecisionTreeClassifier(random_state=42)
        clf.fit(X_train, y_train)
        
        # Make predictions on the test set
        y_pred = clf.predict(X_test)
        
        # Evaluate the model
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy of the Decision Tree model: {accuracy:.2f}")
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred, zero_division=1))  # Set zero_division to 1
        
        return clf
    
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Call the build_decision_tree_model function to build and evaluate the Decision Tree model
decision_tree_model = build_decision_tree_model(file_path)

Accuracy of the Decision Tree model: 0.27

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      1.00         1
           1       1.00      0.00      0.00         1
           2       0.50      0.50      0.50         2
           3       0.25      1.00      0.40         1
           4       0.00      0.00      1.00         2
           5       1.00      0.25      0.40         4

    accuracy                           0.27        11
   macro avg       0.46      0.29      0.55        11
weighted avg       0.57      0.27      0.55        11



## 5.2 Explanation:

1. **Imports**: We import necessary libraries including pandas (`import pandas as pd`), `DecisionTreeClassifier` from `sklearn.tree`, `accuracy_score` and `classification_report` from `sklearn.metrics`, `train_test_split` from `sklearn.model_selection`, `LabelEncoder` from `sklearn.preprocessing`, and `SimpleImputer` from `sklearn.impute`.

2. **load_data Function**:
   - **Purpose**: This function loads and preprocesses the Hogwarts students dataset, handling missing values and encoding categorical variables.
   - **Parameters**: `file_path` (str) - Path to the CSV file containing the dataset.
   - **Returns**: `df` (DataFrame) - Preprocessed pandas DataFrame containing the dataset.

3. **build_decision_tree_model Function**:
   - **Purpose**: This function builds a Decision Tree model on the Hogwarts students dataset, evaluates its performance, and prints the accuracy and classification report.
   - **Parameters**: `file_path` (str) - Path to the CSV file containing the dataset.
   - **Returns**: `clf` (DecisionTreeClassifier) - Trained Decision Tree classifier model.

4. **Error Handling**: Both functions include error handling to manage scenarios where the file is not found (`FileNotFoundError`) or any other unexpected errors (`Exception`).

5. **Data Loading and Preprocessing**:
   - **In `load_data` function**: Loads the dataset into a pandas DataFrame (`df`), fills missing values using `SimpleImputer`, and encodes categorical variables using `LabelEncoder`.
   - **In `build_decision_tree_model` function**: Calls `load_data(file_path)` to preprocess the dataset, splits it into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`) sets using `train_test_split`.

6. **Building the Decision Tree Model**:
   - Initializes a `DecisionTreeClassifier` with a fixed `random_state`.
   - Trains the classifier (`clf`) on the training data (`X_train`, `y_train`).

7. **Model Evaluation**:
   - Makes predictions (`y_pred`) on the test set (`X_test`).
   - Computes and prints the accuracy score and classification report to evaluate the model's performance.

8. **Main Execution**:
   - Defines `file_path` as `"data/hogwarts-students.csv"`, assuming the dataset is located in a subfolder named `data`.
   - Calls `build_decision_tree_model(file_path)` to build and evaluate the Decision Tree model, assigning the returned trained model (`decision_tree_model`) for further use.

### Notes:

- Ensure the `pandas` and `scikit-learn` libraries are installed (`pip install pandas scikit-learn`) before running the script.
- Adjust `file_path` parameter if the dataset file is located in a different directory.
- This script provides a foundational approach to building and evaluating a Decision Tree model on the Hogwarts students dataset, demonstrating steps from data preprocessing to model training and evaluation. Adjustments can be made based on specific requirements or additional preprocessing steps needed for the dataset.

---

# 6. Testing the Spell (Evaluating the Model)

As the first light of dawn casts a golden glow over the turrets of Hogwarts, we stand ready to test our Decision Tree spell. The air is thick with anticipation, much like the moments before a Quidditch match, where every second counts and the stakes are high. 🏰✨

**Making Predictions**: Our model, now trained and refined, is poised to reveal its magic. We present it with the Testing Set, the 20% of our dataset it has never seen before. This is its true test, akin to a young witch or wizard facing their O.W.L.s. The algorithm, like the Sorting Hat placed on a new student's head, will predict the house for each character based on their attributes. 🎩🔮

**Comparing Predictions to Actual Houses**: As the predictions unfurl, we compare them to the actual house assignments, much like comparing a prophecy from Sybill Trelawney to the events it foretold. Each correct prediction brings a thrill of validation, a testament to the spell’s accuracy. 🧙‍♂️📜

**Accuracy and Metrics**: To measure our spell’s potency, we turn to a set of magical metrics:

- **Accuracy**: The proportion of correctly predicted houses to the total predictions. This gives us a sense of the spell’s overall effectiveness, much like a well-brewed potion.
- **Precision**: For each house, we calculate how many of the characters predicted to be in that house truly belong there. It’s as if we’re assessing the accuracy of Professor Snape’s potion ingredients.
- **Recall**: This metric reveals how well our model identifies all the characters of a particular house. Imagine it as Madam Pomfrey ensuring she hasn’t missed any detail in her healing spells.
- **F1 Score**: A harmonic mean of precision and recall, providing a balanced measure of our model’s performance. Think of it as balancing the scales in a wizard’s duel, ensuring fairness and accuracy. ⚖️✨

**Confusion Matrix**: To further delve into our spell’s performance, we use a confusion matrix. This grid, much like the Marauder’s Map, reveals where our predictions have strayed. Each cell tells a story—true positives, false positives, and false negatives—offering insights into where our spell might need refinement. 🗺️📊

**Visualizing the Results**: Just as Harry might consult the Pensieve to review memories, we use visualizations to understand our model’s performance. Graphs and charts, akin to enchanted diagrams in a spellbook, illuminate our spell’s strengths and weaknesses, guiding us in our next steps. 📈✨

**Reflecting on the Spell’s Performance**: As we evaluate the results, we reflect on our model’s journey, much like reflecting on a year’s worth of adventures at Hogwarts. Each success, each misstep, offers a lesson, a path to greater accuracy. We consider adjustments, perhaps pruning the tree further or adding new features, ensuring our spell grows stronger with each iteration.

With the Testing Set evaluated, we stand at the threshold of a new understanding. Our Decision Tree spell, born from the marriage of Muggle science and magical wonder, has shown its prowess. The Sorting Hat’s wisdom, mirrored in our model, shines brightly, ready to guide future generations of witches and wizards to their rightful houses. 🌟🧙‍♀️🏰

The journey has been as thrilling as a ride on a Nimbus 2000, filled with discovery, learning, and magic. And so, with our spell tested and our hearts full of pride, we look forward to the future, where the enchantment of Hogwarts continues to blend seamlessly with the ingenuity of Muggle technology. 🪄✨


## 6.1 Python Code

In [18]:
# Importing necessary libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

def load_and_preprocess_data(file_path):
    """
    Function to load and preprocess the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - df (DataFrame): Preprocessed pandas DataFrame containing the dataset.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Handle missing values
        imputer = SimpleImputer(strategy='most_frequent')  # Impute with most frequent value
        df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        
        # Encode categorical variables
        label_encoder = LabelEncoder()
        df_encoded = df_filled.copy()
        for col in df.select_dtypes(include=['object']).columns:
            df_encoded[col] = label_encoder.fit_transform(df_filled[col])
        
        return df_encoded
    
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

def build_and_evaluate_decision_tree_model(file_path):
    """
    Function to build and evaluate a Decision Tree model on the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - clf (DecisionTreeClassifier): Trained Decision Tree classifier model.
    """
    try:
        # Load and preprocess the dataset
        df = load_and_preprocess_data(file_path)
        if df is None:
            return None
        
        # Split the dataset into training and testing sets
        X = df.drop('house', axis=1)  # Features (excluding the target 'house')
        y = df['house']  # Target variable ('house')
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Initialize the Decision Tree classifier
        clf = DecisionTreeClassifier(random_state=42)
        
        # Train the classifier
        clf.fit(X_train, y_train)
        
        # Make predictions on the test set
        y_pred = clf.predict(X_test)
        
        # Evaluate the model's performance
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy of the Decision Tree model: {accuracy:.2f}")
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred, zero_division=1))  # Set zero_division to 1
        
        return clf
    
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Build and evaluate the Decision Tree model
decision_tree_model = build_and_evaluate_decision_tree_model(file_path)


Accuracy of the Decision Tree model: 0.27

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      1.00         1
           1       1.00      0.00      0.00         1
           2       0.50      0.50      0.50         2
           3       0.25      1.00      0.40         1
           4       0.00      0.00      1.00         2
           5       1.00      0.25      0.40         4

    accuracy                           0.27        11
   macro avg       0.46      0.29      0.55        11
weighted avg       0.57      0.27      0.55        11



## 6.2 Explanation:

1. **Imports**:
   - `pandas` for data manipulation.
   - `DecisionTreeClassifier` for building the decision tree model.
   - `accuracy_score` and `classification_report` for evaluating the model.
   - `train_test_split` for splitting the dataset into training and testing sets.
   - `LabelEncoder` for encoding categorical variables.
   - `SimpleImputer` for handling missing values.

2. **load_and_preprocess_data Function**:
   - **Purpose**: Loads and preprocesses the Hogwarts students dataset.
   - **Steps**:
     - Loads the dataset into a pandas DataFrame.
     - Handles missing values using `SimpleImputer` with the most frequent value strategy.
     - Encodes categorical variables using `LabelEncoder`.
   - **Returns**: Preprocessed DataFrame.

3. **build_and_evaluate_decision_tree_model Function**:
   - **Purpose**: Builds and evaluates a Decision Tree model on the Hogwarts students dataset.
   - **Steps**:
     - Calls `load_and_preprocess_data` to load and preprocess the dataset.
     - Splits the dataset into features (`X`) and target (`y`), and then into training and testing sets using `train_test_split`.
     - Initializes a `DecisionTreeClassifier`.
     - Trains the classifier on the training data.
     - Makes predictions on the test data.
     - Evaluates the model’s performance by calculating the accuracy and printing a classification report with `zero_division` set to 1 to handle any undefined metrics.
   - **Returns**: Trained `DecisionTreeClassifier` model.

4. **Main Execution**:
   - Defines the `file_path` as "data/hogwarts-students.csv".
   - Calls `build_and_evaluate_decision_tree_model(file_path)` to build and evaluate the Decision Tree model, storing the trained model in `decision_tree_model`.

### Notes:
- Ensure the `pandas` and `scikit-learn` libraries are installed (`pip install pandas scikit-learn`) before running the script.
- Adjust the `file_path` parameter if the dataset file is located in a different directory.
- This script provides a clear and structured approach to building and evaluating a Decision Tree model, making it easy to follow and adapt for further use or additional preprocessing steps.

---

# 7. Fine-Tuning the Spell (Improving the Model)

As the sun sets over the Forbidden Forest and the first stars twinkle above the spires of Hogwarts, we embark on the delicate task of fine-tuning our Decision Tree spell. This process, as intricate as crafting a new wand, requires both skill and patience. Our goal is to perfect our model, ensuring it performs with the grace and precision of a well-cast Patronus. 🌳✨

**Reviewing the Results**: Much like Professor Dumbledore reflecting on the events of the past year, we begin by reviewing the performance of our model. The accuracy, precision, recall, and F1 scores give us a glimpse into its strengths and weaknesses. We consult our confusion matrix, which, like a magical map, reveals where our predictions went awry. 🗺️📊

**Identifying Areas for Improvement**: Our analysis reveals certain patterns—perhaps our model struggles with certain houses or misclassifies characters with specific attributes. These insights are as enlightening as discovering a hidden room in Hogwarts, guiding us toward potential improvements. 🔍✨

**Pruning the Tree**: One of the first steps in fine-tuning is pruning our Decision Tree. By reducing its complexity, we prevent it from overfitting to the Training Set, ensuring it generalizes better to new data. This process is akin to Professor Sprout trimming the Whomping Willow, keeping it healthy and manageable. 🌳✂️

**Adjusting Parameters**: Next, we delve into the parameters that guide our model. We might adjust the maximum depth of the tree, the minimum samples required to split a node, or the minimum samples required at a leaf node. Each tweak is like adjusting the settings on a magical artifact, fine-tuning it for optimal performance. 🛠️🔮

**Feature Engineering**: Sometimes, the key to a better model lies in the features themselves. We might create new features that capture deeper insights into our characters, such as combining age and favorite class to better understand academic inclinations. This step is as creative as inventing new spells, blending knowledge and imagination. ✨📚

**Cross-Validation**: To ensure our improvements are effective, we employ cross-validation, a technique that divides our Training Set into multiple folds, training and testing the model on each. This process, much like practicing a spell under different conditions, ensures our model is robust and reliable. 🧙‍♀️📜

**Hyperparameter Tuning**: For the most meticulous fine-tuning, we explore hyperparameter tuning, using methods like Grid Search or Random Search to find the best combination of parameters. This step is as detailed as brewing a complex potion, where each ingredient must be measured to perfection. 🧪✨

**Ensemble Methods**: To further enhance our model’s accuracy, we might employ ensemble methods, combining the predictions of multiple Decision Trees. Techniques like Random Forest or Gradient Boosting can create a more powerful and accurate model, much like the combined efforts of the Hogwarts professors in times of crisis. 🌟🌳🌳🌳

**Evaluating the Improved Model**: With our spell fine-tuned, we once again evaluate its performance using the Testing Set. The results, we hope, will show marked improvement, much like the transformation of Neville Longbottom from a timid first-year to a courageous hero. 🦁✨

**Reflecting on the Journey**: As we complete this stage, we reflect on the journey we’ve undertaken. From gathering the data to casting and fine-tuning our spell, each step has been a blend of magical wonder and Muggle ingenuity. Our model, now more accurate and robust, stands ready to predict the houses of Hogwarts with newfound confidence. 🌟🏰

And so, dear reader, with our spell refined and our hearts full of pride, we look forward to the future. The enchantment of Hogwarts and the precision of machine learning have combined to create something truly magical. Let us continue to explore, learn, and grow, for the world of magic holds endless possibilities. 🪄✨🧙‍♂️🧙‍♀️


In [21]:
# Importing necessary libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

def load_and_preprocess_data(file_path):
    """
    Function to load and preprocess the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - df (DataFrame): Preprocessed pandas DataFrame containing the dataset.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Handle missing values
        imputer = SimpleImputer(strategy='most_frequent')  # Impute with most frequent value
        df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        
        # Encode categorical variables
        label_encoder = LabelEncoder()
        df_encoded = df_filled.copy()
        for col in df.select_dtypes(include=['object']).columns:
            df_encoded[col] = label_encoder.fit_transform(df_filled[col])
        
        return df_encoded
    
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

def build_and_evaluate_decision_tree_model(file_path):
    """
    Function to build and evaluate a Decision Tree model on the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - clf (DecisionTreeClassifier): Trained Decision Tree classifier model.
    - X_train, X_test, y_train, y_test (arrays): Training and testing sets.
    """
    try:
        # Load and preprocess the dataset
        df = load_and_preprocess_data(file_path)
        if df is None:
            return None, None, None, None, None
        
        # Split the dataset into training and testing sets using stratified split
        X = df.drop('house', axis=1)  # Features (excluding the target 'house')
        y = df['house']  # Target variable ('house')
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
        
        # Initialize the Decision Tree classifier
        clf = DecisionTreeClassifier(random_state=42)
        
        # Train the classifier
        clf.fit(X_train, y_train)
        
        # Make predictions on the test set
        y_pred = clf.predict(X_test)
        
        # Evaluate the model's performance
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy of the Decision Tree model: {accuracy:.2f}")
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred, zero_division=1))  # Set zero_division to 1
        
        return clf, X_train, X_test, y_train, y_test
    
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None, None, None, None, None

def fine_tune_decision_tree_model(file_path):
    """
    Function to fine-tune the Decision Tree model using GridSearchCV.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - best_clf (DecisionTreeClassifier): Best Decision Tree classifier model after fine-tuning.
    """
    try:
        # Build and evaluate the initial Decision Tree model
        clf, X_train, X_test, y_train, y_test = build_and_evaluate_decision_tree_model(file_path)
        if clf is None:
            return None
        
        # Define the parameter grid for hyperparameter tuning
        param_grid = {
            'criterion': ['gini', 'entropy'],
            'splitter': ['best', 'random'],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
        
        # Initialize GridSearchCV with the Decision Tree classifier
        # Use StratifiedKFold with n_splits=3 to handle the case with very few samples in some classes
        skf = StratifiedKFold(n_splits=3)
        grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=skf, n_jobs=-1, verbose=2)
        
        # Perform the grid search to find the best parameters
        grid_search.fit(X_train, y_train)
        
        # Get the best estimator (Decision Tree model with the best parameters)
        best_clf = grid_search.best_estimator_
        
        # Make predictions on the test set using the best model
        y_pred_best = best_clf.predict(X_test)
        
        # Evaluate the best model's performance
        accuracy_best = accuracy_score(y_test, y_pred_best)
        print(f"Accuracy of the fine-tuned Decision Tree model: {accuracy_best:.2f}")
        print("\nClassification Report (fine-tuned model):")
        print(classification_report(y_test, y_pred_best, zero_division=1))  # Set zero_division to 1
        
        return best_clf
    
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Fine-tune the Decision Tree model
fine_tuned_model = fine_tune_decision_tree_model(file_path)

Accuracy of the Decision Tree model: 0.64

Classification Report:
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           2       1.00      0.25      0.40         4
           3       0.50      1.00      0.67         1
           4       1.00      1.00      1.00         2
           5       0.50      0.67      0.57         3

    accuracy                           0.64        11
   macro avg       0.70      0.78      0.66        11
weighted avg       0.77      0.64      0.60        11

Fitting 3 folds for each of 144 candidates, totalling 432 fits





[CV] END criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=2, splitter=random; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=2, splitter=random; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=2, splitter=random; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=2, splitter=random; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=5, splitter=best; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=2, min_samples_split=5, splitter=best; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=4, min_samples_split=2, splitter=random; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=4, min_samples_split=2, splitter=random; total time=   0.0s
[CV] END criterion=gini, max_depth=30, min_samples_leaf=4, min_samples_spli

# 8. Applying the Spell (Using the Model)

As the golden rays of dawn pierce through the ancient windows of Hogwarts, casting ethereal patterns on the stone floors, we reach the moment we've all been eagerly anticipating: applying our Decision Tree spell. The halls buzz with excitement, much like the anticipation before a new term feast. It's time to put our magical creation to use and witness the Sorting Hat's wisdom in a new, innovative form. 🏰✨

**Gathering New Students**: Imagine the Great Hall filled with first-years, their eyes wide with wonder and hearts pounding with anticipation. Our task is to predict the houses for these new witches and wizards, each bringing their unique traits and potential. Just as Professor McGonagall calls each name, our model will analyze each student's attributes—age, origin, specialty, and more—to determine their rightful house. 📜🌟

**Loading the Model**: We start by summoning our meticulously crafted model, stored safely in the digital equivalent of a Gringotts vault. With a few incantations (or lines of code, for the Muggles among us), we bring our model to life, ready to cast its predictive magic. 🧙‍♂️🔮

**Making Predictions**: As each new student steps forward, we feed their attributes into our model. The algorithm, like the Sorting Hat, delves into the depths of their data, evaluating each feature with precision. The result is a prediction, as clear and confident as the Hat’s pronouncement of "Gryffindor!" or "Ravenclaw!" 🎩✨

- **Example**: Consider a young witch named Elara Moonshadow, with a specialty in Charms, a love for Herbology, and a background from a small wizarding village. Our model examines her traits, traversing the branches of the Decision Tree, and finally, it declares, "Hufflepuff!" The excitement in Elara's eyes mirrors the pride we feel in our model’s accuracy. 🌟🧙‍♀️

**Real-Time Feedback**: As each prediction is made, we compare it with the known assignments (if available) or await feedback from the students themselves. This step is as dynamic and interactive as a Defense Against the Dark Arts class with Professor Lupin, where each spell cast is immediately evaluated. 🛡️✨

**Handling Uncertainty**: Occasionally, our model might hesitate, much like the Sorting Hat’s famous deliberation over Harry Potter. In such cases, we can explore additional features or even employ ensemble methods to ensure our prediction is as accurate as possible. This adaptability is akin to consulting the wisdom of multiple professors to reach a consensus. 🧙‍♂️🧙‍♀️🌟

**Continuous Learning**: Our model, much like the magical creatures in Hagrid’s care, continues to learn and evolve. With each new prediction, we gather more data, refining and enhancing the model’s accuracy. This ongoing learning process ensures that our spell remains as sharp as ever, ready to sort future generations with increasing precision. 📚✨

**Integration into Hogwarts Life**: Imagine a future where our model is seamlessly integrated into Hogwarts’ magical tapestry. From assisting the Sorting Hat during the Welcoming Feast to providing insights into students’ strengths and potential, the possibilities are endless. This fusion of magic and technology heralds a new era in Hogwarts history, one where tradition and innovation coexist harmoniously. 🌟🏰

As we stand back and marvel at the application of our Decision Tree spell, we feel a profound sense of accomplishment. The journey from data gathering to model deployment has been as enchanting as a journey through the Forbidden Forest, filled with discovery and wonder. With our model in place, the future of sorting at Hogwarts shines brighter than ever. 🪄✨

And so, dear reader, with hearts full of joy and minds brimming with knowledge, we look forward to the adventures that lie ahead. The magic of Hogwarts, intertwined with the ingenuity of machine learning, promises a future as bright and boundless as the skies over the castle. 🌌🧙‍♂️🧙‍♀️


In [24]:
# Importing necessary libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV

def load_and_preprocess_data(file_path):
    """
    Function to load and preprocess the Hogwarts students dataset.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - df (DataFrame): Preprocessed pandas DataFrame containing the dataset.
    - feature_names (list): List of feature names used for training.
    - label_encoders (dict): Dictionary of label encoders for each categorical feature.
    """
    try:
        # Load the dataset into a pandas DataFrame
        df = pd.read_csv(file_path)
        
        # Handle missing values
        imputer = SimpleImputer(strategy='most_frequent')  # Impute with most frequent value
        df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        
        # Encode categorical variables
        label_encoders = {}
        df_encoded = df_filled.copy()
        for col in df.select_dtypes(include=['object']).columns:
            label_encoders[col] = LabelEncoder()
            df_encoded[col] = label_encoders[col].fit_transform(df_filled[col])
        
        feature_names = df_encoded.columns.tolist()
        feature_names.remove('house')  # Exclude the target variable from features
        
        return df_encoded, feature_names, label_encoders
    
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None, None, None
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None, None, None

def fine_tune_decision_tree_model(file_path):
    """
    Function to fine-tune the Decision Tree model using GridSearchCV.
    
    Parameters:
    - file_path (str): Path to the CSV file containing the dataset.
    
    Returns:
    - best_clf (DecisionTreeClassifier): Best Decision Tree classifier model after fine-tuning.
    - feature_names (list): List of feature names used for training.
    - label_encoders (dict): Dictionary of label encoders for each categorical feature.
    """
    try:
        # Load and preprocess the dataset
        df, feature_names, label_encoders = load_and_preprocess_data(file_path)
        if df is None:
            return None, None, None
        
        # Split the dataset into features and target
        X = df[feature_names]  # Features (excluding the target 'house')
        y = df['house']  # Target variable ('house')
        
        # Initialize the Decision Tree classifier
        clf = DecisionTreeClassifier(random_state=42)
        
        # Define the parameter grid for hyperparameter tuning
        param_grid = {
            'criterion': ['gini', 'entropy'],
            'splitter': ['best', 'random'],
            'max_depth': [None, 10, 20, 30, 40, 50],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
        
        # Initialize GridSearchCV with the Decision Tree classifier
        grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
        
        # Perform the grid search to find the best parameters
        grid_search.fit(X, y)
        
        # Get the best estimator (Decision Tree model with the best parameters)
        best_clf = grid_search.best_estimator_
        
        return best_clf, feature_names, label_encoders
    
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None, None, None

def apply_model(model, feature_names, label_encoders, new_data):
    """
    Function to apply the trained Decision Tree model to new data.
    
    Parameters:
    - model (DecisionTreeClassifier): Trained Decision Tree model.
    - feature_names (list): List of feature names used for training.
    - label_encoders (dict): Dictionary of label encoders for each categorical feature.
    - new_data (DataFrame): New data for which predictions are to be made.
    
    Returns:
    - predictions (array): Predicted houses for the new data.
    """
    try:
        # Encode new data using the same encoders as the training data
        for col in label_encoders:
            if col in new_data.columns:
                new_data[col] = label_encoders[col].transform(new_data[col])
            else:
                new_data[col] = 0  # Adding missing features with default value
        
        # Ensure new data has the same columns as the training data
        for feature in feature_names:
            if feature not in new_data.columns:
                new_data[feature] = 0  # Adding missing features with default value
        
        # Align new data with feature names
        new_data = new_data[feature_names]
        
        # Make predictions using the trained model
        predictions = model.predict(new_data)
        
        return predictions
    
    except Exception as e:
        print(f"Error: An unexpected error occurred - {str(e)}")
        return None

# Path to the dataset file
file_path = "data/hogwarts-students.csv"

# Fine-tune the Decision Tree model
fine_tuned_model, feature_names, label_encoders = fine_tune_decision_tree_model(file_path)

# Example new data (you can replace this with actual new data)
new_data = pd.DataFrame({
    'name': ['New Wizard'],
    'gender': ['Male'],
    'age': [12],
    'origin': ['Muggle-born'],
    'specialty': ['Defense Against the Dark Arts'],
    'wand_type': ['Phoenix Feather'],
    'patronus': ['Stag'],
    'broomstick': ['Firebolt'],
    'favourite_subject': ['Charms'],
    'quidditch_position': ['Seeker']
})

# Apply the model to the new data
if fine_tuned_model and feature_names and label_encoders:
    predictions = apply_model(fine_tuned_model, feature_names, label_encoders, new_data)
    if predictions is not None:
        print("Predicted House for the new wizard:", predictions)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits





[CV] END criterion=entropy, max_depth=None, min_samples_leaf=1, min_samples_split=10, splitter=random; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=1, min_samples_split=10, splitter=random; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=2, min_samples_split=10, splitter=best; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=2, min_samples_split=10, splitter=best; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=2, min_samples_split=10, splitter=best; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=2, min_samples_split=10, splitter=random; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=2, min_samples_split=10, splitter=random; total time=   0.0s
[CV] END criterion=entropy, max_depth=None, min_samples_leaf=2, min_samples_split=10, splitter=random; total time=   0.0s
[CV] END criterion=entropy, m

## 9. Conclusion

As the last light of day fades over the sprawling grounds of Hogwarts, the castle stands as a beacon of timeless magic and boundless discovery. In these hallowed halls, where every stone whispers tales of ancient wisdom and every corridor hums with the promise of adventure, we conclude our remarkable journey through the art of magical prediction. 🏰✨

**Reflecting on Our Journey**: Our expedition began with a simple yet profound question: could we, using the marvels of Muggle science, create a model to emulate the Sorting Hat’s centuries-old wisdom? Like young Harry, Ron, and Hermione setting out to unravel the mysteries of the Philosopher’s Stone, we embarked with curiosity and determination. And oh, what a journey it has been! 🚂✨

**From Data Gathering to Spell Casting**: We meticulously gathered data, much like Professor Binns compiling the chronicles of wizarding history. Each character’s attributes, from their age and origin to their favorite subjects and magical abilities, were carefully documented. With these magical ingredients in hand, we crafted our Decision Tree spell, a wondrous blend of Muggle technology and wizarding intuition. 📜🔮

- **The Training Phase**: Our model learned from the rich tapestry of data, discerning patterns and making connections. It was akin to Hermione mastering complex spells through relentless study and practice. 📚✨
- **Testing and Evaluation**: We tested our spell with the rigour of a Triwizard Tournament challenge, ensuring its accuracy and reliability. Each prediction, each metric, was a step towards perfection, reminiscent of Harry honing his skills for the final confrontation with Voldemort. 🐉⚡
- **Fine-Tuning and Application**: Through fine-tuning, we enhanced our model’s prowess, preparing it to face real-world sorting scenarios. And when the moment arrived, our model performed with the grace and precision of a perfectly executed Patronus charm. 🌟🪄

**A New Dawn at Hogwarts**: The integration of our Decision Tree model into the sorting process represents a new dawn at Hogwarts. It is a testament to the harmony between tradition and innovation, a bridge between the timeless magic of the wizarding world and the cutting-edge advancements of the Muggle realm. The Sorting Hat, ever wise and ever patient, welcomes this partnership, embracing the future with open arms. 🎩✨

**The Promise of Future Adventures**: As we stand on the threshold of this new era, we are filled with a sense of wonder and anticipation. The lessons we’ve learned, the spells we’ve cast, and the predictions we’ve made are but the beginning. The world of magic is vast, and the possibilities are endless. With our newfound knowledge, who knows what other mysteries we might unravel, what other spells we might cast? The future beckons, bright and full of promise. 🌌🧙‍♂️🧙‍♀️

And so, dear reader, as the stars twinkle above and the gentle hum of magic fills the air, we close this chapter of our adventure. But remember, at Hogwarts, every ending is but a new beginning. The magic of the castle, the wisdom of its inhabitants, and the spirit of discovery live on, ever ready to guide us on our next great journey.

Until then, keep the magic alive in your heart, and may your days be filled with wonder and enchantment. 🌟🏰✨

# The End

Or perhaps, just the beginning... 🪄🌟🧙‍♂️🧙‍♀️
