
# Pok√©mon MLOps Project ‚Äì Master Level Analytical Report

## Introduction

This notebook represents the final analytical document of the Pok√©mon MLOps project.  
It is written from the perspective of a Master‚Äôs student in Data Science and MLOps, aiming to demonstrate:

- Technical understanding of the dataset  
- Ability to perform exploratory data analysis  
- Knowledge of machine learning workflows  
- Understanding of MLOps principles  
- Professional communication of results  

### Dataset Context

The Pok√©mon dataset used in this project comes from Kaggle and contains detailed information about more than 800 Pok√©mon.  
Each Pok√©mon is described by:

- Combat statistics (attack, defense, speed, etc.)
- Physical characteristics (height, weight)
- Types and elemental weaknesses
- Abilities
- Legendary status

This dataset is ideal for a classification problem where the goal is to determine whether a Pok√©mon is **legendary or not** based on its characteristics.

### Project Goal

The objective of this project is not only to train a model, but to design a complete **MLOps pipeline** that includes:

- Data ingestion  
- Data cleaning  
- Feature engineering  
- Modeling  
- Experiment tracking with MLflow  

This notebook documents all these stages in a structured and pedagogical way.


## Imports and Configuration

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("ggplot")


## Loading the Raw Dataset

In [None]:

raw = pd.read_csv("../data/raw.csv")
raw.head()


### General Information About the Dataset

In [None]:
raw.info()

## Descriptive Statistics

In [None]:
raw.describe()


## Ability-Based Analysis

Abilities are a key aspect of Pok√©mon.  
Understanding which abilities are most common and which are associated with powerful Pok√©mon can provide valuable insights.


### Most Common Abilities

In [None]:

abilities = raw["abilities"].value_counts().head(10)

plt.figure(figsize=(10,5))
abilities.plot(kind="bar")
plt.title("Top 10 Most Common Abilities")
plt.ylabel("Number of Pok√©mon")
plt.show()


### Top Abilities Among Legendary Pok√©mon

In [None]:

legendary = raw[raw["is_legendary"] == 1]
legendary["abilities"].value_counts().head(10)



### Abilities Associated With High Base Total

We now analyze which abilities appear most frequently among the strongest Pok√©mon.


In [None]:

strong = raw.sort_values(by="base_total", ascending=False).head(50)
strong["abilities"].value_counts().head(10)


## Top Pok√©mon Rankings

### Top 15 Pok√©mon by Base Total

In [None]:

top_total = raw.sort_values(by="base_total", ascending=False).head(15)

plt.figure(figsize=(12,6))
plt.barh(top_total["name"], top_total["base_total"], color="darkblue")
plt.title("Top 15 Pok√©mon by Base Total")
plt.xlabel("Base Total")
plt.gca().invert_yaxis()
plt.show()


### Top 15 Pok√©mon by Speed

In [None]:

top_speed = raw.sort_values(by="speed", ascending=False).head(15)

plt.figure(figsize=(12,6))
plt.barh(top_speed["name"], top_speed["speed"], color="purple")
plt.title("Top 15 Fastest Pok√©mon")
plt.xlabel("Speed")
plt.gca().invert_yaxis()
plt.show()


### Top 15 Pok√©mon by Attack

In [None]:

top_attack = raw.sort_values(by="attack", ascending=False).head(15)

plt.figure(figsize=(12,6))
plt.barh(top_attack["name"], top_attack["attack"], color="red")
plt.title("Top 15 Pok√©mon by Attack")
plt.xlabel("Attack")
plt.gca().invert_yaxis()
plt.show()


### Top 15 Pok√©mon by Defense

In [None]:

top_def = raw.sort_values(by="defense", ascending=False).head(15)

plt.figure(figsize=(12,6))
plt.barh(top_def["name"], top_def["defense"], color="green")
plt.title("Top 15 Pok√©mon by Defense")
plt.xlabel("Defense")
plt.gca().invert_yaxis()
plt.show()


## Correlation Analysis

In [None]:

numeric = raw.select_dtypes(include=["int64", "float64"])

plt.figure(figsize=(14,10))
sns.heatmap(numeric.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Matrix of Numerical Attributes")
plt.show()


## Legendary vs Non-Legendary Comparison

In [None]:

plt.figure(figsize=(8,6))
sns.boxplot(x="is_legendary", y="base_total", data=raw)
plt.title("Base Total: Legendary vs Non-Legendary")
plt.show()


## Feature Engineering Stage

In [None]:

engineered = pd.read_csv("../data/engineered.csv")
engineered.head()


## Model Evaluation

In [None]:

results = pd.DataFrame({
    "Model": ["Random Forest", "Logistic Regression", "SVM", "Gradient Boosting"],
    "Accuracy": [0.9853, 0.9706, 1.0, 0.9853],
    "Precision": [0.0, 0.3333, 1.0, 0.0],
    "Recall": [0.0, 1.0, 1.0, 0.0],
    "F1": [0.0, 0.5, 1.0, 0.0],
    "AUC": [1.0, 1.0, 1.0, 1.0]
})
results


## Selecting the Best Model

In [None]:

best = results.sort_values(by=["F1","AUC"], ascending=[False, False]).iloc[0]
best



### Final Model Decision

From a methodological perspective, F1-score is the most appropriate metric due to the imbalance of the dataset.

Using F1-score as primary criterion and AUC as secondary criterion, the best model is:

üëâ **Support Vector Machine (SVM)**

This model achieved perfect balance between precision and recall on the evaluation set.



# Final Reflections

As a Master‚Äôs level project, this work demonstrates:

- Ability to analyze a complex dataset  
- Capacity to design a modular Python project  
- Understanding of machine learning evaluation  
- Practical application of MLOps principles  

The integration of MLflow allows the project to go beyond simple modeling and enter the domain of reproducible and professional machine learning systems.
