# Marvel Character Alignment Predictor

### Project Goal
The goal of this project is to build a machine learning model that can predict whether a Marvel character is 'good' or 'bad' based on their physical statistics.

This notebook will walk through the entire process:
1.  **Data Loading and Exploration**
2.  **Data Cleaning and Preprocessing**
3.  **Model Training and Evaluation**

### Libraries Used
*   Pandas for data manipulation
*   Scikit-learn for building the predictive model

In [1]:
import pandas as pd

### 1. Data Loading and Initial Cleaning

First, we load the dataset from the CSV file. A key first step is to handle the custom missing values (`-` and `-99`) directly upon loading by using the `na_values` parameter. This is more efficient than cleaning them after the fact.

In [2]:
na_vals = ['-99', '-']
df = pd.read_csv('HeroesList.csv', na_values=na_vals)

In [3]:
df

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
0,0,A-Bomb,good,Male,yellow,Human,No Hair,Marvel Comics,,203.0,441.0
1,1,Abe Sapien,good,Male,blue,Icthyo Sapien,No Hair,Dark Horse Comics,blue,191.0,65.0
2,2,Abin Sur,good,Male,blue,Ungaran,No Hair,DC Comics,red,185.0,90.0
3,3,Abomination,bad,Male,green,Human / Radiation,No Hair,Marvel Comics,,203.0,441.0
4,4,Abraxas,bad,Male,blue,Cosmic Entity,Black,Marvel Comics,,,
...,...,...,...,...,...,...,...,...,...,...,...
729,729,Yellowjacket II,good,Female,blue,Human,Strawberry Blond,Marvel Comics,,165.0,52.0
730,730,Ymir,good,Male,white,Frost Giant,No Hair,Marvel Comics,white,304.8,
731,731,Yoda,good,Male,brown,Yoda's species,White,George Lucas,green,66.0,17.0
732,732,Zatanna,good,Female,blue,Human,Black,DC Comics,,170.0,57.0


In [4]:
df['Alignment'].value_counts()

Alignment
good       496
bad        207
neutral     24
Name: count, dtype: int64

### 2. Data Cleaning and Preprocessing

This is the most critical phase. The raw data is not ready for a machine learning model. We need to perform several steps to filter, clean, and transform the data into a usable format.

In [5]:
filt = df['Publisher'] == 'Marvel Comics'
df = df[df['Alignment'].notna()]

In [6]:
df.isna()

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
0,False,False,False,False,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...
729,False,False,False,False,False,False,False,False,True,False,False
730,False,False,False,False,False,False,False,False,False,False,True
731,False,False,False,False,False,False,False,False,False,False,False
732,False,False,False,False,False,False,False,False,True,False,False


#### 2.1 - Filtering for the Marvel Universe

The original dataset contains characters from multiple publishers. Since our goal is to focus on Marvel, we will filter the DataFrame to keep only the rows where the 'Publisher' is 'Marvel Comics'.

**Important Note:** We add `.copy()` at the end of the filter. This ensures that our new `df` is an independent DataFrame, which prevents a common `SettingWithCopyWarning` in Pandas later on.

In [7]:
df = df[df['Publisher'] == 'Marvel Comics'].copy()

In [8]:
df

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
0,0,A-Bomb,good,Male,yellow,Human,No Hair,Marvel Comics,,203.0,441.0
3,3,Abomination,bad,Male,green,Human / Radiation,No Hair,Marvel Comics,,203.0,441.0
4,4,Abraxas,bad,Male,blue,Cosmic Entity,Black,Marvel Comics,,,
5,5,Absorbing Man,bad,Male,blue,Human,No Hair,Marvel Comics,,193.0,122.0
8,8,Agent 13,good,Female,blue,,Blond,Marvel Comics,,173.0,61.0
...,...,...,...,...,...,...,...,...,...,...,...
726,726,X-Man,good,Male,blue,,Brown,Marvel Comics,,175.0,61.0
727,727,Yellow Claw,bad,Male,blue,,No Hair,Marvel Comics,,188.0,95.0
728,728,Yellowjacket,good,Male,blue,Human,Blond,Marvel Comics,,183.0,83.0
729,729,Yellowjacket II,good,Female,blue,Human,Strawberry Blond,Marvel Comics,,165.0,52.0


#### 2.2 - Handling Missing Numerical Data

The `Height` and `Weight` columns contain missing (`NaN`) values. A model cannot handle these, so we need to "impute" them, or fill them in with a reasonable guess. We will use the **median** value for each column. We choose the median over the mean because it is less sensitive to extreme outliers (like the Hulk's weight).

In [9]:
median_height = df["Height"].median()
df['Height'] = df['Height'].fillna(median_height)
df

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
0,0,A-Bomb,good,Male,yellow,Human,No Hair,Marvel Comics,,203.0,441.0
3,3,Abomination,bad,Male,green,Human / Radiation,No Hair,Marvel Comics,,203.0,441.0
4,4,Abraxas,bad,Male,blue,Cosmic Entity,Black,Marvel Comics,,183.0,
5,5,Absorbing Man,bad,Male,blue,Human,No Hair,Marvel Comics,,193.0,122.0
8,8,Agent 13,good,Female,blue,,Blond,Marvel Comics,,173.0,61.0
...,...,...,...,...,...,...,...,...,...,...,...
726,726,X-Man,good,Male,blue,,Brown,Marvel Comics,,175.0,61.0
727,727,Yellow Claw,bad,Male,blue,,No Hair,Marvel Comics,,188.0,95.0
728,728,Yellowjacket,good,Male,blue,Human,Blond,Marvel Comics,,183.0,83.0
729,729,Yellowjacket II,good,Female,blue,Human,Strawberry Blond,Marvel Comics,,165.0,52.0


In [10]:
median_weight = df['Weight'].median()
df['Weight'] = df['Weight'].fillna(median_weight)
df

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
0,0,A-Bomb,good,Male,yellow,Human,No Hair,Marvel Comics,,203.0,441.0
3,3,Abomination,bad,Male,green,Human / Radiation,No Hair,Marvel Comics,,203.0,441.0
4,4,Abraxas,bad,Male,blue,Cosmic Entity,Black,Marvel Comics,,183.0,83.0
5,5,Absorbing Man,bad,Male,blue,Human,No Hair,Marvel Comics,,193.0,122.0
8,8,Agent 13,good,Female,blue,,Blond,Marvel Comics,,173.0,61.0
...,...,...,...,...,...,...,...,...,...,...,...
726,726,X-Man,good,Male,blue,,Brown,Marvel Comics,,175.0,61.0
727,727,Yellow Claw,bad,Male,blue,,No Hair,Marvel Comics,,188.0,95.0
728,728,Yellowjacket,good,Male,blue,Human,Blond,Marvel Comics,,183.0,83.0
729,729,Yellowjacket II,good,Female,blue,Human,Strawberry Blond,Marvel Comics,,165.0,52.0


In [11]:
filt = df['Name'] == 'Spider-Man'
df.loc[filt]

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
622,622,Spider-Man,good,Male,hazel,Human,Brown,Marvel Comics,,178.0,74.0
623,623,Spider-Man,good,,red,Human,Brown,Marvel Comics,,178.0,77.0
624,624,Spider-Man,good,Male,brown,Human,Black,Marvel Comics,,157.0,56.0


#### 2.3 - Cleaning and Encoding the Target Variable ('Alignment')

Our target variable, 'Alignment', needs to be cleaned before we can use it.
1.  First, we remove all rows where 'Alignment' is 'neutral' to simplify our problem into a binary classification (good vs. bad).
2.  Next, we convert the text values 'good' and 'bad' into numbers (1 and 0, respectively) so the machine learning model can understand them. We do this using a dictionary with the `.replace()` method.

In [12]:
notneutral = (df["Alignment"] != "neutral")
df = df[notneutral].copy()
df

Unnamed: 0,ID,Name,Alignment,Gender,EyeColor,Race,HairColor,Publisher,SkinColor,Height,Weight
0,0,A-Bomb,good,Male,yellow,Human,No Hair,Marvel Comics,,203.0,441.0
3,3,Abomination,bad,Male,green,Human / Radiation,No Hair,Marvel Comics,,203.0,441.0
4,4,Abraxas,bad,Male,blue,Cosmic Entity,Black,Marvel Comics,,183.0,83.0
5,5,Absorbing Man,bad,Male,blue,Human,No Hair,Marvel Comics,,193.0,122.0
8,8,Agent 13,good,Female,blue,,Blond,Marvel Comics,,173.0,61.0
...,...,...,...,...,...,...,...,...,...,...,...
726,726,X-Man,good,Male,blue,,Brown,Marvel Comics,,175.0,61.0
727,727,Yellow Claw,bad,Male,blue,,No Hair,Marvel Comics,,188.0,95.0
728,728,Yellowjacket,good,Male,blue,Human,Blond,Marvel Comics,,183.0,83.0
729,729,Yellowjacket II,good,Female,blue,Human,Strawberry Blond,Marvel Comics,,165.0,52.0


In [13]:
replacement_map = {'good': 1, 'bad': 0}
df['Alignment'] = df['Alignment'].map(replacement_map)

### 3. Feature Selection and Final Preparation

Now that our data is clean, we need to formally separate it into our features and our target.
*   **X (Features):** These are the "clues" or inputs for our model. We will start with just `Height` and `Weight`.
*   **y (Target):** This is the "answer" we want our model to predict. This is the numeric `Alignment` column.

In [14]:
y = df["Alignment"]
X = df[['Height', 'Weight']]

### 4. Model Training and Evaluation

With our data prepared, we can now use the Scikit-learn library to build our predictive model.

#### 4.1 - Splitting Data into Training and Testing Sets

This is the most important step in machine learning. We split our data into two parts:
-   **Training Set (80%):** The data the model will learn from.
-   **Testing Set (20%):** The data the model has never seen, which we will use to get an honest evaluation of its performance.

We use `random_state=42` to ensure that we get the same "random" split every time we run the code, making our results reproducible.

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (299, 2)
X_test shape: (75, 2)


#### 4.2 - Training the Model

We will use a `LogisticRegression` model, which is a great baseline for binary classification problems. The `.fit()` method is the "learning" step, where the model analyzes the training data (`X_train` and `y_train`) to find the patterns.

In [18]:
from sklearn.linear_model import LogisticRegression

In [19]:
model = LogisticRegression()

In [20]:
print(y_train.value_counts(dropna=False))

Alignment
1    210
0     89
Name: count, dtype: int64


In [21]:
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


#### 4.3 - Making Predictions and Evaluating Performance

Now we use our trained model to make predictions on the unseen test data (`X_test`). We then compare these predictions to the true answers (`y_test`) to calculate our model's accuracy.

In [22]:
from sklearn.metrics import accuracy_score

In [23]:
predictions = model.predict(X_test)

In [24]:
accuracy = accuracy_score(y_test, predictions)

In [25]:
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 62.67%


### 5. Conclusion

Our initial baseline model achieved an accuracy of **62.67%**.

This result is significantly better than a random 50/50 guess, which proves that there is a real, learnable pattern in the data. It confirms that a character's physical stats have some predictive power in determining their alignment.

This provides a strong foundation for future improvements, such as adding more features or experimenting with more powerful models.