Data source: https://www.kaggle.com/datasets/jonathanbesomi/superheroes-nlp-dataset

Note that I did some of the data cleaning in Excel before importing the file into the Python project, so you need to load the Excel file provided, as the one downloaded directly from Kaggle won't be properly cleaned by the code below, as I wrote the code specifically for the already partially cleaned version of the data.

To be specific as to how I modified the Excel file prior to importing it:
In no particular order,
1. I filled in many of the missing values in the gender column, specifically the ones where the name made the gender
   obvious and/or I was familiar with the character
2. I filled in some of the missing values for race and alignment, going by either personal knowledge of particular
   characters, or by Googling the character
3. Similarly for the creator column
4. I replaced both the cells containing only "-" and the empty cells with cells containing "NaN".
5. I put both the weight and height data into common units, i.e. I converted any meter measurements to cm and any ton measurements to kg.
6. I deleted several of the rows that randomly contained character descriptions and no variable values, as well as empty
   rows
7. I deleted a few rows that contained very little information (i.e. most of the cells were blank)

In [16]:
import pandas as pd
pd.options.mode.chained_assignment = None

import numpy as np
import sklearn as sk
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Read in and clean data

In [17]:
dropRows = [55, 76, 78, 81, 410, 481, 486, 490, 645, 653, 657, 693,
            697, 840, 827, 830, 842, 844, 949, 950, 1059, 1063, 1077, 1079, 1081]
#original data set had nonsense rows

superheroData = \
    pd.read_excel("Superhero data.xlsx").\
    drop(dropRows).\
    drop(["full_name"], axis = 1).\
    dropna(how = "all").\
    reset_index()#.\
    #drop(["index"], axis = 1)

#Fix format of height and weight variables:
columnLength = 1408

for i in range(0, columnLength-1): #final row already has data in the correct form
        #height data is of the form "6'8 â€¢ 203 cm" which would be very difficult to work with
        #hence I'm extracting just the value in centimeters and converting to a float
        #similar for weight

        if not(str(superheroData["height"][i]) == "nan"):
            try:
                superheroData["height"][i] = float(str(superheroData["height"][i]).split()[2]) #noqa
            except:
                superheroData["height"][i] = \
                    str(superheroData.iloc[:,("height", i)]).replace(',', '') #noqa
                superheroData["height"][i] = float(superheroData["height"][i].split()[2]) #noqa

        if not(str(superheroData["weight"][i]) == "nan"):
            try:
                superheroData["weight"][i] = float(str(superheroData["weight"][i]).split()[3]) #noqa
            except:
                superheroData["weight"][i] = str(superheroData["weight"][i]).replace(',', '') #noqa
                superheroData["weight"][i] = float(superheroData["weight"][i].split()[3]) #noqa

superheroData;

Check for and impute missing data values:

In [18]:
superheroData.isnull().sum();

In [19]:
quantitativeVariables = superheroData.iloc[:, [3, 4, 5, 6, 7, 8, 14, 15]]

knnImputer = KNNImputer(n_neighbors = 10)

quantitativeVariablesFilled = \
    pd.DataFrame(knnImputer.fit_transform(quantitativeVariables)).\
    rename(columns = {0:"intelligence score", 1:"strength score",
                      2:"speed score", 3:"durability score",
                      4:"power score", 5: "combat score",
                      6:"height (cm)", 7:"weight (kg)" })

quantitativeVariablesFilled;

In [20]:
temp = [i for i in range(16, 68)]

categoricalVariables = superheroData.iloc[:, [1, 2, 9, 10, 11, 12, 13] + temp]

simpleImputer = SimpleImputer(strategy = "most_frequent")

categoricalVariablesFilled = pd.DataFrame(simpleImputer.fit_transform(categoricalVariables))

categoricalVariablesFilled.columns = categoricalVariables.columns

categoricalVariablesFilled;

In [21]:
superheroDataFilled = pd.concat([categoricalVariablesFilled, quantitativeVariablesFilled], axis = 1)
superheroDataFilled;

Explore data:

In [22]:
color1 = "cyan"

color2 = "deeppink"

color3 = "lightgreen"

color_discrete_map = {"Male":color1, "Female":color2, "None":color3}

px.bar(superheroDataFilled, x = "creator", color = "gender", color_discrete_map = color_discrete_map)

Since, as the above bar graph shows, the vast majority characters in the data set are from DC, Marvel, or
Shueisha, and almost all characters are marked as male or female I'll drop the other author rows and the "none" gender rows, since they ='re unlikely to contribute
meaningfully to the overall patterns and will only complicate the analysis:

In [8]:
superheroDataFilled = \
    superheroDataFilled[
        superheroDataFilled['creator'].
        isin(['Marvel Comics','DC Comics', 'Shueisha'])]

superheroDataFilled = \
    superheroDataFilled[
        superheroDataFilled['gender'].
        isin(['Male', 'Female'])]

In [15]:
px.bar(superheroDataFilled, x = "creator", color = "gender", color_discrete_map = color_discrete_map).\
    update_layout(plot_bgcolor = "white").show();

px.bar(superheroDataFilled, x = "creator", color = "alignment").\
    update_layout(plot_bgcolor = "white").show()


Correlation matrix:

In [10]:
superheroDataFilled.corr()





Unnamed: 0,intelligence score,strength score,speed score,durability score,power score,combat score,height (cm),weight (kg)
intelligence score,1.0,0.271467,0.416613,0.424041,0.465784,0.62699,-0.028673,0.033853
strength score,0.271467,1.0,0.689558,0.776493,0.600363,0.350889,0.137117,0.119172
speed score,0.416613,0.689558,1.0,0.722717,0.642498,0.506413,0.032817,0.100869
durability score,0.424041,0.776493,0.722717,1.0,0.709419,0.493157,0.099295,0.095255
power score,0.465784,0.600363,0.642498,0.709419,1.0,0.470842,0.090958,0.069777
combat score,0.62699,0.350889,0.506413,0.493157,0.470842,1.0,-0.028433,0.016393
height (cm),-0.028673,0.137117,0.032817,0.099295,0.090958,-0.028433,1.0,0.42255
weight (kg),0.033853,0.119172,0.100869,0.095255,0.069777,0.016393,0.42255,1.0


Plot variables with highest correlation:

In [11]:
px.scatter(superheroDataFilled, x = "speed score", y = "durability score", color = "gender")

In [12]:
px.scatter(superheroDataFilled, x = "speed score", y = "durability score", color = "alignment")

Do PCA analysis to further narrow down which variables are the most relevant/important

In [13]:
features = quantitativeVariablesFilled.drop(columns = ["height (cm)", "weight (kg)"], axis = 1)

px.scatter_matrix(features, dimensions = features, color = superheroDataFilled["alignment"]).show()

pca = PCA()

components = pca.fit_transform(features)

labels = {
    str(i): f"{i} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(components,labels = labels, color = superheroDataFilled["alignment"],
                        dimensions=range(6))

fig.update_traces(diagonal_visible=False)

fig.update_layout({'xaxis2':{'side': 'top'},
                   'xaxis4':{'side': 'top'},
                   'xaxis6':{'side': 'top'},
                   })

fig.show()


ValueError: All arguments should have the same length. The length of argument `color` is 1193, whereas the length of  previously-processed arguments ['intelligence score', 'strength score', 'speed score', 'durability score', 'power score', 'combat score'] is 1408

In [None]:
total_var = pca.explained_variance_ratio_
total_var