# Data Science and the Marvel Universe

In this IPython notebook, we'll use Marvel Wikia data from fivethirtyeight.com to explore and explain some basic data science and machine learning concepts taught in General Assembly's Data Science course.

In [1]:
# First, import packages we'll use
import pandas as pd

In [3]:
# Read data into a pandas DataFrame
df = pd.read_csv('marvel-wikia-data.csv')

Let's get to know the dataset a little by exploring its characteristics and unique values.

In [4]:
# How many characters are in this version of the dataset?
len(df)

16376

In [14]:
# What features does this dataset include?

# TODO(justindelatorre): Find out which pandas function captures
#   DataFrame column names
df.head()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
0,1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043,Aug-62,1962
1,7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360,Mar-41,1941
2,64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061,Oct-74,1974
3,1868,"Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961,Mar-63,1963
4,2460,Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258,Nov-50,1950


What are the unique values and counts of each feature?

In [20]:
# feature: ID
df.ID.value_counts()

Secret Identity                  6275
Public Identity                  4528
No Dual Identity                 1788
Known to Authorities Identity      15
dtype: int64

In [21]:
# feature: ALIGN
df.ALIGN.value_counts()

Bad Characters        6720
Good Characters       4636
Neutral Characters    2208
dtype: int64

In [22]:
# feature: ALIGN
df.EYE.value_counts()

Blue Eyes          1962
Brown Eyes         1924
Green Eyes          613
Black Eyes          555
Red Eyes            508
White Eyes          400
Yellow Eyes         256
Grey Eyes            95
Hazel Eyes           76
Variable Eyes        49
Purple Eyes          31
Orange Eyes          25
Pink Eyes            21
One Eye              21
Gold Eyes            14
Silver Eyes          12
Violet Eyes          11
Amber Eyes           10
No Eyes               7
Multiple Eyes         7
Yellow Eyeballs       6
Black Eyeballs        3
Magenta Eyes          2
Compound Eyes         1
dtype: int64

In [23]:
# feature: HAIR
df.HAIR.value_counts()

Black Hair               3755
Brown Hair               2339
Blond Hair               1582
No Hair                  1176
Bald                      838
White Hair                754
Red Hair                  620
Grey Hair                 531
Green Hair                117
Auburn Hair                78
Blue Hair                  56
Purple Hair                47
Strawberry Blond Hair      47
Orange Hair                43
Variable Hair              32
Pink Hair                  31
Yellow Hair                20
Silver Hair                16
Gold Hair                   8
Light Brown Hair            6
Reddish Blond Hair          6
Magenta Hair                5
Orange-brown Hair           3
Bronze Hair                 1
Dyed Hair                   1
dtype: int64

In [24]:
# feature: SEX
df.SEX.value_counts()

Male Characters           11638
Female Characters          3837
Agender Characters           45
Genderfluid Characters        2
dtype: int64

In [25]:
# feature: ALIVE
df.ALIVE.value_counts()

Living Characters      12608
Deceased Characters     3765
dtype: int64

Let's make some comparisons and unfounded correlations, just for fun!

In [27]:
# At a glance, how likely is a character to be dead based on his/her/its sex?
df.groupby('SEX').ALIVE.value_counts()

SEX                                        
Agender Characters      Living Characters        39
                        Deceased Characters       6
Female Characters       Living Characters      3074
                        Deceased Characters     763
Genderfluid Characters  Living Characters         2
Male Characters         Living Characters      8769
                        Deceased Characters    2869
dtype: int64

In [32]:
count_male_living = len(df[(df.SEX == 'Male Characters') & (df.ALIVE == 'Living Characters')])
count_male_total = len(df[df.SEX == 'Male Characters'])

print float(count_male_living) / count_male_total

0.753479979378


You have a three in four chance of staying alive if you're male. Nice! What about if you're female?

In [33]:
count_female_living = len(df[(df.SEX == 'Female Characters') & (df.ALIVE == 'Living Characters')])
count_female_total = len(df[df.SEX == 'Female Characters'])

print float(count_female_living) / count_female_total

0.801146729216


You've got an even better chance if you're female! What about if you're agendered?

In [34]:
count_agender_living = len(df[(df.SEX == 'Agender Characters') & (df.ALIVE == 'Living Characters')])
count_agender_total = len(df[df.SEX == 'Agender Characters'])

print float(count_agender_living) / count_agender_total

0.866666666667


Sorry boys and girls - looks like your luck is even better if you're neither male nor female.

Who should you have sided with during the Civil War - Captain America or Iron Man? Pro- or anti-registration? Let's take a look at survival rates based on identity status.

In [36]:
# At a glance, how likely is a character to be dead based on 
#   his/her/its identity status?
df.groupby('ID').ALIVE.value_counts()

ID                                                
Known to Authorities Identity  Living Characters        14
                               Deceased Characters       1
No Dual Identity               Living Characters      1345
                               Deceased Characters     443
Public Identity                Living Characters      3484
                               Deceased Characters    1044
Secret Identity                Living Characters      4647
                               Deceased Characters    1628
dtype: int64

In [38]:
count_secret_living = len(df[(df.ID == 'Secret Identity') & (df.ALIVE == 'Living Characters')])
count_secret_total = len(df[df.ID == 'Secret Identity'])

print float(count_secret_living) / count_secret_total

0.740557768924


Male characters and those with secret identities seem to have similar rates of survival. What about those living (or dying) publicly?

In [39]:
count_public_living = len(df[(df.ID == 'Public Identity') & (df.ALIVE == 'Living Characters')])
count_public_total = len(df[df.ID == 'Public Identity'])

print float(count_public_living) / count_public_total

0.769434628975


Sorry Cap - looks like Stark edges this one out. You know his dad built your shield right? 

Okay Jarvis: What about for heroes like Thor, who don't have split identities?

In [40]:
count_nodual_living = len(df[(df.ID == 'No Dual Identity') & (df.ALIVE == 'Living Characters')])
count_nodual_total = len(df[df.ID == 'No Dual Identity'])

print float(count_nodual_living) / count_nodual_total

0.752237136465


Great beard of Odin - Stark takes another! At least he can't lift your hammer?

### The Good, The Bad, and The Neutral: Prediction by Physical Traits

In this section, we'll use the <i>K-Nearest Neighbor</i> and <i>Naïve Bayesian</i> classification methods to see if we can use physical features to predict whether a character will be <b>Good</b>, <b>Bad</b>, or <b>Neutral</b>.

Yes, we're <i>kind of</i> racially profiling Marvel characters. But it's for educational purposes, so maybe it's not so bad?