The Avengers are a well-known and widely-loved team of superheroes in the Marvel universe that were originally introduced in the 1960's comic book series. The recent Disney movies re-popularized them, as part of the new Marvel Cinematic Universe.

Because the writers killed off and revived many of the superheroes, the team at FiveThirtyEight was curious to explore data from the Marvel Wikia site further.

### Challenge

While the FiveThirtyEight team did a wonderful job acquiring the data, it still has some inconsistencies. Your mission, if you choose to accept it, is to clean up their data set so it can be more useful for analysis in pandas. Let's read it into pandas as a dataframe and preview the first five rows to get a better sense of it.



In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

avengers = pd.read_csv("avengers.csv")
avengers.head(5)

Unnamed: 0,URL,Name/Alias,Appearances,Current?,Gender,Probationary Introl,Full/Reserve Avengers Intro,Year,Years since joining,Honorary,...,Return1,Death2,Return2,Death3,Return3,Death4,Return4,Death5,Return5,Notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,,Sep-63,1963,52,Full,...,NO,,,,,,,,,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,,Sep-63,1963,52,Full,...,YES,YES,NO,,,,,,,Dies in Fear Itself brought back because that'...


## Filtering out bad data

Because the data came from a crowdsourced community site, it could contain errors. If you plot a histogram of the values in the Year column, which describes the year Marvel introduced each Avenger, you'll immediately notice some oddities. For example, there are quite a few Avengers who look like they were introduced in 1900, which we know is a little fishy -- the Avengers weren't introduced in the comic series until the 1960's!

This is obviously a mistake in the data. As a result, you should remove all of the Avengers introduced before 1960 from the dataframe.

In [None]:
avengers['Year'].hist()

In [5]:
true_avengers = pd.DataFrame()

true_avengers = avengers[avengers['Year'] >= 1960]

true_avengers.head()

Unnamed: 0,URL,Name/Alias,Appearances,Current?,Gender,Probationary Introl,Full/Reserve Avengers Intro,Year,Years since joining,Honorary,...,Return1,Death2,Return2,Death3,Return3,Death4,Return4,Death5,Return5,Notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,,Sep-63,1963,52,Full,...,NO,,,,,,,,,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,,Sep-63,1963,52,Full,...,YES,YES,NO,,,,,,,Dies in Fear Itself brought back because that'...


## Consolidating deaths

We're interested in the total number of deaths each character experienced, so we'd like to have a single field containing that information. Right now, there are five fields (Death1 to Death5), each of which contains a binary value representing whether a superhero experienced that death or not. For example, a superhero could experience Death1, then Death2, and so on until the writers decided not to bring the character back to life.

We'd like to combine that information in a single field so we can perform numerical analysis on it more easily.



In [46]:
# Select columns of interest to a new dataframe
aux_avengers = true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5']].copy()

# Change 'NO' to NaN
aux_avengers.replace('NO', np.nan, inplace = True)

# Sum across axis = 1 to count the number of 'YES'
true_avengers['Deaths'] = aux_avengers.notnull().sum(axis = 1)

# Assign back to 'true_avengers'
true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5', 'Deaths']]

Unnamed: 0,Death1,Death2,Death3,Death4,Death5,Deaths
0,YES,,,,,1
1,YES,,,,,1
2,YES,,,,,1
3,YES,,,,,1
4,YES,YES,,,,2
...,...,...,...,...,...,...
168,NO,,,,,0
169,NO,,,,,0
170,NO,,,,,0
171,NO,,,,,0


# Verifying years since joining

For our final task, we want to verify that the Years since joining field accurately reflects the Year column. For example, if an Avenger was introduced in the Year 1960, is the Years since joining value for that Avenger 55?

In [52]:
# The following code calculates the number of rows where Years since joining is accurate.

true_avengers[['Year', 'Years since joining']]

sum_year = true_avengers['Year'] + true_avengers['Years since joining']

sum_year[sum_year != 2015]

Series([], dtype: int64)

All values were accurately computed.