# Cleaning Up Data about Avengers

Corresponds to DataQuest challenge.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from pathlib import Path

In [2]:
data_path = Path.home() / "datasets" / "tabular_practice"

avengers = pd.read_csv(data_path / "avengers.csv", encoding="ISO-8859-1")
avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   URL                          173 non-null    object
 1   Name/Alias                   163 non-null    object
 2   Appearances                  173 non-null    int64 
 3   Current?                     173 non-null    object
 4   Gender                       173 non-null    object
 5   Probationary Introl          15 non-null     object
 6   Full/Reserve Avengers Intro  159 non-null    object
 7   Year                         173 non-null    int64 
 8   Years since joining          173 non-null    int64 
 9   Honorary                     173 non-null    object
 10  Death1                       173 non-null    object
 11  Return1                      69 non-null     object
 12  Death2                       17 non-null     object
 13  Return2                      16 non

Description of columns:

* "URL":	The URL of the comic character on the Marvel Wikia
* "Name/Alias":	The full name or alias of the character
* "Appearances":	The number of comic books that character appeared in as of April 30
* "Current?":	Is the member currently active on an avengers affiliated team?
* "Gender":	The recorded gender of the character
* "Probationary":	Sometimes the character was given probationary status as an Avenger, this is the date that happened
* "Full/Reserve":	The month and year the character was introduced as a full or reserve member of the Avengers
* "Year":	The year the character was introduced as a full or reserve member of the Avengers
* "Years since joining":	2015 minus the year
* "Honorary":	The status of the avenger, if they were given "Honorary" Avenger status, if they are simply in the "Academy," or "Full" otherwise
* "Death1":	Yes if the Avenger died, No if not.
* "Return1":	Yes if the Avenger returned from their first death, No if they did not, blank if not applicable
* "Death2":	Yes if the Avenger died a second time after their revival, No if they did not, blank if not applicable
* "Return2":	Yes if the Avenger returned from their second death, No if they did not, blank if not applicable
* "Death3":	Yes if the Avenger died a third time after their second revival, No if they did not, blank if not applicable
* "Return3":	Yes if the Avenger returned from their third death, No if they did not, blank if not applicable
* "Death4":	Yes if the Avenger died a fourth time after their third revival, No if they did not, blank if not applicable
* "Return4":	Yes if the Avenger returned from their fourth death, No if they did not, blank if not applicable
* "Death5":	Yes if the Avenger died a fifth time after their fourth revival, No if they did not, blank if not applicable
* "Return5":	Yes if the Avenger returned from their fifth death, No if they did not, blank if not applicable
* "Notes":	Descriptions of deaths and resurrections.

In [3]:
avengers.describe(include="all")

Unnamed: 0,URL,Name/Alias,Appearances,Current?,Gender,Probationary Introl,Full/Reserve Avengers Intro,Year,Years since joining,Honorary,...,Return1,Death2,Return2,Death3,Return3,Death4,Return4,Death5,Return5,Notes
count,173,163,173.0,173,173,15,159,173.0,173.0,173,...,69,17,16,2,2,1,1,1,1,75
unique,173,162,,2,2,12,93,,,4,...,2,2,2,1,2,1,1,1,1,71
top,http://marvel.wikia.com/Henry_Pym_(Earth-616),Vance Astrovik,,NO,MALE,Jul-75,Sep-63,,,Full,...,YES,YES,NO,YES,NO,YES,YES,YES,YES,Died in New_Avengers_Vol_3_32
freq,1,2,,91,115,2,6,,,138,...,46,16,8,2,1,1,1,1,1,4
mean,,,414.052023,,,,,1988.445087,26.554913,,...,,,,,,,,,,
std,,,677.99195,,,,,30.374669,30.374669,,...,,,,,,,,,,
min,,,2.0,,,,,1900.0,0.0,,...,,,,,,,,,,
25%,,,58.0,,,,,1979.0,5.0,,...,,,,,,,,,,
50%,,,132.0,,,,,1996.0,19.0,,...,,,,,,,,,,
75%,,,491.0,,,,,2010.0,36.0,,...,,,,,,,,,,


First, we replace column names by more manageable expressions.

In [4]:
new_columns = list(avengers.columns.str.lower().str.strip().str.replace(" ", "_").str.replace("/", "_"))
new_columns

['url',
 'name_alias',
 'appearances',
 'current?',
 'gender',
 'probationary_introl',
 'full_reserve_avengers_intro',
 'year',
 'years_since_joining',
 'honorary',
 'death1',
 'return1',
 'death2',
 'return2',
 'death3',
 'return3',
 'death4',
 'return4',
 'death5',
 'return5',
 'notes']

In [5]:
new_columns[0] = "URL"
new_columns[3] = "is_current"
avengers.columns = new_columns

In [6]:
avengers["is_current"].value_counts()

is_current
NO     91
YES    82
Name: count, dtype: int64

In [7]:
avengers.loc[:, "is_current"] = avengers["is_current"] == "YES"

In [8]:
avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   URL                          173 non-null    object
 1   name_alias                   163 non-null    object
 2   appearances                  173 non-null    int64 
 3   is_current                   173 non-null    object
 4   gender                       173 non-null    object
 5   probationary_introl          15 non-null     object
 6   full_reserve_avengers_intro  159 non-null    object
 7   year                         173 non-null    int64 
 8   years_since_joining          173 non-null    int64 
 9   honorary                     173 non-null    object
 10  death1                       173 non-null    object
 11  return1                      69 non-null     object
 12  death2                       17 non-null     object
 13  return2                      16 non

In [9]:
avengers.loc[:, "URL"] = avengers["URL"].str.strip()
avengers["URL"].str.startswith("http").sum(), avengers.shape

(173, (173, 21))

In [10]:
avengers.iloc[:, 0:10].describe(include="all")

Unnamed: 0,URL,name_alias,appearances,is_current,gender,probationary_introl,full_reserve_avengers_intro,year,years_since_joining,honorary
count,173,163,173.0,173,173,15,159,173.0,173.0,173
unique,173,162,,2,2,12,93,,,4
top,http://marvel.wikia.com/Henry_Pym_(Earth-616),Vance Astrovik,,False,MALE,Jul-75,Sep-63,,,Full
freq,1,2,,91,115,2,6,,,138
mean,,,414.052023,,,,,1988.445087,26.554913,
std,,,677.99195,,,,,30.374669,30.374669,
min,,,2.0,,,,,1900.0,0.0,
25%,,,58.0,,,,,1979.0,5.0,
50%,,,132.0,,,,,1996.0,19.0,
75%,,,491.0,,,,,2010.0,36.0,


In [11]:
avengers.query("year < 1960")

Unnamed: 0,URL,name_alias,appearances,is_current,gender,probationary_introl,full_reserve_avengers_intro,year,years_since_joining,honorary,...,return1,death2,return2,death3,return3,death4,return4,death5,return5,notes
75,http://marvel.wikia.com/Elvin_Haliday_(Earth-6...,Elvin Haliday,158,False,MALE,Feb-91,,1900,115,Probationary,...,,,,,,,,,,
76,http://marvel.wikia.com/William_Baker_(Earth-6...,William Baker,355,False,MALE,Feb-91,,1900,115,Probationary,...,YES,NO,,,,,,,,Died in Identity_Disc_Vol_1_1. Later was revea...
122,http://marvel.wikia.com/James_Santini_(Earth-6...,James Santini,40,True,MALE,,,1900,115,Academy,...,,,,,,,,,,
123,http://marvel.wikia.com/Emery_Schaub_(Earth-616)#,Emery Schaub,26,True,MALE,,,1900,115,Academy,...,,,,,,,,,,
125,http://marvel.wikia.com/Fiona_(Inhuman)_(Earth...,Fiona,2,True,FEMALE,,,1900,115,Academy,...,,,,,,,,,,
127,http://marvel.wikia.com/Hollow_(Earth-616)#,Yvette,22,True,FEMALE,,,1900,115,Academy,...,,,,,,,,,,
128,http://marvel.wikia.com/Julie_Power_(Earth-616)#,Julie Power,153,True,FEMALE,,,1900,115,Academy,...,,,,,,,,,,
129,http://marvel.wikia.com/Alani_Ryan_(Earth-616)#,Alani Ryan,73,True,FEMALE,,,1900,115,Academy,...,,,,,,,,,,
132,http://marvel.wikia.com/Johnathon_Gallo_(Earth...,Johnny Gallo,43,True,MALE,,,1900,115,Academy,...,,,,,,,,,,
133,http://marvel.wikia.com/Lyra_(Earth-8009)#,Lyra,55,True,FEMALE,,,1900,115,Academy,...,,,,,,,,,,


There are no avengers introduced before 1960. Let us remove these 14 rows. Alternatively, we could impute "year" values from the web pages.

In [12]:
avengers = avengers.query("year >= 1960")

In [13]:
avengers.iloc[:, 10:].describe(include="all")

Unnamed: 0,death1,return1,death2,return2,death3,return3,death4,return4,death5,return5,notes
count,159,68,16,16,2,2,1,1,1,1,74
unique,2,2,1,2,1,2,1,1,1,1,70
top,NO,YES,YES,NO,YES,NO,YES,YES,YES,YES,Died in New_Avengers_Vol_3_32
freq,91,45,16,8,2,1,1,1,1,1,4


In [14]:
slice = avengers.iloc[:, 10:20]
slice

Unnamed: 0,death1,return1,death2,return2,death3,return3,death4,return4,death5,return5
0,YES,NO,,,,,,,,
1,YES,YES,,,,,,,,
2,YES,YES,,,,,,,,
3,YES,YES,,,,,,,,
4,YES,YES,YES,NO,,,,,,
...,...,...,...,...,...,...,...,...,...,...
168,NO,,,,,,,,,
169,NO,,,,,,,,,
170,NO,,,,,,,,,
171,NO,,,,,,,,,


We convert "death*" and "return*" values from "YES", "NO" to `True`, `False`.

In [15]:
slice[slice.notnull()] = False

In [16]:
slice

Unnamed: 0,death1,return1,death2,return2,death3,return3,death4,return4,death5,return5
0,False,False,,,,,,,,
1,False,False,,,,,,,,
2,False,False,,,,,,,,
3,False,False,,,,,,,,
4,False,False,False,False,,,,,,
...,...,...,...,...,...,...,...,...,...,...
168,False,,,,,,,,,
169,False,,,,,,,,,
170,False,,,,,,,,,
171,False,,,,,,,,,


In [17]:
slice[avengers.iloc[:, 10:20] == "YES"] = True
slice

Unnamed: 0,death1,return1,death2,return2,death3,return3,death4,return4,death5,return5
0,True,False,,,,,,,,
1,True,True,,,,,,,,
2,True,True,,,,,,,,
3,True,True,,,,,,,,
4,True,True,True,False,,,,,,
...,...,...,...,...,...,...,...,...,...,...
168,False,,,,,,,,,
169,False,,,,,,,,,
170,False,,,,,,,,,
171,False,,,,,,,,,


In [18]:
avengers.iloc[:, 10:20] = slice
avengers.iloc[:, 10:20]

Unnamed: 0,death1,return1,death2,return2,death3,return3,death4,return4,death5,return5
0,True,False,,,,,,,,
1,True,True,,,,,,,,
2,True,True,,,,,,,,
3,True,True,,,,,,,,
4,True,True,True,False,,,,,,
...,...,...,...,...,...,...,...,...,...,...
168,False,,,,,,,,,
169,False,,,,,,,,,
170,False,,,,,,,,,
171,False,,,,,,,,,


There is a name which appears twice in "name_alias". We can fix it by looking at "URL".

In [19]:
print(avengers.query("name_alias == 'Vance Astrovik'")["URL"])

28     http://marvel.wikia.com/Vance_Astro_(Earth-691)#
84    http://marvel.wikia.com/Vance_Astrovik_(Earth-...
Name: URL, dtype: object


In [20]:
avengers.query("name_alias == 'Vance Astrovik'")

Unnamed: 0,URL,name_alias,appearances,is_current,gender,probationary_introl,full_reserve_avengers_intro,year,years_since_joining,honorary,...,return1,death2,return2,death3,return3,death4,return4,death5,return5,notes
28,http://marvel.wikia.com/Vance_Astro_(Earth-691)#,Vance Astrovik,156,False,MALE,,Feb-78,1978,37,Honorary,...,,,,,,,,,,
84,http://marvel.wikia.com/Vance_Astrovik_(Earth-...,Vance Astrovik,302,False,MALE,,May-98,1998,17,Full,...,,,,,,,,,,


In [21]:
print(avengers.loc[28, "URL"])

http://marvel.wikia.com/Vance_Astro_(Earth-691)#


In [22]:
avengers.loc[28, "name_alias"] = "Vance Astro"
avengers["name_alias"].value_counts()

name_alias
Henry Jonathan "Hank" Pym         1
Thomas "Tommy" Shepherd           1
Doctor Stephen Vincent Strange    1
Ares                              1
James Buchanan Barnes             1
                                 ..
Scott Edward Harris Lang          1
Anthony Ludgate Druid             1
Marrina Smallwood                 1
Ravonna Lexus Renslayer           1
Kaluu                             1
Name: count, Length: 149, dtype: int64

In [23]:
avengers["appearances"].sort_values()

39        2
68        3
65        4
117       6
67        7
       ... 
4      2402
2      3068
92     3130
6      3458
73     4333
Name: appearances, Length: 159, dtype: int64

In [24]:
avengers.iloc[73]

URL                            http://marvel.wikia.com/Peter_Parker_(Earth-616)#
name_alias                                                 Peter Benjamin Parker
appearances                                                                 4333
is_current                                                                  True
gender                                                                      MALE
probationary_introl                                                          NaN
full_reserve_avengers_intro                                               Apr-90
year                                                                        1990
years_since_joining                                                           25
honorary                                                                    Full
death1                                                                      True
return1                                                                     True
death2                      

We also convert "probationary_introl" and "full_reserve_avengers_intro" into `datetime`.

In [25]:
name = "probationary_introl"
avengers.loc[:, name] = pd.to_datetime(avengers[name])

  avengers.loc[:, name] = pd.to_datetime(avengers[name])


In [26]:
name = "full_reserve_avengers_intro"
avengers[name].value_counts(dropna=False)

full_reserve_avengers_intro
Sep-63    6
13-Feb    6
10-Aug    6
Feb-78    6
10-May    6
         ..
May-93    1
Jun-92    1
Sep-92    1
Apr-91    1
15-Jan    1
Name: count, Length: 93, dtype: int64

We clean "full_reserve_avengers_intro" by extracting the name of the month, then combine with "year".

In [27]:
month = avengers[name].str.extract("([A-Z][a-z][a-z])")[0]
# Missing months are imputed by "Jan"
month[month.isnull()] = "Jan"
avengers[name + "2"] = month + "-" + avengers["year"].astype(str)

In [28]:
avengers.loc[:, name] = avengers[name + "2"]
avengers.drop(name + "2", axis=1, inplace=True)

In [29]:
avengers.loc[:, name] = pd.to_datetime(avengers[name])

  avengers.loc[:, name] = pd.to_datetime(avengers[name])


In [30]:
avengers[name].value_counts()

full_reserve_avengers_intro
1963-09-01    6
2013-02-01    6
2010-08-01    6
1978-02-01    6
2010-05-01    6
             ..
1993-05-01    1
1992-06-01    1
1992-09-01    1
1991-04-01    1
2015-01-01    1
Name: count, Length: 93, dtype: int64

In [31]:
avengers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 159 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   URL                          159 non-null    object
 1   name_alias                   149 non-null    object
 2   appearances                  159 non-null    int64 
 3   is_current                   159 non-null    object
 4   gender                       159 non-null    object
 5   probationary_introl          13 non-null     object
 6   full_reserve_avengers_intro  159 non-null    object
 7   year                         159 non-null    int64 
 8   years_since_joining          159 non-null    int64 
 9   honorary                     159 non-null    object
 10  death1                       159 non-null    object
 11  return1                      68 non-null     object
 12  death2                       16 non-null     object
 13  return2                      16 non-null

In [32]:
(avengers["year"] + avengers["years_since_joining"] == 2015).sum()

159

Finally, we can fill in missing "name_alias" values from the URLs.

In [33]:
avengers.loc[avengers["name_alias"].isnull(), "URL"]

56        http://marvel.wikia.com/Gilgamesh_(Earth-616)#
61       http://marvel.wikia.com/Dinah_Soar_(Earth-616)#
67       http://marvel.wikia.com/Monkey_Joe_(Earth-616)#
69        http://marvel.wikia.com/Tippy-Toe_(Earth-616)#
80     http://marvel.wikia.com/Melissa_Darrow_(Earth-...
81         http://marvel.wikia.com/Deathcry_(Earth-616)#
148    http://marvel.wikia.com/Captain_Universe_(Eart...
153       http://marvel.wikia.com/Ex_Nihilo_(Earth-616)#
154    http://marvel.wikia.com/Abyss_(Ex_Nihilo%27s)_...
166    http://marvel.wikia.com/Doombot_(Avenger)_(Ear...
Name: URL, dtype: object

In [34]:
pattern = r"^http://marvel.wikia.com/([\w-]+)_\("
nm_null = avengers["name_alias"].isnull()
new_names = avengers.loc[avengers["name_alias"].isnull(), "URL"].str.extract(pattern)[0].str.replace("[_-]", " ", regex=True)
avengers.loc[nm_null, "name_alias"] = new_names

Let us check for consistency between "URL" and "name_alias" in general.

In [35]:
all_names = avengers["URL"].str.extract(pattern)[0].str.replace("[_-]", " ", regex=True)
diff_ind = all_names != avengers["name_alias"]
avengers.loc[diff_ind, ["URL", "name_alias"]]

Unnamed: 0,URL,name_alias
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym"
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark"
5,http://marvel.wikia.com/Richard_Jones_(Earth-616),Richard Milhouse Jones
7,http://marvel.wikia.com/Clint_Barton_(Earth-616),Clinton Francis Barton
11,http://marvel.wikia.com/Hercules_(Earth-616),Heracles
...,...,...
149,http://marvel.wikia.com/Isabel_Kane_(Earth-616)#,Izzy Kane
151,http://marvel.wikia.com/Rogue_(Anna_Marie)_(Ea...,Anna Marie
155,http://marvel.wikia.com/Nightmask_(Earth-616)#,Adam
156,http://marvel.wikia.com/Kevin_Connor_(Earth-616)#,Kevin Kale Connor


In [36]:
avengers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 159 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   URL                          159 non-null    object
 1   name_alias                   159 non-null    object
 2   appearances                  159 non-null    int64 
 3   is_current                   159 non-null    object
 4   gender                       159 non-null    object
 5   probationary_introl          13 non-null     object
 6   full_reserve_avengers_intro  159 non-null    object
 7   year                         159 non-null    int64 
 8   years_since_joining          159 non-null    int64 
 9   honorary                     159 non-null    object
 10  death1                       159 non-null    object
 11  return1                      68 non-null     object
 12  death2                       16 non-null     object
 13  return2                      16 non-null

This is all we can do. The information encoded by "death*", "return*" is clunky: we could use a single string or list column with the same information.

In [37]:
names = [x for x in avengers.columns if x.startswith("death")]
avengers["num_deaths"] = avengers[names].sum(axis=1).astype(int)
avengers["num_deaths"].value_counts(dropna=False)

num_deaths
0    91
1    52
2    14
3     1
5     1
Name: count, dtype: int64

In [38]:
avengers[names + ["num_deaths"]]

Unnamed: 0,death1,death2,death3,death4,death5,num_deaths
0,True,,,,,1
1,True,,,,,1
2,True,,,,,1
3,True,,,,,1
4,True,True,,,,2
...,...,...,...,...,...,...
168,False,,,,,0
169,False,,,,,0
170,False,,,,,0
171,False,,,,,0


In [39]:
avengers.to_csv(data_path / "avengers_cleaned.csv")