# Avengers Database Exploration
***
### Description

As an employee of The Walt Disney Company with a great interest in the Avengers franchise, I was excited to find a database providing extensive data on the avengers provided by [FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv). The goal of this project will be to practice fundamental Exploratory Data Analysis, try to derive some conclusions based on visualizations, and see if it's possible to create a machine learning model based on dataset attributes. 

As a note, I kept this file local to minimize the requests I was sending to site. Given that this is a personal project, I don't intend to update this frequently.
***

### Loading Necessary Libraries

In [314]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Exploring the Data

In [315]:
df = pd.read_csv('avengers.csv')
df.head()

Unnamed: 0,URL,Name/Alias,Appearances,Current?,Gender,Probationary Introl,Full/Reserve Avengers Intro,Year,Years since joining,Honorary,...,Return1,Death2,Return2,Death3,Return3,Death4,Return4,Death5,Return5,Notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,,Sep-63,1963,52,Full,...,NO,,,,,,,,,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,,Sep-63,1963,52,Full,...,YES,,,,,,,,,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,,Sep-63,1963,52,Full,...,YES,YES,NO,,,,,,,Dies in Fear Itself brought back because that'...


In [316]:
df.info()
df['Death1'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   URL                          173 non-null    object
 1   Name/Alias                   163 non-null    object
 2   Appearances                  173 non-null    int64 
 3   Current?                     173 non-null    object
 4   Gender                       173 non-null    object
 5   Probationary Introl          15 non-null     object
 6   Full/Reserve Avengers Intro  159 non-null    object
 7   Year                         173 non-null    int64 
 8   Years since joining          173 non-null    int64 
 9   Honorary                     173 non-null    object
 10  Death1                       173 non-null    object
 11  Return1                      69 non-null     object
 12  Death2                       17 non-null     object
 13  Return2                      16 non

NO     104
YES     69
Name: Death1, dtype: int64

#### Observations
Right off the bat, we notice a couple of things about each column...
- `URL`: Not necessary. **We can omit this from the sorted df**
- `Name`: Categorical variable. Will be psuedo-index
- `Appearances`: Quantitative variable. Likely useful in a regression model.
- `Current?`: Categorical Variable. Describes if they are current. **Should change this to boolean.**
- `Gender`: Categorical Variable. **Should dummy this variable for modelling.**
- `Probationary Introl`: Categorical date variable. Date they appeared in comics as a probationary member. **Fix this name**
- `Full/Reserve Avengers Intro`: Categorical date variable. Date they appeared in comics as a full member.
- `Year`: Quantitative variable. year the avenger was introduced.
- `Years since joining`: Quantitative variable. 2015 - year they were introduced. **Will need to apply a lambda function to add 6 years.**
- `Honorary`: Categorical variable. Describing status of the avenger. **Will need to change to "status"**
- `Death(1-5)`: Categorical variable. "Yes" if avenger dies. **Should combine to a # of deaths column.**
- `Return(1-5)`: Categorical variable. "Yes" if avenger comes back from dead. **Should combine to a # returns column.**
- `Notes`: Categorical variable. **We can omit this from the sorted df.**

### Cleaning

In [317]:
# Current?
df['Current?'] = df['Current?'].apply(lambda x: True if str(x) == "YES" else False)

In [318]:
# Gender
gender = pd.get_dummies(data=df['Gender'])
df = pd.concat([df, gender], axis=1)
df.drop(columns = ['Gender'], inplace=True)

In [319]:
# Years since joining
"""This reflects the current year"""
df['Years since joining'] = df['Years since joining'] + 6

In [320]:
# Honorary (Status)
df.rename(columns = {'Honorary': 'Status'}, inplace = True)
status = pd.get_dummies(df['Status'])
df = pd.concat([df, status], axis=1)
df.drop(columns = ['Status'], inplace = True)

In [321]:
# Deaths and Returns
for i in range(1,6):
    death_col = 'Death' + str(i)
    return_col = 'Return' + str (i)
    df[death_col] = df[death_col].apply(lambda x: 1 if str(x) == 'YES' else 0)
    df[return_col] = df[return_col].apply(lambda x: 1 if str(x) == 'YES' else 0)

df['total_deaths'] = df['Death1'] + df['Death2'] + df['Death3'] + df['Death4'] + df['Death5']
df['total_returns'] = df['Return1'] + df['Return2'] + df['Return3'] + df['Return4'] + df['Return5']

In [322]:
# NEW COLUMN - Alive
df['Alive'] = (df['total_deaths'] <= df['total_returns'])

In [323]:
# Renaming
df.rename(columns = {'Current?':'Current',
                     'Probationary Introl': 'Probationary_Intro',
                     'Years since joining': 'Years_since_joining',
                     'Full/Reserve Avengers Intro':'Full_Avengers_Intro'},
          inplace=True)

In [324]:
# Dropping URL and Notes
df.drop(columns = ['URL','Notes',
                   'Death1','Death2', 'Death3', 'Death4', 'Death5',
                   'Return1', 'Return2', 'Return3', 'Return4', 'Return5'], 
        inplace=True)

In [325]:
df.head()

Unnamed: 0,Name/Alias,Appearances,Current,Probationary_Intro,Full_Avengers_Intro,Year,Years_since_joining,FEMALE,MALE,Academy,Full,Honorary,Probationary,total_deaths,total_returns,Alive
0,"Henry Jonathan ""Hank"" Pym",1269,True,,Sep-63,1963,58,0,1,0,1,0,0,1,0,False
1,Janet van Dyne,1165,True,,Sep-63,1963,58,1,0,0,1,0,0,1,1,True
2,"Anthony Edward ""Tony"" Stark",3068,True,,Sep-63,1963,58,0,1,0,1,0,0,1,1,True
3,Robert Bruce Banner,2089,True,,Sep-63,1963,58,0,1,0,1,0,0,1,1,True
4,Thor Odinson,2402,True,,Sep-63,1963,58,0,1,0,1,0,0,2,1,False
