# Do Wrestlers Live Shorter Lives Than Other Athletes??

This is still a complete work in progress. I'm currently working on scraping data for NBA and NHL players along with wrestling data. 

Scraping the data was the easy part. 

Luckily Wikipedia has a great function available [here](https://en.wikipedia.org/wiki/Special:Export) that allows you to download the xml and wikimedia data from multiple files at a time. They also have the lists of players for every major sports league:

* [NBA](https://en.wikipedia.org/wiki/Lists_of_National_Basketball_Association_players)
* [NHL](https://en.wikipedia.org/wiki/List_of_NHL_players)

I didn't need to scrape NFL or MLB data because luckily someone already did that:

* [MLB](http://www.seanlahman.com/baseball-archive/statistics/)
* [NFL](http://nflsavant.com/about.php)

Still trying to figure out what to do with the wrestlers. With so many different federations, I'm going to have to dwindle it down to possible WWE/WWF, WCW, ECW, AWA, NWA, and perhaps some of the even older feds to account for the classic wrestlers.

So great, I have the data, but parsing is an absolute mess. Unfortunately on Wikipedia, many athletes are missing infobox data despite the information being listed right on the page, so parsing is not automatic. It works for probably 75% of the data, now I have to sift through the wikimedia data and add lines to account for the missing data so that the program can parse properly. Fun!

In [324]:
import pandas as pd
import numpy as np

from datetime import datetime

mlb = pd.read_csv('csvs/mlb.csv')
nfl = pd.read_csv('csvs/nfl.csv')

#earliest birth year to start with
start_year = 1900

In [219]:
# Helper methods

def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

### MLB Data

In [327]:
mlb = mlb[['name_first', 'name_last', 'birth_year', 'birth_month', 'birth_day', 'death_year', 'death_month', 'death_day']]

mlb = mlb[pd.notnull(mlb['birth_year'])]
mlb = mlb[pd.notnull(mlb['death_year'])]

mlb = mlb[mlb['birth_year'] > start_year]
mlb['birth_day'].fillna(1, inplace=True)
mlb['birth_month'].fillna(1, inplace=True)
mlb['death_day'].fillna(1, inplace=True)
mlb['death_month'].fillna(1, inplace=True)

mlb['birth_year'] = mlb['birth_year'].astype(int)
mlb['birth_day'] = mlb['birth_day'].astype(int)
mlb['birth_month'] = mlb['birth_month'].astype(int)
mlb['death_year'] = mlb['death_year'].astype(int)
mlb['death_day'] = mlb['death_day'].real.astype(int)
mlb['death_month'] = mlb['death_month'].astype(int)

calculate_age = lambda x: (datetime(x['death_year'], x['death_month'], x['death_day']) \
    - datetime(x['birth_year'], x['birth_month'], x['birth_day'])).days

mlb['age'] = mlb.apply(calculate_age, axis=1)
print "Average Age: {}".format(np.mean(mlb['age']) / 365.25)

Average Age: 70.6959346871


### NFL Data

In [330]:
# Add recent deaths
def add_death(name, death_date):
    idx = nfl[nfl['name'] == name].index
    if len(idx) == 0:
        print "Not found"
        return
    nfl.set_value(idx[0], 'death_date', death_date)
    nfl.to_csv('csvs/nfl.csv', index=False)

# Clean NFL dates
def convert_date(x):
    other_format = '%m/%d/%Y'
    proper_format = '%Y-%m-%d'
    val = x
    try:
        val = datetime.strptime(x, other_format).strftime(proper_format)
    except:
        return val
    return val

def calculate_age(x):
    proper_format = '%Y-%m-%d'
    birth_date = datetime.strptime(x['birth_date'], proper_format)
    death_date = datetime.strptime(x['death_date'], proper_format)
    return (death_date - birth_date).days

nfl = nfl[['name', 'birth_date', 'death_date']]
nfl = nfl[pd.notnull(nfl['birth_date'])]
nfl = nfl[pd.notnull(nfl['death_date'])]

nfl['death_date'] = nfl['death_date'].apply(convert_date)
nfl['birth_date'] = nfl['birth_date'].apply(convert_date)

nfl = nfl[nfl['birth_date'] > str(start_year) + "-01-01"]

nfl['age'] = nfl.apply(calculate_age, axis=1)
print "Average Age: {}".format(np.mean(nfl['age']) / 365.25)

Average Age: 68.3376989158


# NBA Data

## Resources

[http://www.wrestlingdata.com/](http://www.wrestlingdata.com/)