###Survival Analysis

* This notebook is about the application of Survival Analysis in cricket - to analyze the career lengths of players.

* Survival Analysis has been described in length by various resources, the following are the major soruces I used to
understand the concept:
    * Allen B. Downy's book on exploratory data analysis in Python includes a great chapter on survival curves, hazard functions, Kaplan–Meier estimators etc. 
    
    http://greenteapress.com/thinkstats2/thinkstats2.pdf 
    
    * Econometrics Academy's notes on survival analysis http://sites.google.com/site/econometricsacademy/econometrics-models/survival-analysis
     
    * Cam Davidson-Pilon's documentation on the lifelines python library on survival analysis.
    
         http://lifelines.readthedocs.org/en/latest/Quickstart.html

* Survival Analyis is used in areas where the time duration of a sample of observations is analysed until an event of death occurs.
* This has great use in mechanical engineering where the lifetime of a tool/product is analyzed, medical sciences where 
the lifetime of cancer patients is analyzed etc.
* This is an attempt to extend this statistical concept into the field of cricket - to analyze the career lengths of players.

* The event of death in this case is the event when players retire from active cricket.
* I have tried to analyze all the players who have played ODI cricket. 
* The reason Test cricket wasn't chosen is due to the fact there is too much noise in the data due to the careers of the players marred due to the World Wars, Apartheid Crisis, Kerry Packer's cricket series etc.

- There isn't any readily available data when it comes to cricket yet - ESPNCricinfo still doesn't provide an API to use its StatsGuru database machine - so I had to scrape the data from the Statsguru webpages to acquire the data.

In [4]:
%matplotlib inline
from bs4 import BeautifulSoup
import requests
import pandas as pd
import lifelines
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

#### Scraping
The following method scrapes the required data from the webpages. 

In [14]:
statsguru_query_url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=2;filter=advanced;orderby=runs;page=1;size=200;template=results;type=batting'

In [15]:
def scrape_data(page_count):
    """
    scrapes the required data present in the form of a table from the given url
    :param url: page of the ESPNCricinfo Stats URL query
    :return: the raw unicode text
    """
    url = statsguru_query_url
    complete_url = url.partition('page=1')[0] + "page=" + str(page_count) + url.partition('page=1')[-1]
    r = requests.get(complete_url)
    data = r.text
    soup = BeautifulSoup(data)
    table = soup.find_all('table')
    return table[2].text

#### Data Cleaning
Once the data is scraped, it has to be cleaned - stripping of the whitespaces and other noisy data to get it into a proper structure

In [16]:
def clean_data(text):
    """
    cleans the data removing all whitespaces and other data
    :param text: table text of each page
    :return: cleaned text  
    """
    text = text.split('\n')
    text.remove(u'Overall figures')
    return text

#### Data Transformation
The data is now transformed accordingly to make get the data into a structure with which it can effectively be modelled.
A series of methods and transformations are done to make a dataframe, fit for modelling.

In [73]:
def create_data():
    """
    scrapes the data, cleans it and transforms the data to load into a pandas dataframe
    """
    page_count = 12

    df = pd.DataFrame()

    get_list_columns = lambda text, start_index, end_index: [str(unicode_text)
                                                                for unicode_text in text][start_index:end_index]

    get_data_rows = lambda text, start_index: text[start_index:]

    remove_all_occurences = lambda data, item: [x for x in data if x != item]

    get_list_rows = lambda data: [data[index: index + 13] for index, row in enumerate(data) if index % 13 == 0]

    for page in range(1, page_count + 1):
        raw_text = scrape_data(page)
        clean_text = clean_data(raw_text)
        list_columns = get_list_columns(clean_text, 3, 16)
        data_rows = get_data_rows(clean_text, 16)
        data_rows = remove_all_occurences(data_rows, u'')
        list_rows = get_list_rows(data_rows)
        df_new = pd.DataFrame(list_rows, columns=list_columns)
        if len(df) == 0:
            df = df_new
        else:
            df = pd.concat([df, df_new])
    return df

In [21]:
df_full = create_data()
df_full = df_full.reset_index(drop=True)

* The scraped data is stored in the form of a pandas dataframe.

In [25]:
len(df_full)

2244

* There are totally 2244 players who have played ODI cricket since its inception in the 1970s.

* The following displays the first five rows of the data.
* The data is by default sorted by the amount of runs scored.

In [23]:
df_full.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0
0,SR Tendulkar (India),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15
2,RT Ponting (Aus/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28


Renaming the columns to access them easier

In [28]:
df_full.columns = ['player', 'span', 'mat', 'inns', 'not_outs', 'runs', 'high_score', 'ave', 'bf', 'sr', 'n_100', 'n_50', 'n_0']

In [29]:
df_span = df_full[['player','span']]

In [30]:
calc_career_length = lambda span: [ int(each_span.partition('-')[-1]) - int(each_span.partition('-')[0]) for each_span in span]
df_span['career_length'] = calc_career_length(df_span.span)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app


In [40]:
df_span['censor'] = [1 for _ in range(0,len(df_span))]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [54]:
df_span['career_start_date'] = [int(span.partition('-')[0]) for span in df_span.span]
df_span['career_end_date'] = [int(span.partition('-')[-1]) for span in df_span.span]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from IPython.kernel.zmq import kernelapp as app


In [71]:
df_span

Unnamed: 0,player,span,career_length,censor,career_start_date,career_end_date
0,SR Tendulkar (India),1989-2012,23,1,1989,2012
1,KC Sangakkara (Asia/ICC/SL),2000-2015,15,1,2000,2015
2,RT Ponting (Aus/ICC),1995-2012,17,1,1995,2012
3,ST Jayasuriya (Asia/SL),1989-2011,22,1,1989,2011
4,DPMD Jayawardene (Asia/SL),1998-2015,17,1,1998,2015
5,Inzamam-ul-Haq (Asia/Pak),1991-2007,16,1,1991,2007
6,JH Kallis (Afr/ICC/SA),1996-2014,18,1,1996,2014
7,SC Ganguly (Asia/India),1992-2007,15,1,1992,2007
8,R Dravid (Asia/ICC/India),1996-2011,15,1,1996,2011
9,BC Lara (ICC/WI),1990-2007,17,1,1990,2007
