# Fiba Europe Machine Learning Project

## Part 2: Processing Data

Again, the raw xml data looks like the following:

In [7]:
#let's see a sample again
example_match_id = '60636'
from itertools import islice
f=open("fiba_europe_example/" + str(example_match_id) + ".xml", "r")
flines = f.readlines()

for line in islice(flines, 40):
    print(line)
    

<FE>

  <HEADER competition="EuroChallenge" round="Last Sixteen, Group L" quarter="4" time="FINAL" logo="http://live.fibaeurope.com/www/gallery/C61B71F3-EC5A-4315-A20B-3AAEA3751891.jpg" duration="15,6432">

    <TEAM name="TS Medical Park" logo="http://www.fibaeurope.com/files/{6A8BCF6E-9977-4706-989E-B582089D3D40}logo_big.gif" pts="76" fouls="7" />

    <TEAM name="Mons-Hainaut" logo="http://www.fibaeurope.com/files/{ECF8A606-44AF-4D55-B2B6-91F54B3977F8}logo_big.gif" pts="68" fouls="4" />

    <QUARTERS>

      <QUARTER n="1" scoreA="23" scoreB="12" time="100" />

      <QUARTER n="2" scoreA="21" scoreB="23" time="100" />

      <QUARTER n="3" scoreA="14" scoreB="10" time="100" />

      <QUARTER n="4" scoreA="18" scoreB="23" time="100,00" />

    </QUARTERS>

  </HEADER>

  <TICKER text="J. Love [MON] - 10 rebounds" duration="0">

    <ITEM text="TS Medical Park - 47,6% FT (10/21)" />

  </TICKER>

  <OVERVIEW duration="46,8509" />

  <PLAYBYPLAY homeTeamImg="http://www.fibaeurope.co

### What to Do

We want to parse the contents of the raw xml file(s), creating a flat table with one row representing each event in the game. Every action associated with every event will have its own value in its own column.

For example, a personal foul committed by the home team on an away team player would yield the following values in the following columns:
* stat_action_awayteam: 'Foul Drawn'   
* stat_action_hometeam: 'Foul Committed'
* foul_drawn_player_hometeam: 'H. Owens'
* foul_committed_player_awayteam: 'M. Delaney'
* player_personal_fouls_committed_awayteam: 2 (+1)
* player_personal_fouls_committed_hometeam: 1 (+1)
* team_fouls_committed_awayteam: 4 (+1)
* team_fouls_committed_hometeam: 3 (+1)
* scoring_stat_awayteam_full: '[ELA] M. Delaney - personal shooting foul (3 PF, 7th team foul). 2 free throws awarded. Foul drawn: H. Owens'


Also, while match files don't include some important metdata like date, age, or sex, they do include some basic values that I want to keep track of, like:
* team_name_hometeam
* team_name_awayteam
* competition_name
* league_name



### 2.1 - Extract match metadata

Each Match File (usually) contains the league and team names for the match, as well as the start and ending scores of each period.

The metadata will later be appended to each row of the big flat dataframe

In [39]:
class ProcessFibaEuropeMatch:
    
    def __init__(self, match_id, root):
        """ 
        Args:
            match_id (int): unique identifier for match
            root (parsed xml): content of match, parsed with xml.etree        
        """
        self.match_id = match_id
        self.root = root

        
    def extract_metadata_from_root(self):

        """ The XML file has metadata in a separate section for each match, including team names,
        scores for each period, competition name etc.
            This scrapes a match file and extracts metadata

        Returns:
            Dictionary with extracted metadata values that can be found in match file 

        """
        match_metadata_dict = {
                             'match_id' : None
                             ,'competition_name' : None
                             ,'competition_round' : None
                             ,'team_name_hometeam' : None
                             ,'team_name_awayteam' : None
                              ,'final_score_hometeam' : None
                              ,'final_score_awayteam' : None
                              ,'total_fouls_hometeam' : None
                              ,'total_fouls_awayteam' : None
                              ,'ending_score_period1_hometeam' : None
                              ,'ending_score_period1_awayteam' : None                                         
                              ,'ending_score_period2_hometeam' : None
                              ,'ending_score_period2_awayteam' : None
                              ,'ending_score_period3_hometeam' : None
                              ,'ending_score_period3_awayteam' : None
                              ,'ending_score_period4_hometeam' : None
                              ,'ending_score_period4_awayteam' : None                                          
                              ,'ending_score_period5_hometeam' : None
                              ,'ending_score_period5_awayteam' : None
                              ,'starting_five_hometeam'        : None                                   
                              ,'starting_five_awayteam'        : None                                   

                }

        match_metadata_dict['match_id'] = self.match_id

        for child in self.root.findall('.//PLAYBYPLAY'):

            if child.tag == 'PLAYBYPLAY':
                try:
                    match_metadata_dict['team_name_hometeam'] = child.attrib['homeTeamName']
                    match_metadata_dict['team_name_awayteam'] = child.attrib['awayTeamName']

                except:
                    pass

        for child in self.root.findall('.//HEADER'):

            try:
                match_metadata_dict['competition_name'] = child.attrib['competition']
                match_metadata_dict['competition_round'] = child.attrib['round']            
            except:
                pass



            for quarters in child:
                if quarters.tag == 'TEAM':
                    try:

                        if  match_metadata_dict['team_name_hometeam'] == quarters.attrib['name']:

                            match_metadata_dict['final_score_hometeam'] = quarters.attrib['pts']
                            match_metadata_dict['total_fouls_hometeam'] = quarters.attrib['fouls']

                        if  match_metadata_dict['team_name_awayteam'] == quarters.attrib['name']:

                            match_metadata_dict['final_score_awayteam'] = quarters.attrib['pts']
                            match_metadata_dict['total_fouls_awayteam'] = quarters.attrib['fouls']

                    except:
                        pass

                if quarters.tag == 'QUARTERS':
                    for quarter in quarters:
                        if quarter.tag == 'QUARTER':                    
                            try:
                                quartername_hometeam = 'ending_score_period' + str(quarter.attrib['n']) + '_hometeam'
                                quartername_awayteam = 'ending_score_period' + str(quarter.attrib['n']) + '_awayteam'

                                match_metadata_dict[quartername_hometeam] = int(quarter.attrib['scoreA'])
                                match_metadata_dict[quartername_awayteam] = int(quarter.attrib['scoreB'])
                            except:
                                pass


        return  match_metadata_dict  


    


In [55]:
# Load an example match and extract the metadata

from xml.etree import ElementTree as ET

def load_match_from_local_file(directory,match_id):
    return ET.parse(directory + "/" + str(match_id) + '.xml').getroot()

match_content = load_match_from_local_file(example_local_destination_directory,example_match_id)

# Load the match_processor function
match_processor = ProcessFibaEuropeMatch(example_match_id,match_content)

# Extract metadata
match_processor.extract_metadata_from_root()

{'match_id': '60636',
 'competition_name': 'EuroChallenge',
 'competition_round': 'Last Sixteen, Group L',
 'team_name_hometeam': 'TS Medical Park',
 'team_name_awayteam': 'Mons-Hainaut',
 'final_score_hometeam': '76',
 'final_score_awayteam': '68',
 'total_fouls_hometeam': '7',
 'total_fouls_awayteam': '4',
 'ending_score_period1_hometeam': 23,
 'ending_score_period1_awayteam': 12,
 'ending_score_period2_hometeam': 21,
 'ending_score_period2_awayteam': 23,
 'ending_score_period3_hometeam': 14,
 'ending_score_period3_awayteam': 10,
 'ending_score_period4_hometeam': 18,
 'ending_score_period4_awayteam': 23,
 'ending_score_period5_hometeam': None,
 'ending_score_period5_awayteam': None,
 'starting_five_hometeam': None,
 'starting_five_awayteam': None}

### 2.2 - Create dataframe with granular play-by-play data

Iterate through the match, creating a dataframe with the following info:

* current score home/away team
* the scoring and/or assist action that just occurred
* full text string noting who scored and who assisted
* period
* time remaining in period

We will then iterate through this dataframe, splitting the various actions associated with each event into seperate columns. Then, lastly, append the metadata dictionary from step 2.1


In [56]:
class ProcessFibaEuropeMatch(ProcessFibaEuropeMatch):

    def extract_granular_data_from_root(self):
        """ Extract each line from the "play by play" block. Each line includes:
        * current score home/away team
        * the scoring and/or assist action that just occurred
        * full text string noting who scored and who assisted
        * period
        * time remaining in period

        We will then iterate through this play by play granular data, further splitting out info from each line. 
        The final result is one big flat table with columns for each type of scoring action and more.
        """

        df_match_raw = pd.DataFrame(columns=["match_id"
                                   ,"row_number"
                                   ,"period"
                                   ,"current_team_performing_stat_action"
                                   ,"scoring_stat_hometeam_full"
                                   ,"scoring_stat_awayteam_full"
                                   ,"assist_stat_hometeam_full"
                                   ,"assist_stat_awayteam_full"
                                   ,"full_text"
                                   ,"time_remaining_in_period"
                                   ,"minutes_remaining_in_period"
                                   ,"current_score_hometeam"
                                   ,"current_score_awayteam"                           
                                   ])


        for child in self.root.findall('.//PLAYBYPLAY'):
            if child.tag == 'PLAYBYPLAY':
                for line in child:
                    if line.tag == 'LINE':
                        row_number = None
                        period = None
                        current_team_performing_stat_action = None
                        scoring_stat_hometeam_full = None
                        scoring_stat_awayteam_full = None
                        assist_stat_hometeam_full = None
                        assist_stat_awayteam_full = None
                        full_text = None
                        time_remaining_in_period = None
                        minutes_remaining_in_period = None                
                        current_score_hometeam = None
                        current_score_awayteam = None

                        try:
                            row_number = int(line.attrib['num'])
                            period = line.attrib['quarter']
                        except:
                            pass

                        try:
                            current_team_performing_stat_action = line.attrib['team']
                            scoring_stat_hometeam_full = line.attrib['s1']
                            scoring_stat_awayteam_full = line.attrib['s2']
                            assist_stat_hometeam_full = line.attrib['s1Assists']
                            assist_stat_awayteam_full = line.attrib['s2Assists']
                            full_text = line.attrib['text']
                            time_remaining_in_period = line.attrib['time']
                            minutes_remaining_in_period = int(line.attrib['time'][:2])
                        except:
                            pass

                        try:
                            current_score_hometeam = int(line.attrib['scoreA'])
                            current_score_awayteam = int(line.attrib['scoreB'])
                        except:
                            pass

                        df_match_raw.loc[row_number] = [self.match_id
                                               ,row_number
                                               ,period
                                               ,current_team_performing_stat_action
                                               ,scoring_stat_hometeam_full
                                               ,scoring_stat_awayteam_full
                                               ,assist_stat_hometeam_full
                                               ,assist_stat_awayteam_full
                                               ,full_text
                                               ,time_remaining_in_period
                                               ,minutes_remaining_in_period
                                               ,current_score_hometeam
                                               ,current_score_awayteam]  

        df_match_raw = df_match_raw.sort_index()  

        return df_match_raw




In [60]:
# Here's what it looks like
import pandas as pd

# Load the updated match_processor function
match_processor = ProcessFibaEuropeMatch(example_match_id,match_content)

df_match_raw = match_processor.extract_granular_data_from_root()
df_match_raw.head(10)

Unnamed: 0,match_id,row_number,period,current_team_performing_stat_action,scoring_stat_hometeam_full,scoring_stat_awayteam_full,assist_stat_hometeam_full,assist_stat_awayteam_full,full_text,time_remaining_in_period,minutes_remaining_in_period,current_score_hometeam,current_score_awayteam
1,60636,1,1,0,,,,,START OF GAME,10:00,10,,
2,60636,2,1,1,Substitution - D. Bost IN,,,,[TRA] Substitution - D. Bost IN,10:00,10,,
3,60636,3,1,1,Substitution - D. Hardy IN,,,,[TRA] Substitution - D. Hardy IN,10:00,10,,
4,60636,4,1,1,Substitution - A. Stipanovic IN,,,,[TRA] Substitution - A. Stipanovic IN,10:00,10,,
5,60636,5,1,1,Substitution - A. Saruhan IN,,,,[TRA] Substitution - A. Saruhan IN,10:00,10,,
6,60636,6,1,1,Substitution - N. Yildirim IN,,,,[TRA] Substitution - N. Yildirim IN,10:00,10,,
7,60636,7,1,2,,Substitution - M. Smith IN,,,[MON] Substitution - M. Smith IN,10:00,10,,
8,60636,8,1,2,,Substitution - J. Love IN,,,[MON] Substitution - J. Love IN,10:00,10,,
9,60636,9,1,2,,Substitution - J. Cage IN,,,[MON] Substitution - J. Cage IN,10:00,10,,
10,60636,10,1,2,,Substitution - T. Battle IN,,,[MON] Substitution - T. Battle IN,10:00,10,,


### Section 2.3 - Process Starting Fives

At the start of the match, there will (usually) be five "substition IN" actions listed for each team, designating the  players who will start the match. 

I wish to keep track of these players to assess their actions (substitution status, fouls, points etc) during the game. I assemble a dictionary collection of these starting five players so that later I can iterate through and collect/assess their stats during the game.

In [62]:
class ProcessFibaEuropeMatch(ProcessFibaEuropeMatch):
    
    def process_substitutions_and_starting_fives(self,df_match_raw):
        """ Extracts the starting five players from each team from the "df_match_raw"(granular play by play dataframe)
            
        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data

        Returns:
            Array containing two more arrays, one with five names of starting hometeam players, other with names of away team players            
        
        """

        startingfive_hometeam = list()
        startingfive_awayteam = list()

        for index, row in df_match_raw.iterrows():


            ## Hometeam
            if len(row['scoring_stat_hometeam_full']) > 1 and 'Substitution - ' in row['scoring_stat_hometeam_full']:
                df_match_raw.loc[index,'stat_action_hometeam'] = 'Substitution'
                substitution_block1 = ""
                substitution_block2 = ""
                substitution_block_full = row['scoring_stat_hometeam_full'].split('Substitution - ')[1].replace("<br />","").strip()
                if ',' in substitution_block_full:
                    substitution_block1 = substitution_block_full.split(',')[0]
                    substitution_block2 = substitution_block_full.split(',')[1]
                    if ' IN' in substitution_block1:
                        df_match_raw.loc[index,'substitution_player_in_hometeam'] = substitution_block1.split(' IN')[0].strip()
                    elif ' OUT' in substitution_block1:
                        df_match_raw.loc[index,'substitution_player_out_hometeam'] = substitution_block1.split(' OUT')[0].strip()

                    if ' IN' in substitution_block2:
                        df_match_raw.loc[index,'substitution_player_in_hometeam'] = substitution_block2.split(' IN')[0].strip()
                    elif ' OUT' in substitution_block2:
                        df_match_raw.loc[index,'substitution_player_out_hometeam'] = substitution_block2.split(' OUT')[0].strip()
                else:
                    if ' IN' in substitution_block_full:
                        df_match_raw.loc[index,'substitution_player_in_hometeam'] = substitution_block_full.split(' IN')[0].strip()

                        ### Add starting 5 if applicable
                        if int(row['period']) == 1 and row['current_score_hometeam'] == None  and row['current_score_awayteam'] == None:
                            startingfive_hometeam.append(substitution_block_full.split(' IN')[0].strip())

                    elif ' OUT' in substitution_block1:
                         df_match_raw.loc[index,'substitution_player_out_hometeam'] = substitution_block_full.split(' OUT')[0].strip()


             ## Awayteam           
            if len(row['scoring_stat_awayteam_full']) > 1 and 'Substitution - ' in row['scoring_stat_awayteam_full']:
                df_match_raw.loc[index,'stat_action_awayteam'] = 'Substitution'         
                substitution_block_full = row['scoring_stat_awayteam_full'].split('Substitution - ')[1].replace("<br />","").strip()
                substitution_block1 = ""
                substitution_block2 = ""

                if ',' in substitution_block_full:
                    substitution_block1 = substitution_block_full.split(',')[0]
                    substitution_block2 = substitution_block_full.split(',')[1]
                    if ' IN' in substitution_block1:
                        df_match_raw.loc[index,'substitution_player_in_awayteam'] = substitution_block1.split(' IN')[0].strip()
                    elif ' OUT' in substitution_block1:
                        df_match_raw.loc[index,'substitution_player_out_awayteam'] = substitution_block1.split(' OUT')[0].strip()

                    if ' IN' in substitution_block2:
                        df_match_raw.loc[index,'substitution_player_in_awayteam'] = substitution_block2.split(' IN')[0].strip()
                    elif ' OUT' in substitution_block2:
                        df_match_raw.loc[index,'substitution_player_out_awayteam'] = substitution_block2.split(' OUT')[0].strip()

                else:
                    if ' IN' in substitution_block_full:
                        df_match_raw.loc[index,'substitution_player_in_awayteam'] = substitution_block_full.split(' IN')[0].strip()

                         ### Add starting 5 if applicable
                        if int(row['period']) == 1 and row['current_score_awayteam'] == None  and row['current_score_awayteam'] == None:
                            startingfive_awayteam.append(substitution_block_full.split(' IN')[0].strip())

                    elif ' OUT' in substitution_block1:
                        df_match_raw.loc[index,'substitution_player_out_awayteam'] = substitution_block_full.split(' OUT')[0].strip()

        starting_five_hometeam_string = ""
        starting_five_awayteam_string = ""
        try:
            starting_five_hometeam_string = ','.join(startingfive_hometeam) 
            starting_five_awayteam_string = ','.join(startingfive_awayteam) 
        except:
            pass
        return [starting_five_hometeam_string,starting_five_awayteam_string]

In [63]:
# Load the updated match_processor function
match_processor = ProcessFibaEuropeMatch(example_match_id,match_content)  

# Return an array of the starting five players from each team
match_processor.process_substitutions_and_starting_fives(df_match_raw)


['D. Bost,D. Hardy,A. Stipanovic,A. Saruhan,N. Yildirim',
 'M. Smith,J. Love,J. Cage,T. Battle,I. Lasisi']

### 2.4 - Extract action information from each row of the granular match data

The `df_match_raw` dataframe holds information for each event during the game. I want to split out the information relevant to those actions into separate columns.

We will do this for:

* fouls
* turnovers
* steals
* blocks
* shots made
* shots missed
* assists
* other (free throws, timeouts etc)

The following functions are applied to **each row** of the `df_match_raw` dataframe:

In [66]:
class ProcessFibaEuropeMatch(ProcessFibaEuropeMatch):
    
        ## The following functions are applied to each row of the "df_match_raw" dataset,
        ## extracting relevant data into separate columns (to be later joined together in one big flat dataframe)

    def process_fouls_for_row(self,df_match_raw,index,row):
        """ PROCESS FOULS - extract information related to fouls and place said info (current
        number of fouls for player, for team, etc) into separate columns of the input dataframe

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """

        ## Hometeam
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and 'Foul drawn: ' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Foul Drawn'                  
            df_match_raw.loc[index,'foul_drawn_player_hometeam'] = row['scoring_stat_hometeam_full'].split('Foul drawn: ')[1].strip()
        elif len(row['scoring_stat_hometeam_full']) > 1 and '  foul (' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Foul Committed'
            df_match_raw.loc[index,'foul_committed_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[0].strip()
            try:
                foulnumber_raw = row['scoring_stat_hometeam_full'].split(' - ')[1].split(', ')[1].strip()[:2] 
                if foulnumber_raw[1].isdigit():
                    df_match_raw.loc[index,'team_fouls_committed_hometeam'] = int(foulnumber_raw)
                else:
                    df_match_raw.loc[index,'team_fouls_committed_hometeam'] = int(foulnumber_raw[0])
            except:
                pass

            try:
                personal_foulnumber_raw = row['scoring_stat_hometeam_full'].split(' - ')[1].split(', ')[0].split('(')[1].replace('PF','').strip()
                df_match_raw.loc[index,'player_personal_fouls_committed_hometeam'] = int(personal_foulnumber_raw)
            except:
                pass


         ## Awayteam           
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and 'Foul drawn: ' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Foul Drawn'         
            df_match_raw.loc[index,'foul_drawn_player_awayteam'] = row['scoring_stat_awayteam_full'].split('Foul drawn: ')[1].strip()
        elif len(row['scoring_stat_awayteam_full']) > 1 and 'personal  foul' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Foul Committed'
            df_match_raw.loc[index,'foul_committed_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[0].strip()
            try:
                foulnumber_raw = row['scoring_stat_awayteam_full'].split(' - ')[1].split(', ')[1].strip()[:2] 
                if foulnumber_raw[1].isdigit():
                    df_match_raw.loc[index,'team_fouls_committed_awayteam'] = int(foulnumber_raw)
                else:
                    df_match_raw.loc[index,'team_fouls_committed_awayteam'] = int(foulnumber_raw[0])
            except:
                pass


            try:
                personal_foulnumber_raw = row['scoring_stat_awayteam_full'].split(' - ')[1].split(', ')[0].split('(')[1].replace('PF','').strip()
                df_match_raw.loc[index,'player_personal_fouls_committed_awayteam'] = int(personal_foulnumber_raw)
            except:
                pass




    def process_turnovers_for_row(self,df_match_raw,index,row):

        """ PROCESS TURNOVERS - extract information related to turnovers and place said info 
        into separate columns of the input dataframe

        example: W. Carter - turnover (bad pass, 2 TO)

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row
        """
        
        ## Hometeam
        if pd.isnull(row['stat_action_hometeam']) and len(row['scoring_stat_hometeam_full']) > 1 and ' - turnover ' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Turnover Committed'                  
            df_match_raw.loc[index,'turnover_committed_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - turnover ')[0].strip()
            df_match_raw.loc[index,'turnover_committed_type_hometeam'] = row['scoring_stat_hometeam_full'].split(' - turnover ')[1].strip().split(', ')[0].replace("(","").strip()
            if row['scoring_stat_hometeam_full'].rfind('TO') > 0:
                try:
                    df_match_raw.loc[index,'turnover_committed_by_player_count_hometeam'] = int(row['scoring_stat_hometeam_full'].split(' - turnover ')[1].strip().split(', ')[1].replace(" TO)","").strip())
                except:
                    pass

        if pd.isnull(row['stat_action_hometeam']) and len(row['scoring_stat_hometeam_full']) > 1 and 'Team turnover (24 seconds)' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Turnover Committed'                  

        ## Awayteam
        if pd.isnull(row['stat_action_awayteam']) and len(row['scoring_stat_awayteam_full']) > 1 and ' - turnover ' in row['scoring_stat_awayteam_full']:

            df_match_raw.loc[index,'stat_action_awayteam'] = 'Turnover Committed'                  
            df_match_raw.loc[index,'turnover_committed_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - turnover ')[0].strip()
            df_match_raw.loc[index,'turnover_committed_type_awayteam'] = row['scoring_stat_awayteam_full'].split(' - turnover ')[1].strip().split(', ')[0].replace("(","").strip()
            if row['scoring_stat_awayteam_full'].rfind('TO') > 0:
                try:
                    df_match_raw.loc[index,'turnover_committed_by_player_count_awayteam'] = int(row['scoring_stat_awayteam_full'].split(' - turnover ')[1].strip().split(', ')[1].replace(" TO)","").strip())
                except:
                    pass

        if pd.isnull(row['stat_action_awayteam']) and len(row['scoring_stat_awayteam_full']) > 1 and 'Team turnover (24 seconds)' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Turnover Committed'    



    def process_steals_for_row(self,df_match_raw,index,row):

        """ PROCESS STEALS - extract information related to steals and place said info
        into separate columns of the input dataframe

        Example: W. Carter - steal (0 ST) 

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """
        
        ## Hometeam
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and ' - steal ' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Steal'                  
            df_match_raw.loc[index,'steal_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - steal ')[0].strip()
            if row['scoring_stat_hometeam_full'].rfind('ST') > 0:
                try:
                    df_match_raw.loc[index,'ball_stolen_by_player_count_hometeam'] = int(row['scoring_stat_hometeam_full'].split(' - steal ')[1].strip().replace(" ST)","").replace("(","").strip())
                except:
                    pass


        ## Awayteam
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and ' - steal ' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Steal'                  
            df_match_raw.loc[index,'steal_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - steal ')[0].strip()
            if row['scoring_stat_awayteam_full'].rfind('ST') > 0:
                try:
                    df_match_raw.loc[index,'ball_stolen_by_player_count_awayteam'] = int(row['scoring_stat_awayteam_full'].split(' - steal ')[1].strip().replace(" ST)","").replace("(","").strip())
                except:
                    pass


    def process_blocks_for_row(self,df_match_raw,index,row):

        """ PROCESS BLOCKS - extract information related to blocks and place said info
        into separate columns of the input dataframe

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """
        
        
        ## Hometeam
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and ' blocked' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Shot Blocked'                  
            df_match_raw.loc[index,'shot_blocked_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[0].strip()
            df_match_raw.loc[index,'shot_block_detail'] = row['scoring_stat_hometeam_full'].split(' - ')[1].strip()


        ## Awayteam
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and ' blocked' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Shot Blocked'                  
            df_match_raw.loc[index,'shot_blocked_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[0].strip()
            df_match_raw.loc[index,'shot_block_detail'] = row['scoring_stat_awayteam_full'].split(' - ')[1].strip()



    def process_shots_made_for_row(self,df_match_raw,index,row):

        """ SHOTS MADE - extract information related to shots made and place said info
        into separate columns of the input dataframe

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """
        
        ## Hometeam
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and ' made (' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Points Scored'                  
            df_match_raw.loc[index,'points_scored_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[0].strip()
            df_match_raw.loc[index,'points_scored_type_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[1].split('(')[0].strip().replace(' made','').strip()

            if ',' in row['scoring_stat_hometeam_full'].split(' - ')[1].split('(')[1]:
                df_match_raw.loc[index,'points_scored_subtype_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[1].split('(')[1].split(',')[0].strip()
                df_match_raw.loc[index,'points_scored_by_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[1].split('(')[1].split(',')[1].strip().split(' ')[0].strip()             
            else:
                df_match_raw.loc[index,'points_scored_by_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()


        ## Awayteam
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and ' made (' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Points Scored'                  
            df_match_raw.loc[index,'points_scored_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[0].strip()
            df_match_raw.loc[index,'points_scored_type_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[1].split('(')[0].strip().replace(' made','').strip()

            if ',' in row['scoring_stat_awayteam_full'].split(' - ')[1].split('(')[1]:
                df_match_raw.loc[index,'points_scored_subtype_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[1].split('(')[1].split(',')[0].strip()
                df_match_raw.loc[index,'points_scored_by_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[1].split('(')[1].split(',')[1].strip().split(' ')[0].strip()             
            else:
                df_match_raw.loc[index,'points_scored_by_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()


    def process_shots_missed_for_row(self,df_match_raw,index,row):

        """ SHOTS MISSED - extract information related to shots missed and place said info
        into separate columns of the input dataframe

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """

        ## Hometeam
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and ' missed' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Shot Missed'                  
            df_match_raw.loc[index,'shot_missed_player_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[0].strip()
            df_match_raw.loc[index,'shot_missed_type_hometeam'] = row['scoring_stat_hometeam_full'].split(' - ')[1].strip().replace(' missed','').strip()

        ## Awayteam
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and ' missed' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Shot Missed'                  
            df_match_raw.loc[index,'shot_missed_player_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[0].strip()
            df_match_raw.loc[index,'shot_missed_type_awayteam'] = row['scoring_stat_awayteam_full'].split(' - ')[1].strip().replace(' missed','').strip()



    def process_assists_for_row(self,df_match_raw,index,row):

        """ SHOTS ASSISTS - extract information related to assists and place said info
        into separate columns of the input dataframe

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """

        ## Hometeam
        ## Offensive Rebound
        if pd.isnull(row['stat_action_hometeam'])  and len(row['assist_stat_hometeam_full']) > 1 and '- offensive rebound' in row['assist_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_assist_hometeam'] = 'Offensive Rebound'                  
            df_match_raw.loc[index,'offensive_rebound_player_hometeam'] = row['assist_stat_hometeam_full'].split(' - ')[0].strip()

            try:
                df_match_raw.loc[index,'offensive_rebounds_by_player_hometeam'] = row['assist_stat_hometeam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()                 
            except:
                df_match_raw.loc[index,'offensive_rebounds_by_player_hometeam'] = '1'
        ## Defensive Rebound
        if pd.isnull(row['stat_action_hometeam'])  and len(row['assist_stat_hometeam_full']) > 1 and '- defensive rebound' in row['assist_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_assist_hometeam'] = 'Defensive Rebound'                  
            df_match_raw.loc[index,'defensive_rebound_player_hometeam'] = row['assist_stat_hometeam_full'].split(' - ')[0].strip()
            try:
                df_match_raw.loc[index,'defensive_rebounds_by_player_hometeam'] = row['assist_stat_hometeam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()                 
            except:
                df_match_raw.loc[index,'defensive_rebounds_by_player_hometeam'] = '1'

        ## Assists
        if pd.isnull(row['stat_action_hometeam'])  and len(row['assist_stat_hometeam_full']) > 1 and '- assist ' in row['assist_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_assist_hometeam'] = 'Assist'                  
            df_match_raw.loc[index,'scoring_assist_player_hometeam'] = row['assist_stat_hometeam_full'].split(' - ')[0].strip()
            try:
                df_match_raw.loc[index,'scoring_assists_by_player_hometeam'] = row['assist_stat_hometeam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()                 
            except:
                df_match_raw.loc[index,'scoring_assists_by_player_hometeam'] = '1'

        ## Awayteam
        ## Offensive Rebound
        if pd.isnull(row['stat_action_awayteam'])  and len(row['assist_stat_awayteam_full']) > 1 and '- offensive rebound' in row['assist_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_assist_awayteam'] = 'Offensive Rebound'                  
            df_match_raw.loc[index,'offensive_rebound_player_awayteam'] = row['assist_stat_awayteam_full'].split(' - ')[0].strip()
            try:            
                df_match_raw.loc[index,'offensive_rebounds_by_player_awayteam'] = row['assist_stat_awayteam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()                 
            except:
                df_match_raw.loc[index,'offensive_rebounds_by_player_awayteam'] = '1'

        ## Defensive Rebound
        if pd.isnull(row['stat_action_awayteam'])  and len(row['assist_stat_awayteam_full']) > 1 and '- defensive rebound' in row['assist_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_assist_awayteam'] = 'Defensive Rebound'                  
            df_match_raw.loc[index,'defensive_rebound_player_awayteam'] = row['assist_stat_awayteam_full'].split(' - ')[0].strip()
            try:
                df_match_raw.loc[index,'defensive_rebounds_by_player_awayteam'] = row['assist_stat_awayteam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()                 
            except:
                df_match_raw.loc[index,'defensive_rebounds_by_player_awayteam'] = '1'

        ## Assists
        if pd.isnull(row['stat_action_awayteam'])  and len(row['assist_stat_awayteam_full']) > 1 and '- assist ' in row['assist_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_assist_awayteam'] = 'Assist'                  
            df_match_raw.loc[index,'scoring_assist_player_awayteam'] = row['assist_stat_awayteam_full'].split(' - ')[0].strip()
            try:
                df_match_raw.loc[index,'scoring_assists_by_player_awayteam'] = row['assist_stat_awayteam_full'].split(' - ')[1].split('(')[1].split(' ')[0].strip()                 
            except:
                df_match_raw.loc[index,'scoring_assists_by_player_awayteam'] = '1'    


    def process_other_misc_for_row(self,df_match_raw,index,row):
        
        """ MISC - extract information related to Free Throws, Time Outs etc and place said info
        into separate columns of the input dataframe

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
        """

        
        ## Hometeam
        ## Free Throws
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and ' free throws' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Free Throws Awarded'                      
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and ' Timeout' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Timeout'                      

        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and 'End of quarter' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'End of Quarter'                      
        if pd.isnull(row['stat_action_hometeam'])  and len(row['scoring_stat_hometeam_full']) > 1 and 'Start of quarter' in row['scoring_stat_hometeam_full']:
            df_match_raw.loc[index,'stat_action_hometeam'] = 'Start of Quarter'                      


        ## Awayteam
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and ' free throws' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Free Throws Awarded'                      
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and ' Timeout' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Timeout'                      

        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and 'End of quarter' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'End of Quarter'                      
        if pd.isnull(row['stat_action_awayteam'])  and len(row['scoring_stat_awayteam_full']) > 1 and 'Start of quarter' in row['scoring_stat_awayteam_full']:
            df_match_raw.loc[index,'stat_action_awayteam'] = 'Start of Quarter'             



### 2.5 - Add back the metadata, and cumulatives, time deltas, and top fives

In [75]:
import datetime
import numpy as np

class ProcessFibaEuropeMatch(ProcessFibaEuropeMatch):

    ## Apply the metadata collection to each row in the table
    def add_metadata_for_row(self,df_match_raw,index,row,match_metadata_dict):                          
        """ Add the metadata extracted earlier (into a dictionary) to the dataframe as columns

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
            index (int): index of dataframe row
            row (pandas dataframe row): pandas dataframe row        
            match_metadata_dict (dict): match metadata extracted earlier from file
        """

        df_match_raw.loc[index,'metadata_competition_name'] = match_metadata_dict['competition_name']                      
        df_match_raw.loc[index,'competition_round'] = match_metadata_dict['competition_round']                      
        df_match_raw.loc[index,'team_name_hometeam'] = match_metadata_dict['team_name_hometeam']                      
        df_match_raw.loc[index,'team_name_awayteam'] = match_metadata_dict['team_name_awayteam']                      
        df_match_raw.loc[index,'ending_score_period1_hometeam'] = match_metadata_dict['ending_score_period1_hometeam']                      
        df_match_raw.loc[index,'ending_score_period1_awayteam'] = match_metadata_dict['ending_score_period1_awayteam']                      
        df_match_raw.loc[index,'ending_score_period2_hometeam'] = match_metadata_dict['ending_score_period2_hometeam']                      
        df_match_raw.loc[index,'ending_score_period2_awayteam'] = match_metadata_dict['ending_score_period2_awayteam']                      
        df_match_raw.loc[index,'ending_score_period3_hometeam'] = match_metadata_dict['ending_score_period3_hometeam']                      
        df_match_raw.loc[index,'ending_score_period3_awayteam'] = match_metadata_dict['ending_score_period3_awayteam']                      
        df_match_raw.loc[index,'ending_score_period4_hometeam'] = match_metadata_dict['ending_score_period4_hometeam']                      
        df_match_raw.loc[index,'ending_score_period4_awayteam'] = match_metadata_dict['ending_score_period4_awayteam']                      
        df_match_raw.loc[index,'ending_score_period5_hometeam'] = match_metadata_dict['ending_score_period5_hometeam']                      
        df_match_raw.loc[index,'ending_score_period5_awayteam'] = match_metadata_dict['ending_score_period5_awayteam']

        df_match_raw.loc[index,'starting_five_hometeam'] = match_metadata_dict['starting_five_hometeam']
        df_match_raw.loc[index,'starting_five_awayteam'] = match_metadata_dict['starting_five_awayteam']

    

    def add_cumulatives_fill_in_current_score(self,df_match_raw):
        """ Make fields with cumulative values, like 'current score hometeam', etc

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
        """

        current_score_hometeam = None
        current_score_awayteam = None

        for index, row in df_match_raw.iterrows():

            if row['current_score_hometeam'] == None:
                df_match_raw.loc[index,'current_score_hometeam'] = current_score_hometeam
            else:
               current_score_hometeam = row['current_score_hometeam']

            if row['current_score_awayteam'] == None:
                df_match_raw.loc[index,'current_score_awayteam'] = current_score_awayteam
            else:
               current_score_awayteam = row['current_score_awayteam']

    def evaluate_mean_of_timedeltas(self,timedeltas):
        """ Finds mean of an array of time deltas (so long as they aren't null)
        Used to find mean between events (like avg seconds bewteen scoring events, etc)

        Args:
            timedeltas (array of timedeltas): array of timedeltas between events
        
        Returns:
            average elapsed time (timedelta format), or NaN if some deltas are null            
        """
        
        if len(timedeltas) == 0:
            return np.NaN
        else:
            filtered_timedeltas = [x for x in timedeltas if pd.notnull(x)]
            if len(filtered_timedeltas)>0:
                average_timedelta = sum(filtered_timedeltas,filtered_timedeltas[0]) / len(timedeltas)
                return average_timedelta
            else:
                return np.NaN

    def add_cumulatives_elapsed_time_between_scores(self,df_match_raw):
        """ Sums up (cumulatively) average elapsed time between scoring events

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
        """

        prev_time_remaining_in_period_overall = datetime.datetime.strptime('10:00', "%M:%S")
        prev_quarter_overall = 1
        prev_points_scored_type_overall = None


        prev_time_remaining_in_period_hometeam = datetime.datetime.strptime('10:00', "%M:%S")
        prev_quarter_hometeam = 1
        prev_points_scored_type_hometeam = None

        prev_time_remaining_in_period_awayteam = datetime.datetime.strptime('10:00', "%M:%S")
        prev_quarter_awayteam = 1
        prev_points_scored_type_awayteam = None

        timedeltas_overall = []
        timedeltas_overall_hometeam = []
        timedeltas_overall_awayteam = []

        timedeltas_quarter = []
        timedeltas_quarter_hometeam = []
        timedeltas_quarter_awayteam = []

        for index, row in df_match_raw.iterrows():

            if len(row['time_remaining_in_period'])>0:
                current_time_remaining_in_period = datetime.datetime.strptime(row['time_remaining_in_period'], "%M:%S")
                current_quarter = int(row['period'])

                ## Elapsed time between scoring events -- hometeam
                if row['stat_action_hometeam'] == 'Points Scored':
                    current_points_scored_type_hometeam = row['points_scored_type_hometeam']
                    if prev_quarter_hometeam == current_quarter and (prev_points_scored_type_hometeam != current_points_scored_type_hometeam or prev_time_remaining_in_period_hometeam >current_time_remaining_in_period or prev_quarter_hometeam < current_quarter):
                        df_match_raw.loc[index,'elapsed_time_since_last_scoring_event_hometeam'] = prev_time_remaining_in_period_hometeam - current_time_remaining_in_period
                        timedeltas_overall_hometeam.append(prev_time_remaining_in_period_hometeam - current_time_remaining_in_period)
                        timedeltas_quarter_hometeam.append(prev_time_remaining_in_period_hometeam - current_time_remaining_in_period)

                        df_match_raw.loc[index,'avg_time_between_scoring_events_overall_hometeam'] = self.evaluate_mean_of_timedeltas(timedeltas_overall_hometeam)
                        df_match_raw.loc[index,'avg_time_between_scoring_events_quarter_hometeam'] = self.evaluate_mean_of_timedeltas(timedeltas_quarter_hometeam)

                    elif int(current_quarter)-(prev_quarter_hometeam) >= 1:
                        df_match_raw.loc[index,'elapsed_time_since_last_scoring_event_hometeam'] = prev_time_remaining_in_period_hometeam + datetime.timedelta(minutes=10) - current_time_remaining_in_period
                        timedeltas_overall_hometeam.append(prev_time_remaining_in_period_hometeam + datetime.timedelta(minutes=10) - current_time_remaining_in_period)
                        timedeltas_quarter_hometeam = []

                    prev_time_remaining_in_period_hometeam  = current_time_remaining_in_period
                    prev_quarter_hometeam = current_quarter
                    prev_points_scored_type_hometeam = current_points_scored_type_hometeam

                ## Elapsed time between scoring events -- awayteam
                if row['stat_action_awayteam'] == 'Points Scored':
                    current_points_scored_type_awayteam = row['points_scored_type_awayteam']
                    if prev_quarter_awayteam == current_quarter and (prev_points_scored_type_awayteam != current_points_scored_type_awayteam or prev_time_remaining_in_period_awayteam >current_time_remaining_in_period or prev_quarter_awayteam < current_quarter):
                        df_match_raw.loc[index,'elapsed_time_since_last_scoring_event_awayteam'] = prev_time_remaining_in_period_awayteam - current_time_remaining_in_period
                        timedeltas_overall_awayteam.append(prev_time_remaining_in_period_awayteam - current_time_remaining_in_period)
                        timedeltas_quarter_awayteam.append(prev_time_remaining_in_period_awayteam - current_time_remaining_in_period)

                        df_match_raw.loc[index,'avg_time_between_scoring_events_overall_awayteam'] = self.evaluate_mean_of_timedeltas(timedeltas_overall_awayteam)
                        df_match_raw.loc[index,'avg_time_between_scoring_events_quarter_awayteam'] = self.evaluate_mean_of_timedeltas(timedeltas_quarter_awayteam)


                    elif int(current_quarter)-(prev_quarter_awayteam) >= 1:
                        df_match_raw.loc[index,'elapsed_time_since_last_scoring_event_awayteam'] = prev_time_remaining_in_period_awayteam + datetime.timedelta(minutes=10) - current_time_remaining_in_period
                        timedeltas_overall_awayteam.append(prev_time_remaining_in_period_awayteam + datetime.timedelta(minutes=10) - current_time_remaining_in_period)
                        timedeltas_quarter_awayteam = []
                    prev_time_remaining_in_period_awayteam  = current_time_remaining_in_period
                    prev_quarter_awayteam = current_quarter
                    prev_points_scored_type_awayteam = current_points_scored_type_awayteam



                if row['stat_action_hometeam'] == 'Points Scored' or row['stat_action_awayteam'] == 'Points Scored': 
                    if row['stat_action_hometeam'] == 'Points Scored':
                        current_points_scored_type_overall = row['points_scored_type_hometeam']
                    else:
                        current_points_scored_type_overall = row['points_scored_type_awayteam']

                    if prev_quarter_overall == current_quarter and (prev_points_scored_type_overall != current_points_scored_type_overall or prev_time_remaining_in_period_overall >current_time_remaining_in_period or prev_quarter_overall < current_quarter):
                        df_match_raw.loc[index,'elapsed_time_since_last_scoring_event_overall'] = prev_time_remaining_in_period_overall - current_time_remaining_in_period
                        timedeltas_overall.append(prev_time_remaining_in_period_overall - current_time_remaining_in_period)
                        timedeltas_quarter.append(prev_time_remaining_in_period_overall - current_time_remaining_in_period)

                        df_match_raw.loc[index,'avg_time_between_scoring_events_overall'] = self.evaluate_mean_of_timedeltas(timedeltas_overall)
                        df_match_raw.loc[index,'avg_time_between_scoring_events_quarter'] = self.evaluate_mean_of_timedeltas(timedeltas_quarter)


                    elif int(current_quarter)-(prev_quarter_overall) == 1:
                        df_match_raw.loc[index,'elapsed_time_since_last_scoring_event_overall'] = prev_time_remaining_in_period_overall + datetime.timedelta(minutes=10) - current_time_remaining_in_period
                        timedeltas_overall.append(prev_time_remaining_in_period_overall + datetime.timedelta(minutes=10) - current_time_remaining_in_period)
                        timedeltas_quarter = []

                    prev_time_remaining_in_period_overall  = current_time_remaining_in_period
                    prev_quarter_overall = current_quarter
                    prev_points_scored_type_overall = current_points_scored_type_overall    


    def pos_elements_in_array(lst):
        return [x for x in lst if x > 0] or None
    def neg_elements_in_array(lst):
        return [x for x in lst if x < 0] or None
    def abs_elements_in_array(lst):
        return [abs(x) for x in lst] or None

    def add_cumulatives_lead_sizes(self,df_match_raw):
        """ Sums up (cumulatively) average size of lead for hometeam

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
        """

        prev_lead = None
        cumulative_lead_changes = 0
        leads = []
        prev_quarter = 1

        leads_in_quarter = []
        lead_changes_in_quarter = 0

        for index, row in df_match_raw.iterrows():

            current_quarter = row['period']        
            if row['current_score_hometeam'] != None and  row['current_score_awayteam'] != None and (row['stat_action_hometeam'] == 'Points Scored' or row['stat_action_awayteam'] == 'Points Scored'):            

                if current_quarter != prev_quarter:
                    leads_in_quarter = []
                    lead_changes_in_quarter = 0

                current_lead = row['current_score_hometeam'] - row['current_score_awayteam']
                leads.append(current_lead)        
                leads_in_quarter.append(current_lead)


                if (current_lead != 0 and (prev_lead == None or (prev_lead <0 and current_lead >0) or (prev_lead > 0 and current_lead <0))):
                    cumulative_lead_changes += 1
                    lead_changes_in_quarter += 1

                df_match_raw.loc[index,'current_lead_hometeam'] = current_lead 
                df_match_raw.loc[index,'cumulative_lead_changes_game'] = cumulative_lead_changes 

                try:
                    df_match_raw.loc[index,'cumulative_avg_abs_size_of_lead_game'] = np.mean(abs_elements_in_array(leads))        
                except:
                    df_match_raw.loc[index,'cumulative_avg_abs_size_of_lead_game'] = None
                    pass


                try:
                    df_match_raw.loc[index,'cumulative_max_abs_size_of_lead_game'] = np.max(abs_elements_in_array(leads))        
                except:
                    df_match_raw.loc[index,'cumulative_max_abs_size_of_lead_game'] = None
                    pass


                try:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_game_hometeam'] = np.max(pos_elements_in_array(leads))
                except:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_game_hometeam'] = None
                    pass


                try:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_game_awayteam'] = np.max(abs_elements_in_array(neg_elements_in_array(leads)))
                except:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_game_awayteam'] = None
                    pass


                df_match_raw.loc[index,'cumulative_lead_changes_quarter'] = lead_changes_in_quarter 


                try:
                    df_match_raw.loc[index,'avg_abs_size_of_lead_quarter'] = np.mean(abs_elements_in_array(leads_in_quarter))        
                except:
                    df_match_raw.loc[index,'avg_abs_size_of_lead_quarter'] = None
                    pass


                try:
                    df_match_raw.loc[index,'cumulative_avg_abs_size_of_lead_quarter'] = np.mean(abs_elements_in_array(leads_in_quarter))        
                except:
                    df_match_raw.loc[index,'cumulative_avg_abs_size_of_lead_quarter'] = None
                    pass

                try:
                    df_match_raw.loc[index,'cumulative_max_abs_size_of_lead_quarter'] = np.max(abs_elements_in_array(leads_in_quarter))        
                except:
                    df_match_raw.loc[index,'cumulative_max_abs_size_of_lead_quarter'] = None
                    pass

                try:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_quarter_hometeam'] = np.max(pos_elements_in_array(leads_in_quarter))
                except:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_quarter_hometeam'] = None
                    pass
                try:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_quarter_awayteam'] = np.max(abs_elements_in_array(neg_elements_in_array(leads_in_quarter)))
                except:
                    df_match_raw.loc[index,'cumulative_max_size_of_lead_quarter_awayteam'] = None
                    pass
                prev_lead = current_lead


    def add_cumulatives_possessions(self,df_match_raw):
        """ Sums up (cumulatively) total and quarterly possessions per team

        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
        """

        cumulative_possessions_overall_hometeam = 0 
        cumulative_possessions_quarter_hometeam = 0

        cumulative_possessions_overall_awayteam = 0 
        cumulative_possessions_quarter_awayteam = 0

        prev_quarter = 1

        action_set = set(['Points Scored', 'Shot Missed','Turnover Comitted','Free Throws Awarded'])

        for index, row in df_match_raw.iterrows():

            if row['period'] != None:
                current_quarter = row['period']   

            if (row['stat_action_hometeam'] != None and row['stat_action_hometeam'] in action_set) or (row['stat_action_awayteam'] != None and row['stat_action_awayteam'] in action_set):


                if current_quarter != prev_quarter:
                    cumulative_possessions_quarter_hometeam = 0
                    cumulative_possessions_quarter_awayteam = 0

                possession_just_ended = row['current_team_performing_stat_action']

                if possession_just_ended.strip() == '1':
                    cumulative_possessions_overall_hometeam += 1
                    cumulative_possessions_quarter_hometeam += 1
                if possession_just_ended.strip() == '2':
                    cumulative_possessions_overall_awayteam += 1
                    cumulative_possessions_quarter_awayteam += 1
                prev_quarter = current_quarter

            df_match_raw.loc[index,'cumulative_possessions_overall_hometeam'] = cumulative_possessions_overall_hometeam
            df_match_raw.loc[index,'cumulative_possessions_overall_awayteam'] = cumulative_possessions_overall_awayteam
            df_match_raw.loc[index,'cumulative_possessions_quarter_hometeam'] = cumulative_possessions_quarter_hometeam
            df_match_raw.loc[index,'cumulative_possessions_quarter_awayteam'] = cumulative_possessions_quarter_awayteam


    def evaluate_player_dict(self,player_dict,starting_five):
        """ Takes input of dictionary with each players stats for the game so far,
        and outputs a dictionary with metrics of interest concerning these stats, like
        'starting_five_in_play', 'percent_of_total_points_scored_by_players_in_play', etc

        Args:
            player_dict (dict): dictionary with each players stats for the game so far
            starting_five (dict): dictionary with metrics of interest concerning these stats
        """
        
        top_scorers = sorted(player_dict, key=lambda x: (int(player_dict[x]['points']), int(player_dict[x]['assists'])),reverse = True)
        array_len = len(top_scorers)
        current_guy = 0
        top_five_scorers_in_play = 0
        points_scored_by_players_in_play = 0
        total_points_scored = 0
        starting_five_in_play = 0

        for x in range(array_len):
            dict_value = player_dict.get(top_scorers[x])
            if dict_value['in']:
                points_scored_by_players_in_play += int(dict_value['points'])
                if current_guy <5:
                    top_five_scorers_in_play += 1
                if top_scorers[x] in starting_five:
                    starting_five_in_play += 1
            current_guy +=1                
            total_points_scored += int(dict_value['points'])


        top_players = sorted(player_dict, key=lambda x: (int(player_dict[x]['points'])+int(player_dict[x]['assists']) + int(player_dict[x]['rebounds'])),reverse = True)
        array_len = len(top_players)
        current_guy = 0
        top_five_players_in_play = 0
        total_stat_count_players_in_play = 0
        total_stat_count = 0

        for x in range(array_len):
            dict_value = player_dict.get(top_players[x])
            if dict_value['in']:
                total_stat_count_players_in_play += (int(dict_value['points']) + int(dict_value['assists']) + int(dict_value['rebounds']))
                if current_guy <5:
                    top_five_players_in_play += 1
            current_guy +=1        
            total_stat_count += (int(dict_value['points']) + int(dict_value['assists']) + int(dict_value['rebounds']))

        percent_of_total_points_scored_by_players_in_play = 0
        percent_of_total_stat_count_by_players_in_play = 0
        try:
            percent_of_total_points_scored_by_players_in_play = points_scored_by_players_in_play/total_points_scored
        except:
            pass        
        try:
            percent_of_total_stat_count_by_players_in_play = total_stat_count_players_in_play/total_stat_count
        except:
            pass        

        result_set = {'starting_five_in_play':starting_five_in_play,
                'top_five_scorers_in_play':top_five_scorers_in_play,
                'points_scored_by_players_in_play':points_scored_by_players_in_play,
                'percent_of_total_points_scored_by_players_in_play':percent_of_total_points_scored_by_players_in_play,
                'top_five_players_in_play':top_five_players_in_play,
                'total_stat_count_players_in_play':total_stat_count_players_in_play,
                'percent_of_total_stat_count_by_players_in_play':percent_of_total_stat_count_by_players_in_play            
                }
        return result_set    


    def process_top_fives(self,df_match_raw):
        
        """ Iterates through dataframe with play-by-play data, calculating metrics relating
        to the top five players for each team, and adding them as fields in the dataframe.

        Essentially we want to answer questions related to star players, and bench play, like
        'How many of the top 5 scorers are in play right now?' or 'what % of total points can be
        attributed to the top five scorers in this match?'
        
        Args:
            df_match_raw (pandas dataframe): dataframe with raw play-by-play data
        """
        
        players_dict_hometeam = {}
        players_dict_awayteam = {}

        for index, row in df_match_raw.iterrows():
            if row['stat_action_hometeam'] == 'Substitution' and 'substitution_player_in_hometeam' in df_match_raw.columns and pd.notnull(row['substitution_player_in_hometeam']):
                player = row['substitution_player_in_hometeam']
                player_dict_value = players_dict_hometeam.get(player, {'team':'hometeam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                players_dict_hometeam[player] = player_dict_value       

        #    # Player OUT
            if row['stat_action_hometeam'] == 'Substitution' and 'substitution_player_out_hometeam' in df_match_raw.columns and pd.notnull(row['substitution_player_out_hometeam']):
                player = row['substitution_player_out_hometeam']
                player_dict_value = players_dict_hometeam.get(player, {'team':'hometeam','in':False,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = False
                players_dict_hometeam[player] = player_dict_value 

        #    # Points Scored
            if row['stat_action_hometeam'] == 'Points Scored' and 'points_scored_player_hometeam' in df_match_raw.columns and pd.notnull(row['points_scored_player_hometeam']):
                player = row['points_scored_player_hometeam']
                player_dict_value = players_dict_hometeam.get(player, {'team':'hometeam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['points'] = row['points_scored_by_player_hometeam']        
                players_dict_hometeam[player] = player_dict_value 

        #    # Foul Committed
            if row['stat_action_hometeam'] == 'Foul Committed'  and 'foul_committed_player_hometeam' in df_match_raw.columns and pd.notnull(row['foul_committed_player_hometeam']):
                player = row['foul_committed_player_hometeam']
                player_dict_value = players_dict_hometeam.get(player, {'team':'hometeam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['fouls'] = row['player_personal_fouls_committed_hometeam']             
                players_dict_hometeam[player] = player_dict_value 

        #    # Rebounds
            if 'offensive_rebound_player_hometeam' in df_match_raw.columns and pd.notnull(row['offensive_rebound_player_hometeam']):
                player = row['offensive_rebound_player_hometeam']
                player_dict_value = players_dict_hometeam.get(player, {'team':'hometeam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['rebounds'] = row['offensive_rebounds_by_player_hometeam']             
                players_dict_hometeam[player] = player_dict_value 
            if  'defensive_rebound_player_hometeam' in df_match_raw.columns and pd.notnull(row['defensive_rebound_player_hometeam']):
                player = row['defensive_rebound_player_hometeam']
                player_dict_value = players_dict_hometeam.get(player, {'team':'hometeam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['rebounds'] = row['defensive_rebounds_by_player_hometeam']             
                players_dict_hometeam[player] = player_dict_value 

        #    # Assists
            if 'scoring_assist_player_hometeam' in df_match_raw.columns  and pd.notnull(row['scoring_assist_player_hometeam']):
                player = row['scoring_assist_player_hometeam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'hometeam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['assists'] = row['scoring_assists_by_player_hometeam']             
                players_dict_awayteam[player] = player_dict_value 

        #    ## AWAYTEAM
        #    # Player IN
            if row['stat_action_awayteam'] == 'Substitution'  and 'substitution_player_in_awayteam' in df_match_raw.columns and pd.notnull(row['substitution_player_in_awayteam']):
                player = row['substitution_player_in_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                players_dict_awayteam[player] = player_dict_value 


        #    # Player OUT
            if row['stat_action_awayteam'] == 'Substitution' and 'substitution_player_out_awayteam' in df_match_raw.columns and pd.notnull(row['substitution_player_out_awayteam']):
                player = row['substitution_player_out_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':False,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = False
                players_dict_awayteam[player] = player_dict_value 

        #    # Points Scored
            if row['stat_action_awayteam'] == 'Points Scored' and 'points_scored_player_awayteam' in df_match_raw.columns and pd.notnull(row['points_scored_player_awayteam']):
                player = row['points_scored_player_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['points'] = row['points_scored_by_player_awayteam']        
                players_dict_awayteam[player] = player_dict_value 

        #    # Foul Committed
            if row['stat_action_awayteam'] == 'Foul Committed'  and 'foul_committed_player_awayteam' in df_match_raw.columns and pd.notnull(row['foul_committed_player_awayteam']):
                player = row['foul_committed_player_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['fouls'] = row['player_personal_fouls_committed_awayteam']             
                players_dict_awayteam[player] = player_dict_value 

        #    # Rebounds
            if  'offensive_rebound_player_awayteam' in df_match_raw.columns and pd.notnull(row['offensive_rebound_player_awayteam']):
                player = row['offensive_rebound_player_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['rebounds'] = row['offensive_rebounds_by_player_awayteam']             
                players_dict_awayteam[player] = player_dict_value 
            if  'defensive_rebound_player_awayteam' in df_match_raw.columns and pd.notnull(row['defensive_rebound_player_awayteam']):
                player = row['defensive_rebound_player_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['rebounds'] = row['defensive_rebounds_by_player_awayteam']             
                players_dict_awayteam[player] = player_dict_value 

        #    # Assists
            if  'scoring_assist_player_awayteam' in df_match_raw.columns  and pd.notnull(row['scoring_assist_player_awayteam']):
                player = row['scoring_assist_player_awayteam']
                player_dict_value = players_dict_awayteam.get(player, {'team':'awayteam','in':True,'points':0,'fouls':0,'rebounds':0,'assists':0}) 
                player_dict_value['in'] = True
                player_dict_value['assists'] = row['scoring_assists_by_player_awayteam']             
                players_dict_awayteam[player] = player_dict_value 

            hometeam_result_set = self.evaluate_player_dict(players_dict_hometeam,row['starting_five_hometeam'])
            awayteam_result_set = self.evaluate_player_dict(players_dict_awayteam,row['starting_five_awayteam'])

            df_match_raw.loc[index,'starting_five_in_play_hometeam'] = int(hometeam_result_set.get('starting_five_in_play'))
            df_match_raw.loc[index,'top_five_scorers_in_play_hometeam'] = int(hometeam_result_set.get('top_five_scorers_in_play'))
            df_match_raw.loc[index,'points_scored_by_players_in_play_hometeam'] = int(hometeam_result_set.get('points_scored_by_players_in_play'))
            df_match_raw.loc[index,'percent_of_total_points_scored_by_players_in_play_hometeam'] = float(hometeam_result_set.get('percent_of_total_points_scored_by_players_in_play'))
            df_match_raw.loc[index,'top_five_players_in_play_hometeam'] = int(hometeam_result_set.get('top_five_players_in_play'))
            df_match_raw.loc[index,'total_stat_count_players_in_play_hometeam'] = int(hometeam_result_set.get('total_stat_count_players_in_play'))
            df_match_raw.loc[index,'percent_of_total_stat_count_by_players_in_play_hometeam'] = float(hometeam_result_set.get('percent_of_total_stat_count_by_players_in_play'))

            df_match_raw.loc[index,'starting_five_in_play_awayteam'] = int(awayteam_result_set.get('starting_five_in_play'))
            df_match_raw.loc[index,'top_five_scorers_in_play_awayteam'] = int(awayteam_result_set.get('top_five_scorers_in_play'))
            df_match_raw.loc[index,'points_scored_by_players_in_play_awayteam'] = int(awayteam_result_set.get('points_scored_by_players_in_play'))
            df_match_raw.loc[index,'percent_of_total_points_scored_by_players_in_play_awayteam'] = float(awayteam_result_set.get('percent_of_total_points_scored_by_players_in_play'))
            df_match_raw.loc[index,'top_five_players_in_play_awayteam'] = int(awayteam_result_set.get('top_five_players_in_play'))
            df_match_raw.loc[index,'total_stat_count_players_in_play_awayteam'] = int(awayteam_result_set.get('total_stat_count_players_in_play'))
            df_match_raw.loc[index,'percent_of_total_stat_count_by_players_in_play_awayteam'] = float(awayteam_result_set.get('percent_of_total_stat_count_by_players_in_play'))




In [None]:
### 2.6 - Extra Processing, Fill in Columns, Converting Columns etc

In [68]:
class ProcessFibaEuropeMatch(ProcessFibaEuropeMatch):
    
    def create_clean_dataset_chunk(self,df,show_process_details=True):
    """ The previous steps have resulted in a dataframe with a large number of columns--a 
    number which may vary depending on the match (let's imagine a match with no blocked shots
    for example). This function applies a standard for columns and the data type for each column,
    as well as some extra cleanup, like filling in any gaps which may occur for 
    some cumulative metrics

    Args:
        df (pandas dataframe): dataframe with raw play-by-play data, and many newly appended columns based on that play by play data
        show_process_details (boolean): print process details or hide
        
    """

        def add_cumulatives_fill_in_current_fouls(df_match_raw):
            """
            The first pass of processing the data ended up with some gaps in the `team_fouls_committed_hometeam`
            and `team_fouls_committed_awayteam` columns. 

            Here we fill in the gaps.
            """
            team_fouls_committed_hometeam = None
            team_fouls_committed_awayteam = None
            current_match_id = None

            for index, row in df_match_raw.iterrows():
                if current_match_id == None or row['match_id'] == current_match_id:
                    if row['team_fouls_committed_hometeam'] == None:
                        df_match_raw.loc[index,'team_fouls_committed_hometeam'] = team_fouls_committed_hometeam
                    else:
                        team_fouls_committed_hometeam = row['team_fouls_committed_hometeam']

                    if row['team_fouls_committed_awayteam'] == None:
                        df_match_raw.loc[index,'team_fouls_committed_awayteam'] = team_fouls_committed_awayteam
                    else:
                        team_fouls_committed_awayteam = row['team_fouls_committed_awayteam']
                else:
                        team_fouls_committed_hometeam = row['team_fouls_committed_hometeam']
                        team_fouls_committed_awayteam = row['team_fouls_committed_awayteam']

                current_match_id = row['match_id']


        def show_full_mem_usage(df):
            """
            Show memory usage info by column
            """
            df.info(memory_usage='deep')
            for dtype in ['float','int','object']:
                selected_dtype = df.select_dtypes(include=[dtype])
                mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
                mean_usage_mb = mean_usage_b / 1024 ** 2
                print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

        ## Define downcast conversion functions
        def convert_to_categories(dataframe,column_array):
            """
            Convert category columns to 'category'
            """
            converted_column_counter = 0    
            for column in column_array:
                if column in dataframe.columns:
                    try:
                        dataframe[[column]] = dataframe[[column]].astype('category')
                        converted_column_counter += 1
                    except:
                        print("Could not convert column: '" + str(column) + "' to category")
                        pass
            if(show_process_details):        
                print("Converted " + str(converted_column_counter) + " to type: category")            


        def convert_timedeltas_to_seconds(dataframe,column_array):
            """
            Creating metrics like 'avg time of possession' or 'avg seconds per point scored' etc, require measurments of time to be compared between rows.
            I do these initially as time deltas. Here we convert time deltas to seconds.
            """
            converted_column_counter = 0    
            for column in column_array:
                if column in dataframe.columns:
                    try:
                        dataframe[[column]] = dataframe[[column]].applymap(lambda x: pd.to_timedelta(x).seconds)
                        converted_column_counter += 1
                    except:
                        print("Could not convert column: '" + str(column) + "' to timedelta-->seconds")
                        pass
            if(show_process_details):        
                print("Converted " + str(converted_column_counter) + " to type: timedelta-->seconds")            


        def convert_int64_to_int32(dataframe,column_array):
            """
            clean up some data type issues by converting all int64 to int32
            """
            converted_column_counter = 0    
            for column in column_array:
                if column in dataframe.columns:
                    try:
                        dataframe[[column]] = dataframe[[column]].astype('int32')
                        converted_column_counter += 1
                    except:
                        print("Could not convert column: '" + str(column) + "' to int32")
                        pass
            if(show_process_details):        
                print("Converted " + str(converted_column_counter) + " to type: int32")    

        def convert_float64_to_float32(dataframe,column_array):
            converted_column_counter = 0    
            for column in column_array:
                if column in dataframe.columns:
                    try:
                        dataframe[[column]] = dataframe[[column]].astype('float32')
                        converted_column_counter += 1
                    except:
                        print("Could not convert column: '" + str(column) + "' to float32")
                        pass
            if(show_process_details):        
                print("Converted " + str(converted_column_counter) + " to type: float32")    



        add_cumulatives_fill_in_current_fouls(df)


        if(show_process_details):
            print('Raw Mem Usage:')
            show_full_mem_usage(df)


        # Define set of columns we wish to keep and use for machine learning
        columns_to_use = [
         'match_id',
         'row_number',
         'period',
         'current_team_performing_stat_action',
         'time_remaining_in_period',
         'minutes_remaining_in_period',
         'current_score_hometeam',
         'current_score_awayteam',
         'metadata_competition_name',
         'team_name_hometeam',
         'team_name_awayteam',
         'final_score_hometeam',
         'final_score_awayteam',
         'ending_score_period1_hometeam',
         'ending_score_period1_awayteam',
         'ending_score_period2_hometeam',
         'ending_score_period2_awayteam',
         'ending_score_period3_hometeam',
         'ending_score_period3_awayteam',
         'ending_score_period4_hometeam',
         'ending_score_period4_awayteam',
         'ending_score_period5_hometeam',
         'ending_score_period5_awayteam',
         'team_fouls_committed_hometeam',
         'player_personal_fouls_committed_hometeam',
         'team_fouls_committed_awayteam',
         'player_personal_fouls_committed_awayteam',
         'avg_time_between_scoring_events_overall_hometeam',
         'avg_time_between_scoring_events_quarter_hometeam',
         'avg_time_between_scoring_events_overall',
         'avg_time_between_scoring_events_quarter',
         'avg_time_between_scoring_events_overall_awayteam',
         'avg_time_between_scoring_events_quarter_awayteam',
         'current_lead_hometeam',
         'cumulative_lead_changes_game',
         'cumulative_avg_abs_size_of_lead_game',
         'cumulative_max_abs_size_of_lead_game',
         'cumulative_max_size_of_lead_game_hometeam',
         'cumulative_max_size_of_lead_game_awayteam',
         'cumulative_lead_changes_quarter',
         'avg_abs_size_of_lead_quarter',
         'cumulative_avg_abs_size_of_lead_quarter',
         'cumulative_max_abs_size_of_lead_quarter',
         'cumulative_max_size_of_lead_quarter_hometeam',
         'cumulative_max_size_of_lead_quarter_awayteam',
         'cumulative_possessions_overall_hometeam',
         'cumulative_possessions_overall_awayteam',
         'cumulative_possessions_quarter_hometeam',
         'cumulative_possessions_quarter_awayteam',
         'starting_five_in_play_hometeam',
         'top_five_scorers_in_play_hometeam',
         'points_scored_by_players_in_play_hometeam',
         'percent_of_total_points_scored_by_players_in_play_hometeam',
         'top_five_players_in_play_hometeam',
         'total_stat_count_players_in_play_hometeam',
         'percent_of_total_stat_count_by_players_in_play_hometeam',
         'starting_five_in_play_awayteam',
         'top_five_scorers_in_play_awayteam',
         'points_scored_by_players_in_play_awayteam',
         'percent_of_total_points_scored_by_players_in_play_awayteam',
         'top_five_players_in_play_awayteam',
         'total_stat_count_players_in_play_awayteam',
         'percent_of_total_stat_count_by_players_in_play_awayteam'
         ]

        # Drop unwanted columns
        columns_to_drop = []

        for column_name in df.columns:
            if column_name not in columns_to_use:
                columns_to_drop.append(column_name)

        df = df.drop(columns_to_drop, 1)

        if(show_process_details):
            print('Mem Usage after dropping columns:')
            show_full_mem_usage(df)    

        # Downcast columns as necessary
        columns_to_convert_from_object_to_category = ['match_id','metadata_competition_name','team_name_hometeam','team_name_awayteam']    
        columns_to_convert_from_object_to_timedelta_to_seconds = ['avg_time_between_scoring_events_overall_hometeam'
                                                                  ,'avg_time_between_scoring_events_quarter_hometeam'
                                                                  ,'avg_time_between_scoring_events_overall'
                                                                  ,'avg_time_between_scoring_events_quarter'
                                                                  ,'avg_time_between_scoring_events_overall_awayteam'
                                                                  ,'avg_time_between_scoring_events_quarter_awayteam']    


        convert_to_categories(df,columns_to_convert_from_object_to_category)
        convert_timedeltas_to_seconds(df,columns_to_convert_from_object_to_timedelta_to_seconds)    


        column_int64 = list(df.select_dtypes(include=['int64']).columns)    
        column_float64 = list(df.select_dtypes(include=['float64']).columns)    
        convert_int64_to_int32(df,column_int64)
        convert_float64_to_float32(df,column_float64)

        if(show_process_details):
            print('Mem Usage After Downcasting Columns:')
            show_full_mem_usage(df)

        def forward_fill_spotty_rows(df,column_to_fill):
            """
            As with the `add_cumulatives_fill_in_current_fouls` function, some columns
            have gaps in them that need to be filled in. This iterates through the dataframe
            and fills these gaps
            """
            tmp_name = column_to_fill + '_tmp'
            df.rename(columns={column_to_fill: tmp_name}, inplace=True)
            df[column_to_fill] = df[['match_id',tmp_name]].groupby(['match_id']).fillna(method='ffill')[tmp_name]

            if column_to_fill in ['avg_time_between_scoring_events_overall_hometeam'
                                  ,'avg_time_between_scoring_events_quarter_hometeam'
                                  ,'avg_time_between_scoring_events_overall'
                                  ,'avg_time_between_scoring_events_quarter'
                                  ,'avg_time_between_scoring_events_overall_awayteam'
                                  ,'avg_time_between_scoring_events_quarter_awayteam']:
                try:
                    df[column_to_fill] = df[column_to_fill].combine_first(datetime.datetime.strptime('10:00', "%M:%S") - df['time_remaining_in_period'].apply(lambda x: datetime.datetime.strptime(x, "%M:%S")))
                except:
                    df[column_to_fill] = df[column_to_fill].combine_first(df[['match_id',tmp_name]].groupby(['match_id']).fillna(method='bfill')[tmp_name])

                if df[column_to_fill].dtype == 'm8[ns]':
                    df[column_to_fill] = df[column_to_fill] / np.timedelta64(1, 's')


            else:
                df[column_to_fill] = df[column_to_fill].fillna(0.0)


            forward_fill_spotty_rows(df,'avg_time_between_scoring_events_overall_hometeam')
            print('avg_time_between_scoring_events_overall_hometeam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'avg_time_between_scoring_events_quarter_hometeam')
            print('avg_time_between_scoring_events_quarter_hometeam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'avg_time_between_scoring_events_overall')	
            print('avg_time_between_scoring_events_overall' + " -- FINISHED")
            forward_fill_spotty_rows(df,'avg_time_between_scoring_events_quarter')	
            print('avg_time_between_scoring_events_quarter' + " -- FINISHED")
            forward_fill_spotty_rows(df,'avg_time_between_scoring_events_overall_awayteam')
            print('avg_time_between_scoring_events_overall_awayteam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'avg_time_between_scoring_events_quarter_awayteam')
            print('avg_time_between_scoring_events_quarter_awayteam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'current_lead_hometeam')	
            print('current_lead_hometeam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_lead_changes_game')	
            print('cumulative_lead_changes_game' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_avg_abs_size_of_lead_game')	
            print('cumulative_avg_abs_size_of_lead_game' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_max_abs_size_of_lead_game')	
            print('cumulative_max_abs_size_of_lead_game' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_max_size_of_lead_game_hometeam')	
            print('cumulative_max_size_of_lead_game_hometeam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_max_size_of_lead_game_awayteam')	
            print('cumulative_max_size_of_lead_game_awayteam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_lead_changes_quarter')	
            print('cumulative_lead_changes_quarter' + " -- FINISHED")
            forward_fill_spotty_rows(df,'avg_abs_size_of_lead_quarter')	
            print('avg_abs_size_of_lead_quarter' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_avg_abs_size_of_lead_quarter')	
            print('cumulative_avg_abs_size_of_lead_quarter' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_max_abs_size_of_lead_quarter')
            print('cumulative_max_abs_size_of_lead_quarter' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_max_size_of_lead_quarter_hometeam')
            print('cumulative_max_size_of_lead_quarter_hometeam' + " -- FINISHED")
            forward_fill_spotty_rows(df,'cumulative_max_size_of_lead_quarter_awayteam')
            print('cumulative_max_size_of_lead_quarter_awayteam' + " -- FINISHED")        

        return df


In [None]:
# Add a fix
import numpy as np

def fix_possibly_screwy_fields(df_train):
    """
    Some of the columns got a bit screwy in earlier versions of this part, 
    so I am recalculating them here. May not be a problem anymore, but it can't hurt.
    """

    print("inital column count: " + str(len(df_train.columns)))
    print("adjusting bad columns  -- grouping")

    #df_grouped = dftest.groupby(['match_id']).fillna(method='bfill')[tmp_name])

    df_grouped = df_train.groupby('match_id').agg({'current_score_hometeam': ['max'],'current_score_awayteam': ['max']})
    df_grouped.columns = ["_".join(x) for x in df_grouped.columns.ravel()]
    #df_grouped.columns
    print("adjusting bad columns  -- merging")

    df_train = df_train.merge(df_grouped, on='match_id', how='left')

    # df_train.columns

    df_train['final_score_hometeam'] = df_train['current_score_hometeam_max']
    df_train['final_score_awayteam'] = df_train['current_score_awayteam_max']
    df_train['final_score_combined'] = df_train['final_score_hometeam'] + df_train['final_score_awayteam']
    df_train['ending_lead_final_hometeam'] = df_train['final_score_hometeam'] - df_train['final_score_awayteam']
    df_train['winner_hometeam'] = np.where(df_train['final_score_hometeam']>df_train['final_score_awayteam'], 1, 0)
    df_train['current_lead_hometeam'] = df_train['current_score_hometeam'] - df_train['current_score_awayteam']

    print("updated column count: " + str(len(df_train.columns)))
    return df_train


### 2.7 - Combine all of the above, and run it

In [73]:
class ProcessFibaEuropeMatch(ProcessFibaEuropeMatch):


    def process_fiba_europe_match(self):
        """
        Here's the whole process from start to finish
        """
        import time
        #record elapsed time to process the match
        start_time = time.time()
        orig_start_time = start_time

        #extract metadata. We will append it later
        match_metadata_dict = self.extract_metadata_from_root()
        
        # extract granular play-by-play data from the match file
        df_match_raw = self.extract_granular_data_from_root()
        
        # extract list of starting five players for home and away teams
        list_of_starting_fives = self.process_substitutions_and_starting_fives(df_match_raw)

        # add these starting fives to the 'match_metadata_dict'
        # this will be used to create metrics for this set of players which
        # will be added to the final dataframe
        match_metadata_dict['starting_five_hometeam'] = list_of_starting_fives[0]
        match_metadata_dict['starting_five_awayteam'] = list_of_starting_fives[1]

        # iterate through dataframe, parse releveant data for each event into
        # its own column
        for index, row in df_match_raw.iterrows():
            self.process_fouls_for_row(df_match_raw,index,row)
            self.process_turnovers_for_row(df_match_raw,index,row)
            self.process_steals_for_row(df_match_raw,index,row)
            self.process_blocks_for_row(df_match_raw,index,row)
            self.process_shots_made_for_row(df_match_raw,index,row)
            self.process_shots_missed_for_row(df_match_raw,index,row)
            self.process_assists_for_row(df_match_raw,index,row)
            self.process_other_misc_for_row(df_match_raw,index,row)
            self.add_metadata_for_row(df_match_raw,index,row,match_metadata_dict)

        # iterate again (several times--inneficient, I know), adding cumulative metrics 
        self.add_cumulatives_fill_in_current_score(df_match_raw)
        self.add_cumulatives_elapsed_time_between_scores(df_match_raw)
        self.add_cumulatives_lead_sizes(df_match_raw)
        self.add_cumulatives_possessions(df_match_raw)
        # add metrics related to the top five players for each team
        self.process_top_fives(df_match_raw)

        # clean up dataset, dropping some columns, downcasting columns, etc
        df_match_raw = self.create_clean_dataset_chunk(df_match_raw)
        df_match_raw = fix_possibly_screwy_fields(df_match_raw)
        
        # record elapsed time to process match
        final_time = time.time() - orig_start_time
        print(str(self.match_id) + " -  DONE")
        print(final_time)


        return df_match_raw                


In [76]:
# Load the updated match_processor function
match_processor = ProcessFibaEuropeMatch(example_match_id,match_content)

#process the match
df_match_processed = match_processor.process_fiba_europe_match()

Raw Mem Usage:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 301 entries, 1 to 301
Columns: 117 entries, match_id to percent_of_total_stat_count_by_players_in_play_awayteam
dtypes: float64(46), object(62), timedelta64[ns](9)
memory usage: 1.2 MB
Average memory usage for float columns: 0.00 MB
Average memory usage for int columns: 0.01 MB
Average memory usage for object columns: 0.02 MB
Mem Usage after dropping columns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 301 entries, 1 to 301
Data columns (total 61 columns):
match_id                                                      301 non-null object
row_number                                                    301 non-null object
period                                                        301 non-null object
current_team_performing_stat_action                           301 non-null object
time_remaining_in_period                                      301 non-null object
minutes_remaining_in_period                               

Average memory usage for float columns: 0.01 MB
Average memory usage for int columns: 0.01 MB
Average memory usage for object columns: 0.01 MB
60636 -  DONE
89.07523107528687


In [77]:
df_match_processed.head(10)

Unnamed: 0,match_id,row_number,period,current_team_performing_stat_action,time_remaining_in_period,minutes_remaining_in_period,current_score_hometeam,current_score_awayteam,metadata_competition_name,team_name_hometeam,...,top_five_players_in_play_hometeam,total_stat_count_players_in_play_hometeam,percent_of_total_stat_count_by_players_in_play_hometeam,starting_five_in_play_awayteam,top_five_scorers_in_play_awayteam,points_scored_by_players_in_play_awayteam,percent_of_total_points_scored_by_players_in_play_awayteam,top_five_players_in_play_awayteam,total_stat_count_players_in_play_awayteam,percent_of_total_stat_count_by_players_in_play_awayteam
1,60636,1,1,0,10:00,10,,,EuroChallenge,TS Medical Park,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,60636,2,1,1,10:00,10,,,EuroChallenge,TS Medical Park,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,60636,3,1,1,10:00,10,,,EuroChallenge,TS Medical Park,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,60636,4,1,1,10:00,10,,,EuroChallenge,TS Medical Park,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,60636,5,1,1,10:00,10,,,EuroChallenge,TS Medical Park,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,60636,6,1,1,10:00,10,,,EuroChallenge,TS Medical Park,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,60636,7,1,2,10:00,10,,,EuroChallenge,TS Medical Park,...,5.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
8,60636,8,1,2,10:00,10,,,EuroChallenge,TS Medical Park,...,5.0,0.0,0.0,2.0,2.0,0.0,0.0,2.0,0.0,0.0
9,60636,9,1,2,10:00,10,,,EuroChallenge,TS Medical Park,...,5.0,0.0,0.0,3.0,3.0,0.0,0.0,3.0,0.0,0.0
10,60636,10,1,2,10:00,10,,,EuroChallenge,TS Medical Park,...,5.0,0.0,0.0,4.0,4.0,0.0,0.0,4.0,0.0,0.0


#### Next Steps:

* I processed each of these matches, combining them all into chunks of about 200 matches each.

* Before starting to train machine learning algorythms on this mishmash of matches, i wanted to seperate them into four groups:
1. Male - Adult
2. Male - Youth (any leagues or competitions specified as under 21)
3. Female - Adult
4. Female - Youth (any leagues or competitions specified as under 21)

* I also wanted to add, if I could, metadata like the date of the match