In this notebook, we use line profiling to see what the bottleneck is in TOI parsing. (See [here](http://mortada.net/easily-profile-python-code-in-jupyter.html) for an example.)

In [1]:
import re
import pandas as pd
from scrapenhl2.scrape import parse_toi, schedules, players, scrape_toi, general_helpers as helpers
%load_ext line_profiler

The first thing we should try is the overall parsing.

In [2]:
%timeit parse_toi.parse_game_toi_from_html(2012, 20001, True)

1 loop, best of 3: 4.6 s per loop


In [3]:
%lprun -f parse_toi.parse_game_toi_from_html parse_toi.parse_game_toi_from_html(2012, 20001, True)

![alt text](parse_game_toi_from_html.png "Parse Game TOI")

Next, we'll try read_shifts_from_html_pages.

In [4]:
season = 2012
game = 20001
from scrapenhl2.scrape import schedules
gameinfo = schedules.get_game_data_from_schedule(season, game)

In [5]:
%lprun -f parse_toi.read_shifts_from_html_pages parse_toi.read_shifts_from_html_pages(scrape_toi.get_raw_html_toi(season, game, 'H'), scrape_toi.get_raw_html_toi(season, game, 'R'),gameinfo['Home'], gameinfo['Road'], season, game)

![alt text](read_shifts_from_html_pages1.png "Read shifts 1")

![alt text](read_shifts_from_html_pages2.png "Read shifts 2")

It looks like the HTML parser is not bad. The bigger bottleneck is the manipulations at the end.

In [6]:
rawtoi1 = scrape_toi.get_raw_html_toi(season, game, 'H')
rawtoi2 = scrape_toi.get_raw_html_toi(season, game, 'R')
teamid1 = gameinfo['Home']
teamid2 = gameinfo['Road']

from html_table_extractor.extractor import Extractor
dflst = []
for rawtoi, teamid in zip((rawtoi1, rawtoi2), (teamid1, teamid2)):
    extractor = Extractor(rawtoi)
    extractor.parse()
    tables = extractor.return_list()

    ids = [None for _ in range(len(tables))]
    periods = [None for _ in range(len(tables))]
    starts = [None for _ in range(len(tables))]
    ends = [None for _ in range(len(tables))]
    durationtime = [None for _ in range(len(tables))]
    teams = [None for _ in range(len(tables))]
    i = 0
    while i < len(tables):
        # A convenient artefact of this package: search for [p, p, p, p, p, p, p, p]
        if len(tables[i]) == 8 and helpers.check_number_last_first_format(tables[i][0]):
            pname = helpers.remove_leading_number(tables[i][0])
            pname = helpers.flip_first_last(pname)
            pid = players.player_as_id(pname, teamid)
            i += 2  # skip the header row
            while re.match('\d{1,2}', tables[i][0]):  # First entry is shift number
                # print(tables[i])
                shiftnum, per, start, end, dur, ev = tables[i]
                # print(pname, pid, shiftnum, per, start, end)
                ids[i] = pid
                if per == 'OT':
                    per = 4
                periods[i] = int(per)
                starts[i] = start[:start.index('/')].strip()
                ends[i] = end[:end.index('/')].strip()
                durationtime[i] = helpers.mmss_to_secs(dur)
                teams[i] = teamid
                i += 1
            i += 1
        else:
            i += 1

    ids = [x for x in ids if x is not None]
    periods = [x for x in periods if x is not None]
    starts = [x for x in starts if x is not None]
    ends = [x for x in ends if x is not None]
    durationtime = [x for x in durationtime if x is not None]
    teams = [x for x in teams if x is not None]

    startmin = [x[:x.index(':')] for x in starts]
    startsec = [x[x.index(':') + 1:] for x in starts]
    starttimes = [1200 * (p - 1) + 60 * int(m) + int(s) + 1 for p, m, s in zip(periods, startmin, startsec)]
    # starttimes = [0 if x == 1 else x for x in starttimes]
    endmin = [x[:x.index(':')] for x in ends]
    endsec = [x[x.index(':') + 1:] for x in ends]
    # There is an extra -1 in endtimes to avoid overlapping start/end
    endtimes = [1200 * (p - 1) + 60 * int(m) + int(s) for p, m, s in zip(periods, endmin, endsec)]

    durationtime = [e - s for s, e in zip(starttimes, endtimes)]

    df = pd.DataFrame({'PlayerID': ids, 'Period': periods, 'Start': starttimes, 'End': endtimes,
                       'Team': teams, 'Duration': durationtime})
    dflst.append(df)
dflst = pd.concat(dflst)

In [7]:
%lprun -f parse_toi._finish_toidf_manipulations parse_toi._finish_toidf_manipulations(dflst, season, game)

![alt text](finish_toidf_manipulations1.png "Finish manipulations1")

![alt text](finish_toidf_manipulations2.png "Finish manipulations2")

![alt text](finish_toidf_manipulations3.png "Finish manipulations3")

![alt text](finish_toidf_manipulations4.png "Finish manipulations4")

![alt text](finish_toidf_manipulations5.png "Finish manipulations5")

It looks like the two lines that rank are the primary culprits here.