# Chapter 2 - Analyzing Game Scores

Squares are a time-honored Super Bowl tradition. There is a board that features 10 rows and 10 columns, adding up to 100 squares. One of the teams is assigned the rows, while the other team is assigned the columns. Players select the boxes until they are all spoken for. After that, the  numbers between zero and nine are randomly to the rows and columns, which allows for every possible score combination. 

A player wins if the ones digits of the score at the end of the first, second, third, and fourth quarters (we are ignoring overtime here) match their square. So a player holding 4, 6 (Team A, Team B) wins if the score at the end of a quarter or game is, for example, 14-6 in favor of Team A or 26-24 in favor of Team B.

Although there is no skill involved in Squares, we still want to know the relative value of the squares. To get our answers, we'll use a dataset that has the scores of every NFL game from 2009-2017.

In [88]:
# load common libraries
import re

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
# dataset is in a csv file
# although it's called 'csv' the separator is actually a semi-colon
# we use the pandas method 'read_csv' to read data
# we get back a dataframe, which is similar to a spreadsheet
# we'll talk about the similarities and differences below
df = pd.read_csv('game_scores.csv', sep=';')

In [6]:
# a dataframe has an index, rows, and columns
# we didn't specify an index column when we read the csv file
# so we have the default index, which is a sequence of integers starting with zero
df.index

RangeIndex(start=0, stop=2292, step=1)

In [8]:
# any unique value (or combinations) can serve as an index
# here, we could use gsis_id as the index because it is unique to a game 
# and there is only one row per game in our dataset
# in a jupyter notebook, the index is displayed in bold
# and the index name is below the column names
df.set_index('gsis_id').head()

Unnamed: 0_level_0,year,home_team,away_team,home_score,away_score,home_score_q1,home_score_q2,home_score_q3,home_score_q4,away_score_q1,away_score_q2,away_score_q3,away_score_q4
gsis_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2009091000,2009,PIT,TEN,13,10,0,7,0,3,0,7,0,3
2009091300,2009,ATL,MIA,19,7,0,10,6,3,0,0,0,7
2009091301,2009,BAL,KC,38,24,10,0,7,21,0,7,7,10
2009091302,2009,CAR,PHI,10,38,7,3,0,0,3,28,7,0
2009091303,2009,CIN,DEN,7,12,0,0,0,7,0,3,3,6


In [9]:
# pandas operations are not 'in-place', meaning that we didn't actually set the index for df
# instead, we created (but did not assign to a variable) a new dataframe with the specified index
# to set the index on df, use the statement df = df.set_index('gsis_id')
df.head()

Unnamed: 0,gsis_id,year,home_team,away_team,home_score,away_score,home_score_q1,home_score_q2,home_score_q3,home_score_q4,away_score_q1,away_score_q2,away_score_q3,away_score_q4
0,2009091000,2009,PIT,TEN,13,10,0,7,0,3,0,7,0,3
1,2009091300,2009,ATL,MIA,19,7,0,10,6,3,0,0,0,7
2,2009091301,2009,BAL,KC,38,24,10,0,7,21,0,7,7,10
3,2009091302,2009,CAR,PHI,10,38,7,3,0,0,3,28,7,0
4,2009091303,2009,CIN,DEN,7,12,0,0,0,7,0,3,3,6


In [86]:
# need to create dataframe where score_q2 is the sum of q1 and q2, not just q1
# this is the most straightforward way to do it
df['q1_score_home'] = df['home_score_q1']
df['q2_score_home'] = df['home_score_q1'] + df['home_score_q2']
df['q3_score_home'] = df['home_score_q1'] + df['home_score_q2'] + df['home_score_q3']
df['q4_score_home'] = df['home_score_q1'] + df['home_score_q2'] + df['home_score_q3'] + df['home_score_q4']

df['q1_score_away'] = df['away_score_q1']
df['q2_score_away'] = df['away_score_q1'] + df['away_score_q2']
df['q3_score_away'] = df['away_score_q1'] + df['away_score_q2'] + df['away_score_q3']
df['q4_score_away'] = df['away_score_q1'] + df['away_score_q2'] + df['away_score_q3'] + df['away_score_q4']
        

In [93]:
keep_columns = [c for c in df.columns if re.match(r'q\d+_score_', c)]
df = df.loc[:, keep_columns]

In [94]:
## now we need to create a dataframe with the ones digit of the scores after each quarter
## to get the ones digit, we use % 10 (modulus operator) which gives us the remainder of number divided by 10
## 0, 1 is the same as 1, 0 for analyzing the value of a square
## so we are going to create a tuple of the sorted one digits, such as (0, 1)
## then we will count how many times this tuple appears in the dataframe (overall and by quarter)
## there are more efficient ways to do this but this will be easier to follow

def score_pairs(row, quarter):
    """Gets the score pair from a row"""
    score_home = row[f'q{quarter}_score_home'] % 10
    score_away = row[f'q{quarter}_score_away'] % 10
    return tuple(sorted([score_home, score_away])

                 
def score_pair_dataframe(df, quarter):
    """Create a dataframe of score pairs
    
    Args:
        df (DataFrame): a dataframe with scores
        quarter (int): the quarter of the game
        
    Returns:
        DataFrame with columns quarter, score_pair
    """
    # we want the function to fail if the quarter is invalid
    if quarter > 5:
        raise ValueError(f'There are not {i} quarters in a football game')

    # pandas allows us to 'apply' a function to every row in the dataframe
    # somewhat confusingly, you have to specify the axis as 1
    # even though you want to apply it to each row
    score_column = df.apply(score_pairs, axis=1)

    # this creates a series where every row = quarter
    quarter_column = pd.Series([quarter] * len(score_series))

    # if you want to concatenate side-by-side, use axis=1
    combined = pd.concat([quarter_column, score_column], axis=1)

    # we want to specify the column names
    combined.columns = ['quarter', 'score_pair']
    
    return combined

# this uses what is called a list comprehension
# it creates a list of the result of calling score_pair_dataframe
# with i having the value 1, 2, 3, and 4 (in python, the last value in the range is one less than the stop value)
scoredf = pd.concat([score_pair_dataframe(df, i) 
                     for i in range(1, 5)])

Let's take a closer look at what we just did

1. We defined a function called 'score_pair_dataframe' with the 'def' keyword. The function takes two arguments - a dataframe and a quarter (1-5)
2. We called that function for all 4 quarters, with i being 1, 2, 3, and 4
3. We concatenated the dataframes for each quarter into one dataframe called 'scoredf'

In [95]:
# we've got two columns
# quarter is an integer
# score_pair is a tuple of integer
scoredf.head()

Unnamed: 0,quarter,score_pair
0,1,"(0, 0)"
1,1,"(0, 0)"
2,1,"(0, 0)"
3,1,"(3, 7)"
4,1,"(0, 0)"


In [96]:
# first we will count the score pairs regardless of quarter
# pandas has a handy function called value_counts
# which gives us a count for every unique value in a column
total_score_pairs = scoredf['score_pair'].value_counts()
total_score_pairs.name = 'n_all'
total_score_pairs.head()

(0, 7)    1200
(0, 3)     913
(0, 0)     750
(3, 7)     664
(0, 4)     537
Name: n_all, dtype: int64

In [97]:
# we can also do value counts on groups
# here, we will group by quarter and see
# the most common score pairs for each quarter
quarter_score_pairs = (scoredf.groupby('quarter')['score_pair']
                       .value_counts()
                       .rename('n_quarter')
                       .reset_index()
                      )

quarter_score_pairs.head()

Unnamed: 0,quarter,score_pair,n_quarter
0,1,"(0, 7)",558
1,1,"(0, 0)",408
2,1,"(0, 3)",386
3,1,"(3, 7)",235
4,1,"(0, 4)",151


In [98]:
# now we want to add a column for total
quarter_score_pairs.join(total_score_pairs, on='score_pair', how='left')

Unnamed: 0,quarter,score_pair,n_quarter,n_all
0,1,"(0, 7)",558,1200
1,1,"(0, 0)",408,750
2,1,"(0, 3)",386,913
3,1,"(3, 7)",235,664
4,1,"(0, 4)",151,537
...,...,...,...,...
181,4,"(5, 9)",11,15
182,4,"(8, 9)",11,15
183,4,"(2, 5)",10,11
184,4,"(5, 5)",5,6
