## Which conference is the best at winning tight games?



### How can we find out which conference is the best at winning tight games?

### We need to look at play-by-play data to figure this out, because if we just look at final scores we won't truly know whether a game was close. 

## Step 1
### Let's start by looking at [a game we know was close](https://www.youtube.com/watch?v=SdFB3OGUaGU) to get a sense of the play-by-play data.

##### (Sorry for the reminder, Longhorn fans!)

In [0]:
from pandas.io import gbq
project_id = '[YOUR_PROJECT_ID]'

In [3]:
n_iowa_vs_texas_q = """
SELECT 
  away_market,
  home_market,
  away_pts,
  home_pts,
  elapsed_time_sec,
  period,
  game_clock,
  team_market,
  jersey_num,
  event_type,
  points_scored
FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
WHERE game_id = "703-504-2016-03-18"
AND (home_pts != 0 and away_pts != 0)
ORder by elapsed_time_sec desc
"""

n_iowa_vs_texas = gbq.read_gbq(query=n_iowa_vs_texas_q, dialect ='standard', project_id=project_id)
n_iowa_vs_texas.head(25)



Requesting query... ok.
Job ID: job_ArZH0ybbYdqRoNs4aOX-Gu6cXfs9
Query running...
Query done.
Cache hit.

Retrieving results...
Got 81 rows.

Total time taken 0.64 s.
Finished at 2018-03-05 16:00:54.


Unnamed: 0,away_market,home_market,away_pts,home_pts,elapsed_time_sec,period,game_clock,team_market,jersey_num,event_type,points_scored
0,Northern Iowa,Texas,75,72,2400,2,00:00,Northern Iowa,4,GOOD,3
1,Northern Iowa,Texas,72,72,2397,2,00:03,Texas,1,GOOD,2
2,Northern Iowa,Texas,72,70,2389,2,00:11,Northern Iowa,11,GOOD,1
3,Northern Iowa,Texas,71,70,2374,2,00:26,Northern Iowa,2,GOOD,2
4,Northern Iowa,Texas,69,70,2355,2,00:45,Texas,44,GOOD,1
5,Northern Iowa,Texas,69,69,2264,2,02:16,Texas,21,GOOD,2
6,Northern Iowa,Texas,69,67,2225,2,02:55,Northern Iowa,5,GOOD,1
7,Northern Iowa,Texas,68,67,2197,2,03:23,Northern Iowa,20,GOOD,1
8,Northern Iowa,Texas,67,67,2197,2,03:23,Northern Iowa,20,GOOD,1
9,Northern Iowa,Texas,66,66,2182,2,03:38,Texas,21,GOOD,1


### What a shot!

## Step 2

### Down to business. Let's decide that a "close game" is any game where the score difference is:

### (1) Less than 4 at any point in the last 5 minutes OR
### (2) Less than 7 for >2 of the last 5 minutes OR
### (3) Less than 10 for >3 of the last 5 minutes

##### Note that the NCAA started collecting play-by-play in the middle of the 2009 season. Due to discrepancies in when teams started collecting, it's cleaner to start with the 2010 season ("AND season > 2009").

### Here is how we can find how many games there are like this:

In [34]:
close_games_q = """
  SELECT
    game_id,
    away_conf_alias,
    home_conf_alias,
    MAX(away_pts) as final_away_pts,
    MAX(home_pts) as final_home_pts
  FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
  WHERE game_id IN (
    SELECT game_id
    FROM (
      SELECT 
        game_id,
        SUM(IF(diff < 4, time_before_change, 0)) as a,
        SUM(IF(diff < 7, time_before_change, 0)) as b,
        SUM(IF(diff < 10, time_before_change, 0)) as c,
        IF((SUM(IF(diff < 4, time_before_change, 0)) > 0
          OR SUM(IF(diff < 7, time_before_change, 0)) > 120
          OR SUM(IF(diff < 10, time_before_change, 0)) > 180), TRUE, FALSE) as close
      FROM (
        SELECT
          game_id,
          away_pts,
          home_pts,
          elapsed_time_sec,
          ABS(away_pts - home_pts) AS diff,
          next_update - elapsed_time_sec as time_before_change
        FROM (
          SELECT
            game_id,
            away_pts,
            home_pts,
            ABS(away_pts - home_pts) AS diff,
            elapsed_time_sec,
            MAX(elapsed_time_sec) OVER (PARTITION BY game_id ORDER BY elapsed_time_sec DESC, away_pts + home_pts DESC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS next_update
          #FROM below shows all scoring plays, and also entries for the end of regulation time and the five-minutes remaining mark
          FROM (
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              away_pts,
              home_pts,
              elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            UNION ALL
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              MAX(away_pts),
              MAX(home_pts),
              2100 as elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            WHERE elapsed_time_sec <= 2100
            GROUP BY season, game_id, home_division_alias, away_division_alias
            UNION ALL
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              MAX(away_pts),
              MAX(home_pts),
              2400 as elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            WHERE  elapsed_time_sec <= 2400
            GROUP BY season, game_id, home_division_alias, away_division_alias
          )
          WHERE home_division_alias = "D1" 
          AND away_division_alias = "D1"
          AND season > 2009
          AND (home_pts != 0 AND away_pts != 0)
          AND elapsed_time_sec >= 2100
          AND elapsed_time_sec <= 2400
          ORDER BY elapsed_time_sec DESC, away_pts + home_pts DESC
        )
      )
      GROUP BY game_id
    )
    WHERE close = TRUE
  )
  GROUP BY game_id, away_conf_alias, home_conf_alias
  ORDER BY final_away_pts + final_home_pts DESC
"""

close_games = gbq.read_gbq(query=close_games_q, dialect ='standard', project_id=project_id)
close_games.head(25)

Requesting query... ok.
Job ID: job_zDPN4_nLLv7CzS2BwZwQOnTCUGbR
Query running...
Query done.
Processed: 1.7 GB
Standard price: $0.01 USD

Retrieving results...
Got 21674 rows.

Total time taken 7.25 s.
Finished at 2018-03-05 17:23:48.


Unnamed: 0,game_id,away_conf_alias,home_conf_alias,final_away_pts,final_home_pts
0,2915-625-2017-02-09,SOUTHERN,SOUTHERN,127,131
1,207-550-2017-02-04,BIGSKY,BIGSKY,124,130
2,285-590-2013-11-13,NE,PATRIOT,118,122
3,654-454-2014-03-01,OVC,OVC,115,118
4,391-141-2016-12-19,SOUTHERN,AE,111,120
5,30-32-2014-03-14,SUNBELT,SUNBELT,114,116
6,678-141-2016-11-18,SOUTHERN,AS,116,112
7,365-2711-2015-12-02,AS,SEC,108,119
8,287-184-2017-11-19,HORIZON,SOUTHLAND,116,109
9,735-406-2013-11-29,SOUTHERN,MVC,117,108


## Step 3
### Let's do a couple sanity checks. How many games are there total?

In [6]:
total_games_q = """
SELECT
  DISTINCT game_id
FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
WHERE home_division_alias = "D1" 
AND away_division_alias = "D1"
AND season > 2009
"""

total_games = gbq.read_gbq(query=total_games_q, dialect ='standard', project_id=project_id)
total_games.head(25)

Requesting query... ok.
Job ID: job_fSAT54VPXJCA0TVDzGCqGO8yOuq3
Query running...
Query done.
Processed: 836.9 MB
Standard price: $0.00 USD

Retrieving results...
Got 42568 rows.

Total time taken 3.79 s.
Finished at 2018-03-05 16:14:52.


Unnamed: 0,game_id
0,180-167-2011-11-22
1,172-167-2014-03-01
2,129-172-2011-11-26
3,172-554-2013-02-15
4,172-81-2011-11-19
5,172-380-2012-11-10
6,172-540-2013-02-16
7,172-80-2011-02-11
8,172-167-2011-01-28
9,172-81-2015-12-31


## Step 4

### So what percentage of games are close?

In [19]:
print "close games: " + str(len(close_games))
print "all games: " + str(len(total_games))
print "% close = " + str(float(len(close_games))/float(len(total_games)))

close games: 21674
all games: 42568
% close = 0.50916181169


### That seems reasonable!

## Step 5

### Now we need to add up how many of these close games each conference won and lost.

In [29]:
close_wins_by_conf_q = """
WITH close_games AS  (
  SELECT
    game_id,
    away_conf_alias,
    home_conf_alias,
    MAX(away_pts) as final_away_pts,
    MAX(home_pts) as final_home_pts
  FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
  WHERE game_id IN (
    SELECT game_id
    FROM (
      SELECT 
        game_id,
        SUM(IF(diff < 4, time_before_change, 0)) as a,
        SUM(IF(diff < 7, time_before_change, 0)) as b,
        SUM(IF(diff < 10, time_before_change, 0)) as c,
        IF((SUM(IF(diff < 4, time_before_change, 0)) > 0
          OR SUM(IF(diff < 7, time_before_change, 0)) > 120
          OR SUM(IF(diff < 10, time_before_change, 0)) > 180), TRUE, FALSE) as close
      FROM (
        SELECT
          game_id,
          away_pts,
          home_pts,
          elapsed_time_sec,
          ABS(away_pts - home_pts) AS diff,
          next_update - elapsed_time_sec as time_before_change
        FROM (
          SELECT
            game_id,
            away_pts,
            home_pts,
            ABS(away_pts - home_pts) AS diff,
            elapsed_time_sec,
            MAX(elapsed_time_sec) OVER (PARTITION BY game_id ORDER BY elapsed_time_sec DESC, away_pts + home_pts DESC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS next_update
          #FROM below shows all scoring plays, and also entries for the end of regulation time and the five-minutes remaining mark
          FROM (
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              away_pts,
              home_pts,
              elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            UNION ALL
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              MAX(away_pts),
              MAX(home_pts),
              2100 as elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            WHERE elapsed_time_sec <= 2100
            GROUP BY season, game_id, home_division_alias, away_division_alias
            UNION ALL
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              MAX(away_pts),
              MAX(home_pts),
              2400 as elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            WHERE  elapsed_time_sec <= 2400
            GROUP BY season, game_id, home_division_alias, away_division_alias
          )
          WHERE home_division_alias = "D1" 
          AND away_division_alias = "D1"
          AND season > 2009
          AND (home_pts != 0 AND away_pts != 0)
          AND elapsed_time_sec >= 2100
          AND elapsed_time_sec <= 2400
          ORDER BY elapsed_time_sec DESC, away_pts + home_pts DESC
        )
      )
      GROUP BY game_id
    )
    WHERE close = TRUE
  )
  GROUP BY game_id, away_conf_alias, home_conf_alias
)

SELECT 
  conf,
  MAX(IF(home_or_away='home' AND won,cnt,null)) home_won,
  MAX(IF(home_or_away='home' AND NOT won,cnt,null)) home_lost,
  MAX(IF(home_or_away='away' AND won,cnt,null)) away_won,
  MAX(IF(home_or_away='away' AND NOT won,cnt,null)) away_lost,
  SUM(IF(won,cnt,null)) won,
  SUM(cnt) total,
  SUM(IF(won,cnt,null))/SUM(cnt) as pct
FROM (
  SELECT 
    home_conf_alias AS conf, 
    'home' AS home_or_away, 
    final_home_pts > final_away_pts AS won,
    COUNT(*) cnt
  FROM close_games
  GROUP BY conf, home_or_away, won
  UNION ALL
  SELECT 
    away_conf_alias AS conf, 
    'away' AS home_or_away, 
    final_away_pts > final_home_pts AS won,
    COUNT(*) cnt
  FROM  close_games
  GROUP BY conf, home_or_away, won
)
GROUP BY conf
ORDER BY pct DESC
"""

close_wins_by_conf = gbq.read_gbq(query=close_wins_by_conf_q, dialect ='standard', project_id=project_id)
print close_wins_by_conf

Requesting query... ok.
Job ID: job_dvg5I4GizKtAF26E6LEwJzb1NfRk
Query running...
Query done.
Processed: 1.7 GB
Standard price: $0.01 USD

Retrieving results...
Got 32 rows.

Total time taken 5.74 s.
Finished at 2018-03-05 17:16:54.
         conf  home_won  home_lost  away_won  away_lost   won  total       pct
0         AAC       501        323       330        360   831   1514  0.548877
1     BIGEAST       392        274       287        289   679   1242  0.546699
2         ACC       641        409       385        461  1026   1896  0.541139
3       PAC12       545        298       275        398   820   1516  0.540897
4       BIG12       421        264       247        315   668   1247  0.535686
5       BIG10       582        389       360        443   942   1774  0.531003
6         MWC       438        276       298        381   736   1393  0.528356
7         SEC       609        375       330        486   939   1800  0.521667
8         A10       555        396       394        482 

### That looks OK. But there's still something wrong: we're including all those games between two teams in the same conference. These games will always give that conference a win and a loss. If the conference wins a lot (more than 50%) of their close games against other conferences, all of their close games within the conference will pull their winning percentage down towards 50%. 

## Step 7
### So let's take those games out.

In [35]:
close_wins_by_conf2_q = """
WITH close_games AS  (
  SELECT
    game_id,
    away_conf_alias,
    home_conf_alias,
    MAX(away_pts) as final_away_pts,
    MAX(home_pts) as final_home_pts
  FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
  WHERE game_id IN (
    SELECT game_id
    FROM (
      SELECT 
        game_id,
        SUM(IF(diff < 4, time_before_change, 0)) as a,
        SUM(IF(diff < 7, time_before_change, 0)) as b,
        SUM(IF(diff < 10, time_before_change, 0)) as c,
        IF((SUM(IF(diff < 4, time_before_change, 0)) > 0
          OR SUM(IF(diff < 7, time_before_change, 0)) > 120
          OR SUM(IF(diff < 10, time_before_change, 0)) > 180), TRUE, FALSE) as close
      FROM (
        SELECT
          game_id,
          away_pts,
          home_pts,
          elapsed_time_sec,
          ABS(away_pts - home_pts) AS diff,
          next_update - elapsed_time_sec as time_before_change
        FROM (
          SELECT
            game_id,
            away_pts,
            home_pts,
            ABS(away_pts - home_pts) AS diff,
            elapsed_time_sec,
            MAX(elapsed_time_sec) OVER (PARTITION BY game_id ORDER BY elapsed_time_sec DESC, away_pts + home_pts DESC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS next_update
          #FROM below shows all scoring plays, and also entries for the end of regulation time and the five-minutes remaining mark
          FROM (
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              away_pts,
              home_pts,
              elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            UNION ALL
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              MAX(away_pts),
              MAX(home_pts),
              2100 as elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            WHERE elapsed_time_sec <= 2100
            GROUP BY season, game_id, home_division_alias, away_division_alias
            UNION ALL
            SELECT
              season,
              game_id,
              home_division_alias,
              away_division_alias,
              MAX(away_pts),
              MAX(home_pts),
              2400 as elapsed_time_sec
            FROM `bigquery-public-data.ncaa_basketball.mbb_pbp_ncaa`
            WHERE  elapsed_time_sec <= 2400
            GROUP BY season, game_id, home_division_alias, away_division_alias
          )
          WHERE home_division_alias = "D1" 
          AND away_division_alias = "D1"
          AND season > 2009
          AND (home_pts != 0 AND away_pts != 0)
          AND elapsed_time_sec >= 2100
          AND elapsed_time_sec <= 2400
          ORDER BY elapsed_time_sec DESC, away_pts + home_pts DESC
        )
      )
      GROUP BY game_id
    )
    WHERE close = TRUE
  )
  AND home_conf_alias != away_conf_alias # <<<<<<<<<<<<<<<<<<<< this is the new line
  GROUP BY game_id, away_conf_alias, home_conf_alias
)

SELECT 
  conf,
  MAX(IF(home_or_away='home' AND won,cnt,null)) home_won,
  MAX(IF(home_or_away='home' AND NOT won,cnt,null)) home_lost,
  MAX(IF(home_or_away='away' AND won,cnt,null)) away_won,
  MAX(IF(home_or_away='away' AND NOT won,cnt,null)) away_lost,
  SUM(IF(won,cnt,null)) won,
  SUM(cnt) total,
  SUM(IF(won,cnt,null))/SUM(cnt) as pct
FROM (
  SELECT 
    home_conf_alias AS conf, 
    'home' AS home_or_away, 
    final_home_pts > final_away_pts AS won,
    COUNT(*) cnt
  FROM close_games
  GROUP BY conf, home_or_away, won
  UNION ALL
  SELECT 
    away_conf_alias AS conf, 
    'away' AS home_or_away, 
    final_away_pts > final_home_pts AS won,
    COUNT(*) cnt
  FROM  close_games
  GROUP BY conf, home_or_away, won
)
GROUP BY conf
ORDER BY pct DESC
"""

close_wins_by_conf2 = gbq.read_gbq(query=close_wins_by_conf2_q, dialect ='standard', project_id=project_id)
print close_wins_by_conf2

Requesting query... ok.
Job ID: job_oqVLKjN5_x-gFQkgZ461LYLZZmCf
Query running...
Query done.
Processed: 1.7 GB
Standard price: $0.01 USD

Retrieving results...
Got 32 rows.

Total time taken 5.01 s.
Finished at 2018-03-05 17:25:47.
         conf  home_won  home_lost  away_won  away_lost  won  total       pct
0       PAC12       251        116        93        104  344    564  0.609929
1         ACC       322        166       142        142  464    772  0.601036
2       BIG12       202        109        92         96  294    499  0.589178
3     BIGEAST       244        134       147        141  391    666  0.587087
4         AAC       330        181       188        189  518    888  0.583333
5       BIG10       281        149       120        142  401    692  0.579480
6         MWC       204        115       137        147  341    603  0.565506
7         SEC       272        154       109        149  381    684  0.557018
8         A10       285        191       189        212  474    8

## Once we remove intra-conference games, the result is clear: the PAC-12 is the best at winning close games (against other conferences)!