### Script for pulling the shot chart data
This script will take the "box_score_link.txt" file as input go over each link generating the shot chart data as follow:

| Player key | qtr | Make/Miss | Distance from offensive baseline (px) | Distance from left baseline (px) | Time remaining in qtr| Distance to hoop (ft) | game score after shot | player's team | away team | home team | season year | 
|------|------|------|------|------|------|------|------|------|------|------|------|
| jamesle01 | 1| make | 312 | 414 | MM:SS.0 | 24 | 3-2| Chicago| CHI | ORL| YYYY | 

This will sent to a csv file with each player corresponding to a player from a particular game ordered by quarters.
This script finished by saving all the rows in the style as above to series of text files using "saveAsTextFile()".

The first 


**Note:** 
- Not all shot locations for a given game are recorded. In future may get estimates for each year the number of shot locations/FGA.
- First link with shot chart data: https://www.basketball-reference.com/boxscores/199611010BOS.html
- This is the 31847 link.


In [15]:
!pip install --upgrade pip
!pip install lxml



In [16]:
%%shell
apt-get update -qq > /dev/null
apt-get install openjdk-8-jdk-headless -qq > /dev/null
wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
tar xf spark-2.4.7-bin-hadoop2.7.tgz
pip install -q findspark



In [17]:
%%shell
pip install --upgrade pip
pip install lxml
git clone https://github.com/pmcwhannel/NBA-analytics.git
mv NBA-analytics NBAanalytics # So importing functions is easy

Cloning into 'NBA-analytics'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 38 (delta 15), reused 14 (delta 3), pack-reused 0
Unpacking objects: 100% (38/38), done.




In [18]:
# Have to rename drive to get rid of NBA-analytics -> NBAanalytics
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext(appName="YourTest", master="local[*]")

ValueError: ignored

Code below for importing required packages

In [None]:
import requests
from lxml import html
import re

Code below are functions for extracting the shot chart data

In [21]:
def go_to_shot_chart(boxScoreLink):
    '''
    return list with path to shot chart.
    '''
    page = requests.get(boxScoreLink)
    tree = html.fromstring(page.content)
    # if empty list has no shot chart
    finds = tree.xpath('//*[@class="filter"]/div/a/@href')
    return [l for l in finds if bool(re.search('shot-chart',l))]


def clean_tip_string(string):
    '''
    takes string from shot chart of tip = "TEXT"
    i.e. TEXT = '1st quarter, 11:25.0 remaining<br>Darrell Armstrong made 2-pointer from 17 ft<br>Orlando now tied 2-2'
    Extract:
    Time remaining
    Distance to hoop
    game score
    player's team
    return them in a list [time, dist, game score, team]
    '''
    stopwords = {'leads':1,'lead':1, 'now':1,'trails':1,'trail':1,'tied':1}
    time_remain = string.split()[2]
    game_score = string.split()[-1]
    shot_dist = re.search('(?<=from ).*(?=ft)', string)[0].strip()
    players_team = string.split('<br>')[-1].split()
    if players_team[1].lower() in stopwords:
      players_team = players_team[0]
    else:
      players_team = players_team[0] + ' ' + players_team[1] # (i.e. 'Golden' +' '+ 'State')
    return [time_remain, shot_dist, game_score, players_team]

def extract_shot_data(shotChartLink):
    '''
    return data pulled from the shot shart list of lists
    [[playerkey,...]]
    '''
    page = requests.get(shotChartLink)
    tree = html.fromstring(page.content)

    # create [player key, qtr, make/miss]
    # ['tooltip', 'q-1', 'p-armstda01', 'make']
    shot_data = [] # player metadata
    for md in tree.xpath('//*[@class="shot-area"]/div/@class'):
        temp = md.split()
        shot_data.append([temp[2][2:],int(temp[1][-1]),temp[3]])

    # ['TOP','LEFT'] px from there
    shoot_pos = [re.findall('\d+',pos) for pos in tree.xpath('//*[@class="shot-area"]/div/@style')]

    #[time(minutes:seconds.0), dist shot (ft), game score, player's]
    game_data = [clean_tip_string(string) for string in tree.xpath('//*[@class="shot-area"]/div/@tip')]
    
    # Extract year is number that represents the season. so 1968-1969 -> 1969.
    temp = tree.xpath('//*[@class="scorebox"]/div/div/strong/a/@href')
    year = temp[0].split('/')[-1][:4]
    team_1 = temp[0].split('/')[2]
    team_2 = temp[1].split('/')[2]
    # Create output [[features],...,]
    
    output = []
    for i in range(0,len(shoot_pos)):
        # [player key, qtr, make/miss, TOP_dist, LEFT_dist, time remaining, dist shot,
        # game score, players_team ,team_1=away, team_2=home, year]
        output.append(shot_data[i] + shoot_pos[i] + game_data[i] + [team_1, team_2, year])
    
    return output

def get_single_shot_chart_data(boxScoreLink):
    '''
    Get all shot chart data for a single game.
    '''
    base_path = "https://www.basketball-reference.com"
    shot_path = go_to_shot_chart(boxScoreLink)
    
    if len(shot_path) == 1:
        shotChartLink = base_path + shot_path[0] # link to path
        return (extract_shot_data(shotChartLink))
    else:
        return([]) # empty list if no shot chart available
    
    

In [22]:
# Use this cell for testing specific boxscore links.

test_link = 'https://www.basketball-reference.com/boxscores/199911020CHH.html' # Designed strongly based on this link
#test_link = 'https://www.basketball-reference.com/boxscores/196910140NYK.html' # boxscore without a shot chart
#test_link = 'https://www.basketball-reference.com/boxscores/shot-chart/199911120CHH.html'

# Get all shot chart data for a single game based on a boxscore link.
get_single_shot_chart_data(test_link)[:1]

[['marbust01',
  1,
  'make',
  '283',
  '318',
  '11:23.0',
  '25',
  '3-0',
  'New Jersey',
  'NJN',
  'CHH',
  '2000']]

In [23]:
!wget https://raw.githubusercontent.com/pmcwhannel/NBA-analytics/main/box_score_links.txt

--2020-12-04 08:04:27--  https://raw.githubusercontent.com/pmcwhannel/NBA-analytics/main/box_score_links.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4036305 (3.8M) [text/plain]
Saving to: ‘box_score_links.txt.1’


2020-12-04 08:04:39 (22.7 MB/s) - ‘box_score_links.txt.1’ saved [4036305/4036305]



### Code to find first game with shot chart data

In [24]:
# This was just used for finding the year when shot charts begin
# 30500 + 1347= first index shot chart index....
import timeit as tt
boxScoreLinks = open("box_score_links.txt").readlines()

start = tt.default_timer()
for i,link in enumerate(boxScoreLinks[30500:]): # try some subset
    
    data = get_single_shot_chart_data(link.strip())
    if i%100 == 0:
        print('Took {} seconds to process up to link {}.'.format(tt.default_timer() - start,i+1))
    else:
        pass
    
    if len(data) >= 1:
        print(link)
        break
    else:
        continue
print(30500 + i,data)

Took 0.1553727489990706 seconds to process up to link 1.
Took 13.27405905699925 seconds to process up to link 101.
Took 26.467339170998457 seconds to process up to link 201.
Took 39.093649572001596 seconds to process up to link 301.
Took 51.51036883299821 seconds to process up to link 401.
Took 64.3902643299989 seconds to process up to link 501.
Took 77.0859194279983 seconds to process up to link 601.
Took 89.62399142399954 seconds to process up to link 701.
Took 101.71742716199878 seconds to process up to link 801.
Took 113.80503285899977 seconds to process up to link 901.
Took 126.68332838100105 seconds to process up to link 1001.
Took 139.1865779360014 seconds to process up to link 1101.
Took 151.4090116840016 seconds to process up to link 1201.
Took 163.60906616599823 seconds to process up to link 1301.
https://www.basketball-reference.com/boxscores/199611010BOS.html

31847 [['jordami01', 1, 'miss', '198', '285', '11:06.0', '15', '3-2', 'Chicago', 'CHI', 'BOS', '1997'], ['pippesc01

In [31]:
!rm -r all_shot_chart_data

In [32]:
import timeit as tt
def toCSVLine(data):
  return ','.join(str(d) for d in data)


boxScoreLinks = open("box_score_links.txt").readlines()
boxscore_URLS = sc.parallelize(boxScoreLinks[31847:]) # create subset to test on
textFileName = 'all_shot_chart_data'
start= tt.default_timer()
boxscore_URLS.map(lambda x: get_single_shot_chart_data(x.strip())).flatMap(lambda x:
              x).map(lambda x: toCSVLine(x)).saveAsTextFile(textFileName)
stop = tt.default_timer()
print('It took {} seconds to extract all shot chart data.'.format(stop - start))

It took 4900.576215703 seconds to extract all shot chart data.


In [33]:
!cp -r all_shot_chart_data /content/drive/MyDrive/CS631-Project

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
