# Player Insurance Scraper

This notebook will generate the CSV files for each player season. It will also run the hedging analysis of our proposed contracts.

## Step 1: scraping the data

<b>Goal: get usable player data that we can feed into our analysis in subsequent steps</b>

In order to run our analysis, we first need a matrix of available player data. The key should be on (player, season) and contain the following variables:
<li>Counting stats: points, rebounds, assists, turnovers, steals, blocks</li>
<li>Team played for during that season: list with possibly several values</li>
<li>Salary for that year: float</li>
<li>Salary rank for that year: int </li>
<li>Games played that year: list of 82 games, which will be represented as booleans (false if inactive, true if active)</li>

### A: pre-processing
Import the relevant libraries.

In [1]:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
import os
import lxml

In [25]:
url = 'https://www.basketball-reference.com/teams/SAC/2012.html'
page = requests.get(url)
page

<Response [200]>

In [4]:
page.content



In [26]:
soup = BeautifulSoup(page.text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" data-root="/home/bbr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport"/>
  <link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202203081" rel="dns-prefetch"/>
  <!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
  <script async="true" type="text/javascript">
   (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://quantcast.mgr.consensu.org'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, '/choice.js')
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript);

In [58]:
#This will turn salary into a number
from decimal import Decimal
function makeDollarFloat(salary_str):
    return Decimal(salary_str.strip('$'))

In [58]:
# This will build the initial dataframe for a player based of of the /teams/[code]/year page
# Initially, we scrape the basic counting stats, with an index of (player, year)
per_game = soup.find(id="div_per_game")
team='SAC'
year=2012
stats = ['player', 'pts_per_g', 'trb_per_g', 'ast_per_g', 'stl_per_g', 'blk_per_g', 'tov_per_g']

stats_list = [[td.getText() for td in per_game.findAll('td', {'data-stat': stat})] for stat in stats]

df_index = [(x, year) for x in stats_list[0]]
all_stats = pd.DataFrame(stats_list).T
all_stats.columns = stats
all_stats['team'] = [team] * len(all_stats['player'])
all_stats['year'] = [year] * len(all_stats['player'])
all_stats = all_stats[['player', 'team', 'year', 'pts_per_g', 'trb_per_g', 'ast_per_g', 'stl_per_g', 'blk_per_g', 'tov_per_g']]
all_stats

Unnamed: 0,player,team,year,pts_per_g,trb_per_g,ast_per_g,stl_per_g,blk_per_g,tov_per_g
0,Marcus Thornton,SAC,2012,18.7,3.7,1.9,1.4,0.2,1.6
1,Tyreke Evans,SAC,2012,16.5,4.6,4.5,1.3,0.5,2.7
2,DeMarcus Cousins,SAC,2012,18.1,11.0,1.6,1.5,1.2,2.7
3,John Salmons,SAC,2012,7.5,2.9,2.0,0.8,0.2,1.0
4,Jason Thompson,SAC,2012,9.1,6.9,1.2,0.7,0.7,1.1
5,Isaiah Thomas,SAC,2012,11.5,2.6,4.1,0.8,0.1,1.6
6,Terrence Williams,SAC,2012,8.8,4.1,3.1,0.9,0.3,1.8
7,Chuck Hayes,SAC,2012,3.2,4.3,1.4,0.7,0.3,0.9
8,Jimmer Fredette,SAC,2012,7.6,1.2,1.8,0.5,0.0,1.1
9,J.J. Hickson,SAC,2012,4.7,5.1,0.6,0.5,0.5,1.1


In [59]:
#This gets the salaries for a player

#We have to do some pre-processing to get the comment that the salary data is in
#Since comments is not a clean structure, we have to loop through all comments
# to find the comment that has the
#id of "salaries2", store that as a separate variable, and then make a soup out of it

comments = soup.find_all(string=lambda text: isinstance(text, Comment))
#This loops through all comments
for c in comments:
    commentsoup = BeautifulSoup(c, 'lxml')
    #Check if the salaries2 id is in this comment, if so store in salaries
    salary_comment = commentsoup.findAll(id="salaries2")
    if salary_comment != []:
        salaries = salary_comment[0]

#Now, we make a second data frame that has "player, salary"
stats = ['player', 'salary']
stats_list = [[td.getText() for td in salaries.findAll('td', {'data-stat': stat})] for stat in stats]
salary_stats = pd.DataFrame(stats_list).T
salary_stats.columns = stats
all_stats = all_stats.join(salary_stats.set_index('player'), on='player')
all_stats = all_stats.fillna("$0")
all_stats

Unnamed: 0,player,team,year,pts_per_g,trb_per_g,ast_per_g,stl_per_g,blk_per_g,tov_per_g,salary
0,Marcus Thornton,SAC,2012,18.7,3.7,1.9,1.4,0.2,1.6,"$7,000,000"
1,Tyreke Evans,SAC,2012,16.5,4.6,4.5,1.3,0.5,2.7,"$4,151,640"
2,DeMarcus Cousins,SAC,2012,18.1,11.0,1.6,1.5,1.2,2.7,"$3,627,720"
3,John Salmons,SAC,2012,7.5,2.9,2.0,0.8,0.2,1.0,"$8,500,000"
4,Jason Thompson,SAC,2012,9.1,6.9,1.2,0.7,0.7,1.1,"$3,001,284"
5,Isaiah Thomas,SAC,2012,11.5,2.6,4.1,0.8,0.1,1.6,"$473,604"
6,Terrence Williams,SAC,2012,8.8,4.1,3.1,0.9,0.3,1.8,$0
7,Chuck Hayes,SAC,2012,3.2,4.3,1.4,0.7,0.3,0.9,"$5,250,000"
8,Jimmer Fredette,SAC,2012,7.6,1.2,1.8,0.5,0.0,1.1,"$2,238,360"
9,J.J. Hickson,SAC,2012,4.7,5.1,0.6,0.5,0.5,1.1,"$2,354,537"


Now that we have salaries with the basic counting stats, the next step is determining if any of the players are top 5 in salary and missed 41 consecutive games. To narrow the search space, we can just limit it to the top 5 in salary for this season.

In [60]:

all_stats.sort_values(by='salary')

Unnamed: 0,player,team,year,pts_per_g,trb_per_g,ast_per_g,stl_per_g,blk_per_g,tov_per_g,salary
6,Terrence Williams,SAC,2012,8.8,4.1,3.1,0.9,0.3,1.8,$0
10,Francisco García,SAC,2012,4.8,2.0,0.6,0.7,0.8,0.4,$0
11,Donté Greene,SAC,2012,5.4,2.5,0.6,0.3,0.5,0.6,$0
8,Jimmer Fredette,SAC,2012,7.6,1.2,1.8,0.5,0.0,1.1,"$2,238,360"
9,J.J. Hickson,SAC,2012,4.7,5.1,0.6,0.5,0.5,1.1,"$2,354,537"
12,Travis Outlaw,SAC,2012,4.3,1.6,0.4,0.5,0.5,0.5,"$3,000,000"
4,Jason Thompson,SAC,2012,9.1,6.9,1.2,0.7,0.7,1.1,"$3,001,284"
2,DeMarcus Cousins,SAC,2012,18.1,11.0,1.6,1.5,1.2,2.7,"$3,627,720"
1,Tyreke Evans,SAC,2012,16.5,4.6,4.5,1.3,0.5,2.7,"$4,151,640"
5,Isaiah Thomas,SAC,2012,11.5,2.6,4.1,0.8,0.1,1.6,"$473,604"


So now we have the basic counting stats for a player. In order to get the salaries, we have to use a different service called Sportrac.

Sportrac is structured a bit differently than Basketball Reference. Each year, we can get the top five players for each team as defined by cap space.

In [13]:
#This loads in Sportrac
url = 'https://www.spotrac.com/nba/sacramento-kings/cap/2011/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
print (soup.prettify())


<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!--[if IE 9]>    <html class="no-js ie9 oldie" lang="en"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US">
 <!--<![endif]-->
 <head>
  <!-- start:global -->
  <meta charset="utf-8"/>
  <!-- end:global -->
  <!-- start:page title -->
  <title>
   Premium Sign In | Spotrac.com
  </title>
  <!-- end:page title -->
  <!-- start:meta info -->
  <meta content="premium, tools, subscription,sports, contract, contracts,salary cap, NFL, MLB, NHL, NBA, salaries, bonuses, compensation" name="keywords"/>
  <meta content="Sign in to your Spotrac Premium account to gain access to complete info, unique tools, and more!" name="description"/>
  <!-- end:meta info -->
  <!-- start:responsive web design -->
  <meta content="width=device-width, initial-scal

In [22]:

#This gets the salaries for a player
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for c in comments:
    commentsoup = BeautifulSoup(c, 'lxml')
    salary_comment = commentsoup.findAll(id="salaries2")
    print(salary_comment)
    if salary_comment != []:
        print(salary_comment)

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
