# Scouting Report Scraping

The purpose of this notebook is to scrape scouting report data from [Baseball Prospectus' database](https://legacy.baseballprospectus.com/prospects/eyewitness.php).

## Setup/Initiation:

First, importing needed packages:

In [1]:
import pickle
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10
import numpy as np
import glob
from scipy import stats
from bs4 import BeautifulSoup
import requests
import re
from IPython.core.display import display, HTML # make sure Jupyter knows to display it as HTML

In [2]:
home_url = 'https://legacy.baseballprospectus.com/prospects/eyewitness.php'

Setting up beautifulsoup object:

In [3]:
home_response = requests.get(home_url)
page = home_response.text
soup_object = BeautifulSoup(page, 'lxml')

## Initial Dataframe
First, I need a dataframe with each player's initial information (name, links, etc.) to further scrape their reports.

Now, getting the table headers:

In [4]:
header = soup_object.find('tr', class_='header').find_all('td')

In [5]:
header_list = [col.get_text() for col in header]

In [6]:
header_list

['Player', 'Primary Pos', 'Evaluator', 'Report Date', 'OFP']

Getting the rows of the table from the site, for each player, position, evaluattor, report date, OFP.

In [7]:
rows = soup_object.find('tbody').find_all('tr',class_=lambda x: x!= 'header')

In [8]:
player_rows = [row.find_all('td') for row in rows]

In [9]:
final_player_data = []
for player in player_rows:
    player_list = []
    for item in player:
        player_list.append(item.get_text())
    final_player_data.append(player_list)

Now, creating a Pandas dataframe with that info:

In [10]:
initial_df = pd.DataFrame(final_player_data, columns=header_list)

Only thing left to add is the links to each player's personal page - this will allow for scraping of their own scouting reports.  Based on the website, each players link will be 'https://legacy.baseballprospectus.com/prospects' with the Player_Page_Link column added on.

In [11]:
base_link = 'https://legacy.baseballprospectus.com/prospects/'

In [12]:
player_links = [base_link + row[0].find('a').get('href') for row in player_rows]

In [13]:
len(player_links)

1074

In [14]:
initial_df.shape

(1074, 5)

Shapes match up, adding this on:

In [15]:
initial_df['Player_Page_Link'] = player_links
initial_df.head()

Unnamed: 0,Player,Primary Pos,Evaluator,Report Date,OFP,Player_Page_Link
0,CJ Abrams,SS,Keanan Lamb,06/01/2019,50,https://legacy.baseballprospectus.com/prospect...
1,Albert Abreu,P,Mauricio Rubio Jr.,08/27/2016,60,https://legacy.baseballprospectus.com/prospect...
2,Osvaldo Abreu,SS,Tucker Blair,05/04/2015,40,https://legacy.baseballprospectus.com/prospect...
3,Albert Abreu,P,Grant Jones,00/00/0000,60,https://legacy.baseballprospectus.com/prospect...
4,Albert Abreu,P,John Eshleman,12/13/2017,60,https://legacy.baseballprospectus.com/prospect...


From here, I can use the links in the Player_Page_Link column to scrape each player's report page.

First, I'll pickle out this dataframe so I don't have to continue to scrape the site for the same data.

In [59]:
pwd

'/Users/patrickbovard/Documents/GitHub/MLB-Scouting-Reports-Analysis'

In [16]:
with open('./Data_Files/initial_full_df_links.pickle', 'wb') as to_write:
    pickle.dump(initial_df, to_write)

## Scraping of Player Reports

From the links I got originally, there are going to be two basic types of scouting reports: hitters and pitchers.  This should allow me to format my functions/scraping techniques, based on the player type.

### Hitters
The hitters have 'eyewitness_bat' in their link.  Based on the scouting reports ([CJ Abrams for example](https://legacy.baseballprospectus.com/prospects/eyewitness_bat.php?reportid=544)), there are a few main fields to capture:
- Personal Info (Born, Bats, Throws, Height, Weight, Primary Position, Secondary Position)
- Physical/Health
- First Table (MLB ETA, Risk Factor)
- Scouting Table - Tools (for Hit, Power, Baserunning/Speed, Glove, Arm):
    - Future Grade
    - Report
- Overall Report

These are consistent for all hitters, so corressponding functions will work to scrape this data.

I'll start by testing it out with CJ Abrams, as an example, since the reports have a similar format:

In [16]:
abrams_link = 'https://legacy.baseballprospectus.com/prospects/eyewitness_bat.php?reportid=544'

In [17]:
home_response = requests.get(abrams_link)
page = home_response.text
abrams_soup = BeautifulSoup(page, 'lxml')

Basic Information:

Name:

In [18]:
name = abrams_soup.find('table', class_='info').find('p', class_='name').get_text()

In [19]:
name

'CJ Abrams'

Other personal info:

In [20]:
info_orig = abrams_soup.find('table', class_='info').find('td').get_text().split(name)[1][1:].split('\n')
info_final = [item.split(':')[1][1:].strip(' (Age') for item in info_orig if item]

In [21]:
info_final

['10/03/2000', 'Left', 'Right', '6\' 2"', '185', 'SS', 'CF']

In [22]:
info_final.insert(0, name)

In [23]:
info_final

['CJ Abrams', '10/03/2000', 'Left', 'Right', '6\' 2"', '185', 'SS', 'CF']

Evaluator Info:

In [24]:
#Evaluator, Report Date, Date Seen, Affiliate:
eval_info = [item.get_text().replace('\n', '').replace('\t', '') for item in abrams_soup.find('table', class_='evaluator').find('tbody').find_all('td',class_=lambda x: x!= 'header')]

In [25]:
eval_info

['Keanan Lamb', '06/01/2019', '4/3-6/19', ' (, ) ']

In [26]:
info_final + eval_info

['CJ Abrams',
 '10/03/2000',
 'Left',
 'Right',
 '6\' 2"',
 '185',
 'SS',
 'CF',
 'Keanan Lamb',
 '06/01/2019',
 '4/3-6/19',
 ' (, ) ']

Physical/Health:

In [27]:
abrams_soup.find('table', class_='mechanics').find_all('td')[-1].get_text()

'A pure athlete with quick twitch muscles, very skinny and needs to add good weight without sacrificing dynamic movement. His premium athleticism is the upside on all his projected future tools.'

MLB ETA, Risk, OFP:

In [28]:
rep_stuff = [item.get_text().strip(' ') for item in abrams_soup.find('table', class_='repertoire').find_all('td')[4:7]]

In [29]:
rep_stuff

['2025', 'Extreme', '50']

Tools and Reports:

In [30]:
#First need to find the rows for hit, power, speed, glove, and arm:
tools_full = abrams_soup.find('table', class_='tool').find_all('tr', class_=lambda x: x!= 'header')

#Then, can get grades:
tools_grades = [tool.find_all('td', class_='mid') for tool in tools_full]
tools_grades_text = [tool[0].get_text().strip(' ') for tool in tools_grades]

In [31]:
tools_grades_text

['50', '40', '70', '55', '60']

In [32]:
#Now, Reports:
tools_reports = [tool.find_all('td')[-1].get_text().strip('\t')[1:] for tool in tools_full] #Need to get rid of leading spaces

In [33]:
len(tools_reports)

5

Overall Report:

In [34]:
overall_report = abrams_soup.find('table', class_='overall').find('p').get_text()[1:]

In [35]:
overall_report

"Every year there are players drafted purely based on athletic ability and not necessarily on baseball skills, and Abrams would fall into that category as he is still plenty raw defensively and especially offensively. The team that selects him will have to be patient and work with him to help develop his skills in accordance with his physicality. It's a boom or bust profile that is reliant on everything clicking."

Now, I'll be testing out these functions that were compiled in hitter_scraping.py.  This will use the full list of links.

In [36]:
link_list = list(initial_df.Player_Page_Link.values)

In [37]:
hitter_link_list = [link for link in link_list if 'eyewitness_bat' in link]

In [38]:
from hitter_scraping import *

In [39]:
scraped_data = hitter_puller_list(hitter_link_list)

With the data scraped, these are the columns that will be needed for the dataframe.  I'll then just quickly check that the number of columns in all players' data is the same.

In [52]:
report_cols = ['Name', 'Born', 'Bats', 'Throws', 'Height', 'Weight', 'Physical_Health', 'MLB_ETA', 'Risk_Factor', 'OFP', 'Hit_Grade', 'Power_Grade', 'Running_Grade', 'Glove_Grade', 'Arm_Grade', 'Hit_Report','Power_Report','Running_Report','Glove_Report','Arm_Report', 'Overall_Report']

In [53]:
len(report_cols)

21

In [54]:
uneven_reports = [report for report in scraped_data if len(report) != 21]
len(uneven_reports)

0

Great, all have the same length to match the columns

Now, building the hitters' dataframe.

In [55]:
hitter_df = pd.DataFrame(scraped_data, columns=report_cols)

In [56]:
hitter_df.head()

Unnamed: 0,Name,Born,Bats,Throws,Height,Weight,Physical_Health,MLB_ETA,Risk_Factor,OFP,...,Power_Grade,Running_Grade,Glove_Grade,Arm_Grade,Hit_Report,Power_Report,Running_Report,Glove_Report,Arm_Report,Overall_Report
0,CJ Abrams,10/03/2000,Left,Right,"6' 2""",185,"A pure athlete with quick twitch muscles, very...",2025,Extreme,50,...,40,70,55,60,Very raw as a hitter with a lot of mechanical ...,Beyond needing to gain strength to add to his ...,"Easily his best tool, runs like a wide receive...","Good hand-eye coordination and fundamentals, c...","On pure arm strength, it's average to slightly...",Every year there are players drafted purely ba...
1,Osvaldo Abreu,06/13/1994,Right,Right,"6' 0""",195,Smaller frame with a little room to add streng...,2018,Moderate,45,...,30,55,55,55,"Open stance, steady at the plate, hands are ti...","Frame translates to below-average raw power, l...","4.15 clock, above-average foot speed speed, qu...","Quick transfer, flashes soft hands, immature g...",Enough arm strength to make any throw from sho...,Abreu has the defensive chops and arm strength...
2,Osvaldo Abreu,06/13/1994,Right,Right,"6' 0""",195,Small frame; toned body with muscular definiti...,2019,High,40,...,30,45,50,50,Above-average bat speed; noisy hands; minor hi...,Average raw power; slight leverage and above-a...,4.45 home to first; slow out of the box; more ...,Extremely inconsistent currently; footwork can...,Above-average arm strength; moderate carry and...,Abreu was signed as an international free agen...
3,Ronald Acuna,12/18/1997,Right,Right,"6' 0""",180,"Quick-twitch movement, explosive hands; presen...",2019,High,70,...,60,55,45,55,"High hands stay quiet through load, attacks ba...",Plus-plus raw power; outsized raw despite aver...,"Times range 4.2-4.3, 4.15 on a leaner out of b...",Uncertain center-field future; above-average r...,Above-average arm for center; would play to av...,Acuna flashes all five tools with quick-twitch...
4,Ronald Acuna,12/18/1997,Right,Right,"6' 0""",180,"Impressive, proportioned athlete with plus str...",2018,Low,80,...,70,55,55,70,High hands (above ear) in set-up. Moderate leg...,Wrist strength and quickness is major asset. G...,4.23 home to first. Faster underway with longe...,Demonstrates good reads and solid first step; ...,One-hopper on line from RF warning track. Thro...,Impressive player ready to contribute now desp...


In [57]:
hitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524 entries, 0 to 523
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             524 non-null    object
 1   Born             524 non-null    object
 2   Bats             524 non-null    object
 3   Throws           524 non-null    object
 4   Height           524 non-null    object
 5   Weight           524 non-null    object
 6   Physical_Health  524 non-null    object
 7   MLB_ETA          524 non-null    object
 8   Risk_Factor      524 non-null    object
 9   OFP              524 non-null    object
 10  Hit_Grade        524 non-null    object
 11  Power_Grade      524 non-null    object
 12  Running_Grade    524 non-null    object
 13  Glove_Grade      524 non-null    object
 14  Arm_Grade        524 non-null    object
 15  Hit_Report       524 non-null    object
 16  Power_Report     524 non-null    object
 17  Running_Report   524 non-null    ob

Pickling out this dataframe to save for later, and to avoid having to continuously scrape the site.

In [59]:
pwd

'/Users/patrickbovard/Documents/GitHub/MLB-Scouting-Reports-Analysis'

In [61]:
with open('./Data_Files/initial_hitter_data.pickle', 'wb') as to_write:
    pickle.dump(hitter_df, to_write)