# Scouting Report Scraping

The purpose of this notebook is to scrape scouting report data from [Baseball Prospectus' database](https://legacy.baseballprospectus.com/prospects/eyewitness.php).

## Setup/Initiation:

First, importing needed packages:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10
import numpy as np
import glob
from scipy import stats
from bs4 import BeautifulSoup
import requests
import re
from IPython.core.display import display, HTML    # make sure Jupyter knows to display it as HTML

In [2]:
home_url = 'https://legacy.baseballprospectus.com/prospects/eyewitness.php'

Setting up beautifulsoup object:

In [3]:
home_response = requests.get(home_url)
page = home_response.text
soup_object = BeautifulSoup(page, 'lxml')

## Initial Dataframe
First, I need a dataframe with each player's initial information (name, links, etc.) to further scrape their reports.

Now, getting the table headers:

In [33]:
header = soup_object.find('tr', class_='header').find_all('td')

In [36]:
header_list = [col.get_text() for col in header]

In [37]:
header_list

['Player', 'Primary Pos', 'Evaluator', 'Report Date', 'OFP']

Getting the rows of the table from the site, for each player, position, evaluattor, report date, OFP.

In [41]:
rows = soup_object.find('tbody').find_all('tr',class_=lambda x: x!= 'header')

In [60]:
player_rows = [row.find_all('td') for row in rows]

In [73]:
final_player_data = []
for player in player_rows:
    player_list = []
    for item in player:
        player_list.append(item.get_text())
    final_player_data.append(player_list)

Now, creating a Pandas dataframe with that info:

In [75]:
initial_df = pd.DataFrame(final_player_data, columns=header_list)

Only thing left to add is the links to each player's personal page - this will allow for scraping of their own scouting reports.  Based on the website, each players link will be 'https://legacy.baseballprospectus.com/prospects' with the Player_Page_Link column added on.

In [101]:
base_link = 'https://legacy.baseballprospectus.com/prospects/'

In [102]:
player_links = [base_link + row[0].find('a').get('href') for row in player_rows]

In [103]:
len(player_links)

1074

In [104]:
initial_df.shape

(1074, 6)

Shapes match up, adding this on:

In [105]:
initial_df['Player_Page_Link'] = player_links
initial_df.head()

Unnamed: 0,Player,Primary Pos,Evaluator,Report Date,OFP,Player_Page_Link
0,CJ Abrams,SS,Keanan Lamb,06/01/2019,50,https://legacy.baseballprospectus.com/prospect...
1,Albert Abreu,P,Mauricio Rubio Jr.,08/27/2016,60,https://legacy.baseballprospectus.com/prospect...
2,Osvaldo Abreu,SS,Tucker Blair,05/04/2015,40,https://legacy.baseballprospectus.com/prospect...
3,Albert Abreu,P,John Eshleman,12/13/2017,60,https://legacy.baseballprospectus.com/prospect...
4,Albert Abreu,P,Jeffrey Paternostro,05/28/2019,60,https://legacy.baseballprospectus.com/prospect...


From here, I can use the links in the Player_Page_Link column to scrape each player's report page.

## Scraping of Reports

From the links I got originally, there are going to be two basic types of scouting reports: hitters and pitchers.  This should allow me to format my functions/scraping techniques, based on the player type.

### Hitters
The hitters have 'eyewitness_bat' in their link.  Based on the scouting reports ([CJ Abrams for example](https://legacy.baseballprospectus.com/prospects/eyewitness_bat.php?reportid=544)), there are a few main fields to capture:
- Personal Info (Born, Bats, Throws, Height, Weight, Primary Position, Secondary Position)
- Physical/Health
- First Table (MLB ETA, Risk Factor)
- Scouting Table - Tools (for Hit, Power, Baserunning/Speed, Glove, Arm):
    - Future Grade
    - Report
- Overall Report

These are consistent for all hitters, so corressponding functions will work to scrape this data.

I'll start by testing it out with CJ Abrams, as an example:

In [109]:
abrams_link = 'https://legacy.baseballprospectus.com/prospects/eyewitness_bat.php?reportid=544'

In [110]:
home_response = requests.get(abrams_link)
page = home_response.text
abrams_soup = BeautifulSoup(page, 'lxml')

Basic Information:

Name:

In [113]:
abrams_soup.find('table', class_='info').find('p', class_='name').get_text()

'CJ Abrams'

Born:

In [125]:
abrams_soup.find('table', class_='info').find('td').get_text().replace('\n', '')

'CJ AbramsBorn: 10/03/2000 (Age: 18)Bats: LeftThrows: RightHeight: 6\' 2" Weight: 185Primary Position: SSSecondary Position: CF '

Evaluator Info:

In [137]:
#Evaluator, Report Date, Date Seen, Affiliate:
eval_info = [item.get_text().replace('\n', '').replace('\t', '') for item in abrams_soup.find('table', class_='evaluator').find('tbody').find_all('td',class_=lambda x: x!= 'header')]

In [138]:
eval_info

['Keanan Lamb', '06/01/2019', '4/3-6/19', ' (, ) ']

Physical/Health:

In [141]:
abrams_soup.find('table', class_='mechanics').find_all('td')[-1].get_text()

'A pure athlete with quick twitch muscles, very skinny and needs to add good weight without sacrificing dynamic movement. His premium athleticism is the upside on all his projected future tools.'

MLB ETA, Risk, OFP:

In [149]:
rep_stuff = [item.get_text().strip(' ') for item in abrams_soup.find('table', class_='repertoire').find_all('td')[4:7]]

In [150]:
rep_stuff

['2025', 'Extreme', '50']

Tools and Reports:

In [155]:
#First need to find the rows for hit, power, speed, glove, and arm:
tools_full = abrams_soup.find('table', class_='tool').find_all('tr', class_=lambda x: x!= 'header')

#Then, can get grades:
tools_grades = [tool.get_text().strip(' ') for tool in tools_full.find_all('td', class_='mid')]

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [158]:
tools_full.find_all('td')

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?