# Predicting Baseball Salaries, Part II

I previously worked on a simple regression model to predict baseball player salaries where I collected player team, position and salary for the 2019 season. It was a simple exercise in using decision trees and different ensemble methods to predict salaries, but I wanted to go back adn see if I included more stats, specifically the previous season's hitting statistics, could I get closer to predicting salary?

My first step was to load in the libraries necessary and then my previously collected salary data from [USA Today](https://www.usatoday.com/sports/mlb/salaries/).

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time

In [57]:
%%capture

from tqdm import tqdm_notebook as tqdm
from tqdm import tnrange
tqdm().pandas()

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
salary = pd.read_pickle('salary_info')

In [4]:
salary.head()

Unnamed: 0,name,team,position,salary
0,MaxScherzer,WSH,SP,42142857
1,StephenStrasburg,WSH,SP,36428571
2,MikeTrout,LAA,CF,34083333
3,ZackGreinke,ARI,SP,32421884
4,DavidPrice,BOS,SP,31000000


Now that I have those salaraies, I want to get the 2018 hitting stats. I'm going to collect my data from [MLB's site](https://www.mlb.com/stats/2018) for each player and then use BeautifulSoup to parse the HTML.

In [5]:
url = 'https://www.mlb.com/stats/2018'
html = requests.get(url)
html

<Response [200]>

In [6]:
soup = BeautifulSoup(html.content, 'html.parser')

In [7]:
name_container = soup.find('th', {'data-row' : 0})
name_container

<th class="pinned-col-3lxtuFnc col-group-start-sa9unvY0 number-aY5arzrB first-col-3aGPCzvr is-table-pinned-1WfPW2jT" data-col="0" data-row="0" id="tb-6799-body-row0" scope="row"><div class="custom-cell-wrapper-34Cjf9P0"><div class="index-3cdMSKi7">1</div><div class="value-wrapper-1W5GYs5E"><div class="top-wrapper-1NLTqKbE"><div><a aria-label="Mike Trout" class="bui-link" href="/player/545361"><span class="full-3fV3c9pF">Mike</span><span class="short-3OJ0bTju">M Trout</span><span class="full-3fV3c9pF">Trout</span></a></div><div class="position-28TbwVOg">CF</div></div></div></div><div class="placeholder-wrapper-bEG1UFFP"><div class="index-3cdMSKi7">1</div><div><span class="bui-skeleton"><span class="skeleton-row-2cL12jX9" style="background-color:#eee;background-image:linear-gradient(90deg, #eee, #F5F5F5, #eee);border-radius:50%;width:42px;height:42px">‌</span></span></div><div class="placeholder-content-2l2UMerJ"><div><span class="bui-skeleton"><span class="skeleton-row-2cL12jX9" style="

In [8]:
names = name_container.findAll('span', class_ = 'full-3fV3c9pF')
names

[<span class="full-3fV3c9pF">Mike</span>,
 <span class="full-3fV3c9pF">Trout</span>]

In [9]:
for i in range(len(names)):
    print(names[i].get_text())

Mike
Trout


Get Name

In [10]:
first = names[0].get_text()
last = names[1].get_text()
print(first, last)

Mike Trout


In [11]:
full_name = first + " " + last
full_name

'Mike Trout'

Get Position

In [12]:
name_container.find('div', class_ = 'position-28TbwVOg').get_text()

'CF'

Get Team

In [13]:
soup.find('td', {'data-col': '1', 'data-row': '0'}).get_text()

'LAA'

Get Games

In [14]:
soup.find('td', {'data-col': '2', 'data-row': '0'}).get_text()

'140'

Get At Bats

In [15]:
soup.find('td', {'data-col': '3', 'data-row': '0'}).get_text()

'471'

Get Runs

In [16]:
soup.find('td', {'data-col': '4', 'data-row': '0'}).get_text()

'101'

Get Hits

In [17]:
soup.find('td', {'data-col': '5', 'data-row': '0'}).get_text()

'147'

Get Doubles

In [18]:
soup.find('td', {'data-col': '6', 'data-row': '0'}).get_text()

'24'

Get Triples

In [19]:
soup.find('td', {'data-col': '7', 'data-row': '0'}).get_text()

'4'

Get Homeruns

In [20]:
soup.find('td', {'data-col': '8', 'data-row': '0'}).get_text()

'39'

Get RBIs

In [21]:
soup.find('td', {'data-col': '9', 'data-row': '0'}).get_text()

'79'

Get Walks

In [22]:
soup.find('td', {'data-col': '10', 'data-row': '0'}).get_text()

'122'

Get Strikeouts

In [23]:
soup.find('td', {'data-col': '11', 'data-row': '0'}).get_text()

'124'

Get Stolen Bases

In [24]:
soup.find('td', {'data-col': '12', 'data-row': '0'}).get_text()

'24'

Get Caught Stealing

In [25]:
soup.find('td', {'data-col': '13', 'data-row': '0'}).get_text()

'2'

Get Batting Average

In [26]:
soup.find('td', {'data-col': '14', 'data-row': '0'}).get_text()

'.312'

Get On Base Percentage

In [27]:
soup.find('td', {'data-col': '15', 'data-row': '0'}).get_text()

'.460'

Get Slugging Percentage

In [28]:
soup.find('td', {'data-col': '16', 'data-row': '0'}).get_text()

'.628'

Get On Base Plus Slugging Percentage

In [29]:
soup.find('td', {'data-col': '17', 'data-row': '0'}).get_text()

'1.088'

## Scraping MLB.com for 2018 Stats

In [62]:
df = pd.DataFrame(columns = ['FirstName', 'LastName', 'FullName', 'Position', 'Team', 'Games', 'At_Bats', 'Runs', 
                             'Hits', 'Doubles', 'Triples', 'Homeruns', 'RBIs', 'Walks', 'Strikeouts', 'StolenBases', 
                            'CaughtStealing', 'BattingAverage', 'OnBasePercentage', 'SluggingPercentage', 
                            'OnBaseSluggingPercent'])
df

Unnamed: 0,FirstName,LastName,FullName,Position,Team,Games,At_Bats,Runs,Hits,Doubles,Triples,Homeruns,RBIs,Walks,Strikeouts,StolenBases,CaughtStealing,BattingAverage,OnBasePercentage,SluggingPercentage,OnBaseSluggingPercent


### Functions

In [34]:
def get_first_name(soup, data_row):
    container = soup.find('th', {'data-row' : data_row})
    names = container.findAll('span', class_ = 'full-3fV3c9pF')
    first_name = names[0].get_text()
    last_name = names[1].get_text()
    full_name = first_name + ' ' + last_name
    return first_name, last_name, full_name

In [35]:
def get_position(soup, data_row):
    container = soup.find('th', {'data-row' : data_row})
    position = container.find('div', class_ = 'position-28TbwVOg').get_text()
    return position

In [36]:
def get_team(soup, data_row):
    team = soup.find('td', {'data-col': '1', 'data-row': data_row}).get_text()
    return team

In [37]:
def get_games(soup, data_row):
    games = soup.find('td', {'data-col': '2', 'data-row': data_row}).get_text()
    return games

In [38]:
def get_atbats(soup, data_row):
    at_bats = soup.find('td', {'data-col': '3', 'data-row': data_row}).get_text()
    return at_bats

In [39]:
def get_runs(soup, data_row):
    runs = soup.find('td', {'data-col': '4', 'data-row': data_row}).get_text()
    return runs

In [40]:
def get_hits(soup, data_row):
    hits = soup.find('td', {'data-col': '5', 'data-row': data_row}).get_text()
    return hits

In [41]:
def get_doubles(soup, data_row):
    doubles = soup.find('td', {'data-col': '6', 'data-row': data_row}).get_text()
    return doubles

In [42]:
def get_triples(soup, data_row):
    triples = soup.find('td', {'data-col': '7', 'data-row': data_row}).get_text()
    return triples

In [43]:
def get_homeruns(soup, data_row):
    homeruns = soup.find('td', {'data-col': '8', 'data-row': data_row}).get_text()
    return homeruns

In [44]:
def get_rbis(soup, data_row):
    rbis = soup.find('td', {'data-col': '9', 'data-row': data_row}).get_text()
    return rbis

In [45]:
def get_walks(soup, data_row):
    walks = soup.find('td', {'data-col': '10', 'data-row': data_row}).get_text()
    return walks

In [46]:
def get_strikeouts(soup, data_row):
    strikeouts = soup.find('td', {'data-col': '11', 'data-row': data_row}).get_text()
    return strikeouts

In [47]:
def get_stolenbases(soup, data_row):
    stolenbases = soup.find('td', {'data-col': '12', 'data-row': data_row}).get_text()
    return stolenbases

In [48]:
def get_caughtstealing(soup, data_row):
    caughtstealing = soup.find('td', {'data-col': '13', 'data-row': data_row}).get_text()
    return caughtstealing

In [49]:
def get_battingavg(soup, data_row):
    battingavg = soup.find('td', {'data-col': '14', 'data-row': data_row}).get_text()
    return battingavg

In [50]:
def get_onbasepercent(soup, data_row):
    onbasepercent = soup.find('td', {'data-col': '15', 'data-row': data_row}).get_text()
    return onbasepercent

In [51]:
def get_sluggingpercent(soup, data_row):
    sluggingpercent = soup.find('td', {'data-col': '16', 'data-row': data_row}).get_text()
    return sluggingpercent

In [52]:
def get_onbaseslugging(soup, data_row):
    onbaseslugging = soup.find('td', {'data-col': '17', 'data-row': data_row}).get_text()
    return onbaseslugging

In [61]:
players_on_page = range(25)

def get_contents(webpage):
    html = requests.get(webpage)
    soup = BeautifulSoup(html.content, 'html.parser')
    
    for i in players_on_page:
        try:
            first_name, last_name, full_name = get_first_name(soup, i)
            position = get_position(soup, i)
            team = get_team(soup, i)
            games = get_games(soup, i)
            at_bats = get_atbats(soup, i)
            runs = get_runs(soup, i)
            hits = get_hits(soup, i)
            doubles = get_doubles(soup, i)
            triples = get_triples(soup, i)
            homeruns = get_homeruns(soup, i)
            rbis = get_rbis(soup, i)
            walks = get_walks(soup, i)
            strikeouts = get_strikeouts(soup, i)
            stolenbases = get_stolenbases(soup, i)
            caughtstealing = get_caughtstealing(soup, i)
            battingavg = get_battingavg(soup, i)
            onbasepercent = get_onbasepercent(soup, i)
            slugging = get_sluggingpercent(soup, i)
            onbaseslug = get_onbaseslugging(soup, i)
            
            time.sleep(.5)
        
            global df
            df = df.append({'FirstName': first_name, 
                           'LastName': last_name,
                           'FullName': full_name, 
                           'Position': position, 
                           'Team': team,
                           'Games': games,
                           'At_Bats': at_bats,
                           'Runs': runs,
                           'Hits': hits,
                           'Doubles': doubles,
                           'Triples': triples,
                           'Homeruns': homeruns,
                           'RBIs': rbis,
                           'Walks': walks,
                           'Strikeouts': strikeouts,
                           'StolenBases': stolenbases,
                           'CaughStealing': caughtstealing,
                           'BattingAverage': battingavg,
                           'OnBasePercentage': onbasepercent,
                           'SluggingPercentage': slugging,
                           'OnBaseSluggingPercent': onbaseslug}, ignore_index=True)
        
        
        except AttributeError:
            continue

In [63]:
first_url = 'https://www.mlb.com/stats/2018?playerPool=ALL'

In [64]:
get_contents(first_url)

In [65]:
print(df.shape)
df.head()

(25, 22)


Unnamed: 0,FirstName,LastName,FullName,Position,Team,Games,At_Bats,Runs,Hits,Doubles,Triples,Homeruns,RBIs,Walks,Strikeouts,StolenBases,CaughtStealing,BattingAverage,OnBasePercentage,SluggingPercentage,OnBaseSluggingPercent,CaughStealing
0,Enny,Romero,Enny Romero,P,KC,4,1,1,1,1,0,0,0,0,0,0,,1.0,1.0,2.0,3.0,0
1,Kolby,Allard,Kolby Allard,P,ATL,3,1,1,1,0,0,0,0,0,0,0,,1.0,1.0,1.0,2.0,0
2,Kyle,Gibson,Kyle Gibson,P,MIN,1,2,2,2,0,0,0,0,0,0,0,,1.0,1.0,1.0,2.0,0
3,Derek,Law,Derek Law,P,SF,7,1,1,1,0,0,0,0,0,0,0,,1.0,1.0,1.0,2.0,0
4,Vidal,Nuno,Vidal Nuno,P,TB,1,2,0,2,0,0,0,1,0,0,0,,1.0,1.0,1.0,2.0,0


In [66]:
url_list = []

for i in range(2, 52):
    url_list.append('https://www.mlb.com/stats/2018?page=' + str(i) + '&playerPool=ALL')

url_list

['https://www.mlb.com/stats/2018?page=2&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=3&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=4&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=5&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=6&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=7&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=8&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=9&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=10&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=11&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=12&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=13&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=14&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=15&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=16&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=17&playerPool=ALL',
 'https://www.mlb.com/stats/2018?page=18&playerPool=ALL',
 'https://www.mlb.com/

In [67]:
for i in tqdm(url_list):
    get_contents(i)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




In [58]:
df.shape

(1270, 22)

In [197]:
salary.shape

(877, 4)

In [198]:
df.FullName.value_counts()

Anthony Rendon      51
Khris Davis         51
David Peralta       51
Francisco Lindor    51
Christian Yelich    51
Matt Chapman        51
Javier Baez         51
Charlie Blackmon    51
Manny Machado       51
Mookie Betts        51
Trevor Story        51
Jesus Aguilar       51
Paul Goldschmidt    51
Bryce Harper        51
Mitch Haniger       51
Matt Carpenter      51
Freddie Freeman     51
J.D. Martinez       51
Mike Trout          51
Eugenio Suarez      51
Brandon Nimmo       51
Alex Bregman        51
Nolan Arenado       51
Xander Bogaerts     51
Jose Ramirez        51
Name: FullName, dtype: int64

In [72]:
df.head()

Unnamed: 0,FirstName,LastName,FullName,Position,Team,Games,At_Bats,Runs,Hits,Doubles,...,RBIs,Walks,Strikeouts,StolenBases,CaughtStealing,BattingAverage,OnBasePercentage,SluggingPercentage,OnBaseSluggingPercent,CaughStealing
0,Enny,Romero,Enny Romero,P,KC,4,1,1,1,1,...,0,0,0,0,,1.0,1.0,2.0,3.0,0
1,Kolby,Allard,Kolby Allard,P,ATL,3,1,1,1,0,...,0,0,0,0,,1.0,1.0,1.0,2.0,0
2,Kyle,Gibson,Kyle Gibson,P,MIN,1,2,2,2,0,...,0,0,0,0,,1.0,1.0,1.0,2.0,0
3,Derek,Law,Derek Law,P,SF,7,1,1,1,0,...,0,0,0,0,,1.0,1.0,1.0,2.0,0
4,Vidal,Nuno,Vidal Nuno,P,TB,1,2,0,2,0,...,1,0,0,0,,1.0,1.0,1.0,2.0,0


In [74]:
df.shape

(1270, 22)

In [75]:
df.FullName.value_counts()

Jose Ramirez         2
Javy Guerra          2
Kevin Shackelford    1
Alcides Escobar      1
Hector Rondon        1
                    ..
Addison Reed         1
Jake Barrett         1
Jonny Venters        1
Brock Stewart        1
Bryse Wilson         1
Name: FullName, Length: 1268, dtype: int64

In [76]:
df.to_pickle('hitting_data_v1')