# Introduction

#### Beginning this project, we sought to discover the relationship between offensive MLB statistics and their contributions to wins.#
#### We sourced the statistics from baseballreference.com, and used data from the 2017 MLB season.
#### After running a correlation matrix, we sorted through the data and decided which data to keep, and which to remove from our analysis.
#### Through this process, we discovered some interesting relationships, and removed data which were either insignificant, or simply not applicable (such as RBI's)
#### Since using RBI's would almost be cheating, we removed that statistic, along with other, unimportant stats such as sacrifice flys, sacrifice hits, stolen bases, and caught stealing.


# From Wins to Runs

##### While continuing to search for the best way to approach this problem, we discovered that a major flaw was in our design. We were trying to predict Wins using only offensive statistics.  This lead us to change our approach, and try to predict the total Runs a team would produce using the statistics.  We used Machine Learning, training with our important offensive statistics to successfully predict total Runs.

In [27]:
import csv
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
from sklearn import linear_model
import sklearn.metrics
import matplotlib.pyplot as plt
from sklearn import neighbors
%matplotlib notebook
from sklearn.model_selection import train_test_split

#load the dataset
AH_data=pd.read_csv("2017.csv")
data_clean=AH_data.dropna()

In [3]:
#cleaning the data
print(data_clean.dtypes)
data_clean.describe()

#using describe function to get a good overview of the data

Tm       object
R         int64
H         int64
2B        int64
3B        int64
HR        int64
RBI       int64
SB        int64
CS        int64
BB        int64
SO        int64
BA      float64
OBP     float64
SLG     float64
TB        int64
GDP       int64
HBP       int64
SH        int64
SF        int64
LOB       int64
BB%      object
SO%      object
WINS      int64
dtype: object


Unnamed: 0,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,BA,OBP,SLG,TB,GDP,HBP,SH,SF,LOB,WINS
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,752.733333,1407.166667,279.9,26.5,203.5,718.6,84.233333,31.133333,527.633333,1336.8,0.2549,0.324333,0.4259,2350.566667,126.8,58.766667,30.833333,38.933333,1098.066667,81.0
std,65.195691,70.040916,28.037352,8.661249,27.680131,63.07937,23.869882,7.006073,65.530453,122.869822,0.010159,0.011034,0.02065,125.362284,14.278993,14.194908,15.874689,8.300118,43.849219,11.531067
min,604.0,1251.0,226.0,5.0,128.0,576.0,32.0,13.0,390.0,1087.0,0.234,0.299,0.38,2105.0,99.0,31.0,9.0,26.0,1005.0,64.0
25%,707.0,1351.0,261.5,20.25,187.5,672.75,65.5,28.0,488.75,1234.75,0.249,0.3155,0.41325,2303.5,116.5,50.0,17.0,33.5,1076.0,72.75
50%,746.5,1403.5,282.5,28.0,209.0,713.5,88.0,31.0,535.5,1335.5,0.2555,0.3245,0.4295,2353.0,126.5,54.5,29.0,37.5,1100.0,79.0
75%,808.75,1458.0,292.25,31.75,223.75,771.0,97.25,34.0,569.5,1411.0,0.26,0.3335,0.43675,2411.5,137.75,69.25,42.75,43.5,1129.75,90.0
max,896.0,1581.0,346.0,39.0,241.0,854.0,136.0,44.0,649.0,1571.0,0.282,0.346,0.478,2681.0,160.0,88.0,62.0,61.0,1184.0,104.0


In [4]:
data_clean

#printing the data

Unnamed: 0,Tm,R,H,2B,3B,HR,RBI,SB,CS,BB,...,SLG,TB,GDP,HBP,SH,SF,LOB,BB%,SO%,WINS
0,ARI,812,1405,314,39,220,776,103,30,578,...,0.445,2457,106,54,39,27,1118,9.30%,23.40%,93
1,ATL,732,1467,289,26,165,706,77,31,474,...,0.412,2303,137,66,59,32,1127,7.60%,19.10%,72
2,BAL,743,1469,269,12,232,713,32,13,392,...,0.435,2458,138,50,10,37,1041,6.40%,23.00%,75
3,BOS,785,1461,302,19,168,735,106,31,571,...,0.407,2305,141,53,9,36,1134,9.00%,19.30%,93
4,CHC,822,1402,274,29,223,785,62,31,622,...,0.437,2403,134,82,48,32,1147,9.90%,22.30%,92
5,CHW,706,1412,256,37,186,670,71,31,401,...,0.417,2300,124,76,35,33,1055,6.60%,23.10%,67
6,CIN,753,1390,249,38,219,715,120,39,565,...,0.433,2372,116,72,50,42,1135,9.10%,21.40%,68
7,CLE,818,1449,333,29,212,780,88,23,604,...,0.449,2476,125,50,23,45,1158,9.70%,18.50%,102
8,COL,824,1510,293,38,192,793,59,34,519,...,0.444,2455,143,44,62,41,1088,8.40%,22.70%,87
9,DET,735,1435,289,35,187,699,65,34,503,...,0.424,2355,128,52,11,27,1104,8.20%,21.40%,64


# Trying to predict runs versus wins

In [5]:
target_Wins = data_clean.WINS
features_Runs = ['R']
X_Runs = data_clean[features_Runs]
y_Wins = target_Wins
lm = linear_model.LinearRegression()
model = lm.fit(X_Runs,y_Wins)
y_Wins_predict = lm.predict(X_Runs)
lm.score(X_Runs, y_Wins)

0.52256415241691878

We created a simple linear regression model to see how well team runs predict team wins. We chose a linear regression model because Runs and Wins have a strong positive linear correlation. The R-squared value for our model is approximently .523. The variation in runs scored explain about fifty-two percent of the variation in team wins. 

In [6]:
data_clean.plot.scatter('R', 'WINS')
plt.plot(X_Runs, y_Wins_predict)

#plotting runs versus wins
#with linear regression model

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x109066908>]

# Initial thoughts: Statistics we thought would be undervalued but weren't

### Sacrifice flies and sacrifice bunts:
We thought these would be important because teams playing small ball might be gaining a strategic advantage over others, but it turns out it's not very correlated

In [7]:
data_clean.plot.scatter('SH', 'R')

#plotting home runs versus wins

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10b01f400>

In [None]:
data_clean.plot.scatter('SF', 'R')

#plotting hits versus wins

### Baserunning: Stolen Bases and Caught Stealing
We thought that baserunning would be undervalued since it isn't talked about as much and since stealing bases can put a team in a lot better position to score a run.

In [8]:
data_clean.plot.scatter('SB', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10b0894a8>

In [None]:
data_clean.plot.scatter('CS', 'R')

# Obvious correlations but can't be used: RBIs
RBIs might be important for individual players, but as a team, you can't use them because almost every run comes with an RBI. The only time a run happens without an RBI is when there's an error on the other team or a player grounds into a double play.

In [9]:
data_clean.plot.scatter('RBI', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10b0c4860>

# Traditional Statistics

### Batting average: Percentage of time a player gets a hit as opposed to getting out
We know that batting average is important, but as the intital wave of Sabermetrics tells us, hits aren't everything -- there are other ways to get on base and not all hits are equal. We will still see distinct correlation.

In [10]:
data_clean.plot.scatter('BA', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a14b14128>

### Strikeouts: Outs where the ball doesn't get put in play
If a player strikes out, they are generally unable to move over a runner, whereas if a player grounds out or flies out, they may be able to be productive and advance a runner.

In [11]:
data_clean.plot.scatter('SO', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a15531cc0>

### Home Runs: Instant runs
Homers are immportant because they give the team runs instantly, and the defense can't do anything about it once the ball has been hit. Because they're physically impressive, they are an often cited statistic, but may not be as correlated to scoring as initially thought.

In [12]:
data_clean.plot.scatter('HR', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a15f48d68>

### Triples: Difficult and productive
Triples are hard hits to get. If a ball is hit hard, it often becomes a home run, but if it stays in the park, it's usually no more than a double because the size of parks is small enough that the defense can usually get to the ball in time to keep the batter/runner from advancing past second to third base

In [13]:
data_clean.plot.scatter('3B', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a1695e278>

# Important Relevant Statistics

### On Base Percentage: Hitting and plate discipline

On base percentage is now more commonly used than batting average because getting hits is not the only way to get on base. You can also walk, and some players are very good at drawing walks and not swinging at pitches outside of the zone. Although a walk only moves runners a maximum of one base, it still puts a runner on base which gets a team that much closer to a run.

In [14]:
data_clean.plot.scatter('OBP', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a17372b00>

### Slugging: How powerful a player's hits are

Batting average treats all hits as the same. However, a single is much less helpful than a triple. Obviously, the further a player makes it around the bases, the better the hit was, and the more likely he is to have driven in players already on base. In addition, being further around the basepaths makes it easier for upcoming batters to drive him in.

In [15]:
data_clean.plot.scatter('SLG', 'R')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a17d76f98>

# Machine Learning
Nearest Neighbor

We initially tried nearest neighbor to account for all the different approaches teams have. The model had an accuracy score of .167. This model did not perform well because we do not have enough data points, teams, to have an accurate model.  

In [34]:
target = data_clean.R
features = data_clean[['BA', 'HR', 'SO', 'BB', 'OBP', 'SLG']]
X = features.values
y = target.values
classifier = neighbors.KNeighborsClassifier()
classifier.fit(X, y)
classifier.score(X, y)

0.16666666666666666

We used a linear regression model to predict the number of runs scored based on batting average, homeruns, strikeouts, walks, On Base Percentage, and slugging percentage. We used cross validation to test our model. The model had a mean score of approximently .80 and a standard deviation of approximently .13. The model is fairly accurate but has a large standard deviation. We could improve the standard deviation by including more years, but other factors such as rule changes, changing team approaches, and physically changes in equipment mean the data is different year to year.  

In [35]:
classifier = linear_model.LinearRegression()
scores = cross_val_score(classifier, X, y, cv = 5)
print(scores)
print('mean', np.mean(scores))
print('standard deviation', np.std(scores))

[ 0.6995515   0.91531813  0.68013133  0.98084108  0.70349659]
mean 0.79586772562
standard deviation 0.126243967788
