<center> <h2> DS 3000 - Fall 2021</h2> </center>
<center> <h3> DS Report </h3> </center>


<center> <h3>Beat the Streak: Predicting Baseball Hits</h3> </center>
<center><h4>Melis Akinci, Kevin Schmitt, Rachel Rouff</h4></center>


<hr style="height:2px; border:none; color:black; background-color:black;">

#### Executive Summary:

In this project, we attempted to predict whether an MLB player would get a hit in a given game based on historical statistics against the pitcher they were facing. Our goal is to correctly predict a player that will get a hit for 57 straight games to win the MLB's Beat the Streak competition. Our data set included more than 10,000 rows of data with columns representing batters, pitchers, historical statistics between the two, and whether the batter got a hit in that specific matchup. We attempted to classify each matchup in two categories, hit or no hit, and did initial testing with k-Nearest Neighbor, Support Vector Machines, Gaussian Naive Bayes, and Decision Trees. Our final two models (Gaussian Naive Bayes and a Support Vector Machine) used hyperparameters found in a Grid Search and three features found through recursive feature elimination and each had 68.8% accuracy.

<hr style="height:2px; border:none; color:black; background-color:black;">

## Outline
1. <a href='#1'>INTRODUCTION</a>
2. <a href='#2'>METHOD</a>
3. <a href='#3'>RESULTS</a>
4. <a href='#4'>DISCUSSION</a>

<a id="1"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 1. INTRODUCTION

<h4>Problem Statement</h4>

Each baseball season, the MLB hosts an online game called Beat The Streak in which participants select a player each day that they think will get a hit in their MLB game. The participant attempts “to establish a virtual ‘hitting streak’ and keep it going for as long as possible”(Beat the Streak Official Rules). If the player the participant chose gets a hit, their streak will continue, however, if they do not get a hit, the streak will go back down to zero. The goal is to get up to a streak of 57 days in order to beat Joe DiMaggio’s 56-game hitting streak. We want to create a program that will use the data on all players and games in the MLB in order to output a list of the players that are most likely to get a hit on a given day. 

<h4>Significance of the Problem</h4>

The winner of this competition wins $5.6 million, and no one has ever successfully completed the 57-day objective, so we want to use predictions based on the provided MLB data in order to try to beat the game. The insights from this problem could end up being a helpful and applicable tool for team managers in the MLB who are trying to choose which players to put in the lineup based on their probability of getting a hit in that game. Some previous work on this topic can be found on the following websites.
* https://math.stackexchange.com/questions/513627/baseball-batting-average-and-probability
* http://www.baseballmusings.com/

<h4>Questions</h4>

* Is it possible to effectively predict whether a player is going to get a hit in their game?
* Are there any consistent patterns or indicators of whether a player is going to get a hit?
* Which machine learning algorithm will be the most effective when it comes to predicting hits?
* What factors are most closely related to whether a player gets a hit or not?

<a id="2"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 2. METHOD

### 2.1. Data Acquisition

Our data came from the following websites. The Kaggle link is a dataset on the outcomes of batter versus pitcher match-ups. The Rotowire dataset is made up of data on match-up statistics for batters versus pitchers.
* https://www.rotowire.com/baseball/stats-bvp.php
* https://www.kaggle.com/josephvm/mlb-game-data

Our dataset includes several statistical values for the performance of many Major League Baseball players in individual games during the 2019 and 2021 MLB seasons. While many people may believe that the only indicator they need for a player in determining if they will get a hit is their batting average, there are several other statistics that are important in determining whether a given play will get a hit, especially their statistics against the opposing pitcher. Our dataset has 16 variable columns and over 11,000 sample rows. Our features include whether the game is a home or away game (Home?), the date the game took place (Date), the hitter's past at bats against the pitcher in their overall career (ABVP), and the hits versus the pitcher they are facing in their overall career (HVP). Other feature variables that we believe may have an impact on hit likelihood include the number of extra-base hits (XBH), the number of home runs (HR), the number of runners batted in (RBI), the number of walks (BB), average number of hits (AVG), and the on base percentage (OBP) against the opposing pitcher which allow us to gain further insight into the statistics of the game. Using the feature variables, we have a binary target variable of whether the player got a hit in the game, which will be used to train the model to determine whether a player got a hit or not.

### 2.2. Data Analysis

Our predictive model is a binary classification algorithm that uses the feature variables ("Home?", "ABVP", "HVP", "XBH". "HR", "RBI", "BB", and "AVG") from previous batter versus pitcher match-ups in order to predict the outcome variable of whether the batter will get a hit or not in their upcoming match-up against the given pitcher. These feature variables are important predictors because each of these elements are key to determining whether a player gets a hit or not at a certain game. Many studies show that whether a player is at their home field influences their ability to get a hit and their overall performance in a game. The pitcher along with their success against the batter has a strong influence on the hitter's ability to get a hit since many pitchers have varying styles and skill levels. The number of home runs, extra base hits, and average number of hits are some of the best representations towards the general strength of the player and their ability to achieve a hit. Furthermore, the number of runners batted in, number of walks, and on base percentage allow us to gain more detail on how effective each player is. 

Our project is tackling a classification problem as it is trying to predict whether a batter will get a hit against a given pitcher. We considered using a regression algorithm and determining the batter's average number of hits per at bat in a game, but chose the classification approach as a player's average number of hits per at bat is not a great indicator of their probability of getting a single hit which is all that is necessary for the game.

We are going to try all of the main classification algorithms discussed in class including the k-Nearest Neighbor Classifier, Support Vector Machine Classifier, Gaussian Naive Bayes Classifier, and Decision Tree Classifier since we want to make the most informed decision of which classifier would work the best for the data. We will use recursive feature elimination (RFE) to choose the most important features as it is hard to tell which features seem to be most directly correlated with the binary target variable from looking at the data as stated above. It would also be helpful to compare all of the features to the target using scatterplots to see how many have correlation and a violin graph as this will help us decide how many features we want to choose when running the RFE. For example, if it seems like 5 features might be correlated with the target, we will choose n = 5 for selecting features when running RFE.

<a id="3"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 3. RESULTS

### 3.1. Data Wrangling

## Data Import, Merging, Cleaning, and Formatting

In [1]:
import pandas as pd
hitters = pd.read_csv('https://raw.githubusercontent.com/kevschmitt/Beat-the-Streak/main/hittersByGame.csv', low_memory=False)
hitters = hitters[['Hitters', 'AB', 'H', 'Game']]
games = pd.read_csv('https://raw.githubusercontent.com/kevschmitt/Beat-the-Streak/main/games.csv', low_memory=False)
games = games[['Game', 'Date']]
hitters.head()

Unnamed: 0,Hitters,AB,H,Game
0,M. Carpenter,4,1,360403123
1,T. Pham,1,0,360403123
2,M. Adams,4,0,360403123
3,M. Holliday,3,1,360403123
4,R. Grichuk,4,1,360403123


In [2]:
b_v_p = pd.read_csv('https://raw.githubusercontent.com/kevschmitt/Beat-the-Streak/main/batter_vs_pitcher.csv')
b_v_p

Unnamed: 0.1,Unnamed: 0,Name,Team,Opp,Home?,Date,Pitcher,AB,H,XBH,HR,RBI,BB,AVG,OBP,SLG,OPS
0,0,Ian Happ,CHC,MIA,No,3/29/2019,Jose Urena,11,5,3,2,2,1,0.455,0.500,1.091,1.591
1,1,Jonathan Schoop,DET,PIT,Yes,3/29/2019,Ivan Nova,18,6,4,2,3,0,0.333,0.333,0.778,1.111
2,2,Jonathan Villar,NYM,STL,Yes,3/29/2019,Carlos Martinez,30,11,2,1,3,2,0.367,0.406,0.500,0.906
3,3,Nolan Arenado,STL,NYM,No,3/29/2019,Noah Syndergaard,10,4,1,1,1,2,0.400,0.500,0.700,1.200
4,4,Paul DeJong,STL,NYM,No,3/29/2019,Noah Syndergaard,11,6,1,0,0,0,0.545,0.545,0.636,1.182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15759,15759,Carlos Correa,HOU,SEA,No,8/31/2021,Yusei Kikuchi,18,9,3,0,1,1,0.500,0.526,0.722,1.249
15760,15760,Jose Altuve,HOU,SEA,No,8/31/2021,Yusei Kikuchi,20,4,4,1,3,5,0.200,0.370,0.500,0.870
15761,15761,Mookie Betts,LAD,ATL,Yes,8/31/2021,Charlie Morton,24,7,3,1,1,4,0.292,0.414,0.500,0.914
15762,15762,Ty France,SEA,HOU,Yes,8/31/2021,Lance McCullers,12,6,2,0,1,2,0.500,0.571,0.667,1.238


In [3]:
# Reformats the date of a game to the MM/DD/YYYY format so that they are standardized
def format_dates_bvp(date):
    parts = date.split('/')
    ret = ''
    for p in range(0,2):
        if len(parts[p]) == 1:
            ret = ret + '0' + parts[p] + '/'
        else:
            ret = ret + parts[p] + '/'
    ret = ret + parts[2]
    return(ret)

In [4]:
b_v_p['Date'] = b_v_p['Date'].map(format_dates_bvp)

In [5]:
games_stats = pd.merge(hitters, games, on='Game', how='left')

In [6]:
# Reformats the date of a game to the MM/DD/YYYY format
def format_date(date):
    temp = date.split('T')[0]
    parts = temp.split('-')
    return parts[1] + '/' + parts[2] + '/' + parts[0]

In [7]:
# Reformats the names to be the hitter's first initial and last name which simplifies the data
def format_name(name):
    parts = name.split(' ')
    fi = parts[0][:1]
    return fi + '. ' + parts[1]

In [8]:
games_stats['Date'] = games_stats['Date'].map(format_date)
games_stats.head()

Unnamed: 0,Hitters,AB,H,Game,Date
0,M. Carpenter,4,1,360403123,04/03/2016
1,T. Pham,1,0,360403123,04/03/2016
2,M. Adams,4,0,360403123,04/03/2016
3,M. Holliday,3,1,360403123,04/03/2016
4,R. Grichuk,4,1,360403123,04/03/2016


In [9]:
# Changes the name of the "Name" column to "Hitters" to be more specific since there is another player name 
# in the dataframe
b_v_p['Name'] = b_v_p['Name'].map(format_name)
b_v_p = b_v_p.rename(columns = {'Name': 'Hitters'})
b_v_p.head()

Unnamed: 0.1,Unnamed: 0,Hitters,Team,Opp,Home?,Date,Pitcher,AB,H,XBH,HR,RBI,BB,AVG,OBP,SLG,OPS
0,0,I. Happ,CHC,MIA,No,03/29/2019,Jose Urena,11,5,3,2,2,1,0.455,0.5,1.091,1.591
1,1,J. Schoop,DET,PIT,Yes,03/29/2019,Ivan Nova,18,6,4,2,3,0,0.333,0.333,0.778,1.111
2,2,J. Villar,NYM,STL,Yes,03/29/2019,Carlos Martinez,30,11,2,1,3,2,0.367,0.406,0.5,0.906
3,3,N. Arenado,STL,NYM,No,03/29/2019,Noah Syndergaard,10,4,1,1,1,2,0.4,0.5,0.7,1.2
4,4,P. DeJong,STL,NYM,No,03/29/2019,Noah Syndergaard,11,6,1,0,0,0,0.545,0.545,0.636,1.182


In [10]:
# Merges the two dataframes above
df = pd.merge(b_v_p, games_stats, on=['Hitters', 'Date'], how='left')

In [11]:
# Used to drop the NaN values in the dataframe
df = df.dropna()
df=df.reset_index()

In [12]:
# Used to rename some of the columns of the dataframe
df = df[['Hitters', 'Pitcher', 'Team', 'Opp', 'Home?', 'Date', 'AB_x', 'H_x', 'XBH', 'HR', 'RBI', 'BB', 'AVG',
               'OBP', 'AB_y', 'H_y']]
df = df.rename(columns = {'Hitters': 'Hitter', 'AB_x': 'ABVP', 'H_x': 'HVP', 'AB_y': 'AB_actual', 'H_y': 'H_actual'})

In [13]:
df['Pitcher'] = df['Pitcher'].map(format_name)

In [14]:
# Used to turn the values in "Home?" column into binary values where 
# 1 = yes and 0 = no so that they can be used in the model
def reformat_home(home):
    if home == 'Yes':
        return 1
    return 0

In [15]:
df['Home?'] = df['Home?'].map(reformat_home)

In [16]:
# Changes all number values in the dataframe to be floats to use in the model
for i in ['ABVP', 'HVP', 'XBH', 'HR', 'RBI', 'BB', 'AVG', 'OBP', 'AB_actual', 'H_actual']:    
    df[i] = df[i].astype(float)
df.head()

Unnamed: 0,Hitter,Pitcher,Team,Opp,Home?,Date,ABVP,HVP,XBH,HR,RBI,BB,AVG,OBP,AB_actual,H_actual
0,N. Arenado,N. Syndergaard,STL,NYM,0,03/29/2019,10.0,4.0,1.0,1.0,1.0,2.0,0.4,0.5,5.0,2.0
1,A. Bregman,C. Hamels,HOU,TEX,0,03/29/2019,19.0,6.0,3.0,1.0,1.0,1.0,0.316,0.381,4.0,0.0
2,T. Pham,C. Anderson,SD,MIL,1,03/29/2019,14.0,4.0,3.0,2.0,3.0,2.0,0.286,0.375,3.0,1.0
3,Y. Diaz,C. Sale,TB,BOS,1,03/29/2019,12.0,6.0,3.0,0.0,3.0,1.0,0.5,0.538,4.0,2.0
4,T. Story,P. Corbin,COL,ARI,0,03/29/2019,30.0,10.0,7.0,3.0,6.0,6.0,0.333,0.444,4.0,1.0


## Creating Target Variable

In [17]:
# Used to create a new column called "Hit?" that has a binary value of whether a hitter got a hit or not 
# (1 = yes, 0 = no)
def float_to_binary(num):
    if (num > 0):
        return 1
    else:
        return 0

In [18]:
df['Hit?'] = df['H_actual'].map(float_to_binary)
df

Unnamed: 0,Hitter,Pitcher,Team,Opp,Home?,Date,ABVP,HVP,XBH,HR,RBI,BB,AVG,OBP,AB_actual,H_actual,Hit?
0,N. Arenado,N. Syndergaard,STL,NYM,0,03/29/2019,10.0,4.0,1.0,1.0,1.0,2.0,0.400,0.500,5.0,2.0,1
1,A. Bregman,C. Hamels,HOU,TEX,0,03/29/2019,19.0,6.0,3.0,1.0,1.0,1.0,0.316,0.381,4.0,0.0,0
2,T. Pham,C. Anderson,SD,MIL,1,03/29/2019,14.0,4.0,3.0,2.0,3.0,2.0,0.286,0.375,3.0,1.0,1
3,Y. Diaz,C. Sale,TB,BOS,1,03/29/2019,12.0,6.0,3.0,0.0,3.0,1.0,0.500,0.538,4.0,2.0,1
4,T. Story,P. Corbin,COL,ARI,0,03/29/2019,30.0,10.0,7.0,3.0,6.0,6.0,0.333,0.444,4.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11056,C. Correa,Y. Kikuchi,HOU,SEA,0,08/31/2021,18.0,9.0,3.0,0.0,1.0,1.0,0.500,0.526,4.0,1.0,1
11057,J. Altuve,Y. Kikuchi,HOU,SEA,0,08/31/2021,20.0,4.0,4.0,1.0,3.0,5.0,0.200,0.370,5.0,1.0,1
11058,M. Betts,C. Morton,LAD,ATL,1,08/31/2021,24.0,7.0,3.0,1.0,1.0,4.0,0.292,0.414,4.0,1.0,1
11059,T. France,L. McCullers,SEA,HOU,1,08/31/2021,12.0,6.0,2.0,0.0,1.0,2.0,0.500,0.571,4.0,2.0,1


## Creating Feature and Target Sets

In [19]:
# Features: Home?, ABVP, HVP, XBH, HR, RBI, BB, AVG, OBP
features = df[['Home?', 'ABVP', 'HVP', 'XBH','HR', 'RBI', 'AVG', 'OBP']]
features.head()

Unnamed: 0,Home?,ABVP,HVP,XBH,HR,RBI,AVG,OBP
0,0,10.0,4.0,1.0,1.0,1.0,0.4,0.5
1,0,19.0,6.0,3.0,1.0,1.0,0.316,0.381
2,1,14.0,4.0,3.0,2.0,3.0,0.286,0.375
3,1,12.0,6.0,3.0,0.0,3.0,0.5,0.538
4,0,30.0,10.0,7.0,3.0,6.0,0.333,0.444


In [20]:
# Target: Hit?
target = df['Hit?']
target.head()

0    1
1    0
2    1
3    1
4    1
Name: Hit?, dtype: int64

### Feature Selection

From our knowledge of the dataset, baseball in general, and basic visualizations, we expected that the on base percentage, batting average, and runners batted in against the opposing pitcher were most likely to be significant features so we were not surprised by the outcome of the recursive feature selection shown below.

In [21]:
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [22]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000) 

In [23]:
def RFE_feature_selection(features, x_train, x_test, y_train):
    
    #instantiate
    feature_selection = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 3)
    feature_selection.fit(x_train, y_train)
    
    X_train_selected = feature_selection.transform(x_train)
    X_test_selected = feature_selection.transform(x_test)
    
    #determine selected features
    selected_features = []
    for f in range(len(features.columns)):
        if feature_selection.get_support()[f] == True:
            selected_features.append(features.columns[f])

    return X_train_selected, X_test_selected, selected_features

In [24]:
X_train_selected, X_test_selected, selected = RFE_feature_selection(features, X_train, X_test, y_train)
features_selected = features[selected]

In [25]:
features_selected

Unnamed: 0,RBI,AVG,OBP
0,1.0,0.400,0.500
1,1.0,0.316,0.381
2,3.0,0.286,0.375
3,3.0,0.500,0.538
4,6.0,0.333,0.444
...,...,...,...
11056,1.0,0.500,0.526
11057,3.0,0.200,0.370
11058,1.0,0.292,0.414
11059,1.0,0.500,0.571


### 3.2. Data Exploration
* Generate appropriate data visualizations for your key variables identified in the previous section
* You should have at least three visualizations (and at least two different visualization types)
* For each visualization provide an explanation regarding the variables involved and an interpretation of the graph. In other words, what does the graph represent?
* Insert your visualizations as images as well (upload the graph images to an online source, e.g. github or imgbb, and embed those into the cells in Jupyter Notebook, similar to HWs). This is a requirement.


### Scatter Matrix of Random Sample of Player Games
This scatter matrix is used to identify any possible relationships between on base percentage, batting average, and runners batted in against the opposing pitcher for a random sample of players' individual games. We were surprised by the results of looking at this scatter plot because it did not really show much of a pattern in results. Players with lower averages got hits while players with higher averages did not. Players with both high average and high on base percentage didn't get hits while others did.

In [26]:
import plotly.express as px
sample = df.sample(50)
fig = px.scatter_matrix(sample, dimensions = ["OBP","AVG","RBI"], color = "Hit?", title = "Relationships Between OBP, AVG, and RBI")

<a href="https://imgbb.com/"><img src="https://i.ibb.co/G02KnJk/newplot-2.png" alt="newplot-2" border="0"></a>

### Scatter Matrix of an Individual Player's Games
This is a similar scatter matrix to the one above but is specific to the given player, Mookie Betts, in order to see any patterns that may exist on a single player basis.

In [34]:
df_mookie = df[df['Hitter'] == 'M. Betts']
fig = px.scatter_matrix(df_mookie, dimensions = ["OBP","AVG","RBI"], color = "Hit?", title = "Mookie Betts Statistics")

<a href="https://imgbb.com/"><img src="https://i.ibb.co/jv7bDP8/newplot-1.png" alt="newplot-1" border="0"></a>

### Violin Plot comparing vs pitcher batting averages grouped by Players who did/did not get a hit
Shows how hitters who did get a hit vs a pitcher typically had slightly better historical performance against the pitcher.
Hit is represented as 1 and no hit is represented as 0

In [35]:
fig = px.violin(df, x='Hit?', y='AVG', color = "Hit?", box=True, template='ggplot2', 
                        points="all", width = 500, violinmode="overlay")

<a href="https://imgbb.com/"><img src="https://i.ibb.co/g7RgzGq/newplot.png" alt="newplot" border="0"></a>

### 3.3. Model Training

In [29]:
# Classification
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

estimators = {
    'k-Nearest Neighbor': KNeighborsClassifier(), 
    'Support Vector Machine': LinearSVC(max_iter=1000000),
    'Gaussian Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier()}

print('INITIAL TESTS ON CLASSIFICATION ALGORITHMS')

for estimator_name, estimator_object in estimators.items():
    # base data
    model = estimator_object
    model.fit(X=X_train, y=y_train)
    predicted = model.predict(X=X_test)
    print(estimator_name)
    print('\tClassification accuracy on the base train data:', 
          format(model.score(X_train, y_train)*100, ".2f") + '%')
    print('\tClassification accuracy on the base test data:', 
          format(model.score(X_test, y_test)*100, ".2f") + '%')
    
    # selected data
    model_selected = estimator_object
    model_selected.fit(X=X_train_selected, y=y_train)
    predicted = model_selected.predict(X=X_test_selected)
    print('\tClassification accuracy on the selected train data:', 
          format(model_selected.score(X_train_selected, y_train)*100, ".2f") + '%')
    print('\tClassification accuracy on the selected test data:', 
          format(model_selected.score(X_test_selected, y_test)*100, ".2f") + '%')
    

INITIAL TESTS ON CLASSIFICATION ALGORITHMS
k-Nearest Neighbor
	Classification accuracy on the base train data: 74.74%
	Classification accuracy on the base test data: 61.97%
	Classification accuracy on the selected train data: 69.28%
	Classification accuracy on the selected test data: 62.00%
Support Vector Machine
	Classification accuracy on the base train data: 68.70%
	Classification accuracy on the base test data: 68.80%
	Classification accuracy on the selected train data: 68.70%
	Classification accuracy on the selected test data: 68.80%
Gaussian Naive Bayes
	Classification accuracy on the base train data: 68.70%
	Classification accuracy on the base test data: 68.80%
	Classification accuracy on the selected train data: 68.70%
	Classification accuracy on the selected test data: 68.80%
Decision Tree
	Classification accuracy on the base train data: 80.87%
	Classification accuracy on the base test data: 59.22%
	Classification accuracy on the selected train data: 73.27%
	Classification acc

Based on the results above, Decision Tree and k-Nearest Neighbors are both less accurate and overfitting based on their much higher accuracy on the training data versus the test data. We have chosen to use both the Support Vector Machine and Gaussian Naive Bayes to apply grid search and see if either will become more accurate with hyperparameters.

In [30]:
param_grid = {'multi_class': ['ovr', 'crammer_singer'], 'loss': ['hinge', 'squared_hinge']}
    
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(LinearSVC(max_iter=1000000), param_grid)

grid_search.fit(X=X_train_selected, y=y_train)

# result of grid search
print("Best estimator: ", grid_search.best_estimator_)
print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

Best estimator:  LinearSVC(loss='hinge', max_iter=1000000)
Best parameters:  {'loss': 'hinge', 'multi_class': 'ovr'}
Best cross-validation score:  0.6870403857745628


In [31]:
param_grid = {'var_smoothing': [0.00000001, .000000001, 0.0000000001]}

grid_search = GridSearchCV(GaussianNB(), param_grid)

grid_search.fit(X=X_train_selected, y=y_train)

# result of grid search
print("Best estimator: ", grid_search.best_estimator_)
print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

Best estimator:  GaussianNB(var_smoothing=1e-08)
Best parameters:  {'var_smoothing': 1e-08}
Best cross-validation score:  0.6870403857745628


### 3.4. Model Optimization
We used the parameters determined by the grid search and the best estimator in this model in order to get the best results. We found that two different estimator and parameter sets had the same results and decided to use this set since they yield the same result either way. We also used the selected features instead of the full set because they gave the same results as well.

In [32]:
# Support Vector Machine
model1 = LinearSVC(max_iter=1000000, loss = 'hinge')
model1.fit(X=X_train_selected, y=y_train)

#Gaussian Naive Bayes
model2 = GaussianNB(var_smoothing = .00000001)
model2.fit(X=X_train_selected, y=y_train)

GaussianNB(var_smoothing=1e-08)

### 3.5. Model Testing

In [33]:
print('TESTING SET WITH HYPERPARAMETERS')
print('\tSupport Vector Machine performance on testing set:', 
          format(model1.score(X_test_selected, y_test)*100, ".2f") + '%')
print('\tGaussian Naive Bayes performance on testing set:', 
          format(model2.score(X_test_selected, y_test)*100, ".2f") + '%')

TESTING SET WITH HYPERPARAMETERS
	Support Vector Machine performance on testing set: 68.80%
	Gaussian Naive Bayes performance on testing set: 68.80%


<a id="4"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 4. DISCUSSION

For our model we chose to compare the k-Nearest Neighbor, Support Vector Machines, Gaussian Naive Bayes, and Decision Tree classification algorithms. Based on their individual accuracies on the training and testing sets, we found that the Support Vector Machine and Naive Bayes classification algorithms had the least risk of overfitting and the best results on the test set. From there we used grid search to determine the best hyperparameters for each of these algorithms. We found that the Support Vector Machine classifier performs best with hinge loss and maximum iterations of 1,000,000. The Naive Bayes classifier performs best with $10 x e^{-9}$ as the smoothing coefficient. Both of these methods performed equally well with both the variables found in the feature selection and the complete list of features, so we chose to go with the selected features in order to lower model complexity and the risk of overfitting. Both of these algorithms also yielded the same accuracies after hyperparameter tuning so they can be used interchangeably. 

Based on our findings, we believe that we can predict whether a player will get a hit in a single game with a certain degree of accuracy but we are unsure of whether it could be effective for our original intention of predicting 57 days in a row correctly with very low odds that it predicts correctly that many times.

There are a few ethical implications of our project that should be addressed. In the past, baseball has been a game which is riddled with ethical issues. One potential issue which this project may encounter is how we view players. Many players will be judged solely on their ability to produce hits, which may not be the best representation of their skillset or ability to contribute to the team's success. In addition, several people say that the use of computerized algorithms and analysis has overtaken the sport, ultimately, taking away from the true spirit of the sport, leading to a sole focus on improving outcome. This could potentially further the situation, ultimately leading to further ethical issues such as how much we rely on a computer statistic. 

Results should be accepted at base value in context to the Beat the Streak competition, however, they should not be the sole deciding factor on a player's ability. In all team sports, each player holds a different value and contributes to the team in a different manner. In history, not all of the best players have had the highest hit rates, and therefore, it may not be the best predictor of both individual and team success. 

We believe that there is some bias against younger players with very little games played. Because they have likely only faced a given pitcher a few times, they will have either a very high or low average that is not fully representative of their matchups vs a pitcher.

In future work, we would want to add a section that considers the pitcher's overall success since not all pitchers are on the same skill level. Furthermore, player performance changes over years and there are several confounding variables which may influence their performance that are not taken into account.


<a id="5"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

### CONTRIBUTIONS

In FP4, part 1 was done by Rachel and Melis. Part 2 was done by Melis and Kevin. Part 3 was done by Rachel and Kevin. Part 4 was a group effort between all 3 of us. Overall, we all worked together to complete the discussion, executive summary, and aided each other throughout the data wrangling and acquisition processes.