<a href="https://colab.research.google.com/github/maxsauers13/linear-regression-baseball/blob/main/DMAP_SP21_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Keep this code cell here
# Project title will be used by the reviewer
PROJECT_TITLE = "Analysis of Runs Scored in a Baseball Game"
NOTEBOOK_ID   = "https://colab.research.google.com/drive/19mW0Odcz5VPLRNczVzRA5sSidRnhgDnq?authuser=1"
VERSION = "SP21"

# Introduction

Since the MLB season is in full swing, I've decided that I want to answer a long time question of mine about the game of baseball through statistics. What are the greatest factors when determining the number of runs that will be scored in a game of baseball? To solve this problem I want to look at many possible components. These include:

  1. Weather
    1. Wind speed
    2. Temperature
  2. Pitching
    1. Earned Run Average (ERA)
      - This is the average number of runs a pitcher gives up for every 9 innings pitched.
  3. Batting
    1. On Base Percentage (OBP)
      - This is the percentage that a player gets on base for each at bat, for an entire team.

I will collect data on these topics and explore how they contribute to the number of runs scored in a given baseball game. I will find which factors contribute the most and build a model accordingly. Lastly, I can test this model against statistics from other games and compare how accurate my model turned out to be.

<!-- 

   VIDEO INSTRUCTIONS (and data hosting)

1. upload to google drive, get the share URL
https://drive.google.com/file/d/1yGvY5a0KAqnOKf5kLh5EbbbRY4_LonAX

2. convert to export URL:
http://drive.google.com/uc?export=download&id=1yGvY5a0KAqnOKf5kLh5EbbbRY4_LonAX

3. OR use some other service to host your video:
https://storage.googleapis.com/uicourse/videos/dmap/Exact%20Instructions%20Challenge%20-%20THIS%20is%20why%20my%20kids%20hate%20me.%20%20Josh%20Darnit.mp4

replace the src="YOUR VIDEO URL" in the <source> tag in the next cell below
-->

In [None]:
%%html
<!-- this should be the ONLY html cell in the notebook: use markdown -->
<div style="font-size:36px; max-width:800px; font-family:Times, serif;">
 Analysis of Runs Scored in a Baseball Game
<!--<video width="600" controls>
  <source src="https://www.youtube.com/embed/AkkLzJGXlXc"
  type="video/mp4">
</video>-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/AkkLzJGXlXc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

# Data Acquisition, Selection, Cleaning
*Data description shown in the "Data Preparation" Section below.*

In [None]:
# add your imports here for your entire project
import pandas as pd
import numpy as np
import requests

In [None]:
# downloading my data

def install_data():
  ID = '1HhBn-Ir3M4sRYBiebGHxVtIHWbLK3VfZ'
  d_url = "https://drive.google.com/uc?export=download&id={:s}".format(ID)

  import requests
  raw = requests.get(d_url).content
  with open('data-2021.csv', 'wb') as fd:
    fd.write(raw)

  ID = '1EWhiGKImrr2HQREVi6UzZDpaJg8wZEsZ'
  d_url = "https://drive.google.com/uc?export=download&id={:s}".format(ID)

  raw = requests.get(d_url).content
  with open('data-2020.csv', 'wb') as fd:
    fd.write(raw)
  
  # transferred to a different function

  # import pandas as pd
  # df = pd.read_csv('data-2020.csv')
  # return df

def get_training_df():
  import pandas as pd
  df = pd.read_csv('data-2020.csv')
  return df

def get_testing_df():
  import pandas as pd
  df = pd.read_csv('data-2021.csv')
  return df

install_data()

# Data Preparation

The data used in this project was mainly prepared before downloading it into this notebook. 

Data was taken from three JSON files. They include game statistics, player statistics, and team statistics. Here are instances of their data below:


In [None]:
# Game Instance:

# '{"GameID":59622,"Season":2020,"SeasonType":1,"Status":"Final","Day":"2020-07-23T00:00:00","DateTime":"2020-07-23T19:00:00","AwayTeam":"NYY","HomeTeam":"WSH",
# "AwayTeamID":29,"HomeTeamID":35,"RescheduledGameID":null,"StadiumID":53,"Channel":"ESPN","Inning":6,"InningHalf":"T","AwayTeamRuns":9,"HomeTeamRuns":2,
# "AwayTeamHits":14,"HomeTeamHits":2,"AwayTeamErrors":0,"HomeTeamErrors":0,"WinningPitcherID":10000970,"LosingPitcherID":10000426,"SavingPitcherID":null,"Attendance":null,
# "AwayTeamProbablePitcherID":10000970,"HomeTeamProbablePitcherID":10000426,"Outs":null,"Balls":null,"Strikes":null,"CurrentPitcherID":null,"CurrentHitterID":null,
# "AwayTeamStartingPitcherID":10000970,"HomeTeamStartingPitcherID":10000426,"CurrentPitchingTeamID":null,"CurrentHittingTeamID":null,"PointSpread":3.4,"OverUnder":12.0,
# "AwayTeamMoneyLine":-252,"HomeTeamMoneyLine":219,"ForecastTempLow":112,"ForecastTempHigh":117,"ForecastDescription":"Scrambled","ForecastWindChill":112,
# "ForecastWindSpeed":9,"ForecastWindDirection":386,"RescheduledFromGameID":null,"RunnerOnFirst":null,"RunnerOnSecond":null,"RunnerOnThird":null,
# "AwayTeamStartingPitcher":"Scrambled","HomeTeamStartingPitcher":"Scrambled","CurrentPitcher":"Scrambled","CurrentHitter":"Scrambled","WinningPitcher":"Scrambled",
# "LosingPitcher":"Scrambled","SavingPitcher":"Scrambled","DueUpHitterID1":null,"DueUpHitterID2":null,"DueUpHitterID3":null,"GlobalGameID":10059622,
# "GlobalAwayTeamID":10000029,"GlobalHomeTeamID":10000035,"PointSpreadAwayTeamMoneyLine":153,"PointSpreadHomeTeamMoneyLine":-184,"LastPlay":"Scrambled",
# "IsClosed":true,"Updated":"2020-10-17T11:13:05","GameEndDateTime":"2020-07-23T23:08:38","HomeRotationNumber":1353,"AwayRotationNumber":1352,"NeutralVenue":false,
# "InningDescription":null,"OverPayout":null,"UnderPayout":null,"Innings":[]}'

# Player Instance:

# '{"StatID":2990653,"TeamID":3,"PlayerID":10000001,"SeasonType":1,"Season":2020,"Name":"Chase Anderson","Team":"TOR","Position":"SP","PositionCategory":"P",
# "Started":7,"BattingOrder":null,"GlobalTeamID":10000003,"AverageDraftPosition":null,"AuctionValue":null,"Updated":"2020-12-31T03:20:53","Games":10,"FantasyPoints":5.8,
# "AtBats":0.0,"Runs":0.0,"Hits":0.0,"Singles":0.0,"Doubles":0.0,"Triples":0.0,"HomeRuns":0.0,"RunsBattedIn":0.0,"BattingAverage":0.0,"Outs":0.0,"Strikeouts":0.0,
# "Walks":0.0,"HitByPitch":0.0,"Sacrifices":0.0,"SacrificeFlies":0.0,"GroundIntoDoublePlay":0.0,"StolenBases":0.0,"CaughtStealing":0.0,"PitchesSeen":0.0,
# "OnBasePercentage":0.0,"SluggingPercentage":0.0,"OnBasePlusSlugging":0.0,"Errors":0.0,"Wins":0.5,"Losses":1.1,"Saves":0.0,"InningsPitchedDecimal":24.8,
# "TotalOutsPitched":74.4,"InningsPitchedFull":24.3,"InningsPitchedOuts":1.1,"EarnedRunAverage":3.9,"PitchingHits":33.2,"PitchingRuns":21.4,"PitchingEarnedRuns":19.9,
# "PitchingWalks":5.4,"PitchingStrikeouts":28.0,"PitchingHomeRuns":6.0,"PitchesThrown":480.5,"PitchesThrownStrikes":294.1,"WalksHitsPerInningsPitched":0.9,
# "PitchingBattingAverageAgainst":0.2,"GrandSlams":0.0,"FantasyPointsFanDuel":103.2,"FantasyPointsDraftKings":50.2,"FantasyPointsYahoo":56.5,"PlateAppearances":0.0,
# "TotalBases":0.0,"FlyOuts":0.0,"GroundOuts":0.0,"LineOuts":0.0,"PopOuts":0.0,"IntentionalWalks":0.0,"ReachedOnError":0.0,"BallsInPlay":0.0,
# "BattingAverageOnBallsInPlay":0.0,"WeightedOnBasePercentage":0.0,"PitchingSingles":17.0,"PitchingDoubles":6.0,"PitchingTriples":0.0,"PitchingGrandSlams":0.0,
# "PitchingHitByPitch":0.5,"PitchingSacrifices":0.0,"PitchingSacrificeFlies":0.0,"PitchingGroundIntoDoublePlay":1.1,"PitchingCompleteGames":0.0,"PitchingShutOuts":0.0,
# "PitchingNoHitters":0.0,"PitchingPerfectGames":0.0,"PitchingPlateAppearances":113.5,"PitchingTotalBases":65.6,"PitchingFlyOuts":6.5,"PitchingGroundOuts":18.4,
# "PitchingLineOuts":4.3,"PitchingPopOuts":4.3,"PitchingIntentionalWalks":0.0,"PitchingReachedOnError":0.0,"PitchingCatchersInterference":0.0,"PitchingBallsInPlay":69.3,
# "PitchingOnBasePercentage":0.2,"PitchingSluggingPercentage":0.3,"PitchingOnBasePlusSlugging":0.5,"PitchingStrikeoutsPerNineInnings":5.5,"PitchingWalksPerNineInnings":1.5,
# "PitchingBattingAverageOnBallsInPlay":0.2,"PitchingWeightedOnBasePercentage":0.2,"DoublePlays":0.0,"PitchingDoublePlays":1.1,"BattingOrderConfirmed":true,
# "IsolatedPower":0.0,"FieldingIndependentPitching":3.3,"PitchingQualityStarts":0.0,"PitchingInningStarted":null,"LeftOnBase":0.0,"PitchingHolds":0.0,
# "PitchingBlownSaves":0.0,"SubstituteBattingOrder":null,"SubstituteBattingOrderSequence":null,"FantasyPointsFantasyDraft":50.2}'

# Team Instance:

# {"StatID":2934732,"TeamID":14,"SeasonType":1,"Season":2020,"Name":"Arizona Diamondbacks","Team":"ARI","GlobalTeamID":10000014,"Updated":"2020-12-31T03:22:17",
# "Games":60,"FantasyPoints":784.0,"AtBats":812.8,"Runs":109.5,"Hits":196.2,"Singles":126.6,"Doubles":41.1,"Triples":2.0,"HomeRuns":23.6,"RunsBattedIn":103.8,
# "BattingAverage":0.0,"Outs":616.6,"Strikeouts":187.6,"Walks":73.7,"HitByPitch":14.7,"Sacrifices":4.0,"SacrificeFlies":3.8,"GroundIntoDoublePlay":14.7,"StolenBases":3.8,
# "CaughtStealing":1.2,"PitchesSeen":3533.2,"OnBasePercentage":0.1,"SluggingPercentage":0.1,"OnBasePlusSlugging":0.1,"Errors":14.2,"Wins":10.2,"Losses":14.2,"Saves":2.2,
# "InningsPitchedDecimal":211.0,"TotalOutsPitched":632.9,"InningsPitchedFull":210.8,"InningsPitchedOuts":0.2,"EarnedRunAverage":0.8,"PitchingHits":205.9,"PitchingRuns":120.1,
# "PitchingEarnedRuns":113.6,"PitchingWalks":95.6,"PitchingStrikeouts":213.3,"PitchingHomeRuns":37.9,"PitchesThrown":3628.8,"PitchesThrownStrikes":2275.9,
# "WalksHitsPerInningsPitched":0.2,"PitchingBattingAverageAgainst":0.0,"GrandSlams":0.0,"FantasyPointsFanDuel":1988.3,"FantasyPointsDraftKings":1510.4,
# "FantasyPointsYahoo":1276.2,"PlateAppearances":910.5,"TotalBases":317.9,"FlyOuts":0.0,"GroundOuts":173.8,"LineOuts":55.4,"PopOuts":37.9,"IntentionalWalks":0.8,
# "ReachedOnError":3.3,"BallsInPlay":610.9,"BattingAverageOnBallsInPlay":0.0,"WeightedOnBasePercentage":0.1,"PitchingSingles":115.6,"PitchingDoubles":48.0,
# "PitchingTriples":1.8,"PitchingGrandSlams":0.2,"PitchingHitByPitch":3.8,"PitchingSacrifices":0.5,"PitchingSacrificeFlies":2.2,"PitchingGroundIntoDoublePlay":20.8,
# "PitchingCompleteGames":0.0,"PitchingShutOuts":0.0,"PitchingNoHitters":0.0,"PitchingPerfectGames":0.0,"PitchingPlateAppearances":929.2,"PitchingTotalBases":376.5,
# "PitchingFlyOuts":107.4,"PitchingGroundOuts":131.1,"PitchingLineOuts":55.4,"PitchingPopOuts":34.6,"PitchingIntentionalWalks":3.3,"PitchingReachedOnError":10.2,
# "PitchingCatchersInterference":0.2,"PitchingBallsInPlay":571.4,"PitchingOnBasePercentage":0.1,"PitchingSluggingPercentage":0.1,"PitchingOnBasePlusSlugging":0.1,
# "PitchingStrikeoutsPerNineInnings":1.5,"PitchingWalksPerNineInnings":0.7,"PitchingBattingAverageOnBallsInPlay":0.0,"PitchingWeightedOnBasePercentage":0.1,
# "DoublePlays":16.3,"PitchingDoublePlays":21.6,"BattingOrderConfirmed":true,"IsolatedPower":0.0,"FieldingIndependentPitching":0.8,"PitchingQualityStarts":2.2,
# "PitchingInningStarted":null,"LeftOnBase":313.0,"PitchingHolds":3.1,"PitchingBlownSaves":1.5,"SubstituteBattingOrder":null,"SubstituteBattingOrderSequence":null,
# "FantasyPointsFantasyDraft":1504.7}

The important data was then taken out of each instance and written to a CSV file. 

This includes:
1. Runs scored in a game
2. Combined OBP of both teams
3. Combined ERA of the pitchers on both teams
4. Temperature
5. Wind Speed

This was done using 4 different functions described below:

*Note these will not run in this notebook. I wrote and ran these in Microsoft VScode. The size of the JSON files prevents them from being imported to this notebook.*

In [None]:
# takes a JSON file name and a dictionary of strings (created in getPitcherDict)
# extracts the valuable information from each game instance
# returns a list of lists, each game becoming a list of data
def getCSVdata(fileName, players, teams_dict):
    output = []
    file = open(fileName, 'r')
    teams = file.read().split('}')
    teams.pop(len(teams) - 1)
    for team in teams:
        team_data = []
        team_list = team.split(',')
        for data in team_list:
            data_split = data.split(':')
            if '"AwayTeam"' == data_split[0]:
                team_data.append(float(teams_dict[data_split[1][1:-1]]))
            elif '"HomeTeam"' == data_split[0]:
                team_data[0] += float(teams_dict[data_split[1][1:-1]])
                team_data[0] = "{:.3f}".format(team_data[0])
            elif '"AwayTeamRuns"' == data_split[0]:
                if data_split[1] == 'null':
                    continue
                team_data.append(int(data_split[1]))
            elif '"HomeTeamRuns"' == data_split[0]:
                if data_split[1] == 'null':
                    continue
                team_data[1] += (int(data_split[1]))
                team_data[1] = str(team_data[1])
                temp = team_data[1]
                team_data[1] = team_data[0]
                team_data[0] = temp
            elif '"AwayTeamStartingPitcherID"' == data_split[0]:
                if data_split[1] == 'null':
                    continue
                if data_split[1] in players:
                    team_data.append(float(players[data_split[1]]))
            elif '"HomeTeamStartingPitcherID"' == data_split[0]:
                if data_split[1] == 'null':
                    continue
                if data_split[1] in players and len(team_data) > 2:
                    team_data[2] += float(players[data_split[1]])
                    team_data[2] = "{:.1f}".format(team_data[2])
            elif '"ForecastTempHigh"' == data_split[0]:
                if data_split[1] == 'null':
                    continue
                team_data.append((data_split[1]))
            elif '"ForecastWindSpeed"' == data_split[0]:
                if data_split[1] == 'null':
                    continue
                team_data.append((data_split[1]))
        
        num_points = 5
        if len(team_data) == num_points:
            output.append(team_data)

    return output

# returns a dictionary of all the pitchers and their corresponding player IDs
# this is used to find the ERA of a pitcher from a game instance
def getPitcherDict(fileName):
    output = []
    file = open(fileName, 'r')
    teams = file.read().split('}')
    teams.pop(len(teams) - 1)
    for team in teams:
        team_data = []
        team_list = team.split(',')
        for data in team_list:
            data_split = data.split(':')
            if '"PlayerID"' == data_split[0]:
                team_data.append(data_split[1])
            elif '"EarnedRunAverage"' == data_split[0]:
                if data_split[1] != '0.0':
                    team_data.append(data_split[1])
        
        num_points = 2
        if len(team_data) == num_points:
            output.append(team_data)
    
    output_dict = {}
    for player in output:
        output_dict[player[0]] = player[1]

    return output_dict

# returns a dictionary of all the teams and their corresponding OBPs
def getTeamsDict(fileName):
    output = []
    file = open(fileName, 'r')
    teams = file.read().split('}')
    teams.pop(len(teams) - 1)
    for team in teams:
        team_data = []
        team_list = team.split(',')
        for data in team_list:
            data_split = data.split(':')
            if '"Team"' == data_split[0]:
                team_data.append(data_split[1][1:-1])
            elif '"AtBats"' == data_split[0]:
                team_data.append(data_split[1])
            elif '"Hits"' == data_split[0]:
                team_data.append(data_split[1])
            elif '"Walks"' == data_split[0]:
                team_data.append(data_split[1])
        
        num_points = 4
        if len(team_data) == num_points:
            output.append(team_data)
    
    output_dict = {}
    for team in output:
        output_dict[team[0]] = "{:.3f}".format((float(team[2]) + float(team[3])) / float(team[1]))

    return output_dict

# takes a list of lists and writes it into a csv file
# each list is a line in the csv file
def writeCSV(to_write):
    f = open('2021_stats.csv', 'w')
    f.write('Runs, OBP, ERA, Temperature, Wind\n')
    for data in to_write:
        for i in range(len(data)):
            if i != len(data) - 1:
                f.write(str(data[i]) + ',')
            else:
                f.write(str(data[i]) + '\n')
    f.close()

# DO NOT run these functions

Additionally, the getCSVdata function runs so that no rows in the CSV file written in writeCSV have any empty data, eliminating that concern.

# Data Analysis

### Regression

The data analysis technique I will be using is linear regression. I chose this analysis it is one of the best and most effiecient ways to find the correlation between two datasets, which works well for my purposes.
- Linear regression is essentially the process of modeling data points as the equation of a line.
- I will plot each of the features against Runs Scored and use linear regression to find the best fit line.

### Pearson Coefficient

Specifically, I will use the linear regression technique to find the squared Pearson correlation coefficient. 
- This gives me the strength of relationship between two datasets, from comparing the number of runs dataset to every other dataset in the data. 
- This Pearson coefficient ranges from 0 to 1 and acts as a percentage of variation explained by the relationship between two variables.

The equation for the Pearson Coefficient is as follows:
- (SST - SSE) / SST
- SSE is the residuals used in the linear regression process.
- SST is the total variation between the average y and each point.

### Goals

1. I can use the coefficients to find which feature effects number of runs scored the most.

2. Using the training on 2020 MLB data to find the correlation coeffiecients, I will test my model on 2021 MLB data by making predictions of runs scored using the given features and score its performance. 

*The training does not require a significant amount of time so it will be included in the notebook.*

In [None]:
import numpy as np

# returns a dictionary of feature names mapped to their r2 values
def get_r2(df):
  r2_values = {}
  cols = list(df.columns.values)[1:]
  x = df.loc[:,'Runs'].values

  for col in cols:
    y = df.loc[:,col].values
    
    n = np.size(x)
    dx = (n * np.sum(x * y) - np.sum(x) * np.sum(y))
    dy = (n * np.sum(np.square(x)) - np.sum(x) * np.sum(x))
    slope = dx / dy
    y0 = np.mean(y) - slope * np.mean(x)
    predicted = x * slope + y0

    SSE = np.sum(np.square(y - predicted))
    SST = np.sum(np.square(y - np.mean(y)))
    r2 = (SST - SSE) / SST

    r2_values[col] = r2
  
  return r2_values

In [None]:
# test the precision of using the r2 values to predict runs scored
# returns a dictionary that shows the accuracy of the model and an accuracy score
def test_r2(df, r2_dict):
  runs = df.loc[:,'Runs'].values
  runs_predicted = []

  # get a list of each features percent of the total r2
  r2_value_percentages = []
  total = 0
  for key in r2_dict:
    total += r2_dict[key]
  for key in r2_dict:
    r2_value_percentages.append(r2_dict[key] / total)
  
  for i in df.index:
    prediction = 0
    prediction += r2_value_percentages[0] * df[' OBP'][i]
    prediction += r2_value_percentages[1] * df[' ERA'][i]
    prediction += r2_value_percentages[2] * df[' Temperature'][i]
    prediction += r2_value_percentages[3] * df[' Wind'][i]
    runs_predicted.append(int(prediction.round()))
  runs_predicted = np.asarray(runs_predicted)

  accuracy = {'Correct':0, 'Within 1 Run':0, 'Within 2 Runs':0,
              'Within 3 Runs':0, '4 or Greater':0}
  for i in range(len(runs)):
    if abs(runs[i] - runs_predicted[i]) == 0:
      accuracy['Correct'] += 1
    elif abs(runs[i] - runs_predicted[i]) == 1:
      accuracy['Within 1 Run'] += 1
    elif abs(runs[i] - runs_predicted[i]) == 2:
      accuracy['Within 2 Runs'] += 1
    elif abs(runs[i] - runs_predicted[i]) == 3:
      accuracy['Within 3 Runs'] += 1
    else:
      accuracy['4 or Greater'] += 1

  accuracy_score_values = {'Correct':1, 'Within 1 Run':0.75, 'Within 2 Runs':0.5,
              'Within 3 Runs':0.25, '4 or Greater':0}
  accuracy_score = (accuracy_score_values['Correct'] * accuracy['Correct'] + 
                    accuracy_score_values['Within 1 Run'] * accuracy['Within 1 Run'] + 
                    accuracy_score_values['Within 2 Runs'] * accuracy['Within 2 Runs'] + 
                    accuracy_score_values['Within 3 Runs'] * accuracy['Within 3 Runs']) / len(runs)

  return accuracy, accuracy_score

# return list of features sorted from most to least important
def correlation_order(r2_dict):
  values_sorted = sorted(r2_dict.values(), reverse=True)
  keys_sorted = []

  for value in values_sorted:
    for key in r2_dict.keys():
      if r2_dict[key] == value:
        keys_sorted.append(key)
        break
  
  return keys_sorted

# outputs the results of the testing and training
def output_results():
  df_training = get_training_df()
  df_testing = get_testing_df()

  r2_values = get_r2(df_training[:550])
  keys_sorted = correlation_order(r2_values)
  accuracy, accuracy_score = test_r2(df_testing, r2_values)

  print("Pearson Coefficients (As percentages out of 100):")
  for key in keys_sorted:
    print("{0}: {1:.4f}%".format(key, r2_values[key] * 100))

  print()
  print("Accuracy of Model Predictions:")
  for key in accuracy:
    print(" {0}: {1}".format(key, accuracy[key]))

  print()
  print("Accuracy Score (As percentage out of 100):")
  print(" {:.4f}%".format(accuracy_score * 100))

output_results()

Pearson Coefficients (As percentages out of 100):
 ERA: 8.4838%
 OBP: 0.2266%
 Temperature: 0.0934%
 Wind: 0.0023%

Accuracy of Model Predictions:
 Correct: 47
 Within 1 Run: 104
 Within 2 Runs: 71
 Within 3 Runs: 55
 4 or Greater: 107

Accuracy Score (As percentage out of 100):
 45.3776%


# Conclusion

If the cells above are run correctly, the output look like:

### Pearson Coefficients (As percentages out of 100):
 - ERA: 8.4838%
 - OBP: 0.2266%
 - Temperature: 0.0934%
 - Wind: 0.0023%

### Accuracy of Model Predictions:
 - Correct: 47
 - Within 1 Run: 104
 - Within 2 Runs: 71
 - Within 3 Runs: 55
 - 4 or Greater: 107

### Accuracy Score (As percentage out of 100):
 - 45.3776%

### My analysis of this data has led me to these conclusions:
1. Earned Run Average was the only data point that had a significant effect on the number of runs scored.
2. Pitching is much more important than batting when it comes to runs scored, as OBP is a very accurate representation of a team's batting strength yet did not have a large effect on the runs.
3. Wind and temperature are almost irrelevant, as the number of runs seems to remain consistant in all conditions.

### In addition to this, my model was modest in its prediction strength
1. It predicted about 12% of the games exactly correct.
2. About 72% of predicted games were within 3 runs of their actual result.
3. I created the accuracy score myself, so its weight is up for interpretation to the user.

Overall, my experience with this project was great and I learned a lot about data analysis techniques. I hope you did as well.

### Thanks!


---
# Submission Guidelines (keep this section here)
---


When you are ready to submit your project, part of the submission process will be to register your notebook for reviewing.  

You will also receive the links and instructions to do the peer reviews.

Please review the metadata:

In [None]:
def get_metadata():
  meta = {
          "title": PROJECT_TITLE, # keep this as is
          "nb_id": NOTEBOOK_ID,   # keep this as is

          "data" : ["https://drive.google.com/uc?export=download&id=1HhBn-Ir3M4sRYBiebGHxVtIHWbLK3VfZ", 
                    "https://drive.google.com/uc?export=download&id=1EWhiGKImrr2HQREVi6UzZDpaJg8wZEsZ"],

          # permissions
          # do you give the instructor the permission to copy this project
          # and allow others to view it in the class gallery?
          "allow_gallery": True,
          
          # if your project is made viewable to others,
          # do you want to include your name (first/last)?
          "allow_name_release": True
          }
  return meta

Specific instructions will come for what to submit for the various milestones.

If necessary, you can download the Python version of this notebook by using the `File->Download .py` as well as the notebook itself `File->Download .ipynb`.

