# Baseball Analysis

##### The data in this analysis was taken from http://www.seanlahman.com/baseball-archive/statistics/

In [1]:
# Let's start by importing all the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

### My first question concerning this dataset is:
### 1) What possible factors are strongly correlated with receiving more votes for awards?

##### (I will be looking for correlations of at least <u>0.7</u> based on https://www.andrews.edu/~calkins/math/edrm611/edrm05.htm)

#### Since the number of votes received alread describes who won the award, I decided to only use the AwardsShareManagers and AwardsSharePlayers datasets for analysis.

In [11]:
# Manager awards and player awards probably have different correlating factors, so let's analyze them separately.

manager_awards = pd.read_csv('baseball/AwardsShareManagers.csv')
player_awards = pd.read_csv('baseball/AwardsSharePlayers.csv')

print "First 5 rows of Manager Award Points Data...\n"
print manager_awards.head()
print "\nFirst 5 rows of Player Award Points Data...\n"
print player_awards.head()

First 5 rows of Manager Award Points Data...

           awardID  yearID lgID   playerID  pointsWon  pointsMax  votesFirst
0  Mgr of the year    1983   AL  altobjo01          7         28           7
1  Mgr of the year    1983   AL    coxbo01          4         28           4
2  Mgr of the year    1983   AL  larusto01         17         28          17
3  Mgr of the year    1983   NL  lasorto01         10         24          10
4  Mgr of the year    1983   NL  lillibo01          9         24           9

First 5 rows of Player Award Points Data...

    awardID  yearID lgID   playerID  pointsWon  pointsMax  votesFirst
0  Cy Young    1956   ML   fordwh01        1.0         16         1.0
1  Cy Young    1956   ML  maglisa01        4.0         16         4.0
2  Cy Young    1956   ML  newcodo01       10.0         16        10.0
3  Cy Young    1956   ML  spahnwa01        1.0         16         1.0
4  Cy Young    1957   ML  donovdi01        1.0         16         1.0


#### After looking at https://en.wikipedia.org/wiki/Major_League_Baseball_Manager_of_the_Year_Award, it looks like the final award score is calculated along with first-place votes. "pointsWon" is most likely this score since there are no other point columns other than "votesFirst". Either way, it gives us a score that shows the candidate popularity better.

#### However, it seems like the maximum voting points varies across each league and year. Therefore, I will address the above quesion using "pointsWon" and the ratio of  "pointsWon" / "pointsMax"</p>

In [16]:
# Let's check to see if pointsWon is always greater than votesFirst

print "The number of instances where votesFirst is greater than pointsWon"
print " - For managers: ", len(manager_awards.loc[manager_awards["votesFirst"] > manager_awards["pointsWon"]])
print " - For players: ", len(player_awards.loc[player_awards["votesFirst"] > player_awards["pointsWon"]])

The number of instances where votesFirst is greater than pointsWon
 - For managers:  0
 - For players:  1


### After checking this one instance, there does not seem to be a significant difference. Therefore, we will keep the data as is.
### Here are some plausible factors: 
<li>For Managers: Number of Wins, Cumulative team rank, Number of games managed </li>
<li> For Players: Batters can be measured by their doubles, triples, and homeruns. Pitchers can be measured by outs pitched and saves. </li>


In [None]:
# 