# Baseball Stats

In this notebook, I'll be loading in data from a dataset of Major League Baseball stats found on [Kaggle](https://www.kaggle.com/) and using the pandas library to hopefully find interesting information. The dataset is located [here](https://www.kaggle.com/kaggle/the-history-of-baseball).

## Loading libraries and data

In [7]:
import pandas as pd
import numpy as np

batting_data = pd.read_csv('baseball/batting.csv').set_index('player_id')
player_data = pd.read_csv('baseball/player.csv').set_index('player_id')

I'll start with something simple, I'll use pandas to create a pivot table to aggregate each player's seasons and total up the homeruns.

In [4]:
hrs = batting_data.pivot_table(index=batting_data.index, values='hr', aggfunc=sum).sort_values(ascending=False).dropna()
print hrs.head()

player_id
bondsba01    762
aaronha01    755
ruthba01     714
rodrial01    687
mayswi01     660
Name: hr, dtype: float64


Now let's do something a bit more complicated, let's try to find the players with the highest and lowest homerun to stolen base ratio. I'll only use players with >= 300 homeruns.

In [48]:
sbs = batting_data.pivot_table(index=batting_data.index, values='sb', aggfunc=sum).sort_values(ascending=False).dropna()
hrs_sbs = pd.concat((hrs, sbs), axis=1).sort_values('hr', ascending=False)
hrs_sbs['hr_to_sb_ratio'] = hrs_sbs['hr'] / hrs_sbs['sb']
hr_sb_ratios = hrs_sbs[hrs_sbs['hr'] > 300].sort_values('hr_to_sb_ratio', ascending=False)
print pd.concat((hr_sb_ratios.head(), hr_sb_ratios.tail()))

            hr   sb  hr_to_sb_ratio
fieldce01  319    2      159.500000
buhneja01  310    6       51.666667
konerpa01  439    9       48.777778
mcgwima01  583   12       48.583333
howarfr01  382    8       47.750000
beltrca01  392  311        1.260450
baylodo01  338  285        1.185965
sandere02  305  304        1.003289
finlest01  304  320        0.950000
bondsbo01  332  461        0.720174


And our winner with a whopping 159.5:1 ratio of homeruns to stolen bases is Cecil Fielder. On the opposite end, we have Bobby Bonds with a .72:1 homeruns to stolen base ratio.

I'm going to look at the player data now, which lists information such as birth city, height, weight, handedness, and other details.

First, what city is home to the largest number of baseball players throughout history??

In [22]:
print player_data['birth_city'].value_counts().head()

Chicago         375
Philadelphia    355
St. Louis       298
New York        267
Brooklyn        240
Name: birth_city, dtype: int64


Which states are home to the most and least number of baseball players??

In [39]:
print player_data[player_data['birth_country'] == 'USA']['birth_state'].value_counts()

CA    2160
PA    1417
NY    1207
IL    1054
OH    1035
TX     891
MA     665
MO     604
FL     497
MI     432
NJ     427
NC     399
IN     373
GA     344
AL     322
MD     309
TN     296
VA     286
KY     281
OK     259
LA     247
WI     244
IA     218
KS     211
CT     204
MS     200
WA     195
SC     183
MN     165
AR     152
OR     127
WV     120
NE     113
DC     102
AZ      99
CO      89
RI      78
ME      78
DE      54
NH      53
HI      40
UT      39
VT      38
SD      38
NV      36
ID      29
NM      28
MT      24
ND      16
WY      15
AK      11
Name: birth_state, dtype: int64


This all makes sense. According to the [2014 census data](https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population), California has largest population in the United States while Alaska is in the bottom 5.