### Scraping MLB stats from ESPN go###

In this homework we will try to analyze the batting performances of different teams in Major League Baseball using the data available in the following link http://espn.go.com/mlb/stats/team/_/stat/batting 

We will first scrape the page corresponding to the 2015 season 

In [None]:
url="http://espn.go.com/mlb/stats/team/_/stat/batting/year/2015"

We will now scrape the table found in the above link. You can follow the class notebook to understand how to capture html table tags.

Q1) Write a function which will take the above url and return a pandas dataframe corresponding to the table found in the given link. 

In [None]:
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def SCRAPE_ESPN_MLB_STATS(url):
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page, 'html.parser')
    rows = soup.table.findAll('tr',  { "align" : "right" })
    header_row = rows[0]
    team_rows = rows[1:-1] #exclude the last
    
    types=[int, str, int, int, int, int, int, int, int, int,int, float, float, float, float ]

    
    headers = [c.text for c in header_row.findAll("td")]
    t={}
    for i,header in enumerate(headers):
           t[header] = []
    
    for tr in team_rows:
        cols = tr.findAll('td')
        for i,header in enumerate(headers):
           t[header].append(types[i](cols[i].text))
       
    df = pd.DataFrame(t)
    df=df.set_index(["TEAM"])
    return df
        

Run the above function to scrape season 2015 stats.

In [None]:
df_2015 = SCRAPE_ESPN_MLB_STATS(url)
df_2015

We will now produce plots analyzing performance of different teams on different statistical parameters

Q2) Write a function which will take the above dataframe and a list of column names as input and produces a set of plots corresponding to each of the column names provided

In [None]:
def produce_plots(df, col_names):
    for col_name in col_names:
        ax = plt.subplots()
        df[col_name].order().plot(kind="bar", subplots=True)
    


Call the above function for the columns

1. HR: Home Runs
2. TB: Total Bases
3. RBI: Runs Batted In

Q3) We will now use the above functions to scrape for more seasons and analyze the performances over a period of 6 years from 2010-2015

In [None]:
produce_plots(df_2015, ["HR", "TB", "RBI"])

In [None]:
dfs={}
for year in xrange(0,6):
    link = 'http://espn.go.com/mlb/stats/team/_/stat/batting/year/201'+str(year)
    dfs[year]=SCRAPE_ESPN_MLB_STATS(link)
    


Inorder to analyze performance of teams across seasons, we will need all the data in a single dataframe. 

Q4) Use appropriate pandas functions to combine the above dictionary of year:dataframe to produce one dataframe which has a new column corresponding to the year/season 

In [None]:
years=[]
combined_df=None

for y,df in dfs.iteritems():
    season = "200"+str(y)
    df["SEASON"] = season
    years.append(df)

combined_df = pd.concat(years)
combined_df

Q5) Now write a function which will take the above dataframe and a list of column names and produces a set of plots corresponding to each of the columns provided. Each plot is a set of subplots, where every subplot is a line graph of the column values over the 6 years for each of the teams in the dataframe.

In [None]:
def produce_plots_over_seasons(combined_df, col_names):
    import matplotlib.dates as mdates
    myFmt = mdates.DateFormatter('%y')

    for col_name in col_names:
        fig = plt.figure()  # a new figure window

        
        for team in list(set(list(combined_df.index))):
            fig, ax = plt.subplots()
            fig.autofmt_xdate()
            team_df = combined_df.ix[team]
            df = team_df[["SEASON", col_name]]
            ax.plot(df["SEASON"], df[col_name].values, label=col_name)
            ax.set_title(team)
            ax.legend(loc='best')
            ax.xaxis.set_major_formatter(myFmt)

Call the above function for the columns 

1. HR: Home Runs
2. TB: Total Bases
3. RBI: Runs Batted In

In [None]:
produce_plots_over_seasons(combined_df, ["HR"])