# 01-all-sec-scraping

>Data scraping for all-SEC teams dating back to 2003

The following notebook scrapes historical all-SEC teams from Wikipedia. This data will be used for further analysis, which includes developing a better understanding of where Vanderbilt has seen success on the recruiting trail. While data for season, player, and school is easily scraped, the data for position was collected manually (and remains free of errors). The position data is joined with the scraped data to produce the final dataframe used in analysis.

### Data Scraping

In [1]:
#import relevant packages
import requests as rq
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
import janitor
import unidecode

In [2]:
#initialize range of seasons to loop over
seasons = range(2003,2021)

In [3]:
#initialize dataframe to store all-SEC data over range of seasons
all_sec_df = pd.DataFrame({'season': [],
                           'player': [],
                           'school': []})

In [4]:
#iterate over each season
for s in seasons:
    
    #set the relevant url, must do in parts then join together
    url_parts = ['https://en.wikipedia.org/wiki/', str(s), '_All-SEC_football_team']
    url = ''.join(url_parts)
    #use get to access the url and save the page
    page = rq.get(url)
    #save the html content of the page
    soup = bs(page.content, 'html.parser')
    
    #grab the player data from the html content using css selectors and find number of observations
    player_data = soup.select('h3+ ul li')
    n_players = len(player_data)
    
    #parse html data to find name and school data (remove accents from name data)
    name_data = [unidecode.unidecode(p.text.split(', ')[0]) for p in player_data]
    school_data = [p.text.split(', ')[1] for p in player_data]
    
    #construct cleaned datasets for season, name, and school for each observation
    season_data_clean = np.repeat(s, n_players)
    name_data_clean = [n if n[-1].isalpha() else n[:-1] for n in name_data]
    school_data_clean = [s.split(' (')[0] for s in school_data]
    
    #save constructed datasets for current season to temporary dataframe and append to the initialized dataframe
    all_sec_df_temp = pd.DataFrame({'season': season_data_clean,
                                    'player': name_data_clean,
                                    'school': school_data_clean})
    all_sec_df = all_sec_df.append(all_sec_df_temp, ignore_index = True)

#change season datatype to integer
all_sec_df = all_sec_df.astype({'season': 'int64'})

In [5]:
#display all-SEC dataframe
all_sec_df

Unnamed: 0,season,player,school
0,2003,Eli Manning,Ole Miss
1,2003,Matt Mauck,LSU
2,2003,David Greene,Georgia
3,2003,Carnell Williams,Auburn
4,2003,Cedric Cobbs,Arkansas
...,...,...,...
1214,2020,Max Duffy,Kentucky
1215,2020,Zach Von Rosenberg,LSU
1216,2020,Kadarius Toney,Florida
1217,2020,Jerrion Ealy,Ole Miss


### Including Position Data

In [8]:
#read in manually collected data in order to add in player positions later on
pos_totals = pd.read_csv('preliminary-data/all-sec-position-totals.csv').clean_names()

In [9]:
#repeat the indexes by the amount given in the value column, access dataframe with new indecies, reset the index and observe only position column
pos_totals = pos_totals.loc[pos_totals.index.repeat(pos_totals['value'])].reset_index(drop=True)[['position']]

In [10]:
#display new position dataframe to be binded by column to `all_sec_df`
pos_totals

Unnamed: 0,position
0,QB
1,QB
2,RB
3,RB
4,RB
...,...
1214,P
1215,P
1216,P
1217,ST


### Final Merge and Export

In [11]:
#bind the two dataframes (column-wise)
all_sec_data = pd.concat([all_sec_df, pos_totals], axis = 1)

In [12]:
#display final dataframe `all_sec_data`
all_sec_data

Unnamed: 0,season,player,school,position
0,2003,Eli Manning,Ole Miss,QB
1,2003,Matt Mauck,LSU,QB
2,2003,David Greene,Georgia,RB
3,2003,Carnell Williams,Auburn,RB
4,2003,Cedric Cobbs,Arkansas,RB
...,...,...,...,...
1214,2020,Max Duffy,Kentucky,P
1215,2020,Zach Von Rosenberg,LSU,P
1216,2020,Kadarius Toney,Florida,P
1217,2020,Jerrion Ealy,Ole Miss,ST


In [14]:
#write dataframe into a csv
all_sec_data.to_csv('preliminary-data/all-sec-data.csv', index = False)