<center> <h1>Prime DFS </h1></center>
<center> <h1>Web Scraping NFL Data </h1></center>


<center><h1> Notebook Goal: </h1></center>
<center><H4>Create a CSV file of relevant NFL stats for the 2020 Season</H4></center>

## 1. Introduction to BeautifulSoup
* A Web Scraping Library that organizes messy HTML and presents us with manageable Python objects
* **Need to install the library before you can use it in Python**
    * `pip install beautifulsoup4` - run this in your terminal
* Full documentation available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### 1.1. Importing the libraries
* Urllib - Module allows us to read data from URLs
    * Full urlopen documentation at https://docs.python.org/3/library/urllib.request.html
* BeautifulSoup - Module allows us to transform the html into manageable Python objects
    * Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Pandas - Module allows access for easy to use data manipulation and analysis methods
    * Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [46]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

### 1.2. Creating the BeautifulSoup Object
* Copy/Paste the url link to the football site we want to grab data from
    * In this case we use string formatting to maninpulate the year
    * Link in this use case can be found at https://www.pro-football-reference.com/years/2020/fantasy.htm
* Assign the html variable as the result of placing our url link as the parameter to the urlopen function
    * Will read and return the html data file
* Assign the soup variable the result of placing our html variable as the parameter to the BeautifulSoupFunction
    * This will be the main variable we will manipulate going forward
    * The BeautifulSoup function transforms the html result into manageable python objects
        * Try printing the soup variable below the next block to see what it returns

In [39]:
year = 2020
url = "https://www.pro-football-reference.com/years/{}/fantasy.htm".format(year)
html = urlopen(url)
soup = BeautifulSoup(html)

### 1.3. Getting the Header Data for our CSV file
* Use a list comprehension to find the text within the table header tag inside the second table row tag
    * Use list slicing to remove the first column header
    * Print the first five headers as proof

In [72]:
headers = [th.getText() for th in soup.findAll('tr')[1].findAll('th')] #Find the second table row tag, find every table header column within it and extract the html text via the get_text method.
headers = headers[1:] #Do not need the first (0 index) column header
print(headers[:5])

['Player', 'Tm', 'FantPos', 'Age', 'G']


### 1.4. Getting the Table Row Data for our CSV file
* Use a list comprehension to find the text within each table data cell within each table row
    * Find all table rows not classed as thead
        * Every 30 rows on this football-reference page contains table header rows
    * Perform the list comprehension to grab player stats
    * Remove the first two empty rows from the player stats list


In [93]:
rows = soup.findAll('tr', class_ = lambda table_rows: table_rows != "thead") #Here we grab all rows that are not classed as table header rows - football reference throws in a table header row everyy 30 rows 
player_stats = [[td.getText() for td in rows[i].findAll('td')] # get the table data cell text from each table data cell
                for i in range(len(rows))] #for each row
player_stats = player_stats[2:]

### 1.5. Creating a Pandas DataFrame object
* Simply place the player_stats list as the data parameter and the headers as the columns parameter

In [94]:
stats = pd.DataFrame(player_stats, columns = headers)
stats.head()

Unnamed: 0,Player,Tm,FantPos,Age,G,GS,Cmp,Att,Yds,TD,...,TD.1,2PM,2PP,FantPt,PPR,DKPt,FDPt,VBD,PosRank,OvRank
0,Dalvin Cook,MIN,RB,25,10,10,0,0,0,0,...,14,3.0,,223,251.5,260.5,237.0,133,1,1
1,Derrick Henry,TEN,RB,26,11,11,0,0,0,0,...,12,,,207,221.0,224.0,214.0,118,2,2
2,Tyreek Hill,KAN,WR,26,11,11,0,0,0,0,...,14,,,192,260.1,263.1,226.1,106,1,3
3,Alvin Kamara,NOR,RB,25,11,6,0,0,0,0,...,12,,,195,263.1,269.1,229.1,106,3,4
4,Kyler Murray,ARI,QB,23,11,11,264,387,2814,19,...,10,,,292,291.6,308.6,300.6,100,1,5


### 1.7. Data Minipulation
* Replace empty strings with zeroes, N/A values, column averages, etc.
    * **It's your data you decide what works best for you**
* Create a year column as the year variable

In [77]:
stats = stats.replace(r'', 0, regex=True) #replac
stats['Year'] = year
stats.head()

Unnamed: 0,Player,Tm,FantPos,Age,G,GS,Cmp,Att,Yds,TD,...,2PM,2PP,FantPt,PPR,DKPt,FDPt,VBD,PosRank,OvRank,Year
0,Dalvin Cook,MIN,RB,25,10,10,0,0,0,0,...,3.0,,223,251.5,260.5,237.0,133,1,1,2020
1,Derrick Henry,TEN,RB,26,11,11,0,0,0,0,...,,,207,221.0,224.0,214.0,118,2,2,2020
2,Tyreek Hill,KAN,WR,26,11,11,0,0,0,0,...,,,192,260.1,263.1,226.1,106,1,3,2020
3,Alvin Kamara,NOR,RB,25,11,6,0,0,0,0,...,,,195,263.1,269.1,229.1,106,3,4,2020
4,Kyler Murray,ARI,QB,23,11,11,264,387,2814,19,...,,,292,291.6,308.6,300.6,100,1,5,2020


### 1.8. Creating the CSV file
* Enter your unique file path as the parameter string to the to_csv method

In [96]:
stats.to_csv('2020playerstats.csv')

## 2. Putting it all together
* Creating a function that creates a CSV file of player data
* Only parameter is the year of data you wish to create data from

In [99]:
def player_csv(year):

    url = "https://www.pro-football-reference.com/years/{}/fantasy.htm".format(year)
    html = urlopen(url)
    soup = BeautifulSoup(html)

    headers = [th.getText() for th in soup.findAll('tr')[1].findAll('th')] #Find the second table row tag, find every table header column within it and extract the html text via the get_text method.
    headers = headers[1:] #Do not need the first (0 index) column header
    
    rows = soup.findAll('tr', class_ = lambda table_rows: table_rows != "thead") #Here we grab all rows that are not classed as table header rows - football reference throws in a table header row everyy 30 rows 
    player_stats = [[td.getText() for td in rows[i].findAll('td')] # get the table data cell text from each table data cell
                    for i in range(len(rows))] #for each row
    player_stats = player_stats[2:]

    stats = pd.DataFrame(player_stats, columns = headers)
    
    stats = stats.replace(r'', 'N/A', regex=True)
    stats['Year'] = year
    
    stats.to_csv('{}playerstats.csv'.format(year))
    
    print("Player data for the year {} has been created.".format(year))


In [100]:
player_csv(2019)

Player data for the year 2019 has been created
