# Gathering NFL players' data using Scrapy

The following document is an easy and simple explanation on how to use spiders to Gather a NFL or any other kind of dataset using Scrapy's resources, then use QGIS framework to make a geospatial Analysis on player's Birthplace and College career.

**OBSERVATION**: This document has the only purpose to teach how the code works, it does not have the intention to be run on Jupyter Notebook, since you need all the scrapy auxiliar files.


## Objective

Collect NFL Players' Personal data using the NFL stats page and personal bio available on [NFL main Page](nfl.com) and correlate their Birthplaces and College graduation with the Average income in the U.S.

**Hypothesis**: Hall of Fame players birthplace are directly connected with the income of their State/City

The first step is to know how to get the players' personal info.

To access any stats, it is necessary to acess the [Players' stats](http://www.nfl.com/stats/player) , the page should look just like the image below.

![Players stats for the 2019 Season](images/players.png)

And then Just click on any player name to acess its personal info, and stats for the current NFL season. The amount of personal info will vary on which year you are looking for, but basically there are two structures available:

1. Players active on the NFL
    - Height   
    - Weight
    - Age
    - Born (Date and place of birth)
    - College
    - Experience
    - High School

![Personal Info](images/tb12.png)

2. Retired Players 

    - Height
    - Weight
    - Age
    - Born (Date and place of birth)
    - College
    - Experience
    - Hall of Fame Induction (If on)
    
![Personal Info](images/dm13.png)


### Complementary Data

To complement the information about, it was collected data about the NFL team's stadiums, that can be available on [Sport League Maps](https://sportleaguemaps.com/football/nfl/). From that website it was collected:

- Team
- Stadium Name
- Location (City, State)

![Stadium Info](images/stadiums.png)


## Scrapy

Scrapy is an open-source framework that eases the process of 'crawling' and 'scraping' from the web, providing useful functions and tools that increases the speed of that search.

To install Scrapy is necessary the [Anaconda Navigator](http://docs.continuum.io/anaconda/install/), after installed, just type this command to have the scrapy framework

In [None]:
conda install -c conda-forge scrapy

### Creating a new project 

To create a new project just type on terminal:

In [None]:
scrapy startproject nameofyourproject

After that you should have a new folder with the *spider* subfolder, that's the place where you should put your files to crawl on the web and gather your data.

### Spider's Structure

The Spider is divided into 2 structures:

- start_requests: Region where you insert all the links that you are going to make your spider crawl, depending on the *parse* algorithm, you can have lots of pages or just one home page to make the spider crawl.

- parse: Area where the rules to follow on the web pages are available,that can be click on certain link, collect some text, download images, just like an human acessing, but in an automated way and in large scale of collection, saving time and handwork.
    - External parser: Responsible to handle the access on the first webpage (Stats) and redirect to the second page (Personal Bio).
    - Internal parser: Responsible to find the personal info and store on an appropiate place.
    
The same is applied to the Stadium website, but withou an External Parser, since all the data is located on the Home Page, avoid the case of accessing other hyperlinks.

### HTML Inspection

To start scraping from the internet, first it is necessary to study the target website, find what are the elements that you want to capture and how to get to it, also it is necessary to know a thing or two about HTML and CSS, to identify the tags, classes and ID's which the data you want is. To do that you can use any browser with the option *"Inspect Element"* on it.


![HTML Inspection](images/inspect.png)


## Methodology

The first step on your spider is to create the class that is going to shape its behaviour to get the data, each project can have just one spider name on it, so take care on naming your spiders correctly. The code above show how it is created

```python
class NFLSpider(scrapy.Spider): 
    
    name = 'nfl'
    
```

Now it is necessary to define the pages to access inside the NFLSpider class, on this case it was used a generator to each Passing stat on NFL post-season since 1933 until the last season. On each iteration a season website will be passed to the *parser* to process the rules and collect the data.

```python
def start_requests(self):
        urls = (f'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season={year}&seasonType=POST&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
            for year in range(1933, datetime.datetime.now().year))
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
```

### External Parser

The External Parser is responsible to enter on each link provided on *start_requests* and send the player's info to the Internal parser, with the finality to collect the data. To make this it is necessary to know the "XPath", that basically is a reference from the Root div to the desirable element (In this case, the players' bio hyperlink). The code above show how it was made


```python
def parse(self, response):
        
        # Extracting all Player links
        Player_links = response.xpath('//table[@id="result"]//tbody[1]//tr//td[2]/a/@href').extract()  
        for player in Player_links:
            info = response.urljoin(player)
            yield scrapy.Request(info, callback=self.parse_playerInfo)
```

## Internal Parser

The Internal Parser is the one that makes the magic happens, for each attribute (Name, Bornplace, etc.), its xpath is used to help the spider to find the correct area of collection, with the command *yield*, the data is stored on a temporary database and the format is going to be set when compiling the scrapy's spider.

```python

def parse_playerInfo(self,response):
        yield {
                    'Name' : response.xpath('//div[@class="player-info"]//span[1]/text()').get(default='NA'),
                    'Bornplace': response.xpath('//div[@class="player-info"]//p[contains(.,"Born")]/text()[2]').get(default='NA'),
                    'College':response.xpath('//div[@class="player-info"]//p[contains(.,"College")]/text()').get(default='NA'),
                    'Experience': response.xpath('//div[@class="player-info"]//p[contains(.,"Experience")]/text()').get(default='NA'),
                    'HS':response.xpath('//div[@class="player-info"]//p[contains(.,"School")]/text()').get(default='NA'),
                    'HOF':response.xpath('//div[@class="player-info"]//p[contains(.,"Fame")]/text()').get(default='NA')
                    
                    }
```

## Full Code

The complete code is available below:

```python

import scrapy
import datetime
from scrapy.shell import inspect_response


class NFLSpider(scrapy.Spider): 
    
    name = 'nfl-spider'
    
    def start_requests(self):
        urls = (f'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season={year}&seasonType=POST&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
            for year in range(1933, datetime.datetime.now().year))
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
        
    def parse_playerInfo(self,response):
        # player_info = response.xpath('//div[@class="player-info"]//p/text()').getall()
        # born = //div[@class="player-info"]//p[contains(.,"Born")]
        yield {
                    'Name' : response.xpath('//div[@class="player-info"]//span[1]/text()').get(default='NA'),
                    'Bornplace': response.xpath('//div[@class="player-info"]//p[contains(.,"Born")]/text()[2]').get(default='NA'),
                    'College':response.xpath('//div[@class="player-info"]//p[contains(.,"College")]/text()').get(default='NA'),
                    'Experience': response.xpath('//div[@class="player-info"]//p[contains(.,"Experience")]/text()').get(default='NA'),
                    'HS':response.xpath('//div[@class="player-info"]//p[contains(.,"School")]/text()').get(default='NA'),
                    'HOF':response.xpath('//div[@class="player-info"]//p[contains(.,"Fame")]/text()').get(default='NA')
                    
                    }
        
    def parse(self, response):
        
        # Extracting all Player links
        Player_links = response.xpath('//table[@id="result"]//tbody[1]//tr//td[2]/a/@href').extract()  
        for player in Player_links:
            info = response.urljoin(player)
            yield scrapy.Request(info, callback=self.parse_playerInfo)

```

Now the code for the stadiums:

```python
import scrapy
import datetime
from scrapy.shell import inspect_response


class NFLSpider(scrapy.Spider): 
    
    name = 'stadiums'
    
    def start_requests(self):
        url = 'https://sportleaguemaps.com/football/nfl/'
        
        yield scrapy.Request(url=url, callback=self.parse)
        
        
    def parse(self, response):
        
        # Extracting all Player links
        stadium_names = response.xpath('//table//tr[not(contains(., "Logo"))]//text()').getall()
        
        for i in range(0, len(stadium_names) - 1, 3):
            yield {
                    'Team' : stadium_names[i],
                    'Name': stadium_names[i+1],
                    'Location':stadium_names[i+2]
                    }
            
```

## Compiling the Spider


To start colecting the info, just go to the scrapy folder and type:

In [None]:
scrapy crawl nfl-spider -o output.csv

Where:
- **nfl-spider** is the spider's name
- **output.csv** is the file to store the data

# Tidying the Data (RStudio)

On R-Studio the data was cleaned and organized to be merged in the future with the *geopy* library, the steps taken was listed below:

**Observation**: This process can also be equally implemented on Python.

1. Read the Spider's output file
2. Removed trash from the colector like (':', '\n', '\t')
3. Organised the Date of Born to european format (DD/MM/YYYY)
4. Added commas between the Date and Bornplace to split afterwards
5. In case of Bornplaces outside the U.S. it was necessary to add commas between the city and country
6. Removed info between parenthesis to ease the gps queries
7. Splitted the Bornplace field into (Date of Birth, Place of Birth and State)
8. Saving a new csv file named "NFL-player-bio.csv"

All the code are avaiable below (On R Language)

```

library(tidyverse)

NFL <- read.csv("1.csv")

NFL$Bornplace <- gsub(": ", "", NFL$Bornplace  )
NFL$Bornplace <- gsub("(\\n|\\t)", "", NFL$Bornplace  )
NFL$Bornplace <- gsub(": ", "", NFL$Bornplace  )
NFL$Bornplace <- gsub("(^[0-9]{1,2})\\/([0-9]{1,2})\\/([0-9]{4})", "\\2\\/\\1\\/\\3 ,", NFL$Bornplace)
NFL$Bornplace <- gsub("(.*?) , (.*) {2,}(.*)$", "\\1 , \\2 , \\3",  NFL$Bornplace)

NFL$College <- gsub(": ", "", NFL$College  )
NFL$College <- gsub("\\(.*\\)", "", NFL$College )
NFL$Experience <- gsub(": ", "", NFL$Experience  )
NFL$HS <- gsub(": ", "", NFL$HS  )
NFL$HOF <- gsub(": ", "", NFL$HOF  )

NFL <- NFL %>% separate(Bornplace, c('Date_Birth', 'Place_Birth', "State"), sep = " , " )

write.csv(NFL, file = "NFL-player-bio.csv", row.names = F)


```

# Connecting the data to geospatial providers

After collecting the data, it is necessary to transform all that addresses, that are mostly formed of textual elements, into GPS coordinates (Latitude & Longitude), for that it was used an library called [Geopy](https://geopy.readthedocs.io/en/stable/), a simple library that has lots of Geocoders (some paid, some free) available for making this transformation.

The process(both Player bio and stadium location) is summarised into:

1. Load Preprocessed CSV file
2. Unite the State and City (cities with the same name on different states may cause some problems)
3. Merge the "University" into the College Name (Unis like 'Florida' may cause some problems)
4. Set the GeoCoder (Used ArcGIS)
5. Get full coordinates for birthplace and college on a new columns and split them into two
6. Remove redundant columns (Location, Coordinate tuple)
7. Save to a new CSV file

**Important**: [It is necessary to limit the queries to 1 per second to not be blocked by the GeoCoders](https://geopy.readthedocs.io/en/stable/#usage-with-pandas)

The full code is available below:

```python
import pandas as pd
import numpy as np
from tqdm import tqdm
from tqdm._tqdm_notebook import tqdm_notebook
from functools import partial

from geopy.geocoders import ArcGIS
from geopy.extra.rate_limiter import RateLimiter

NFL = pd.read_csv('NFL-player-bio.csv')
NFL['Full Address'] = NFL['Place_Birth'] + ', ' + NFL['State']
NFL['College'] = NFL['College'] + " University" 

geolocator = ArcGIS(user_agent="NFL-GPS")

geocode = RateLimiter(geolocator.geocode, min_delay_seconds= 1)

tqdm_notebook.pandas()
    
NFL['Coords_Birth'] = NFL['Full Address'].progress_apply(partial(geocode,timeout = 5)).apply(lambda x: (x.latitude, x.longitude))
NFL['Coords_College'] = NFL['College'].progress_apply(partial(geocode,timeout = 5)).apply(lambda x: (x.latitude, x.longitude))

NFL[['Latitude_Birth', 'Longitude_Birth']] = pd.DataFrame(NFL['Coords_Birth'].tolist(), index=NFL.index)
NFL[['Latitude_College', 'Longitude_College']] = pd.DataFrame(NFL['Coords_College'].tolist(), index=NFL.index)

NFL.drop(columns=['Coords_Birth', 'Coords_College', 'Full Address'], inplace= True)

NFL.to_csv("NFL-GPS.csv", index=False)


# Stadium transformation

Stadium = pd.read_csv('Stadium.csv')

tqdm_notebook.pandas()

Stadium['Coords'] = Stadium['Name'].progress_apply(partial(geocode,timeout = 5)).apply(lambda x: (x.latitude, x.longitude))
Stadium[['Latitude', 'Longitude']] = pd.DataFrame(Stadium['Coords'].tolist(), index=Stadium.index)

Stadium.drop(columns=['Coords'], inplace= True)

Stadium.to_csv("NFL-GPS-Stadiums.csv", index=False)

```

## QGIS

The QGIS is the responsible for generate maps using datasets, and now with these two new dataset, it is possible to show on the U.S. map the relationship between NFL Players, Status and [US Household Income Statistics](https://www.kaggle.com/goldenoakresearch/us-household-income-stats-geo-locations)

To add a map first is necessary to install the OpenLayers plugin on the QGIS app and just drag-and-drop the CSV files on the application. The Income dataset was transformed into a heatmap 

Here it is the individual visualization of each Type:

### Birth place of Hall of Famers

![Birth-HOF](images/HOF-Birth.png)

### Hall of Famers Colleges

![College-HOF](images/HOF-College.png)

### NFL Stadium Locations

![Stadiums](images/Stadiums-QGIS.png)


# Analysis

The first point to observe is the East coast dominance in Hall of Famers, even with the presence of big cities, like Los Angeles and Seattle. 

Even though the East is crowded with stadiums, few Hall of Famers were born in the far East zones (Boston, Washington and New york), maybe because people born on these places tend to focus on Business areas than Sports. 

![East coast](images/East-Coast.png)

The most crowded areas are near Philadelphia and Texas/Louisiana state, which has less economical power than the biggest cities (New York and Los Angeles).

![East coast](images/West-Coast.png)


It is good to observe that mostly of HOFers were born far from the the big cities and that simpler life outside the metropolis should've helped to develop the love for the sport and make them follow a career through it.

Another interesting fact is that on the University time, players tend to get more away from the NFL's stadiums locations to play on furthest colleges and then finally come to the city centers following a professional career on the NFL.





# Conclusion

With that it was proved that the combo Scrapy + Geopy + QGIS is really good to transform textual data from the web into Geospatial elements, providing some new insights about how players know the sport and how they develop until the professional career. Due the lack of time, it was not possible to make more complex hypothesis or analysis, but the Hypothesis

The final map is here:

![NFL](images/USA.png)
