

# ETL Project Final Report

## Brent Thomas
## Emmanuel Olofinkua
## Matt Houser




### BeautifulSoup 
BeautifulSoup package is a Python library for pulling data out of HTML and XML files. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Splinter
Splinter lets you automate browser actions, such as visiting URLs and interacting with their items.
https://splinter.readthedocs.io/en/latest/

### DataSet
We extracted data from two separate websites, bleacherreport.com and USAtoday.com.  These websites contained articles that discussed the all-time NFL draft picks, more specifically the best and worst picks for each team. The bleacher report article discussed first round picks only while the USA today article included all rounds.

Each team’s best and worst NFL draft picks (1st round only)
https://bleacherreport.com/articles/2767040-every-nfl-teams-best-and-worst-1st-round-draft-pick-of-the-super-bowl-era#slide4

Each team’s best and worst NFL draft picks (all rounds)
https://ftw.usatoday.com/2018/04/nfl-best-worst-picks-team




In [89]:

# import dependencies

import requests
import pymongo
import pandas as pd
from splinter import Browser
from bs4 import BeautifulSoup as bs
from pprint import pprint
import time
from lxml import html

# Extract

####  Extract HTML File and display

In [92]:
# open chrome driver browser
executable_path = {'executable_path': 'chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

NFL News Best and Worst URL

In [93]:
# define url
NFL_BW_url = "https://ftw.usatoday.com/2018/04/nfl-best-worst-picks-team"
browser.visit(NFL_BW_url)
# create beautiful soup object 
html = browser.html
soup = bs(html, 'html.parser')



In [94]:
# open chrome driver browser
executable_path = {'executable_path': 'chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

In [95]:
# define url
first_round_url = "https://bleacherreport.com/articles/2767040-every-nfl-teams-best-and-worst-1st-round-draft-pick-of-the-super-bowl-era#slide5"
browser.visit(first_round_url)
# create beautiful soup object 
html = browser.html
first_round_soup = bs(html, 'html.parser')

## Transform

#### Clean and transform our datasets, within Jupyter Notebook we first . We then used find.all to display all the paragraphs, titles, and headers within each article for reference.  

#### When scrapping the Bleacher Report article, we came across an issue with a blank string interrupting the data, causing an extra space between players and throwing the data set off. We needed to filter the blank lines out that were occuring around when there was a break in the html.

In [96]:
# find the first news title
NFL_title = soup.body.find("div", class_="entry__content")
# find the paragraph associated with the first title
NFL_paragraph = soup.body.find("div", class_="entry__content")
# close the browser
print(f"The title is: \n{NFL_title}")
print()
print(f"The descriptive paragraph is:  \n{NFL_paragraph}")

The title is: 
<div class="entry__content">
<div class="articleBody" itemprop="articleBody"><p>Despite all the enthusiastic hyperbole of draft night, the law of averages says we’ll see as many first-round busts as first-round superstars. Will any be good or bad enough to make our list of the best and worst picks for all 32 NFL teams?</p>
<p>We looked at every draft from the NFL’s modern era (since 1970) and made our selections, only considering what a player did for the team that drafted them (for instance, Deion Sanders’ titles in San Francisco and Dallas don’t affect his time in Atlanta).</p>
<p>And draft position matters – a really bad No. 2 pick is worse than a horrible No. 32 pick.</p>
<h3>Dallas Cowboys</h3>
<p><b>Best pick: Emmitt Smith, RB (No. 17, 1990)</b><br/>
<b>Worst pick: Bill Thomas, RB (No. 26, 1972) </b></p>
<p>In 1988, the Cowboys used their first-round pick on Michael Irvin. The next year, Troy Aikman went No. 1 overall to Dallas. Both are in the Hall of Fame. But th

In [97]:
# find the first news title
NFL_t = first_round_soup.body.find("div", class_="organism contentStream slideshow")
# find the paragraph associated with the first title
NFL_p = first_round_soup.body.find("div", class_="entry__content")
# close the browser
print(f"The title is: \n{NFL_t}")
print()
print(f"The descriptive paragraph is:  \n{NFL_p}")

The title is: 

The descriptive paragraph is:  
None


In [98]:
NFL_t.find_all('h1')

[<h1>Every NFL Team's Best (and Worst) 1st-Round Draft Pick of the Super Bowl Era</h1>,
 <h1>Arizona Cardinals</h1>,
 <h1>Atlanta Falcons</h1>,
 <h1>Baltimore Ravens</h1>,
 <h1>Buffalo Bills</h1>,
 <h1>Carolina Panthers</h1>,
 <h1>Chicago Bears</h1>,
 <h1>Cincinnati Bengals</h1>,
 <h1>Cleveland Browns</h1>,
 <h1>Dallas Cowboys</h1>,
 <h1>Denver Broncos</h1>,
 <h1>Detroit Lions</h1>,
 <h1>Green Bay Packers</h1>,
 <h1>Houston Texans</h1>,
 <h1>Indianapolis Colts</h1>,
 <h1>Jacksonville Jaguars</h1>,
 <h1>Kansas City Chiefs</h1>,
 <h1>Los Angeles Chargers</h1>,
 <h1>Los Angeles Rams</h1>,
 <h1>Miami Dolphins</h1>,
 <h1>Minnesota Vikings</h1>,
 <h1>New England Patriots</h1>,
 <h1>New Orleans Saints</h1>,
 <h1>New York Giants</h1>,
 <h1>New York Jets</h1>,
 <h1>Oakland Raiders</h1>,
 <h1>Philadelphia Eagles</h1>,
 <h1>Pittsburgh Steelers</h1>,
 <h1>San Francisco 49ers</h1>,
 <h1>Seattle Seahawks</h1>,
 <h1>Tampa Bay Buccaneers</h1>,
 <h1>Tennessee Titans</h1>,
 <h1>Washington Redskins</h1>]

In [99]:
NFL_title.find_all('h3')

[<h3>Dallas Cowboys</h3>,
 <h3>New York Giants</h3>,
 <h3>Philadelphia Eagles</h3>,
 <h3>Washington Redskins</h3>,
 <h3>Chicago Bears</h3>,
 <h3>Detroit Lions</h3>,
 <h3>Green Bay Packers</h3>,
 <h3>Minnesota Vikings</h3>,
 <h3>Atlanta Falcons</h3>,
 <h3>Carolina Panthers</h3>,
 <h3>New Orleans Saints</h3>,
 <h3>Tampa Bay Buccaneers</h3>,
 <h3>Arizona Cardinals</h3>,
 <h3>Los Angeles Rams</h3>,
 <h3>San Francisco 49ers</h3>,
 <h3>Seattle Seahawks</h3>,
 <h3>Buffalo Bills</h3>,
 <h3>Miami Dolphins</h3>,
 <h3>New England Patriots</h3>,
 <h3>New York Jets</h3>,
 <h3>Baltimore Ravens</h3>,
 <h3>Cincinnati Bengals</h3>,
 <h3>Cleveland Browns</h3>,
 <h3>Pittsburgh Steelers</h3>,
 <h3>Houston Texans</h3>,
 <h3>Indianapolis Colts</h3>,
 <h3>Jacksonville Jaguars</h3>,
 <h3>Tennessee Titans</h3>,
 <h3>Denver Broncos</h3>,
 <h3>Kansas City Chiefs</h3>,
 <h3>Los Angeles Chargers</h3>,
 <h3>Oakland Raiders</h3>,
 <h3 class="related__title">
 <span>
 			More			<a class="related__link" href="https://

In [100]:
#Find all Paragrahs
NFL_t.find_all('p')

[<p></p>,
 <p>The <a href="http://bleacherreport.com/nfl">NFL</a> draft is intoxicating because the event provides hope for every fanbase. Each first-round pick is a future Hall of Famer...until he takes the field.</p>,
 <p>Once the games begin, each of the top selections undertakes a divergent path based on numerous factors including situation, coaching staff, fit and everything that encompasses being a professional athlete. </p>,
 <p>Organizations hope they find a quality starter. None expect a Hall of Fame-caliber performer. Yet a lucky few emerge. Each of the 32 franchises made at least one of these selections in the opening frame since the Super Bowl era began in 1966. </p>,
 <p>Then, there are those everyone wants to forget. Amazing talents that, for whatever reason, never clicked and live in infamy. The league's worst draft picks won't find their busts in Canton. They're simply busts. </p>,
 <p></p>,
 <p><strong>Best: <a href="http://bleacherreport.com/larry-fitzgerald">Larry Fi

In [101]:
NFL_paragraph.find_all('b')

[<b>Best pick: Emmitt Smith, RB (No. 17, 1990)</b>,
 <b>Worst pick: Bill Thomas, RB (No. 26, 1972) </b>,
 <b>Best pick: Lawrence Taylor, LB (No. 2, 1981)</b>,
 <b>Worst pick: Cedric Jones, DE (No. 5, 1996)</b>,
 <b>Best pick: Jerome Brown, DT (No. 9, 1987)</b>,
 <b>Worst pick: Mike Mamula, DE (No. 7, 1995)</b>,
 <b>Best pick: Darrell Green, CB (No. 28, 1983)</b>,
 <b>Worst pick: Heath Shuler, QB (No. 3, 1995) </b>,
 <b>Best pick: Walter Payton, RB (No. 4, 1975)</b>,
 <b>Worst pick: Cade McNown, QB (No. 12, 1999)</b>,
 <b>Best pick: Barry Sanders, RB (No. 3, 1989) </b>,
 <b>Worst pick: Charles Rogers, WR (No. 2, 2003)</b>,
 <b>Best pick: Aaron Rodgers, QB (No. 24, 2005)</b>,
 <b>Worst pick: Rich Campbell, QB (No. 6, 1981)</b>,
 <b>Best pick: Randall McDaniel, G (No. 19, 1988)</b>,
 <b>Worst pick: Troy Williams, WR (No. 7, 2005)</b>,
 <b>Best pick: Matt Ryan , QB (No. 3, 2008)</b>,
 <b>Worst pick: Bruce Pickens, CB (No. 3, 1991)</b>,
 <b>Best pick: Cam Newton, QB (No. 1, 2011)</b>,
 <b>W

In [102]:
#Find all NFL teams
NFL_title.find_all('h1')

[]

In [103]:
# Extract title text
title = soup.title.text
print(title)

Every NFL team’s best and worst first round draft picks of all time | For The Win


In [104]:
# Extract title text
title = first_round_soup.title.text
print(title)

Every NFL Team's Best (and Worst) 1st-Round Draft Pick of the Super Bowl Era | Bleacher Report | Latest News, Videos and Highlights


# Load

####  We created a local connection to mongoDB and for-loop to iterated through all results and loaded both collections to our NFL_Best_Worst_db in Mongo.

In [None]:
# Initialize PyMongo to work with MongoDBs
conn = 'mongodb://localhost:27017'
client = pymongo.MongoClient(conn)

In [None]:
# Define database and collection
db = client.NFL_Best_Worst_db
collection = db.picks
draft_collection = db.draft_picks

In [105]:
import re
results = soup.find_all(['h3','b'])    # all b and c tags


In [106]:
results = results[2:98]

In [108]:
for i in range(0,len(results),3):
    
    print(results[i].text)
    print('  '+results[i+1].text)
    print('  '+results[i+2].text)
    print('------------------------------------------------')
    
    post = {
        'team': results[i].text,
        'best':' ' +results[i+1].text,
        'worst':' ' +results[i+2].text,
    }
    collection.insert_one(post)


Dallas Cowboys
  Best pick: Emmitt Smith, RB (No. 17, 1990)
  Worst pick: Bill Thomas, RB (No. 26, 1972) 
------------------------------------------------
New York Giants
  Best pick: Lawrence Taylor, LB (No. 2, 1981)
  Worst pick: Cedric Jones, DE (No. 5, 1996)
------------------------------------------------
Philadelphia Eagles
  Best pick: Jerome Brown, DT (No. 9, 1987)
  Worst pick: Mike Mamula, DE (No. 7, 1995)
------------------------------------------------
Washington Redskins
  Best pick: Darrell Green, CB (No. 28, 1983)
  Worst pick: Heath Shuler, QB (No. 3, 1995) 
------------------------------------------------
Chicago Bears
  Best pick: Walter Payton, RB (No. 4, 1975)
  Worst pick: Cade McNown, QB (No. 12, 1999)
------------------------------------------------
Detroit Lions
  Best pick: Barry Sanders, RB (No. 3, 1989) 
  Worst pick: Charles Rogers, WR (No. 2, 2003)
------------------------------------------------
Green Bay Packers
  Best pick: Aaron Rodgers, QB (No. 24, 200

In [109]:
results = db.picks.find()


In [113]:
import re
results = first_round_soup.find_all(['h1','strong'])
results = results[2:105]

In [115]:
# filter blank lines
values = []
for result in results:
    if len(result.text) > 0:
        values.append(result)
        
# print teams
for i in range(0,len(values),3):
    print('----------------')
    print(f'{values[i].text:40}')
    print(f'  {values[i+1].text:40}')
    print(f'  {values[i+2].text:40}')
    print('----------------')

    post = {
        'team': results[i].text,
        'best':' ' +results[i+1].text,
        'worst':' ' +results[i+2].text,   
    }

    draft_collection.insert_one(post)

----------------
Arizona Cardinals                       
  Best: Larry Fitzgerald                  
  Worst: Steve Little                     
----------------
----------------
Atlanta Falcons                         
  Best: Mike Kenn                         
  Worst: Aundray Bruce                    
----------------
----------------
Baltimore Ravens                        
  Best: Ray Lewis                         
  Worst: Kyle Boller                      
----------------
----------------
Buffalo Bills                           
  Best: Bruce Smith                       
  Worst: Tom Cousineau                    
----------------
----------------
Carolina Panthers                       
  Best: Julius Peppers                    
  Worst: Rae Carruth                      
----------------
----------------
Chicago Bears                           
  Best: Walter Payton                     
  Worst: Curtis Enis                      
----------------
----------------
Cincinnati Bengal

In [117]:
results = db.draft_picks.find()