# Data Science - Web Scraping

## Tasks Today:

1) <b>Requests</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Requests <br>
2) <b>Beautiful Soup</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) .prettify() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Converting to a List <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Extracting Beautiful Soup Elements <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Assigning Variables from Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) .find() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) .find_all() <br>
3) <b>Exercise</b> <br>

## Requests

In [1]:
# Install Beautiful Soup
!pip install beautifulsoup4
!pip install requests



### Importing

In [2]:
import requests

### Using Requests

In [40]:
# Connect to URL - https://www.arthurleej.com/e-language.html

page = requests.get('https://www.arthurleej.com/e-language.html')

In [41]:
# display result response
page

<Response [200]>

##### .content

In [42]:
# Check Status of request response
page.content

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\r<html>\r<head>\r\t<title>Essay on Language by Arthur Lee Jacobson</title>\r<meta name="description" content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson.">\r<meta name="keywords" content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington">\r<meta name="resource-type" content="document">\r<meta name="generator" content="BBEdit 4.5">\r<meta name="robots" content="all">\r<meta name="classification" content="Gardening">\r<meta name="distribution" content="global">\r<meta name="rating" content="general">\r<meta name="copyright" content="2001 Arthur Lee Jacobson">\r<meta name="author" content="eriktyme@eriktyme.com">\r<meta name="language" content="en-us">\r</head>\r<body background="images/background1a.jpg" bgcolor="#FFFFCC" text="#000000" link=

## Beautiful Soup

### Importing

In [43]:
from bs4 import BeautifulSoup

### Using Beautiful Soup

In [44]:
# Instantiate BeautifulSoup class
soup = BeautifulSoup(page.content, 'html.parser')

soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
 <html> <head> <title>Essay on Language by Arthur Lee Jacobson</title> <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/> <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/> <meta content="document" name="resource-type"/> <meta content="BBEdit 4.5" name="generator"/> <meta content="all" name="robots"/> <meta content="Gardening" name="classification"/> <meta content="global" name="distribution"/> <meta content="general" name="rating"/> <meta content="2001 Arthur Lee Jacobson" name="copyright"/> <meta content="eriktyme@eriktyme.com" name="author"/> <meta content="en-us" name="language"/> </head> <body alink="#33CC33" background="images/background1a.jpg" bgcolor="#FFFFCC" link="#0000F

### .prettify()

In [45]:
#NOTE: Prettify only works for the full document and the .find() method
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>
   Essay on Language by Arthur Lee Jacobson
  </title>
  <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/>
  <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/>
  <meta content="document" name="resource-type"/>
  <meta content="BBEdit 4.5" name="generator"/>
  <meta content="all" name="robots"/>
  <meta content="Gardening" name="classification"/>
  <meta content="global" name="distribution"/>
  <meta content="general" name="rating"/>
  <meta content="2001 Arthur Lee Jacobson" name="copyright"/>
  <meta content="eriktyme@eriktyme.com" name="author"/>
  <meta content="en-us" name="language"/>
 </head>
 <body alink="#33CC33" background="images/background1a.jp

### Converting to a List

In [46]:
# Tags may contain strings and other tags. These elements are the tag’s children.
# print(list(soup.children))

print(list(soup.children)[2])

<html> <head> <title>Essay on Language by Arthur Lee Jacobson</title> <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/> <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/> <meta content="document" name="resource-type"/> <meta content="BBEdit 4.5" name="generator"/> <meta content="all" name="robots"/> <meta content="Gardening" name="classification"/> <meta content="global" name="distribution"/> <meta content="general" name="rating"/> <meta content="2001 Arthur Lee Jacobson" name="copyright"/> <meta content="eriktyme@eriktyme.com" name="author"/> <meta content="en-us" name="language"/> </head> <body alink="#33CC33" background="images/background1a.jpg" bgcolor="#FFFFCC" link="#0000FF" text="#000000" vlink="#FF00FF"> <!-- -- #include virtu

### Extracting Beautiful Soup Elements

In [10]:
# We can traverse through an HTML page and extract other tags and text
# The below example shows the types of iterables available in the object created from the HTML Document
# .Tag allows us to dive deeper into the document i.e we can look for HTML attributes like .class and if needed go deeper into the document from there


### Assinging Variables from Beautiful Soup

In [55]:
import pprint
html = list(soup.children)[2] #select only the html
body = list(html.children)[3] # selecting just the body from html children
center = list(body.children)[4]
table = list(center.children)[0]
print(table.prettify())

<table border="0" cellpadding="1" cellspacing="2">
 <tr>
  <td align="center" valign="top" width="480">
   <table border="0" cellpadding="1" cellspacing="2">
    <tr>
     <td align="center" valign="top" width="480">
      <font size="5">
       <b>
        Language
       </b>
      </font>
     </td>
    </tr>
    <tr>
     <td align="left" valign="top" width="480">
      <font size="3">
       <b>
        Humanity's highly developed ability to communicate verbally is our
        <i>
         essence,
        </i>
        I believe. Without our tremendous vocabulary, we'd perhaps be not much better off than gorillas and monkeys. Language is taken for granted since it is a basic characteristic. But it is, for all its universality, among the most powerful of human tools. "The pen is mightier than the sword."
       </b>
      </font>
     </td>
    </tr>
    <tr>
     <td align="left" valign="top" width="480">
      <font size="3">
       <b>
        Language informs, persuades, querie

### .find() <br>
<p>Find a specific instance of the parameter passed in</p>

In [48]:
# looking for the first instance of the 'b' tag - gets whole tag
table.find('b')

# just getting the inner text
table.find('b').text

# specificity
# syntax soup.find('tagName', attrs={'attribute':'attrValue'})

new_p = table.find('font', attrs={'size':'3'})

# tag with child elements - simply use dot notation to traverse inward
new_p.b
new_p.b.text

# grabbing attributes within a selected tag - dictionary key syntax element['attribute']
print(new_p)
new_p['size']


<font size="3"><b>    Humanity's highly developed ability to communicate verbally is our <i>essence,</i> I believe. Without our tremendous vocabulary, we'd perhaps be not much better off than gorillas and monkeys. Language is taken for granted since it is a basic characteristic. But it is, for all its universality, among the most powerful of human tools. "The pen is mightier than the sword."</b></font>


'3'

### .find_all() <br>
<p>Similar to .find(), except this will return all of them instead of one</p>

In [58]:
# find_all has mostly the exact same kwargs available as .find
# .find_all('div', attrs={'class':'btn btn-success'})

text_body = []
for b in html.find_all('b'):
    text_body.append(b.text)

text_body

['Language',
 '\xa0\xa0\xa0\xa0Humanity\'s highly developed ability to communicate verbally is our essence, I believe. Without our tremendous vocabulary, we\'d perhaps be not much better off than gorillas and monkeys. Language is taken for granted since it is a basic characteristic. But it is, for all its universality, among the most powerful of human tools. "The pen is mightier than the sword."',
 "\xa0\xa0\xa0\xa0Language informs, persuades, queries, expresses emotions, allows transmission of complex ideas and data, and its usage is often artful, whether prosaic or in verse. Of course, so far my remarks are regarding vocalization and writing. The broadest definition of language includes much more. For example, we have codes, such as Morse and flag, smoke signals, body language, and to an extent even music. Computer programs include special coding that can in some sense be called language. In my essay I choose to restrict the word's meaning to its root: tongue-based communication, and

## Exercise <br>
<p>Using the Beautiful Soup library, grab the data from the following link: https://www.nbastuffer.com/2020-2021-nba-player-stats/. After getting the data, display the players name and team inside of a pandas dataframe.</p>

In [61]:
# Hint: Use the .get_text() method

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import collections


nba_page = requests.get('https://www.nbastuffer.com/2020-2021-nba-player-stats/')

nba_soup = BeautifulSoup(nba_page.text, 'html.parser')


In [72]:
names = []
teams = []
position = []

for row in nba_soup.find_all('tr'):
    names.append(row.find_all(text = True)[1])
    teams.append(row.find_all(text = True)[2])
    position.append(row.find_all(text = True)[3])
    
names.pop(0)
teams.pop(0)
position.pop(0)


'TEAM'

In [73]:
players_20_21 = {
    'Player Name': names,
    'TEAM': teams,
    'POS': position
}

nba_df = pd.DataFrame.from_dict(players_20_21)
nba_df

Unnamed: 0,Player Name,TEAM,POS
0,Pat Connaughton,Mil,G
1,Bryn Forbes,Mil,G
2,Jrue Holiday,Mil,G
3,Brook Lopez,Mil,C
4,Khris Middleton,Mil,F
...,...,...,...
861,Delon Wright,Sac,G
862,Thaddeus Young,Chi,F
863,Trae Young,Atl,G
864,Cody Zeller,Cha,F-C


# Bonus Example: Pulling Vegas Odds from PFR.com

<h3> Use this example for further reference</h3>
<p> This is an example that shows what we will get returned back to us when accessing a HTML document with Beautiful Soup</p>

In [103]:
page = requests.get('https://www.pro-football-reference.com/boxscores/201810140nwe.htm')
# print(page.status_code)

soup = BeautifulSoup(page.content, 'html.parser')

In [104]:

# Get HTML
html = list(soup.children)[3]
html


<html class="no-js" data-root="/home/pfr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
<link href="https://d2p3bygnnzw9w3.cloudfront.net/req/202108231" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://quantcast.mgr.consensu.org'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, '/choice.js')
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript);
	
	function makeStub() {
	  

In [106]:
outer_div = html.find('div', attrs={'id':'wrap'})
outer_div

<div id="wrap">
<div id="header" role="banner">
<ul class="notranslate" id="subnav">
<li><a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference</a></li>
<li><a href="https://www.baseball-reference.com/">Baseball</a></li>
<li class="current"><a href="https://www.pro-football-reference.com/">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>
<li><a href="https://www.basketball-reference.com/">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>
<li><a href="https://www.hockey-reference.com/">Hockey</a></li>
<li><a href="https://fbref.com/it/">Calcio</a></li>
<li><a href="https://www.sports-reference.com/blog/">Blog</a></li>
<li><a href="https://stathead.com/?utm_source=web&amp;utm_medium=pfr&amp;utm_campaign=sr-nav-bar-top-link">Stathead</a></li>
<li><a href="https://widgets.sports-reference.com/">Widgets</a></li>
<li><a href="#" onclick="Freshwork

In [107]:
table = outer_div.find('div', attrs={'id':'content'})
table

<div class="box" id="content" role="main">
<h1>Kansas City Chiefs at New England Patriots - October 14th, 2018</h1>
<div class="section_wrapper setup_commented commented" id="all_other_scores">
<div class="section_heading assoc_other_scores" id="other_scores_sh">
<span class="section_anchor" data-label="All Week 6 Games" id="other_scores_link"></span>
</div><div class="placeholder"></div>
<!--     <div class="section_content" id="div_other_scores">
	    <div class="game_summaries compressed">
   <h2>NFL Scores &mdash; <a href="/years/2018/week_6.htm">Week 6</a></h2>
   
      <div class="game_summary nohover">
	<table class="teams">
		<tbody>       
		<tr class="">
			<td><strong><a href="/teams/phi/2018.htm">PHI</a></strong></td>
			<td class="right">34</td>
			<td class="right gamelink">
				<a href="/boxscores/201810110nyg.htm">F<span class="no_mobile">inal</span></a>
				
			</td>
		</tr>
		<tr class="">
			<td><a href="/teams/nyg/2018.htm">NYG</a></td>
			<td class="right">13</td>

In [117]:
grid = table.find('div', attrs={'class':'content_grid'})
grid.find('div', attrs={'id': 'all_game_info'})

<div class="table_wrapper setup_commented commented" id="all_game_info">
<div class="section_heading assoc_game_info" id="game_info_sh">
<span class="section_anchor" data-label="Game Info" id="game_info_link"></span><h2>Game Info</h2> <div class="section_heading_text">
<ul>
</ul>
</div>
</div><div class="placeholder"></div>
<!--

<div class="table_container" id="div_game_info">
    
    <table class="suppress_all sortable stats_table" id="game_info" data-cols-to-freeze="0">
    <caption>Game Info Table</caption>
    <tr class="thead onecell" ><td class="right center" data-stat="onecell" colspan="2" >Game Info</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Won Toss</th><td class="center " data-stat="stat" >Chiefs (deferred)</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Roof</th><td class="center " data-stat="stat" >outdoors</td></tr>
<tr ><th scope="row" class="center " data-stat="info" >Surface</th><td class="center " data-stat="stat" >fieldturf </t

['\n',
 <div class="section_heading assoc_expected_points" id="expected_points_sh">
 <span class="section_anchor" data-label="Expected Points Summary" id="expected_points_link"></span><h2>Expected Points Summary</h2> <div class="section_heading_text">
 <ul>
 </ul>
 </div>
 </div>,
 <div class="placeholder"></div>,
 '\n',
 '\n\n<div class="table_container" id="div_expected_points">\n    \n    <table class="sortable stats_table" id="expected_points" data-cols-to-freeze=",1">\n    <caption>Expected Points Summary Table</caption>\n    \n   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>\n   <thead>\n      \n      <tr class="over_header">\n         <th aria-label="" data-stat="" colspan="2" class=" over_header center" ></th>\n         <th aria-label="" data-stat="pbp_exp_points_off" colspan="4" class=" over_header center" >Offense</th>\n         <th aria-label="" data-stat="pbp_exp_points_def" colspan="4" class=" over_header center" >Def

### Selenium


In [118]:
import os
import sys
os.path.dirname(sys.executable)

'/opt/anaconda3/bin'

In [119]:
!pip install selenium
!pip install geckodriver-autoinstaller



In [121]:
from selenium import webdriver
from time import sleep

In [131]:
from selenium.webdriver.common.keys import Keys
import geckodriver_autoinstaller


geckodriver_autoinstaller.install() 

from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True


In [132]:
driver = webdriver.Firefox(options=options)
driver.get('https://kekambas-bs.herokuapp.com/login')

username = driver.find_element_by_name('username')
username.clear()
username.send_keys('natew')

password = driver.find_element_by_name('password')
password.clear()
password.send_keys('codingisfun')
sleep(2)
password.send_keys(Keys.RETURN)

sleep(2)

driver.get('https://kekambas-bs.herokuapp.com/createpost')

title = driver.find_element_by_name('title')
title.clear()
title.send_keys('TESTING HEADLESS VERSION')
sleep(2)
content = driver.find_element_by_name('content')
content.clear()
content.send_keys('I am Robot.... beep boop')
sleep(2)
submit = driver.find_element_by_name('submit')
submit.click()

driver.get('https://kekambas-bs.herokuapp.com/index')
