# Data Science - Web Scraping

## Tasks Today:

1) <b>Requests</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Requests <br>
2) <b>Beautiful Soup</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) .prettify() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Converting to a List <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Extracting Beautiful Soup Elements <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Assigning Variables from Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) .find() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) .find_all() <br>
3) <b>Exercise</b> <br>

## Requests

### Importing

In [1]:
import requests

### Using Requests

In [2]:
# perform a request on 'http://www.arthurleej.com/e-love.html'

page = requests.get('http://www.arthurleej.com/e-love.html')

In [3]:
# display page result

print(page)

<Response [200]>


##### .content

In [4]:
print(page.content)

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\r<html>\r<head>\r\t<title>Essay on Love by Arthur Lee Jacobson</title>\r<meta name="description" content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson.">\r<meta name="keywords" content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington">\r<meta name="resource-type" content="document">\r<meta name="generator" content="BBEdit 4.5">\r<meta name="robots" content="all">\r<meta name="classification" content="Gardening">\r<meta name="distribution" content="global">\r<meta name="rating" content="general">\r<meta name="copyright" content="2001 Arthur Lee Jacobson">\r<meta name="author" content="eriktyme@eriktyme.com">\r<meta name="language" content="en-us">\r</head>\r<body background="images/background1a.jpg" bgcolor="#FFFFCC" text="#000000" link="#00

## Beautiful Soup

### Importing

In [5]:
from bs4 import BeautifulSoup

### Using Beautiful Soup

In [7]:
soup = BeautifulSoup(page.content, 'html.parser')

soup

print(type(soup))

<class 'bs4.BeautifulSoup'>


### .prettify()

In [9]:
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>
   Essay on Love by Arthur Lee Jacobson
  </title>
  <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/>
  <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/>
  <meta content="document" name="resource-type"/>
  <meta content="BBEdit 4.5" name="generator"/>
  <meta content="all" name="robots"/>
  <meta content="Gardening" name="classification"/>
  <meta content="global" name="distribution"/>
  <meta content="general" name="rating"/>
  <meta content="2001 Arthur Lee Jacobson" name="copyright"/>
  <meta content="eriktyme@eriktyme.com" name="author"/>
  <meta content="en-us" name="language"/>
 </head>
 <body alink="#33CC33" background="images/background1a.jpg" b

### Converting to a List

In [10]:
list(soup.children)

['HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"',
 ' ',
 <html> <head> <title>Essay on Love by Arthur Lee Jacobson</title> <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/> <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/> <meta content="document" name="resource-type"/> <meta content="BBEdit 4.5" name="generator"/> <meta content="all" name="robots"/> <meta content="Gardening" name="classification"/> <meta content="global" name="distribution"/> <meta content="general" name="rating"/> <meta content="2001 Arthur Lee Jacobson" name="copyright"/> <meta content="eriktyme@eriktyme.com" name="author"/> <meta content="en-us" name="language"/> </head> <body alink="#33CC33" background="images/background1a.jpg" bgcolor="#FFFFCC" link="#0000FF" te

### Extracting Beautiful Soup Elements

In [12]:
# .Tag allows for traversal through HTML page and extract other tags and text

sections = [type(item) for item in soup.children]

for section in sections:
    print(section)

<class 'bs4.element.Doctype'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>


### Assinging Variables from Beautiful Soup

In [22]:
html = list(soup.children)[2]
body = list(html.children)[3]
center = list(body.children)[4]
table = list(center.children)[0]

for el in table:
    print('&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&')
    print(el)
    print('&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&')

&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
 
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
<tr> <td align="center" valign="top" width="480"> <table border="0" cellpadding="1" cellspacing="2"> <tr> <td align="center" valign="top" width="480"><font size="5"><b>Love</b></font></td> </tr> <tr> <td align="left" valign="top" width="480"><font size="3"><b>    Of the fourteen essays I'm writing, only this one treats an emotion. That love is the most important emotion is the deduction. I think other emotions may be as important, but are not so powerfully moving or interesting to most of us. Love is exciting. There is no need to justify choosing to write about it. Are not most songs love songs? Are not most novels stories featuring love?</b></font></td> </tr> <tr> <td align="left" valign="top" width="480"><font size="3"><b>    Love in its broad sense is the feeling of strong attraction, and often attachment and protection. It is felt towards other people, towards p

### .find() <br>
<p>Find a specific instance of the parameter passed in</p>

In [34]:
soup.find('b').text

'Love'

### .find_all() <br>
<p>Similar to .find(), except this will return all of them instead of one</p>

In [33]:
soup.find_all('b')

[<b>Love</b>,
 <b>    Of the fourteen essays I'm writing, only this one treats an emotion. That love is the most important emotion is the deduction. I think other emotions may be as important, but are not so powerfully moving or interesting to most of us. Love is exciting. There is no need to justify choosing to write about it. Are not most songs love songs? Are not most novels stories featuring love?</b>,
 <b>    Love in its broad sense is the feeling of strong attraction, and often attachment and protection. It is felt towards other people, towards pets, towards inanimate objects, towards abstractions such as patriotism, religious matters, hobbies, and I suppose nearly everything. It is multifaceted, and includes ordinary self-love, chivalrous love, carnal or sexual love, friendly love, family love. It is an emotion that is closely related to certain others, such as hope. At its simplest level it is what we strongly like.</b>,
 <b>    I have a hunch that love, like the rose, owes muc

## Exercise - Together <br>
<p>Using the Beautiful Soup library, grab the data from the following link: https://www.baseball-reference.com/teams/BOS/batteam.shtml. After getting the data, display only the year and batting average for each year (2017: .276). Lastly, plot the data on a preferred matplotlib chart.</p>

In [37]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.baseball-reference.com/teams/BOS/batteam.shtml')

soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" data-root="/home/br/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
   <link href="https://d2p3bygnnzw9w3.cloudfront.net/req/201907232" rel="dns-prefetch"/>
   <!-- no:cookie fast load the css.           -->
   <link crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net" rel="preconnect"/>
   <link crossorigin="" href="https://d3k2oh6evki4b7.cloudfront.net" rel="preconnect"/>
   <style>
   </style>
   <link as="style" crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201908081/css/br/sr-min.css" rel="preload"/>
   <link crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201908081/css/br/sr-min.css" media="print" onload="this.media='all'" rel="stylesheet"/>
   <noscript>
    <link href="https://d2p3

In [83]:
hrs = soup.find_all('td', attrs={"data-stat" : 'HR'})
years = soup.find_all('th', attrs={"data-stat" : 'year_ID'})

# print(hrs[0].text)

years.pop(0)             # first item is the header text

print(years[39].text)

data = {
    'years' : [],
    'hrs' : []
}

for i in range(len(hrs)):
    year = years[i].get_text()
    hr = hrs[i].get_text()
    
    data['years'].append(int(year))
    data['hrs'].append(int(hr))
    
print(data)

1980
{'years': [2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961, 1960, 1959, 1958, 1957, 1956, 1955, 1954, 1953, 1952, 1951, 1950, 1949, 1948, 1947, 1946, 1945, 1944, 1943, 1942, 1941, 1940, 1939, 1938, 1937, 1936, 1935, 1934, 1933, 1932, 1931, 1930, 1929, 1928, 1927, 1926, 1925, 1924, 1923, 1922, 1921, 1920, 1919, 1918, 1917, 1916, 1915, 1914, 1913, 1912, 1911, 1910, 1909, 1908, 1907, 1906, 1905, 1904, 1903, 1902, 1901], 'hrs': [200, 208, 168, 208, 161, 123, 178, 165, 203, 211, 212, 173, 166, 192, 199, 222, 238, 177, 198, 167, 176, 205, 185, 209, 175, 120, 114, 84, 126, 106, 108, 124, 174, 144, 162, 181, 142, 136, 90, 162, 194, 172, 213, 134, 134, 109, 147, 124, 161, 203, 197, 125, 158

#### in-class exercise: convert data into pandas dataframe, find the average number of hrs hit between 1960 and 1980


In [94]:
df = pd.DataFrame.from_dict(data)


df[(df['years'] >= 1960) & (df['years'] < 1980)].mean()

years    1969.5
hrs       156.0
dtype: float64

#### in-class exercise: with converted dataframe find the year with the largest difference of home runs.
<p>example: if they hit 30 more home runs from 1980 to 1981 and only 12 more home runs from 1990 to 1991 then 1981 has the highest differencial</p>

In [95]:
# df.head(5)

df['prev_hr'] = df['hrs'].shift(-1)

df['hr_diff'] = abs(df['hrs'] - df['prev_hr'])

df=df.sort_values('hr_diff', ascending=False)

df.head(5)

Unnamed: 0,years,hrs,prev_hr,hr_diff
42,1977,213,134.0,79.0
50,1969,197,125.0,72.0
38,1981,90,162.0,72.0
16,2003,238,177.0,61.0
73,1946,109,50.0,59.0


# Exercise #1
<p>https://www.bostonglobe.com/</p>
<p>Print the top 5 words used on boston globe's home page</p>

In [96]:
page = requests.get('https://www.bostonglobe.com/')

soup = BeautifulSoup(page.content, 'html.parser')

In [97]:
for el in soup.children:
    print(el)

html
<html lang="en-US"><head><title>The Boston Globe</title><meta content="New England’s best source for news, sports, opinion and entertainment. The Globe brings you breaking news, Spotlight Team investigations, year-round coverage of the Red Sox, Patriots, Celtics and Bruins, sharp editorials, stunning photography, and engaging arts, food and lifestyle journalism." name="description"/><meta content="News, Massachusetts, New England" name="keywords"/><meta name="title" value="The Boston Globe"/><script type="application/javascript"> if ( !('scrollingElement' in document) || !(Element.prototype.matches) || !(window.URLSearchParams) || !(String.prototype.includes) ) { document.write('<scr'+'ipt type="application/javascript" src="/pf/resources/dist/all-polyfills.js?d=83" defer=""></sc'+'ript>') }  </script><script type="application/javascript">if(!Array.prototype.includes||!(window.Object && window.Object.assign)||!window.Promise||!window.Symbol||!window.fetch){document.write('<script t

In [124]:
from bs4.element import Comment

def tagVisible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

texts = soup.findAll(text=True)
result = filter(tagVisible, texts)

def removeWaste(word):
    bad_words = ['the', 'this', 'not', '—', '-', '_', 'A', 'as', 'a', 'in', 'of', 'to', 'The', 'and', 'at', 'on', 'from', 'for', 'is', 'that', 'are', 'be', 'by', '|', 'with']
    
    if word in bad_words:
        return False
    else:
        return True

word_count = {}

for text in result:
    words = text.replace('\n', '').replace('\t', '').split(' ')
    
    words = filter(removeWaste, words)
    
    for word in words:
        if word != '':
            if word in word_count:
                word_count[word] += 1
            else:
                word_count[word] = 1
                
word_count = sorted(word_count.items(), key=lambda k: k[1], reverse=True)
word_count

[('Sox', 17),
 ('Red', 15),
 ('Boston', 14),
 ('Globe', 12),
 ('Opinion', 7),
 ('game', 7),
 ('will', 7),
 ('says', 7),
 ('an', 7),
 ('were', 7),
 ('has', 7),
 ('up', 7),
 ('his', 6),
 ('next', 6),
 ('no', 6),
 ('struck', 6),
 ('their', 6),
 ('after', 6),
 ('was', 6),
 ('Local', 5),
 ('&', 5),
 ('Marijuana', 5),
 ('State', 5),
 ('its', 5),
 ('running', 5),
 ('truck', 5),
 ('ice', 5),
 ('How', 5),
 ('time', 5),
 ('had', 5),
 ('former', 5),
 ('South', 5),
 ('Obituaries', 4),
 ('Business', 4),
 ('Politics', 4),
 ('Real', 4),
 ('my', 4),
 ('Bruce', 4),
 ('Springsteen', 4),
 ('Police', 4),
 ('arrest', 4),
 ('Suspended', 4),
 ('other', 4),
 ('end', 4),
 ('more', 4),
 ('than', 4),
 ('president', 4),
 ('presidential', 4),
 ('ago', 4),
 ('Quincy', 4),
 ('most', 4),
 ('Chestnut', 4),
 ('In', 4),
 ('get', 4),
 ('against', 4),
 ('My', 4),
 ('do', 4),
 ('day', 4),
 ('seeks', 4),
 ('Alex', 4),
 ('Cora,', 4),
 ('swept', 4),
 ('costly', 4),
 ('decision', 4),
 ('loss', 4),
 ('Phillies', 4),
 ('business