# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [3]:
%%HTML
<h1>This is a header</h1>
<p class = 'super-paragraph'>This would be a paragraph. <strong>Strong Words</strong> here.</p>

In [8]:
url = 'https://en.wikipedia.org/wiki/List_of_21_Jump_Street_episodes'

In [9]:
response = requests.get(url)

In [10]:
response

<Response [200]>

In [12]:
response.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of 21 Jump Street episodes - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_21_Jump_Street_episodes","wgTitle":"List of 21 Jump Street episodes","wgCurRevisionId":844038329,"wgRevisionId":844038329,"wgArticleId":35403829,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from May 2012","All articles needing additional references","21 Jump Street","Lists of American crime television series episodes"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparat

In [13]:
soup = BeautifulSoup(response.text, 'html.parser')

In [14]:
soup.find('h2')

<h2>Contents</h2>

In [20]:
all_h2s = soup.find_all('h2')

for h2 in all_h2s:
    print(h2.text)

Contents
Series Overview[edit]
Season 1 (1987)[edit]
Season 2 (1987-88)[edit]
Season 3 (1988-89)[edit]
Season 4 (1989-90)[edit]
Season 5 (1990-91)[edit]
References[edit]
Navigation menu


In [21]:
soup.find('p')

<p><i><a href="/wiki/21_Jump_Street" title="21 Jump Street">21 Jump Street</a></i> is an American <a href="/wiki/Police_procedural" title="Police procedural">police procedural</a> <a class="mw-redirect" href="/wiki/Crime_drama" title="Crime drama">crime drama</a> <a class="mw-redirect" href="/wiki/Television_series" title="Television series">television series</a> that aired on the <a href="/wiki/Fox_Broadcasting_Company" title="Fox Broadcasting Company">Fox Network</a> and in first run syndication from April 12, 1987, to April 27, 1991, with a total of 103 <a href="/wiki/Episode" title="Episode">episodes</a>. The series focuses on a squad of youthful-looking undercover police officers investigating crimes in high schools, colleges, and other teenage venues.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>

In [22]:
len(soup.find_all('p'))

1

In [27]:
table_1 = soup.find('table',{'class':'wikitable plainrowheaders'})

In [31]:
season_1_titles = table_1.find_all('td',{'class':'summary'})

season_1_titles

[<td class="summary" style="text-align:left">"Pilot"</td>,
 <td class="summary" style="text-align:left">"America, What a Town"</td>,
 <td class="summary" style="text-align:left">"Don't Pet the Teacher"</td>,
 <td class="summary" style="text-align:left">"My Future's So Bright, I Gotta Wear Shades"</td>,
 <td class="summary" style="text-align:left">"The Worst Night of Your Life"</td>,
 <td class="summary" style="text-align:left">"Gotta Finish the Riff"</td>,
 <td class="summary" style="text-align:left">"Bad Influence"</td>,
 <td class="summary" style="text-align:left">"Blindsided"</td>,
 <td class="summary" style="text-align:left">"Next Generation"</td>,
 <td class="summary" style="text-align:left">"Low and Away"<br/>"Running on Ice"</td>,
 <td class="summary" style="text-align:left">"16 Blown to 35"</td>,
 <td class="summary" style="text-align:left">"Mean Streets and Pastel Houses"</td>]

In [32]:
for title in season_1_titles:
    print(title.text)

"Pilot"
"America, What a Town"
"Don't Pet the Teacher"
"My Future's So Bright, I Gotta Wear Shades"
"The Worst Night of Your Life"
"Gotta Finish the Riff"
"Bad Influence"
"Blindsided"
"Next Generation"
"Low and Away""Running on Ice"
"16 Blown to 35"
"Mean Streets and Pastel Houses"


In [None]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [None]:
links

In [None]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [None]:
df.head()

### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [34]:
consumer_key = 'kZFSwW1kezEqG8a1rP79BEbus'
consumer_secret = '8vMwEDPji62FqCVmgkqjZ0FtYWJeVhmWOfLXFMlEUFoewKqhhz'
access_token = '4922450361-27KxaE3Oy6G3jSbXMtIetPBZiVUIWtgGFj83OyE'
access_token_secret = 'u6T9vRox2ozs33uMvjBsGVP5nK1FKBJzbIx2I6kjil2Lf'

In [36]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [37]:
user = api.get_user('qz')

In [38]:
for tweet in user.timeline():
    print(tweet.text)

Is it bad to shop on Amazon? https://t.co/O4cvuvMUWL https://t.co/qXRMeXaGtv
Cash is pouring into tech startups from every source https://t.co/CGLQZPNZmY
Donald Trump already told us why he’s shouting at Iran https://t.co/K2pOvo8B60
Donald Trump already told us why he’s shouting at Iran https://t.co/Fw4GF0xvSY
American cheese is no longer the most popular cheese in America https://t.co/nA5dP7VhxJ
In the race to a $1 trillion valuation, analysts are still betting on Apple over Amazon https://t.co/PgTdINZJob
From elephants to eagles: the evolving brands of Nigeria’s unsuccessful national airlines https://t.co/wBbgV4v3bO
Why the world is so excited about electric cars https://t.co/knCcMPwpfF
This album documents the lasting impact of Sudan on Africa’s music scene https://t.co/AuNqzifeUN
It’s not all about advertising at Alphabet anymore https://t.co/Mjv4dmXBIV
Africa is now the world’s epicenter of modern-day slavery https://t.co/4Z2YqoDsUJ
It's impossible to lead a totally ethical life—b

In [39]:
print(user.followers_count)

359396


In [40]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [41]:
tweets[:5]

['Is it bad to shop on Amazon? https://t.co/O4cvuvMUWL https://t.co/qXRMeXaGtv',
 'Cash is pouring into tech startups from every source https://t.co/CGLQZPNZmY',
 'Donald Trump already told us why he’s shouting at Iran https://t.co/K2pOvo8B60',
 'Donald Trump already told us why he’s shouting at Iran https://t.co/Fw4GF0xvSY',
 'American cheese is no longer the most popular cheese in America https://t.co/nA5dP7VhxJ']

In [42]:
karen = api.get_user('_karenhao')

In [43]:
for tweet in karen.timeline():
    print(tweet.text)

RT @Wolfe321: A meme for our time. https://t.co/Hq5K5fijfY
@jennygzhang i'm reading abe lincoln's biography by goodwin right now and he used to write out the letter he wanted… https://t.co/4LGty29I70
RT @WhenWeAllVote: Your vote is your voice. #WhenWeAllVote, we all do better. Register and volunteer at https://t.co/TgXnKAE7g8. https://t.…
@AkshatRathi You're not giving enough credit to the rocket guy https://t.co/oB1iqNxLzp
Wow interesting https://t.co/SmtXA3oFxY
RT @missanabeem: This is a wonderful opportunity https://t.co/VAOL7GlULF
Uncannily similar to my own process. https://t.co/RKpd9wW9PO
Being told by the instructor of my machine learning class “well yeah, it’s all a black box that’s fine 🤷🏼‍♂️” while… https://t.co/RRjNGel10m
@dancow Oh yikes, that Mother Jones interview was done by me—it didn't occur to me how powerful an identifier a tweet reference could be
Does @Refinery29 realize how much hate this will get. https://t.co/asr7x4dvID
RT @ChappellTracker: 6 These reviews of th

In [44]:
karen.followers_count

1426

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?