# Web Crawling

- A web scraping technique, but for large collections of data
- Combines **XPath** and **For Loop** statement
- Knowledge of **Regular Expression** is plus

# Copyright laws: 
* Using the crawled data for **commerical service** is violating copyright laws
* Web crawling in this course is for **educational purpose** only. 

# Example: IMDb
- Previously, we scraped http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012
- There are over 200,000 webpages for this movie information. To collect data from those webpages, you need **web crawling**

## Web Scraping (first webpage only) 

In [2]:
# import python packages
import requests
from lxml import html
import csv
import pandas as pd

r = requests.get('http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012')
data = html.fromstring(r.text)

# Xpath
title = data.xpath("//h3/a/text()")  
url = data.xpath("//h3[@class='lister-item-header']/a/@href")        
year = data.xpath("//h3/span[2]/text()")    

# aggregate three columns
imdb = zip(title,url,year)
# changing the data more like Excel format
imdb = pd.DataFrame(imdb)
# then save
imdb.to_csv("data/output_imdb.csv", encoding='utf-8')

## Web Crawling

### Review: for loop

In [3]:
# loop 5 times
for i in range(1,5):
    print i

1
2
3
4


When you visit the web site (IMDb), the webpages have a certain pattern:

- http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012
- http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=51&title_type=feature&year=1950,2012
- http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=101&title_type=feature&year=1950,2012
 
The number after **start=** increases by 50. Then, we can try something like this:

In [4]:
a = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start="
b = "&title_type=feature&year=1950,2012"

print a + b

http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=&title_type=feature&year=1950,2012


Okay, let's takey one more step. How about this?

In [5]:
a = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start="
b = "&title_type=feature&year=1950,2012"

for i in range(1,5):
    print a + b

http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=&title_type=feature&year=1950,2012
http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=&title_type=feature&year=1950,2012
http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=&title_type=feature&year=1950,2012
http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=&title_type=feature&year=1950,2012


Let's add a number into each url

In [6]:
a = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start="
b = "&title_type=feature&year=1950,2012"
count = 1
for i in range(1,5):
    print '%s%s%s' %(a,count,b)
    count = count + 50    

http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012
http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=51&title_type=feature&year=1950,2012
http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=101&title_type=feature&year=1950,2012
http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=151&title_type=feature&year=1950,2012


Now, we will add XPaths into the loop statement.

In [7]:
a = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start="
b = "&title_type=feature&year=1950,2012"
count = 1

for i in range(1,5):
    url = '%s%s%s' %(a,count,b)
    count = count + 50   
    data = html.fromstring(requests.get(url).text)
    
    # Xpath
    title = data.xpath("//h3/a/text()")  
    url = data.xpath("//h3[@class='lister-item-header']/a/@href")        
    year = data.xpath("//h3/span[2]/text()")    

    print title, url, year

['The Shawshank Redemption', 'The Dark Knight', 'Inception', 'Fight Club', 'Pulp Fiction', 'Forrest Gump', 'The Lord of the Rings: The Fellowship of the Ring', 'The Matrix', 'The Lord of the Rings: The Return of the King', 'The Godfather', 'The Dark Knight Rises', 'The Lord of the Rings: The Two Towers', 'Se7en', 'Gladiator', 'Batman Begins', 'Django Unchained', 'The Avengers', 'Star Wars: Episode IV - A New Hope', 'The Silence of the Lambs', 'Inglourious Basterds', 'Saving Private Ryan', 'The Departed', "Schindler's List", 'Avatar', 'The Prestige', 'Star Wars: Episode V - The Empire Strikes Back', 'Memento', 'American Beauty', 'Pirates of the Caribbean: The Curse of the Black Pearl', 'Shutter Island', 'The Green Mile', 'The Godfather: Part II', 'V for Vendetta', 'Titanic', 'American History X', 'Back to the Future', 'The Usual Suspects', 'Terminator 2: Judgment Day', 'Braveheart', 'Kill Bill: Vol. 1', u'L\xe9on: The Professional', 'Goodfellas', u'WALL\xb7E', 'Finding Nemo', 'The Sixth

['The Matrix Reloaded', 'The Bourne Identity', '21 Jump Street', 'X2', 'Juno', 'Rise of the Planet of the Apes', 'Men in Black', "Ocean's Eleven", 'Harry Potter and the Chamber of Secrets', 'The Notebook', 'Minority Report', 'Blood Diamond', 'Cast Away', 'Harry Potter and the Goblet of Fire', 'Shaun of the Dead', 'Harry Potter and the Prisoner of Azkaban', 'I, Robot', 'Despicable Me', 'Toy Story 2', 'Zombieland', 'Watchmen', 'Rocky', 'Spider-Man 3', 'X-Men: The Last Stand', 'Monty Python and the Holy Grail', 'Source Code', 'Troy', 'There Will Be Blood', 'Rain Man', 'Pirates of the Caribbean: On Stranger Tides', '(500) Days of Summer', 'Crazy, Stupid, Love.', 'The Hangover Part II', 'Oldeuboi', 'X-Men Origins: Wolverine', 'Harry Potter and the Order of the Phoenix', 'Children of Men', 'American Psycho', 'Mission: Impossible - Ghost Protocol', 'The Matrix Revolutions', 'The Butterfly Effect', 'The Perks of Being a Wallflower', 'Hot Fuzz', 'Edward Scissorhands', 'Little Miss Sunshine', 'T

Let's improve the codes to save the data

In [11]:
finaldata = []

a = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start="
b = "&title_type=feature&year=1950,2012"
count = 1

for i in range(1,5):
    url = '%s%s%s' %(a,count,b)
    count = count + 50   
    data = html.fromstring(requests.get(url).text)
    
    # Xpath
    title = data.xpath("//h3/a/text()")  
    url = data.xpath("//h3[@class='lister-item-header']/a/@href")        
    year = data.xpath("//h3/span[2]/text()")    

    crawleddata = zip(title, url, year)
    
    for row in crawleddata:
        finaldata.append(row)    

finaldata = pd.DataFrame(finaldata)     
print finaldata
finaldata.to_csv("data/output_imdb_crawling.csv", encoding="utf-8")

                                                     0  \
0                             The Shawshank Redemption   
1                                      The Dark Knight   
2                                            Inception   
3                                           Fight Club   
4                                         Pulp Fiction   
5                                         Forrest Gump   
6    The Lord of the Rings: The Fellowship of the Ring   
7                                           The Matrix   
8        The Lord of the Rings: The Return of the King   
9                                        The Godfather   
10                               The Dark Knight Rises   
11               The Lord of the Rings: The Two Towers   
12                                               Se7en   
13                                           Gladiator   
14                                       Batman Begins   
15                                    Django Unchained   
16            

# OpenCorporates (The Open Database of the Corporate World)

- We're interested in businesses in Kansas. The url is https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=
- There are many more webpages (+1000).
- For example, the second web page's URL looks like this https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=2&q=

## Web Scraping (first webpage)

In [None]:
r = requests.get('https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=')
data = html.fromstring(r.text)

# Xpath
title = 
url =     
address =  

# aggregate three columns

# changing the data more like Excel format

# then save


In [12]:
a = "https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page="
b = "&q="

for i in range(1,5):
    print a + b

https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=&q=
https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=&q=
https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=&q=
https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=&q=


In [13]:
a = "https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page="
b = "&q="
count = 1
for i in range(1,5):
    url = '%s%s%s' %(a,count,b)
    print url
    count = count + 1
    

https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=
https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=2&q=
https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=3&q=
https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=4&q=


## Web crawling

# Rotten Tomatoes Movie Reviews

- Now, we're familar with how XPath works so we will do coding without using Google Sheets. 
- Go to https://www.rottentomatoes.com/m/interstellar_2014/reviews/?page=1&sort=
- Collect reviewer name, fresh/rotten, review, and date.
- There are 15 more webpages of reviews for this movie

## Web Scraping

In [None]:
r = requests.get('https://www.rottentomatoes.com/m/interstellar_2014/reviews/?page=1&sort=')
data = html.fromstring(r.text)

name = 
rotten_fresh = 
date = 
review = 
movie = zip(name, rotten_fresh, date, review)
movie

## Web Crawling: Your Turn!

In [None]:
# renaming the columns 0: 'reviewer', 1: 'sentiment', 2: 'date', 3:'review'
finaldata = finaldata.rename(columns={0: 'reviewer', 1: 'sentiment', 2: 'date', 3:'review'})
finaldata

In [None]:
# pivot table
finaldata.groupby('date').count()

In [None]:
# pivot table


# Word Frequency Analysis

In [None]:
import nltk
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
% matplotlib inline

# convert to string
tokens = str(review)

#lowecases
tokens = tokens.lower()

#tokenization
tokens = word_tokenize(tokens)

#Remove stopwords
tokens = (word for word in tokens if word not in stopwords.words('english'))

# Filter non-alphanumeric characters from tokens
tokens = (word for word in tokens if word.isalpha())

#remove short words
tokens = [ word for word in tokens if len(word) >= 2 ]

#Create your bigrams ... bigrams are two tokens
#bgs = nltk.bigrams(tokens)

#compute frequency distribution for all the bigrams in the text
fdist_h = nltk.FreqDist(tokens)
fdist_h.most_common(20)