# Web Crawling

- A web scraping technique, but for large collections of data
- Combines **XPath** and **For Loop** statement
- Knowledge of **Regular Expression** is plus

# Copyright laws: 
* Using the crawled data for **commerical service** is violating copyright laws
* Web crawling in this course is for **educational purpose** only. 

# Example: IMDb
- Previously, we scraped http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012
- There are over 200,000 webpages for this movie information. To collect data from those webpages, you need **web crawling**

## Web Scraping (first webpage only) 

In [None]:
# import python packages
import requests
from lxml import html
import csv
import pandas as pd

r = requests.get('http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012')
data = html.fromstring(r.text)

# Xpath
alldata =[]

for i in data.xpath("//h3[@class='lister-item-header']"):
    title = i.xpath('a/text()')  
    url = i.xpath('a/@href')        
    year = i.xpath('span[2]/text()')   
    print title, url, year
    alldata.append([title, url, year])
    
len(alldata)

## Crawling multiple pages

#### Review: for loop

In [None]:
# loop 5 times
for i in range(1,5):
    print i

When you visit the web site (IMDb), the webpages have a certain pattern:

- http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012
- http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=51&title_type=feature&year=1950,2012
- http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=101&title_type=feature&year=1950,2012
 
The number after **start=** increases by 50. Then, we can try something like this:

In [None]:
web = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=%s&title_type=feature&year=1950,2012"

for page in range(1,150,50):
    print web % page

Now, we will add XPaths into the loop statement.

In [None]:
alldata =[]

web = "http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=%s&title_type=feature&year=1950,2012"

for page in range(1,150,50):
    url = web % page
    data = html.fromstring(requests.get(url).text)
    #xpath
    for i in data.xpath("//h3[@class='lister-item-header']"):
        title = i.xpath('a/text()')  
        url = i.xpath('a/@href')        
        year = i.xpath('span[2]/text()')   
        print title, url, year
        alldata.append([title, url, year])
    
len(alldata)

In [None]:
df = pd.DataFrame(alldata)       
df.to_csv("data/output_imdb_crawling.csv", index=False, encoding="utf-8")

# OpenCorporates (The Open Database of the Corporate World)

- We're interested in businesses in Kansas. The url is https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=
- There are many more webpages (+1000).
- For example, the second web page's URL looks like this https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=2&q=

## Web Scraping (first webpage)

In [None]:
r = requests.get('https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=')
data = html.fromstring(r.text)

# Xpath

for i in data.xpath("//li[contains(@class,'search-result')]"):
    title = i.xpath('a[2]/text()')   
    url = i.xpath('span[@class="address"]/a/@href')       
    address = i.xpath('span[@class="address"]/text()') 
    print title, url, address

## Crawling multiple pages

In [None]:
# generating urls



# Rotten Tomatoes Movie Reviews

- Now, we're familar with how XPath works so we will do coding without using Google Sheets. 
- Go to https://www.rottentomatoes.com/m/interstellar_2014/reviews/?page=1&sort=
- Collect reviewer name, fresh/rotten, review, and date.
- There are 15 more webpages of reviews for this movie

## Web Scraping

In [None]:
r = requests.get('https://www.rottentomatoes.com/m/interstellar_2014/reviews/?page=1&sort=')
data = html.fromstring(r.text)

for i in data.xpath("//div[@class='row review_table_row']"):
    name = i.xpath('div/div/a[contains(@href, "critic")]/text()')
    sentiment = i.xpath('div[@class="col-xs-16 review_container"]/div[1]/@class')
    date = i.xpath('div[@class="col-xs-16 review_container"]/div[2]/div[1]/text()')
    review = i.xpath('div[@class="col-xs-16 review_container"]/div[2]/div[2]/div[1]/text()')
    print name, sentiment, date, review   

## Crawling multiple pages: Your Turn!

In [None]:
# data frame



In [None]:
# remove brackets






In [None]:
# remove single quotation







In [None]:
# remove "review_icon icon small"






In [None]:
# renaming columns 0: 'reviewer', 1: 'sentiment', 2: 'date', 3:'review'



In [None]:
# pivot table by date



In [None]:
# pivot table by sentiment



# word frequency analysis

In [None]:
# data frame to list

review = df['review'].tolist()
review

In [None]:
for i in review:
    print i

In [None]:
import nltk
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
% matplotlib inline

# convert to string
tokens = str(review)

#lowecases
tokens = tokens.lower()

#tokenization
tokens = word_tokenize(tokens)

#Remove stopwords
tokens = (word for word in tokens if word not in stopwords.words('english'))

# Filter non-alphanumeric characters from tokens
tokens = (word for word in tokens if word.isalpha())

#remove short words
tokens = [ word for word in tokens if len(word) >= 2 ]

#Create your bigrams ... bigrams are two tokens
#bgs = nltk.bigrams(tokens)

#compute frequency distribution for all the bigrams in the text
fdist_h = nltk.FreqDist(tokens)
fdist_h.most_common(20)