# Web Scraping in Python

The following provides an example of
<ul>
<li>IPython Notebook</li>
<li>Parsing HTML using lxml</li>
<li>Data manipulation w/ pandas</li>
<li>Classification w/ Multinomial Naive Bayes using scikit-learn</li>
</ul>

Now lets see if we can teach our computer how to tell the difference between a Wall Street Journal headline and a Gawker headline. 

Inspiration for this notebook/presentation was provided by [this](http://nbviewer.ipython.org/github/nealcaren/workshop_2014/blob/master/notebooks/6_Classification.ipynb) and [this](http://nbviewer.ipython.org/github/cs109/content/blob/master/HW3_solutions.ipynb). If you want to run this notebook, or one like it, it'll be helpful for you to check out [Anaconda](https://store.continuum.io/cshop/anaconda/)


## Get the Data!
If we want to build a model with a discerning taste for quality news, we're going to have to find him some examples. Lucky for us, Gawker and the Wall Street Journal both have websites. 
### To the internet (with dev tools)!
#### [Gawker](http://gawker.com)
After poking around the gawker homepage with the help of command + option + i (mac) and a Chrome plugin, [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en), we can see that all the article titles can be grabbed with the xpath, `"//header/h1/a/text()"`. lxml lets us retrieve all the text that matches our xpath

In [1]:
from lxml import html 
x = html.parse('http://gawker.com')
titles = x.xpath('//header/h1/a/text()')
print "We got {} titles. Here are the first 5:".format(len(titles))
titles[:5]

We got 17 titles. Here are the first 5:


['College Kids Not So Into the Free Press. Whatever',
 "Today's Best Deals: Running Shoes, Outdoor Clothes, Anker Jump Starter, and More",
 "Who's Named in the Panama Papers?",
 "NYPD's Blowhard Union Boss Is Feeling a Little Scared and a Little Confused",
 'This Is Bad, Even For Sarah Palin']

This page only has 20 titles on it. If we want to properly train our model, we're going to need more examples. There's a "More Stories" button at the bottom of the page, which brings us to another, similarly structured page with new titles and another "More Stories" button. To get more examples, we'll repeat the process above in a loop with each successive iteration hitting the page pointed to by the "More Stories" button. In order to figure out what the link is, go do some more investigating! You can also just look at the code below. 

In [2]:
headlines = x.xpath('//header/h1/a/text()')

In [3]:
# These are the xpaths we determined from snooping 
next_button_xpath = "//div[@class='row load-more']//a[contains(@href, 'startTime')]/@href"
headline_xpath = '//header/h1/a/text()'

# We'll use sleep to add some time in between requests
# so that we're not bombarding Gawker's server too hard. 
from time import sleep

# Now we'll fill this list of gawker titles by starting
# at the landing page and following "More Stories" links
gawker_titles = []
base_url = 'http://gawker.com/{}'
next_page = "http://gawker.com/"
while len(gawker_titles) < 2000 and next_page:
    dom = html.parse(next_page)
    headlines = dom.xpath(headline_xpath)
    print "Retrieved {} titles from url: {}".format(len(headlines), next_page)
    gawker_titles += headlines
    next_pages = dom.xpath(next_button_xpath)
    if next_pages: 
        next_page = base_url.format(next_pages[0]) 
    else:
        print "No next button found"
        next_page = None
    sleep(5)

Retrieved 17 titles from url: http://gawker.com/
Retrieved 18 titles from url: http://gawker.com/?startTime=1459814100852
Retrieved 19 titles from url: http://gawker.com/?startTime=1459777920161
Retrieved 18 titles from url: http://gawker.com/?startTime=1459616727760
Retrieved 16 titles from url: http://gawker.com/?startTime=1459518120384
Retrieved 17 titles from url: http://gawker.com/?startTime=1459440720292
Retrieved 21 titles from url: http://gawker.com/?startTime=1459378619355
Retrieved 18 titles from url: http://gawker.com/?startTime=1459344905458
Retrieved 19 titles from url: http://gawker.com/?startTime=1459282860784
Retrieved 20 titles from url: http://gawker.com/?startTime=1459224087623
Retrieved 19 titles from url: http://gawker.com/?startTime=1459180800411
Retrieved 19 titles from url: http://gawker.com/?startTime=1459096920497
Retrieved 19 titles from url: http://gawker.com/?startTime=1458942097476
Retrieved 18 titles from url: http://gawker.com/?startTime=1458873001132
Re

In [4]:
with open('gawker_titles.txt', 'wb') as out:
    out.writelines(gawker_titles)
# with open('gawker_titles.txt') as f:
#     gawker_titles = f.readlines()
    
print "Holy smokes, we got {} Gawker headlines!".format(len(gawker_titles))

Holy smokes, we got 2001 Gawker headlines!


#### [Wall Street Journal](http://online.wsj.com/public/page/archive-2014-1-1.html)
Now we'll do a similar thing with WSJ now. Here we notice that they have a section of the site where they have lists of articles for each day in the past year. There are links to the different archive dates all over the page, and we can see that the links all have the same structure, with different dates in the URL. Lets iterate over a bunch of dates. I grabbed the articles from the first day of each month this year

In [5]:
wsj_url = "http://online.wsj.com/public/page/archive-2014-{}-1.html"
wsj_headline_xpath = "//h2/a/text()"
wsj_headlines = []
for i in range(1, 11): 
    dom = html.parse(wsj_url.format(i))
    titles = dom.xpath(wsj_headline_xpath)
    wsj_headlines += titles
    print "Retrieved {} WSJ headlines from url: {}".format(len(titles), wsj_url.format(i))   

Retrieved 106 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-1-1.html
Retrieved 21 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-2-1.html
Retrieved 31 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-3-1.html
Retrieved 284 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-4-1.html
Retrieved 386 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-5-1.html
Retrieved 120 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-6-1.html
Retrieved 310 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-7-1.html
Retrieved 300 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-8-1.html
Retrieved 162 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-9-1.html
Retrieved 388 WSJ headlines from url: http://online.wsj.com/public/page/archive-2014-10-1.html


In [6]:
with open('wsj_titles.txt', 'wb') as out:
    out.writelines(wsj_headlines)
# with open('wsj_titles.txt') as f:
#     wsj_headlines = f.readlines()
    
print "Jeez, Louise! We got {} WSJ headlines!".format(len(wsj_headlines))

Jeez, Louise! We got 2108 WSJ headlines!


Now we'll use pandas to build a data frame with two columns: "gawker", which contains a boolean value indicating whether the value in the "title" column came from Gawker's website. 

In [7]:
import pandas as pd
gawk_records = [{'gawker': True, 'title': title} for title in gawker_titles]
wsj_records = [{'gawker': False, 'title': title} for title in wsj_headlines]
df = pd.DataFrame.from_records(gawk_records + wsj_records)
df.tail()

Unnamed: 0,gawker,title
4104,False,Australia to Fly Support Missions in Iraq
4105,False,Zalando Stock Debut Disappoints
4106,False,China September Official Manufacturing PMI Hol...
4107,False,Asian Shares Mixed in Quiet Trading
4108,False,BOJ Tankan: Japan Corporate Sentiment Improves


In [18]:
# abstracts = ''.join([x for x in df.Abstract])
# import markovify
# text_model = markovify.Text(abstracts)
# for i in range(5):
#     print(text_model.make_sentence())
#     print('')

In [19]:
abstracts = ''.join([x for x in df[df.gawker == True].title])

In [23]:
# abstracts = abstracts.decode('unicode_escape').encode('ascii','ignore')
text_model = markovify.Text(str(abstracts.decode('unicode_escape').encode('ascii','ignore')))

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 980: ordinal not in range(128)

In [14]:
for i in range(5):
    print(text_model.make_sentence())
    print('')

None

None

None

None

None



## Teach the Machine!

The basic goal of machine learning is to learn a model or function that maps our inputs/observations to outputs/predictions. The model that we'll be building to make our predictions is known as Naive Bayes. We can compute by counting the probability that any given word shows up in a Gawker title, or P(Word | Gawker). We want the probability of a that a given body of text is a Gawker title, or P(Gawker | Words).

$$P(Gawker | Words) = P(Word_1|Gawker)*P(Word_2|Gawker)*..*P(Word_n | Gawker)*P(Gawker)$$

All of the probabilities on the right side of the equation are empirically determined from the text data. Luckily, instead of having to write all of the code to determine those probabilities ourselves, sklearn's CountVectorizer does all of the dirty work for us.

In [8]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import cross_validation, naive_bayes

In [9]:
# def make_xy(df, vectorizer=None):
#     #Your code here    
#     if vectorizer is None:
#         vectorizer = CountVectorizer(min_df=5, max_df=.3, ngram_range=(1,2))
#     X = vectorizer.fit_transform(df.title)
#     X = X.tocsc()  # some versions of sklearn return COO format
#     Y = df.gawker.values.astype(np.int)
#     return X, Y

In [10]:
vectorizer = CountVectorizer(min_df=5, max_df=.3, ngram_range=(1,2))
# X, Y = make_xy(df)
X = vectorizer.fit_transform(df.title)
X = X.tocsc()  # some versions of sklearn return COO format
Y = df.gawker.values.astype(np.int) # Need numbers instead of bools for 

Now we have a data array, X, whose rows correspond to titles and whose columns correspond to words. So the value X[i, j] is the number of times word j shows up in article title i. Each row has a corresponding member in vector, Y. If the headline associated with X[i] came from Gawker, Y[i] == 1. Otherwise, Y[i] == 0. Now our data are in a format we want it to be in. Time to seperate our data into training and testing sets before building and evaluating our model.

In [11]:
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X,Y)

In [12]:
clf = naive_bayes.MultinomialNB(fit_prior=False, alpha=0.5)
clf.fit(X_train, Y_train)

MultinomialNB(alpha=0.5, class_prior=None, fit_prior=False)

## Test the machine!

We'll see how good our model is by testing the accuracy of its predictions on articles it hasn't seen before. The accuracy metric reported below is simply the percentage of titles it classifies correctly.

In [13]:
print "Accuracy: %0.2f%%" % (100 * clf.score(X_test, Y_test))

Accuracy: 84.91%


Not bad for a stupid machine! We can take a closer look at how the model is making predictions. Let's look at which words are the Gawkeriest and which ones are the Wall Street Journaliest.

In [14]:
import numpy as np
words = np.array(vectorizer.get_feature_names())

x = np.eye(X_test.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)

good_words = words[ind[:10]]
bad_words = words[ind[-10:]]

good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]

print "Gawker words\t P(gawker| word)"
for w, p in zip(good_words, good_prob):
    print "%20s" % w, "%0.2f" % (1 - np.exp(p))
    
print "WSJ words\t P(gawker | word)"
for w, p in zip(bad_words, bad_prob):
    print "%20s" % w, "%0.2f" % (1 - np.exp(p))

Gawker words	 P(gawker| word)
            one week 0.99
              titles 0.99
            get free 0.99
              choose 0.99
   ingredient recipe 0.99
             week of 0.99
         apron fresh 0.99
            month of 0.99
         choose from 0.99
           of oyster 0.99
WSJ words	 P(gawker | word)
              market 0.02
           what news 0.02
       manufacturing 0.01
           investors 0.01
              growth 0.01
                 ipo 0.01
                 ceo 0.01
               sales 0.01
               china 0.01
              profit 0.01


You might be complaining that our model should do better. But some titles can be tough. Would you have classified these mis-predicted article titles correctly? 

In [15]:
prob = clf.predict_proba(X)[:, 0]
predict = clf.predict(X)

bad_wsj = np.argsort(prob[Y == 0])[:5]
bad_gawker = np.argsort(prob[Y == 1])[-5:]

print "Mis-predicted WSJ quotes"
print '---------------------------'
for row in bad_wsj:
    print df[Y == 0].title.irow(row)
    print

print
print "Mis-predicted Gawker quotes"
print '--------------------------'
for row in bad_gawker:
    print df[Y == 1].title.irow(row)
    print

Mis-predicted WSJ quotes
---------------------------
Study Finds Over One Million Caring for Iraq, Afghan War Veterans

Photo of the Week

Body of Ferry Victim Found by Fishermen

From Florida Boy to Alleged Suicide Bomber in Syria

How the 'Jesus' Wife' Hoax Fell Apart


Mis-predicted Gawker quotes
--------------------------
Benetton Will Contribute to Fund For Rana Plaza Victims in Bangladesh

NYU Urges Staffers to Help Pay Students' Outrageously Expensive Tuition

Nobody In China Wants Tibetan Mastiffs Anymore

Ukraine Cease-Fire Under Threat in Key Town, Holding Elsewhere

Leaders Reach Shaky Cease-fire Deal to End War in Ukraine 



Make your own headline and see if it belongs in Gawker or WSJ!

In [16]:
probs = clf.predict_proba(vectorizer.transform(["Your title here"]))
print "P(gawker) = {}".format(probs[0][1])

P(gawker) = 0.971367221347
