# Assignment 5

In this assignment, you'll scrape text from [The California Aggie](https://theaggie.org/) and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a [Campus News](https://theaggie.org/campus/) list, [Arts & Culture](https://theaggie.org/arts/) list, and [Sports](https://theaggie.org/sports/) list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

__Exercise 1.1.__ Write a function that extracts all of the links to articles in an Aggie article list. The function should:

* Have a parameter `url` for the URL of the article list.

* Have a parameter `page` for the number of pages to fetch links from. The default should be `1`.

* Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

* Be polite to The Aggie and save time by setting up [requests_cache](https://pypi.python.org/pypi/requests-cache) before you write your function.

* Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

* You can use [lxml.html](http://lxml.de/lxmlhtml.html) or [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrape HTML. Choose one and use it throughout the entire assignment.

In [1]:
#worked with Nivi Achanta and Richard Safran

import requests
import requests_ftp
import requests_cache
import lxml
import re
from bs4 import BeautifulSoup
from collections import Counter
from matplotlib import pyplot as plt
import pandas as pd
plt.style.use('ggplot')
requests_cache.install_cache('coll_cache')
%matplotlib inline

requests_cache.install_cache('aggie_cache')
aggie = "https://theaggie.org/sports/"
page = requests.get(aggie)
page_html = page.text
soup = BeautifulSoup(page_html, 'lxml')
next_buttons = soup.find_all(name='a',attrs={'class':'page-numbers'})
print next_buttons #gives list of next buttons

[<a class="page-numbers" href="https://theaggie.org/sports/page/2/">2</a>, <a class="page-numbers" href="https://theaggie.org/sports/page/3/">3</a>, <a class="page-numbers" href="https://theaggie.org/sports/page/4/">4</a>, <a class="page-numbers" href="https://theaggie.org/sports/page/206/">206</a>, <a class="next page-numbers" href="https://theaggie.org/sports/page/2/"><i class="icon-angle-right"></i></a>]


In [4]:
def get_links(url, pages=1):
    """
    accept the list url as an argument and 
    return a corresponding list of article urls (for 0 items, return an empty list)
    """
    urls = []
    ag_req = requests.get(url)
    #if we hit bad page, return empty list
    if ag_req.status_code != requests.codes.ok:
        return urls
    aghtml = ag_req.text
    ag = BeautifulSoup(aghtml, 'lxml')
    #use h2 html link to extract relevant links 
    urls = [x.findNext().get('href') for x in ag.findAll('h2')]
    if pages > 1:
        # if more than one page requested, fing the link to the next page and recurse
        next_url = ag.find(name='a',attrs={'rel':'next'}).get('href')
        # print "Next URL is ", next_url
        if (next_url):
            urls += get_links(next_url, pages - 1)
    return urls

print "Links from 5 sports pages :"
print get_links("http://theaggie.org/sports", 5) 

print "Links from bad url:"
print get_links("http://theaggie.org/badurl", 3) 



Links from 5 sports pages :
['https://theaggie.org/2017/02/20/a-day-in-the-life-of-an-athletic-trainer/', 'https://theaggie.org/2017/02/19/a-lifetime-of-tennis/', 'https://theaggie.org/2017/02/17/uc-davis-womens-basketball-team-holds-off-uc-irvine-62-42/', 'https://theaggie.org/2017/02/17/nba-awards-at-the-all-star-break/', 'https://theaggie.org/2017/02/17/womens-lacrosse-soars-over-the-blackbirds/', 'https://theaggie.org/2017/02/17/uc-davis-womens-lacrosse-falls-in-tough-season-opener/', 'https://theaggie.org/2017/02/16/uc-davis-softball-falls-in-home-opener/', 'https://theaggie.org/2017/02/16/ags-compensate-for-recent-loss-to-highlanders-with-decisive-win/', 'https://theaggie.org/2017/02/13/uc-davis-athletes-excel-academically/', 'https://theaggie.org/2017/02/12/a-beginners-guide-to-womens-gymnastics/', 'https://theaggie.org/2017/02/10/inside-the-game-chima-moneke/', 'https://theaggie.org/2017/02/10/uc-davis-victorious-in-clash-with-titans/', 'https://theaggie.org/2017/02/09/uc-davis

__Exercise 1.2.__ Write a function that extracts the title, text, and author of an Aggie article. The function should:

* Have a parameter `url` for the URL of the article.

* For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

* Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for [this article](https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/) your function should return something similar to this:
```
{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}
```

Hints:

* The author line is always the last line of the last paragraph.

*   Python 2 displays some Unicode characters as `\uXXXX`. For instance, `\u201c` is a left-facing quotation mark.
    You can convert most of these to ASCII characters with the method call (on a string)
    ```
    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    ```
    If you're curious about these characters, you can look them up on [this page](http://unicode.org/cldr/utility/character.jsp), or read 
    more about [what Unicode is](http://unicode.org/standard/WhatIsUnicode.html).

In [2]:
def get_text(url):
    """
    Takes parameter url, extracts "written by" line
    Return a dictionary with keys "url", "title", "text", and "author".
    """

    url_req = requests.get(url)
    html = url_req.text
    art = BeautifulSoup(html, 'lxml')

    adict = {}
    adict['title'] = art.find_all('h1', attrs={'class': 'entry-title'})[0].get_text()
    adict['url'] = url

    body = art.find_all("div", {'itemprop': 'articleBody'})[0]
    paragraph = body.find_all('p')

    text = " ".join([x.getText() for x in paragraph[:-1]])
    text = text.translate({0x2018: 0x27, 0x2019: 0x27, 0x201C: 0x22, 0x201D: 0x22, 0x2026: 0x20})
    adict['text'] = text

    auth = paragraph[len(paragraph) - 1].getText()
    author = re.split("Written [Bb]y", auth, flags=re.IGNORECASE)[-1]
    adict['author'] = author
    return adict

get_text("https://theaggie.org/2017/02/24/2017-winter-quarter-election-results/")


{'author': u': Alyssa Vandenberg \xa0\u2014 campus@theaggie.org',
 'text': u"Six senators, new executive team elected Current ASUCD Vice President Abhay Sandhu announced the ASUCD election results on Feb. 24 in the Memorial Union's Mee room. Six senators were elected: Sam Chiang, Michael Gofman, Khadeja Ibrahim, Rahi Suryawanshi, Marcos Rodriguez and Yajaira Ramirez Sigala. Chiang and Ibrahim ran on the BASED slate, while Suryawanshi, Rodriguez and Ramirez Sigala ran on the Bespoke slate. Gofman ran independently. The new ASUCD president and vice president will be Josh Dalavai and Adilla Jamaludin. Dalavai and Jamaludin ran on the BASED slate.  The results will also be posted online at elections.ucdavis.edu.",
 'title': u'2017 Winter Quarter election results',
 'url': 'https://theaggie.org/2017/02/24/2017-winter-quarter-election-results/'}

__Exercise 1.3.__ Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 [Campus News](https://theaggie.org/campus/) articles and a data frame of 60 [City News](https://theaggie.org/city/) articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [5]:
campuslist = get_links("http://theaggie.org/campus", 6)
frame_dict = {'author': [], 'text': [], 'title': [], 'url': []}
for item in campuslist[0:60]: #I only want first 60 
    dict = get_text(item)
    #print dict
    frame_dict['url'].append(dict['url'])
    frame_dict['title'].append(dict['title'])
    frame_dict['author'].append(dict['author'])
    frame_dict['text'].append(dict['text'])
    #print(item)
#print(frame_dict)
campus_df = pd.DataFrame.from_dict(frame_dict)
#print(df[df.columns[0]])
#print(df[df.columns[1]])
#print(df[df.columns[2]])
#print(df[df.columns[3]])
print campus_df.head()

                                              author  \
0  : Aaron Liss and Raul Castellanos  — campus@th...   
1               : Kimia Akbari — campus@theaggie.org   
2             : Kenton Goldsby — campus@theaggie.org   
3            : Ivan Valenzuela — campus@theaggie.org   
4         : Alyssa Vandenberg  — campus@theaggie.org   

                                                text  \
0  Wells Fargo faces fraud, predatory lending cha...   
1  Faculty, students recount personal tales of im...   
2  Opening date pushed back to May 1 Students hav...   
3  Veto included revision abandoning creation of ...   
4  Shaheen's name to remain on ballot, his votes ...   

                                               title  \
0  University of California, Davis City Council s...   
1  Academics unite in peaceful rally against immi...   
2            Memorial Union to reopen Spring Quarter   
3  ASUCD President Alex Lee vetoes amendment for ...   
4  Senate candidate Zaki Shaheen withdraws fro

In [6]:
citylist = get_links("https://theaggie.org/city", 6)
#print citylist
frame_dict = {'author': [], 'text': [], 'title': [], 'url': []}
#print frame_dict
for item in citylist[0:60]: #I only want first 60 
    dict = get_text(item)
    #print dict
    frame_dict['url'].append(dict['url'])
    frame_dict['title'].append(dict['title'])
    frame_dict['author'].append(dict['author'])
    frame_dict['text'].append(dict['text'])
    #print frame_dict
    #print(item)
#print(frame_dict)
city_df = pd.DataFrame.from_dict(frame_dict)
#print(df[df.columns[0]])
#print(df[df.columns[1]])
#print(df[df.columns[2]])
#print(df[df.columns[3]])
#print city_df.head()
print city_df['text'][0] #need this to be corpus

Local Whole Foods closes Feb. 12 After five years of providing business to students and locals of Davis, the Whole Foods Market on 1st Street closed on Sunday, Feb. 12. The Davis location was one of nine Whole Foods Markets across the country to permanently close. The Whole Foods Market in Davis was located in a small shopping center along with several other eateries. The closure was part of an evaluation nationwide to determine which Whole Food's locations were underperforming. The local market faced competition from other supermarkets and stores in the city, such as Safeway, Trader Joe's and the Davis Food Co-op. Whole Foods markets itself on its organic food that does not use artificial preservatives, colors, flavors, sweeteners or hydrogenated fats. Although this is a healthy benefit, many students could not afford the price tag associated with the products. "I think it affects students for the most part," said John Tuquero, a Verizon Wireless employee in the shopping center in whi

__Exercise 1.4.__ Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

* What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

* What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

* Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

*   The [nltk book](http://www.nltk.org/book/) and [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) may be helpful here.

*   You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

*   If you want, you can use the [wordcloud](http://amueller.github.io/word_cloud/) package to plot a word cloud. To install the package, run
    ```
    conda install -c https://conda.anaconda.org/amueller wordcloud
    ```
    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

In [7]:
import nltk
from nltk.corpus import brown

In [8]:
city_docs = city_df['text'].tolist()
campus_docs  = campus_df['text'].tolist()
import numpy as np
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# 'lemmatize' : stemming tokenizer from the lesson 11 NB
stemmer = PorterStemmer().stem
# word_tokenize bring ',' '.' etc. as tokens, so used different one below
# tokenize = nltk.word_tokenize

def lemmatize(text):
    """
    Extract simple lemmas based on tokenization and stemming
    Input: string
    Output: list of strings (lemmata)
    """
    #  More detailed tokenizer from http://brandonrose.org/clustering
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer(t.lower()) for t in filtered_tokens]
    return stems   


In [9]:
# Reuters suff from the Lesson 11, modified for brown
#(better categories)

fids = corpus.brown.fileids()
# words are tagged so need this
brown = [" ".join(corpus.brown.words(f)) for f in fids]

len(brown)

500

In [10]:
extra_stopwords=set(["trump","will","UC","Davis","year","campus","city","police","day", "really","also", "going", "come","don","don't", "people", "food"])

my_stopwords = set(nltk.corpus.stopwords.words('english')).union(extra_stopwords)

vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words=my_stopwords,smooth_idf=True,norm=None)

# Train Vectorizer with Reuters articles first 
brown_tfs = vectorizer.fit_transform(brown)

print brown_tfs.shape

(500, 30589)


In [11]:
# transform (not fit) docs in the same Vectorizer
city_tfs = vectorizer.transform(city_docs)
print city_tfs.shape


(60, 30589)


In [12]:
from sklearn.neighbors import KNeighborsClassifier
nnclassifier = KNeighborsClassifier()
nnclassifier.fit(brown_tfs,[corpus.brown.categories(fid)[0] for fid in fids])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [13]:
predicted = nnclassifier.predict(city_tfs)

In [14]:
print predicted

[u'adventure' u'belles_lettres' u'adventure' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'adventure' u'adventure' u'adventure'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'adventure'
 u'belles_lettres' u'belles_lettres' u'adventure' u'belles_lettres'
 u'belles_lettres' u'adventure' u'belles_lettres' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'adventure'
 u'belles_lettres' u'adventure' u'adventure' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'adventure'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'belles_lettres' u'adventure'
 u'belles_lettres' u'adventure' u'belles_lettres' u'belles_lettres'
 u'belles_lettres' u'belles_lettres' u'adventure' u'belles_lettres'
 u'belles_lettres' u'belles_lettres

In [15]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(brown_tfs,[corpus.brown.categories(fid)[0] for fid in fids])

In [16]:
 predicted = clf.predict(city_tfs)

In [17]:
print predicted

[u'news' u'lore' u'news' u'news' u'government' u'hobbies' u'news' u'news'
 u'government' u'news' u'belles_lettres' u'belles_lettres' u'editorial'
 u'news' u'news' u'belles_lettres' u'news' u'news' u'news' u'news'
 u'editorial' u'news' u'news' u'news' u'belles_lettres' u'learned' u'news'
 u'government' u'government' u'news' u'editorial' u'belles_lettres' u'news'
 u'news' u'news' u'belles_lettres' u'belles_lettres' u'news' u'news'
 u'news' u'reviews' u'news' u'news' u'news' u'learned' u'news' u'news'
 u'news' u'belles_lettres' u'belles_lettres' u'belles_lettres' u'news'
 u'news' u'belles_lettres' u'news' u'news' u'news' u'news' u'news' u'news']


In [18]:
city_sim = city_tfs.dot(city_tfs.T) #Creat the similarity matrix
city_sim.mean() #The mean of the similarity matrix. Higher the number the more similar the pieces are

1173.6872683050003

In [19]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics.pairwise import linear_kernel
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix

cosine_similarities = linear_kernel(city_tfs[0:1], city_tfs).flatten() # RETURN COSINE SIMILARITIES

# 3 Pairs most closely related articles - city
related_docs_indices = cosine_similarities.argsort()[:-7:-1] #Grab the top 3 pairs which equates to last 6 elements of the sorted list.
print "The top 3 most related article pairs from city are: "
print city_df["title"][related_docs_indices[0]],"+",city_df["title"][related_docs_indices[1]]
print city_df["title"][related_docs_indices[2]],"+",city_df["title"][related_docs_indices[3]]
print city_df["title"][related_docs_indices[4]],"+",city_df["title"][related_docs_indices[5]]

cosine_similarities_campus = linear_kernel(city_tfs[0:1], city_tfs).flatten() # RETURN COSINE SIMILARITIES

# 3 Pairs most closely related articles - campus
related_docs_indices = cosine_similarities_campus.argsort()[:-7:-1] #Grab the top 3 pairs which equates to last 6 elements of the sorted list.
print "The top 3 most related article pairs from campus are: "
print campus_df["title"][related_docs_indices[0]],"+",campus_df["title"][related_docs_indices[1]]
print campus_df["title"][related_docs_indices[2]],"+",campus_df["title"][related_docs_indices[3]]
print campus_df["title"][related_docs_indices[4]],"+",campus_df["title"][related_docs_indices[5]]

The top 3 most related article pairs from city are: 
Davis’ Whole Foods Market shuts down + Holding the Light
Local residents attend Davis town hall meeting + City of Davis to retain sanctuary city status
Children’s Candlelight Parade lights up downtown Davis + Davis’s Historic City Hall building to be put up for sale
The top 3 most related article pairs from campus are: 
University of California, Davis City Council sever Wells Fargo contracts + UC-wide walkout, teach-ins on Trump’s inauguration day
Sierra Nevada Brewing owners gift $2 million to UC Davis program + UC President selects Gary May as new UC Davis chancellor
Last week in Senate + Memorial Union to reopen Spring Quarter


In [20]:
# This part does similarity between documents, instead of grouping by topic.
# It does not use Reuters - perhaps, a better clustering would be by set of topicssimilarity

# pairwise similarity
from sklearn.metrics.pairwise import cosine_similarity
# we can do multi-dim and ward transfromations with dist later
dist = 1 - cosine_similarity(city_tfs)
print dist.shape
# feature vocabulary
terms = vectorizer.get_feature_names()

(60, 60)


In [21]:
# from http://brandonrose.org/clustering

from sklearn.cluster import KMeans
km = KMeans()

# Option: use AffinityPropagation (NearestNeigbour) instead of K-means
# from sklearn.cluster import  AffinityPropagation
# kM=AffinityPropagation()

%time km.fit(city_tfs)

# add cluster number as a column to the frame
clusters = km.labels_.tolist()
city_df["cluster"]=clusters
city_df.set_index("cluster")

from __future__ import print_function

print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(len(clusters)-1):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :5]: #replace 5 with n words per cluster
        print(' %s' % terms[ind])
    print() #add whitespace
    print() #add whitespace
    
    print("Cluster %d titles:" % i, end='')
    for title in city_df['title'].tolist():
        print(' %s,' % title, end='\n')
        
    print() #add whitespace
    print() #add whitespace
    
print()
print()

CPU times: user 431 ms, sys: 19.8 ms, total: 451 ms
Wall time: 290 ms
Top terms per cluster:

Cluster 0 words: davi
 said
 's
 commun
 student


Cluster 0 titles: Davis’ Whole Foods Market shuts down,
 Protest against Planned Parenthood in Woodland is met with counter protests,
 Davis’s Historic City Hall building to be put up for sale,
 Davis stands with Muslim residents,
 City of Davis awarded funds for new recycling bins,
 Police Logs,
 City of Davis to retain sanctuary city status,
 Suspect in Davis Islamic Center vandalism arrested,
 Project Toto aims to address questions regarding city finances,
 Police Logs,
 News in Brief: A Valentine’s Day for everybody,
 The musical train to memory lane,
 Police Logs,
 Davis owls face eviction at Marriott Residence Inn,
 Davis Celebrates MLK Day,
 Local author, cyclist rides 2,300 miles,
 City leaders address panhandling issue in Davis,
 Toastmasters help members conquer fear of public speaking,
 Sacramento’s new public transportation,
 A sym

IndexError: index 8 is out of bounds for axis 0 with size 8

I think that this corpus is representative of the aggie since it includes lots of info about community events, social activism, and campus happenings that always encompass most writing in the Aggie. 