# 11 - Web Information Retrieval 

by [Alejandro Correa Bahnsen](albahnsen.com/)

version 0.1, Apr 2016

## Part of the class [Practical Machine Learning](https://github.com/albahnsen/PracticalMachineLearningClass)



This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). 

## Introduction

Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. (Wikipedia)

In [1]:
url_ = 'http://mashable.com/2016/03/07/apple-ebook-case/#6KXWVluVqmqg'

In [2]:
from IPython.display import IFrame
IFrame(url_, 600, 600)

## Get the webpage into a python object

If we want to collect information on hundreds or twosands of webpages doing it manually is a no go. Instead, lets get the information of the webpage into python using web scraping

### Download the HTML code of the webpage

In [3]:
import urllib.request
response = urllib.request.urlopen(url_)
html = response.read()

In [4]:
html[0:800]

b'<!DOCTYPE html>\n<!--\no o     o     +              o\n+   +     +             o     +       +\n            +\no  +    +        o  +           +        +\n     __  __           _           _     _\n~_,-|  \\/  | __ _ ___| |__   __ _| |__ | | ___\n    | |\\/| |/ _` / __| \'_ \\ / _` | \'_ \\| |/ _ \\,-~_,- - - ,\n~_,-| |  | | (_| \\__ \\ | | | (_| | |_) | |  __/    |   /\\_/\\\n    |_|  |_|\\__,_|___/_| |_|\\__,_|_.__/|_|\\___|  ~=|__( ^ .^)\n~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,""   ""\no o     o     +              o\n+   +     +             o     +       +\n            +\no  +    +        o  +           +        +\n-->\n<html data-env=\'production\' lang=\'en\' xml:lang=\'en\'>\n<head>\n<script>\n  window.__o = {"channel":"business","content_type":"article","v_buy":null,"v_buy_i":null,"h_pub":582.0,"h_buy":null,"h'

### Lets parse the input to a more readable one

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<!--
o o     o     +              o
+   +     +             o     +       +
            +
o  +    +        o  +           +        +
     __  __           _           _     _
~_,-|  \/  | __ _ ___| |__   __ _| |__ | | ___
    | |\/| |/ _` / __| '_ \ / _` | '_ \| |/ _ \,-~_,- - - ,
~_,-| |  | | (_| \__ \ | | | (_| | |_) | |  __/    |   /\_/\
    |_|  |_|\__,_|___/_| |_|\__,_|_.__/|_|\___|  ~=|__( ^ .^)
~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,""   ""
o o     o     +              o
+   +     +             o     +       +
            +
o  +    +        o  +           +        +
-->
<html data-env="production" lang="en" xml:lang="en">
 <head>
  <script>
   window.__o = {"channel":"business","content_type":"article","v_buy":null,"v_buy_i":null,"h_pub":582.0,"h_buy":null,"h_pub_buy":null,"v_cur":1.5,"v_max":1.8,"v_cur_i":1,"v_max_i":1,"events":"event51,event61","top_channel":"business","content_source_type":"Internal","content_source_name":"Internal","author_name":"

### Page title

In [6]:
title = soup.title.string
title

'The Supreme Court smacked down Apple today'

### Author name

In [7]:
author = soup.find_all("span", { "class" : "author_name"})
author

[<span class="author_name">By Seth Fiegerman</span>]

In [8]:
# If author is empty try this:
if author == []:
    author = soup.find_all("span", { "class" : "byline basic"})
author

[<span class="author_name">By Seth Fiegerman</span>]

In [9]:
str(author).split('>')

['[<span class="author_name"', 'By Seth Fiegerman</span', ']']

In [10]:
author = str(author).split('>')[1]
author

'By Seth Fiegerman</span'

In [11]:
author = author.split('By ')[1]
author

'Seth Fiegerman</span'

In [12]:
author = author.split('<')[0]
author

'Seth Fiegerman'

### Total ranks

In [13]:
shares = soup.find_all("div", { "class" : "total-shares"})
shares

[<div class="total-shares" data-index="0">
 <em>4.9k</em>
 <div class="caption">Shares</div>
 </div>]

In [14]:
shares = str(shares).split('<em>')[1].split('</em>')[0]
shares

'4.9k'

In [15]:
if 'k' in shares:
    shares = shares[:-1]
    shares = shares.replace('.', '') + '00'
shares

'4900'

### Author webpage

In [16]:
author_web = soup.find_all("a", { "class" : "byline"})
author_web

[<a class="byline " href="/people/seth-fiegerman/"><img alt="Headshot_2015_sethfiegerman_1" class="author_image" src="http://rack.3.mshcdn.com/media/ZgkyMDE1LzA2LzE2LzBjL0hlYWRzaG90XzIwLjkzNDQxLmpwZwpwCXRodW1iCTkweDkwIwplCWpwZw/43e49f19/5ff/Headshot_2015_SethFiegerman_1.jpg"/><div class="author_and_date"><span class="author_name">By Seth Fiegerman</span><time datetime="Mon, 07 Mar 2016 16:36:45 +0000">2016-03-07 16:36:45 UTC</time></div></a>]

In [17]:
if author_web != []:
    author_web = str(author_web).split('href="')[1]
author_web

'/people/seth-fiegerman/"><img alt="Headshot_2015_sethfiegerman_1" class="author_image" src="http://rack.3.mshcdn.com/media/ZgkyMDE1LzA2LzE2LzBjL0hlYWRzaG90XzIwLjkzNDQxLmpwZwpwCXRodW1iCTkweDkwIwplCWpwZw/43e49f19/5ff/Headshot_2015_SethFiegerman_1.jpg"/><div class="author_and_date"><span class="author_name">By Seth Fiegerman</span><time datetime="Mon, 07 Mar 2016 16:36:45 +0000">2016-03-07 16:36:45 UTC</time></div></a>]'

In [18]:
if author_web != []:
    author_web = author_web.split('">')[0]
author_web

'/people/seth-fiegerman/'

In [19]:
if author_web != []:
    author_web = 'http://mashable.com' + author_web
author_web

'http://mashable.com/people/seth-fiegerman/'

### Get news text

In [20]:
print(soup.get_text())






  window.__o = {"channel":"business","content_type":"article","v_buy":null,"v_buy_i":null,"h_pub":582.0,"h_buy":null,"h_pub_buy":null,"v_cur":1.5,"v_max":1.8,"v_cur_i":1,"v_max_i":1,"events":"event51,event61","top_channel":"business","content_source_type":"Internal","content_source_name":"Internal","author_name":"Seth Fiegerman","age":"24","pub_day":7,"pub_month":3,"pub_year":2016,"pub_date":"03/07/2016","sourced_from":"Internal","isPostView":true,"post_lead_type":"Alt Image Lead","topics":"Business","campaign":null,"display_mode":null,"viral_video_type":null,"b_flag":true};
  window._gaq = window._gaq || [];
  window._gaq.push(['_setAccount', 'UA-92124-1']);
  window._geo = "US";
  window.__domStart = (new Date().getTime())

The Supreme Court smacked down Apple today
































{"@context":"http://schema.org","headline":"The Supreme Court smacked down Apple today","url":"http://mashable.com/2016/03/07/apple-ebook-case/","image":"http://rack.3.mshcdn.com/media

In [21]:
try:
    text = str(soup.get_text()).split("UTC\n\n\n")[1]
except IndexError:
    text = str(soup.get_text()).split("Analysis\n\n")[1]

text = text.split('Have something to add to this story?')[0]

In [22]:
print(text)


Apple's long and controversial ebook case has reached its final chapter — and it's not the happy ending the company wanted.
The Supreme Court on Monday rejected an appeal filed by Apple to overturn a stinging ruling that it led a broad conspiracy with several major publishers to fix the price of e-books sold through its online bookstore.
The court's decision means Apple now has no choice but to pay out $400 million to consumers and an additional $50 million in legal fees, according to the original settlement in 2014.
SEE ALSO: Here's how Apple marshalled the entire tech industry in its fight with the FBI
For Apple, the final verdict is more damaging to its reputation as a consumer-friendly brand, not to mention the legacy of its beloved founder Steve Jobs, than to its actual bottom line.
To put the fine in context, the total $450 million payout is equal to about a little more than half the sales Apple generates on average each day, based on the $75.9 billion in revenue it reported in 

## Author information

In [23]:
# Only if author_web != []:
from IPython.display import IFrame
IFrame(author_web, 600, 600)

Binary features if author is on: Facebook, LinkedIn, Twitter, Google+

In [24]:
author_networks = {'facebo': '',
                   'linked': '',
                   'twitte': '',
                   'google': ''}

In [25]:
response = urllib.request.urlopen(author_web)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')

### Get all the networks that the author is in

In [26]:
networks = soup.find_all("div", { "class" : "profile-networks"})
networks

[<div class="profile-networks">
 <a class="network-badge network-badge-facebook network-badge-round" href="http://www.facebook.com/sfiegerman" target="_blank"></a>
 <a class="network-badge network-badge-linkedin network-badge-round" href="http://www.linkedin.com/in/sfiegerman" target="_blank"></a>
 <a class="network-badge network-badge-round network-badge-twitter" href="https://twitter.com/sfiegerman" target="_blank"></a>
 </div>]

In [27]:
networks = str(networks).replace('network-badge-round', '')
networks

'[<div class="profile-networks">\n<a class="network-badge network-badge-facebook " href="http://www.facebook.com/sfiegerman" target="_blank"></a>\n<a class="network-badge network-badge-linkedin " href="http://www.linkedin.com/in/sfiegerman" target="_blank"></a>\n<a class="network-badge  network-badge-twitter" href="https://twitter.com/sfiegerman" target="_blank"></a>\n</div>]'

In [28]:
networks = networks.split('network-badge-')
networks  # Note networks is now a list of strings

['[<div class="profile-networks">\n<a class="network-badge ',
 'facebook " href="http://www.facebook.com/sfiegerman" target="_blank"></a>\n<a class="network-badge ',
 'linkedin " href="http://www.linkedin.com/in/sfiegerman" target="_blank"></a>\n<a class="network-badge  ',
 'twitter" href="https://twitter.com/sfiegerman" target="_blank"></a>\n</div>]']

In [29]:
for network in networks:
    if network[:6] in author_networks.keys():
        author_networks[network[:6]] = network.split('href="')[1].split('" target')[0]

In [30]:
author_networks

{'facebo': 'http://www.facebook.com/sfiegerman',
 'google': '',
 'linked': 'http://www.linkedin.com/in/sfiegerman',
 'twitte': 'https://twitter.com/sfiegerman'}

### Get number of twitter followers

In [31]:
author_networks['twitter_followers'] = 0
author_networks['twitte']

'https://twitter.com/sfiegerman'

In [32]:
response = urllib.request.urlopen(author_networks['twitte'])
html = response.read()
soup = BeautifulSoup(html, 'html.parser')

In [33]:
followers = str(soup.find_all("span", { "class" : "ProfileNav-value"})[2])
followers

'<span class="ProfileNav-value" data-is-compact="true">14,2\xa0mil</span>'

In [34]:
followers = followers.split('">')[1]
followers

'14,2\xa0mil</span>'

In [35]:
if ('K' in followers) or ('mil' in followers):
    followers = followers.split('\xa0')[0]
    if ',' in followers:
        followers = followers.replace(',', '') + '00'
    else:
        followers = followers + '000'
else:
    followers = followers.split('</span')[0].replace('.', '')
followers = int(followers)
followers

14200

In [36]:
author_networks['twitter_followers'] = followers

## Merge everything into a function

In [37]:
def news_info(url):
    # Download HTML
    response = urllib.request.urlopen(url_)
    html = response.read()
    soup = BeautifulSoup(html, 'html.parser')
    
    # Title, author, text
    title = soup.title.string
    
    author = soup.find_all("span", { "class" : "author_name"})
    # If author is empty try this:
    if author == []:
        author = soup.find_all("span", { "class" : "byline basic"})
    author = str(author).split('>')[1].split('By ')[1].split('<')[0]
    
    # Number of shares
    shares = soup.find_all("div", { "class" : "total-shares"})
    shares = str(shares).split('<em>')[1].split('</em>')[0]
    if 'k' in shares:
        shares = shares[:-1]
        shares = shares.replace('.', '') + '00'
    
    # Get text
    try:
        text = str(soup.get_text()).split("UTC\n\n\n")[1]
    except IndexError:
        text = str(soup.get_text()).split("Analysis\n\n")[1]
        
    text = text.split('Have something to add to this story?')[0]
    
    author_web = soup.find_all("a", { "class" : "byline"})
    if author_web != []:
        author_web = 'http://mashable.com' + str(author_web).split('href="')[1].split('">')[0]
    
        # Author networks
        author_networks = {'facebo': '',
                           'linked': '',
                           'twitte': '',
                           'google': ''}

        response = urllib.request.urlopen(author_web)
        html = response.read()
        soup = BeautifulSoup(html, 'html.parser')

        networks = str(soup.find_all("div", { "class" : "profile-networks"})).replace('network-badge-round', '').split('network-badge-')

        for network in networks:
            if network[:6] in author_networks.keys():
                author_networks[network[:6]] = network.split('href="')[1].split('" target')[0]

        # Author twitter followers
        author_networks['twitter_followers'] = 0
        if author_networks['twitte'] != '':

            response = urllib.request.urlopen(author_networks['twitte'])
            html = response.read()
            soup = BeautifulSoup(html, 'html.parser')

            followers = str(soup.find_all("span", { "class" : "ProfileNav-value"})[2]).split('">')[1]
            if ('K' in followers) or ('mil' in followers):
                followers = followers.split('\xa0')[0]
                if ',' in followers:
                    followers = followers.replace(',', '') + '00'
                else:
                    followers = followers + '000'
            else:
                followers = followers.split('</span')[0].replace('.', '')

            author_networks['twitter_followers'] = int(followers)
        
    else:
        author_networks = {'facebo': '',
                           'linked': '',
                           'twitte': '',
                           'google': '', 
                           'twitter_followers': 0}
        
    return {'title': title, 'author': author, 'shares': shares, 
            'author_web': author_web, 'text':text, 
            'author_networks': author_networks}


In [38]:
url_ = 'http://mashable.com/2016/03/04/the-who-50th-anniversary'
news_info(url_)

{'author': 'Yohana Desta',
 'author_networks': {'facebo': '',
  'google': 'https://plus.google.com/102665910331841447889?rel=author',
  'linked': '',
  'twitte': 'https://twitter.com/YohanaDesta',
  'twitter_followers': 2991},
 'author_web': 'http://mashable.com/people/yohana-desta/',
 'shares': '762',
 'text': '\nFifty years later, the Who can still rule a crowd.\nRoger Daltrey’s still got the howl and Pete Townshend can make the entirety of Madison Square Garden lose its mind over a swivel of his right arm. The band, currently on its 50th anniversary tour, packed Manhattan’s famed stadium on Wednesday night with thousands of fans eager to sing along to hits like “My Generation” and "Won\'t Get Fooled Again."\xa0\nThe show was even more celebratory because it was rescheduled from last fall, when Daltrey was diagnosed with viral meningitis. At the end of the show, he thanked the crowd for sticking around while he was "really having a tough time with the reaper."\n"Here were are, still 

In [39]:
url_ = 'http://mashable.com/2016/03/08/scotland-giant-rabbit-home'
news_info(url_)

{'author': 'Davina Merchant',
 'author_networks': {'facebo': '',
  'google': 'https://plus.google.com/105525238342980116477?rel=author',
  'linked': '',
  'twitte': '',
  'twitter_followers': 0},
 'author_web': 'http://mashable.com/people/568bdab3519840193100211f/',
 'shares': '7000',
 'text': 'LONDON - Last month we reported on a dog-sized rabbit in desperate need of a new home — now just under a month later Atlas the rabbit now has a permanent home. \nAfter his story went global people from all over the world, including the U.S., Canada and France started reaching out to the Scottish Society for the Prevention of Cruelty to Animals to re-home the rabbit.   \nThanks to Jen Hislop from Ayrshire, the adorable bunny will get to stay in his native Scotland. \n\nHe even has a new buggy.Image: Facebook Scottish SPCAJen, a financial fraud investigator, told the charity, "I burst into tears when I got the phone call saying I had been chosen to re-home Atlas and I cried again when I collected 

In [40]:
url_ = 'http://mashable.com/2016/03/08/15-skills-digital-marketers'
news_info(url_)

{'author': 'Scott Gerber',
 'author_networks': {'facebo': '',
  'google': '',
  'linked': '',
  'twitte': '',
  'twitter_followers': 0},
 'author_web': [],
 'shares': '4900',
 'text': 'Today\'s digital marketing experts must have a diverse skill set, including a sophisticated grasp of available media channels, the ability to identify up-and-coming opportunities, on top of having the basic skills of a brilliant marketer. What\'s more, they have to possess a balance of critical and creative thinking skills in order to drive measurable success for their company.\nThat\'s why I asked 15 members of Young Entrepreneur Council (YEC) what they look for when hiring digital marketers. Their best answers are below.\n1. Paid social media advertising expertise\n\nA new digital marketing hire should be well-versed in paid social media advertising, especially through Facebook or a similar social platform that our company uses regularly. They need to be able to understand and implement Facebook analyt

In [41]:
url_ = 'http://mashable.com/2016/03/31/donald-trump-gaslighting-women'
news_info(url_)

{'author': 'Rebecca Ruiz',
 'author_networks': {'facebo': '',
  'google': '',
  'linked': '',
  'twitte': 'https://twitter.com/rebecca_ruiz',
  'twitter_followers': 3734},
 'author_web': 'http://mashable.com/people/rebecca-ruiz/',
 'shares': '1400',
 'text': 'Analysis\n\n\n\n\n\nThere is a reason that\xa0Donald Trump\'s\xa0outrageous statements and behavior feel familiar to many women.\xa0\nIt\'s not because they know his declarative style and trademark shrug from reality television or political debates. Nor is it because his outsized role in American business made an unforgettable impression on them.\nSEE ALSO: As Donald Trump targets women, Republicans say he could cost the party everything\nThe eerie familiarity is more personal than that. They know Trump because they\'ve encountered a man like him at home, work, on social media or in a relationship.\xa0\nThis man extols the virtues of women, but has no problem reducing them to sex objects. He casts himself as unflappable, but blame