# CS 429: Information Retrieval
<br>

## Lecture 26: HITS

<br>

### Dr. Aron Culotta
### Illinois Institute of Technology
### Spring 2015

The **Hubs and Authorities** algorithm is a simple procedure to assign each page two scores:

- **hub score:** how likely is this page to be a directory?
- **authority score:** how likely is this page to be a trustworthy resource on a topic?

Let $F_u$ be *forward* links (going out from $u$).

Let $B_u$ be *back* links (going in to $u$).

$$ \begin{align}
h(u) = \sum_{v \in F_u} a(v)\\
a(u) = \sum_{v \in B_u} h(v)\\
\end{align}$$

In words:

- The hub score for $u$ is the sum of the authority scores for all pages linked from $u$.
- The authority score for $u$ is the sum of the hub scores for all pages linking to $u$.

As for PageRank, we can compute these iteratively until convergence:

1. Initialize all hub/authority scores to 1.0
2. Loop until convergence
  1. update authority scores
  2. update hub scores

Let's try this out on some real data:

In [16]:
# See http://stackoverflow.com/questions/1657570/google-search-from-a-python-app
import json
import urllib.parse, urllib.request
from google import search  # pip install google

def clean_url(url):
    if url[-1] == '/':
        return url[:-1]
    else:
        return url
    
def search_google(query_str, limit):
    return set([clean_url(u) for u in search(query_str, stop=limit)][:limit])

In [18]:
# Get some search results.
urls = search_google('chicago technical universities', 10)
print('top', len(urls), 'results:\n', '\n'.join(urls))

top 10 results:
 https://web.iit.edu/about/academic-programs
http://web.iit.edu
https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Chicago
http://engineering.uic.edu/COE/UndergraduatePrograms
http://admissions.iit.edu/graduate
http://www.engineering.uic.edu
http://www.iit.edu
https://admissions.iit.edu/undergraduate
http://web.iit.edu/directory
http://ortchicagotech.edu


In [22]:
def get_domain(s):
    return '_'.join(urllib.parse.urlparse(s).netloc.split('.')[-2:])
print(get_domain('http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691'))

rankingsandreviews_com


In [26]:
# Download links for each url. Store inlinks/outlinks for each page in original set.
from collections import defaultdict
import requests
import time
from bs4 import BeautifulSoup  # pip install beautifulsoup4

    
def crawl(toprocess, depth=2):
    """ Crawl pages from seed set."""
    processed = set()
    inlinks = defaultdict(lambda: set())   # url -> set of inlinks
    outlinks = defaultdict(lambda: set())  # url -> set of outlinks 

    for i in range(depth):  # depth to crawl
        toprocess -= processed
        new_urls = set()
        for url in toprocess:
            print('url=', url)
            domain = get_domain(url)
            text = ''
            try:
                time.sleep(.2)
                text = requests.get(url).text
            except:
                print('CANNOT FETCH %s' % url)
                continue
            soup = BeautifulSoup(text)
            processed.add(url)
            links = set([clean_url(a['href']) for a in soup.findAll('a') if a.get('href') and
                                                                            a['href'][:5] == 'http:' and
                                                                            get_domain(a['href']) != domain])
            print('found %d links' % len(links))
            outlinks[url] = links
            for link in links:
                inlinks[link].add(url)
            # Add links to be processed
            links = [l for l in links if l not in processed and l not in toprocess][:20]
            new_urls |= set(links)
        toprocess = new_urls
    return inlinks, outlinks
    
inlinks, outlinks = crawl(urls)

url= https://web.iit.edu/about/academic-programs




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))


found 6 links
url= http://web.iit.edu
found 7 links
url= https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Chicago
found 4 links
url= http://engineering.uic.edu/COE/UndergraduatePrograms
found 4 links
url= http://admissions.iit.edu/graduate
found 10 links
url= http://www.engineering.uic.edu
found 4 links
url= http://www.iit.edu
found 7 links
url= https://admissions.iit.edu/undergraduate
found 12 links
url= http://web.iit.edu/directory
found 13 links
url= http://ortchicagotech.edu
found 5 links
url= http://www.youtube.com/user/IITToday
found 1 links
url= http://nces.ed.gov/collegenavigator/?s=IL
found 3 links
url= http://myillinoistech.tumblr.com
found 10 links
url= http://www.pinterest.com/ortchicago
found 2 links
url= http://www.bexleyseabury.edu/seabury-in-chicago
found 4 links
url= http://www.flickr.com/photos/iitugadmission/sets
found 1 links
url= http://www.youvisit.com
found 4 links
url= http://www.kentlaw.edu/students/organizations.html
found 17 links
url= http:

In [28]:
urls = set(list(inlinks.keys())) | set(list(outlinks.keys()))
print('%d total urls' % len(urls))

215 total urls


In [29]:
# Print outlinks.
for url in outlinks:
    if len(outlinks[url]) > 0:
        print('\n', url, '->\n', '\n'.join(outlinks[url]))


 http://www.youtube.com/user/IITToday ->
 http://www.iit.edu

 http://www.pinterest.com/ortchicago ->
 http://enable-javascript.com
http://ortchicagotech.edu

 http://www.bexleyseabury.edu/seabury-in-chicago ->
 http://www.twitter.com/bexleyseabury
http://www.facebook.com/bexleyseabury
http://mail.google.com/a/bexleyseabury.edu
http://www.anglicantheologicalreview.org

 http://www.flickr.com/photos/iitugadmission/sets ->
 http://flickr.tumblr.com

 http://www.kentlaw.edu ->
 http://calendar.kentlaw.iit.edu/EventList.aspx?view=EventDetails&eventidn=4528&information_id=7192&type=&rss=rss
http://www.facebook.com/ChicagoKentLaw
http://www.iit.edu
http://calendar.kentlaw.iit.edu/EventList.aspx?view=EventDetails&eventidn=4153&information_id=6391&type=&rss=rss
http://instagram.com/chicagokentlaw
http://calendar.kentlaw.iit.edu/EventList.aspx?view=EventDetails&eventidn=4518&information_id=7172&type=&rss=rss
http://www.iit.edu/about/department_sites.shtml
http://www.iit.edu/cdr
http://www.nxtb

In [30]:
for url in outlinks:
    print(url, len(outlinks[url]))

http://www.youtube.com/user/IITToday 1
http://www.pinterest.com/ortchicago 2
http://www.bexleyseabury.edu/seabury-in-chicago 4
http://www.flickr.com/photos/iitugadmission/sets 1
http://www.kentlaw.edu 21
http://www.youvisit.com 4
http://twitter.com/UICEngineering 1
http://galvinlibrary.wordpress.com 11
http://www.flickr.com/photos/uic_engineering/sets 1
http://ortchicago.blogspot.com 10
http://web.iit.edu 7
http://www.facebook.com/IITScarletHawks 0
http://www.universitytechnologypark.com 4
http://www.iitri.org 1
http://beonair.com/about 0
http://engineering.uic.edu/COE/UndergraduatePrograms 4
http://academicresourcecenter.blogspot.com 9
http://www.kentlaw.edu/record 2
http://www.illinoistechathletics.com 17
http://www.linkedin.com/company/chicago-ort-technical-institute 0
http://blogs.kentlaw.edu/faculty 44
http://web.iit.edu/directory 13
https://web.iit.edu/about/academic-programs 6
http://nces.ed.gov/collegenavigator/?s=IL 3
http://myillinoistech.tumblr.com 10
http://www.kentlaw.edu/

In [31]:
# Print inlinks
for url in inlinks:
    if len(inlinks[url]) > 0:
        print('\n', url, '<-\n', '\n'.join(inlinks[url]))


 http://www.pinterest.com/ortchicago <-
 http://ortchicagotech.edu

 http://www.bexleyseabury.edu/seabury-in-chicago <-
 https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Chicago

 http://www.flickr.com/photos/iitugadmission/sets <-
 https://admissions.iit.edu/undergraduate

 http://facebook.com/sharer.php?u=http%3A%2F%2Fmyillinoistech.tumblr.com%2Fpost%2F142045013856%2Fyou-can-really-tell-when-the-architects-come-to&t=My%20Illinois%20Tech <-
 http://myillinoistech.tumblr.com

 http://www.prestosports.com <-
 http://www.illinoistechathletics.com

 http://www.kentlaw.iit.edu/academics/jd-program/practical-skills-training/externships <-
 http://www.kentlaw.edu/record

 http://t.co/fTZJPTLz7Z <-
 http://twitter.com/UICEngineering

 http://www.youtube.com/wildwestwebvideo <-
 http://wildwestonlineproductions.com

 http://www.facebook.com/bexleyseabury <-
 http://www.bexleyseabury.edu/seabury-in-chicago

 http://www.bls.gov/oco <-
 http://nces.ed.gov/collegenavigator/?s=IL

In [32]:
print('iit')
print('\n'.join(inlinks['http://web.iit.edu']))
print('uic')
print('\n'.join(inlinks['http://www.engineering.uic.edu']))

iit
http://www.illinoistechathletics.com
uic



In [33]:
# Initialize hubs and authorities scores and print.
hubs = dict([(u, 1.0) for u in urls])
authorities = dict([(u, 1.0) for u in urls])
def print_top(hubs, authorities):
    print('Top hubs\n', '\n'.join('%s %.6f' % (u[0], u[1]) for u in sorted(hubs.items(), key=lambda x: x[1], reverse=True)[:5]))
    print('Top authorities\n', '\n'.join('%s %.6f' % (u[0], u[1]) for u in sorted(authorities.items(), key=lambda x: x[1], reverse=True)[:5]))
    print()

print_top(hubs, authorities)

Top hubs
 http://blogs.kentlaw.iit.edu/facultynews/feed 1.000000
http://www.bexleyseabury.edu/seabury-in-chicago 1.000000
http://www.flickr.com/photos/iitugadmission/sets 1.000000
http://facebook.com/sharer.php?u=http%3A%2F%2Fmyillinoistech.tumblr.com%2Fpost%2F142045013856%2Fyou-can-really-tell-when-the-architects-come-to&t=My%20Illinois%20Tech 1.000000
http://www.huffingtonpost.com/2016/04/13/dennis-hastert-sentencing-lies_n_9697304.html?utm_hp_ref=chicago&ir=Chicago 1.000000
Top authorities
 http://blogs.kentlaw.iit.edu/facultynews/feed 1.000000
http://www.bexleyseabury.edu/seabury-in-chicago 1.000000
http://www.flickr.com/photos/iitugadmission/sets 1.000000
http://facebook.com/sharer.php?u=http%3A%2F%2Fmyillinoistech.tumblr.com%2Fpost%2F142045013856%2Fyou-can-really-tell-when-the-architects-come-to&t=My%20Illinois%20Tech 1.000000
http://www.huffingtonpost.com/2016/04/13/dennis-hastert-sentencing-lies_n_9697304.html?utm_hp_ref=chicago&ir=Chicago 1.000000



In [34]:
# Update hub and authority scores.
import math

def update(hubs, authorities, inlinks, outlinks):
    for u in authorities:
        authorities[u] += sum([hubs[inlink] for inlink in inlinks[u]])
    normalize(authorities)
    for u in hubs:
        hubs[u] += sum([authorities[outlink] for outlink in outlinks[u]])
    normalize(hubs)

def normalize(d):
    norm = math.sqrt(sum([v * v for v in d.values()]))
    for k in d:
        d[k] /= norm

In [35]:
update(hubs, authorities, inlinks, outlinks)
print_top(hubs, authorities)

Top hubs
 http://blogs.kentlaw.edu/faculty 0.261201
http://blogs.kentlaw.edu 0.261201
http://www.kentlaw.edu 0.153157
http://www.kentlaw.edu/students/organizations.html 0.141476
https://admissions.iit.edu/undergraduate 0.141476
Top authorities
 http://iit.bncollege.com/webapp/wcs/stores/servlet/BNCBHomePage?storeId=45055&catalogId=10001&langId=-1 0.179961
http://www.universitytechnologypark.com 0.179961
http://www.illinoistechathletics.com 0.179961
http://www.iitri.org 0.179961
http://www.kentlaw.edu 0.179961



In [36]:
for i in range(10):
    update(hubs, authorities, inlinks, outlinks)
    print_top(hubs, authorities)

Top hubs
 http://blogs.kentlaw.edu/faculty 0.610643
http://blogs.kentlaw.edu 0.610643
http://www.kentlaw.edu 0.209587
http://www.kentlaw.edu/students/organizations.html 0.190651
https://admissions.iit.edu/undergraduate 0.174405
Top authorities
 http://iit.bncollege.com/webapp/wcs/stores/servlet/BNCBHomePage?storeId=45055&catalogId=10001&langId=-1 0.179312
http://www.universitytechnologypark.com 0.179312
http://www.illinoistechathletics.com 0.179312
http://www.iitri.org 0.179312
http://www.kentlaw.edu 0.179312

Top hubs
 http://blogs.kentlaw.edu/faculty 0.669915
http://blogs.kentlaw.edu 0.669915
http://www.kentlaw.edu 0.160470
http://www.kentlaw.edu/students/organizations.html 0.148194
https://admissions.iit.edu/undergraduate 0.104936
Top authorities
 http://www.facebook.com/ChicagoKentLaw 0.183271
http://blogs.kentlaw.iit.edu 0.183271
http://twitter.com/ChicagoKentLaw 0.183271
http://www.youtube.com/ChicagoKentLaw 0.183271
http://blogs.kentlaw.iit.edu/facultynews/feed 0.135899

Top hub

## Expanding the set of urls

- How does restricting to only 20 links from each url limit these results?
- For a given query, begin with the *root* set of the top $k$ matching documents.
- Expand the set to include all forward and backward links from the root.


When would this help?
