# CS 429: Information Retrieval
<br>

## Lecture 26: HITS

<br>

### Dr. Aron Culotta
### Illinois Institute of Technology
### Spring 2015

The **Hubs and Authorities** algorithm is a simple procedure to assign each page two scores:

- **hub score:** how likely is this page to be a directory?
- **authority score:** how likely is this page to be a trustworthy resource on a topic?

Let $F_u$ be *forward* links (going out from $u$).

Let $B_u$ be *back* links (going in to $u$).

$$ \begin{align}
h(u) = \sum_{v \in F_u} a(v)\\
a(u) = \sum_{v \in B_u} h(v)\\
\end{align}$$

In words:

- The hub score for $u$ is the sum of the authority scores for all pages linked from $u$.
- The authority score for $u$ is the sum of the hub scores for all pages linking to $u$.

As for PageRank, we can compute these iteratively until convergence:

1. Initialize all hub/authority scores to 1.0
2. Loop until convergence
  1. update authority scores
  2. update hub scores

Let's try this out on some real data:

In [16]:
# See http://stackoverflow.com/questions/1657570/google-search-from-a-python-app
import json
import urllib

def clean_url(url):
    if url[-1] == '/':
        return url[:-1]
    else:
        return url
    
def search_google(query_str):
    query = urllib.urlencode({'q': query_str})
    url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s&rsz=4' % query
    search_response = urllib.urlopen(url)
    search_results = search_response.read()
    results = json.loads(search_results)
    data = results['responseData']
    hits = data['results']
    return set([clean_url(h['url']) for h in hits])

In [17]:
# Get some search results.
urls = search_google('chicago technical universities')
print 'top', len(urls), 'results:\n', '\n'.join(urls)

top 4 results:
http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Chicago
http://www.engineering.uic.edu
http://web.iit.edu
http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691


In [19]:
from urlparse import urlparse
def get_domain(s):
    return '_'.join(urlparse(s).netloc.split('.')[-2:])
print get_domain('http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691')

rankingsandreviews_com


In [20]:
# Download links for each url. Store inlinks/outlinks for each page in original set.
from collections import defaultdict
import requests
import time
from urlparse import urlparse
from BeautifulSoup import BeautifulSoup

    
def crawl(toprocess, depth=2):
    """ Crawl pages from seed set."""
    processed = set()
    inlinks = defaultdict(lambda: set())   # url -> set of inlinks
    outlinks = defaultdict(lambda: set())  # url -> set of outlinks 

    for i in range(depth):  # depth to crawl
        toprocess -= processed
        new_urls = set()
        for url in toprocess:
            print 'url=', url
            domain = get_domain(url)
            soup = BeautifulSoup(requests.get(url).text)
            processed.add(url)
            links = set([clean_url(a['href']) for a in soup.findAll('a') if a.get('href') and
                                                                            a['href'][:5] == 'http:' and
                                                                            get_domain(a['href']) != domain])
            print 'found %d links' % len(links)
            outlinks[url] = links
            for link in links:
                inlinks[link].add(url)
            # Add links to be processed
            links = [l for l in links if l not in processed and l not in toprocess][:20]
            new_urls |= set(links)
        toprocess = new_urls
    return inlinks, outlinks
    
inlinks, outlinks = crawl(urls)

url= http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Chicago
found 2 links
url= http://www.engineering.uic.edu
found 3 links
url= http://web.iit.edu
found 6 links
url= http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691
found 37 links
url= http://travel.usnews.com/cruises
found 3 links
url= http://bestlawfirms.usnews.com/rankings.aspx
found 6 links
url= http://iit.bncollege.com/webapp/wcs/stores/servlet/BNCBHomePage?storeId=45055&catalogId=10001&langId=-1
found 8 links
url= http://twitter.com/UICEngineering
found 21 links
url= http://money.usnews.com/funds/mutual-funds
found 5 links




url= http://www.youtube.com/user/UICengineering
found 1 links
url= http://www.universitytechnologypark.com
found 4 links
url= http://health.usnews.com
found 4 links
url= http://www.usnews.com/education/online-education/illinois-institute-of-technology-145725
found 11 links
url= http://www.iitri.org
found 1 links
url= http://mediakit.usnews.com/index.php
found 1 links
url= http://nces.ed.gov/collegenavigator/?s=IL
found 3 links
url= http://beonair.com/about
found 2 links
url= http://twitter.com/#!/usnewseducation
found 0 links
url= http://travel.usnews.com/Rankings
found 8 links
url= http://travel.usnews.com/Hotels
found 6 links
url= http://travel.usnews.com
found 3 links
url= http://www.iitmicrogrid.net
found 0 links




url= http://www.flickr.com/photos/uic_engineering/sets
found 7 links
url= http://money.usnews.com/money/careers
found 5 links
url= http://www.usnews.com
found 3 links
url= http://www.kentlaw.edu
found 20 links
url= http://money.usnews.com/529s
found 3 links
url= http://health.usnews.com/doctors
found 3 links
url= http://health.usnews.com/best-nursing-homes
found 3 links
url= http://www.illinoistechathletics.com
found 17 links
url= http://health.usnews.com/health-news/best-hospitals
found 6 links
url= http://health.usnews.com/health-products
found 4 links
url= http://www.usnews.com/education/best-colleges/paying-for-college
found 9 links
url= http://www.usnews.com/rss/education
found 0 links
url= http://health.usnews.com/health-insurance
found 4 links


In [21]:
urls = set(inlinks.keys() + outlinks.keys())
print '%d total urls' % len(urls)

165 total urls


In [22]:
# Print outlinks.
for url in outlinks:
    if len(outlinks[url]) > 0:
        print '\n', url, '->\n', '\n'.join(outlinks[url])


http://travel.usnews.com/cruises ->
http://usnews.rankingsandreviews.com/cars-trucks/used-cars
http://usnews.rankingsandreviews.com/cars-trucks
http://usnews.rankingsandreviews.com/cars-trucks/rankings

http://www.usnews.com/education/best-colleges/paying-for-college ->
http://www.usnewsuniversitydirectory.com/masters-mba.aspx
http://www.usnewsuniversitydirectory.com/bachelors.aspx
http://usnews.rankingsandreviews.com/cars-trucks
http://www.usnewsuniversitydirectory.com/Colleges-Universities/financialaid/?mcid=53158
http://usnews.rankingsandreviews.com/cars-trucks/rankings
http://www.usnewsuniversitydirectory.com/certificates.aspx
http://www.usnewsuniversitydirectory.com
http://www.usnewsuniversitydirectory.com/associates.aspx
http://usnews.rankingsandreviews.com/cars-trucks/used-cars

http://money.usnews.com/funds/mutual-funds ->
http://www.interactivedata.com
http://usnews.rankingsandreviews.com/cars-trucks/used-cars
http://usnews.rankingsandreviews.com/cars-trucks/rankings
http://u

In [24]:
for url in outlinks:
    print url, len(outlinks[url])

http://travel.usnews.com/cruises 3
http://www.usnews.com/education/best-colleges/paying-for-college 9
http://money.usnews.com/funds/mutual-funds 5
http://twitter.com/UICEngineering 21
http://beonair.com/about 2
http://health.usnews.com 4
http://www.usnews.com/education/online-education/illinois-institute-of-technology-145725 11
http://www.iitri.org 1
http://health.usnews.com/best-nursing-homes 3
http://nces.ed.gov/collegenavigator/?s=IL 3
http://travel.usnews.com/Hotels 6
http://twitter.com/#!/usnewseducation 0
http://travel.usnews.com/Rankings 8
http://travel.usnews.com 3
http://www.iitmicrogrid.net 0
http://www.flickr.com/photos/uic_engineering/sets 7
http://money.usnews.com/money/careers 5
http://www.usnews.com 3
http://www.kentlaw.edu 20
http://money.usnews.com/529s 3
http://health.usnews.com/doctors 3
http://www.usnews.com/rss/education 0
http://health.usnews.com/health-news/best-hospitals 6
http://health.usnews.com/health-products 4
http://mediakit.usnews.com/index.php 1
http://h

In [25]:
# Print inlinks
for url in inlinks:
    if len(inlinks[url]) > 0:
        print '\n', url, '<-\n', '\n'.join(inlinks[url])


http://plus.google.com/+usnewsworldreport <-
http://bestlawfirms.usnews.com/rankings.aspx

http://t.co/XeBvC03gDZ <-
http://twitter.com/UICEngineering

http://www.theuscaa.com/landing/index <-
http://www.illinoistechathletics.com

http://www.iubenda.com/privacy-policy/264605 <-
http://www.iitri.org

http://forms.bncollegemail.com/email/form2.htm <-
http://iit.bncollege.com/webapp/wcs/stores/servlet/BNCBHomePage?storeId=45055&catalogId=10001&langId=-1

http://www.prestosports.com <-
http://www.illinoistechathletics.com

http://www.facebook.com/pages/University-Technology-Park-at-IIT/265227113561775 <-
http://www.universitytechnologypark.com

http://beonair.com/about <-
http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Chicago

http://health.usnews.com <-
http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691

http://www.usnews.com/education/online-education/illinois-institute-of-technology-145725 <-
http://colleges.usnews.rankingsandreviews.com/best-colleg

In [28]:
print 'iit'
print '\n'.join(inlinks['http://web.iit.edu'])
print 'uic'
print '\n'.join(inlinks['http://www.engineering.uic.edu'])

iit
http://www.illinoistechathletics.com
uic



In [29]:
# Initialize hubs and authorities scores and print.
hubs = dict([(u, 1.0) for u in urls])
authorities = dict([(u, 1.0) for u in urls])
def print_top(hubs, authorities):
    print 'Top hubs\n', '\n'.join('%s %.6f' % (u[0], u[1]) for u in sorted(hubs.items(), key=lambda x: x[1], reverse=True)[:5])
    print 'Top authorities\n', '\n'.join('%s %.6f' % (u[0], u[1]) for u in sorted(authorities.items(), key=lambda x: x[1], reverse=True)[:5])
    print

print_top(hubs, authorities)

Top hubs
http://plus.google.com/+usnewsworldreport 1.000000
http://t.co/XeBvC03gDZ 1.000000
http://www.theuscaa.com/landing/index 1.000000
http://www.iubenda.com/privacy-policy/264605 1.000000
http://forms.bncollegemail.com/email/form2.htm 1.000000
Top authorities
http://plus.google.com/+usnewsworldreport 1.000000
http://t.co/XeBvC03gDZ 1.000000
http://www.theuscaa.com/landing/index 1.000000
http://www.iubenda.com/privacy-policy/264605 1.000000
http://forms.bncollegemail.com/email/form2.htm 1.000000



In [30]:
# Update hub and authority scores.
import math

def update(hubs, authorities, inlinks, outlinks):
    for u in authorities:
        authorities[u] += sum([hubs[inlink] for inlink in inlinks[u]])
    normalize(authorities)
    for u in hubs:
        hubs[u] += sum([authorities[outlink] for outlink in outlinks[u]])
    normalize(hubs)

def normalize(d):
    norm = math.sqrt(sum([v * v for v in d.values()]))
    for k in d:
        d[k] /= norm

In [31]:
update(hubs, authorities, inlinks, outlinks)
print_top(hubs, authorities)

Top hubs
http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691 0.180879
http://www.usnews.com/education/online-education/illinois-institute-of-technology-145725 0.174946
http://www.usnews.com/education/best-colleges/paying-for-college 0.158629
http://health.usnews.com/health-news/best-hospitals 0.157146
http://travel.usnews.com/Rankings 0.155662
Top authorities
http://usnews.rankingsandreviews.com/cars-trucks 0.464105
http://usnews.rankingsandreviews.com/cars-trucks/rankings 0.439679
http://usnews.rankingsandreviews.com/cars-trucks/used-cars 0.415252
http://www.linkedin.com/company/u.s.-news-&-world-report 0.170986
http://twitter.com/#!/usnewseducation 0.073280



In [32]:
for i in range(10):
    update(hubs, authorities, inlinks, outlinks)
    print_top(hubs, authorities)

Top hubs
http://www.usnews.com/education/online-education/illinois-institute-of-technology-145725 0.288601
http://health.usnews.com/health-news/best-hospitals 0.251467
http://money.usnews.com/money/careers 0.246434
http://www.usnews.com/education/best-colleges/paying-for-college 0.244805
http://colleges.usnews.rankingsandreviews.com/best-colleges/iit-1691 0.241600
Top authorities
http://usnews.rankingsandreviews.com/cars-trucks 0.534201
http://usnews.rankingsandreviews.com/cars-trucks/rankings 0.514471
http://usnews.rankingsandreviews.com/cars-trucks/used-cars 0.489577
http://www.linkedin.com/company/u.s.-news-&-world-report 0.198547
http://twitter.com/#!/usnewseducation 0.074698

Top hubs
http://www.usnews.com/education/online-education/illinois-institute-of-technology-145725 0.296570
http://health.usnews.com/health-news/best-hospitals 0.258423
http://money.usnews.com/money/careers 0.253477
http://www.usnews.com/education/best-colleges/paying-for-college 0.248751
http://health.usnews.

## Expanding the set of urls

- How does restricting to only 20 links from each url limit these results?
- For a given query, begin with the *root* set of the top $k$ matching documents.
- Expand the set to include all forward and backward links from the root.


When would this help?
