# Information Retrieval

## Creating Documents

The Reuters-21578 files are in [SGML format](http://kdd.ics.uci.edu/databases/reuters21578/README.txt). We need to split them into individual documents. There should be 1000.

In [66]:
from bs4 import BeautifulSoup
from collections import defaultdict
import re

In [10]:
test = '''Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.'''
# re.split('\W+', test)

with open('./data/reut2-000.sgm') as f:
    corpus = f.read()


In [37]:
soup = BeautifulSoup(corpus, 'html.parser')
articles = soup.find_all('body') # 925 articles...whatever
documents = [re.split('\W+', a.string) for a in articles]


We now turn each document into a vector of W dimensions, where W is the number of words in the corpus.

In [58]:
words = []
for doc in documents:
    words += doc
words = [w for w in set(words) if len(w) > 0]
numWords = len(words)

print('There are {} unique words over {} documents.'.format(numWords, len(documents)))

There are 11234 unique words over 925 documents.


In [67]:
wordToLoc = {words[i]: i for i in range(numWords)} # a word to index lookup

globalCounts = [0 for _ in range(numWords)]

def tfVector(doc):
    vector = [0 for _ in range(numWords)]
    isGloballyCounted = defaultdict(bool) # keeps track of if we've counted the word towards document frequency
    for word in doc:
        if len(word) == 0:
            continue
        loc = wordToLoc[word]
        vector[loc] += 1
        if not isGloballyCounted[word]:
            globalCounts[loc] += 1
            isGloballyCounted[word] = True
    vector = [c / len(doc) for c in vector]
    
documents = [tfVector(d) for d in documents] # calculate term frequencies while building document frequencies
print(globalCounts)
# documents = [idfVector(d) for d in documents]

[1, 3, 30, 1, 2, 2, 3, 1, 4, 4, 4, 5, 7, 3, 1, 1, 1, 4, 5, 1, 2, 2, 2, 3, 1, 1, 1, 1, 2, 1, 5, 9, 1, 3, 3, 4, 1, 1, 6, 10, 1, 1, 6, 2, 6, 20, 1, 1, 1, 1, 3, 7, 3, 1, 203, 4, 23, 16, 5, 1, 1, 1, 3, 1, 4, 2, 1, 1, 13, 26, 12, 1, 2, 2, 6, 1, 8, 4, 1, 2, 1, 3, 3, 2, 16, 3, 1, 1, 1, 3, 1, 1, 17, 2, 1, 3, 252, 1, 2, 1, 1, 4, 3, 2, 672, 2, 1, 1, 1, 8, 1, 2, 1, 1, 3, 1, 2, 4, 2, 2, 1, 5, 8, 2, 2, 1, 4, 1, 1, 1, 3, 1, 5, 1, 2, 49, 24, 3, 3, 18, 1, 1, 2, 1, 7, 1, 6, 49, 1, 8, 1, 16, 2, 1, 1, 5, 1, 1, 1, 1, 1, 3, 4, 2, 2, 1, 1, 1, 1, 6, 1, 1, 1, 2, 1, 1, 1, 1, 3, 2, 9, 1, 1, 1, 1, 3, 24, 1, 5, 1, 1, 1, 3, 1, 49, 7, 7, 1, 1, 1, 2, 1, 2, 1, 6, 1, 1, 5, 1, 2, 1, 1, 1, 7, 1, 1, 1, 2, 6, 7, 1, 3, 3, 1, 2, 3, 4, 1, 1, 21, 2, 1, 3, 1, 1, 1, 6, 1, 9, 2, 3, 2, 17, 2, 3, 7, 1, 2, 8, 8, 6, 1, 1, 3, 1, 6, 1, 41, 1, 39, 1, 246, 1, 1, 1, 6, 1, 4, 1, 1, 5, 2, 2, 1, 4, 2, 3, 3, 1, 3, 25, 7, 28, 6, 3, 1, 2, 1, 4, 1, 1, 2, 5, 1, 1, 1, 5, 10, 1, 5, 13, 4, 2, 1, 2, 5, 2, 2, 24, 1, 1, 1, 1, 2, 1, 5, 1, 1, 1, 3, 2, 1,