# Week 4: Document Retrieval

### Loading and exploring Wikipedia data

In [1]:
import pandas as pd
import re
from collections import Counter

In [2]:
df = pd.read_csv("data/people_wiki.csv")
df.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [3]:
len(df)

59071

In [4]:
# Get word count for all documents

# This is the function to use together with .apply() to get word count of selected words
def count(row):
    sentence = row['text']
    
    # Convert sentence to lower-case, and remove punctuations
    sentence = re.sub('\\.|\\,|\\!|\\?| -', '', sentence.lower())
    
    # Count words in sentence
    word_count = Counter(sentence.split())
    
    return word_count

# Get word count
df['word_count'] = df.apply(count, axis = 1)
df

Unnamed: 0,URI,name,text,word_count
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...,"{'digby': 1, 'morrell': 5, 'born': 1, '10': 1,..."
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...,"{'alfred': 1, 'j': 1, 'lewy': 3, 'aka': 1, 'sa..."
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...,"{'harpdog': 2, 'brown': 2, 'is': 7, 'a': 7, 's..."
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...,"{'franz': 1, 'rottensteiner': 3, 'born': 1, 'i..."
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...,"{'henry': 1, 'krvits': 1, 'born': 1, '30': 1, ..."
...,...,...,...,...
59066,<http://dbpedia.org/resource/Olari_Elts>,Olari Elts,olari elts born april 27 1971 in tallinn eston...,"{'olari': 2, 'elts': 3, 'born': 1, 'april': 1,..."
59067,<http://dbpedia.org/resource/Scott_F._Crago>,Scott F. Crago,scott francis crago born july 26 1963 twin bro...,"{'scott': 1, 'francis': 1, 'crago': 5, 'born':..."
59068,<http://dbpedia.org/resource/David_Cass_(footb...,David Cass (footballer),david william royce cass born 27 march 1962 in...,"{'david': 1, 'william': 1, 'royce': 1, 'cass':..."
59069,<http://dbpedia.org/resource/Keith_Elias>,Keith Elias,keith hector elias born february 3 1972 in lac...,"{'keith': 1, 'hector': 1, 'elias': 4, 'born': ..."


### Computing TF-IDFs
To compute TF-IDFs, we need to have a normalizer which depends on the entire corpus.

We will need to get the word count for the entire corpus.

![1_1pTLnoOPJKKcKIcRi3q0WA.jpeg](attachment:1_1pTLnoOPJKKcKIcRi3q0WA.jpeg)

Note: TF is short for term frequency, and IDF is short for inverse document frequency. *df* = number of documents containing term *t*. *N* is the total number of documents.