# Session 4: TF-IDF

Now that we know a little bit more about how different kinds of Python objects work, we're ready to do something a bit more complicated. Since our original impetus was to think about linguistic distinctiveness, we can turn our attention to the most traditional method for finding out if a term is *unusual* in a corpus. **TF-IDF**, which stands for term frequency--inverse document frequency, is a classic method for figuring out if one texts use of a word is distinctive compared to its use across a given corpus.

There is a great [Programming Historian tutorial](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#how-to-run-it-in-python-3) on this topic. The author of that tutorial uses a custom method from a library called SciKit Learn to run the TF-IDF algorithm. We are not going to do that, as it skips over many of the Python skills we are trying to learn.

Instead, we'll create our own functions for determining TF-IDF, based on the descriptions of the algorithm in that tutorial.

*n.b. The dataset we're using---a few Spenser texts from EEBO-TCP and Paradise Lost---is not ideal for this exercise. Ideally you'd want a large corpus of different authors, to get a sense of how a particular text fits within the larger lexicon. But we just want to practice our Python skills, so this smaller set of texts works well for us. Except for the first step of getting basic wordcounts, the rest of the code would be the same regardless of the kinds of files you begin with.*

In [2]:
# First we import all the libraries we'll need

import csv, glob, math
from collections import Counter

In [5]:
# Get all of our filenames
filenames = glob.glob('data/tfidf_texts/*')

all_wordcounts = {} # We'll put the results in a dictionary

# Now we loop through, open each one, and count the regularized tokens.
for filename in filenames:
    # Let's make a clean version of the file name to help us keep track
    clean_filename = filename.split('/')[-1].split('.')[0]
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter="\t") # Create a reader object
        # We just want the regs, so we only need to loop once, but let's get rid of punctuation and capitalization
        punct = list(".,!?():;")
        reg_tokens = [row[3].lower() for row in reader if row[3] not in punct]
        counted_reg_tokens = Counter(reg_tokens) # Now we can count them
        # Finally, let's put them in our dictionary by their cleaned filename
        all_wordcounts[clean_filename] = counted_reg_tokens
        
print(list(all_wordcounts.items())[0])

('am_ep', Counter({'the': 496, 'and': 439, 'to': 377, 'her': 298, 'of': 290, 'that': 282, 'in': 231, 'with': 224, 'my': 208, 'i': 170, 'but': 152, 'you': 142, 'which': 141, 'all': 134, 'is': 126, 'for': 122, 'so': 122, 'your': 113, 'it': 101, 'do': 99, 'she': 99, 'me': 96, 'does': 94, 'a': 94, 'his': 93, 'be': 93, 'sonnet': 89, 'then': 79, 'love': 77, 'their': 76, 'when': 75, 'as': 66, 'may': 64, 'fair': 64, 'not': 63, 'like': 62, 'will': 59, 'shall': 58, 'let': 57, 'this': 54, 'on': 54, 'did': 53, 'sweet': 50, 'self': 50, 'or': 49, 'more': 46, 'they': 45, 'now': 43, 'eyes': 43, 'by': 41, 'at': 40, 'make': 40, 'from': 38, 'day': 38, 'nor': 37, 'thy': 37, 'one': 37, 'heart': 36, 'yet': 36, 'unto': 36, 'he': 36, 'long': 35, 'such': 33, 'have': 33, 'no': 32, 'can': 32, 'ne': 32, 'them': 31, 'are': 30, 'most': 30, 'thou': 29, 'him': 29, 'how': 28, 'was': 28, 'sing': 28, 'goodly': 27, 'see': 26, 'those': 26, 'light': 26, 'if': 26, 'woods': 25, 'ring': 25, 'still': 24, 'through': 24, 'whose'

## That takes care of the term frequency

**Term frequency** is just another word for the wordcount: the number of times a particular word appears in the document. **Inverse document frequency** is a bit more complicated. Lavin (our ProgHist author) tells us correctly that it can take different forms. The form we'll use defines IDF as the natural log of the number of documents plus one divided by the number of documents in which a word appears, all plus one. The easiest way to do this for each word is to calculate that as a function.

In [20]:
# We'll create a function with two documents, the number of words in a document 
# and the number of documents in which a word appears
def calculate_idf(number_of_documents, document_frequency):
    # This equation is identical to the one in ProgHist
    idf = math.log((number_of_documents+1)/document_frequency) + 1
    return idf

In [21]:
# First we need to know the number of documents
number_of_documents = len(all_wordcounts)

# In order to get the document frequency, we need to know what tokens 
# are in what documents. We already have that, but we should flatten it out
# into list form.

# This is a bit complex, so we'll talk through it together.
all_tokens = [list(wordcounts.keys()) for wordcounts in all_wordcounts.values()]

# Now we want to loop through every word in every document, calculating as we go

all_tfidf = {} # A dictionary to store our result
for filename, wordcounts in all_wordcounts.items():
    all_tfidf[filename] = {} # A separate sub-dictionary for each termset
    for word, tf in wordcounts.items():
        document_frequency = 0
        for token_list in all_tokens: # Loop through our lists of tokens
            if word in token_list: # If the word is there, add it to the document frequency
                document_frequency += 1
        # Now we can calculate idf
        idf = calculate_idf(number_of_documents, document_frequency)
        
        # Now we have all the pieces we need!
        # We simply multiply term frequency by idf to get our final result
        tfidf = tf * idf
        
        # Let's put all that into our dictionaries
        all_tfidf[filename][word] = tfidf
        
print(all_tfidf)



In [30]:
# Now let's put all of this into a csv

# The first thing we need is a list of all words
# We can use a special flattening technique using the sum() function
all_words = list(set(sum(all_tokens, [])))
# We should add an empty string to the first line
all_words.insert(0, '')

# Now we should flatten out our tfidf to include filenames
flattened_tfidf = []
for filename, tfidf_words in all_tfidf.items():
    # Add filename to the dictionary with an empty string as the key
    tfidf_words[''] = filename
    flattened_tfidf.append(tfidf_words)

# Now we can write our CSV
with open('data/all_tfidf.csv', 'w') as newfile: # Create a new file object
    writer = csv.DictWriter(newfile, fieldnames=all_words) # Create a writer object with all words as fieldnames
    writer.writeheader() # Write the header
    writer.writerows(flattened_tfidf) # Writer our new list of dictionaries