# TFIDF Exercise

### Now that we can run tf-idf with plain Python, we can learn how to run it using other libraries

We'll cover how to run exactly the same process as in the Session 4 notebook, but using SciKit Learn and pandas instead. You'll need to install these on the command line using `pip install libraryname` before you begin.

We'll mostly follow the techniques of [the Programming Historian tutorial on tf-idf](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf).

Wrapping your head around these libraries takes a little work, but it's ultimately much easier than doing the calculations yourself. As we move to more advanced calculations, doing so would be almost impossible, so it's worth figuring out SKLearn now.

In [1]:
import csv, glob
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer # import only the TfidfVectorizer


In [2]:
# For this to work, we need each text file as a string
# If these were normal text files, this step would be
# very easy: just read each file and put it in a list.

# But because we have MorphAdorner CSVs, we need to first
# get the regularized words and then join them together
# into strings.

filenames = glob.glob('data/tfidf_texts/*')

all_files_as_strings = []
for filename in filenames:
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter="\t")
        # If we had regular text files, SKLearn would eliminate punctuation for us
        # But because of spacing issues, it's easier to do this ourselves at the outset
        punct = list(".,!?():;")
        reg_tokens = [row[3].lower() for row in reader if row[3] not in punct]
        # Now we can join the list of tokens back into a string:
        reg_string = ' '.join(reg_tokens)
        all_files_as_strings.append(reg_string)
        
print(all_files_as_strings)

["amoretti and epithalamion written not long since by edmunde spenser printed for william ponsonby 1695. to the right worshipful sir robart needham knight sir to gratulate your safe return from ireland i had nothing so ready nor thought any thing so meet as these sweet conceited sonnets the deed of that weld serving gentleman master edmond spenser whose name sufficiently warranting the worthiness of the work i do more confidently presume to publish it in his absence under your name to whom in my poor opinion the patronage thereof does in some respects properly appertain for beside your judgement and delight in learned poesy this gentle muse for her former perfection long wished for in englande now at the length crossing the seas in your happy company though to your self unknown seems to make choice of you as meetest to give her deserved countenance after her return entertain her then right worshipful in sort best beseeming your gentle mind and her merit and take in worth my good will h

## Now we can use SKLearn to do all the hard counting work for us!

SKLearn follows the pattern of other libraries we've used so far. First it creates a *class*, which in this case represents a **model**. Then it runs that **model** on a *dataset* and gives us back **output**.

In [3]:
# vectorizer is what we'll call our model
# There are lots of options that are all described on PH
vectorizer = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None)

# transformed_documents are what we call the output
transformed_documents = vectorizer.fit_transform(all_files_as_strings)

# As Lavin mentions, this returns something called a sparse matrix. To get the full thing, we need an extra method

transformed_documents_all = transformed_documents.toarray()
print(transformed_documents_all)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 1.69314718 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         2.09861229]
 [5.07944154 6.29583687 6.29583687 ... 1.69314718 0.         0.        ]
 [1.69314718 0.         0.         ... 0.         2.09861229 0.        ]]


## As you can see, the output is impossible to read!

### But we can reconstruct this using pandas.

Lavin does this a little differently in his tutorial, but because we have a small number of texts, this can all become a single pandas DataFrame.

In [13]:
# First let's clean up the filenames

clean_filenames = [f.split('/')[-1].split('.')[0] for f in filenames]

# Because SKLearn has kept track of the order for us, the rows will be in the same order as this list.

# Pandas lets us convert SKLearn output (in this case, technically a "numpy array") into a DataFrame directly.
# It will ask us for indices (row names) and column names
# We want row names to be the files and column names to be the words

# We already have the filenames, and SKLearn will let us get the words easily

all_words = vectorizer.get_feature_names()

# Now we can make our DataFrame
# The first argument is our data, then we
# can enter columns and an index as named arguments

df = pd.DataFrame(transformed_documents_all, columns=all_words, index=clean_filenames)

# But this has our rows and columns backwards!
# We can fix this with a simple transpose command: .T

df = df.T
df

Unnamed: 0,am_ep,paradiselost,fq1596,fowre_hymnes,axiochus,prothalamion
10,0.000000,0.000000,0.000000,0.000000,0.000000,2.252763
1591,0.000000,0.000000,0.000000,2.252763,0.000000,0.000000
1592,0.000000,0.000000,0.000000,0.000000,2.252763,0.000000
1596,0.000000,0.000000,3.119232,4.678847,0.000000,1.559616
1674,0.000000,2.252763,0.000000,0.000000,0.000000,0.000000
1695,2.252763,0.000000,0.000000,0.000000,0.000000,0.000000
aaron,0.000000,2.252763,0.000000,0.000000,0.000000,0.000000
aarons,0.000000,2.252763,0.000000,0.000000,0.000000,0.000000
aback,0.000000,0.000000,15.769341,0.000000,0.000000,0.000000
aband,0.000000,0.000000,2.252763,0.000000,0.000000,0.000000


In [12]:
# Pandas makes it dead simple to write a new csv

df.to_csv('data/sklearn_tfidf.csv')