## Counting Words, Part 1: TF/IDF ##

*Based of tutorials by [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) and [Kavita Ganesan](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XZVlcOdKhSw)*

Topic modelling can be a powerful technique for identifying the themes that recur in a corpus. But in many cases, just counting words can also tell you a lot. 

To begin, we're going to explore a method called Term Frequency - Inverse Document Frequency (tf-idf). Tf-idf comes up a lot in text analysis projects because it’s both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

The procedure was introduced in a 1972 paper by Karen Spärck Jones under the name “term specificity,” and the basic idea is this:

Instead of representing a term in a document by its raw frequency (its number of occurrences) or its relative frequency (the term count divided by the document length), each term is *weighted* by dividing the term frequency by the number of documents in the corpus containing the word. 

The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in a document are often the most frequently used words in all of the documents. We encountered this very problem in our topic model of the CCP Corpus.

By contrast, terms with the highest tf-idf scores are the terms in a document that are distinctively frequent in a document, when that document is compared other documents. When you sort by tf-idf score, these distinctive terms rise to the top. 

### An Analogy ###
    
If this explanation doesn’t quite resonate, a brief analogy might help. 

Say you've decided leave campus to get dinner on Buford Highway. Since leaving campus takes a lot of effort (and also, crucially, access to a car), the food better be worth it! That means you'll need to balance two competing goals:

1) The food has to be really tasty; and also, crucially: 
2) If you're going to go all the way out to Buford Highway, it better be something that you can't also get in Emory Village. Otherwise, why go to all the trouble of getting there?!

Or, to give an example involving actual food: you don't want to go all the way out to Buford Highway to get pizza. Even if the pizza on Buford Highway is pretty tasty, you can get pizza anywhere in town. How can you find out what is distintively tasty on Buford Highway?  

If you looked up the Yelp reviews for the all restaurants on Buford highway and sorted by score, you would get an answer to the question of what's the tastiest. But it still won't help solve the problem of what's *distintively tasty* on Buford Highway--like hot pot, for example, which is something that you can't get in Emory Village.   

So you need a way to tell the difference between what's tasty and what's distinctively tasty. To do so, you need to distinguish between four categories of food. Food that, on Buford Highway, is:

- both tasty and distinctive (e.g. hot pot)
- tasty but not distinctive (e.g. pizza) 
- distinctive but not tasty (e.g. tacos-- tho I'm open to disagreement here)
- neither tasty nor distinctive (e.g. Taco Bell).

These categories are what TF/IDF helps you measure. Term frequencies can be assessed according to the same criteria. A term might be:

- Frequently used in a language like English, and especially frequent (or infrequent) in one document
- Frequently used in a language like English, but used to a typical degree in one document
- Infrequently used in a language like English, but distinctly frequent (or infrequent) in one document
- Infrequently used in a language like English, and used at to a typical degree in one document

It's the words that are especially frequent in one document that are most interesting to us, and the ones that TF/IDF helps us identify. To see how, let's turn back to the CCP example.

### TF-IDF: How to do it ### 

As always, to start, we need to perform some pre-procesing of our corpus to get it into the format that TF/IDF requires. 

For TF/IDF, this format is much simpler than what was required for our topic model. Here, we just need to create a list in which each doc is a single string. 

**Note**: This logic: iterate through all the documents in a corpus, and do something to each text file one at a time, is the same logic we used to write the iter_docs function that we used in our last class. You'll get very familiar with writing document and text pre-processing code like this by the end of the semester.

In [17]:
import os

base_dir = "./2019-09-ccp-corpus-0.3/ccprecords/" # NOTE: Your path may be different!!!

all_docs = [] # our list which will store the text of each doc; empty for now

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through the docs
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_docs.append(text) # append it to the all_docs list

# lastly, just take a look at the first list item to be sure it worked
all_docs[0]

'THE JERSEY CONVENTION .\n\nWe have received a call for a New Citizens\' State Convention in New Jersey, to meet at New Brunswick, June 3, 1873. Despite the fact that this New Citizens\' convention , is to be held at New Brunswick in New Jersey, we find ourselves altogether unable to appreciate the necessity for it. WE are to be pitied, possibly, but we must confess to the fact that we find ourselves unable to appreciate four-fifths of the political conventions held by the colored men of the country. Without an exception, scarcely, they are simply axe-grinding assemblies. Managed chiefly by men who are unfit to lead, and too ambitious to follow, they but tend to compromise our interests with the nation. Called ostensibly to advance that interest, according to our way of thinking, they greatly retard it. And yet much depends upon the aim in view. Like unto what class of the American people is it purposed to mould the American Negro. That he must receive some mould is plain. A slave in t

You can calculate TF/IDF manually using addition and division. But we're going to use scikit-learn's TF/IDF modules because they have built-in tokenization. 

Note that we're now encountering a *third* library that does tokenization for us. The takeaway in this case is that after interating through your docs and getting your text into whatever format is required, as in the code cell above, the next step us (usually) to tokenize it.

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

#instantiate CountVectorizer()
cv=CountVectorizer(stop_words='english') # using stopwords this time
 
# this steps returns word counts for the words in your docs 
word_count_vector=cv.fit_transform(all_docs)

# check shape
word_count_vector.shape


(147, 24930)

That last line tells us that we have 147 rows, one for each document in the corpus, and 24,930 columns, one for each word (minus single character words, which the tokenizer excludes, as well as the default stopwords, which we've indicated with the `stop_words='english'` parameter above). 

We can also look at the whole vocabulary like this:

In [6]:
cv.vocabulary_

{'jersey': 12967,
 'convention': 5702,
 'received': 18547,
 'new': 15557,
 'citizens': 4668,
 'state': 21273,
 'meet': 14673,
 'brunswick': 3731,
 'june': 13101,
 '1873': 307,
 'despite': 6960,
 'fact': 9219,
 'held': 11228,
 'altogether': 1699,
 'unable': 23119,
 'appreciate': 2014,
 'necessity': 15469,
 'pitied': 17090,
 'possibly': 17392,
 'confess': 5341,
 'fifths': 9535,
 'political': 17286,
 'conventions': 5706,
 'colored': 4970,
 'men': 14712,
 'country': 5929,
 'exception': 8933,
 'scarcely': 19922,
 'simply': 20628,
 'axe': 2593,
 'grinding': 10726,
 'assemblies': 2282,
 'managed': 14263,
 'chiefly': 4541,
 'unfit': 23323,
 'lead': 13506,
 'ambitious': 1723,
 'follow': 9744,
 'tend': 22261,
 'compromise': 5225,
 'interests': 12597,
 'nation': 15401,
 'called': 3973,
 'ostensibly': 16175,
 'advance': 1348,
 'according': 1122,
 'way': 24257,
 'thinking': 22422,
 'greatly': 10665,
 'retard': 19236,
 'depends': 6819,
 'aim': 1547,
 'view': 23926,
 'like': 13753,
 'unto': 23573,
 '

The numbers above are the indices for each feature, not the word counts.

But we can sort the vocabulary like this:

In [18]:
sum_words = word_count_vector.sum(axis=0) # sum_words is a vector that contains
                                            # the sum of each word occurrence in all 
                                            # texts in the corpus. In other words, 
                                            # we are adding the elements for each column of
                                            # the word_count_vector matrix

# then sort the list of tuples that contain the word and their occurrence in the corpus.
words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]

words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)

# display the top 10
words_freq[:10]

[('convention', 5296),
 ('colored', 4217),
 ('committee', 4183),
 ('people', 3990),
 ('state', 3830),
 ('mr', 3767),
 ('shall', 3069),
 ('men', 2481),
 ('resolved', 2321),
 ('president', 2043)]

We can already see some words that don't seem too distinctive: "convention" and "colored," for example. It's not surprising that those are the most frequently occurring words since the corpus is about the colored conventions. 

So now let's calculate the IDF values so that we can balance them out.

In [19]:
import pandas as pd # this will help us keep track of our data; 
# we'll talk about pandas and dataframes in more detail on Tuedsay

# Call tfidf_transformer.fit on the word count vector we computed earlier.
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
convention,1.034368
people,1.034368
state,1.041385
colored,1.062738
men,1.099372
...,...
imaginative,5.304065
imagined,5.304065
imagining,5.304065
imbibe,5.304065


In the table above, the words at the top are those that appear in the most number of documents, across all of the corpus; and the words at the bottom are those that appear in the least number of documents.



But what are these numbers that we're looking at?

The most direct formula would be **N/df<sub>i</sub>**, where N represents the total number of documents in the corpus, and df is the number of documents in which the term appears. 

However, many implementations of tf-idf, including scikit-learn, which we are using, normalize the results with additional operations. 

In tf-idf, normalization is generally used in two ways, and for two reasons: first, to prevent bias in term frequency from terms in shorter or longer documents; and second, as above, to calculate each term’s idf value. 


Scikit-learn’s implementation of tf-idf represents N as **N+1**, calculates the natural logarithm of **(N+1)/df<sub>i</sub>**, and then adds **1** to the final result. 

![tf-idf](http://lklein.lmc.gatech.edu/wp-content/uploads/2019/10/Screen-Shot-2019-10-02-at-11.52.31-PM.png)

**Important note!** This is only one way to calculate TF-IDF. There are many, many versions. The number itself isn't important. It's the ranking that the number enables that's most interesting to us. Because one you have the IDF values, you can now compute the tf-idf scores for any document or set of documents. 

So now let’s compute tf-idf scores for the documents in our corpus.

In [20]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(word_count_vector)

And let’s print the tf-idf values of the first document to see if it makes sense. 

What we are doing below is, placing the tf-idf scores from the first document into a pandas data frame and sorting it in descending order of scores.

In [21]:
feature_names = cv.get_feature_names()
 
#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores for the first doc
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False).head(10)

Unnamed: 0,tfidf
mould,0.58465
survival,0.163179
don,0.147537
brunswick,0.146163
say,0.142156
new,0.140073
jersey,0.120605
lost,0.120424
receive,0.110154
politician,0.109566


You can actually do the TF-IDF calculation in one step, so now we're going to do the very same thing again using scikit-learn's all-in-one TF-IDF vectorizer.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer 
 
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True)
 
# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(all_docs)

In [23]:
# as above, get the first vector out (for the first document) to see what it looks like
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]
 
# place tf-idf values in a pandas data frame
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False).head(10)

Unnamed: 0,tfidf
mould,0.58465
survival,0.163179
don,0.147537
brunswick,0.146163
say,0.142156
new,0.140073
jersey,0.120605
lost,0.120424
receive,0.110154
politician,0.109566


Finally, let's print our TF-IDF vectors and store them to csv files so we can explore the scores using Excel.

In [37]:
base_dir = "./2019-09-ccp-corpus-0.3/ccprecords/"

# make a directory to store them in
# os.mkdir("./tf_idf_output")

docs = os.listdir(base_dir)

csvs = []

for doc in docs:
    if not doc.startswith('.'): # get only the .txt files
        csv = doc.replace(".txt",".csv")
        csvs.append(csv)

# convert sparse matrix to array
tfidf_vectors_as_array = tfidf_vectorizer_vectors.toarray() # LK: explain this. 

# loop each item in tfidf_vectors_as_array, 
for counter, doc in enumerate(tfidf_vectors_as_array): # note enumerate. useful! 
    # construct a dataframe
    tf_idf_tuples = list(zip(tfidf_vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

    print("\n Doc " + str(counter) + " top 5 terms: ")
    print(one_doc_as_df.head())
    
    # output to a csv using the enumerated value for the filename
    # one_doc_as_df.to_csv(csvs[counter])





 Doc 0 top 5 terms: 
        term     score
0      mould  0.584650
1   survival  0.163179
2        don  0.147537
3  brunswick  0.146163
4        say  0.142156

 Doc 1 top 5 terms: 
         term     score
0     florida  0.296138
1      county  0.278028
2  conference  0.242936
3       state  0.204529
4      public  0.198948

 Doc 2 top 5 terms: 
       term     score
0     haven  0.249525
1  resolved  0.237184
2     beman  0.201325
3  hartford  0.182991
4      1855  0.169296

 Doc 3 top 5 terms: 
     term     score
0      00  0.308812
1    iowa  0.308045
2  moines  0.281295
3     des  0.259792
4    alex  0.249078

 Doc 4 top 5 terms: 
         term     score
0    nebraska  0.422682
1  convention  0.342318
2     orleans  0.243819
3  williamson  0.194031
4    douglass  0.182945

 Doc 5 top 5 terms: 
           term     score
0            mr  0.362022
1         state  0.210594
2        league  0.191628
3    convention  0.178186
4  pennsylvania  0.170460

 Doc 6 top 5 terms: 
       term 


 Doc 50 top 5 terms: 
      term     score
0   ruffin  0.240711
1  downing  0.229288
2       mr  0.227927
3  colored  0.216491
4   grimes  0.195428

 Doc 51 top 5 terms: 
           term     score
0       comrade  0.580778
1       sailors  0.249652
2  pennsylvania  0.235096
3      soldiers  0.201752
4      underdue  0.193593

 Doc 52 top 5 terms: 
        term     score
0  tennessee  0.356774
1    colored  0.227636
2     rights  0.210291
3   citizens  0.169796
4      civil  0.153688

 Doc 53 top 5 terms: 
        term     score
0     advise  0.386822
1   maryland  0.290741
2   exertion  0.170982
3  dependent  0.170982
4       1866  0.161154

 Doc 54 top 5 terms: 
         term     score
0          mr  0.255193
1  convention  0.174814
2         men  0.168010
3      league  0.160567
4       shall  0.146885

 Doc 55 top 5 terms: 
          term     score
0  leavenworth  0.222077
1        right  0.185493
2        state  0.175548
3        class  0.155554
4     equality  0.151689

 Doc 56 t


 Doc 102 top 5 terms: 
         term     score
0      watson  0.277596
1  convention  0.263644
2          mr  0.229749
3    langston  0.191887
4  resolution  0.176484

 Doc 103 top 5 terms: 
         term     score
0  resolution  0.257054
1  convention  0.241276
2    resolved  0.174905
3     colored  0.164048
4   cleveland  0.157531

 Doc 104 top 5 terms: 
      term     score
0    taney  0.228246
1  watkins  0.189194
2     dred  0.178406
3   garnet  0.175036
4    party  0.163027

 Doc 105 top 5 terms: 
         term     score
0        kagi  0.404292
1          mr  0.311896
2     whipple  0.281167
3      delany  0.245791
4  convention  0.244412

 Doc 106 top 5 terms: 
         term     score
0  centennial  0.235067
1          mr  0.230670
2  convention  0.210930
3     colored  0.204213
4   pinchback  0.198903

 Doc 107 top 5 terms: 
       term     score
0  hamburgh  0.431631
1    butler  0.284860
2   company  0.255909
3    rivers  0.255355
4     drill  0.196196

 Doc 108 top 5 terms:

## Sorting Chunks of Documents ##

You can also sort *chunks* of documents by socre. [Here is an example of this from 538](https://fivethirtyeight.com/features/these-are-the-phrases-each-gop-candidate-repeats-most/)

Note that their TF-IDF scores seem wildly different than ours. That's because they haven't normalized their scores. Again, it's not the raw number that matters but the ranking that the scores enable.

## Cosine Similarity ##

Another thing you can do is calculate the cosine similarity between documents. We'll talk more about this in the future, but here's a quick example:

In [38]:
# CALCULATE SIMILARITY TO FIRST DOC 

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_vectorizer_vectors[0:1], tfidf_vectorizer_vectors)

array([[1.        , 0.06465099, 0.05610237, 0.04711738, 0.06962622,
        0.09625273, 0.08169905, 0.07551052, 0.05405891, 0.10062727,
        0.07003329, 0.07941162, 0.08219579, 0.09451411, 0.04486977,
        0.02077095, 0.02486525, 0.04318454, 0.03951665, 0.10057563,
        0.07303898, 0.14159949, 0.13675211, 0.08527692, 0.07197963,
        0.10856095, 0.03973465, 0.07398188, 0.04941485, 0.0436834 ,
        0.07951614, 0.04175132, 0.03781364, 0.01266459, 0.09214135,
        0.06510835, 0.08726553, 0.05812651, 0.09211778, 0.055983  ,
        0.07041372, 0.08540176, 0.00807353, 0.01479589, 0.10073539,
        0.1284685 , 0.05295548, 0.07301611, 0.06173464, 0.08242942,
        0.07501834, 0.04647728, 0.08177462, 0.04023591, 0.13104857,
        0.09392625, 0.05611693, 0.10734953, 0.06743812, 0.06938436,
        0.06165032, 0.07661391, 0.06796959, 0.05065519, 0.09440454,
        0.01808766, 0.08056023, 0.02029955, 0.05239651, 0.10815004,
        0.02802066, 0.04574778, 0.009497  , 0.01

## How to chunk documents by file name ##

In [None]:
base_dir = "./2019-09-ccp-corpus-0.3/ccprecords/"

war_docs = []

pre_1865 = ""
post_1865 = ""

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through them
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion for pcs
            text = file.read() # read in the file as a single text string
            
            # HERE IS WHERE THE CHUNKING HAPPENS
            year = doc[:4] # how you access/slice characters in a string
            
            if int(year) < 1865:
                # DO A THING
            else:
                # DO ANOTHER THING
            
