<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Adapted by [Jen Ferguson](https://library.northeastern.edu/about/library-staff-directory/jen-ferguson) from notebooks created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br /> See [here](https://docs.tdm-pilot.org/tag/intermediate-lessons/) for the originals and additional text analysis notebooks. 

# About JSTOR/Portico and tdm-pilot.org##
*What is this thing I searched when creating my dataset?*

The text mining platform at tdm-pilot.org provides access to:
* nearly all of JSTOR 
* content from selected Portico publishers (currently 40+ publishers)
* content from Chronicling America - historic newspapers
* CORD - a COVID-19 collection
* DocSouth - primary source collection about the history & culture of the American South

Discussions with other providers are underway.

By the numbers: All told, the above includes content from 2000+ publishers, 5500+ journals, and 19 million articles. 

Note: Northeastern has an agreement with JSTOR/Portico, so we have permission to do this.

If you generated your own dataset, the slick interface that you used at tdm-pilot.org is mostly intended to give you a peek into what's in your data. For more detailed work, you can pull that dataset you generated into a Jupyter Notebook. That's what we'll do next.

## Finding significant words within a curated dataset##

This notebook demonstrates how to find the significant words in your dataset using a model called TF-IDF. 

*Fun fact: TF-IDF was used in early search engines to do relevance ranking, until clever folks figured out how to break that by 'keyword stuffing'.* 

As you work through this notebook, you'll take the following steps:

* Import a dataset
* Find the query used to build the dataset within the dataset's metadata
* Write a helper function to help clean up a single token
* Clean each document of your dataset, one token at a time
* Use a dictionary of English words to remove words with poor OCR (optical character recognition)
* Compute the most significant words in your corpus using TF-IDF and a library callled gensim 

**What's a token?**  It's a string of text. For our purposes, think of a token = a single word.

A quick note before we get started. As you work through this notebook you may see a cell or two marked ***'optional'***. These are opportunities for you to try modifying and applying Python code to see what happens. I encourage you to try them, but you can also just run the notebook as written.

First we'll import gensim, and the Dataset module from the tdm_client library.  The tdm_client library contains functions for connecting to the JSTOR server that contains our corpus dataset.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import gensim

import tdm_client

Next we'll pull in our datasets. 


**Did you build your own dataset?**  In the next code cell, you'll supply the dataset ID provided when you created your dataset. Now's a good time to make sure you have it handy.

**Didn't create a dataset?**  Here are a couple to choose from, with dataset IDs in <font color=red> red </font>:


* Documents published in African American Review, Black American Literature Forum, and Negro American Literature Forum (from JSTOR): <font color=red> b4668c50-a970-c4d7-eb2c-bb6d04313542 </font>


* 'Civilian Conservation Corps' from Chronicling America:<font color=red> 9fa82dbc-9269-6deb-9720-179b4ba5e451</font>



We create a new variable **dset** and initialize its value using the **Dataset** function. A sample **dataset ID** of data derived from searching JSTOR for 'Civilian Conservation Corps' is provided here ('e2a07be0-39f4-4b9f-b3d1-680bb04dc580'). 

Pasting your unique **dataset ID** in place of the <font color=red> red </font> text below will import your dataset from the JSTOR server. (No output will show.)

In [None]:
dataset_id = '6ef4b79b-73a2-7590-afcd-0b22e64a2a46
'

In [None]:
dataset_info = tdm_client.get_description(dataset_id)

Find the total number of documents in the dataset using the `len()` function. 

In [None]:
dataset_info["num_documents"]

Let's double-check to make sure we have the correct dataset. 
We can look at the original query by viewing the `search_desription`.

In [None]:
dataset_info["search_description"]

Now that we've verified that we have the correct corpus/dataset, let's download it.

In [None]:
dataset_file = tdm_client.get_dataset(dataset_id)

Next, let's create a helper function that can standardize and clean up the tokens in it. The function will:

* Change all tokens (aka words) to lower case.  This will make 'Cats' and 'cats' be counted as the same token.
* Discard tokens with non-alphabetical characters
* Discard any tokens less than 4 characters in length

*Question to ponder:* Why do you think we want to discard tokens that are less than 4 characters long?



In [None]:
def process_token(token): #defines a function `process_token` that takes the argument `token`
    token = token.lower() #changes all strings to lower case
    if len(token) < 4: #discards any tokens that are less than 4 characters long
        return
    if not(token.isalpha()): #discards any tokens with non-alphabetic characters
        return
    return token #returns the `token` variable which has been set equal to the `corrected` variable

Now let's cycle through each document in the corpus with our helper function.  This may take a while to run; recall that if it's in process, you'll see this: In [ * ]. (No output will show.)

In [None]:
reader = tdm_client.dataset_reader(dataset_file)

#Creates a new variable `documents` that is a list that that will contain all of our documents.
documents = []

for n, document in enumerate(reader):
    this_doc = []
    _id = document["id"]
    for token, count in document["unigramCount"].items():
        clean_token = process_token(token)
        if clean_token is None:
            continue
        this_doc += [clean_token] * count
    documents.append((_id, this_doc))

In [None]:
dictionary = gensim.corpora.Dictionary([d[1] for d in documents])

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in [d[1] for d in documents]]

In [None]:
model = gensim.models.TfidfModel(bow_corpus)

In [None]:
corpus_tfidf = model[bow_corpus]

Now that we have those pieces in place, we can run the following code cells to find the most significant terms, by TFIDF, in our dataset. 

In [None]:
td = {
        dictionary.get(_id): value for doc in corpus_tfidf
        for _id, value in doc
    }
sorted_td = sorted(td.items(), key=lambda kv: kv[1], reverse=True)

In [None]:
for term, weight in sorted_td[:20]:
    print(term, weight)

Print the most significant word, by TFIDF, for the first 20 documents in the corpus. 

In [None]:
for n, doc in enumerate(corpus_tfidf):
    if len(doc) < 1:
        continue
    word_id, score = max(doc, key=lambda x: x[1])
    print(documents[n][0], dictionary.get(word_id), score)
    if n >= 20:
        break

*Optional:  How would you print the most significant word for the **first 8 documents**? Copy the code block above, and paste then modify the code in the code block below.*

Want to keep going? [This notebook](https://hub.binder.tdm-pilot.org/user/jasf--tdm-nbs-dxsy8b0z/notebooks/Count%20and%20Visualize.ipynb/) will let you count and visualize some characteristics of the documents in your dataset.