# Project 3a: Goals and Deliverables

The goals of this assignment are:
* To analyze corpora with metadata.
* To make some basic corpus-level visualizations.

Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment. Open and set up a code space (install a python kernel and select it).
2. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
3. I wrote the comments; you write the code! Complete and run `spacy_on_corpus.py` following the instructions in this notebook.
4. Edit the README.md file. Provide your name, your class year, links to/descriptions of any extensions and a list of resources. 
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

Possible extensions (from least points to most points):
* Make word counts plots for the top 100 words and entities. Look at the labels on the y axis of each plot. Where do you think spaCy is making mistakes?
* Augment the `wordcount` functionality so that it displays relative frequencies of entity label pairs and token part of speech pairs.
* Augment the `wordcloud' functionality so that it also makes an entity cloud.
* Make the bar plots and/or word clouds more beautiful.
* Learn about the useful python collections package, especially the [Counter data type](https://docs.python.org/3/library/collections.html#collections.Counter). Copy spacy_on_corpus.py and name the copy spacy_on_corpus_counter.py. Change `get_token_counts` and `get_entity_counts` to use counters. 
* Add in the analyses from project 2c as functions `make_doc_markdown`, `make_doc_tables` and `make_doc_stats`; make sure to ask the user for a document before running any of these!
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Setup

## Install Our Packages

On the command line (in the terminal), type:

% `pip install -r requirements.txt`

## Upload Our Data

From Moodle, download `files.jsonl.zip`. 

Then, upload `files.jsonl.zip` to the code space.

## Make Sure We Can Work With .py Files We Are Editing

Run the code cell below.

In [None]:
# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

# Corpus Metadata

Last week we wrote functions to load and manipulate a corpus consisting of plain text files.

This week we will extend that program to load and manipulate a corpus where each document also has **metadata**.

# Python Dictionaries in Files

There is a format called ['JSON'](https://www.json.org/json-en.html) that looks like a python dictionary printed out. 

If you see a filename that has the extension `.json`, you can think of it as containing a python dictionary.

If you see a filename that has the extension `.jsonl`, you can think of it as containing a python dictionary *per line*.

Constellate data sets actually consiste of `.jsonl` files. Each line contains the metadata (and, if the fulltext is available, the data) for a single document. So a single file can contain all the data about a whole corpus!


Open the file `test.jsonl` in a text editor (Visual Studio is fine) and take a look at it. How many keys does each dictionary have?

Python has a package for reading `.json` and `.jsonl` files. It is called ... `json`. We will use it from now on to read corpura.

# Testing the Functions in spacy_on_corpus.py

For this project, you will be extending your code in `spacy_on_corpus.py`. You will fill in the functions and test them in this section.

First, we will need a test corpus. I give you one here (the text for each document comes from the Wikipedia page for the named college or university).

In [None]:
# import spacy
import spacy
# import pprint
import pprint

# make a spacy engine
nlp = spacy.load('en_core_web_sm')

# make a corpus
corpus = {'doc1': {'text': 'Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors. Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium.'},
          'doc2': {'text': 'Columbia University, officially titled as Columbia University in the City of New York, is a private Ivy League research university in New York City. Established in 1754 as King\'s College on the grounds of Trinity Church in Manhattan, it is the oldest institution of higher education in New York and the fifth-oldest in the United States.'}}

# run spacy on each text in the corpus
for key in corpus:
    corpus[key]['doc'] = nlp(corpus[key]['text'])

# print the corpus


## Test `get_token_counts`

Complete the implementation of `get_token_counts` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_token_counts` on the provided corpus.

In [None]:
#import spacy_on_corpus


# call get_token_counts on corpus


The output should start with:
```
[('Colby', 5),
 ('College', 6),
 ('is', 3),
 ('a', 2),
 ('private', 2),
 ('liberal', 2),
 ('arts', 2),
 ('college', 3),
 ('in', 12),
 ('Waterville', 3),
 ('Maine', 4),
 ('Founded', 1),
 ```

This function has some **optional arguments**. Look at the function **signature** (the line that starts with `def`). See that there are two arguments, but one of them has a value assigned to it already (using `=`). That means, if you don't want to say what the tags to exclude should be, you can take the default ones specified in the signature. 

Let's try changing this. Let's make *no* tags excluded.

In the code cell below, run `get_token_counts` on the corpus provided, also specifying `tags_to_exclude = []` (the empty list).

In [None]:
# call get_token_counts with no tags to exclude


The output should start with:
```
[('Colby', 5),
 ('College', 6),
 ('is', 3),
 ('a', 2),
 ('private', 2),
 ('liberal', 2),
 ('arts', 2),
 ('college', 3),
 ('in', 12),
 ('Waterville', 3),
 (',', 9),
 ('Maine', 4),
 ('.', 9),
 ```

Now, referring to [the coarse-grained tag list](https://universaldependencies.org/u/pos/all.html), what do you have to do to make `get_token_counts` *only* give you counts of (proper or regular) nouns, verbs, adjectives and adverbs?

In [None]:
# call get_token_counts excluding all tags but those corresponding to nouns, verbs, adjectives and adverbs


The output should start with:
```
[('Colby', 5),
 ('College', 6),
 ('private', 2),
 ('liberal', 2),
 ('arts', 2),
 ('college', 3),
 ('Waterville', 3),
 ('Maine', 4),
 ('Founded', 1),
 ```

## Test `get_entity_counts`

Complete the implementation of `get_entity_counts` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_entity_counts` on the provided corpus.

In [None]:
# import

# call get_entity_counts


The output should start with:
```
[('Colby College', 1),
 ('Waterville', 2),
 ('Maine', 3),
 ('1813', 1),
 ('the Maine Literary and Theological Institution', 1),
 ('Waterville College', 1),
 ('1821', 1),
 ```

Now, referring to [the spaCy model docs](https://spacy.io/models/en), what do you have to do to make `get_entity_counts` *only* give you organizations, persons and locations?

In [None]:
# call get_entity_counts so as to get only organizations, persons and locations


The output should start with:
```
[('Colby College', 1),
 ('Waterville', 2),
 ('Maine', 3),
 ('the Maine Literary and Theological Institution', 1),
 ('Waterville College', 1),
 ```

## Test `reduce_to_top_k`

Complete the implementation of `reduce_to_top_k` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `reduce_to_top_k` on the output of `get_token_counts` on the provided corpus.

In [None]:
# import

# get the token counts on corpus; assign token_counts to the returned result
token_counts = 

# call reduce_to_top_k on token_counts to get the top 5


The output should look like:
```
[('of', 5), ('College', 6), ('and', 7), ('the', 11), ('in', 12)]
```

## Test `load_textfile`

Complete the implementation of `load_textfile` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `load_textfile` on 'colby_college.txt'.

In [None]:
# import

# initialize corpus to the empty dictionary

# call load_textfile

# print corpus


The output should look like:
```
{'colby_college.txt': {'doc': Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors.
 
 Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium.
}}
```

## Test `load_compressed`

Complete the implementation of `load_compressed` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `load_compressed` on 'files.zip'. (This may take a little while!)

In [None]:
# import

# initialize corpus to the empty dictionary

# call load_compressed

# print corpus keys


The output should look like:
```
dict_keys(['temp/ark:__27927_pjb5s37cx32', 'temp/ark:__27927_phx1wcjq0tm', 'temp/ark:__27927_phzmmfj893c', 'temp/ark:__27927_phzkfzqzs41', 'temp/ark:__27927_phzq8c34ggp', 'temp/ark:__27927_pjb3ptfm8xd', 'temp/ark:__27927_phz8qhfbxzm', 'temp/ark:__27927_pjb1wn175cv', 'temp/ark:__27927_phznswfkrxz', 'temp/ark:__27927_pjb65xt4m6r', 'temp/ark:__27927_phzq26wnjzn', 'temp/ark:__27927_phzbjns29gn', 'temp/ark:__27927_phzpdcpvdnb', 'temp/ark:__27927_pjb1z8505hp', 'temp/ark:__27927_phz35174v0z', 'temp/ark:__27927_phzjj6kfdxp', 'temp/ark:__27927_pjb16g9m9r7', 'temp/ark:__27927_pjb1z5xzrx7'])
```

## Test `load_jsonl` (New for project 3a!)

Complete the implementation of `load_jsonl` in `spacy_on_corpus.py`.

Then, in the code cell below import `spacy_on_corpus` and run `load_jsonl` on `test.jsonl`. (This may take a little while!)

In [None]:
# Import 

# call load_jsonl


There should be two entries in the resulting corpus.

## Test `build_corpus` (Modified for project 3a!)

Complete the implementation of `build_corpus` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `build_corpus` on the pattern 'f*.jsonl.zip'. (This may take a little while!)

Note, `build_corpus` _returns a corpus_, so you want to assign that return value to a variable (like `my_corpus`).

In [None]:
# import

# call build_corpus

# print corpus keys


The output should start with:
```
dict_keys(['ark://27927/phz35174v0z', 'ark://27927/phzmmfj893c', 'ark://27927/pjb1wn175cv',
```

Now let's look at one document and its metadata. Run the code cell below to inspect the types of metadata available from Constellate.

In [None]:
# what types of metadata do each document have?
print(my_corpus['ark://27927/pjb1z5xzrx7']['metadata'].keys())

## Test `get_metadata_counts` (New for project 3a!)

Complete the implementation of `get_metadata_counts` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_metadata_counts` on the provided corpus using the key 'pageCount'.

In [None]:
# import

# call get_metadata_counts


The output should look like:
```
[(4, 1),
 (7, 3),
 (9, 1),
 (23, 1),
 (21, 1),
 (24, 1),
 (11, 1),
 (22, 2),
 (5, 2),
 (12, 1),
 (32, 1),
 (39, 1),
 (16, 1),
 (37, 1)]
```

## Test `get_basic_statistics` (Modified for project 3a!)

Complete the implementation of `get_basic_statistics` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `get_basic_statistics` on the given corpus.

In [None]:
# import

# call load_compressed

# call get_basic_statistics on my_corpus
answers.get_basic_statistics(my_corpus)

Your output should look like:
```
Documents: 18

Tokens: 170315

Unique tokens: 19741

Entities: 170315

Unique entities: 7337

Publication year range: 2015-2022

Page count year range: 4-39
```

## Test `plot_word_entity_frequencies`

Complete the implementation of `plot_word_entity_frequencies` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `plot_word_entity_frequencies` on the given corpus.

In [None]:
# import

# call load_compressed

# call plot_word_entity_frequencies on my_corpus


The resulting file `token_counts.png` should look like:

![token_counts.png](answer_token_counts.png)

The resulting file `entity_counts.png` should look like:

![token_counts.png](answer_entity_counts.png)

## Test `plot_word_cloud`

Complete the implementation of `plot_word_cloud` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `plot_word_cloud` on the given corpus.

In [None]:
# import

# call load_compressed

# call plot_word_cloud on my_corpus


The resulting file `token_counts.png` should look like:

![token_wordcloud.png](answer_token_wordcloud.png)

## Make Wordcount and Wordcloud Plots Informative

Modify `plot_word_entity_frequencies` and `plot_word_cloud` functions so they produce *informative* plots (excluding function words like *and* and *a*, punctuation, and numbers).

## Test `plot_metadata_frequencies` (New for project 3a!)

Complete the implementation of `plot_metadata_frequencies` in `spacy_on_corpus.py`. 

Then, in the code cell below import `spacy_on_corpus` and run `plot_metadata_frequencies` on the given corpus using the key 'publicationYear'.

In [None]:
# import

# call load_compressed

# call plot_metadata_frequencies on my_corpus to get the publicationYear frequencies


The resulting file `publicationYear_counts.png` should look like:

![pubYear_counts.png](publicationYear_counts.png)

# Running `spacy_on_corpus.py` from the Terminal (Modified for project 3a!)

Complete the implementation of `main` in `spacy_on_corpus.py`. 

Now run this in the terminal:
% `python spacy_on_corpus.py`

Give it `files.jsonl.zip` as the pattern. Get all of 'statistics', 'wordcount' and 'wordcloud' as well as bar charts for 'publicationYear' and 'pageCount'.

Insert the images generated when you run it.

## Token count plot


## Entity count plot


## Word cloud


## Publication year plot


# Page count plot



# Questions

Answer these questions with respect to the corpus defined by `files.jsonl.zip`.

1. *How many tokens and unique tokens are in this corpus?*
2. *What is the average number of pages of documents in this corpus?*
3. *What is the most frequent year in which articles in this corpus were published?* 
4. *What happens when you try to get metadata counts for the metadata key `tdmCategory`? Why?*
5. *What is the structure of the return value from `get_metadata_counts`?*
6. *What does the function `loads()` in the package `json` do? Is it more suited to `.json` or `.jsonl` files?
7. *Name and describe one function in the package `json` other than `loads()`.*
8. *List three metadata keys in this corpus:* 
9. *One of the metadata keys in this corpus is 'doi'. What is a DOI?*
10. *Could we get a good understanding of the themes of this corpus just from the metadata (not the fulltext)? Why or why not?*