## Import your dataset

We'll use the `constellate` client to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [1]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "3237accb-90e0-01d2-b4e6-5fa527ab6244"

Next, import the `constellate` client, passing the `dataset_id` as an argument using the `get_dataset` method.

In [3]:
# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_file = constellate.download(dataset_id, 'jsonl')

AttributeError: module 'constellate' has no attribute 'get_dataset'

## Extract Unigram Counts from the JSON file (No cleaning)

We pulled in our dataset using a `dataset_id`. The file, which resides in the datasets/ folder, is a compressed JSON Lines file (jsonl.gz) that contains all the metadata information found in the metadata CSV *plus* the textual data necessary for analysis including:

* Unigram Counts
* Bigram Counts
* Trigram Counts
* Full-text (if available)

To complete our analysis, we are going to pull out the unigram counts for each document and store them in a Counter() object. We will import `Counter` which will allow us to use Counter() objects for counting unigrams. Then we will initialize an empty Counter() object `word_frequency` to hold all of our unigram counts.

We can read in each document using the tdm_client.dataset_reader. 

In [4]:
import sys

!$sys.executable -m pip install termcolor

Collecting termcolor
  Downloading termcolor-2.0.1-py3-none-any.whl (5.4 kB)
Installing collected packages: termcolor
Successfully installed termcolor-2.0.1
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [5]:
from matcher import Text, Matcher
import json
import pandas as pd
from IPython.display import clear_output
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16, 6]

INFO:numexpr.utils:NumExpr defaulting to 8 threads.
INFO:matplotlib.font_manager:generated new fontManager


In [6]:
#Text
data = [document for document in constellate.dataset_reader(dataset_file)]
# for document in constellate.dataset_reader(dataset_file):
#     print(document.keys())

In [7]:
with open('middlemarch.txt') as f: 
    rawMM = f.read()

mm = Text(rawMM, 'Middlemarch')

In [8]:
for i, article in enumerate(data): 
    clear_output()
    print('\r', 'Matching article %s of %s' % (i, 33), end='')
    if 'numMatches' not in article: 
        articleText = Text(article['fullText'], article['id'])
        article['numMatches'], article['Locations in A'], article['Locations in B'] = \
        Matcher(mm, articleText).match()

 Matching article 32 of 33

In [9]:
with open('e2a.json', 'w') as outfile: 
    json.dump(data, outfile)