# LOAD PLAIN TEXT DATA

This notebook enables you to upload a zip archive containing plain text files as data. The files are converted into valid JSON manifests, allowing the rest of the Virtual Workspace to work with them.

Run the **Python Imports**, **Configuration**, and **Basic Setup** cells _before_ uploading your data and metadata files.

Note that you are required to provide metadata in either json or csv format. For formatting requirements, please see the **Configuration** and **Basic Setup** sections below.

## SETTINGS

In [None]:
import csv
import json
import os
import re
from IPython.display import display, HTML

sourcepath = 'caches/external_sources'
textfilepath = 'caches/txt'

## CONFIGURATION

In [1]:
zipfile = 'myzip.zip' # The name of the zip archive containing your plain text documents
metadata_format = 'json' # 'csv' or 'json'

# Add Custom Metadata Fields
custom_metadata_fields = [] # e.g. ['word_count', 'sentiment']

## BASIC SETUP

This cell creates the required directories into which you will upload your data and metadata files. Please run this cell and then follow the instructions for uploading.

In [None]:
!rm -r caches/external_sources
!mkdir -p caches/external_sources
!rm -r caches/txt
!mkdir -p caches/txt
msg = '''
<h3>You are now ready to upload your material.</h3>
<p>Note that you are required to provide metadata with a date of publication that must be in 'YYYY-MM-DD' format.</p>
<h3>Using json metadata</h3>
<p>Navigate to your <code>caches/external_sources</code> folder and click the "Upload" button to upload a zipped archive of documents in plain text format. If your metadata format is `json`, you must also upload a JSON file called <code>metadata.json</code> with the following format:)
<pre><code>
[
  {
    "filename': "emma.txt",
    "pub_date": "1815-01-01",
    "title": "Emma",
    "author": "Jane Austen"
  },
  {
    "filename": "jane_eyre.txt",
    "pub_date": "1847-01-01",
    "title": "Jane Eyre",
    "author": "Charlotte Brontë"
  }  
]
</code></pre>
<p>You can also paste a copy of the JSON metadata in the <code>metadata</code> variable in the <strong>Add Metadata</strong> section below, but make sure that `metadata_format` is set to 'json' in your configuration above. When you are finished, continue running the cells below the <strong>Add Metadata</strong> section.</p>
<h3>Using csv metadata</h3>
<p>Alternatively, if your metadata is in csv format, you can upload a csv file with the following header:
<code>filename,pub_date,title,author</code>.</p>
<p>Make sure your date of publication is in `YYYY-MM-DD` format, and that your filenames have a `.txt` extension. Set <code>metadata_format</code> to 'csv' in the configuration cell above.</p>
'''
output = HTML(msg)

display(output)

## METADATA: Add Metadata

Leave the following list blank if you are uploading your own JSON or CSV files, but please run the cell. It sets an empty list for the metadata file to populate. We will then write this list to a series of json files.

In [None]:
metadata = []

## PREPARE DATA: Prepare data and metadata for import

In [None]:
%%time 

# Load the metadata from file
if metadata == [] and metadata_format == 'json':
    with open(os.path.join(sourcepath, 'metadata.json'), 'r') as f:
        metadata = json.loads(f.read())
if metadata == [] and metadata_format == 'csv':
    metadata = [dict(d) for d in csv.DictReader(open(os.path.join(sourcepath, 'metadata.csv')))]

# Now unzip the archive
datapath = os.path.join(sourcepath, zipfile)
!unzip -j -o -u "{datapath}" -d caches/txt

!ls caches/txt | wc -l
    
print('\n\n----------Time----------')

## CREATE JSON: Create the JSON data files

In [None]:
%%time 

!rm -r caches/json
!mkdir -p caches/json
    
for i, meta in enumerate(metadata):
    # Build the basic manifest
    doc_id = str(i)
    name = meta['filename'].lower().replace('.txt', '')
    name = re.sub('[\W_-]+', '_', name)
    date = meta['pub_date'] + 'T00:00:00Z'
    author = meta['author']
    if 'pub' in meta:
        pub = meta['pub']
    else:
        pub = 'Unknown'
    doc = {
        'doc_id': doc_id,
        'name': name,
        'namespace': 'we1sv2.0',
        'metapath': '',
        'pub': pub,
        'title': meta['title'],
        'pub_date': date,
        'author': author
    }
    # Add any custom_metadata_fields
    for item in custom_metadata_fields:
        doc[item] = meta[item]
    # Read the source file and copy the content to the manifest
    with open(os.path.join(textfilepath, meta['filename']), 'r') as f:
        doc['content'] = f.read()
        doc['length'] = str(len(doc['content']))
    # Save the document
    with open(os.path.join('caches/json', name + '.json'), 'w') as f:
        f.write(json.dumps(doc))

print('\n\n----------Time----------')
            
msg = '''
<h3>Your data has been imported.</h3>

<p>You should now run the <code>1_import_data</code> notebook to continue your project.</p>

<p>Do not run the cells in the following sections.</p>

<ul>
<li>BROWSE: search zip filenames for keywords</li>
<li>LIST: define which zips will be used to import JSON files</li>
<li>IMPORT: copy JSON from zip files to cache</li>
</ul>

<p>After these sections, run the cells starting with <strong>FILTER: delete non-matching JSON</strong>.</p>
'''
output = HTML(msg)

display(output)