# Introduction to Jupyter Notebooks
This lesson will introduce the Jupyter Notebook interface. We will use the interface to run and write, yes, write, some Python code for text data analysis.

By the end of this lesson, learners should be able to:
1. Explain the difference between markdown and code blocks in Jupyter Notebooks
2. Execute pre-written Python code to analyze newspaper text
3. Modify Python code to change the settings of the analysis

And just an aside, all the code for this fun stuff is availble on GitHub at [https://github.com/jcoliver/dig-coll-borderlands](https://github.com/jcoliver/dig-coll-borderlands).

## What is this Jupyter Notebook thing?

Jupyter Notebooks are effectively made up of "cells". We can start by thinking of each cell being equivalent to a paragraph on a page. There is an order in which paragraphs and cells appear, and that order matters. In Jupyter Notebooks, the cells come in two flavors and a single notebook (like the one we are working in now) with have both types of cells. 
+ The first is called "markdown", which is text, like you are reading now. We can use some syntax in the text to format the cells in particular ways. For example, we can create italic text by using the underscore symbol ("\_") at the beginning and ending of the text we want to italicize. So when we write "\_italic\_" in a markdown block, it will show up as _italic_.
+ The second kind of cell is a "code" cell, that contains computer code in a language like Python or R. This is where the fun comes in.


So let's try this out. Click your cursor in the box below on the word "Data" and run the cell. You can run the cell by holding down the Control (Ctrl) key and press Enter (on a Windows machine). If you are on a Mac, you hold down the Command (Cmd) key instead of the Control key and press Enter. You can also click the button labeled "Run" at the top of the screen, too.

In [None]:
print("Collections as Data")

So what we are going to to today is work with some text files on some text data mining questions.
### Looking at one file
We will start with a single text file.

The code block below sets up the name of the file we want to use. There are a couple of important pieces we need to provide:
1. The title of the paper, here in a machine-readable form
2. The date of the paper, with four digit year, two digit month, and two digit day

In [2]:
title = "border-vidette"
year = "1919"
month = "01"
day = "04"
# We stitch all those pieces of information together, along with the folder 
# information about where data for an entire day's paper is located
filename = "data/" + title + "/volumes/" + year + month + day + ".txt"

When we run that code block, nothing will visibly happen. We haven't asked Python to print anything, and there were no errors (yay!). But we might want to check our work to make sure the file name was specified correctly. So we can use our `print` command again:

In [None]:
print(filename)

Note that this time we did not enter a phrase enclosed with quotation marks, but instead provided the word `filename`. But it didn't print "filename". Rather it printed the value stored in the _variable_ called `filename`. If you can think back to high school algebra, this is a similar sort of concept - we use a variable, in this case `filename` to store information, much like we would use the variable "x" in a mathematical equation.

At this point, we are ready to read the file and do some work with it. Before we do so, we will need to tell Python about some additional programs to use. By default, Python does not come with text data mining tools, so those are installed separately and we make them available for use using the `import` command. Run the code block below to load those packages.

In [11]:
# Load additional packages
import pandas  # for data tables
import os      # for file navigation
import re      # for pattern matching in filenames
import nltk    # for text data mining
from nltk.corpus import stopwords           # for stopword corpora
from nltk.tokenize import RegexpTokenizer   # for splitting data into individual words
import digcol.cleantext as dc               # for automated text cleaning

Now we are ready to read in the data and start looking around. The code block below will read in all the text from the day's paper and clean it up. By "clean it up", the `CleanText` does the following:
1. Removes stop words (here we use English stop words)
2. Removes words that are one character long
3. Removes punctuation
4. "Tokenizes" the data. In this case, that means is breaks the text into individual words

In [4]:
newdata = dc.CleanText(filename)

Again, nothing visibly happened, so we can check our work by looking at the first 20 words. Run the code block below (remember click the box and press Ctrl+Enter or Cmd+Enter).

In [None]:
print(newdata.clean_list[1:20])

We can use this list of words to calculate relative frequency of each word. Relative frequencies in this case are in regards to the length of the issue. We count the number of times a word occurs, and divide that by the total number of words in the issue.

In [8]:
# Create a table with all the words
word_table = pandas.Series(newdata.clean_list)

# Calculate relative frequency of each word
word_freqs = word_table.value_counts(normalize = True)

To check our work, we look at the first 10 rows of the `word_freqs` table:

In [None]:
print(word_freqs.head(n = 10))

It should come as no big surprise that "Arizona" and "Nogales" are the most frequent words, given that the paper was printed in Nogales, Arizona.
## Beyond counting
Now we can broaden our focus to look at trends over time. We are going to look a multiple years of papers to track how the frequency of influenza coverages changes over time. We will stick with _The Border Vidette_ but instead of looking at a single issue, we will look at all the issues 1917-1919.

Here's where it gets fun. We could try to do this file-by-file, but that would be extremely tedious. So we are going to give Python a little bit of information and let the computer look at every single file. But first we need to tell Python _which_ files to use.

In [13]:
# Create a pattern that will match the dates of interest. In this case, papers from 
# 1917, 1918, and 1919
date_pattern = re.compile("((1917)|(1918)|(1919))([0-9]{4})*")

Wait, what the hell does that even mean? What we have in the code block above is something called "regular expressions". Regular expressions is a very powerful pattern matching tool with a very terrible name. What we are saying above is that we want any files that:
+ start with "1917", "1918", or "1919"
+ which is followed by four digits (and number between zero and nine)
    + `[0-9]` matches any single digit
    + `{4}` indicates there will be four of them, i.e. the two-digit month and two-digit day so the first of May is represented as "0501"
+ and ends with anything (the asterisk "*" is a wild card, matching and letters, numbers, or symbols)
So be sure you run the code block above (Ctrl-Enter or Cmd-Enter) before moving on. You will know that the code block has been run when you see a number show up in between the square brackets to the left of the code block (`In [ ]:`).

We are now ready to start reading in the files. We need to start by listing _all_ the _Border Vidette_ issues, then filtering only those that are in the date range of interest.

In [14]:
# List all the Border Vidette files and store in bv_volumes variable
bv_volumes = os.listdir("data/border-vidette/volumes")

# Use date pattern from above to restrict to dates of interest
bv_volumes = list(filter(date_pattern.match, bv_volumes))

# Sort them for easier bookkeeping
bv_volumes.sort()

Here is another opportunity for a reality check, so we ask Python to print out the first five files that we will ask Python to read.

In [16]:
print(bv_volumes[1:5])

['19170113.txt', '19170120.txt', '19170127.txt', '19170203.txt']


Now we know which files to look at, so we can instruct Python to do so. We are ultimately going to want to create a table that has two columns of data:
1. The date the paper was published. Thankfully, this is stored in the filename: the file 19170113 is the paper that was published on January 13, 1917.
2. The relative frequency of the words we are interested in for that date's paper

In [None]:
# Ultimately creating a table with dates and relative word frequencies
# dates = [str(d[0:4]) + "-" + str(d[4:6]) + "-" + str(d[6:8]) for d in bv_volumes]
# Add those dates to a data frame
# flu_table = pandas.DataFrame(dates, columns = ["Date"])

# May want to outsource this, need:
# bv_volumes
# words of interest
# file locations, in this case, "data/border-vidette/volumes"