# Introduction to Jupyter Notebooks
This lesson will introduce the Jupyter Notebook interface. We will use the interface to run and write, yes, write, some Python code for text data analysis.

By the end of this lesson, learners should be able to:
1. Explain the difference between markdown and code blocks in Jupyter Notebooks
2. Execute pre-written Python code to analyze newspaper text
3. Modify Python code to change the settings of the analysis

## What is this Jupyter Notebook thing?

Jupyter Notebooks are effectively made up of "cells". We can start by thinking of each cell being equivalent to a paragraph on a page. There is an order in which paragraphs and cells appear, and that order matters. In Jupyter Notebooks, the cells come in two flavors and a single notebook (like the one we are working in now) with have both types of cells. 
+ The first is called "markdown", which is text, like you are reading now. We can use some syntax in the text to format the cells in particular ways. For example, we can create italic text by using the underscore symbol ("\_") at the beginning and ending of the text we want to italicize. So when we write "\_italic\_" in a markdown block, it will show up as _italic_.
+ The second kind of cell is a "code" cell, that contains computer code in a language like Python or R. This is where the fun comes in.

**Do some markdown stuff?**

In [1]:
print("Collections as Data")

Collections as Data


## So what is Python then?

Add brief explantion of python?

In [2]:
print("Hello World")

Hello World


**talk about what we are going to do**

In [3]:
# This may need to happen first, to get stopwords downloaded to all learners' home folder?
# import nltk
# nltk.download('stopwords')

In [4]:
# import stuff and run it
from scipy import stats
import pandas
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

get the data and roadmap where we want to go. Maybe whiteboard the entire process?

In [19]:
# download data and do a little reality check
title_1 = "border-vidette"
year = "1919"
month = "01"
day = "04"
filename = "data/" + title_1 + "/volumes/" + year + month + day + ".txt"
print(filename)

data/border-vidette/volumes/19190104.txt


In [20]:
# open the file so we can read the text from it
file = open(filename, "r")
# read the file and store in variable issue_text
issue_text = file.read()
# close the file after reading in the text
file.close()
# print the first 200 characters of the text
print(issue_text[0:200])

rVrritoria' Library State House
NTY-SEVENTH YEAR.
NOGALES, SANTA CRUZ COUNTY. ARIZONA, JANUARY 4, 1919.
No. 1.
I
- ..;..n.
ANGLO-AMERICAN
COAT POCKET FLASHLIGHTS
FLAT OPENING
OR
CIGARETTE CASE STYLE
i


Word tokenizing

In [21]:
# convert everything to lower case (otherwise "House" and "house" are considered different words)
issue_text = issue_text.lower()

# remove punctuation and "tokenize"
tokenizer = RegexpTokenizer(r'\w+')
issue_text = tokenizer.tokenize(issue_text)

# look at first ten words in the output
print(issue_text[1:10])

['library', 'state', 'house', 'nty', 'seventh', 'year', 'nogales', 'santa', 'cruz']


some word counting before removing stopwords

Could leave line 4 with an error in it (value_count instead of value_counts)

In [22]:
# make a table with words in it
word_table = pandas.Series(issue_text)
# count how many times each word occurs
word_counts = word_table.value_counts()
# print the top ten most common words and their respective counts
print(word_counts.head(n = 10))

the    570
of     417
and    279
a      223
to     195
in     174
at     107
is      96
for     89
i       77
dtype: int64


removal of stopwords and maybe single characters?

In [23]:
# load the appropriate corpora
stop_words = set(stopwords.words('english'))

# remove stopwords
filtered_words = []
for word in issue_text:
    if word not in stop_words:
        filtered_words.append(word)

# Recalculate word counts
word_table = pandas.Series(filtered_words)
# count how many times each word occurs
word_counts = word_table.value_counts()
# print the top ten most common words and their respective counts
print(word_counts.head(n = 10))

arizona    71
nogales    62
j          55
w          34
r          34
f          33
state      27
e          25
p          23
year       23
dtype: int64


In [24]:
# remove stop words AND single letter words
filtered_words = []
for word in issue_text:
    if word not in stop_words:
        if len(word) > 1:
            filtered_words.append(word)

# Recalculate word counts
word_table = pandas.Series(filtered_words)
# count how many times each word occurs
word_counts = word_table.value_counts()
# print the top ten most common words and their respective counts
print(word_counts.head(n = 10))

arizona     71
nogales     62
state       27
year        23
one         21
business    21
ed          21
day         20
co          20
people      20
dtype: int64


In [27]:
# instead of counts, return relative frequencies
word_freqs = word_table.value_counts(normalize = True)
# print the top ten most frequent words
print(word_freqs.head(n = 10))

arizona     0.011356
nogales     0.009917
state       0.004319
year        0.003679
one         0.003359
business    0.003359
ed          0.003359
day         0.003199
co          0.003199
people      0.003199
dtype: float64


## Beyond counting
There is a lot more we can do more than just count words. For example, we can look for specific words and see how their frequency changes over time. Given the publication dates of the newspapers we are looking at and current events, we can look at the frequency of the words "flu" and "influenza". And see how this frequency is changing over time.

In [31]:
# Search for "flu" and "influenza" in one volume of interest (middle of influenza pandemic)
influenza_words = ['flu', 'influenza']
influenza_freq = word_counts.filter(influenza_words)
print(influenza_freq)

flu          4
influenza    3
dtype: int64
7


Now get relative frequency for all volumes in a year

In [None]:
# Loop over all volumes in a single year, calculating frequency of flu and influenza for each volume

Comparative analysis - two papers

In [None]:
# Loop over all volumes of second title in a single year, calculating rel. freq. of flu & influenza
# Second title is Bisbee Daily Review?

T-test comparing the two papers