# Introduction to Jupyter Notebooks
This lesson will introduce the Jupyter Notebook interface. We will use the interface to run and write, yes, write, some Python code for text data analysis.

By the end of this lesson, learners should be able to:
1. Explain the difference between markdown and code blocks in Jupyter Notebooks
2. Execute pre-written Python code to analyze newspaper text
3. Modify Python code to change the settings of the analysis

## What is this Jupyter Notebook thing?

Jupyter Notebooks are effectively made up of "cells". We can start by thinking of each cell being equivalent to a paragraph on a page. There is an order in which paragraphs and cells appear, and that order matters. In Jupyter Notebooks, the cells come in two flavors and a single notebook (like the one we are working in now) with have both types of cells. 
+ The first is called "markdown", which is text, like you are reading now. We can use some syntax in the text to format the cells in particular ways. For example, we can create italic text by using the underscore symbol ("\_") at the beginning and ending of the text we want to italicize. So when we write "\_italic\_" in a markdown block, it will show up as _italic_.
+ The second kind of cell is a "code" cell, that contains computer code in a language like Python or R. This is where the fun comes in.

**Do some markdown stuff?**

In [9]:
print("Collections as Data")

Collections as Data


## So what is Python then?

Add brief explantion of python?

In [10]:
print("Hello World")

Hello World


**talk about what we are going to do**

In [11]:
# This may need to happen first, to get stopwords downloaded to all learners' home folder?
# import nltk
# nltk.download('stopwords')

In [12]:
# import stuff and run it
import pandas
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

get the data and roadmap where we want to go. Maybe whiteboard the entire process?

In [13]:
# download data and do a little reality check
title_1 = "border-vidette"
year = "1919"
month = "01"
day = "04"
filename = "data/" + title_1 + "/volumes/" + year + month + day + ".txt"
print(filename)

data/border-vidette/volumes/19190104.txt


In [14]:
# open the file so we can read the text from it
file = open(filename, "r")
# read the file and store in variable issue_text
issue_text = file.read()
# close the file after reading in the text
file.close()
# print the first 200 characters of the text
print(issue_text[0:200])

rVrritoria' Library State House
NTY-SEVENTH YEAR.
NOGALES, SANTA CRUZ COUNTY. ARIZONA, JANUARY 4, 1919.
No. 1.
I
- ..;..n.
ANGLO-AMERICAN
COAT POCKET FLASHLIGHTS
FLAT OPENING
OR
CIGARETTE CASE STYLE
i


Word tokenizing

In [15]:
# convert everything to lower case (otherwise "House" and "house" are considered different words)
issue_text = issue_text.lower()

# remove punctuation and "tokenize"
tokenizer = RegexpTokenizer(r'\w+')
issue_text = tokenizer.tokenize(issue_text)

# look at first ten words in the output
print(issue_text[1:10])

['library', 'state', 'house', 'nty', 'seventh', 'year', 'nogales', 'santa', 'cruz']


some word counting before removing stopwords

Could leave line 4 with an error in it (value_count instead of value_counts)

In [16]:
# make a table with words in it
word_table = pandas.Series(issue_text)
# count how many times each word occurs
word_counts = word_table.value_counts()
# print the top ten most common words and their respective counts
print(word_counts.head(n = 10))

the    570
of     417
and    279
a      223
to     195
in     174
at     107
is      96
for     89
i       77
dtype: int64


removal of stopwords and maybe single characters?

In [17]:
# load the appropriate corpora
stop_words = set(stopwords.words('english'))

# remove stopwords
filtered_words = []
for word in issue_text:
    if word not in stop_words:
        filtered_words.append(word)

# Recalculate word counts
word_table = pandas.Series(filtered_words)
# count how many times each word occurs
word_counts = word_table.value_counts()
# print the top ten most common words and their respective counts
print(word_counts.head(n = 10))

arizona    71
nogales    62
j          55
w          34
r          34
f          33
state      27
e          25
year       23
p          23
dtype: int64


In [18]:
# remove stop words AND single letter words
filtered_words = []
for word in issue_text:
    if word not in stop_words:
        if len(word) > 1:
            filtered_words.append(word)

# Recalculate word counts
word_table = pandas.Series(filtered_words)
# count how many times each word occurs
word_counts = word_table.value_counts()
# print the top ten most common words and their respective counts
print(word_counts.head(n = 10))

arizona     71
nogales     62
state       27
year        23
one         21
ed          21
business    21
co          20
people      20
day         20
dtype: int64


In [19]:
# instead of counts, return relative frequencies
word_freqs = word_table.value_counts(normalize = True)
# print the top ten most frequent words
print(word_freqs.head(n = 10))

arizona     0.011356
nogales     0.009917
state       0.004319
year        0.003679
one         0.003359
ed          0.003359
business    0.003359
co          0.003199
people      0.003199
day         0.003199
dtype: float64


## Beyond counting
There is a lot more we can do more than just count words. For example, we can look for specific words and see how their frequency changes over time. Given the publication dates of the newspapers we are looking at and current events, we can look at the frequency of the words "flu" and "influenza". And see how this frequency is changing over time.

In [20]:
# Search for "flu" and "influenza" in one volume of interest (middle of influenza pandemic)
influenza_words = ['flu', 'influenza']
influenza_freq = word_counts.filter(influenza_words)
print(influenza_freq)

flu          4
influenza    3
dtype: int64


Now get relative frequency for all volumes in a year

In [21]:
# Loop over all volumes in a single year, calculating frequency of flu and influenza for each volume

## Comparative analysis
We can also make comparisons between different titles. Here we are going to compare the Bisbee Daily Review and the Border Vidette to see if there is a difference in the coverage of the [mine strike of 1917](https://en.wikipedia.org/wiki/Bisbee_Deportation#Strike).

In [22]:
# Loop over all volumes of second title in a single year, calculating rel. freq. of 'strike' and 'strikes'
# Second title is Bisbee Daily Review?
# Look at 'deportation' and 'strike', in June - October of 1917 for Bisbee Daily Review

# Get volumes for each title 191706** through 191710**
# bv = Border Vidette
# bdr = Bisbee Daily Review
import os
import re

# Create a pattern that will match papers June - October, 1917
date_pattern = re.compile("1917((06)|(07)|(08)|(09)|(10))([0-9]{2})*")

# List all the Border Vidette files
bv_volumes = os.listdir("data/border-vidette/volumes")

# Use date pattern from above to restrict to dates of interest
bv_volumes = list(filter(date_pattern.match, bv_volumes))

# Do a little reality check to make sure we only see files in desired date range.
print("Border Vidette:")
print(bv_volumes)

# Download and filter files for bisbee-daily-review (like above)
bdr_volumes = os.listdir("data/bisbee-daily-review/volumes")
bdr_volumes = list(filter(date_pattern.match, bdr_volumes))
print("Bisbee Daily Review:")
print(bdr_volumes[0:9])


Border Vidette:
['19170915.txt', '19170804.txt', '19170811.txt', '19171027.txt', '19170901.txt', '19170818.txt', '19170707.txt', '19170728.txt', '19171013.txt', '19170714.txt', '19170825.txt', '19170908.txt', '19170630.txt', '19171006.txt', '19170922.txt', '19170609.txt', '19170602.txt', '19170929.txt', '19171020.txt', '19170616.txt', '19170623.txt']
Bisbee Daily Review:
['19170701.txt', '19170915.txt', '19170921.txt', '19170817.txt', '19170804.txt', '19171004.txt', '19170811.txt', '19170815.txt', '19170824.txt']


Calculate relative frequency of "strike" for each issue for each paper

In [32]:
# A tool we can re-use to break up words
tokenizer = RegexpTokenizer(r'\w+')

# A list of the words of interest
strike_words = ['strike', 'strikers', 'strikes']

# For all Border Vidette volumes, calculate the relative frequency of 'strike'
bv_strike_freq = []
bv_file_locations = ("data/border-vidette/volumes/")
for one_issue in bv_volumes:
    # Read in text
    issue_location = bv_file_locations + one_issue
    issue_file = open(issue_location, "r")
    issue_text = issue_file.read()
    issue_file.close()
    
    # convert to lower case
    issue_text = issue_text.lower()
    issue_text = tokenizer.tokenize(issue_text)
    
    # remove stopwords
    filtered_words = []
    for word in issue_text:
        if word not in stop_words:
            if len(word) > 1:
                filtered_words.append(word)

    # make a table with words in it
    word_table = pandas.Series(filtered_words)
    # calculate relative frequencies
    word_freqs = word_table.value_counts(normalize = True)

    # pull out only values for 'strike', 'striker', or 'strikes'
    strike_freqs = word_freqs.filter(strike_words)
    
    # add those frequencies to our list of values for Border Vidette
    bv_strike_freq.append(strike_freqs.sum())

print(bv_strike_freq)

[0.0, 0.00033101621979476995, 0.0, 0.0, 0.0, 0.0, 0.0004702931493964571, 0.0, 0.0001719986240110079, 0.0006744225257123588, 0.0, 0.0, 0.0001563232765358762, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0001857700167193015, 0.0, 0.0]


In [35]:
# Do same calculations for Bisbee daily review
bdr_strike_freq = []
bdr_file_locations = ("data/bisbee-daily-review/volumes/")
for one_issue in bdr_volumes:
    # Read in text
    issue_location = bdr_file_locations + one_issue
    issue_file = open(issue_location, "r")
    issue_text = issue_file.read()
    issue_file.close()
    
    # convert to lower case
    issue_text = issue_text.lower()
    issue_text = tokenizer.tokenize(issue_text)
    
    # remove stopwords
    filtered_words = []
    for word in issue_text:
        if word not in stop_words:
            if len(word) > 1:
                filtered_words.append(word)

    # make a table with words in it
    word_table = pandas.Series(filtered_words)
    # calculate relative frequencies
    word_freqs = word_table.value_counts(normalize = True)

    # pull out only values for 'strike', 'striker', or 'strikes'
    strike_freqs = word_freqs.filter(strike_words)
    
    # add those frequencies to our list of values for Border Vidette
    bdr_strike_freq.append(strike_freqs.sum())

print(bdr_strike_freq)

[0.0018363939899833056, 0.00013264358668258389, 0.00036768568126904083, 0.0008564393533882881, 0.001246555570135153, 0.000501336898395722, 0.0003936078091789341, 0.00040257648953301127, 0.0006528835690968445, 0.00016173378618793466, 0.0009913070001525087, 0.0004993065187239945, 0.0006412124744972312, 0.0009314781393724166, 0.0002230400356864057, 0.0017501093818363647, 0.0005339598462195643, 0.0009174311926605505, 0.0013420362349783445, 0.0007610832751950276, 0.0005429864253393665, 0.0008604864616796695, 0.00011014428901861439, 0.0009204470742932282, 7.042253521126761e-05, 0.00012219710392863688, 0.0006555793250983369, 0.0007020231029421151, 6.477522995206633e-05, 0.0008223684210526315, 0.0004931395584950541, 0.00039691990156386444, 0.00041564088627425906, 0.00047024400438894405, 0.00043896673982779, 0.0004172876304023845, 0.0, 0.0006618620385350787, 0.0029359953024075164, 0.0005826487210860572, 0.0034113447044823485, 5.990893841361131e-05, 0.0011521369397048335, 0.00038031076822775184,

In [51]:
# Calculate mean relative frequencies for each of the papers
from statistics import mean
# Border Vidette
bv_mean = mean(bv_strike_freq)
# print the mean, using format instead of str to avoid scientific notation
print(format(bv_mean, 'f') + " Border Vidette")
# Bisbee Daily Review
bdr_mean = mean(bdr_strike_freq)
print(format(bdr_mean, 'f') + " Bisbee Daily Review")

0.000095 Border Vidette
0.000840 Bisbee Daily Review


But are these _significantly_ different?

In [49]:
from scipy import stats

# run the test
compare_strike = stats.ttest_ind(bv_strike_freq, bdr_strike_freq, equal_var = False)

# extract values of interest, Student's t and the p-value
t_value = compare_strike[0]
p_value = compare_strike[1]

# print test statistics
print("t = " + format(t_value, '.3f')) # normal formatting
print("p = " + format(p_value, '.3e')) # scientific notation

t = -8.8751
p = 4.137e-15
