# Introduction to Jupyter Notebooks

This lesson will introduce the Jupyter Notebook interface. We will use the interface to run and write, yes, write, some Python code for text data analysis.

By the end of this lesson, learners should be able to:

1. Explain the difference between markdown and code blocks in Jupyter Notebooks
2. Execute pre-written Python code to analyze newspaper text
3. Modify Python code to change the settings of the analysis

And just an aside, all the code for this fun stuff is availble on GitHub at [https://github.com/jcoliver/dig-coll-borderlands](https://github.com/jcoliver/dig-coll-borderlands).

If you are not familiar with text data mining, take a look at this nice [StoryMap](https://storymaps.arcgis.com/stories/cd7e273c42cd4ab6b6ce3fa89c13132c) that introduces the idea of text data mining and what we can do with it.

## What is this Jupyter Notebook thing?

Jupyter Notebooks are effectively made up of "cells". We can start by thinking of each cell being equivalent to a paragraph on a page. There is an order in which paragraphs and cells appear, and that order matters. In Jupyter Notebooks, the cells come in two flavors and a single notebook (like the one we are working in now) with have both types of cells. 

+ The first is called "markdown", which is text, like you are reading now. We can use some syntax in the text to format the cells in particular ways. For example, we can create italic text by using the underscore symbol ("\_") at the beginning and ending of the text we want to italicize. So when we write "\_italic\_" in a markdown block, it will show up as _italic_.
+ The second kind of cell is a "code" cell, that contains computer code in a language like Python or R. This is where the fun comes in.

So let's try this out. Click your cursor in the box below on the word "Data" and run the cell. You can run the cell by holding down the Control (Ctrl) key and press Enter. You can also click the button labeled "Run" at the top of the screen, too.

In [None]:
print('Collections as Data')

So what we are going to do today is work with some text files on some text data mining questions.

### Looking at one file

We will start with a single text file.

The code block below sets up the name of the file we want to use. There are a couple of important pieces we need to provide:

1. The title of the paper, here in a machine-readable form
2. The date of the paper, with four digit year, two digit month, and two digit day

In [None]:
title = 'border-vidette'
year = '1919'
month = '01'
day = '04'

Now that we know which date and which title we are interested in, we need to tell Python where the text files are located.

In [None]:
# We also need to indicate where the data are stored (i.e. which folder 
# are they in)
datapath = 'data/sample/'

We can define another variable, `filename` that contains all the information we need to read the file. That is, `filename` includes the folder location and the filename of the file we are interested in.

In [None]:
# We stitch all those pieces of information together, along with the folder 
# information about where data for an entire day's paper is located
filename = datapath + title + '/volumes/' + year + month + day + '.txt'

When we run that code block, nothing will visibly happen. We haven't asked Python to print anything, and there were no errors (yay!). But we might want to check our work to make sure the file name was specified correctly. So we can use our `print` command again:

In [None]:
print(filename)

Note that this time we did not enter a phrase enclosed with quotation marks, but instead provided the word `filename`. But it didn't print "filename". Rather it printed the value stored in the _variable_ called `filename`. If you can think back to high school algebra, this is a similar sort of concept - we use a variable, in this case `filename` to store information, much like we would use the variable "x" in a mathematical equation.

At this point, we are ready to read the file and do some work with it. Before we do so, we will need to tell Python about some additional programs to use. By default, Python does not come with text data mining tools, so those are installed separately and we make them available for use using the `import` command. Run the code block below to load those packages.

In [None]:
# Load additional packages
# for data tables
import pandas

# for file navigation
import os

# for pattern matching in filenames
import re

# for text data mining
import nltk

# for stopword corpora for a variety of languages
from nltk.corpus import stopwords

# for splitting data into individual words
from nltk.tokenize import RegexpTokenizer

# for automated text cleaning
import digcol as dc

We also need to download the stopwords. There are _a lot_ of recognized stopwords (i.e. "y", "a", "el", "la", "del", "que", etc.), so we don't want to enter them by hand.

In [None]:
# download the stopwords for several languages
nltk.download('stopwords')

Now we are ready to read in the data and start looking around. The code block below will read in all the text from the day's paper and clean it up. By "clean it up", the `CleanText` does the following:

1. Removes stop words (here we use English stop words)
2. Removes words that are one character long
3. Removes punctuation
4. "Tokenizes" the data. In this case, that means is breaks the text into individual words

In [None]:
newdata = dc.CleanText(filename)

Again, nothing visibly happened, so we can check our work by looking at the first 20 words. Run the code block below (remember click the box and press Ctrl+Enter or Cmd+Enter).

In [None]:
print(newdata.clean_list[0:20])

We can use this list of words to calculate relative frequency of each word. Relative frequencies in this case are in regards to the length of the issue. We count the number of times a word occurs, and divide that by the total number of words in the issue.

In [None]:
# Create a table with all the words
word_table = pandas.Series(newdata.clean_list)

# Calculate relative frequency of each word
word_freqs = word_table.value_counts(normalize = True)

To check our work, we look at the first 10 rows of the `word_freqs` table:

In [None]:
print(word_freqs.head(n = 10))

It should come as no big surprise that "Arizona" and "Nogales" are the most frequent words, given that the paper was printed in Nogales, Arizona.

## Beyond counting

Now we can broaden our focus to look at trends over time. We are going to look a multiple years of papers to track how the frequency of influenza coverages changes over time. We will stick with _The Border Vidette_ but instead of looking at a single issue, we will look at all the issues 1917-1919.

Here's where it gets fun. We could try to do this file-by-file, but that would be extremely tedious. So we are going to give Python a little bit of information and let the computer look at every single file. But first we need to tell Python _which_ files to use.

In [None]:
# Create a pattern that will match the dates of interest. In this case, 
# papers from 1917, 1918, and 1919
date_pattern = re.compile(r'(1917|1918|1919)([0-9]{4})*')

Wait, what the hell does that even mean? What we have in the code block above is something called "regular expressions". Regular expressions is a very powerful pattern matching tool with a very terrible name. What we are saying above is that we want any files that:

+ start with "1917", "1918", or "1919",
+ followed by four digits, i.e. the two-digit month and two-digit day, so May 1 is represented as "0501"
    + `[0-9]` matches any single digit between 0 and 9
    + `{4}` means that there are four consecutive digits. `[0-9][0-9][0-9][0-9]` is equivalent to `[0-9]{4}`, we just don't have to write as much
+ and end with anything (the asterisk "\*" is a wild card, matching and letters, numbers, or symbols)
+ that extra "r" right before the quotation marks helps python treat the expression correctly (it's OK, I don't quite understand why it is necessary either, but smarter people than me said it is a good idea).

So be sure you run the code block above (Ctrl-Enter or Cmd-Enter) before moving on. You will know that the code block has been run when you see a number show up in between the square brackets to the left of the code block (`In [ ]:`).

We are now ready to start reading in the files. We need to start by listing _all_ the _Border Vidette_ issues, then filtering only those that are in the date range of interest.

In [None]:
# List all the Border Vidette files and store in bv_volumes variable
volume_path = datapath + title + '/volumes/'
bv_volumes = os.listdir(volume_path)

# Use date pattern from above to restrict to dates of interest
bv_volumes = list(filter(date_pattern.match, bv_volumes))

# Sort them for easier bookkeeping
bv_volumes.sort()

Here is another opportunity for a reality check, so we ask Python to print out the first five files that we will ask Python to read.

In [None]:
print(bv_volumes[0:5])

Now we know which files to look at, so we can instruct Python to do so. We are ultimately going to want to create a table that has two columns of data:

1. The date the paper was published. Thankfully, this is stored in the filename: the file 19170113.txt is the paper that was published on January 13, 1917.
2. The relative frequency of the words we are interested in for that date's paper

We will start by creating a table that will hold that information. We need to extract dates for each paper. While the filenames have that information, we need to convert it to an actual date, in the form of YYYY-MM-DD, so the date for 19170113.txt is 1917-01-13.

In [None]:
# Create a table that will hold the relative frequency for each date
dates = []
for one_file in bv_volumes:
    one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
    dates.append(one_date)

Wait. No. What?

So there's a bit going on there up above.

+ `dates = []` creates an empty list of dates. There is nothing in there to start with.
+ `for one_file in bv_volumes:` says we are going to cycle through all of the files that are listed in the `bv_volumes` variable; each cycle, the value of `one_file` changes to the next value. If you look at the output we created above when running `print(bv_volumes[0:5])`, the first time through the cycle, `one_file` will have the value '19170113.txt'. The second time, `one_file` will have the value '19170120.txt'.
+ `one_date = str(one_file[0:4]) + str(one_file[4:6]) + str(one_file[6:8])` is creating a - no, we just need to run some code to explain this one.

Let us do a little test, seeing what this code does on an example of `one_file`. We start by pulling out the very first value in `bv_volumes`, as if it was the first cycle through the code above.

In [None]:
one_file = bv_volumes[0]
print(one_file)

Cool. Looking good. What we are doing with the `one_date` line is pulling out parts of that filename using indexing. An index is basically an address for each letter. For the first That is, we pull out the 0<sup>th</sup> through 3<sup>rd</sup> part of the file name via `one_file[0:4]`:

In [None]:
# Look at first four characters
print(one_file[0:4])

In [None]:
# look at characters 5 and 6
print(one_file[4:6])

Looking at the entireity of the filename, these are the indexes of each character:

| Index:      | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|-------------|---|---|---|---|---|---|---|---|---|---|----|----|
| Characters: | 1 | 9 | 1 | 7 | 0 | 1 | 1 | 3 | . | t | x  | t  |

If we run the piece of code that stitches all the pieces together,

In [None]:
str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])

we see the date formatted like we want (i.e. YYYY-MM-DD). 

Looking back to the cycle:

```
for one_file in bv_volumes:
    one_date = str(one_file[0:4]) + "-" + str(one_file[4:6]) + "-" + str(one_file[6:8])
    dates.append(one_date)
```

the last line (`dates.append(one_date)`) will add that one date to our list of dates. Now that we have all those dates, we can set up our table (finally!).

In [None]:
# Add those dates to a data frame
flu_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to zero
flu_table['Frequency'] = 0.0

And as a reality check, let's look at the first six rows.

In [None]:
flu_table.head(n = 6)

Now we can use Python to cycle over every file, calculate the relative frequency of flu and influenza, and store the result in the table.

In [None]:
flu_words = ['flu', 'influenza']

for issue in bv_volumes:
    issue_text = dc.CleanText(volume_path + issue)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values for flu or influenza
    flu_freqs = word_freqs.filter(flu_words)
    
    # Get the total frequency for flu and influenza
    total_flu_freq = flu_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
    
    # Add the date & relative frequency to our data table
    flu_table.loc[flu_table['Date'] == issue_date, 'Frequency'] = total_flu_freq

Look again at the first six rows of our table.

In [None]:
flu_table.head(n = 6)

Hmm...they are all still zeros. But maybe that isn't surprising, since the influenza pandemic did not really get going until late 1918, and we are just looking at the early 1917 issues here. When doing this sort of quality assurance, we can pick a paper that we _know_ will have at least some occurrences of influenza. The November 16 issue from 1918 had at least some mention of influenza, so we can look at the corresponding row for that date via:

In [None]:
# Look at a 1918 date where we know there should be non-zero values
flu_table.loc[flu_table['Date'] == '1918-11-16', ]

Alright! We have our data ready to go. All we need to do now is graph it.

In [None]:
# One more package is needed for plotting
import plotly.express as px
flu_figure = px.line(flu_table, x = 'Date', y = 'Frequency')
flu_figure.show()

The graph should show a peak in relative frequency of flu/influenza over the winter of 1918-1919.

## Your turn

The code block below includes all the code necessary to make a graph like the one above, but you get to determine what it shows. You'll need to provide:

1. The title of the newspaper you want to analyze
2. The range of years to include (note, consult the table below for available date ranges for each paper)
3. The words you are interested in including

Run the code block below to display a table with relevant information regarding available titles and dates.

In [None]:
# Run to display table with newspaper information
titles = pandas.read_csv('data/sample/sample-titles.csv')
display(titles)

Edit the first four variables in the code block below then run the code block. Your graph should appear below the block once it has finished running. You'll know it finished when the asterisk in the square brackets is replaced by a number (e.g. `In [*]` -> `In [31]`)

In [None]:
# Change this to directory of the title of interest, be sure to use lowercase 
# and no spaces (it may be easiest to copy & paste from the "directory" 
# column in the table above)
title = 'el-tucsonense' 

# List the years of interest, each enclosed in quotation marks (') and separated
# by commas
year_list = ['1917', '1918', '1919']

# What words are you interested in? You can add as many as you like, 
# just be sure to enclose each in quotation marks (') and separate with a comma
# Also, keep them lower case, even if they are proper nouns
my_words = ['alemania', 'alemana', 'alemán'] # germany, german (f.), german (m.)

# Specify the language of the title you are looking at (all lowercase)
# Possible values: english, spanish, arabic, turkish, etc.
language = 'spanish'

################################################################################
# No need to edit anything below here
################################################################################

# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)

# Location of files with text for a day's paper
volume_path = datapath + title + '/volumes/'
my_volumes = os.listdir(volume_path)

# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))

# Sort them for easier bookkeeping
my_volumes.sort()

# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
    one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
    dates.append(one_date)

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to zero
results_table['Frequency'] = 0.0

# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
    issue_text = dc.CleanText(filename = volume_path + issue, language = language)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values that match words of interest
    my_freqs = word_freqs.filter(my_words)
    
    # Get the total frequency for words of interest
    total_my_freq = my_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
    
    # Add the date & relative frequency to our data table
    results_table.loc[results_table['Date'] == issue_date, 'Frequency'] = total_my_freq
    
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = 'Date', y = 'Frequency')
my_figure.show()

## More than one line?

If you want to plot more than one line on a plot, you can use the code below. The code below is set up for plotting two lines for one title of one language, but it could be extended to multiple lines, titles, and languages (but probably not today).

In [None]:
# Change this to directory of the title of interest, be sure to use lowercase 
# and no spaces (it may be easiest to copy & paste from the "directory" 
# column in the table above)
title = 'el-tucsonense' 

# List the years of interest, each enclosed in quotation marks (') and separated
# by commas
year_list = ['1917', '1918', '1919']

# What words are you interested in? You can add as many as you like, 
# just be sure to enclose each in quotation marks (') and separate with a comma
# Also, keep them lower case, even if they are proper nouns
words_1 = ['alemania', 'alemana', 'alemán'] # germany, german (f.), german (m.)
words_1_name = 'German'
words_2 = ['japona', 'japón']
words_2_name = 'Japan'

# Specify the language of the title you are looking at (all lowercase)
# Possible values: english, spanish, arabic, turkish, etc.
language = 'spanish'

################################################################################
# No need to edit anything below here
################################################################################

# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)


# Location of files with text for a day's paper
volume_path = datapath + title + '/volumes/'
my_volumes = os.listdir(volume_path)

# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))

# Sort them for easier bookkeeping
my_volumes.sort()

# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
    one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
    dates.append(one_date)

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to zero
results_table[words_1_name] = 0.0
results_table[words_2_name] = 0.0

# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
    issue_text = dc.CleanText(filename = volume_path + issue, language = language)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values that match words of interest
    words_1_freqs = word_freqs.filter(words_1)
    words_2_freqs = word_freqs.filter(words_2)

    # Get the total frequency for words of interest
    total_words_1 = words_1_freqs.sum()
    total_words_2 = words_2_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
    
    # Add the date & relative frequency to our data table
    results_table.loc[results_table['Date'] == issue_date, words_1_name] = total_words_1
    results_table.loc[results_table['Date'] == issue_date, words_2_name] = total_words_2
    
# Analyses are all done, but we need to transform data to "long" format
results_melt = results_table.melt(id_vars = 'Date', value_vars = [words_1_name, words_2_name])

# By default, two columns created are called "value" and "variable", we want 
# to rename them
results_melt.rename(columns = {'value':'Frequency', 'variable':'Words'}, inplace = True)

# plot the figure
my_figure = px.line(results_melt, x = 'Date' , y = 'Frequency' , color = 'Words')
my_figure.show()

## Comparative analysis

We can also make comparisons between different titles. Here we are going to compare the _Bisbee Daily Review_ and the _Border Vidette_ to see if there is a difference in the coverage of the [mine strike of 1917](https://en.wikipedia.org/wiki/Bisbee_Deportation#Strike).

We start as we did before to filter papers by dates of interest. We are going to focus only on those issues published June through October of 1917.

In [None]:
# Create a pattern that will match papers June - October, 1917
date_pattern = re.compile(r'1917(06|07|08|09|10)([0-9]{2})*')

More regular expressions! We are only looking at papers published in 1917, in the months June (06) through October (10). Similar to the regular expression we saw earlier, this allows us to match those files that:

+ start with "1917",
+ followed by either "06", "07", "08", "09", or "10", (the month specification)
+ followed by two digits (the day specification)
    + `[0-9]` matches any single digit between 0 and 9
    + `{2}` means that there are two consecutive digits, so in this case we are using `[0-9]{2}` as a shortcut for `[0-9][0-9]`
+ and end with anything (the asterisk "\*" is a wild card, matching and letters, numbers, or symbols)

We match only those files, for each title, that were published during the time period of interest.

In [None]:
# To the data from each title separate, we will use a convention where 
# the variables are named such that the prefix of the variable name 
# indicates the title from which the information is associated with:
# bv = Border Vidette
# bdr = Bisbee Daily Review

# List all the Border Vidette files
bv_volumes = os.listdir('data/sample/border-vidette/volumes')

# Use date pattern from above to restrict to dates of interest
bv_volumes = list(filter(date_pattern.match, bv_volumes))

# Do a little reality check to make sure we only see files in 
# desired date range.
print('Border Vidette: ' + str(len(bv_volumes)) + ' issues')
print(bv_volumes[0:5])

# Download and filter files for Bisbee Daily Review (like above)
bdr_volumes = os.listdir('data/sample/bisbee-daily-review/volumes')
bdr_volumes = list(filter(date_pattern.match, bdr_volumes))

# Another reality check, reporting total number of issues and the 
# first five filenames
print('Bisbee Daily Review: ' + str(len(bdr_volumes)) + ' issues')
print(bdr_volumes[0:5])

As we did above with the search for influenza terms, we will look for words related to the strike in Bisbee, and calculate the relative frequency for each issue of each title. We start by defining the terms to look for.

In [None]:
# A list of the words of interest
strike_words = ['strike', 'strikes', 'striker', 'strikers']

 We start with looking at issues of _Border Vidette_.

In [None]:
# For all Border Vidette volumes that matched our date criteria, calculate 
# the relative frequency of 'strike' and related words

# This variable will hold relative frequency for each day's paper
bv_strike_freq = []

# The location where text for each issue is stored
bv_file_locations = datapath + 'border-vidette/volumes/'

# Loop over each issue, calculating relative frequency
for one_issue in bv_volumes:
    # Read in the cleaned text (stopwords and punctuation removed)
    issue_text = dc.CleanText(filename = bv_file_locations + one_issue)
    issue_text = issue_text.clean_list
    
    # Create a table with all the words in the issue
    word_table = pandas.Series(issue_text)
    
    # Calculate relative frequency of each word in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values for strike related words
    strike_freqs = word_freqs.filter(strike_words)
    
    # Add those frequencies to our list of values for Border Vidette
    bv_strike_freq.append(strike_freqs.sum())

# Do a reality check to look at first five values
print(bv_strike_freq[0:5])

Now repeat the process of calculating relative frequencies for _Bisbee Daily Review_.

In [None]:
# For all Bisbee Daily Review volumes that matched our date criteria, 
# calculate the relative frequency of 'strike' and related words

# This variable will hold relative frequency for each day's paper
bdr_strike_freq = []

# The location where text for each issue is stored
bdr_file_locations = datapath + 'bisbee-daily-review/volumes/'

# Loop over each issue, calculating relative frequency
for one_issue in bdr_volumes:
    # Read in the cleaned text (stopwords and punctuation removed)
    issue_text = dc.CleanText(filename = bdr_file_locations + one_issue)
    issue_text = issue_text.clean_list
    
    # Create a table with all the words in the issue
    word_table = pandas.Series(issue_text)
    
    # Calculate relative frequency of each word in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values for strike related words
    strike_freqs = word_freqs.filter(strike_words)
    
    # Add those frequencies to our list of values for Border Vidette
    bdr_strike_freq.append(strike_freqs.sum())

# Do a reality check to look at first five values
print(bdr_strike_freq[0:5])

Now that we have the relative frequencies for each of the titles, we can calculate some summary statistics, including the average relative frequency in each issue for each title.

In [None]:
# Import the mean function from the statistics package
from statistics import mean

# Calculate average relative frequency of Border Vidette issues
bv_mean = mean(bv_strike_freq)

# Print the value, using format instead of str to avoid scientific notation
print(format(bv_mean, 'f') + ' Border Vidette')

# Calculate average relative frequency of Bisbee Daily Review issues and print 
# to screen
bdr_mean = mean(bdr_strike_freq)
print(format(bdr_mean, 'f') + ' Bisbee Daily Review')

Comparing these means, we see that the relative frequency of strike-related words was higher in issues of the _Bisbee Daily Review_ than in issues of the _Border Vidette_. In fact, it looks like the relative frequency of strike words in the _Bisbee Daily Review_ was ten-fold higher than in the _Border Vidette_.

Finally, we need to run a statistical test to see if those means are significantly difference. For our purposes, we can use a two-sample _t_-test.

In [None]:
# The scipy package has a stats function that allows us to run a t-test
from scipy import stats

# Run the test, assuming unequal variances
compare_strike = stats.ttest_ind(bv_strike_freq, bdr_strike_freq, equal_var = False)

# Extract values of interest, Student's t and the p-value
t_value = compare_strike[0]
p_value = compare_strike[1]

# Print test statistics
print('t = ' + format(t_value, '.3f')) # normal formatting
print('p = ' + format(p_value, '.3e')) # scientific notation

In this case we can conclude the relative word frequency of strike-related words _was_ significantly higher in the _Bisbee Daily Review_ than in the _Border Vidette_.

Whew. We're done.

If you want more practice or want to do some analyses with a larger data set, head over to the lesson at [https://mybinder.org/v2/gh/jcoliver/dig-coll-borderlands/master?filepath=Text-Mining-Template.ipynb](https://mybinder.org/v2/gh/jcoliver/dig-coll-borderlands/master?filepath=Text-Mining-Template.ipynb).

If you want even _more_ resources, check these out:

+ Examples of text data mining:
    + Benjamin Schmidt and Mitch Fraas. [The Language of the State of the Union](https://www.theatlantic.com/politics/archive/2015/01/the-language-of-the-state-of-the-union/384575/). _The Atlantic_ (January 18, 2015). 
    + Lincoln Mullen. [America’s Public Bible: Biblical Quotations in U.S. Newspapers](https://americaspublicbible.org/).(2016).
+ Text data mining in Python:
    + Quinn Dombrowski, Tassie Gniady, and David Kloster, "Introduction to Jupyter Notebooks," The Programming Historian 8 (2019), [https://doi.org/10.46430/phen0087](https://doi.org/10.46430/phen0087).
    + William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," The Programming Historian 1 (2012), [https://doi.org/10.46430/phen0003](https://doi.org/10.46430/phen0003).

If you have any questions or comments on this lesson, look at the [project's GitHub page](https://github.com/jcoliver/dig-coll-borderlands) and open a new issue if you don't find an answer there.

[![](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by.svg)](https://creativecommons.org/licenses/by/4.0/legalcode)

This lesson is licensed under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode) 2020 to Jeffrey C. Oliver. 