# Introduction to text mining

After this lesson, students will be able to:

1. Explain the difference between a code block and a text block in Jupyter Notebooks
2. Write a short command in Python and execute the command
3. Execute pre-written Python code to create a word frequency through time plot
4. Explain what a file path is
5. Describe how to choose words for searching
6. Update Python code to perform word frequency calculations on different newspaper title
7. Create a plot showing word frequency over time on a chosen newspaper and time frame

## 1. Explain the difference between a code block and a markdown block in Jupyter Notebooks
Jupyter notebooks allow us to mix in computer programming code with readable text. Each notebook is broken into several "blocks", or sections. A block can _either_ have code (a code block) or text (a markdown block). The text you are reading right now is in a markdown block, but we will start by looking at code blocks. We can tell the difference because code blocks are shown with gray shading. We can _run_, or execute a block by clicking the gray box and pressing the &#x25B6; Run button at the top of the window.

In [None]:
print("Welcome to Python")

We will work more with code blocks in a bit, but what about markdown blocks? Markdown blocks are where we can write text and use some simple styling. You can change the text of the markdown block by clicking on the block, pressing "Enter" and then editing the content. Try this out on this text block; when you finished adding the text, press the Run button again to have the text formatted.

We will not be doing too much work with markdown blocks, but they can be very useful if you want to write a report and include the code and results of your work.

## 2. Write a short command in Python and execute the command
Click on the code block below and type in the command `print(My name is Jeff)`, but replace "Jeff" with your name. When you finish writing the command, execute the code block by pressing the Run button.

In [None]:
# Write a print command to print your name

You probably got a syntax error. Why? Compare your command to the first print command that we ran at the beginning of the lesson. What is different? In the code block below, re-write the command to print your name with the syntax we used at the beginning of the lesson.

In [None]:
# Re-write the command to print your name

This exercise is meant to demonstrate that when we pass computers text information, like "My name is Jeff", we need to wrap the text in quotation marks. Missing quotation marks is a common mistake we encounter when doing text data mining in Python.

You also probably noticed the instructions in the code block start with a pound sign (`#`) at the beginning of the line. This is known as the comment character, and is how we can write human-readable notes in a code block.

## 3. Execute pre-written Python code to create a word frequency through time plot

We are going to have more opportunities to play with Python code, but let us pause for a moment to consider what we are going to be doing in terms of text data mining. For this project, we have three newspaper titles published in southern Arizona. Those newspapers were scanned and the text was digitized by the Library of Congress. (More information about the newspapers and the digitization process can be found at [https://chroniclingamerica.loc.gov/](https://chroniclingamerica.loc.gov/)).

What _we_ are going to do is look at the frequency of certain words over time. That is, we are interested to see how certain words are used in these papers over time. This allows us to identify trends over time without actually reading four years worth of newspapers.

There are many more things you can do with text data mining in Python, but they are beyond the scope of this workshop. If you are interested, Library Carpentry has an introduction to text mining lesson at [http://librarycarpentry.org/lc-tdm/index.html](http://librarycarpentry.org/lc-tdm/index.html).

But for now, let us start by looking for the frequency of the word "influenza" in the _Bisbee Daily Review_ during the years right before and during the global influenza pandemic.

We need to provide the following pieces of information:

+ The name of the folder to look in corresponding to the newspaper of interest
+ The years to consider, here 1917 and 1918,
+ The words to look for, here "influenza"
+ The language the newspaper is written in, in this case English

In [None]:
# Store information in variables
title = "bisbee-daily-review"
year_list = ['1917', '1918']
my_words = ['influenza']
language = "english"

We can check our work by writing `print` commands on all of these variables. The first code block is done for you; fill in the remaining three.

In [None]:
# Print value stored in title variable
print(title)

In [None]:
# Print value stored in year_list

In [None]:
# Print value stored in my_words

In [None]:
# Print value stored in language

We also need to tell Python that there are additional packages to load. What does this mean? We will need to use additional programs, written in Python, to perform our text data mining. No need to change anything in the code block, but you will need to execute the block (press the Run button at the top of the window).

In [None]:
# Load additional packages
# for data tables
import pandas

# for file navigation
import os

# for pattern matching in filenames
import re

# for text data mining
import nltk

# for stopword corpora for a variety of languages
from nltk.corpus import stopwords

# for splitting data into individual words
from nltk.tokenize import RegexpTokenizer

# for automated text cleaning
import digcol as dc

# download the stopwords for several languages
nltk.download('stopwords')

# for drawing the plot
import plotly.express as px

The next code block does all the heavy lifting of reading in the text of each day's newspaper, calculating the relative frequency of the word influenza, and finally drawing a plot of the frequency over time. You do not need to change anything in the code block, but you should run it by clicking on the code block and pressing the Run button (or clicking on it, holding down the Control key, and pressing "Enter").

In [None]:
################################################################################
# No need to edit anything in this code block
################################################################################

# Creating the pattern of filenames based on years to match
years = ")|("
years = years.join(year_list)
pattern = "((" + years + "))([0-9]{4})*"
date_pattern = re.compile(pattern)

# Location of files with text for a day's paper
volume_path = "data/sample/" + title + "/volumes/"
my_volumes = os.listdir(volume_path)

# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))

# Sort them for easier bookkeeping
my_volumes.sort()

# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
    one_date = str(one_file[0:4]) + "-" + str(one_file[4:6]) + "-" + str(one_file[6:8])
    dates.append(one_date)

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ["Date"])

# Set all frequencies to zero
results_table["Frequency"] = 0.0

# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
    issue_text = dc.CleanText(filename = volume_path + issue, language = language)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values that match words of interest
    my_freqs = word_freqs.filter(my_words)
    
    # Get the total frequency for words of interest
    total_my_freq = my_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:4]) + "-" + str(issue[4:6]) + "-" + str(issue[6:8])
    
    # Add the date & relative frequency to our data table
    results_table.loc[results_table["Date"] == issue_date, "Frequency"] = total_my_freq
    
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = "Date", y = "Frequency")
my_figure.show()

## 4. Explain what a file path is

Need to have a nice graphic showing file paths?
```
data  
  + sample  
      + border-vidette  
      + bisbee-daily-review  
      + el-tucsonense  
```

## 5. Describe how to choose words for searching

Need to have students recognize that it is doing _exact_ matching and what the ramifications are for this.

In [None]:
# A copy of the word frequency code
# During the lesson, add the word "flu" to the list of words being searched for

## 6. Update python code to perform word frequency calculations on different newspaper title

In [None]:
# Code for plotting, but will need to change the newspaper title 
# For notebook, leave it as a copy of what we did before, but during 
# the lesson, change:
#    + newspaper title to el-tucsonsense
#    + language to spanish

## 7. Create a plot showing word frequency over time on a chosen newspaper and time frame

In [None]:
# Code for plotting, with variables for students to change: title, language, years, words

Something to indicate the close of the lesson