# Lab 4: Python and Jupyter Notebooks
## ENGL 6701 Spring 2024

Contact:
Lindsay Thomas, lthomas@cornell.edu

For more information about this lab and the context in which it was used in ENGL 6701, see [the Lab 4 page on the ENGL 6701 Spring 2024 course website](https://lindsaythomas.net/engl6701s24/labs/lab-4.html).

### 0. Create and Save a Notes Document

As with Lab 3, you will want to create and save a notes document before beginning this lab. Starting with section 5, you will be asked to write down your answers to specific questions in your lab notebook entry. As you complete these portions of the lab, take notes on your response in your notes document, and you can use these notes later in writing your lab notebook entry.

Since we are using Binder to work with this notebook, please note that none of the changes you make will be saved after your browser session. This includes outputs that are displayed to the notebook, as well as things like inserting filenames. When you shut this tab down or your laptop loses its connection to the internet or server, you will need to restart this notebook, and it will be like restarting from the beginning. That's ok, as this notebook doesn't require you to save anything to complete the lab. Just know that if you want to return to it in completing your lab, you may need to rerun cells in order to see their outputs again.

### 1. Running Code in a Jupyter Notebook

In a Jupyter notebook (what you are looking at now), code, outputs of the code such as visualizations, and writing about the code exist together in one place. A Jupyter notebook is organized via cells; if you double-click on this text right now, you will see that it turns into a cell containing Markdown text. You can "run" this cell (i.e., display the formatted text again instead of the plain-text Markdown) by selecting the cell and clicking the "Run" button at the top of the notebook (an icon that looks like a "play" button), by selecting the cell and then going to the Run menu and selecting "Run Selected Cells," or by selecting the cell and clicking Control + return/enter.

You can tell if a cell contains executable code because, apart from the fact that it will include Python code, it will also include a pair of empty brackets `[]` to the left of the cell. If you see brackets and they are empty, this means that cell hasn't yet been run. If you see brackets next to a cell and the brackets include a number, this means the cell has been run. You can run these cells as described above: by selecting the cell and clicking the "Run" button at the top of the notebook (an icon that looks like a "play" button), by selecting the cell and then going to the Run menu and selecting "Run Selected Cells," or by selecting the cell and clicking Control + return/enter.  Try it out on the next cell:

In [None]:
print("Yay I ran a cell!")

Make sure to run every cell with code in this notebook as you proceed through it. You can tell if a cell is still running if an asterisk appears between the brackets to the left of the cell, like this: `[*]`.

You can tell when your Jupyter notebook is connected to the server and ready to execute code because the dot in the upper right-hand corner of the notebook (next to the words "Python 3 (ipykernel") will be transparent. If you try to run a cell and it is taking a long time, or if it's not appearing to work, look up at this dot. If it's filled in and grey, this may mean you have lost connection with the server (or the "kernel"). To restart your session, go to Kernel > Restart Kernel. If you have to restart the kernel (like if you shut down your browser tab or your laptop goes to sleep), you lose any variables you have assigned during your session.

### 2. Variables and Python Syntax

Portions of this section are drawn from the ["Variables" section of Chapter 2, "Python Basics,"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/04-Variables.html) from Melanie Walsh's free online textbook, *Introduction to Cultural Analytics & Python*.

A variable is how you store data and information in Python. A variable is like a little box that you use to store filenames, words, numbers, collections of words and numbers, and more. Basically any values or data that you want your code to interact with or produce, you need to store in variables.

There are rules about what characters variables can include in Python and about what words you can use for variables (see the ["Variables" section of Walsh's textbook](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/04-Variables.html#variable-names) for more on these rules). In general, you should strive to give your variables clear and precise names that you (or others) will understand later.

You assign variables with an equals `=` sign. In Python, a single equals sign `=` is the “assignment operator.” A double equals sign `==` is the “real” equals sign. Run the cells below to see what I mean.

In [None]:
new_variable = 100
print(new_variable)

In [None]:
newer_variable = 2 * 2 == 4
print(newer_variable)

We can check to see what’s “inside” variables by running a cell with the variable’s name. This is one of the handiest features of a Jupyter notebook (although a notebook cell will only print the *last* value listed in the cell; for other values, you still need to use the `print()` function). Outside the Jupyter environment, you would need to use the `print()` function, as I have done above, to display the variable. Try it by running the below cell:

In [None]:
another_variable = "I'm another variable!"
another_variable

You can also use variables in Python to store the *output* of code so that you can access that output later. For example, run the 2 cells below in sequence:

In [None]:
output = 2 * 2
output

In [None]:
final_result = output + 2
final_result

### 3. Python Data Types and Structures

As emphasized above, variables can store all kinds of information types. 

#### Strings

The variable below stores what's called a "string" in Python, which is textual data.

In [None]:
string_variable = "I'm textual data"
string_variable

#### Integers and Booleans

Other data types include integers (numbers) and Booleans (True/False) (there are more data types than this, but this is fine for today):

In [None]:
integer_example = 37
integer_example

In [None]:
boolean_example = 2 * 2 == 4
print(boolean_example)

boolean_example_2 = 2 * 2 == 6
print(boolean_example_2)

#### Lists and For Loops

There are also a variety of data structures in Python, or ways to organize and format data. The ones you will encounter today include "lists" and "dictionaries."

In [None]:
list_example = [1, 2, 3]
list_example

As you can see, a list in Python is a data structure that contains an ordered sequence of values. You know a list is a list in Python because it's contained within brackets `[]`. When you see something like this in a Python script:

`list_example = []`

This means that this variable, `list_example` in this case, is being "defined" as a list; later on in the script, values will be placed in that list in a particular order.

You can iterate through the contents of a list using what's called a "for loop." For loops allow you to operate on each item in the list one at a time. The syntax is in the cell below; what happens when you run it?

In [None]:
list_example = [1, 2, 3]

for item in list_example:
    print(item)

As we can see, this code is saying, "for each item in the variable `list_example` (which is a list), iterate through and print each one."

#### Dictionaries

A dictionary is another way to organize data in Python; it is an unordered collection of (usually) paired values. Dictionaries are frequently used in text analysis to store word frequency data. For example (run the below cell):

In [None]:
dictionary_example = {"color": 3, "me": 2, "your": 2, "baby": 1, "car": 1} 
dictionary_example

The dictionary above, `dictionary_example`, contains word frequency data for the first 2 lines of Blondie's song "Call Me" ("Color me your color baby, Color me your car"). As we can see, the word "color" appears 3 times in that string, "me" appears 2 times, "your" appears 2 times, "baby" once, and "car" once.

You know a dictionary is a dictionary in Python because it's contained within curly brackets `{}`. When you see something like this in a Python script:

`dictionary_example = {}`

This means that this variable, `dictionary_example` in this case, is being "defined" as a dictionary; later on in the script, values will be placed in that dictionary in a particular order. 

You can also create a dictionary in Python from two already existing lists using the built-in `dict()` and `zip()` functions, like this:

`dictionary_example = dict(zip(list1, list2))`

Run the below code to see what I mean:

In [None]:
list1 = ["color", "me", "your", "baby", "car"]
list2 = [3, 2, 2, 1, 1]

dictionary_example = dict(zip(list1, list2))
dictionary_example

The pairs in a dictionary are often called "key:value" pairs. In the example above, the keys are the words, while the values are the numbers, and the first key:value pair is `"color": 3`. You can use functions built-in to Python to access just the keys or just the values of any given dictionary. For example:

In [None]:
terms = dictionary_example.keys()
print(terms)

counts = dictionary_example.values()
print(counts)

### 4. File Handling in Python

If you have a file containing data you want to interact with using Python, you need to "read" that file "into" Python (i.e., you need to open it using Python so your code can access its contents). In Python 3 (the version of Python we are using here), opening up a file to read its contents looks like this:

In [None]:
example_file = "example_file.txt"

with open(example_file, 'r') as fin:
    for line in fin:
        print(line)

Let's break this process down. 

1. If you look in [this lab's repository](https://github.com/lcthomas/engl6701s24-lab4) (and/or to the left of this notebook in the JupyterLab file interface), you will find a file there named `example_file.txt`. This is the file we are opening with this code. The first line above assigns the filepath of the file we want our code to read to the variable `example_file`. This tells our code where to look for the file we want to open.
2. The next line of code (line 3), uses Python's `open()` function to open that file. A few things are happening here:
    - The syntax `with open()` is a short-hand way to open the file and then automatically close it once Python has read its contents. In older versions of Python, you had to make sure to both open files and then close them when you were done with them, but this syntax automatically closes files.
    - We can see that the thing being opened is assigned to the `example_file` variable.
    - The `r` after the variable name tells Python to "read" the contents of the file. There are various options you can use here; to explore others, Google someting like "Python file open options." In fact, we don't need to explicitly include the `r` here, as it's the default option with the `open()` function.
    - The `as fin` portion of the code is assigning the data contained within "example_file.txt" to the variable `fin` (this variable is often used by people writing Python to mean "file-input").
3. Line 4, which must be indented, tells the code to read the input file line by line. As you can see if you take a look at "example_file.txt" in this lab's GitHub repo, there is only one line in this file. Note: we could have used anything for the `line` variable here; the code would work just as well if it read `for x in fin:`. But I've used the variable `line` to remind myself that what I am doing here is reading the file line by line.
4. Line 5, which again must be indented, tells Python to print the contents of each line in sequence (again, there is only one line).

#### Now you try it

There is another plain-text file in this repository; the filepath is `"another_example_file.txt"`. The code in the cell below, like the code in the cell above, will read the contents of the file and print them line by line. What do you need to do to the code in the cell below to read in that file?

In [None]:
example_file = # delete this comment and insert the filename you want here, put it in quotes

with open(example_file, 'r') as fin:
    for line in fin:
        print(line)

If you got a `SyntaxError` after running the above cell, this means you didn't delete the comment and insert the filename you want in its place. If you got a `NameError`, this may be because you didn't put the name of the file you want to open in quotes.

### 5. Anatomy of a Python Function

If we have a chunk of code that we are going to re-use or package inside another chunk of code, it's often useful to write that code as a function. Let's break down the different parts of the below function:

In [None]:
import re

def split_into_words(any_chunk_of_text):
    # split the string into individual terms and make them all lowercase
    words = re.split("\W+", any_chunk_of_text.lower())
    return words 

any_chunk_of_text = "Okay, okay, ladies, now let's get in formation, cause I slay"

all_the_words = split_into_words(any_chunk_of_text)
print(all_the_words)

#### Import packages

In the first line, we are telling Python to `import` code that other people have written into our code so that we can use some of that other code's functions (make sure to run all of the code cells in this section, or the cell where you call the function won't work):

In [None]:
import re

We call the code written and packaged up by other people a “library,” “package,” or “module.” For now it's enough to know that you import libraries/packages/modules at the very top of a Python script for later use. Importing existing libraries/packages/modules will save you time and do a lot of work behind the scenes.

In the above function, we are importing the [`re` package, which handles regular expressions](https://docs.python.org/3/library/re.html). 

#### Define the function

Next we define the function:

In [None]:
def split_into_words(any_chunk_of_text):
    # split the string into individual terms and make them all lowercase
    words = re.split("\W+", any_chunk_of_text.lower())
    return words 

In this case, it's a function called `split_into_words`, which takes in any chunk of text, transforms that text to lower-case, and splits the text into a list of "clean" words without punctuation or spaces. It then returns that list as the output of the function. When you ran the above cell, you probably noticed that, while you can see a number in the brackets to the left of the cell, indicating the cell ran, nothing happened. This is because we’re not actually using the function yet, just defining it (i.e., telling the computer what the function does).

You may also notice that the above code contains a line that is commented out with a hashtag `#`. This line is colored green. Comments are non-executable; the hashtag (`#`) tells the computer to skip that line. Comments are usually used to quickly explain what the code below is doing. If you want to write a long comment, you can place it between a series of three apostrophes on either side of the comment, as below:

In [None]:
def split_into_words(any_chunk_of_text):
    '''A long comment a very long comment quite long indeed this is really long I'm writing 
    so much I need another line'''
    words = re.split("\W+", any_chunk_of_text.lower())
    return words

A final thing to note about this function is the indentation of lines 2-4, or all of the lines following the first line defining the function. In Python, indentation is very important. Try to run the below code and see what happens:

In [None]:
def split_into_words(any_chunk_of_text):
# split the string into individual terms and make them all lowercase
words = re.split("\W+", any_chunk_of_text.lower())
return words 

You should see an `IndentationError`, which alerts you to the fact that when writing functions, you need to indent the lines after the first line of the function. In Python, you also indent `for` loops and `if...else` clauses, both of which you will see in the final section of this notebook.

#### Assign variables

Next we assign the variables we need to make the function work:

In [None]:
any_chunk_of_text = "Okay, okay, ladies, now let's get in formation, cause I slay"

This is the text that we will use as input for our function. But notice how this specific input is separated out from the operations of the code within the function (i.e, we don't *have* to use this specific text in order to run the function). We could use any chunk of text in this function. In fact, that is the utility of writing functions: they allow you to separate out specific inputs from the operations performed on those inputs, and hence to repeat code in different contexts.

#### Call the function

Now we call the function, i.e., tell the computer to run it. Notice here that we are storing the results of the function in the variable `all_the_words` and then printing the value of that variable (i.e., the output of the function) in the second line.

In [None]:
all_the_words = split_into_words(any_chunk_of_text)
print(all_the_words)

If you got a `NameError` that indicates, for example, that `name 'split_into_words' is not defined`, this means you need to define the function using the cell under "Define the Function" above.

#### Question 1
What do you notice about how this function has split the string "Okay, okay, ladies, now let's get in formation, cause I slay"? What has it done that isn't quite right, and why has it done this? Write down your response in your notes document.

Hint: Look at the regular expression being used in this line of code from the function:<br/>
`words = re.split("\W+", any_chunk_of_text.lower())`

What does the regular expression `"\W+"` mean in Python? To find this out, you can [search the `re` package documentation](https://docs.python.org/3/library/re.html). Search the page (command/control + F in a web browser) for the term `\W`. What does it mean? Do the same for the `+` sign. What does it do? So what do these symbols mean when used together in a regular expression?

#### Now you try it

Let's run the `split_into_words` function with a chunk of text of your choosing.

In [None]:
import re

def split_into_words(any_chunk_of_text):
    # split the string into individual terms and make them all lowercase
    words = re.split("\W+", any_chunk_of_text.lower())
    return words 

any_chunk_of_text = # insert your text here, making sure to put it in quotation marks

all_the_words = split_into_words(any_chunk_of_text)
print(all_the_words)

If you got a `SyntaxError` after running the above cell, this means you didn't delete the comment and insert the string (or chunk of text) you want in its place. If you got a `NameError`, this may be because you didn't put your string in quotes.

#### Question 2
What happened? Did it work as you expected? If not, what happened that you *didn't* expect? Write down your response in your notes document.

### 6. The Function I Wrote to Count the Use of "Ver" vs "Vir" Spellings in EEBO Texts

Now we are ready to look at the function I wrote to answer the question described on the ["Lab 4" page on our course website](https://lindsaythomas.net/engl6701s24/labs/lab-4.html). I have modified this code slightly to save the output as a dataframe rather than a .csv file.


The first thing to know is that this function is exceedingly ugly! Don't be intimidated by its length; I've included a lot of comments below describing what almost every line of the function does. One thing to note about this function is that it's really *not* very efficient -- it's not the best possible way to perform this task -- and so it's slow. I wrote it only about six months after learning Python. Looking back at it now, I can identify places in the code where I could be less verbose and more efficient, or where I could structure the steps differently in order to speed up processing. But you know what? Oh well! Although the code is slow, it works, I understand it, and it accomplished the task I needed it to accomplish.

All you need to do next is: 
1. Run the below cell. 
2. Read through each line of the function. If you don't understand the specific Python syntax of each line, that's ok. Instead, the goal is to understand the "transformations" (to use Schmidt's vocabulary) wrought by each line. As you read, try to understand what is happening in each step.


In [None]:
# Import the packages we need

# BeautifulSoup is a package for parsing encoded text such as TEI, other kinds of XML, HTML, etc
from bs4 import BeautifulSoup
# os is a very commonly used package for miscellaneous utility functions like listing contents of
# a directory
import os
# you should recognize this package (see earlier in this notebook)!
import re
# this is a package for creating and storing dataframes, or spreadsheet-like data structures
import pandas as pd

# Define the function and include a comment that succinctly describes what it does
# The variables we need are:
# xml_dir = the directory of xml files
# include_a = the first list of words to find, i.e., the list of virtu* words to find
# include_b = the second list of words to find, i.e., the list of vertu* words to find
def compare_counts_specific(xml_dir, include_a, include_b):
    '''Produces dataframe with total counts of EEBO-TCP xml files containing words in include_a and 
    include_b lists. Saves filename, author, title, publication date, word1 total, and word2 total 
    to pandas dataframe. Prints total number of documents processed and total number of documents
    containing either virtu* or vertu* words.'''
    # assign an empty list to variable hits_list; this variable will store total number of 
    # documents that contain virtu* and/or vertu* words
    hits_list = []
    # assign an empty list to variable rows; this variable will store each row in the dataframe
    # as they are built by the function
    rows = []
    # list the contents of the directory with the xml files by filename (the os.listdir() function
    # turns the contents of a directory into a list of filenames)
    # then, iterate through each filename and do all of the following steps
    # i.e., for each file in that directory, do the following:
    for xmlfile in os.listdir(xml_dir):
        # assign variables to hold lists of virtu* and vertu* words
        list_a = [] 
        list_b = [] 
        # assign a variable to hold the specific filepath of each file the code is iterating through
        file_path = xml_dir + '/' + xmlfile
        # check that the file is an xml file and not a directory
        if os.path.isdir(file_path) == False and file_path.endswith('.xml'):
            # open the xml file
            with open(file_path, 'rb') as fin:
                # assign all of the contents of the file to contents variable
                contents = fin.read()
            # use BeautifulSoup to process the file contents so that we can grab information
            # from particular tags
            soup = BeautifulSoup(contents,'xml')
            # if there is a <text> tag:
            if soup.find('text'):
                # retrieve the value of the <text> tag and make it lowercase
                text = soup.find('text').get_text().lower()
                # for each word in the first list of words to search for:
                for item in include_a:
                    # use regular expressions to search for each word in the first list
                    # if each word is in the file's <text> tag,
                    # store it in variable a (automatically stored as a list of 1)
                    a = re.findall(r'\b{0}\b'.format(item), text, flags=re.IGNORECASE)
                    # if the term exists in the file (i.e., if the length of the list is not 0):
                    if len(a) != 0:
                        # add a to list_a to keep a list of all of the virtu* hits for this file
                        list_a.extend(a)
                # for each word in the second list of words to search for:
                for item in include_b:
                    # use regular expressions to search for each word in the second list
                    # if each word is in the file's <text> tag,
                    # store it in variable b (automatically stored as a list of 1)
                    b = re.findall(r'\b{0}\b'.format(item), text, flags=re.IGNORECASE)
                    # if the term exists in the file (i.e., if the length of the list is not 0):
                    if len(b) != 0:
                        # add b to list_b to keep a list of all of the vertu* hits for this file
                        list_b.extend(b)
                # if there is nothing in list_a and list_b after reading the contents of this file
                # i.e., if the file does not contain either virtu* or vertu* words
                # go to the next file
                if len(list_a) == 0 and len(list_b) == 0: 
                    pass
                # otherwise...
                else:
                    # add the name of this file to the hits_list variable
                    # because it represents a "hit," or a file that contains virtu* and/or vertu* words
                    hits_list.append(xmlfile)
                    # use regular expressions to delete all non-word characters in each list of virtu* and
                    # vertu* words, in case any got through
                    list_a = [re.sub('\W+','', token) for token in list_a]
                    list_b = [re.sub('\W+','', token) for token in list_b]
                    # count the number of times each term in each list occurs
                    a_wf = [list_a.count(token) for token in list_a if token.isalnum()==True]
                    b_wf = [list_b.count(token) for token in list_b if token.isalnum()==True]
                    # create a dictionary including each term with the number of times it occurs
                    # in each list
                    a_freqdict = dict(zip(list_a, a_wf))
                    b_freqdict = dict(zip(list_b, b_wf))
                    # get just the counts from each list
                    a_total = sum(a_freqdict.values())
                    b_total = sum(b_freqdict.values())
                    # look for the <author> tag in this file
                    author = soup.find('author')
                    # if there is an <author> tag
                    if author:
                        # grab the value
                        author = author.get_text()
                    # look for the <title> tag in this file
                    title = soup.find('title')
                    # if there is a <title> tag
                    if title:
                        # grab the value
                        title = title.get_text()
                    # now we do some truly terrible date handling
                    # look for the <date> tag in this file
                    date = soup.find('date')
                    # if there is a <date> tag:
                    if date:
                        # grab the value
                        date = date.get_text()
                        # The publication dates in eebo xml files are irregular because the dates 
                        # of publication included in early modern books are often irregular.
                        # EEBO-TCP files also represent years of transcription work with slightly varying 
                        # schemas in some instances, and sometimes the date of transcription
                        # is encoded in the file using the '<date>' tag as well.
                        # What we want is the date of "publication" of each text, which sometimes
                        # occurs in the first '<date>' tag in a file, and sometimes in the second.
                        # ALL of the remaining code indented under `if date:` above is about
                        # using regular expressions to manage various kinds of irregular 
                        # publication dates and put them into a standard YYYY format,
                        # this includes all of the code until the next commented line.
                        # As per our discussions of data "cleaning" in class, we are definitely
                        # losing information about the "publication" dates of some of these texts via
                        # this process. We are also, in some instances, deciding to give a text a firm
                        # "publication" date when in fact none exists (i.e., scholars don't exactly know
                        # what specific year a text was printed). In this context, for this research
                        # question, that loss of information and false precision doesn't matter so much,
                        # as long as we recognize that is what is happening. For other questions,
                        # however, it might matter very much.
                        if re.search(r'(\d\d\d\d)', date):
                            date = re.search(r'(\d\d\d\d)', date).group(0)
                        date = ''.join(i for i in date if i.isdigit() or i in string.punctuation)
                        date = re.sub('[.,:>\(\)]', '', date)
                        if re.search(r'^20\d\d', date): 
                            date = soup.find_all('date')[1]
                            if date:
                                date = date.get_text()
                                # causes some false precision for dates with brackets
                                if re.search(r'(\d\d\d\d)', date):
                                    date = re.search(r'(\d\d\d\d)', date).group(0)
                                date = ''.join(i for i in date if i.isdigit() or i in string.punctuation)
                                date = re.sub('[.,:>\(\)]', '', date)
                        if re.search(r'^20', date):
                            date = re.sub('20', '', date)
                        if re.search(r'--(\d\d\d\d)--', date):
                            date = re.search('--(\d\d\d\d)--',date).group(0)
                        if re.search(r'----(\d\d\d\d)', date):
                            date = re.search('----(\d\d\d\d)',date).group(0)
                        if re.search(r'-(\d\d\d\d)', date):
                            date = re.search('-(\d\d\d\d)',date).group(0)
                        if re.search(r'--(\d\d\d\d)', date):
                            date = re.search('--(\d\d\d\d)',date).group(0)
                        if re.search(r'(\d\d\d\d)--', date):
                            date = re.search('(\d\d\d\d)--',date).group(0)
                        if re.search('17411742', date):
                            date = re.sub('17411742', '1741-1742', date)
                        if re.search(r'^10', date):
                            date = re.sub('10', '', date)
                        if re.search(r'^25', date):
                            date = re.sub('25', '', date)
                        if re.search(r'^26--', date):
                            date = re.sub('26--', '', date)
                        if re.search('17811782', date):
                            date = re.sub('17811782', '1781-1782', date)
                        if re.search('1800424', date):
                            date = re.sub('424', '', date)
                        if re.search('18004', date):
                            date = re.sub('4', '', date)
                        if re.search('17661767?', date):
                            date = re.sub('17661767?', '1766-1767?', date)
                            date = re.sub('57911791', '1791', date)
                        if re.search('17121711', date):
                            date = re.sub('17121711', '1712, 1711', date)
                        if re.search('17626', date):
                            date = re.sub('17626', '1762', date)
                        if re.search('175710', date):
                            date = re.sub('175710', '1757', date)
                        if re.search('175610', date):
                            date = re.sub('175610', '1756', date)
                        if re.search(r'^5\d\d\d', date):
                            date = re.sub(r'^5', '1', date)
                        if re.search('1767215', date):
                            date = re.sub('1767215', '1767', date)
                    # ok, we have finally finished with various date issues
                    # now, create a row (like in a spreadsheet) containing the following info:
                    # xmlfile = the filename of this particular xml file
                    # author = the value we found for the <author> tag
                    # title = the value we found for the <title> tag
                    # date = the massaged <date> value
                    # a_total = the total number of virtu* words in the file
                    # b_total = the total number of vertu* words in the file
                    # add this row of values to the rows variable, which holds the values for
                    # the other files in this directory as well
                    rows.append([xmlfile, author, title, date, a_total, b_total])
    # Ok, now you can tell by this indentation that we have exited the for loop, meaning we've looped 
    # through all of the files in the directory.
    # Now we can create the pandas dataframe from the rows list, provide column names in the correct 
    # order.
    df = pd.DataFrame(rows, columns = ['filename', 'author', 'title', 'date', 'virtu*', 'vertu*'])             
    # print the number of xml documents processed by the script
    # and print the number of documents that contain either virtu* or vertu* words
    print(str(len(hits_list)) + ' documents with hits\n',
        str(len([x for x in os.listdir(xml_dir) if x.endswith('.xml')])) + ' total documents processed')
    # return the dataframe, meaning recognize it as the output of the function, which can be
    # saved in a variable for further manipulation.
    return df

                

Now we are going to call the function and pass it some data to use. The below cell won't work unless you have run the above cell -- if you see a `NameError` appear below the cell, this likely means the above cell hasn't been run. The below cell may take a few moments to run; you will see the asterisk appear in the brackets to the left of the cell. This tells you the code is still working.

There is a folder of 10 EEBO xml files in this repository (`eebo-test`), randomly selected from the data Jessica and I used with this code. We are going to tell the `compare_counts_specific` function to use the files in that directory as input. We are also going to give the function two lists of words to count: `include_a` is a list of "virtu*" words; and `include_b` is a list of "vertu* words.

In [None]:
# usage

# the directory with the files we want to use
xml_dir = 'eebo-test'
# list of virtu* words
include_a = ['virtu,' 'virtus', 'virtue', 'virtues', 'virtuous', 'virtuouse', 'virtuosity', 'virtuositie']
# list of vertu* words
include_b = ['vertu', 'vertus', 'vertue', 'vertues','vertuous', 'vertuouse', 'vertuosity', 'vertuositie']

# call the function and assign its output to the df variable
df = compare_counts_specific(xml_dir, include_a, include_b)

# display the value of the df variable
df



#### Question 3
Describe the output of this script (the dataframe that displays after the above cell finishes running). Remember that this is the same output as the "vir-ver-counts-specific" spreadsheet in our lab 4 folder on Canvas, only for just 10 texts. What is this dataframe showing us? Write down your response in your notes document.

#### Question 4
Look at the below lines from the `compare_counts_specific` function above. These lines use regular expressions to do something to the value of the `<date>` field in an xml file (if the contents of the `<date>` field meet certain conditions, that is). What are these lines doing? Here are the lines (you don't need to run this code; just examine it; it won't work if you try to run it):

In [None]:
if re.search(r'^20', date):
    date = re.sub('20', '', date)

Hint: In Python, you need to include an `r` before a regular expression -- such as in `if re.search(r'^20', date):` -- if that regular expression contains special characters, such as a carat `^`. If you don't know what the regular expression `^20` means in Python regex syntax, how can you find out?