# Topic 3 - Diving into files and data formats

This week, we will learn how to deal with files and data formats in Python. You probably have heard of (or are already quite familiar with) different data formats, such as plain text, tables (CSV/TSV), XML, JSON and RDF. These formats are simply the result of agreements that were made between people on how to organize and store data. Some of these formats, such as XML and RDF, have a high degree of structure, whereas plain text is a typical example of unstructured data. Structuring data according to predefined specifications allows information in the data to be easily ordered and processed by machines. You can compare highly structured data with a perfectly organized filing cabinet where everything is identified, labeled and easy to access. 

This notebook introduces some of the existing data formats, and explains how you can work with files stored locally on your computer.

**At the end of this week, you will be able to:**
- open and read the contents of one or multiple `text` and `csv/tsv` files;
- manipulate the content of files (e.g. sentence splitting, tokenization, POS-tagging, and lemmatization);
- write new or manipulated content to new (or existing) files;
- read and write JSON data; 
- import and use modules like `csv`, `json`, `nltk`, `os` and `glob`;
- write a function.

**This requires that you already have (some) knowledge about:**
- basic objects: strings, lists and dictionaries;
- for-loops;
- boolean expressions;
- if-statements.

**If you want to learn more about these topics, you might find the following links useful:**
- [Chapter 2 of this course: Control flow tools and files](https://github.com/evanmiltenburg/python-for-text-analysis/blob/master/Python-chapters/chapter-2.md)
- [Video: File Objects - Reading and Writing to Files](https://www.youtube.com/watch?v=Uh2ebFW8OYM)
- [Video: Automate Parsing and Renaming of Multiple Files](https://www.youtube.com/watch?v=ve2pmm5JqmI)
- [Video: OS Module - Use Underlying Operating System Functionality](https://www.youtube.com/watch?v=tJxcKyFMTGo)
- [Video: Working With JSON](https://www.youtube.com/watch?v=Kf0q4Tf5M3c)
- [Tutorial: The import statement](https://www.tutorialspoint.com/python/python_modules.htm)
- [Tutorial: Defining Functions of your Own](http://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/functions.html)
- [Tutorial: Reading and Manipulating CSV Files](https://newcircle.com/s/post/1572/python_for_beginners_reading_and_manipulating_csv_files)
- [Blog post: 6 Ways the Linux File System is Different From the Windows File System](http://www.howtogeek.com/137096/6-ways-the-linux-file-system-is-different-from-the-windows-file-system/) 
- [Blog post: Gotcha — backslashes in Windows filenames](https://pythonconquerstheuniverse.wordpress.com/2008/06/04/gotcha-%E2%80%94-backslashes-in-windows-filenames/)

## 1. Working with plain text files: Charlie & The Chocolate Factory

When doing text analysis, you will often work with files that contain plain text. These files typically end with the `.txt` extension. In Python, you can read the content of a file, store it as the type of object that you need (string, list, etc.) and manipulate it (e.g. replacing or removing words). You can also write new content to an existing or a new file. Refer to [Chapter 2](https://github.com/evanmiltenburg/python-for-text-analysis/blob/master/Python-chapters/chapter-2.md) of this course for a summary about reading and writing files.

### 1.1 Reading (and closing) files
Let's start with opening a file in Python. To do this, we need to associate the file on disk with a variable in Python. First, we tell Python where the file is stored on your disk. The location of your file is often referred to as the file path. Python will start looking in the 'working' or 'current' directory (which often will be where your Python script is). If it's in the working directory, you only have to tell Python the name of the file (e.g. `charlie.txt`). If it's not in the working directory, as in our case, you have to tell Python the exact path to your file. We will create a string variable to store this information:

In [None]:
filename = "../Data/Charlie/charlie.txt"  

Note the double dots in the beginning of the file path; this means 'the parent of the current directory'. When writing a file path, you can use the following:
- /     means the root of the current drive; 
- ./    means the current directory;
- ../   means the parent of the current directory.

Also note that the formatting of file paths is different across operating systems. The file path as specified above should work on any UNIX platform (Linux, Mac). If you are using Windows, however, you might run into problems when formatting file paths in this way outside of this notebook, because Windows uses backslashes instead of forward slashes (Jupyter Notebook should already have taken care of these problems for you). In that case, it might be useful to have a look at [this page](http://www.howtogeek.com/137096/6-ways-the-linux-file-system-is-different-from-the-windows-file-system/) about the differences between the file systems, and at [this page](https://pythonconquerstheuniverse.wordpress.com/2008/06/04/gotcha-%E2%80%94-backslashes-in-windows-filenames/) about solving this problem in Python. In short, it's probably best if you use the following (we will talk about the `os` module in more detail later in this notebook):

In [None]:
import os
windows_file_path = os.path.normpath("C:/somePath/someFilename") # Use forward slashes

#### The `open()` function
As soon as Python knows where your file is stored, we can open the file by using the built-in function `open()`:

In [None]:
filename = "../Data/Charlie/charlie.txt"  
infile = open(filename, "r")

We could also write:

In [None]:
infile = open("../Data/Charlie/charlie.txt" , "r")

Do you get an error when running the above? Then you are probably using Windows. In that case, we need to specify the encoding as follows (just to be sure, we will do this in the rest of this notebook as well):

In [None]:
infile = open(filename, "r", encoding="utf8")

The `open()` function requires the file path as its first argument. The second (optional) argument specifies the *mode* in which the file is opened. The third (optional) argument specifies the encoding of the file.

The mode you choose will depend on what you wish to do with the file. Here are some of our mode options:

| Character | Meaning |
| --------- | ------- |
|'r' |	open for reading (default)|
|'w' |	open for writing, truncating the file first|
|'x' |	open for exclusive creation, failing if the file already exists|
|'a' |	open for writing, appending to the end of the file if it exists|
|'b' |	binary mode|
|'t' |	text mode (default)|
|'+' |	open a disk file for updating (reading and writing)|
|'U' |	universal newlines mode (deprecated)|

Let's now print `infile`. What do you think will happen?

In [None]:
print(infile)

"Hey! That's not what I expected to happen!", you might think. Python is not printing the contents of the file but only some mysterious mention of some `TextIOWrapper`. This `TextIOWrapper` thing is Python's way of saying it has *opened* a connection to the file `charlie.txt`. To actually see its content, we need to do some more.

####  The `read()`, `readlines()` and `readline()` methods
In order to *read* the contents of the file, Python provides three related operations. The first operation is `read()`:

In [None]:
content = infile.read()
print(content)

The variable `content` now holds the entire content of the file `charlie.txt` as a single string and we can access and manipulate it just like any other string. 

The second operation is `readlines()`, which returns a list of the lines in the file, where each item of the list represents a single line:

In [None]:
lines = infile.readlines()
print(lines)

Oops, why doesn't this return anything? Something to keep in mind when you are reading from files is that once a file has been read using one of the `read' operations, it cannot be read again. Therefore, anytime you wish to read from a file you will have to open a new file variable. Let's try again:

In [None]:
infile = open(filename, "r", encoding="utf8")
lines = infile.readlines()
print(lines)

Now you can, for example, use a for-loop to print each line in the file (note that the second line is just a newline character):

In [None]:
for line in lines:
    print("LINE:", line)

The third operation `readline()` returns the next line of the file, returning the text up to and including the next newline character (*\n*, or *\r\n* on Windows). More simply put, this operation will read a file line-by-line. So if you call this operation again, it will return the next line in the file. Try it out below!

In [None]:
infile = open(filename, "r", encoding="utf8")
next_line = infile.readline()
print(next_line)

In [None]:
next_line = infile.readline()
print(next_line)

In [None]:
next_line = infile.readline()
print(next_line)

#### Closing the file: `close()` vs. content manager
After reading the contents of a file, the `TextWrapper` no longer needs to be open since we have stored the content as a variable. In fact, it is good practice to close the file as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [None]:
infile.close()

There is actually an easier (and preferred) way to make sure that the file is closed as soon as you don't need it anymore, namely using what is called a `context manager`:

In [None]:
with open(filename, "r", encoding="utf8") as infile:
    content = infile.read()
    
print(content)

The main advantage of using the with-statement is that it automatically closes the file once you leave the local context defined by the indentation level. If you 'manually' open and close the file, you risk forgetting to close the file. Therefore, context managers are considered a best-practice, and we will use the with-statement in all of our following code. 

**Exercise:** Write a program that opens `RedCircle.txt` in the `../Data/RedCircle` folder and prints its content as a single string:

In [None]:
# Write your program here

**Exercise:** Write a program that opens `RedCircle.txt` in the `../Data/RedCircle` folder and prints a list containing all lines in the file:

In [None]:
# Write your program here

### 1.2 Manipulating the content of text files: importing and using the NLTK package
Last week, we have done several exercises with manipulating strings. Let's recap. We have learned that some of the most common preprocessing steps are casefolding/lowercasing, punctuation removal and stemming/lemmatization. Did you know that there are some very useful NLP packages and modules to do some of these steps? One that is often used in text analysis is the Python package NLTK (the Natural Language Toolkit). However, to be able to use the modules that are part of the NLTK toolkit, we first need to *import* NLTK into our Python script. Let's quickly talk about importing packages and modules first.

#### Importing modules

Some things in Python, like `int`, `float` or `list.count()` are built-in and can be used whenever you want. But many things you will want to do need a little more than that. Try running the following code:


In [None]:
current_time = datetime.datetime.now()
print(current_time)

As you can see, you get a `NameError`. This is Python's way of telling you: "I don't know what 'datetime' means, please tell me first." We can make the `datetime` module accessible by using a suitable `import` statement. For example:

In [None]:
import datetime
current_time = datetime.datetime.now()
print(current_time)

In the second line, we basically say: "from the package `datetime`, and its subpackage `datetime.datetime`, call the function `now()`." The [documentation](https://docs.python.org/3/reference/import.html) for the `import` statement gives more details about packages and modules. For example: 

> Python has only one type of module object, and all modules are of this type, regardless of whether the module is implemented in Python, C, or something else. To help organize modules and provide a naming hierarchy, Python has a concept of packages.

> You can think of packages as the directories on a file system and modules as files within directories, but don’t take this analogy too literally since packages and modules need not originate from the file system. For the purposes of this documentation, we’ll use this convenient analogy of directories and files. Like file system directories, packages are organized hierarchically, and packages may themselves contain subpackages, as well as regular modules.

> It’s important to keep in mind that all packages are modules, but not all modules are packages. Or put another way, packages are just a special kind of module. Specifically, any module that contains a __path__ attribute is considered a package.

> All modules have a name. Subpackage names are separated from their parent package name by dots, akin to Python’s standard attribute access syntax. Thus you might have a module called sys and a package called `email`, which in turn has a subpackage called `email.mime` and a module within that subpackage called `email.mime.text`.

Optionally, we can also give the module a conventient name while importing it, which works as follows:

In [None]:
import datetime as dt
current_time = dt.datetime.now()
print(current_time)

Python's `from...import` statement lets you import specific attributes from a module:

In [None]:
from datetime import datetime
current_time = datetime.now()
print(current_time)

#### NLTK: Tokenization and sentence splitting
Now we know how to import packages and modules, we can import the NLTK toolkit!

In [None]:
import nltk

Amongst other things, the NLTK toolkit allows you to tokenize texts with the function `word_tokenize()`. To be able to use this function, we first need to download the NLTK Tokenizer Models. Run the following command to download a collection of NLTK data and models: 

In [None]:
nltk.download("book")

Now, let's try tokenizing our Charlie story! First, we will open and read the file again and assign the file contents to the variable `content`. Then, we can call the `word_tokenize()` function from the `nltk` module as follows:

In [None]:
with open("../Data/Charlie/charlie.txt", encoding="utf8") as infile:
    content = infile.read()

tokens = nltk.word_tokenize(content)
print(tokens)

As you can see, we now have a list of all words in the text. The punctuation marks are also in the list, but as separate tokens. Another thing that NLTK can do for you is to split a text into sentences by using the `sent_tokenize()` function:

In [None]:
sentences = nltk.sent_tokenize(content)
print(sentences)

We can now do all sorts of cool things with these lists. For example, we can search for all words that have certain letters in them and add them to a list. Let's say we want to find all present participles in the text. We know that present participles end with *-ing*, so we can do something like this:

In [None]:
present_participles = []
for token in tokens:
    if token.endswith("ing"):
        present_participles.append(token)
print(present_participles)

This looks good! We now have a list of words like *boiling*, *sizzling*, etc. But wait... Oops, there is one word in the list that actually is not a present participle! Of course, also other words can end with *-ing*. So if we want to find all present participles, we have to come up with a smarter solution. 

#### NLTK: POS tagging
Once again, NLTK comes to the rescue. Using the function `pos_tag()`, we can label each word in the text with its part of speech:

In [None]:
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)

We now have a list of tuples. The first element of the tuple is the token, the second element indicates the part of speech of the token. This POS tagger uses the POS tag set of the Penn Treebank Project, which can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). In this tag set, the `VBG` tag is used for present participles and gerunds. 

**Exercise:** Now let's try to make a list of all present participles in `charlie.txt` using the POS tags. Finish the following code:

In [None]:
# Finish the following code:
present_participles = []
for token in tagged_tokens:
    if token[1] == "VBG":
        # ??
print(present_participles)

In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
assert len(present_participles) == 11 and type(present_participles[0]) == str
print("Well done!")

You should get the following list: ['boiling', 'bubbling', 'hissing', 'sizzling', 'clanking', 'running', 'hopping', 'knowing', 'rubbing', 'cackling', 'going']

**Exercise:** Finish the following code to get *all* verbs. We already provided you with the full set of verb tags.

In [None]:
# Finish the following code:
verb_tags = ("VBD", "VBG", "VBN", "VBP", "VBZ")
verbs = []
# Use a for-loop! 

print(verbs)

In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
assert len(verbs) == 39 and type(verbs[0]) == str    
print("Well done!")

#### NLTK: Lemmatization
We now have a list of all inflected forms of the verbs. We can also use NLTK to lemmatize words. We will use the WordNetLemmatizer for this. In the code below, we loop through the list of verbs, lemmatize each of the verbs, and add them to a new list called `verb_lemmas`.

In [None]:
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
verb_lemmas = []
for participle in verbs:
    # For this lemmatizer, we need to indicate the POS of the word (in this case, v = verb)
    lemma = lmtzr.lemmatize(participle, "v") 
    verb_lemmas.append(lemma)
print(verb_lemmas)

**Exercise:** The resulting list contains a lot of duplicates. Do you remember how you can get rid of these duplicates? Create a set in which each verb occurs only once and name it `unique_verbs`. Then print it.

In [None]:
# Write your code here
          

In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
assert len(unique_verbs) == 28    
print("Well done!")

**Exercise:** Now use a for-loop to count the number of times that each of these verb lemmas occurs in the text! For each verb in the list you just created, get the count of this verb in `charlie.txt` using the `count()` method. Create a dictionary that contains the lemmas of the verbs as keys, and the counts of these verbs as values. Refer to the notebook about Topic 1 if you forgot how to use the `count()` method or how to create dictionary entries!

In [None]:
verb_counts = {}
# Finish this for-loop
for verb in unique_verbs:
    # ??

print(verb_counts) 

In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
assert len(verb_counts) == 28 and verb_counts["bubble"] == 1 and verb_counts["be"] == 9
print("Well done!")

### 1.3 Writing files

So far, we have seen how to open a file, and how to read and manipulate its content. But you can also use Python to write files. Let's first slightly adapt our Charlie story by replacing the names in the text:

In [None]:
your_name = "" #type in your name 
friends_name = "" #type in the name of a friend 
new_content = content.replace("Charlie Bucket", your_name)
content = new_content.replace("Mr Wonka", friends_name)

Now, we can open a new file and write the text to this file by using `write()` as follows (remember, we need to specify the mode in which we open the file, in this case the writing mode):

In [None]:
filename = "../Data/Charlie/charlie_new.txt"
with open(filename, "w", encoding="utf8") as outfile:
    outfile.write(content)

Open the file in the folder 'texts' in any text editor and read a personalized version of the story!

Let's try something else. Remember that we have a list of verb lemmas that occured in `charlie.txt`, which we assigned to the variable `verb_lemmas`. Let's say we want to write these lemmas to a file, with each lemma on a separate line. What do you think will happen if you run the following code?

In [None]:
filename = "../Data/Charlie/charlie_verbs.txt"
with open(filename, "w", encoding="utf8") as outfile:
    outfile.write(verb_lemmas)

Do you understand why you get this error?

... exactly, you can only write strings to files. If you want to write the verbs that are stored within the list `verb_lemmas` to a file, you'll need to use a for-loop or create one string from the list using the `join()`-method.

In [None]:
filename = "../Data/Charlie/charlie_verbs.txt"

with open(filename, "w", encoding="utf8") as outfile:
    
    # Writing example 1
    for verb in verb_lemmas:
        outfile.write(verb)
    outfile.write("\n\n")

    # Writing example 2 
    for verb in verb_lemmas:
        s = verb + "\n"
        outfile.write(s)
    outfile.write("\n\n")

    # Writing example 3
    s = " ".join(verb_lemmas)
    outfile.write(s)

Investigate the output of the following code in `charlie_verbs.txt` and try to understand the differences between the three writing examples:

In [None]:
# Read the file and print its content 
with open(filename, "r", encoding="utf8") as infile:
    content = infile.read()
    print(content)

**Exercise:** Create a list containing 10 color names. Write each of these colors on a separate line to a file called `colors.txt` in the `../Data` folder.

In [None]:
# Write your program here

## 2. CSV and TSV (tables): The 2016 Presidential Debate
Now let's move on to another data format. The *table* is probably one of the most common data formats. A table represents a set of data points as a series of rows, with a column for each of the data points' properties. Tabular data can be encoded as CSV (comma-separated values) or TSV (tab-separated values). CSV and TSV files are simply plain text files in which each line represents a row and, within each line, a comma (for CSV) or a tab character (for TSV) separates the cells in the row (the columns).

### 2.1 Reading CSV files

In the following, we will have a look at two CSV files: `AK.TXT` in the folder `Data/baby_names/names_by_state` and `debate.csv` in the folder `Data/Debate`.  If you'd like, you can open these files in a text editor or Excel (convert text to columns by using the comma as delimiter) to see their content. We will  show you how you can read these file in Python with the very useful `csv` module. But first, let's see how it works without using this module.

#### Without `csv` module
In Python we could read a CSV file in a similar way as we have seen with plain text files:

In [None]:
# Read the file and print its content 
filename = "../Data/baby_names/names_by_state/AK.TXT"
with open(filename, "r", encoding="utf8") as csvfile:
    content = csvfile.read()
    print(content)

This file contains a list of names given to children in the state Alaska from 1910 to 2015. Each line in this file has five elements: the state abbreviation (AK for Alaska), gender (F/M), year, name, and frequency of that name in the given year and state. These elements are all separated by commas. So even though the extension of this file is not `.csv`, the data is still in a CSV format.

Let's say we want to create a list that contains each row of the CSV files, and each row itself is a list representing the different columns in the CSV file. We could do that by using the `readlines()` function that we have seen before, and then split each row into columns using the `split()` method:

In [None]:
# Read the file and get all lines
filename = "../Data/baby_names/names_by_state/AK.TXT"
with open(filename, "r", encoding="utf8") as csvfile:
    csv_data = []
    rows = csvfile.readlines()
    for row in rows:
        row = row.strip("\n")
        columns = row.split(",")
        csv_data.append(columns)

The variable `csv_data` now contains a list of all rows in the file. Let's have a look at an example

In [None]:
example_row = csv_data[18]
print(example_row)

We see that this worked, but that `\n` is included in the last item of the list (represening the last column). If we don't want this to be included, we need to remove it somehow. For example, by using the `strip()` method as shown below:

In [None]:
# Read the file and get all lines
filename = "../Data/baby_names/names_by_state/AK.TXT"
with open(filename, "r", encoding="utf8") as csvfile:
    csv_data = []
    rows = csvfile.readlines()
    for row in rows:
        row = row.strip("\n") # remove all newlines
        columns = row.split(",")
        csv_data.append(columns)

# Print an example row
example_row = csv_data[18]
print(example_row)

Now we can, for example, write a program that prints all rows containing the names given in 1912:

In [None]:
# Example: print all names given in 1912
for row in csv_data:
    year = row[2]
    if year == "1912":
        print(row)

Even though you have to perform some steps to split the file in lines, then split the lines into columns, and remove newlines, this method works. But let's take a look at a slightly more complicated CSV file: the `debate.csv` file in the `Data/Debate` folder. This file contains transcripts of the 2016 (vice-)presidential debate from 26 September to 9 October. We open the file and print its content again:

In [None]:
# Read the file and print its content 
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    content = csvfile.read()
    print(content)

We see that the first line in this CSV file is the header indicating the names of the columns: `Line`, `Speaker`, `Text` and `Date`. The remaining lines represent the transcripts of the different speakers in chronological order. We split the file into rows and the rows into columns again, and print an example row:

In [None]:
# Read the file and get all lines
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csv_data = []
    rows = csvfile.readlines()
    for row in rows:
        row = row.strip("\n") # remove all newlines
        columns = row.split(",")
        csv_data.append(columns)

example_row = rows[18]
columns = example_row.split(",")
print(columns)

We immediately see a problem: we get 10 columns instead of 4. This is because Python has split the string by all occurrences of the comma, including the ones that are part of the transcripts (inside the `Text` column). In the CSV file, double quotation marks are used to surround the different cells (saying: "treat the part between quotation marks as one unit"), but the `split()` function does not take this into account and splits these units anyway. In addition, the quotation marks are also included in the strings representing the data inside the columns, even though these were just there to indicate that whatever is inside these quotation marks should be treated as one data point.

You may wonder: well, why don't we just split the data using `split('","')` then? Try it out to see if this would work:

In [None]:
# Read the file and get all lines
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csv_data = []
    rows = csvfile.readlines()
    for row in rows:
        row = row.strip("\n") # remove all newlines
        columns = row.split(",")
        csv_data.append(columns)

# Print example row and value
example_row = rows[18]
columns = example_row.split('","')
first_column = columns[0]
print(columns)
print(first_column)

Unfortunately, this does not work. The quotation marks in the CSV file indicating units are not used for every column. The first column only contains the `Line` number; this will never contain a comma, so according to CSV specifications, it is not strictly necessary to surround these values with quotation marks. Therefore, splitting on "," will not work on this file. It is possible to clean your data even further, but we recommend using the `csv` module instead.

#### With the `csv` module: the `reader()` function
We have tried to show you how you can, in principal, read a CSV file just like any plain text file. A much better and easier approach, however, is to use the `csv` module, which will simplify the parsing of CSV and TSV files. First, we import it:

In [None]:
import csv

Using the `csv` module, we can open and read the file by using either the `csv.reader()` function, or the `csv.DictReader()` function. The `csv.reader()` function works as follows: 

In [None]:
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.reader(csvfile, delimiter=",")
print(csvreader)

As you can see from the output, we created a `Reader` object that we assigned to the variable `csvreader`. A `Reader` object lets you iterate over lines in the CSV file:

In [None]:
# Read the file and print each row 
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.reader(csvfile, delimiter=",")
    for row in csvreader:
        print(row)

As you can see, using the `csv` module we are able to split the rows into columns using the comma separator, while keeping the units surrounded by double quotation marks intact! The quotation marks and the newline characters are also not part of the string anymore.

In a way, the `Reader` object is similar to a list (rows) of lists (columns). We can also make this explicit by changing the type of `csvreader` to a list and assign that to a new variable `rows`, as shown in the code below. 

In [None]:
# Read the file and convert to a list of lists (each list representing a row)
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.reader(csvfile, delimiter=",")
    rows = list(csvreader)

In contrast to the `Reader` object, this list will still be available after you close the file (in this case, outside of the context manager: the with-statement). Compare the following pieces of code and see what happens:

In [None]:
for row in csvreader:
    print(row)

In [None]:
for row in rows:
    print(row)

However, as you can see from the output above, the first row actually is the header of the table. And if we want to access any specific cell values, we need to use the specific column index. For example, the speaker is represented in the second column, which corresponds to the item with index 1 in each list representing the row:

In [None]:
# Print the speaker for the first 5 rows
for row in rows[0:5]:
    speaker = row[1]
    print(speaker)

#### With the `csv` module: the `DictReader()` function
As was mentioned in the notebook of Topic 1, we can also use a *list of dictionaries* to represent a spreadsheet or database like this, instead of a *list of lists*. In such a list, each dictionary constitutes one row. You can think of the keys as the column headers, and the values as the cell values. 

To create such a list of dictionaries, we can use `csv.DictReader().` In the [documentation](https://docs.python.org/3.6/library/csv.html) of the `csv` module we can read the following about this function:

> Create an object that operates like a regular `reader` but maps the information in each row to an `OrderedDict` whose keys are given by the optional fieldnames parameter. The `fieldnames` parameter is a sequence. If `fieldnames` is omitted, the values in the first row of the csvfile will be used as the fieldnames. Regardless of how the fieldnames are determined, the ordered dictionary preserves their original ordering.

So in the following code, which looks exactly the same as the one we have seen before but using `DictReader()` instead of `reader()`, we create a list of dictionaries, using the values of the first row in the csvfile as keys.

In [None]:
# Read the file and convert to a list of dictionaries (each dictionary representing a row)
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter=",")
    rows = list(csvreader)

for row in rows:
    print(row)

We can now easily access and manipulate each of (the properties of) the data points in the CSV file. For example, we can select and print only the first row:

In [None]:
print(rows[0])

Or print only the last 5 rows in a for-loop:

In [None]:
last_5_rows = rows[-5:]
for row in last_5_rows:
    print(row)

Accessing specific cell values is now also more intuitive, because the keys in the dictionary correspond to the column names (refer to Topic 1 if you don't understand how we access the value of the dictionary here):

In [None]:
# Print the text of the last 5 rows
for row in last_5_rows:
    print(row["Text"])

In [None]:
# Print the speaker of the last 5 rows
for row in last_5_rows:
    print(row["Speaker"])

Or print only those rows where Trump is the speaker:

In [None]:
for row in rows:
    if row["Speaker"] == "Trump":
        print(row)

Or print only those texts of the transcripts where Trump mentions Obama:

In [None]:
for row in rows:
    if row["Speaker"] == "Trump" and "Obama" in row["Text"]:
        print(row["Text"])

**Exercise:** Write a program to get a list of dictionaries representing the first 10 rows.

In [None]:
first_10_rows = #?

**Exercise:** Write a program that iterates over all rows in the file and prints the speaker.

In [None]:
# Write your code here

**Exercise:** Print only those rows that have as date: 9/26/16.

In [None]:
# Write your code here

**Exercise:** Now write a program to get all rows from the debate on the 9th of October 2016 that come from a different speaker than Trump or Clinton. Print both the speaker and the text.

In [None]:
# Write your code here

### 2.2 Writing CSV files

Writing CSV files using the `csv` module is just as easy as reading them. Similar to the reading functions, we can use either `writer()` or `DictWriter()` to create an object for writing. Then, you can use `writerow()` or `writerows()` to write data to the file. 

#### The `writer()` function
In the following code, we first read the csv data from `debate.csv` using the `reader()` function, just like we did before. Then, we use to `writer()` function to write the rows to a new file `debate.tsv` using a tab (`\t`) as delimiter instead of a comma (check out the file that is now in the `Data/Debate` folder):

In [None]:
# Read the file and convert to a list of lists (each list representing a row)
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.reader(csvfile, delimiter=",")
    rows = list(csvreader)

# Write the list of lists to a new output file
outfilename = "../Data/Debate/debate.tsv"
with open(outfilename, "w", encoding="utf8") as outfile:
    csvwriter = csv.writer(outfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
    for row in rows:
        csvwriter.writerow(row)

The code below does exactly the same as the one above, but uses `writerows()` instead of `writerow()`. Whereas `writerow()` takes a list of cells to write, `writerows()` takes a list of lists of cells to write. In other words, `writerow()` takes 1-dimensional data (one row), and `writerows()` takes 2-dimensional data (multiple rows).

In [None]:
outfilename = "../Data/Debate/debate.tsv"
with open(outfilename, "w", encoding="utf8") as outfile:
    csvwriter = csv.writer(outfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
    csvwriter.writerows(rows)

As you can see, we have passed several keyword arguments into the `csv.writer()` function. We have set the `delimiter` to a tab character (the default is a comma), the `quotechar` to double quotes (which is also the default), and the `quoting` option to `csv.QUOTE_ALL` (the default is `csv.QUOTE_MINIMAL`. The `csv` module contains the following quoting options:

| Option | Explanation |
|-----------|--------|
| `csv.QUOTE_ALL` |	Quote everything, regardless of type |
| `csv.QUOTE_MINIMAL` |	Quote fields with special characters	|  	 
| `csv.QUOTE_NONNUMERIC` |	Quote all fields that are not integers or floats  	 |
| `csv.QUOTE_NONE` |	Do not quote anything on output	  |

#### The `DictWriter()` function
So how does this work with the `DictWriter()` function? We have seen with `reader()` and `DictReader()` that the former returned a list of lists, whereas the latter returned a list of dictionaries. Similarly, whereas `writer()` expects a list of lists as input to write to a file, `DictWriter()` expects a list of dictionaries as input. Let's try this out by simply taking the code that we used above and replacing `csv.reader()` by `csv.DictReader()`, and `csv.writer()` by `csv.DictWriter()`:

In [None]:
# Read the file and convert to a list of dictionaries (each dictionary representing a row)
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter=",")
    rows = list(csvreader)

# Write the list of dictionaries to a new output file
outfilename = "../Data/Debate/debate.tsv"
with open(outfilename, "w", encoding="utf8") as outfile:
    csvwriter = csv.DictWriter(outfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
    csvwriter.writerows(rows)

Oops! We get an error that `DictWriter()` is "missing 1 required positional argument: 'fieldnames'". This is because Python's `dict` objects are not ordered, so we need to tell Python the order in which each row should be written to the csvfile. The `fieldnames` argument should be a list of strings, representing the names of the columns (the header). Consider the following code:

In [None]:
# Read the file and convert to a list of dictionaries (each dictionary representing a row)
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter=",")
    rows = list(csvreader)

# Write the list of dictionaries to a new output file
outfilename = "../Data/Debate/debate.tsv"
with open(outfilename, "w", encoding="utf8") as outfile:
    fieldnames = ["Line", "Speaker", "Text", "Date"] # Specify the fieldnames
    csvwriter = csv.DictWriter(outfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL, fieldnames=fieldnames)
    csvwriter.writerows(rows)

However, if you open the file in a text editor, or print the first row as shown below, we see that the first row is not the header, but just the first transcript:

In [None]:
# Open the file we just created (the TSV file)
filename = "../Data/Debate/debate.tsv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter="\t")
    rows = list(csvreader)

# Print the first row
first_row = rows[0]
print(first_row)

If we want to include the header in the file as well, we need to use `writeheader()`:

In [None]:
# Read the file and convert to a list of dictionaries (each dictionary representing a row)
filename = "../Data/Debate/debate.csv"
with open(filename, "r", encoding="utf8") as csvfile:
    csvreader = csv.DictReader(csvfile, delimiter=",")
    rows = list(csvreader)

# Write the list of dictionaries to a new output file
outfilename = "../Data/Debate/debate.tsv"
with open(outfilename, "w", encoding="utf8") as outfile:
    fieldnames = ["Line", "Speaker", "Text", "Date"] # Specify the fieldnames
    csvwriter = csv.DictWriter(outfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL, fieldnames=fieldnames)
    csvwriter.writeheader() # Write the header
    csvwriter.writerows(rows)

**Exercise:** Now try to write a new `csv` file yourself! Create a CSV file called `friends.csv` in the `../Data/Friends` folder that contains the names of 5 of your friends, their gender, and their favorite animal. Do not quote anything on the output and use semicolons as separators. Include the header with the column names in the file. First, use the `writer()` function. Then do the same using the `DictWriter()` function.

In [None]:
# Finish the following list
csvdata = [
    ["First Name", "Last Name", "Gender", "Favorite animal"],
    ["Chantal", "van Son", "female", "cat"]
    # add 5 of your friends 
]

# Create a CSV file called `friends.csv` and write the csv data to this file using the `writer()` function

In [None]:
# Finish the following list
csvdata = [
    {"First Name":"Chantal", "Last Name": "van Son", "Gender": "female", "Favorite animal": "cat"}, 
    # add 5 of your friends 
]

# This will create a new directory 

# Create a CSV file called `friends.csv` and write the csv data to this file using the `DictWriter()` function

## 3. JSON

Let's have a look at another data format. You probably have heard about JSON before. JSON is a lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate. It is completely language independent. However, data formatted in JSON is just like a Python dictionary! The `json` module provides an easy way to encode and decode data in JSON. This can be done with the following functions:

- `json.load()` and `json.loads()` for reading JSON
- `json.dump()` and `json.dumps()` for writing JSON 

The functions with an s take string arguments.

We will show how JSON looks like and how to use these functions by creating a dictionary in Python called `dictionary_friends`. Recall from Topic 1 that dictionaries consist of keys and values. In this case, we have 4 keys ("Chantal", "Jean", "Laura" and "Patrick"), and the values of these keys are dictionaries themselves. These dictionaries have 6 keys ("first name", "gender", etc.) with strings, boolean values, integers and lists as values. 

In [None]:
dictionary_friends = {
    "Chantal": {
        "first name": "Chantal", 
        "last name": "van Son", 
        "gender": "female", 
        "age": 26, 
        "favorite_animal": "cat",
        "single": False,
        "siblings": ["Dennis", "Kelly"]},
    "Jean": {
        "first name": "Jean", 
        "last name": "van der Sluijs", 
        "gender": "male", 
        "age": 30, 
        "favorite_animal": "dog",
        "single": False,
        "siblings": ["Leo"]},
    "Laura": {
        "first name": "Laura", 
        "last name": "Kamphuis", 
        "gender": "female", 
        "age": 25, 
        "favorite_animal": "platypus",
        "single": False,
        "siblings": ["Danique", "Lisa"]},
    "Patrick": {
        "first name": "Patrick", 
        "last name": "van der Plas", 
        "gender": "male", 
        "age": 26, 
        "favorite_animal": "giraffe",
        "single": True,
        "siblings": None}}

Now, let's first import the `json` module:

In [None]:
import json

#### The `dump() `and `dumps()` functions
We can very easily write our dictionary to a file in JSON format by using `json.dump()` as follows:

In [None]:
with open("../Data/Friends/friends.json", "w", encoding="utf8") as outfile:
     json.dump(dictionary_friends, outfile)

We used `json.dump()` and not `json.dumps()` because we used a dictionary as argument, not a string. What happened here is that Python has turned the dictionary into a string in JSON format, and wrote this string to the file `friends.json`. We can read the file again to see how this string looks like:

In [None]:
with open("../Data/Friends/friends.json", "r", encoding="utf8") as infile:
    json_string = infile.read()
    print(json_string)

As you can see, this does not look very petty. We can solve that by using the keyword arguments "indent" (set it to 4, for example) for pretty-printing and "sort_keys" (set to True) to sort the keys alphabetically:

In [None]:
# Create the JSON file
with open("../Data/Friends/friends.json", "w", encoding="utf8") as outfile:
     json.dump(dictionary_friends, outfile, indent=4, sort_keys=True)

# Read in the JSON file again
with open("../Data/Friends/friends.json", "r", encoding="utf8") as infile:
    json_string = infile.read()
    print(json_string)

It looks exactly like a Python dictionary! However, it really is a string. Remember, you can check the type of a Python object as follows:

In [None]:
print(type(json_string))

If you compare the JSON-encoded string to the original dictionary, there are some small differences. The boolean values, for example, are written as `true` and `false` instead of `True` and `False`, and `null` is equivalent to `None`.

#### The `load()` and `loads()` functions
Here is how you turn a JSON-encoded string back into a Python dictionary (now we use a string as argument, so we use `json.loads()`):

In [None]:
dictionary_friends = json.loads(json_string)

Execute the following code to check the output of `json.load()`: 

In [None]:
print(dictionary_friends)
print(type(dictionary_friends))

Well, that was easy, wasn't it? Next week, we will practice some more with JSON. 

**Exercises:**
For now, let's practice a bit more with accessing the values of dictionaries:

In [None]:
# Example: This will print the gender of "Chantal"
print(dictionary_friends["Chantal"]["gender"])

In [None]:
# Print the age of Jean


In [None]:
# Print the first sibling of Laura


In [None]:
# Write a for-loop and to print for each person his or her favorite animal


## 4. Reading and writing multiple files: Vickie's dream reports

So far, we have practiced with reading and writing single files containing data in different formats, including text, CSV/TSV and JSON. But you will often have multiple files to work with. The folder `../Data/Dreams` contains 10 text files describing dreams of Vickie, a 10-year-old girl. These texts are extracted from [DreamBank](http://www.dreambank.net/). We want to get a general idea of what Vickie's dreams are about. Therefore, we will extract all nouns from her dreams.

### 4.1 The `glob` and `os` modules
To be able to process multiple files, we need to *iterate* over a list of files. These files are usually stored in one or multiple directories on your computer. In this case, we want to iterate over all the files in the directory `../Data/Dreams`. So we need to find a way to tell Python: "I want to do something with all these files at this location!" There are two modules that you can use for this: `glob` and `os`. 

#### The `glob` module
Let's first look at the `glob` module:

In [None]:
import glob

The `glob` module is very useful to find all the pathnames matching a specified pattern according to the rules used by the Unix shell. You can use two wildcards: the asterisk and the question mark. An asterisk (\*) matches zero or more characters in a segment of a name. For example, the following code gives all filenames in the directory `./data/dreams`:

In [None]:
for filename in glob.glob("../Data/Dreams/*"):
    print(filename)

Oops! There is one file that, apparently, we should ignore. So actually, we only want to have those filenames in the directory that have the extension `.txt`. We can do that as follows: 

In [None]:
for filename in glob.glob("../Data/Dreams/*.txt"):
    print(filename)

A question mark (?) matches any single character in that position in the name. For example, the following code prints all filenames in the directory `./data/dreams` that start with 'vickie' followed by exactly 1 character and end with the extension `.txt` (note that this will not print `vickie10.txt`):

In [None]:
for filename in glob.glob("../Data/Dreams/vickie?.txt"):
    print(filename)

You can also find filenames recursively by using the pattern `\*\*` (the keyword argument `recursive` should be set to `True`), which will match any files and zero or more directories and subdirectories. The following code prints all files with the extension `.txt` in the directory `../Data` and in all its subdirectories:

In [None]:
for filename in glob.glob("../Data/**/*.txt", recursive=True):
    print(filename)

#### The `os` module

Another module that you will frequently see being used in examples is the `os` module. Let's have a look at this module to compare it with `glob`:

In [None]:
import os

The `listdir()` method of the `os` module is similar to the `glob` method, but it does not use any wildcards. The following code will therefore return an error:

In [None]:
for filename in os.listdir("../Data/Dreams/*.txt"):
    print(filename)

Instead, we just use the directory without wildcards as argument. Then, we can use for example the `splitext` method of the module `os.path` to split the filename into the filename and its extension (returning a tuple). Then, we can use an `if-statement` if we want to search for only those filenames that have `.txt` as an extension. In the code below, we first print these tupes of filenames and their extensions, and then add the `.txt` files to a list called `txt_files`:

In [None]:
txt_files = []
for filename in os.listdir("../Data/Dreams"):
    
    # Split into filename and extension; print the tuples
    split_filename = os.path.splitext(filename)
    print(split_filename)
    
    # Check if extension is .txt; if so, add to list
    extension = os.path.splitext(filename)[1]
    if extension == ".txt":
        txt_files.append(filename)

print("\nThis is the list of filenames with the extension `.txt`:")
print(txt_files)

In addition, whereas the `glob` method returns a list of the full paths of each filename, the `listdir` method returns only a list of the filenames. If we want to have the full path, we need to join the path of the directory where the file is stored and the filename using the `join()` method of the `os.path` module as follows:

In [None]:
for filename in os.listdir("../Data/Dreams"):
    extension = os.path.splitext(filename)[1]
    if extension == ".txt":
        path_file = os.path.join("../Data/Dreams", filename)
        print(path_file)

This all seems a lot of extra work if you compare it to the `glob` module, and in a way it is :-) If you want to quickly find all the pathnames matching a specified pattern, the simple but powerful `glob` module is the way to go. However, the `os` module has many more features that can be very useful and which are not supported by the `glob` module. We will not go over each and every useful method here, but here's a list of some of the things that you can do (some of which we have seen above): 
- creating single or multiple directories: `os.mkdir()`, `os.mkdirs()`;
- removing single or multiple directories: `os.rmdir()`, `os.rmdirs()`;
- checking whether something is a file or a directory: `os.path.isfile()`, `os.path.isdir()`;
- split a path and return a tuple containing the directory and filename: `os.path.split()`;
- construct a pathname out of one or more partial pathnames: `os.path.join()`
- split a filename and return a tuple containing the filename and the file extension: `os.path.splitext()`
- get only the basename or the directory path: `os.path.basename()`, `os.path.dirname()`.

### 4.2 Writing a function

We have seen that Python has several built-in functions. But you can also create your own function. A function is a reusable block of code that performs a specific task. Once you have defined a function, you can use it at any place in your Python script. You can even import a function in one script into another one in a similar way as we have seen before with importing modules. Therefore, they are very useful for tasks that you will perform more often. Plus, functions are a convenient way to order your code and make it more readable!

Whenever you are writing a function, you need to think of the following things:
- What is the purpose of the function?
- How should I name the function?
- What input does the function need?
- What output should the function generate?

We will use an example from [this website](http://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/functions.html) to show you some of the basics of writing a function. Also have a look at [Chapter 2](https://github.com/evanmiltenburg/python-for-text-analysis/blob/master/Python-chapters/chapter-2.md) for a summary about writing functions.

#### Purpose and name
Let's say we want to sing a birthday song to Emily. Then we print the following lines:

In [None]:
print("Happy Birthday to you!")
print("Happy Birthday to you!")
print("Happy Birthday, dear Emily.")
print("Happy Birthday to you!")

This could be the purpose of a function: to print the lines of a birthday song for Emily. 
Now, we define a function to do this. Here is how you define a function:

- write `def`;
- the name you would like to call your function;
- a set of parentheses containing the argument(s) of your function;
- a colon;
- a docstring describing what your function does;
- the function definition;
- ending with a return statement

We give the function a clear name, `happy_birthday_to_emily` and we define the function as shown below. Note that we specify exactly what it does in the docstring in the beginning of the function:

In [None]:
def happy_birthday_to_emily():
    """
    Prints a birthday song to Emily
    """
    print("Happy Birthday to you!")
    print("Happy Birthday to you!")
    print("Happy Birthday, dear Emily.")
    print("Happy Birthday to you!")
    return

If we execute the code above, we don't get any output. That's because we only told Python: "Here's a function to do this, please remember it." If we actually want Python to execute everything insice this function, we have to *call* it:

In [None]:
happy_birthday_to_emily()

#### Input: arguments
But Emily is not the only one who celebrates her birthday once a year. To not exclude any of our friends, let's make a more generic function that can sing the song to anyone. This function will need as input the name of the person. The following function takes the name of the person (string) as an argument and then sings the song with the person’s name inserted at the end of the third line:

In [None]:
def happy_birthday(name):
    """
    Prints a birthday song with the "name" of the person inserted
    """
    print("Happy Birthday to you!")
    print("Happy Birthday to you!")
    print("Happy Birthday, dear " + name + ".")
    print("Happy Birthday to you!")
    return

Let's try to call this function:

In [None]:
happy_birthday()

Oops! We didn't specify the required positional argument `name`. Let's try again:

In [None]:
happy_birthday("James")

#### Output: the `return` statement
If we call the function above, the print statements are executed. We can also create a string variable for the song within the function:

In [None]:
def happy_birthday(name):
    song = """Happy Birthday to you!
Happy Birthday to you!
Happy Birthday, dear %s.
Happy Birthday to you!""" % name 
    return

# Did you know you can format a string like this, using a percentage sign to insert a variable within a string?

However, if we want to sing the song by printing the variable...

In [None]:
print(song)

Nothing happens! We get an error that the variable `song` is not defined. That is because it is only defined *within* the function. Outside the scope of this function, the variable is not stored. If the variable is the *output* that we would like to get back from the function, we need to use the `return` statement as follows:

In [None]:
def happy_birthday(name):
    """
    Returns a birthday song as a string with the "name" of the person inserted
    """
    song = """Happy Birthday to you!
Happy Birthday to you!
Happy Birthday, dear %s.
Happy Birthday to you!""" % name
    return song

Now, we can call the function and assign its output to the variable `song`:

In [None]:
song = happy_birthday("James")
print(song)

We do not have to give this variable the same name as the variable defined in the scope of the function. This works fine as well:

In [None]:
wohworhwffwk = happy_birthday("James")
print(wohworhwffwk)

A function can take multiple arguments as input, and can return multiple variables as output:

In [None]:
def sum_and_diff_len_strings(string1, string2):
    """
    Returns the sum of and difference between the lengths of two strings
    """
    sum_strings = len(string1) + len(string2)
    diff_strings = len(string1) - len(string2)
    return sum_strings, diff_strings

sum_strings, diff_strings = sum_and_diff_len_strings("horse", "dog")
print("Sum:", sum_strings)
print("Difference:", diff_strings)

**Exercise:** Now write a function that opens a file and returns its content as a single string:

In [None]:
def get_content_file(filename):
    # Finish this function


In [None]:
# Test your code here! If your code is correct, you should get a compliment :-)
content = get_content_file("../Data/Dreams/vickie1.txt")
assert(len(content) == 751)
print("Well done!")

### 4.3 Putting it all together
Now we know how to process multiple files and how to write a function. Let's try to use these skills to get all nouns from Vickie's dream reports! Remember how we tagged all tokens with their POS tags on a single text file? We had to open the file, read the contents, tokenize the text, and use the POS tagger (remember, we first needed to import `nltk` to be able to use it). we can now write a single function that does all that for us. The following function reads the specified file and returns the tokens with their POS tags:

In [None]:
import nltk

def tag_tokens_file(filename):
    "Read the contents of FILENAME and returns a list of its tokens with their POS tags."
    with open(filename, "r", encoding="utf8") as infile:
        content = infile.read()
        tokens = nltk.word_tokenize(content)
        tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

Now, instead of having to open a file, read the contents and close the file, we can just call the function `tag_tokens_file` to do this: 

In [None]:
filename = "../Data/Dreams/vickie1.txt"
tagged_tokens = tag_tokens_file(filename)
print(tagged_tokens)

We can also do this for each of the files in the `../Data/Dreams` directory by using a for-loop:

In [None]:
import glob

# Iterate over the `.txt` files in the directory and perform POS tagging on each of them
for filename in glob.glob("../Data/Dreams/*.txt"): 
    tagged_tokens = tag_tokens_file(filename)
    print(filename, "\n", tagged_tokens, "\n")

Now, we extend this code a bit so that we don't print all POS-tagged tokens of each file, but we get all (proper) nouns from the texts and add them to a list called `nouns_in_dreams`. Then, we print the set of nouns:

In [None]:
# Create a list that will contain all nouns
nouns_in_dreams = []

# Iterate over the `.txt` files in the directory and perform POS tagging on each of them
for filename in glob.glob("../Data/Dreams/*.txt"): 
    tagged_tokens = tag_tokens_file(filename)
        
    # Get all (proper) nouns in the text ("NN" and "NNP") and add them to the list
    for token in tagged_tokens:
        if token[1] in ["NN", "NNP"]:
            nouns_in_dreams.append(token[0])

# Print the set of nouns in all dreams
print(set(nouns_in_dreams))


Now we have an idea what Vickie dreams about!