# Steps 4 and 5: Analysing Data and Visualising Results
---
---
Firstly, we need to reload the cleaned list of word tokens we were using in the [previous notebook](4-cleaning-and-exploring.ipynb) that we saved in a file. (You don't need to understand what is happening here in detail to follow the rest of the notebook.)

In [None]:
# Import a module that helps with filepaths
from pathlib import Path

# Create a filepath for the file
tokens_file = Path('data', 'CLEAN-2199-0.txt')

# Create an empty list to hold the tokens
tokens = []

# Open the text file and append all the words to a list of tokens
with open(tokens_file, encoding='utf-8') as file:
    for token in file.read().split():
        tokens.append(token)

tokens[0:20]

---
---
## Step 4: Analysing Data with Frequency Analysis
Let's take a moment to remember our research question:

> What are the top 10 words used in Homer's Iliad in English translation?

In order to answer this question we need to count the number of each unique word in the text. Then we can see which are the most popular, or frequent, 10 words. This metric is called a *frequency distribution*. 

### English Stopwords
Before we start, we need to take a moment to think about what sort of words we are actually interested in counting. 

We are not interested in common words in English that carry little meaning, such as "the", "a" and "its". These are called *stopwords*. There is no definitive list of stopwords, but most Python packages used for Natural Language Processing provide one as a starting point, and spaCy is no exception.

In [None]:
# Import the spaCy standard stopwords list
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = [stop for stop in STOP_WORDS]

# Sort the stopwords in alphabetical order to make them easier to inspect
sorted(stopwords)

In [None]:
# Write code here to count the number of stopwords

> **Exercise**: What do you notice about these stopwords?

For your own research you will need to consider which stopwords are most appropriate:
* Will standard stopword lists for modern languages be suitable for that language written 10, 50, 200 years ago?
* Are there special stopwords specific to the topic or style of literature?
* How might you find or create your own stopword list?

Now we can filter out the stopwords that match this list:

In [None]:
tokens_nostops = [token for token in tokens if token not in stopwords]
tokens_nostops

### Creating a Frequency Distribution
At last, we are ready to create a frequency distribution by counting the frequency of each unique word in the text.

First, we create a frequency distribution:

In [None]:
# Import a module that helps with counting
from collections import Counter

# Count the frequency of words
word_freq = Counter(tokens_nostops)
word_freq

This `Counter` maps each word to the number of times it appears in the text, e.g. `'coward': 17`. By scrolling down the list you can inspect what look like common and infrequent words.

Now we can get precisely the 10 most common words using the function `most_common()`:

In [None]:
common_words = word_freq.most_common(10)
common_words

> **Exercise**: Investigate what is further down the list of top words.

---
---
## Step 5: Presenting Results of the Analysis Visually
There are many options for displaying simple charts, and very complex data, in Jupyter notebooks. We are going to use the most well-known library called [Matplotlib](https://matplotlib.org/), although it is perhaps not the easiest to use compared with some others.

To create a Matplotlib plot we need to:

* Import the matplotlib plot function
* Arrange our data into a pair of lists: one for the x-axis, one for the y-axis
* Set the appearance of titles, labels, ticks and gridlines
* Pass the data into the plot function

Let's display our results as a simple line plot:

In [None]:
# Display the plot inline in the notebook with interactive controls
%matplotlib notebook

# Import the matplotlib plot function
import matplotlib.pyplot as plt

# Get a list of the most common words
words = [word for word,_ in common_words]

# Get a list of the frequency counts for these words
freqs = [count for _,count in common_words]

# Set titles, labels, ticks and gridlines
plt.title("Top 10 Words used in Homer's Iliad in English translation")
plt.xlabel("Word")
plt.ylabel("Count")
plt.xticks(range(len(words)), [str(s) for s in words], rotation=90)
plt.grid(b=True, which='major', color='#333333', linestyle='--', alpha=0.2)

# Plot the frequency counts
plt.plot(freqs)

# Show the plot
plt.show()

With this interactive plot you can:

* Resize the plot by dragging the bottom right-hand corner.
* Pan across the plot to see values further to the right (if there are any to display).
* Zoom into the plot.

> **Exercise**: Change the code to explore different data and ways of displaying your data. 

There are also lots of other graphs that Matplotlib can create, and alternative plotting libraries to use instead, but these are beyond the scope of our course.

---
---
## Review and Reflection
Now that you have seen the data and graph we have generated, no doubt you can see many ways we should improve. 

The process of text-mining a corpus (or individual text) is an iterative process. As you clean and explore the data, you will go back over your workflow again and again: from the collection stage, through to cleaning, analysis and presentation.

> **Exercise**: List the ways you think we should improve the pipeline, from collection to plot.

Fortunately, when you do your text-mining in code (and write explanatory text to document it) you know exactly what you did and can rerun and modify the process.

---
### Going Further: Libraries Libraries Libraries

By now, you will be getting the idea that much of what you want to do in Python involves importing libraries to help you. Remember, libraries are _just code that someone else has written_.

As reminder, here are some of the useful libraries we have used or mentioned in these notebooks:
* [Requests](http://docs.python-requests.org/en/master/) - HTTP (web) requests library
* [SpaCy](https://spacy.io/) - natural language processing library
* [Matplotlib](https://matplotlib.org/) - 2D plotting library

---

---
---
## Summary

Finally, we have: 

* **Loaded** clean token data from a file into a list.
* Removed English **stopwords** from the list of tokens.
* Created a **frequency distribution** and found the 10 most frequent words.
* Visualised the frequency distribution in a **line plot**.

---
---
## What's Next?
You will get the most out of this course if you can follow up on the learning over the next few days and weeks before you forget it all. This is particularly important when learning to code. Abstract concepts need to be reinforced little and often.

### Install Python on your Computer

If you don't already have Python installed on your computer, perhaps the easiest way is with Anaconda:

* [Installing Anaconda on Windows](https://www.datacamp.com/community/tutorials/installing-anaconda-windows)
* [Installing Anaconda on Mac](https://www.datacamp.com/community/tutorials/installing-anaconda-mac-os-x)

### Running Jupyter Notebooks on your Computer

Learn how to run and write Jupyter notebooks on your own computer (rather than using Binder): [Jupyter Notebook Tutorial: The Definitive Guide](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook).

### Recommended books on Python

* Sweigart, A. 2019. _Automate the Boring Stuff_ (2nd ed.) San Francisco: No Starch Press. [Available online](https://automatetheboringstuff.com/)
* Kazil, J. & Jarmul, K., 2016. _Data Wrangling with Python_. Sebastopol: O'Reilly Media.

### Text-mining and NLP in General

* Work through this series of [Programming Historian tutorials](https://programminghistorian.org/en/lessons/working-with-text-files) to get some more practice with basic text files and basic text-mining techniques.
* Follow a more in-depth set of Jupyter notebooks with [The Art of Literary Text Analysis](https://github.com/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb).
* Read a practical and well-explained approach to Natural Language Processing (NLP) in Python: Srinivasa-Desikan, B., 2018. _Natural Language Processing and Computational Linguistics : A practical guide to text analysis with Python, Gensim, spaCy, and Keras._ Birmingham: Packt Publishing. [Available online](https://idiscover.lib.cam.ac.uk/primo-explore/fulldisplay?docid=44CAM_NPLD_MARC018975982&context=L&vid=44CAM_PROD&search_scope=SCOP_CAM_ALL&tab=cam_lib_coll&lang=en_US). This book has chapters on text pre-processing steps, various NLP techniques, and comes with Jupyter notebooks to follow.

### Python for Digital Humanities

* Work through Chapters 1 - 4 (online Jupyter notebooks) of [Python Programming for the Humanities](http://www.karsdorp.io/python-course/).
* Browse a big list of resources for [Teaching Yourself to Code in DH](http://scottbot.net/teaching-yourself-to-code-in-dh/).