# Introduction to Topic Modelling with Python

---
---
## What is Topic Modelling?

Topic modelling is a _distant reading_ technique for finding structure in large collections of text, without actually reading everything by eye. If you have hundreds or thousands of documents and want to understand roughly what your corpus contains, then topic modelling may be for you.

A topic modelling programme finds the words that appear frequently together in a document and groups them together to form 'topics'. A **topic** is a mixture of words that is supposed to characterise (part of) the content of a document — its theme or underlying ideas. For example, one topic of this [Wikipedia article](https://en.wikipedia.org/wiki/Black_hole) is:

* black, hole, mass, star

![First picture of a supermassive black hole, captured in 2019](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Black_hole_-_Messier_87.jpg/320px-Black_hole_-_Messier_87.jpg "First picture of a supermassive black hole, captured in 2019")

Not too surprising, you may think. We could say the topic seems pretty accurate from our perspective. What about a document that we are less familiar with? Here is a topic of a [speech](https://er.jsc.nasa.gov/seh/ricetalk.htm) made by John F. Kennedy at Rice University in 1962:

* space, new, year, man

![Charles Conrad Jr., Apollo 12 Commander, examines the unmanned Surveyor III spacecraft on the Moon](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Surveyor_3-Apollo_12.jpg/274px-Surveyor_3-Apollo_12.jpg "Charles Conrad Jr., Apollo 12 Commander, examines the unmanned Surveyor III spacecraft on the Moon")

This is Kennedy's famous 'we choose to go to the moon' speech. Notice that 'moon' is not in this topic; but the speech does cover the history of humankind's ("man's") endeavours and emphasises a forward-looking perspective (the "new"-ness of advancements).

From these simplified examples, we can see that human intervention is still required to interpret what topics might 'mean'. Topic modelling is not magic; it is a tool that requires informed use and careful review, just like any other.

### So... Why Do Topic Modelling?
In the humanities, topic modelling may be used to support different approaches to large text corpora, such as:

* Survey a collection that is too big to read closely e.g. [Computational Historiography: Data Mining in a Century of Classics Journals](http://www.perseus.tufts.edu/publications/02-jocch-mimno.pdf) (PDF)
* Look at thematic trends over time in an archive e.g. [Topic Modeling Martha Ballard's Diary](http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/)
* Create metadata for an archive to improve accessibility e.g. [Topic modelling for the valorisation of digitised archives of the European Commission](https://ieeexplore.ieee.org/abstract/document/7840981)
* Understand current trends in social media relevant to your discipline e.g. [Mining the Open Web with ‘Looted Heritage’](https://electricarchaeology.ca/2012/06/08/mining-the-open-web-with-looted-heritage-draft/)

### Alternatives to Topic Modelling in Python
If you are looking to explore the topics of a few documents in a casual way, you can use the online digital texts environment [Voyant](), which allows you to upload or copy-and-paste texts and explore a corpus with a number of graphical tools, including topics.

For serious research, a well-known tool for topic modelling is called [MALLET](http://mallet.cs.umass.edu/topics.php), which is a programme (written in Java) that you download to your computer. You have to type commands to use MALLET, but it has otherwise done a great deal for you. [Getting Started with Topic Modeling and MALLET](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet) from Programming Historian gives a step-by-step tutorial on MALLET.

There is a graphical interface for MALLET called [Topic Modeling Tool](https://github.com/senderle/topic-modeling-tool) that is a bit easier to use. The [Quickstart Guide](https://senderle.github.io/topic-modeling-tool/documentation/2017/01/06/quickstart.html) will get you up and running.

If you are looking to use R rather than Python, then `tidytext` is a popular NLP library that will help you work with the `topicmodels` package. The book _Text Mining with R_ devotes [chapter 6](https://www.tidytextmining.com/topicmodeling.html) to this.

---
**With the alternatives out of the way, let's see how we can do topic modelling in Python!**

---
---

## How to Join In with Coding

* **Edit** any cell and try changing the code, or delete it and write your own.

* Before running a cell, try to **guess** what the output will be by thinking through what will happen.

* If you encounter an **error**, realise this is normal. Errors happen all the time and by reading the error message you will learn something new.

* Remember: you cannot break the notebook or your computer, so **don't be afraid to experiment**.

**Let's get coding!**

---
---
## Recap of Python Basics
Welcome back! Let's recap the Python that we learnt last time. Any questions?
### Strings
Create a _string_ and store it with a _name_:

In [21]:
my_sentence = 'The Moon formed 4.51 billion years ago.'
my_sentence

'The Moon formed 4.51 billion years ago.'

_Slice_ a string. Remember that indexing in Python starts at 0.

In [22]:
my_sentence[0:20]

'The Moon formed 4.51'

Transform a string with string methods. Important: the original string `my_sentence` is unchanged. Instead, a string method _returns_ a new string.

In [23]:
my_sentence.swapcase()

'tHE mOON FORMED 4.51 BILLION YEARS AGO.'

Test a string with string methods:

In [24]:
my_sentence.islower()

False

Test a string to see if it contains another string:

In [25]:
'f' in my_sentence

True

Create a _list_ of strings:

In [26]:
my_list = ['The Moon formed 4.51 billion years ago',
           "The Moon is Earth's only permanent natural satellite",
          'The Moon was first reached in September 1959']
my_list

['The Moon formed 4.51 billion years ago',
 "The Moon is Earth's only permanent natural satellite",
 'The Moon was first reached in September 1959']

Slice a list:

In [27]:
my_list[-1]

'The Moon was first reached in September 1959'

Create a transformed list of strings with a _list comprehension_:

In [28]:
new_list = [string.upper() for string in my_list if 'Earth' in string]
new_list

["THE MOON IS EARTH'S ONLY PERMANENT NATURAL SATELLITE"]

### Imports
`import` a _module_ and use it. A module is simply code 'written by someone else' in another file (or files).

In [41]:
import requests
response = requests.get('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/0/1/1013/1013.txt')
text = response.text
text[681:900]

'THE FIRST MEN IN THE MOON\r\n\r\nby H.G. Wells\r\n\r\n\r\n\r\n\r\nChapter 1\r\n\r\n\r\n\r\n\r\nMr. Bedford Meets Mr. Cavor at Lympne\r\n\r\nAs I sit down to write here amidst the shadows of vine-leaves under the\r\nblue sky of southern Italy, it com'

`import` [Natural Language Tool Kit](http://www.nltk.org/) (NLTK) to help with natural language processing (NLP):

In [42]:
import nltk
nltk.download('punkt')

from nltk import word_tokenize

tokens = word_tokenize(text)
tokens[126:146]

[nltk_data] Downloading package punkt to /usr/local/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['THE',
 'FIRST',
 'MEN',
 'IN',
 'THE',
 'MOON',
 'by',
 'H.G',
 '.',
 'Wells',
 'Chapter',
 '1',
 'Mr.',
 'Bedford',
 'Meets',
 'Mr.',
 'Cavor',
 'at',
 'Lympne',
 'As']

### Functions
_Call_ a _function_ with _arguments_. For example, here the function `most_common()` takes a single argument `10`, to give us the ten most common tokens.

In [31]:
from nltk.probability import FreqDist

freqdist = FreqDist(tokens)
freqdist.most_common(10)

[(',', 4612),
 ('.', 3851),
 ('the', 3639),
 ('and', 2538),
 ('of', 2483),
 ('I', 2190),
 ('a', 1809),
 ('to', 1658),
 ("''", 1214),
 ('``', 1160)]

---
---
## More Python Essentials
Before we go on to the next notebook `2-collecting-and-preparing-data-for-topic-modelling.ipynb` we need to a cover a few more Python essentials.

### Looping with `for` loops

A **`for` loop** goes over every item in a list in turn — and runs some code for every item in that list. It makes sure that every item is visited, and then it stops when it gets to the end. We call this **iteration**; the loop _iterates_ over the list. 

NB: Loops also work for many other things other than lists, like strings, but here we stick to lists as an example.

We have already seen `for` loops in passing when we create new lists using list comprehensions:

In [32]:
game = ['rock', 'paper', 'scissors']
new_list = [move for move in game]
new_list

['rock', 'paper', 'scissors']

In the example above, `for move in game` is a `for` loop that loops over every `move` in `game`.

This is a special form of `for` loop for comprehensions — but it essentially works the same as the normal kind.

The normal kind of `for` loop looks like this:

In [33]:
for move in game:
    print(move)

rock
paper
scissors


The syntax goes `for item in list:`

* `for` is a _keyword_ that starts the loop.
* `item` could be any _name_ you give to each item in the list. Name it something that makes sense, e.g. if it's a list of fruit, name it 'fruit', or if it's a list of words, name it 'word'.
* `in` is a _keyword_ that goes before the name of the list.
* `list` could be any _name_ for the list. If your list is a list of novels, for example, it might make sense to name your list 'library'.
* `:` is a colon that starts the _block_ of code that you want to run for every item in the list.

Note that a _block_ of code in Python is indicated by _indenting_ the code by several spaces (typically four spaces).

You also get loops inside other loops if you want to iterate over items in _nested_ lists (i.e. lists inside of lists):

In [34]:
lists = [
    [0, 1, 2],
    [True, False, True],
    ['straw', 'twigs', 'bricks']
]

for lst in lists:
    for item in lst:
        print(item)

0
1
2
True
False
True
straw
twigs
bricks


### Dictionaries
Dictionaries are a form of _mapping_. They map **keys** to **values**. You can think of it like the index at the back of a book, where the key is a word and its value is the page number where you can find that word in the book. To find the page number of a word, you look through the index and find the word you want (the key) and then look at the number (the value).

```
agriculture, 228 
air freight, 46 
airplane food, 19 
alcohol, 165 
alfalfa, 242 
```

_etc._


The Python dictionary is called a `dict` and it can hold (almost) any type of key and value: strings, numbers, Booleans (`True`, `False`) and more.

To create a new `dict` we use curly braces `{}` and inside put each key and value separated by a colon `:`

In [35]:
my_dict = {
    'agriculture': 228,
    'air freight': 46,
    'airplane food': 19,
    'alcohol': 165,
    'alfalfa': 242
}
my_dict

{'agriculture': 228,
 'air freight': 46,
 'airplane food': 19,
 'alcohol': 165,
 'alfalfa': 242}

So now we can find the page number (value) of any of these words (keys) by putting the key in square brackets `[]`:

In [36]:
my_dict['agriculture']

228

To add a new key-value pair to the dictionary we can use the key in square brackets `[]` and assign the new value to it with the assignment operator `=`. In this example, the new key is 'allergies' and the new value is '210'.

In [37]:
my_dict['allergies'] = 210
my_dict

{'agriculture': 228,
 'air freight': 46,
 'airplane food': 19,
 'alcohol': 165,
 'alfalfa': 242,
 'allergies': 210}

### Going Further: Tuples
You will see tuples in the topic modelling example in the next two notebooks, but you should be able to follow along without understanding tuples fully so skip over this section and come back to it if you're running short of time.

A tuple is a bit like a list, except unlike a list, tuples cannot be changed. You cannot add or remove items from a tuple once you have created it. Tuples are known as _immutable_.

NB: Tuple is often pronounced 'toople' if you are from the UK, or 'tupple' if you are from the US, but it doesn't really matter.

To create a new tuple we use parentheses `()`:

In [38]:
my_tuple = (1, 5.0, 'ten-thousand')
my_tuple

(1, 5.0, 'ten-thousand')

You might be a little confused as we also use parentheses `()` to call a function! However, you can recognise a tuple because the parentheses don't have a function name immediately before them. The use of `()` in tuples is totally unrelated to the use of `()` to call functions. They are merely using the same sort of brackets!

Like a list you can _slice_ a tuple, to access its items:

In [39]:
my_tuple[2]

'ten-thousand'

But unlike a list, you cannot assign a new value to any of its items:

In [40]:
my_tuple[2] = 'rainbows and unicorns'

TypeError: 'tuple' object does not support item assignment

You should get an error above that says `TypeError: 'tuple' object does not support item assignment`. This means you cannot assign a new value to any of the items in a tuple.

---
---
## Summary

In this notebook we have covered:

* The basics of what topic modelling is
* How topic modelling can be used in the humanities
* Recap of Python basics from last workshop:
 * Strings and lists
 * Imports
 * Functions
* More Python essentials:
 * Loops
 * Dictionaries
 * Tuples

👌👌👌

In the next notebook `2-collecting-and-preparing-data-for-topic-modelling.ipynb` we will start to walk through a full example of topic modelling using Gensim and the speeches we have prepared.