##Exploring the Polytechnic Magazine: Topic Modelling




The Polytechnic Magazine was the in-house magazine of the [Regent Street Polytechnic](https://westminster-atom.arkivum.net/index.php/rsp), one of the predecessor institutions of the [University of Westminster](https://www.westminster.ac.uk/). Since 2011, a digitised run of more than 1,700 issues of the magazine covering the years 1879 to 1960 has been made available by the [University Archive](http://recordsandarchives.westminster.ac.uk/). This has proved an invaluable resource for academic researchers, family historians and university staff. You can search and read the digitised magazines themselves via the University Archive's [Polytechnic Magazine website](https://polymags.westminster.ac.uk/).

This project aims to complement this resource by opening up computational methods of access to the collection. This notebook uses text extracted from the digitised magazines, to enable some basic frequency analysis of the Polytechnic Magazine corpus. 

For more information on the project and how the text was processed, see the [project website](https://github.com/jakebickford/PolyMags).
Copyright for the Polytechnic Magazine is held by the University of Westminster archive, for further information see the [Polytechnic Magazine website](https://polymags.westminster.ac.uk/). Please note that this is a prototype research project and it may be taken down at any time.

##Preparatory steps

These steps install the necessary Python modules for topic modelling and download the Polytechnic Magazine corpus. You do not need to do this every time you create a topic model, but if your session becomes inactive you may need to run the cells again.

In [None]:
#@title Install modules for topic modelling
#Install modules
import pandas as pd
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from gensim import matutils, models
import scipy.sparse
import gensim
from gensim import corpora

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)#suppresses deprecation warnings generated by pyLDAvis
%pip install pyLDAvis==2.1.2
import pyLDAvis
import pyLDAvis.gensim

Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[?25l[K     |▏                               | 10 kB 26.0 MB/s eta 0:00:01[K     |▍                               | 20 kB 32.7 MB/s eta 0:00:01[K     |▋                               | 30 kB 32.8 MB/s eta 0:00:01[K     |▉                               | 40 kB 34.2 MB/s eta 0:00:01[K     |█                               | 51 kB 36.9 MB/s eta 0:00:01[K     |█▏                              | 61 kB 38.4 MB/s eta 0:00:01[K     |█▍                              | 71 kB 18.6 MB/s eta 0:00:01[K     |█▋                              | 81 kB 19.9 MB/s eta 0:00:01[K     |█▉                              | 92 kB 21.4 MB/s eta 0:00:01[K     |██                              | 102 kB 23.1 MB/s eta 0:00:01[K     |██▎                             | 112 kB 23.1 MB/s eta 0:00:01[K     |██▍                             | 122 kB 23.1 MB/s eta 0:00:01[K     |██▋                             | 133 kB 23.1 MB/s eta 0:

  from collections import Iterable


In [None]:
#@title Download the Polytechnic Magazine corpus { display-mode: "form" }
#load corpus
import gdown
url = "https://drive.google.com/uc?id=1855DjLlxVI-k3vexxdgEIDtck_PjvV3v"
output = "corpus_11.csv"
gdown.download(url, output, quiet=False)

df = pd.read_csv('corpus_11.csv')

Downloading...
From: https://drive.google.com/uc?id=1855DjLlxVI-k3vexxdgEIDtck_PjvV3v
To: /content/corpus_11.csv
966MB [00:08, 119MB/s] 


#Create the document-term matrix#

During this preparatory step you can set parameters for your topic model, specifying to what extent you want to include rare and common words, and selecting a date range of issues to be included. You only need to run these cells once but if you want to come back and change the parameters, be sure to run the cells again.

In [None]:
#@title Set parameters for including common and rare words { display-mode: "form" }
#@markdown These parameters will allow you to control what words are included in your topic model. 

#@markdown **MinDf** determines the number of documents a word must occur in to be included, while **MaxDf** is the 
#@markdown the maximum number of documents that a word can occur in before it is excluded. 
#@markdown So a high **MinDf** will prevent less common words from occuring, while a high **MaxDf** will allow
#@markdown more common words from across the corpus, potentially resulting in more general topics. 



#@markdown You can either enter a whole number (e.g. 10) or a decimal which will be understood as a 
#@markdown percentage of the corpus (e.g. 0.5 = 50%) 
#@markdown So a MinDf of 10 would mean the each word must appear in at least 10 documents to be included, 
#@markdown while a MaxDf of 0.5 (i.e. 50%) would mean that each word may not appear in more than half of 
#@markdown the documents 

#@markdown For a futher discussion of how this works, see Adel Rahmani's excellent notebook on [Topic Modelling of Australian Parliamentary Press Releases](https://nbviewer.jupyter.org/github/adelr/trove-refugee/blob/master/Analyses.ipynb).


MinDf = 10 #@param {type:"number"}
MaxDf = 0.5 #@param {type:"number"}





In [None]:
#@title Select the date range you would like to model { display-mode: "form" }
#@markdown The digitised run of the Polytechnic Magazine covers the years 1879 to 1960.  
#@markdown You can select a date range to model, or use the entire corpus by selecting 1879 as the **StartDate**
#@markdown and 1960 as the **EndDate**.
df.index = pd.to_datetime(df.Date)#change index to date
StartDate = "1913" #@param ["1879", "1880", "1881", "1882", "1883", "1884", "1885", "1886", "1887", "1888", "1889", "1890", "1891", "1892", "1893", "1894", "1895", "1896", "1897", "1898", "1899", "1900", "1901", "1902", "1903", "1904", "1905", "1906", "1907", "1908", "1909", "1910", "1911", "1912", "1913", "1914", "1915", "1916", "1917", "1918", "1919", "1920", "1921", "1922", "1923", "1924", "1925", "1926", "1927", "1928", "1929", "1930", "1931", "1932", "1933", "1934", "1935", "1936", "1937", "1938", "1939", "1940", "1941", "1942", "1943", "1944", "1945", "1946", "1947", "1948", "1949", "1950", "1951", "1952", "1953", "1954", "1955", "1956", "1957", "1958", "1959", "1960"]
EndDate = "1945" #@param ["1879", "1880", "1881", "1882", "1883", "1884", "1885", "1886", "1887", "1888", "1889", "1890", "1891", "1892", "1893", "1894", "1895", "1896", "1897", "1898", "1899", "1900", "1901", "1902", "1903", "1904", "1905", "1906", "1907", "1908", "1909", "1910", "1911", "1912", "1913", "1914", "1915", "1916", "1917", "1918", "1919", "1920", "1921", "1922", "1923", "1924", "1925", "1926", "1927", "1928", "1929", "1930", "1931", "1932", "1933", "1934", "1935", "1936", "1937", "1938", "1939", "1940", "1941", "1942", "1943", "1944", "1945", "1946", "1947", "1948", "1949", "1950", "1951", "1952", "1953", "1954", "1955", "1956", "1957", "1958", "1959", "1960"]
df_timerange = df.loc[StartDate:EndDate]
df_extractedDates = (df_timerange.lemmatized_bigrams_trigrams_as_string)


In [None]:
#@title Create document-term matrix { display-mode: "form" }
#@markdown When you have finished setting the parameters above, run this cell and a document-term matrix will be created.
#@markdown This is used to create your topic model. Once you have run this cell once you do not need to run it
#@markdown again to create new topic models below BUT if you want to change the parameters above later on
#@markdown then as well as running the cells above, make sure you run this cell to create a new matrix.

#create document-term matrix
cv = CountVectorizer(min_df=MinDf, max_df=MaxDf) #built count vectorizor with parameters from above
data_cv = cv.fit_transform(df_extractedDates)
dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
dtm.index = df_extractedDates.index 

tdm = dtm.transpose()#transpose dtm

#put tdm into gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

#create dictionary
id2word = dict((v, k) for k, v in cv.vocabulary_.items())
word2id = dict((k, v) for k, v in cv.vocabulary_.items())
d = corpora.Dictionary()
d.id2token = id2word
d.token2id = word2id

#Create the topic model#

In [None]:
#@title Create the topic model { display-mode: "form" }
#@markdown This cell will create your topic model. 

#@markdown Latent Dirichlet Allocation requires the user to specify the number of topics for the model to look for. Select the number of topics below.
NumberOfTopics = 5 #@param {type:"integer"}
#@markdown You can also specify how many times you would like the algorithm to pass over the document. A higher number of passes may result in more accurate analysis, but may take longer to run.
NumberOfPasses = 10 #@param {type:"integer"}

#@markdown When you have finished setting your parameters run this cell to generate the topic model. Note that it may take some time (5 minutes+) to run.
#@markdown When the model has been generated, it will display the topics it has found and the words most closely associated with them.

warnings.filterwarnings("ignore",category=DeprecationWarning)#suppresses deprecation warnings generated by pyLDAvis
lda = models.LdaModel(corpus=corpus, id2word=d, num_topics=NumberOfTopics, passes=NumberOfPasses, alpha = 'auto') 

lda.print_topics()

[(0,
  '0.005*"battalion" + 0.004*"wound" + 0.004*"regiment" + 0.004*"trench" + 0.003*"london_regiment" + 0.003*"corp" + 0.003*"lieut" + 0.003*"corps" + 0.002*"military" + 0.002*"soldier"'),
 (1,
  '0.010*"war_comfort" + 0.002*"unit" + 0.002*"cadet" + 0.002*"wartime" + 0.002*"cake" + 0.002*"regiment" + 0.002*"guard" + 0.001*"overseas" + 0.001*"enemy" + 0.001*"correspondent"'),
 (2,
  '0.032*"dist" + 0.030*"mathematic" + 0.020*"geometry" + 0.015*"physics" + 0.015*"grade" + 0.014*"prac_mathematic" + 0.011*"inter" + 0.010*"mechanic" + 0.010*"heat_engine" + 0.007*"bookkeeping"'),
 (3,
  '0.002*"motion" + 0.002*"portland_hall" + 0.001*"film" + 0.001*"regatta" + 0.001*"dancing" + 0.001*"toast" + 0.001*"wicket" + 0.001*"chess" + 0.001*"barnet" + 0.001*"floor"'),
 (4,
  '0.082*"construction" + 0.022*"theory" + 0.022*"mathematic" + 0.016*"bookkeeping" + 0.013*"prac_mathematic" + 0.009*"geometry" + 0.008*"economic" + 0.007*"heat_engine" + 0.006*"surveying_field" + 0.006*"grade"')]

In [None]:
#@title Visualise your topic model { display-mode: "form" }
#@markdown Run this cell to visualise your topic model with [pyLDAvis](https://pypi.org/project/pyLDAvis/). 
#@markdown Topics will be displayed as circles, with the size of the circle showing the prominence of the topic in
#@markdown text. You can select each topic to explore the terms they are composed of in more detail.


pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda, corpus, d)
vis