References: I utilized module 4.6, the Wikipedia/Trains starter code provided for "LDA" class, and https://collab.its.virginia.edu/access/lessonbuilder/item/1926886/group/d0d365cf-c603-49dc-a1e3-be22940a5921/Textbooks/blei03a.pdf to complete this problem. I also consulted with classmate Geoff Hansen to compare our posterior distributions in P1a. and adjusted my answer since I had previously calculated the posterior distribution of words given particular topics.

## Problem 1a. ##

![](HW5_P1_a.jpg)

The desired posterior distribution is P($\theta$, z | w, $\alpha$, $\beta$). As described in https://collab.its.virginia.edu/access/lessonbuilder/item/1926886/group/d0d365cf-c603-49dc-a1e3-be22940a5921/Textbooks/blei03a.pdf, the problem with inferring the posterior distribution of latent variables is that the problem becomes intractable due to normalization over the sheer size of words/topics possible. Variational approximation yields a tractable solution because it avoids the computational complexity of sampling, but also provides a more accurate answer than conjugate priors (which are highly uncertain).

## Problem 1b. ##

![](HW5_P1_a.jpg)

## Latent Dirichlet Allocation (LDA) Examples

This notebook explores using LDA for pages in Wikipedia and for analyis of the narratives in train accident reports. These examples show how the LDA method is possible thanks to variational approximation.


In [2]:
#!pip install nltk.corpus
import numpy as np
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import wikipedia
import nltk
import json
nltk.download("punkt")
nltk.download("stopwords")
from nltk.corpus import stopwords
# Set stop words
stopWords = set(stopwords.words('english'))


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ngk3pf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ngk3pf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Wikipedia Pages

In [20]:
# This preprocessing step just removes stopwords

def preprocessor(text):
    
    tokens = nltk.word_tokenize(text)
    return (" ").join([word for word in tokens if word not in stopWords])

In [4]:

class LDA_wikipedia:
    """Creates a class for Latent Dirichlet Allocation using summaries from Wikipedia
    Input:
        title_list = list of titles for Wikipedia pages
        N_topics = number of topics for LDA to produce
        N_words = the number of words to show in a topic
        new_title = title for a new page not in the training s
    Methods:
        Topics = Outputs the list of topics in the selected Wikipedia pages as a dataframe
        Predict_Topics
            Input: New titles for Wikipedia pages
            Output: A dataframe with the probabilities for topics for each new page"""
    
    def __init__(self, title_list, N_topics=3, N_words = 10):
        # initialize variables
        self.title_list = title_list
        self.N_topics = N_topics
        self.N_words = N_words
        # start with an empty corpus
        self.corpus = list()
    
        # Get the summary pages for the given titles
        # then preprocess
        for title in self.title_list:
            page = wikipedia.page(title)
            self.corpus.append(preprocessor(page.summary))
        
        # Get the matrix of word counts for the pages
        # this will be the input the the LDA
        self.countVectorizer = CountVectorizer(stop_words='english')
        self.termFrequency = self.countVectorizer.fit_transform(self.corpus)
        self.Words = self.countVectorizer.get_feature_names()
        
    def Topics(self):
        # Obtain the estimates for the LDA model 
        self.lda = LatentDirichletAllocation(n_components=self.N_topics)
        self.lda.fit(self.termFrequency)
        
        # Obtain the list of the top N_words in the topics
        topics = list()
        for topic in self.lda.components_:
            topics.append([self.Words[i] for i in topic.argsort()[:-self.N_words - 1:-1]])
            
        # Create a list of column names, Words, for the dataframe output
        cols = list()
        for i in range(self.N_words):
            cols.append("Word "+(str(i)))
        
        # Create a dataframe with the topic no. and the words in each topic 
        # output this dataframe
        Topics_df = pd.DataFrame(topics, columns = cols)
        Topics_df.index.name = "Topics"
        return Topics_df  
    
    def Predict_Topics(self, new_title_list):
        # Get the new titles for the new pages
        # and the number of new pages 
        self.new_title_list = new_title_list
        N_new_docs = len(new_title_list)
        
        # For each of the new titles get the summary page in Wikipedia
        # then obtain the estimate probabilities for each of the topics
        # discovered in the training set for each of the new pages
        new_doc_topics = list()
        for title in self.new_title_list:
            new_page = wikipedia.page(title)
            new_doc = preprocessor(new_page.summary)
            new_doc_topics.append(self.lda.transform(self.countVectorizer.transform([new_doc])))
            
        # Recast the list of topic probabilities as an array of size number of no. pages X no. of topics
        new_doc_topics = np.array(new_doc_topics).reshape(N_new_docs, self.N_topics)
        # Create labels for the columns in the output dataframe
        cols = list()
        for i in range(self.N_topics):
            cols.append("Topic "+(str(i)))
            
        # Create the dataframe whose rows contain the topic probabilities for specific Wikipedia pages
        New_Page_df = pd.DataFrame(new_doc_topics, columns = cols )
        New_Page_df.insert(0, 'Page Name', self.new_title_list)
        return New_Page_df

In [5]:
# Example with famous authors

authors = ['"Charles Dickens"', '"Graham Greene"', '"Jane Eyre"', '"Jane Austen"', '"George Orwell"',
          '"Charlotte Bronte"', '"Virginia Woolf"', '"Evelyn Waugh"',
           '"Mark Twain"', '"Scott Fitzgerald"','"Ernest Hemingway"', '"William Faulkner"', 
          '"Kurt Vonnegut"','"Harper Lee"', '"Edgar Allen Poe"', '"John Steinbeck"' ]

# This is a small data set, so try 3 topics
ld_authors = LDA_wikipedia(title_list = authors, N_topics =3)
ld_authors.Topics()

Unnamed: 0_level_0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topics,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,published,novel,war,works,story,writer,novels,short,american,world
1,woolf,english,london,literature,work,published,literary,known,women,works
2,novels,novel,american,published,literary,dickens,fitzgerald,faulkner,known,short


In [6]:
# See how it does with two famous contemporary authors
ld_authors.Predict_Topics(['"Toni Morrison"', '"Stephen King"'])

Unnamed: 0,Page Name,Topic 0,Topic 1,Topic 2
0,"""Toni Morrison""",0.424357,0.006877,0.568766
1,"""Stephen King""",0.314499,0.089272,0.596229


## Train Accident Narratives

In [10]:
# Train accident narratives are in a json file
# Read the JSON file with the narratives and convert to a list for the LDA analysis


with open('TrainNarratives.txt') as json_file:  
    Narrative_dict = json.load(json_file)
    
train_reports = list(Narrative_dict.values())

In [22]:

class LDA_trains:
    """Creates a class for Latent Dirichlet Allocation using summaries from Wikipedia
    Input:
        reports = list of narratives from accident reports
        N_topics = number of topics for LDA to produce
        N_words = the number of words to show in a topic
        new_report = narrative for a new accident report not in the training set
    Methods:
        Topics = Print the list of topics in the selected narratives
        Predict_Topics = Show the predicted probabilities for topics for a new accident narrative"""
    
    def __init__(self, reports, N_topics=3, N_words = 10):
        # the narrative reports
        self.reports = reports
        # initialize variables
        self.N_topics = N_topics
        self.N_words = N_words
        
        # Get the word counts in the reports
        self.countVectorizer = CountVectorizer(stop_words='english')
        self.termFrequency = self.countVectorizer.fit_transform(self.reports)
        self.Words = self.countVectorizer.get_feature_names()
        
    def Topics(self):
        # Obtain the estimates for the LDA model
        
        ### Your code here
        self.lda = LatentDirichletAllocation(n_components=self.N_topics)
        self.lda.fit(self.termFrequency)
        
        # For each of the topics in the model add the top N_words the list of topics
        ### Your code here
        topics = list()
        for topic in self.lda.components_:
            topics.append([self.Words[i] for i in topic.argsort()[:-self.N_words - 1:-1]])
        
        # Create column names for the output matrix
        cols = list()
        for i in range(self.N_words):
            cols.append("Word "+(str(i)))
            
        # Create a dataframe with the topic no. and the words in each topic 
        # output this dataframe 
        Topics_df = pd.DataFrame(topics, columns = cols)
        Topics_df.index.name = "Topics"  ### Your code here
        return Topics_df
    
    def Predict_Topics(self, new_reports):
        self.new_reports = new_reports
        
        # Get the list of new accident report narratives
        # and the number of new narratives
        N_new_reports = len(self.new_reports)
        
        
        # For each of the new narratives 
        # obtain the estimated probabilities for each of the topics
        # in each of the new narratives as estimated by the LDA results
        # on the training set 
        new_report_topics = list()
        ### Your code here        
        for title in self.new_reports:
            new_page = title
            new_doc = preprocessor(new_page)
            new_report_topics.append(self.lda.transform(self.countVectorizer.transform([new_doc])))
        
        
        # Recast the list of probabilities for topics as an array 
        # of size no. of new reports X no. of topics
        new_report_topics = np.array(new_report_topics).reshape(N_new_reports, self.N_topics)
        
        # Create column names for the output dataframe
        cols = list()
        ### Your code here        
        
        for i in range(self.N_topics):
            cols.append("Topic "+(str(i)))
        
        # Create the dataframe whose rows contain topic probabilities for 
        # specificed narratives/reports  
        New_Reports_df = pd.DataFrame(new_report_topics, columns = cols )
        New_Reports_df.insert(0, 'Report Name', self.new_reports)
        
        return New_Reports_df
                

In [21]:
   # For each of the new narratives 
        # obtain the estimated probabilities for each of the topics
        # in each of the new narratives as estimated by the LDA results
        # on the training set     
new_report_topics = list()
new_reports=train_reports
for title in new_reports[:1]:
    new_page = title
    new_doc = preprocessor(new_page)
    print(new_doc)
    print(new_page)
    new_report_topics.append(self.lda.transform(self.countVectorizer.transform([new_doc])))

UNITS 231-281 ( BACK TO BACK ) WERE COMING INTO UP DEISEL SHOP WHEN THE LEFT WHEEL OF 281 RODE OVER RECENTLY REPAIRED SWITCH PLATE AND DERAILED . THE CAUSE WAS DETERMINED TO BE THE TRACK TELEMETRY IN THAT IT WAS TOO SHARP OF A CURVE .
UNITS 231-281(BACK TO BACK)  WERE COMING INTO UP DEISEL SHOP  WHEN THE LEFT WHEEL OF 281 RODE OVER RECENTLY REPAIRED SWITCH PLATE AND DERAILED. THE CAUSE WAS DETERMINED TO BE THE TRACK TELEMETRY IN THAT IT WAS TOO SHARP OF A CURVE.


In [23]:
lda_train = LDA_trains(reports = train_reports, N_topics = 10, N_words = 10)
lda_train.Topics()

Unnamed: 0_level_0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topics,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,derailed,hazardous,track,cars,materials,released,yard,pulling,railcars,leaking
1,cars,track,car,crew,cut,train,end,derailed,shoving,conductor
2,derailed,wide,track,cars,gauge,gage,fuel,rail,pulling,wheel
3,struck,track,damage,locomotive,unit,operator,vehicle,equipment,humping,crossing
4,rail,cars,switch,derailed,point,broken,lead,broke,causing,car
5,switch,track,car,crew,cars,derailed,yard,lined,lead,movement
6,derailed,cars,track,loads,pulling,car,ns,head,empties,east
7,train,derailed,car,cars,track,mph,curve,speed,rail,excessive
8,track,bnsf,cars,damage,train,car,derailed,end,job,pulling
9,train,car,derailed,emergency,cars,mp,crew,damage,went,rail


In [25]:
lda_train.Predict_Topics(train_reports[:10])

Unnamed: 0,Report Name,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,UNITS 231-281(BACK TO BACK) WERE COMING INTO ...,0.229142,0.004546,0.004547,0.004546,0.464354,0.274677,0.004547,0.004547,0.004546,0.004547
1,"ENGINE 286 CAUGHT FIRE AT THE SPRINGFIELD, MA ...",0.009092,0.114051,0.207441,0.009093,0.009092,0.009093,0.009093,0.009093,0.009092,0.614861
2,TRAIN NO.#4 WITH ENGS 83/11/90/44 AND 11 CARS ...,0.002326,0.002326,0.002326,0.002326,0.002326,0.453532,0.002326,0.527859,0.002326,0.002326
3,WHILE SHOVING TRAIN 624 SOUTH ON #30 TRACK AT ...,0.006252,0.006252,0.006251,0.006251,0.006251,0.943737,0.006252,0.006251,0.006251,0.006252
4,TRAIN 786 WAS STRUCK BY A FALLING TREE SOUTH O...,0.132671,0.010002,0.35457,0.442744,0.010001,0.010002,0.010001,0.010003,0.010002,0.010005
5,ENGINE 4403 OF NJT TRAIN 3204 HAD 90% OLD BREA...,0.005,0.005,0.005,0.005,0.005001,0.005,0.005,0.005,0.005,0.954998
6,AGR CREW DELIVERED CARS TO BNSF AT BNSF YARD A...,0.002273,0.753667,0.002273,0.002273,0.002273,0.002274,0.002273,0.002273,0.131328,0.099093
7,TRAIN #263 CAME INTO RUTHLAND YARD AND THEY WE...,0.002084,0.981245,0.002084,0.002084,0.002084,0.002084,0.002084,0.002084,0.002084,0.002084
8,CREW WAS SHOVING A CUT OF CARS EASTWARD TOWARD...,0.003847,0.288334,0.003847,0.003847,0.155399,0.529338,0.003847,0.003848,0.003847,0.003847
9,"WHILE BUILDING TAIN, 1130 TRIMMER DERAILED FOU...",0.007694,0.007694,0.514267,0.007693,0.424184,0.007694,0.007696,0.007693,0.007693,0.007693


The probabilities displayed here are the probabilities of per document topics (theta) for a given document, given the document's word distribution (w) with a given per-topic word distribution (beta), and the topic distribution for a given word (z).

### d. ###

A safety engineer could do several things with the information provided. First, he/she could examine the fluctuation of topic incidences for the most recent reports to see whether there are trending maintenance/user error issues that need to be addressed. For instance, in the second table, we can see that reports 6-8 all have 25%+ topic 1 associations. Examining the most common words within topic 1 in the first table, we can see that unique words for topic 1 include "shoving" "crew" and "conductor," as well as "derailment." This might lead the safety engineer to reexamine the number of personnel surrounding conductors or train crews, to avoid future derailments due to user error.