## Research Question¶
Through this project, we aim to explore the newspaper articles that The Emory Wheel (referred to as 'the newspaper') published in its archive for the past 5 years, and the time range for our data spans from 10/02/2014 to 10/01/2019.

We aim to explore any possible preferences that the newspaper have in covering the news. For example, what kind of wording do they use more frequently than others? What kind of topics they like to write about?

In [2]:
import logging
import itertools
import numpy as np
import gensim
import os
import re
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS



## Loading the files

First, we load all the files using default methods.

In [3]:
import os
base_dir = "C:\\Users\\cahis\\Downloads\\corpus\\"
all_docs = []
docs = os.listdir(base_dir)

for doc in docs:
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding = "utf-8") as file:
            text = file.read()
            all_docs.append(text)

Then we split each .txt file into separate articles, and this is achieved by slicing them with the year information at the end of each article.

In [4]:
for i in range(len(all_docs)):
    all_docs[i]=re.split("20 1[0-9]{1}", all_docs[i])

After that, we expand the sublist and replace all the unnecessary words and phrases that might pollute our corpus using regular expressions.

In [5]:
alltext= []
for sublist in all_docs:
    for item in sublist:
        alltext.append(item)
    
for i in range(len(alltext)):
    alltext[i]=alltext[i].replace("\n", "")
    alltext[i]=alltext[i].replace("\x0c", "")
    alltext[i]=alltext[i].replace("The article has been updated.Co pyrig ht ( c) ", "")
    alltext[i]=alltext[i].replace(" The Emo ry Whe e l, All rig hts re s e rve d.", "")
    alltext[i]=alltext[i].replace("Co pyrig ht ( c) ", "")    

len(alltext)

4263

Here, we create a new pandas dataframe for more convenient analyses. We then move on to transforming the articles into the dataframe using regular expressions.

In [42]:
import pandas as pd 
from pandas import DataFrame

df= pd.DataFrame(index=np.arange(4262),columns= ["Title", "Author", "Date", "Content"])

In [43]:
#making a loop to fill each part of the news into the column of the dataframe
for i in range(len(alltext)):
    #check whether the author's name is written in front of URL or after
    tempindex = alltext[i].find("LinkBy")
    #when there is no "by" after URL but has "author"
    if tempindex == -1:
        temp = alltext[i][0:alltext[i].find("OpenURL")]
        df["Content"][i] = alltext[i][alltext[i].find("OpenURL")+12:]
        #when there is "author"
        if alltext[i].find("Author:") != 0:
            df["Author"][i] = temp[temp.find("Author")+8:temp.find("Section")-1]
        #when there is no "author"
        dateindex = max(temp.find("January"),temp.find("February"),temp.find("March"),temp.find("April"),temp.find("May"),temp.find("June"),temp.find("July"),temp.find("August"),temp.find("September"),temp.find("October"),temp.find("November"),temp.find("December"))
        df["Title"][i] = temp[0:dateindex]
        df["Date"][i] = temp[dateindex:temp.find("Emory Wheel, The")-3] 
    #when author is written after URL Link
    else:
        regexname=re.search("(By)\s[A-Z]{1}[a-z]+\s[A-Z]{1}[a-z]+",alltext[i])
        #This is for the names that can not be detected by regex
        if regexname == None:
            temp=alltext[i][0:alltext[i].find("OpenURL")] 
            df["Content"][i] = alltext[i][alltext[i].find("OpenURL")+12:]
            dateindex = max(temp.find("January"),temp.find("February"),temp.find("March"),temp.find("April"),temp.find("May"),temp.find("June"),temp.find("July"),temp.find("August"),temp.find("September"),temp.find("October"),temp.find("November"),temp.find("December"))
            df["Title"][i] = temp[0:dateindex]
            df["Date"][i] = temp[dateindex:temp.find("Emory Wheel, The")-3] 
        #This is for the name that can be detect by regex
        else:
            nameend = regexname.end()
            temp = alltext[i][0:nameend] 
            df["Author"][i] = temp[nameend-len(regexname.group(0))+3:]  
            Contenttemp = alltext[i][nameend:]
            checkwriter = Contenttemp.find("Writer")
            checkeditor = Contenttemp.find("Editor")
            if checkwriter != -1 and checkwriter < 50:
                df["Content"][i] = Contenttemp[checkwriter+6:]
            elif checkeditor != -1 and checkeditor < 50:
                df["Content"][i] = Contenttemp[checkeditor+6:]
            else:
                df["Content"][i] = Contenttemp
            dateindex = max(temp.find("January"),temp.find("February"),temp.find("March"),temp.find("April"),temp.find("May"),temp.find("June"),temp.find("July"),temp.find("August"),temp.find("September"),temp.find("October"),temp.find("November"),temp.find("December"))
            df["Title"][i] = temp[0:dateindex]
            df["Date"][i] = temp[dateindex:temp.find("Emory Wheel, The")-3]
            
# Replace all the missing strings with NA 
df.replace('', np.nan, inplace=True)
# Drop the missing values
df = df.dropna()
# Obtain the final clean dataframe
df["Date"][2933] = 'March 24, 2015'
df["Date"][3994] = 'April 8, 2017'
df["Date"][1858] = 'April 25, 2018'
df["Date"][1386] = 'September 12, 2018' 
df["Date"][752] = 'March 28, 2018'
df["Date"][231] = 'April 3, 2019'
df["Date"][187] = 'April 18, 2018'
df["Date"][1102] = 'April 3, 2019'
df["Date"][1462] = 'September 12, 2018'

df.head(10)

Unnamed: 0,Title,Author,Date,Content
0,Students Participate in National Climate Strike,Bisma Punjani,"September 25, 2019","Emory students, alumni and faculty gathered on..."
1,University Must Clarify Appropriate Uses of Ra...,The Editorial Board,"September 25, 2019","In spaces of higher education, language is one..."
2,Students Deserve Access to 24/7 Sexual Assault...,Brammhi Balarajan,"September 25, 2019","CORRECTION (Oct. 8, 12:13 am) This article has..."
3,Anti-Hazing Advocate Tells Story of Son's Death,Musa Ya-Sin,"September 25, 2019",Hazing-prevention advocate Lianne Kowiak warne...
4,Eagles Blank Third Straight Opponent,Daniel Kekes-Szabo,"September 25, 2019",The Emory women's soccer team increased their ...
5,Asian Joint Flaunts Southern Charm,Varun Gupta,"September 25, 2019",A mere 10-minute walk away from Krog Street Ma...
6,Aisle 5 Sees the Bright Side of Christian French,Cailen Chinn,"September 25, 2019",If you've ever loved an up-and-coming musician...
7,News Roundup | 9.25.19,Musa Ya-Sin,"September 25, 2019",Nancy Pelosi Announces Impeachment InquiryHous...
8,"Round Table: Emory Faculty Discuss Vaping, Vit...",Thomas Kreutz,"September 25, 2019",According to the Centers for Disease Control a...
9,"Beauty Influencer Amasses 63,000 Fans",Bonny Minn,"September 25, 2019",As a burgeoning Instagram and YouTube beauty i...


## Topic Modelling

In [8]:
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  

def head(stream, n=10):
    return list(itertools.islice(stream, n))
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

In [9]:
#turning the pandas dataframe into a list
df1 = DataFrame(df, columns= ['Title', 'Content'])
newlist = df1.values.tolist()

# A function to yield each doc in CCP Corpus as a `(filename, tokens)` tuple.
def iter_docs(base_list):
    for item in base_list:
        text = item[1]
        tokens = tokenize(text) 
        yield item[0], tokens
    
stream = iter_docs(newlist)
for title, tokens in itertools.islice(stream, 20):
    print(title, tokens[:10]) 

Students Participate in National Climate Strike ['emory', 'students', 'alumni', 'faculty', 'gathered', 'cox', 'bridge', 'friday', 'participate', 'global']
University Must Clarify Appropriate Uses of Racially-ChargedTerms ['spaces', 'higher', 'education', 'language', 'fundamental', 'tools', 'enriching', 'tool', 'comes', 'responsibility']
Students Deserve Access to 24/7 Sexual Assault Care ['correction', 'oct', 'article', 'updated', 'clarify', 'medical', 'sexual', 'assault', 'victims', 'campus']
Anti-Hazing Advocate Tells Story of Son's Death ['hazing', 'prevention', 'advocate', 'lianne', 'kowiak', 'warned', 'roughly', 'greek', 'life', 'members']
Eagles Blank Third Straight Opponent ['emory', 'women', 'soccer', 'team', 'increased', 'win', 'streak', 'shut', 'tenn', 'sept']
Asian Joint Flaunts Southern Charm ['mere', 'minute', 'walk', 'away', 'krog', 'street', 'market', 'jenchan', 'restaurant', 'market']
Aisle 5 Sees the Bright Side of Christian French ['ve', 'loved', 'coming', 'musician',

In [13]:
doc_stream = (tokens for _,tokens in iter_docs(newlist))
              
id2word_emorywheel = gensim.corpora.Dictionary(doc_stream) 

print(id2word_emorywheel)

INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(118817 unique tokens: ['according', 'action', 'alot', 'alumni', 'analysis']...) from 4025 documents (total 1465837 corpus positions)


Dictionary(118817 unique tokens: ['according', 'action', 'alot', 'alumni', 'analysis']...)


In [14]:
# print(id2word_emorywheel.token2id)

In [21]:
# filter out words in only 1 doc, keeping the rest
id2word_emorywheel.filter_extremes(no_below=2, no_above=1.0)
print(id2word_emorywheel)

INFO : discarding 75427 tokens: [('coalplants', 1), ('emphasizedhow', 1), ('gogreen', 1), ('thestrike', 1), ('todemand', 1), ('tousa', 1), ('whichit', 1), ('aremeant', 1), ('defineand', 1), ('earlieruse', 1)]...
INFO : keeping 43390 tokens which were in no less than 2 and no more than 4025 (=100.0%) documents
INFO : resulting dictionary: Dictionary(43390 unique tokens: ['according', 'action', 'alot', 'alumni', 'analysis']...)


Dictionary(43390 unique tokens: ['according', 'action', 'alot', 'alumni', 'analysis']...)


In [22]:
# a class we need; this is the same for every topic model you create with gensim. 
# no need to modify it here

class Corpus(object):
    def __init__(self, dump_file, dictionary, clip_docs=None):
        self.dump_file = dump_file
        self.dictionary = dictionary
        self.clip_docs = clip_docs
    
    def __iter__(self):
        self.titles = []
        for title, tokens in itertools.islice(iter_docs(self.dump_file), self.clip_docs):
            self.titles.append(title)
            yield self.dictionary.doc2bow(tokens)
    
    def __len__(self):
        return self.clip_docs

# create a stream of bag-of-words vectors
emorywheel_corpus = Corpus(newlist, id2word_emorywheel)

# print the first vector in the stream to see what it looks like; 
# this is in the format (word_id, count in first doc)

vector = next(iter(emorywheel_corpus))
print(vector)  

[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 9), (31, 2), (32, 1), (33, 14), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 2), (54, 1), (55, 1), (56, 2), (57, 1), (58, 9), (59, 1), (60, 3), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 5), (75, 1), (76, 1), (77, 1), (78, 2), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 2), (104, 1), (105, 1), (106, 1), (107, 1), (108, 3), (109, 1), (110, 1)

In [24]:
%time lda_model = gensim.models.LdaModel(emorywheel_corpus, num_topics=15, id2word=id2word_emorywheel, passes=100) 

INFO : using symmetric alpha at 0.06666666666666667
INFO : using symmetric eta at 0.06666666666666667
INFO : using serial LDA version on this node
INFO : running online (multi-pass) LDA training, 15 topics, 100 passes over the supplied corpus of 4025 documents, updating model once every 2000 documents, evaluating perplexity every 4025 documents, iterating 50x with a convergence threshold of 0.001000
INFO : PROGRESS: pass 0, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #1 (0.067): 0.022*"emory" + 0.007*"university" + 0.007*"said" + 0.005*"students" + 0.004*"according" + 0.004*"student" + 0.004*"wheel" + 0.004*"atlanta" + 0.003*"campus" + 0.003*"people"
INFO : topic #0 (0.067): 0.008*"said" + 0.006*"emory" + 0.004*"team" + 0.004*"university" + 0.004*"time" + 0.004*"people" + 0.003*"like" + 0.003*"according" + 0.003*"atlanta" + 0.003*"college"
INFO : topic #8 (0.067): 0.016*"said" + 0.009*"emory" + 0.006*"student" + 0.004*"s

INFO : topic #2 (0.067): 0.021*"game" + 0.019*"team" + 0.017*"emory" + 0.014*"said" + 0.013*"eagles" + 0.011*"brereton" + 0.009*"great" + 0.008*"win" + 0.007*"season" + 0.007*"britain"
INFO : topic diff=0.806172, rho=0.446656
INFO : PROGRESS: pass 2, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #9 (0.067): 0.010*"album" + 0.008*"song" + 0.008*"like" + 0.007*"people" + 0.006*"music" + 0.006*"way" + 0.004*"love" + 0.004*"songs" + 0.004*"audience" + 0.004*"track"
INFO : topic #0 (0.067): 0.009*"said" + 0.005*"time" + 0.005*"people" + 0.005*"atlanta" + 0.005*"year" + 0.004*"like" + 0.004*"poetry" + 0.003*"history" + 0.003*"new" + 0.003*"american"
INFO : topic #12 (0.067): 0.016*"band" + 0.012*"song" + 0.010*"tew" + 0.009*"way" + 0.009*"music" + 0.008*"years" + 0.008*"definitely" + 0.008*"ck" + 0.008*"kind" + 0.008*"lot"
INFO : topic #11 (0.067): 0.028*"said" + 0.023*"students" + 0.016*"sga" + 0.014*"student" + 0.011*"college"

INFO : topic diff=0.271261, rho=0.377627
INFO : -7.018 per-word bound, 129.6 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 4, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #3 (0.067): 0.017*"season" + 0.016*"game" + 0.009*"win" + 0.009*"quarterback" + 0.008*"backup" + 0.007*"week" + 0.007*"games" + 0.007*"year" + 0.007*"nfl" + 0.006*"allen"
INFO : topic #9 (0.067): 0.015*"album" + 0.011*"song" + 0.011*"jpegmafia" + 0.007*"parade" + 0.007*"track" + 0.006*"peggy" + 0.006*"beat" + 0.006*"people" + 0.005*"like" + 0.005*"event"
INFO : topic #12 (0.067): 0.028*"song" + 0.024*"band" + 0.018*"way" + 0.016*"tew" + 0.016*"ck" + 0.014*"definitely" + 0.012*"music" + 0.012*"album" + 0.012*"try" + 0.010*"kind"
INFO : topic #0 (0.067): 0.011*"said" + 0.008*"people" + 0.006*"time" + 0.006*"like" + 0.004*"think" + 0.004*"atlanta" + 0.004*"year" + 0.004*"american" + 0.004*"new" + 0.003*"w

INFO : topic #6 (0.067): 0.011*"food" + 0.010*"election" + 0.007*"elections" + 0.005*"coffee" + 0.005*"carter" + 0.005*"party" + 0.004*"vote" + 0.004*"candidates" + 0.004*"candidate" + 0.004*"voting"
INFO : topic #7 (0.067): 0.026*"team" + 0.020*"emory" + 0.015*"said" + 0.013*"place" + 0.012*"eagles" + 0.012*"sophomore" + 0.011*"men" + 0.010*"time" + 0.010*"university" + 0.009*"junior"
INFO : topic #10 (0.067): 0.032*"eagles" + 0.020*"game" + 0.020*"goal" + 0.016*"forward" + 0.013*"ball" + 0.013*"team" + 0.012*"said" + 0.012*"minute" + 0.011*"midfielder" + 0.010*"university"
INFO : topic #3 (0.067): 0.018*"game" + 0.017*"season" + 0.009*"win" + 0.008*"games" + 0.007*"points" + 0.007*"year" + 0.007*"quarterback" + 0.006*"week" + 0.006*"point" + 0.006*"nfl"
INFO : topic #8 (0.067): 0.019*"said" + 0.016*"student" + 0.012*"work" + 0.009*"course" + 0.009*"year" + 0.008*"council" + 0.007*"honor" + 0.007*"assignment" + 0.007*"life" + 0.007*"emory"
INFO : topic diff=0.221386, rho=0.316030
INFO

INFO : topic #14 (0.067): 0.017*"film" + 0.006*"story" + 0.005*"like" + 0.004*"musical" + 0.004*"characters" + 0.004*"time" + 0.004*"man" + 0.004*"world" + 0.003*"character" + 0.003*"movie"
INFO : topic #12 (0.067): 0.025*"song" + 0.023*"band" + 0.016*"way" + 0.015*"music" + 0.013*"tew" + 0.013*"definitely" + 0.011*"ck" + 0.011*"album" + 0.010*"try" + 0.010*"kind"
INFO : topic #10 (0.067): 0.032*"eagles" + 0.021*"goal" + 0.021*"game" + 0.017*"forward" + 0.013*"team" + 0.013*"ball" + 0.012*"said" + 0.011*"minute" + 0.011*"university" + 0.010*"midfielder"
INFO : topic #13 (0.067): 0.030*"emory" + 0.026*"said" + 0.017*"campus" + 0.015*"epd" + 0.015*"complainant" + 0.014*"sept" + 0.011*"report" + 0.011*"library" + 0.010*"student" + 0.008*"responded"
INFO : topic diff=0.144052, rho=0.288525
INFO : -6.806 per-word bound, 111.9 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 9, at document #4025/4025
INFO : merging changes from 25 documents

INFO : topic #1 (0.067): 0.020*"emory" + 0.012*"students" + 0.009*"university" + 0.008*"according" + 0.006*"college" + 0.006*"said" + 0.005*"education" + 0.004*"community" + 0.004*"president" + 0.004*"student"
INFO : topic #5 (0.067): 0.019*"emory" + 0.016*"climate" + 0.015*"falcons" + 0.015*"students" + 0.014*"facility" + 0.010*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #4 (0.067): 0.024*"yak" + 0.023*"yik" + 0.015*"speech" + 0.015*"resolution" + 0.013*"emory" + 0.012*"think" + 0.011*"people" + 0.010*"sga" + 0.008*"posts" + 0.008*"world"
INFO : topic diff=0.233863, rho=0.267142
INFO : PROGRESS: pass 12, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #3 (0.067): 0.019*"game" + 0.019*"season" + 0.009*"win" + 0.009*"games" + 0.008*"points" + 0.007*"year" + 0.007*"quarterback" + 0.007*"week" + 0.006*"point" + 0.006*"nfl"
INFO : topic #13 (0.067): 0.030*"emory" + 0.027*"said" + 0.018*"

INFO : topic #1 (0.067): 0.020*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"according" + 0.007*"said" + 0.005*"college" + 0.005*"president" + 0.004*"community" + 0.004*"student" + 0.004*"school"
INFO : topic #14 (0.067): 0.018*"film" + 0.006*"story" + 0.005*"musical" + 0.004*"nancy" + 0.004*"characters" + 0.004*"set" + 0.004*"man" + 0.004*"like" + 0.004*"david" + 0.003*"factory"
INFO : topic diff=0.141110, rho=0.242447
INFO : PROGRESS: pass 14, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #6 (0.067): 0.013*"food" + 0.011*"election" + 0.007*"coffee" + 0.007*"elections" + 0.007*"carter" + 0.006*"party" + 0.005*"vote" + 0.005*"candidates" + 0.005*"candidate" + 0.004*"restaurant"
INFO : topic #12 (0.067): 0.027*"song" + 0.025*"band" + 0.016*"music" + 0.016*"way" + 0.014*"tew" + 0.012*"definitely" + 0.012*"ck" + 0.011*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.

INFO : topic diff=0.103965, rho=0.229340
INFO : -6.734 per-word bound, 106.4 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 16, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #0 (0.067): 0.011*"like" + 0.009*"people" + 0.008*"said" + 0.007*"time" + 0.006*"think" + 0.005*"way" + 0.004*"work" + 0.004*"know" + 0.004*"new" + 0.004*"want"
INFO : topic #10 (0.067): 0.033*"eagles" + 0.021*"game" + 0.018*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.012*"said" + 0.011*"defense" + 0.011*"university" + 0.011*"minute"
INFO : topic #1 (0.067): 0.020*"emory" + 0.012*"students" + 0.009*"university" + 0.008*"according" + 0.007*"said" + 0.006*"college" + 0.005*"president" + 0.004*"community" + 0.004*"education" + 0.004*"student"
INFO : topic #7 (0.067): 0.027*"team" + 0.020*"emory" + 0.014*"said" + 0.014*"place" + 0.013*"sophomore" + 0.012*"men" + 0.011*"eagles" + 0.010*"time

INFO : topic #4 (0.067): 0.022*"yak" + 0.021*"yik" + 0.016*"speech" + 0.014*"resolution" + 0.013*"emory" + 0.012*"think" + 0.011*"sga" + 0.010*"people" + 0.008*"posts" + 0.008*"world"
INFO : topic #8 (0.067): 0.019*"student" + 0.017*"said" + 0.014*"work" + 0.012*"course" + 0.011*"year" + 0.010*"assignment" + 0.010*"honor" + 0.009*"council" + 0.008*"life" + 0.007*"emory"
INFO : topic #11 (0.067): 0.037*"said" + 0.034*"students" + 0.021*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.011*"meeting" + 0.010*"emory" + 0.009*"president" + 0.008*"zoberman"
INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band" + 0.016*"way" + 0.016*"tew" + 0.015*"music" + 0.014*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #13 (0.067): 0.030*"emory" + 0.028*"said" + 0.018*"campus" + 0.017*"complainant" + 0.015*"sept" + 0.014*"epd" + 0.012*"report" + 0.011*"library" + 0.009*"student" + 0.008*"university"
INFO : topic diff=0.118110, rho=0.213140
INFO : PROGRES

INFO : topic #4 (0.067): 0.021*"yak" + 0.020*"yik" + 0.016*"speech" + 0.013*"resolution" + 0.012*"emory" + 0.011*"think" + 0.010*"people" + 0.010*"sga" + 0.008*"world" + 0.007*"posts"
INFO : topic #2 (0.067): 0.027*"game" + 0.025*"team" + 0.022*"emory" + 0.016*"brereton" + 0.015*"great" + 0.012*"eagles" + 0.011*"britain" + 0.011*"runs" + 0.010*"baseball" + 0.010*"said"
INFO : topic #11 (0.067): 0.037*"said" + 0.034*"students" + 0.021*"sga" + 0.018*"student" + 0.015*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"meeting" + 0.010*"president" + 0.008*"zoberman"
INFO : topic #8 (0.067): 0.019*"student" + 0.017*"said" + 0.012*"work" + 0.012*"course" + 0.011*"year" + 0.009*"life" + 0.009*"honor" + 0.008*"house" + 0.008*"assignment" + 0.008*"council"
INFO : topic diff=0.090352, rho=0.204071
INFO : -6.712 per-word bound, 104.8 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 21, at document #4025/4025
INFO : merging changes from 25 docum

INFO : topic #8 (0.067): 0.020*"student" + 0.016*"work" + 0.014*"said" + 0.014*"course" + 0.013*"assignment" + 0.011*"honor" + 0.011*"year" + 0.009*"online" + 0.008*"council" + 0.008*"reported"
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.007*"said" + 0.006*"time" + 0.006*"think" + 0.005*"way" + 0.004*"work" + 0.004*"know" + 0.004*"life" + 0.004*"new"
INFO : topic #10 (0.067): 0.033*"eagles" + 0.021*"game" + 0.018*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.012*"said" + 0.011*"university" + 0.011*"defense" + 0.011*"minute"
INFO : topic diff=0.159476, rho=0.196069
INFO : PROGRESS: pass 24, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #10 (0.067): 0.034*"eagles" + 0.021*"game" + 0.019*"goal" + 0.016*"forw

INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #4 (0.067): 0.022*"yak" + 0.021*"yik" + 0.016*"speech" + 0.014*"resolution" + 0.013*"emory" + 0.012*"think" + 0.011*"sga" + 0.010*"people" + 0.008*"posts" + 0.008*"world"
INFO : topic diff=0.098954, rho=0.185655
INFO : PROGRESS: pass 26, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.008*"according" + 0.008*"said" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #8 (0.067): 0.020*"student" + 0.016*"said" + 0.013*"work" + 0.012*"course" + 0.011*"year" + 0.009*"honor" + 0.009*"life" + 0.008*"assignment" + 0.008*"house" + 0.008*"council"
INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band" + 0.017*"

INFO : topic #3 (0.067): 0.022*"game" + 0.019*"season" + 0.010*"games" + 0.009*"win" + 0.009*"points" + 0.008*"year" + 0.007*"week" + 0.007*"point" + 0.006*"quarterback" + 0.006*"hawks"
INFO : topic diff=0.078023, rho=0.179569
INFO : -6.694 per-word bound, 103.6 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 28, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #12 (0.067): 0.030*"song" + 0.026*"band" + 0.017*"way" + 0.016*"tew" + 0.015*"music" + 0.015*"ck" + 0.013*"definitely" + 0.012*"album" + 0.011*"try" + 0.010*"songwriting"
INFO : topic #6 (0.067): 0.014*"food" + 0.010*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"elections" + 0.005*"vote" + 0.005*"candidates" + 0.004*"restaurant" + 0.004*"candidate"
INFO : topic #13 (0.067): 0.029*"emory" + 0.029*"said" + 0.019*"campus" + 0.018*"complainant" + 0.017*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"

INFO : topic diff=0.139150, rho=0.174045
INFO : PROGRESS: pass 31, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.007*"said" + 0.006*"time" + 0.005*"think" + 0.005*"way" + 0.004*"work" + 0.004*"life" + 0.004*"know" + 0.004*"new"
INFO : topic #7 (0.067): 0.027*"team" + 0.021*"emory" + 0.015*"said" + 0.013*"place" + 0.012*"sophomore" + 0.012*"eagles" + 0.011*"men" + 0.010*"university" + 0.010*"time" + 0.010*"junior"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #14 (0.067): 0.019*"film" + 0.006*"story" + 0.006*"musical" + 0.005*"nancy" + 0.004*"set" + 0.004*"characters" + 0.004*"man" + 0.004*"david" + 0.004*"factory" + 0.004*"character"
INFO : topic #3 (0.067): 0.021*"game" + 0.020*"season" + 0.010*"games" + 0.010*"win" + 

INFO : topic #13 (0.067): 0.028*"emory" + 0.028*"said" + 0.018*"campus" + 0.016*"complainant" + 0.015*"epd" + 0.015*"sept" + 0.012*"report" + 0.012*"library" + 0.009*"student" + 0.008*"responded"
INFO : topic #8 (0.067): 0.020*"student" + 0.015*"said" + 0.013*"work" + 0.013*"course" + 0.011*"year" + 0.010*"honor" + 0.009*"assignment" + 0.009*"life" + 0.008*"house" + 0.008*"council"
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.006*"time" + 0.006*"said" + 0.006*"think" + 0.005*"way" + 0.004*"work" + 0.004*"life" + 0.004*"know" + 0.004*"new"
INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band" + 0.017*"music" + 0.016*"way" + 0.015*"tew" + 0.013*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic diff=0.071750, rho=0.166638
INFO : -6.686 per-word bound, 103.0 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 33, at document #4025/4025
INFO : merging changes from 25 documents into a model of 40

INFO : topic #4 (0.067): 0.023*"yak" + 0.022*"yik" + 0.015*"speech" + 0.014*"resolution" + 0.013*"emory" + 0.012*"sga" + 0.012*"think" + 0.010*"people" + 0.008*"posts" + 0.008*"world"
INFO : topic #7 (0.067): 0.027*"team" + 0.020*"emory" + 0.015*"said" + 0.013*"place" + 0.013*"sophomore" + 0.012*"eagles" + 0.011*"men" + 0.010*"university" + 0.010*"junior" + 0.010*"time"
INFO : topic #6 (0.067): 0.014*"food" + 0.010*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"vote" + 0.005*"candidates" + 0.004*"elections" + 0.004*"restaurant" + 0.004*"atlanta"
INFO : topic diff=0.128461, rho=0.162195
INFO : PROGRESS: pass 36, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.021*"emory" + 0.020*"brereton" + 0.017*"great" + 0.014*"britain" + 0.011*"baseball" + 0.011*"runs" + 0.010*"eagles" + 0.009*"said"
INFO : topic #10 (0.067): 0.034*"eagles" + 0.021*"game" + 0.019*"goal" + 

INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.011*"jpegmafia" + 0.008*"track" + 0.008*"parade" + 0.007*"beat" + 0.006*"crowd" + 0.006*"peggy" + 0.006*"event" + 0.006*"post"
INFO : topic #8 (0.067): 0.021*"student" + 0.015*"work" + 0.015*"said" + 0.014*"course" + 0.011*"year" + 0.011*"assignment" + 0.011*"honor" + 0.009*"council" + 0.008*"online" + 0.008*"life"
INFO : topic diff=0.080137, rho=0.156150
INFO : PROGRESS: pass 38, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"president" + 0.009*"meeting" + 0.008*"zoberman"
INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.008*"said" + 0.008*"according" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #13 (0.067): 0.028*"emory" + 0.028*"said" + 

INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.021*"emory" + 0.018*"brereton" + 0.016*"great" + 0.012*"britain" + 0.011*"runs" + 0.011*"baseball" + 0.010*"eagles" + 0.009*"said"
INFO : topic diff=0.065049, rho=0.152476
INFO : -6.678 per-word bound, 102.4 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 40, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.012*"jpegmafia" + 0.009*"parade" + 0.008*"track" + 0.007*"beat" + 0.007*"peggy" + 0.006*"event" + 0.006*"crowd" + 0.006*"lantern"
INFO : topic #14 (0.067): 0.018*"film" + 0.007*"musical" + 0.006*"story" + 0.005*"nancy" + 0.004*"set" + 0.004*"factory" + 0.004*"david" + 0.004*"emory" + 0.004*"man" + 0.004*"characters"
INFO : topic #7 (0.067): 0.026*"team" + 0.021*"emory" + 0.015*"said" + 0.013*"place" + 0.013*"sophomore" + 0.012*"eagles" + 0.011*"men" + 0.010*"university" + 

INFO : topic diff=0.117048, rho=0.149050
INFO : PROGRESS: pass 43, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #12 (0.067): 0.029*"song" + 0.026*"band" + 0.016*"way" + 0.016*"music" + 0.016*"tew" + 0.014*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.021*"brereton" + 0.021*"emory" + 0.017*"great" + 0.014*"britain" + 0.012*"baseball" + 0.011*"runs" + 0.009*"eagles" + 0.009*"said"
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.011*"jpegmafia" + 0.008*"track" + 0.008*"parade" + 0.007*"beat" + 0.006*"crowd" + 0.006*"peggy" + 0.006*"event" + 0.006*"post"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.006*"time" + 0.006*"said" + 0.005

INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"said" + 0.008*"according" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"president" + 0.009*"meeting" + 0.008*"zoberman"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.014*"students" + 0.013*"falcons" + 0.013*"facility" + 0.009*"action" + 0.007*"said" + 0.007*"like" + 0.006*"research" + 0.006*"football"
INFO : topic #14 (0.067): 0.019*"film" + 0.007*"story" + 0.005*"musical" + 0.004*"characters" + 0.004*"set" + 0.004*"nancy" + 0.004*"man" + 0.004*"character" + 0.004*"emory" + 0.004*"movie"
INFO : topic diff=0.061280, rho=0.144319
INFO : -6.674 per-word bound, 102.1 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 45, at document #4025/4025
INF

INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.012*"jpegmafia" + 0.009*"parade" + 0.008*"track" + 0.007*"beat" + 0.006*"peggy" + 0.006*"event" + 0.006*"crowd" + 0.006*"lantern"
INFO : topic #13 (0.067): 0.029*"said" + 0.028*"emory" + 0.019*"campus" + 0.018*"complainant" + 0.017*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"library" + 0.008*"student" + 0.008*"university"
INFO : topic #7 (0.067): 0.026*"team" + 0.021*"emory" + 0.015*"said" + 0.013*"place" + 0.013*"sophomore" + 0.012*"eagles" + 0.011*"men" + 0.010*"junior" + 0.010*"university" + 0.010*"time"
INFO : topic diff=0.110484, rho=0.141404
INFO : PROGRESS: pass 48, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #7 (0.067): 0.026*"team" + 0.021*"emory" + 0.016*"said" + 0.013*"place" + 0.012*"sophomore" + 0.012*"eagles" + 0.010*"men" + 0.010*"university" + 0.010*"time" + 0.010*"junior"
INFO : topic #12 (0.067): 0.029*"song" + 0.026*"band" + 0.016*"way" + 0.0

INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"said" + 0.009*"according" + 0.005*"college" + 0.005*"president" + 0.004*"community" + 0.004*"school" + 0.004*"student"
INFO : topic diff=0.068878, rho=0.137344
INFO : PROGRESS: pass 50, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #6 (0.067): 0.015*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"vote" + 0.005*"restaurant" + 0.005*"atlanta" + 0.004*"candidates" + 0.004*"candidate"
INFO : topic #4 (0.067): 0.021*"yak" + 0.020*"yik" + 0.016*"speech" + 0.013*"resolution" + 0.012*"emory" + 0.011*"think" + 0.011*"sga" + 0.010*"people" + 0.008*"world" + 0.008*"posts"
INFO : topic #10 (0.067): 0.035*"eagles" + 0.021*"game" + 0.019*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.013*"said" + 0.011*"university" + 0.010*"minute" + 0.010*"defense"
INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band"

INFO : topic diff=0.056950, rho=0.134825
INFO : -6.670 per-word bound, 101.8 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 52, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #14 (0.067): 0.018*"film" + 0.006*"musical" + 0.006*"story" + 0.005*"nancy" + 0.005*"emory" + 0.004*"set" + 0.004*"factory" + 0.004*"david" + 0.004*"man" + 0.004*"characters"
INFO : topic #12 (0.067): 0.030*"song" + 0.026*"band" + 0.017*"way" + 0.016*"tew" + 0.016*"music" + 0.015*"ck" + 0.013*"definitely" + 0.012*"album" + 0.011*"try" + 0.010*"songwriting"
INFO : topic #11 (0.067): 0.038*"said" + 0.037*"students" + 0.018*"sga" + 0.018*"student" + 0.014*"college" + 0.012*"campus" + 0.010*"emory" + 0.010*"zoberman" + 0.010*"meeting" + 0.009*"president"
INFO : topic #13 (0.067): 0.029*"said" + 0.027*"emory" + 0.019*"campus" + 0.018*"complainant" + 0.017*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"libr

INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.011*"jpegmafia" + 0.008*"track" + 0.008*"parade" + 0.007*"beat" + 0.006*"crowd" + 0.006*"peggy" + 0.006*"event" + 0.006*"post"
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" + 0.017*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.012*"report" + 0.012*"library" + 0.009*"student" + 0.008*"left"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #6 (0.067): 0.014*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.005*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.004*"vote" + 0.004*"candidates" + 0.004*"abrams"
INFO : topic #12 (0.067): 0.029*"song" + 0.026*"band" + 0.016*"way" + 0.016*"music" + 0.016*"tew" + 0.014*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"

INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.006*"time" + 0.006*"think" + 0.005*"said" + 0.005*"way" + 0.004*"work" + 0.004*"life" + 0.004*"know" + 0.004*"world"
INFO : topic #6 (0.067): 0.015*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.005*"vote" + 0.004*"candidates" + 0.004*"clinton"
INFO : topic #3 (0.067): 0.022*"game" + 0.020*"season" + 0.010*"games" + 0.009*"win" + 0.009*"points" + 0.008*"year" + 0.007*"week" + 0.007*"quarterback" + 0.006*"point" + 0.006*"players"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.014*"students" + 0.013*"falcons" + 0.013*"facility" + 0.009*"action" + 0.007*"said" + 0.007*"like" + 0.006*"research" + 0.006*"football"
INFO : topic diff=0.054379, rho=0.129086
INFO : -6.667 per-word bound, 101.6 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 57, at document #4025/4025
INFO : merging changes from 25 do

INFO : topic #10 (0.067): 0.034*"eagles" + 0.022*"game" + 0.018*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.013*"said" + 0.011*"university" + 0.011*"defense" + 0.010*"minute"
INFO : topic #6 (0.067): 0.015*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.005*"vote" + 0.004*"candidates" + 0.004*"clinton"
INFO : topic #11 (0.067): 0.038*"said" + 0.037*"students" + 0.018*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"zoberman" + 0.010*"meeting" + 0.009*"president"
INFO : topic diff=0.098370, rho=0.126987
INFO : PROGRESS: pass 60, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.021*"brereton" + 0.020*"emory" + 0.018*"great" + 0.014*"britain" + 0.012*"baseball" + 0.011*"runs" + 0.009*"said" + 0.009*"eagles"
INFO : topic #8 (0.067): 0.021*"student" + 0.015*"work" + 0.

INFO : topic #3 (0.067): 0.021*"game" + 0.020*"season" + 0.010*"games" + 0.010*"win" + 0.008*"points" + 0.008*"year" + 0.008*"quarterback" + 0.007*"week" + 0.006*"nfl" + 0.006*"atlanta"
INFO : topic diff=0.061306, rho=0.124023
INFO : PROGRESS: pass 62, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"said" + 0.008*"according" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #10 (0.067): 0.035*"eagles" + 0.021*"game" + 0.019*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.013*"said" + 0.011*"university" + 0.010*"minute" + 0.010*"defense"
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" + 0.017*"complainant" + 0.015*"sept" + 0.015*"epd" + 0.012*"report" + 0.012*"library" + 0.009*"student" + 0.008*"responded"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" +

INFO : topic diff=0.051307, rho=0.122158
INFO : -6.664 per-word bound, 101.4 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 64, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #8 (0.067): 0.021*"student" + 0.016*"work" + 0.015*"course" + 0.013*"assignment" + 0.012*"said" + 0.012*"honor" + 0.011*"year" + 0.009*"online" + 0.009*"council" + 0.008*"reported"
INFO : topic #7 (0.067): 0.026*"team" + 0.021*"emory" + 0.015*"said" + 0.013*"place" + 0.013*"sophomore" + 0.012*"eagles" + 0.011*"men" + 0.010*"junior" + 0.010*"university" + 0.010*"time"
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.012*"jpegmafia" + 0.009*"parade" + 0.008*"track" + 0.007*"beat" + 0.006*"peggy" + 0.006*"crowd" + 0.006*"event" + 0.006*"lantern"
INFO : topic #13 (0.067): 0.029*"said" + 0.027*"emory" + 0.019*"campus" + 0.018*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"lib

INFO : topic diff=0.092914, rho=0.120375
INFO : PROGRESS: pass 67, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"said" + 0.009*"according" + 0.005*"college" + 0.005*"president" + 0.004*"community" + 0.004*"school" + 0.004*"student"
INFO : topic #6 (0.067): 0.014*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.005*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.004*"vote" + 0.004*"candidates" + 0.004*"abrams"
INFO : topic #14 (0.067): 0.019*"film" + 0.006*"story" + 0.006*"musical" + 0.005*"nancy" + 0.005*"emory" + 0.004*"set" + 0.004*"characters" + 0.004*"man" + 0.004*"david" + 0.004*"factory"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #8 (0.067): 0.021*"student" + 0.015

INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band" + 0.017*"music" + 0.016*"way" + 0.015*"tew" + 0.013*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #8 (0.067): 0.021*"student" + 0.015*"work" + 0.014*"course" + 0.013*"said" + 0.011*"year" + 0.011*"honor" + 0.010*"assignment" + 0.009*"house" + 0.009*"life" + 0.008*"council"
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.010*"jpegmafia" + 0.008*"track" + 0.008*"parade" + 0.006*"beat" + 0.006*"crowd" + 0.006*"event" + 0.005*"peggy" + 0.005*"post"
INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"said" + 0.008*"according" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"president" + 0.009*"meeting" + 0.008*"zoberman"
INFO : topic diff=0.049418, rho=0.117841
INFO

INFO : topic #5 (0.067): 0.019*"emory" + 0.016*"climate" + 0.015*"students" + 0.015*"falcons" + 0.014*"facility" + 0.010*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #14 (0.067): 0.018*"film" + 0.006*"story" + 0.006*"musical" + 0.005*"nancy" + 0.005*"emory" + 0.004*"set" + 0.004*"david" + 0.004*"factory" + 0.004*"man" + 0.004*"characters"
INFO : topic #13 (0.067): 0.029*"said" + 0.027*"emory" + 0.019*"campus" + 0.018*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"library" + 0.008*"student" + 0.008*"left"
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.023*"brereton" + 0.019*"emory" + 0.019*"great" + 0.016*"britain" + 0.012*"baseball" + 0.012*"runs" + 0.009*"said" + 0.009*"boden"
INFO : topic diff=0.089525, rho=0.116238
INFO : PROGRESS: pass 72, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #6 (0.067): 0.014*"food" + 0.009*"election" + 0.007*"carter" + 

INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 0.013*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"president" + 0.009*"meeting" + 0.009*"zoberman"
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" + 0.017*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.012*"report" + 0.012*"library" + 0.009*"student" + 0.008*"left"
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.006*"time" + 0.005*"think" + 0.005*"said" + 0.005*"way" + 0.004*"work" + 0.004*"life" + 0.004*"know" + 0.004*"world"
INFO : topic diff=0.055762, rho=0.113951
INFO : PROGRESS: pass 74, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.014*"students" + 0.013*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"football"
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" +

INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"president" + 0.009*"meeting" + 0.008*"zoberman"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.014*"students" + 0.013*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"football"
INFO : topic diff=0.047072, rho=0.112500
INFO : -6.660 per-word bound, 101.1 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 76, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.006*"time" + 0.005*"think" + 0.005*"said" + 0.005*"way" + 0.004*"work" + 0.004*"life" + 0.004*"know" + 0.004*"world"
INFO : topic #14 (0.067): 0.018*"film" + 0.006*"story" + 0.006*"musical" + 0.005*"nancy" + 0.005*"emory" + 0.004*"set" + 0.004*"david" + 0.004*"factory

INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.012*"jpegmafia" + 0.009*"parade" + 0.008*"track" + 0.007*"beat" + 0.006*"peggy" + 0.006*"crowd" + 0.006*"event" + 0.005*"lantern"
INFO : topic diff=0.085334, rho=0.111103
INFO : PROGRESS: pass 79, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #4 (0.067): 0.022*"yak" + 0.021*"yik" + 0.015*"speech" + 0.014*"resolution" + 0.013*"emory" + 0.011*"think" + 0.011*"sga" + 0.010*"people" + 0.008*"posts" + 0.008*"world"
INFO : topic #10 (0.067): 0.035*"eagles" + 0.021*"game" + 0.018*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.013*"said" + 0.011*"university" + 0.010*"minute" + 0.010*"defense"
INFO : topic #3 (0.067): 0.021*"game" + 0.020*"season" + 0.010*"games" + 0.010*"

INFO : topic diff=0.053130, rho=0.109101
INFO : PROGRESS: pass 81, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 0.014*"college" + 0.011*"campus" + 0.010*"emory" + 0.010*"president" + 0.009*"meeting" + 0.008*"zoberman"
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.020*"emory" + 0.019*"brereton" + 0.017*"great" + 0.013*"britain" + 0.012*"runs" + 0.011*"baseball" + 0.009*"eagles" + 0.009*"said"
INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band" + 0.017*"music" + 0.016*"way" + 0.015*"tew" + 0.013*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #6 (0.067): 0.015*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.004*"vote" + 0.004*"candidates" + 0.004*"menu"
INFO : topic #3 (0.067): 0.022*"game" + 0.020*"season" + 0.010*"games" + 0.0

INFO : PROGRESS: pass 83, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #0 (0.067): 0.011*"like" + 0.008*"people" + 0.006*"time" + 0.005*"think" + 0.005*"said" + 0.005*"way" + 0.004*"work" + 0.004*"life" + 0.004*"know" + 0.004*"world"
INFO : topic #12 (0.067): 0.029*"song" + 0.026*"band" + 0.017*"way" + 0.016*"tew" + 0.016*"music" + 0.014*"ck" + 0.013*"definitely" + 0.012*"album" + 0.011*"try" + 0.010*"songwriting"
INFO : topic #13 (0.067): 0.029*"said" + 0.027*"emory" + 0.019*"campus" + 0.018*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"library" + 0.008*"student" + 0.008*"left"
INFO : topic #8 (0.067): 0.021*"student" + 0.017*"work" + 0.015*"course" + 0.013*"assignment" + 0.012*"honor" + 0.012*"said" + 0.011*"year" + 0.009*"online" + 0.009*"council" + 0.008*"house"
INFO : topic #6 (0.067): 0.015*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.006*"party" + 0.005*"atlanta" + 0.005*"res

INFO : topic #14 (0.067): 0.019*"film" + 0.006*"story" + 0.006*"musical" + 0.005*"emory" + 0.005*"nancy" + 0.004*"set" + 0.004*"characters" + 0.004*"man" + 0.004*"david" + 0.004*"factory"
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" + 0.017*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"library" + 0.009*"student" + 0.008*"left"
INFO : topic #6 (0.067): 0.015*"food" + 0.009*"election" + 0.007*"carter" + 0.007*"coffee" + 0.005*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.004*"vote" + 0.004*"abrams" + 0.004*"menu"
INFO : topic #3 (0.067): 0.021*"game" + 0.020*"season" + 0.010*"games" + 0.010*"win" + 0.008*"points" + 0.008*"year" + 0.008*"quarterback" + 0.007*"week" + 0.006*"nfl" + 0.006*"atlanta"
INFO : topic diff=0.051463, rho=0.105992
INFO : PROGRESS: pass 86, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #7 (0.067): 0.026*"team" + 0.021*"emory" + 0.016*"said" + 0.012*"

INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"said" + 0.009*"university" + 0.008*"according" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.014*"students" + 0.013*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"football"
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.010*"jpegmafia" + 0.008*"track" + 0.008*"parade" + 0.006*"beat" + 0.006*"crowd" + 0.006*"event" + 0.006*"peggy" + 0.005*"post"
INFO : topic diff=0.043737, rho=0.104821
INFO : -6.657 per-word bound, 100.9 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 88, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.023*"brereton" + 0.019*"great" + 0.019*"emory" + 0.016*"britain" + 0.012*"bas

INFO : topic #1 (0.067): 0.021*"emory" + 0.012*"students" + 0.009*"university" + 0.009*"said" + 0.008*"according" + 0.006*"college" + 0.005*"president" + 0.005*"community" + 0.004*"school" + 0.004*"people"
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.023*"brereton" + 0.019*"great" + 0.019*"emory" + 0.016*"britain" + 0.012*"baseball" + 0.012*"runs" + 0.009*"said" + 0.009*"boden"
INFO : topic diff=0.079342, rho=0.103688
INFO : PROGRESS: pass 91, at document #2000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" + 0.017*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"library" + 0.009*"student" + 0.008*"left"
INFO : topic #2 (0.067): 0.027*"game" + 0.026*"team" + 0.021*"brereton" + 0.019*"emory" + 0.018*"great" + 0.015*"britain" + 0.012*"baseball" + 0.012*"runs" + 0.009*"said" + 0.009*"eagles"
INFO : topic #8 (0.067): 0.021*"student" + 0.016*"work" + 0.014*

INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic diff=0.049360, rho=0.102055
INFO : PROGRESS: pass 93, at document #4000/4025
INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #9 (0.067): 0.018*"album" + 0.012*"song" + 0.010*"jpegmafia" + 0.008*"track" + 0.008*"parade" + 0.006*"beat" + 0.006*"crowd" + 0.006*"event" + 0.006*"peggy" + 0.005*"post"
INFO : topic #3 (0.067): 0.022*"game" + 0.020*"season" + 0.010*"games" + 0.009*"win" + 0.008*"points" + 0.008*"year" + 0.007*"week" + 0.007*"quarterback" + 0.006*"point" + 0.006*"players"
INFO : topic #12 (0.067): 0.028*"song" + 0.025*"band" + 0.017*"music" + 0.016*"way" + 0.015*"tew" + 0.013*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #11 (0.067): 0.038*"said" + 0.035*"students" + 0.020*"sga" + 0.018*"student" + 

INFO : topic diff=0.042081, rho=0.101009
INFO : -6.656 per-word bound, 100.8 perplexity estimate based on a held-out corpus of 25 documents with 7973 words
INFO : PROGRESS: pass 95, at document #4025/4025
INFO : merging changes from 25 documents into a model of 4025 documents
INFO : topic #10 (0.067): 0.034*"eagles" + 0.022*"game" + 0.018*"goal" + 0.016*"forward" + 0.013*"team" + 0.013*"ball" + 0.013*"said" + 0.011*"university" + 0.010*"defense" + 0.010*"minute"
INFO : topic #5 (0.067): 0.019*"emory" + 0.016*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #6 (0.067): 0.015*"food" + 0.008*"election" + 0.007*"carter" + 0.007*"coffee" + 0.005*"party" + 0.005*"atlanta" + 0.005*"restaurant" + 0.004*"vote" + 0.004*"menu" + 0.004*"candidates"
INFO : topic #7 (0.067): 0.026*"team" + 0.021*"emory" + 0.015*"said" + 0.013*"place" + 0.012*"sophomore" + 0.012*"eagles" + 0.011*"men" + 0.01

INFO : merging changes from 2000 documents into a model of 4025 documents
INFO : topic #12 (0.067): 0.029*"song" + 0.026*"band" + 0.016*"way" + 0.016*"music" + 0.016*"tew" + 0.014*"ck" + 0.012*"definitely" + 0.012*"album" + 0.010*"try" + 0.009*"kind"
INFO : topic #5 (0.067): 0.019*"emory" + 0.015*"climate" + 0.015*"students" + 0.014*"falcons" + 0.013*"facility" + 0.009*"action" + 0.008*"said" + 0.007*"like" + 0.006*"research" + 0.006*"high"
INFO : topic #13 (0.067): 0.028*"said" + 0.027*"emory" + 0.018*"campus" + 0.017*"complainant" + 0.016*"sept" + 0.014*"epd" + 0.013*"report" + 0.012*"library" + 0.009*"student" + 0.008*"left"
INFO : topic #3 (0.067): 0.021*"game" + 0.020*"season" + 0.010*"games" + 0.010*"win" + 0.008*"points" + 0.008*"year" + 0.008*"quarterback" + 0.007*"week" + 0.006*"nfl" + 0.006*"atlanta"
INFO : topic #14 (0.067): 0.019*"film" + 0.006*"story" + 0.006*"musical" + 0.005*"emory" + 0.005*"nancy" + 0.004*"set" + 0.004*"characters" + 0.004*"man" + 0.004*"david" + 0.004*

Wall time: 13min 34s


In [17]:
lda_model.show_topics(15, 20)

[(0,
  '0.034*"said" + 0.026*"complainant" + 0.023*"sept" + 0.021*"epd" + 0.019*"emory" + 0.015*"library" + 0.011*"left" + 0.011*"report" + 0.011*"responded" + 0.010*"camera" + 0.009*"student" + 0.009*"water" + 0.009*"laptop" + 0.009*"subject" + 0.009*"officer" + 0.008*"room" + 0.007*"campus" + 0.007*"reference" + 0.006*"locker" + 0.006*"officers"'),
 (1,
  '0.037*"album" + 0.024*"jpegmafia" + 0.023*"song" + 0.016*"track" + 0.013*"peggy" + 0.011*"beat" + 0.008*"political" + 0.008*"heroes" + 0.007*"post" + 0.007*"lyrics" + 0.007*"features" + 0.007*"way" + 0.006*"rap" + 0.006*"emotional" + 0.006*"hook" + 0.006*"release" + 0.006*"sounds" + 0.005*"songs" + 0.005*"beats" + 0.005*"like"'),
 (2,
  '0.010*"musical" + 0.010*"emory" + 0.010*"film" + 0.008*"nancy" + 0.006*"factory" + 0.006*"david" + 0.005*"set" + 0.005*"story" + 0.005*"chocolate" + 0.005*"charlie" + 0.004*"white" + 0.004*"solomon" + 0.004*"production" + 0.004*"scenes" + 0.004*"man" + 0.004*"second" + 0.004*"act" + 0.004*"room" + 

In [25]:
# let's format the words a little more nicely; 
# the formatted=False parameter returns tuples of (word, probability)

topics = lda_model.show_topics(15, 20, formatted=False)

for topic in topics:
    topic_num = topic[0]
    topic_words = ""
    
    topic_pairs = topic[1]
    for pair in topic_pairs:
        topic_words += pair[0] + ", "
    
    print("T" + str(topic_num) + ": " + topic_words)

T0: like, people, time, think, said, way, work, life, world, know, new, feel, want, different, ve, good, going, best, years, love, 
T1: emory, students, said, university, according, college, president, community, school, people, student, education, campus, year, trump, atlanta, georgia, program, department, wheel, 
T2: game, team, brereton, great, emory, britain, baseball, runs, said, boden, solomon, innings, eagles, final, tournament, petrels, run, year, inning, win, 
T3: game, season, win, games, quarterback, year, points, week, backup, nfl, point, players, atlanta, new, allen, hawks, team, teams, play, ryan, 
T4: yak, yik, speech, resolution, emory, sga, think, people, posts, world, university, thunberg, hate, media, support, empathy, student, members, person, abuse, 
T5: emory, climate, students, falcons, facility, action, said, like, research, high, problems, football, injury, activism, strike, building, care, change, people, new, 
T6: food, election, coffee, carter, party, atlant

In [26]:
from gensim.models.coherencemodel import CoherenceModel

cm = CoherenceModel(model=lda_model, corpus=emorywheel_corpus, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

coherence

INFO : CorpusAccumulator accumulated stats from 1000 documents
INFO : CorpusAccumulator accumulated stats from 2000 documents
INFO : CorpusAccumulator accumulated stats from 3000 documents
INFO : CorpusAccumulator accumulated stats from 4000 documents


-2.452951560815258

## Topics at the Start of Each Semester in Each School Year (Spring 2015 to Fall 2019)

In [27]:
spring_2016 = DataFrame(df[(df['Date'].str[-4:] == "2016") & (df['Date'].str[0:7] == "January")], columns = ['Title', 'Content']).values.tolist()
fall_2016 = DataFrame(df[(df['Date'].str[-4:] == "2016") & (df['Date'].str[0:9] == "September")], columns = ['Title', 'Content']).values.tolist()
spring_2017 = DataFrame(df[(df['Date'].str[-4:] == "2017") & (df['Date'].str[0:7] == "January")], columns = ['Title', 'Content']).values.tolist()
fall_2017 = DataFrame(df[(df['Date'].str[-4:] == "2017") & (df['Date'].str[0:9] == "September")], columns = ['Title', 'Content']).values.tolist()
# spring_2018 = DataFrame(df[(df['Date'].str[-4:] == "2018") & (df['Date'].str[0:7] == "January")], columns = ['Title', 'Content']).values.tolist()
# fall_2018 = DataFrame(df[(df['Date'].str[-4:] == "2018") & (df['Date'].str[0:9] == "September")], columns = ['Title', 'Content']).values.tolist()
# spring_2019 = DataFrame(df[(df['Date'].str[-4:] == "2019") & (df['Date'].str[0:7] == "January")], columns = ['Title', 'Content']).values.tolist()
# fall_2019 = DataFrame(df[(df['Date'].str[-4:] == "2019") & (df['Date'].str[0:9] == "September")], columns = ['Title', 'Content']).values.tolist()

sp2016 = iter_docs(spring_2016)
fa2016 = iter_docs(fall_2016)
sp2017 = iter_docs(spring_2017)
fa2017 = iter_docs(fall_2017)
# sp2018 = iter_docs(spring_2018)
# fa2018 = iter_docs(fall_2018)
# sp2019 = iter_docs(spring_2019)
# fa2019 = iter_docs(fall_2019)

doc_stream_sp2016 = (tokens for _,tokens in sp2016)
doc_stream_fa2016 = (tokens for _,tokens in fa2016)
doc_stream_sp2017 = (tokens for _,tokens in sp2017)
doc_stream_fa2017 = (tokens for _,tokens in fa2017)
# doc_stream_sp2018 = (tokens for _,tokens in sp2018)
# doc_stream_fa2018 = (tokens for _,tokens in fa2018)
# doc_stream_sp2019 = (tokens for _,tokens in sp2019)
# doc_stream_fa2019 = (tokens for _,tokens in fa2019)
            
id2word_sp2016 = gensim.corpora.Dictionary(doc_stream_sp2016)
id2word_sp2016.filter_extremes(no_below = 2, no_above = 1.0)
id2word_fa2016 = gensim.corpora.Dictionary(doc_stream_fa2016)             
id2word_sp2017 = gensim.corpora.Dictionary(doc_stream_sp2017)
id2word_sp2017.filter_extremes(no_below=2, no_above=1.0) 
id2word_fa2017 = gensim.corpora.Dictionary(doc_stream_fa2017)
id2word_sp2017.filter_extremes(no_below=2, no_above=1.0) 
# id2word_sp2018 = gensim.corpora.Dictionary(doc_stream_sp2018).filter_extremes(no_below=2, no_above=1.0) 
# id2word_fa2018 = gensim.corpora.Dictionary(doc_stream_fa2018).filter_extremes(no_below=2, no_above=1.0) 
# id2word_sp2019 = gensim.corpora.Dictionary(doc_stream_sp2019).filter_extremes(no_below=2, no_above=1.0) 
# id2word_fa2019 = gensim.corpora.Dictionary(doc_stream_fa2019).filter_extremes(no_below=2, no_above=1.0)

sp2016_corpus = Corpus(spring_2016, id2word_sp2016)
fa2016_corpus = Corpus(fall_2016, id2word_fa2016)
sp2017_corpus = Corpus(spring_2017, id2word_sp2017)
fa2017_corpus = Corpus(fall_2017, id2word_fa2017)
# sp2018_corpus = Corpus(spring_2018, id2word_sp2018)
# fa2018_corpus = Corpus(fall_2018, id2word_fa2018)
# sp2019_corpus = Corpus(spring_2019, id2word_sp2019)
# fa2019_corpus = Corpus(fall_2019, id2word_fa2019)

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(7904 unique tokens: ['abilities', 'absorbed', 'absurdities', 'acharacter', 'act']...) from 52 documents (total 21971 corpus positions)
INFO : discarding 5481 tokens: [('absorbed', 1), ('absurdities', 1), ('acharacter', 1), ('aimed', 1), ('amodern', 1), ('amusing', 1), ('andcharacters', 1), ('andgenos', 1), ('andthus', 1), ('anime', 1)]...
INFO : keeping 2423 tokens which were in no less than 2 and no more than 52 (=100.0%) documents
INFO : resulting dictionary: Dictionary(2423 unique tokens: ['abilities', 'act', 'acting', 'action', 'actually']...)
INFO : adding document #0 to Dictionary(0 unique tokens: [])
INFO : built Dictionary(11235 unique tokens: ['act', 'actswouldn', 'actually', 'advertised', 'alarm']...) from 106 documents 

In [28]:
%time lda_model_sp2016 = gensim.models.LdaModel(sp2016_corpus, num_topics=10, id2word=id2word_sp2016, passes=75)

INFO : using symmetric alpha at 0.1
INFO : using symmetric eta at 0.1
INFO : using serial LDA version on this node
INFO : running online (multi-pass) LDA training, 10 topics, 75 passes over the supplied corpus of 52 documents, updating model once every 52 documents, evaluating perplexity every 52 documents, iterating 50x with a convergence threshold of 0.001000
INFO : -10.163 per-word bound, 1146.5 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 0, at document #52/52
INFO : topic #8 (0.100): 0.011*"film" + 0.006*"story" + 0.006*"said" + 0.005*"time" + 0.005*"students" + 0.005*"like" + 0.004*"emory" + 0.004*"best" + 0.004*"think" + 0.004*"year"
INFO : topic #2 (0.100): 0.025*"film" + 0.007*"like" + 0.007*"game" + 0.005*"best" + 0.005*"films" + 0.005*"emory" + 0.004*"real" + 0.004*"story" + 0.004*"said" + 0.004*"time"
INFO : topic #0 (0.100): 0.014*"students" + 0.011*"emory" + 0.009*"said" + 0.008*"film" + 0.006*"like" + 0.006*"game" 

INFO : topic diff=0.300075, rho=0.353553
INFO : -7.501 per-word bound, 181.1 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 7, at document #52/52
INFO : topic #2 (0.100): 0.035*"film" + 0.010*"speech" + 0.008*"free" + 0.008*"films" + 0.008*"like" + 0.007*"hated" + 0.007*"yak" + 0.006*"yik" + 0.006*"authenticity" + 0.005*"great"
INFO : topic #4 (0.100): 0.013*"best" + 0.011*"emory" + 0.008*"page" + 0.007*"film" + 0.006*"library" + 0.006*"email" + 0.005*"service" + 0.005*"university" + 0.005*"information" + 0.005*"end"
INFO : topic #5 (0.100): 0.019*"house" + 0.011*"website" + 0.011*"said" + 0.010*"students" + 0.010*"program" + 0.010*"think" + 0.010*"emory" + 0.009*"book" + 0.009*"according" + 0.008*"warren"
INFO : topic #8 (0.100): 0.015*"said" + 0.011*"students" + 0.011*"lounge" + 0.008*"harris" + 0.008*"class" + 0.008*"student" + 0.008*"project" + 0.008*"school" + 0.008*"library" + 0.007*"according"
INFO : topic #6 (0.100): 0.022*

INFO : PROGRESS: pass 14, at document #52/52
INFO : topic #4 (0.100): 0.013*"best" + 0.013*"emory" + 0.008*"page" + 0.006*"library" + 0.006*"email" + 0.006*"service" + 0.006*"university" + 0.005*"information" + 0.005*"end" + 0.005*"little"
INFO : topic #3 (0.100): 0.028*"business" + 0.019*"said" + 0.016*"group" + 0.013*"school" + 0.012*"businesses" + 0.010*"goizueta" + 0.010*"provide" + 0.009*"areas" + 0.009*"trying" + 0.008*"help"
INFO : topic #7 (0.100): 0.039*"film" + 0.014*"story" + 0.008*"time" + 0.008*"films" + 0.008*"like" + 0.007*"year" + 0.006*"way" + 0.006*"best" + 0.005*"ve" + 0.005*"character"
INFO : topic #6 (0.100): 0.021*"game" + 0.011*"man" + 0.011*"star" + 0.009*"punch" + 0.009*"wars" + 0.009*"vote" + 0.008*"candidate" + 0.007*"students" + 0.007*"new" + 0.007*"like"
INFO : topic #5 (0.100): 0.019*"house" + 0.012*"said" + 0.011*"website" + 0.011*"think" + 0.010*"students" + 0.010*"program" + 0.010*"emory" + 0.009*"book" + 0.009*"according" + 0.008*"warren"
INFO : topic 

INFO : topic #6 (0.100): 0.021*"game" + 0.011*"man" + 0.011*"star" + 0.010*"punch" + 0.010*"wars" + 0.009*"vote" + 0.008*"candidate" + 0.007*"students" + 0.007*"new" + 0.007*"like"
INFO : topic #0 (0.100): 0.031*"students" + 0.028*"emory" + 0.016*"said" + 0.009*"black" + 0.009*"class" + 0.008*"college" + 0.007*"campus" + 0.007*"americans" + 0.007*"course" + 0.007*"student"
INFO : topic #8 (0.100): 0.015*"said" + 0.012*"students" + 0.012*"lounge" + 0.012*"student" + 0.009*"harris" + 0.009*"building" + 0.009*"class" + 0.008*"hall" + 0.008*"according" + 0.008*"project"
INFO : topic #7 (0.100): 0.040*"film" + 0.014*"story" + 0.008*"films" + 0.008*"time" + 0.008*"like" + 0.007*"year" + 0.007*"best" + 0.006*"way" + 0.005*"character" + 0.005*"ve"
INFO : topic diff=0.008934, rho=0.208514
INFO : -7.445 per-word bound, 174.3 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 22, at document #52/52
INFO : topic #9 (0.100): 0.022*"team" + 0.016*"h

INFO : topic #9 (0.100): 0.022*"team" + 0.016*"hawks" + 0.014*"game" + 0.014*"time" + 0.011*"university" + 0.009*"second" + 0.009*"emory" + 0.009*"meter" + 0.009*"said" + 0.008*"senior"
INFO : topic #5 (0.100): 0.019*"house" + 0.012*"said" + 0.011*"website" + 0.011*"think" + 0.010*"program" + 0.010*"students" + 0.009*"emory" + 0.009*"book" + 0.009*"according" + 0.008*"warren"
INFO : topic diff=0.003619, rho=0.182574
INFO : -7.442 per-word bound, 173.9 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 29, at document #52/52
INFO : topic #9 (0.100): 0.022*"team" + 0.016*"hawks" + 0.014*"game" + 0.014*"time" + 0.011*"university" + 0.009*"second" + 0.009*"emory" + 0.009*"meter" + 0.008*"said" + 0.008*"senior"
INFO : topic #8 (0.100): 0.015*"said" + 0.012*"students" + 0.012*"student" + 0.012*"lounge" + 0.009*"building" + 0.009*"harris" + 0.009*"class" + 0.009*"hall" + 0.008*"according" + 0.008*"project"
INFO : topic #7 (0.100): 0.041*"film

INFO : topic #0 (0.100): 0.031*"students" + 0.027*"emory" + 0.016*"said" + 0.009*"black" + 0.009*"class" + 0.008*"college" + 0.007*"campus" + 0.007*"americans" + 0.007*"course" + 0.007*"student"
INFO : topic diff=0.001951, rho=0.164399
INFO : -7.440 per-word bound, 173.7 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 36, at document #52/52
INFO : topic #6 (0.100): 0.020*"game" + 0.011*"man" + 0.011*"star" + 0.010*"punch" + 0.010*"wars" + 0.009*"vote" + 0.009*"candidate" + 0.007*"students" + 0.007*"new" + 0.007*"like"
INFO : topic #2 (0.100): 0.036*"film" + 0.010*"speech" + 0.008*"free" + 0.008*"films" + 0.008*"like" + 0.007*"hated" + 0.007*"yak" + 0.006*"yik" + 0.006*"authenticity" + 0.005*"great"
INFO : topic #9 (0.100): 0.022*"team" + 0.016*"hawks" + 0.014*"game" + 0.014*"time" + 0.011*"university" + 0.009*"second" + 0.009*"emory" + 0.009*"meter" + 0.008*"said" + 0.008*"senior"
INFO : topic #0 (0.100): 0.031*"students" + 0.027*"e

INFO : topic diff=0.001343, rho=0.150756
INFO : -7.439 per-word bound, 173.6 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 43, at document #52/52
INFO : topic #6 (0.100): 0.020*"game" + 0.012*"man" + 0.011*"star" + 0.010*"punch" + 0.010*"wars" + 0.009*"vote" + 0.009*"candidate" + 0.007*"students" + 0.007*"new" + 0.007*"like"
INFO : topic #0 (0.100): 0.031*"students" + 0.027*"emory" + 0.016*"said" + 0.009*"black" + 0.009*"class" + 0.008*"college" + 0.007*"campus" + 0.007*"americans" + 0.007*"course" + 0.007*"student"
INFO : topic #2 (0.100): 0.036*"film" + 0.010*"speech" + 0.008*"free" + 0.008*"films" + 0.008*"like" + 0.007*"hated" + 0.007*"yak" + 0.006*"yik" + 0.006*"authenticity" + 0.005*"great"
INFO : topic #5 (0.100): 0.019*"house" + 0.012*"said" + 0.011*"website" + 0.011*"think" + 0.010*"program" + 0.010*"students" + 0.009*"emory" + 0.009*"book" + 0.009*"according" + 0.008*"going"
INFO : topic #4 (0.100): 0.014*"emory" + 0.013

INFO : topic #6 (0.100): 0.020*"game" + 0.012*"man" + 0.011*"star" + 0.010*"punch" + 0.010*"wars" + 0.009*"vote" + 0.009*"candidate" + 0.007*"students" + 0.007*"new" + 0.007*"like"
INFO : topic #3 (0.100): 0.028*"business" + 0.018*"said" + 0.016*"group" + 0.012*"school" + 0.012*"businesses" + 0.010*"goizueta" + 0.010*"provide" + 0.009*"areas" + 0.009*"help" + 0.009*"trying"
INFO : topic #7 (0.100): 0.041*"film" + 0.014*"story" + 0.008*"films" + 0.008*"like" + 0.008*"time" + 0.007*"best" + 0.006*"year" + 0.006*"way" + 0.006*"character" + 0.005*"ve"
INFO : topic #5 (0.100): 0.019*"house" + 0.012*"said" + 0.011*"website" + 0.011*"think" + 0.010*"program" + 0.010*"students" + 0.009*"emory" + 0.009*"according" + 0.009*"book" + 0.008*"going"
INFO : topic #0 (0.100): 0.031*"students" + 0.027*"emory" + 0.016*"said" + 0.009*"black" + 0.009*"class" + 0.008*"college" + 0.007*"campus" + 0.007*"americans" + 0.007*"course" + 0.007*"student"
INFO : topic diff=0.000977, rho=0.138675
INFO : -7.438 per-

INFO : topic #9 (0.100): 0.023*"team" + 0.016*"hawks" + 0.014*"game" + 0.014*"time" + 0.011*"university" + 0.009*"second" + 0.009*"emory" + 0.009*"meter" + 0.008*"said" + 0.008*"senior"
INFO : topic #7 (0.100): 0.041*"film" + 0.014*"story" + 0.008*"films" + 0.008*"like" + 0.008*"time" + 0.007*"best" + 0.006*"year" + 0.006*"way" + 0.006*"character" + 0.005*"ve"
INFO : topic #3 (0.100): 0.028*"business" + 0.018*"said" + 0.016*"group" + 0.012*"school" + 0.012*"businesses" + 0.010*"goizueta" + 0.010*"provide" + 0.009*"areas" + 0.009*"help" + 0.009*"trying"
INFO : topic diff=0.000764, rho=0.130189
INFO : -7.438 per-word bound, 173.4 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 58, at document #52/52
INFO : topic #9 (0.100): 0.023*"team" + 0.016*"hawks" + 0.014*"game" + 0.014*"time" + 0.011*"university" + 0.009*"second" + 0.009*"emory" + 0.009*"meter" + 0.008*"said" + 0.008*"senior"
INFO : topic #5 (0.100): 0.019*"house" + 0.012*"said"

INFO : topic #2 (0.100): 0.036*"film" + 0.010*"speech" + 0.008*"free" + 0.008*"films" + 0.008*"like" + 0.007*"hated" + 0.007*"yak" + 0.006*"yik" + 0.006*"authenticity" + 0.005*"real"
INFO : topic diff=0.000620, rho=0.123091
INFO : -7.437 per-word bound, 173.3 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 65, at document #52/52
INFO : topic #2 (0.100): 0.036*"film" + 0.010*"speech" + 0.008*"free" + 0.008*"films" + 0.008*"like" + 0.007*"hated" + 0.007*"yak" + 0.006*"yik" + 0.006*"authenticity" + 0.005*"director"
INFO : topic #8 (0.100): 0.015*"said" + 0.012*"students" + 0.012*"student" + 0.012*"lounge" + 0.010*"building" + 0.009*"harris" + 0.009*"hall" + 0.009*"class" + 0.008*"according" + 0.008*"project"
INFO : topic #0 (0.100): 0.031*"students" + 0.027*"emory" + 0.016*"said" + 0.009*"black" + 0.009*"class" + 0.008*"college" + 0.007*"campus" + 0.007*"americans" + 0.007*"course" + 0.007*"student"
INFO : topic #5 (0.100): 0.019*"hous

INFO : topic diff=0.000512, rho=0.117041
INFO : -7.437 per-word bound, 173.3 perplexity estimate based on a held-out corpus of 52 documents with 15236 words
INFO : PROGRESS: pass 72, at document #52/52
INFO : topic #5 (0.100): 0.019*"house" + 0.012*"said" + 0.011*"website" + 0.011*"think" + 0.010*"program" + 0.010*"students" + 0.009*"emory" + 0.009*"according" + 0.009*"book" + 0.008*"going"
INFO : topic #1 (0.100): 0.021*"eagles" + 0.015*"game" + 0.012*"points" + 0.010*"jan" + 0.009*"lead" + 0.009*"people" + 0.008*"think" + 0.008*"know" + 0.008*"team" + 0.008*"point"
INFO : topic #6 (0.100): 0.020*"game" + 0.012*"man" + 0.011*"star" + 0.010*"punch" + 0.010*"wars" + 0.009*"vote" + 0.009*"candidate" + 0.007*"students" + 0.007*"new" + 0.007*"like"
INFO : topic #0 (0.100): 0.031*"students" + 0.027*"emory" + 0.016*"said" + 0.009*"black" + 0.009*"class" + 0.008*"college" + 0.007*"campus" + 0.007*"americans" + 0.007*"course" + 0.007*"student"
INFO : topic #2 (0.100): 0.036*"film" + 0.010*"spe

Wall time: 14.9 s


In [29]:
%time lda_model_fa2016 = gensim.models.LdaModel(fa2016_corpus, num_topics=10, id2word=id2word_fa2016, passes=75)

INFO : using symmetric alpha at 0.1
INFO : using symmetric eta at 0.1
INFO : using serial LDA version on this node
INFO : running online (multi-pass) LDA training, 10 topics, 75 passes over the supplied corpus of 106 documents, updating model once every 106 documents, evaluating perplexity every 106 documents, iterating 50x with a convergence threshold of 0.001000
INFO : -13.277 per-word bound, 9927.5 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 0, at document #106/106
INFO : topic #6 (0.100): 0.006*"emory" + 0.005*"said" + 0.004*"students" + 0.003*"kumar" + 0.003*"people" + 0.003*"new" + 0.002*"life" + 0.002*"like" + 0.002*"time" + 0.002*"game"
INFO : topic #4 (0.100): 0.012*"said" + 0.008*"emory" + 0.005*"students" + 0.004*"year" + 0.004*"people" + 0.003*"college" + 0.003*"time" + 0.003*"university" + 0.003*"team" + 0.003*"campus"
INFO : topic #0 (0.100): 0.009*"emory" + 0.007*"said" + 0.005*"college" + 0.004*"new" + 0.003*"un

INFO : topic #5 (0.100): 0.007*"said" + 0.006*"emory" + 0.005*"lutz" + 0.004*"ocean" + 0.004*"new" + 0.004*"time" + 0.004*"year" + 0.004*"university" + 0.004*"people" + 0.004*"students"
INFO : topic diff=0.187106, rho=0.353553
INFO : -8.735 per-word bound, 426.0 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 7, at document #106/106
INFO : topic #9 (0.100): 0.011*"emory" + 0.007*"said" + 0.007*"good" + 0.005*"according" + 0.004*"apple" + 0.004*"hong" + 0.004*"kong" + 0.004*"duo" + 0.003*"students" + 0.003*"people"
INFO : topic #5 (0.100): 0.007*"said" + 0.006*"emory" + 0.005*"lutz" + 0.005*"ocean" + 0.004*"new" + 0.004*"time" + 0.004*"year" + 0.004*"university" + 0.004*"people" + 0.004*"students"
INFO : topic #7 (0.100): 0.006*"state" + 0.006*"distefano" + 0.005*"audience" + 0.004*"virtue" + 0.004*"greater" + 0.004*"emory" + 0.003*"student" + 0.003*"sept" + 0.003*"epd" + 0.003*"performance"
INFO : topic #6 (0.100): 0.011*"kumar" + 

INFO : topic diff=0.017012, rho=0.258199
INFO : -8.713 per-word bound, 419.7 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 14, at document #106/106
INFO : topic #2 (0.100): 0.010*"said" + 0.010*"emory" + 0.008*"chai" + 0.008*"students" + 0.005*"kaldi" + 0.005*"meal" + 0.005*"house" + 0.005*"chang" + 0.004*"student" + 0.004*"food"
INFO : topic #6 (0.100): 0.011*"kumar" + 0.006*"game" + 0.006*"said" + 0.005*"new" + 0.005*"film" + 0.004*"parents" + 0.004*"india" + 0.004*"life" + 0.004*"emory" + 0.004*"want"
INFO : topic #5 (0.100): 0.007*"said" + 0.006*"emory" + 0.005*"lutz" + 0.005*"ocean" + 0.004*"new" + 0.004*"time" + 0.004*"year" + 0.004*"university" + 0.004*"people" + 0.004*"students"
INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.009*"team" + 0.007*"said" + 0.005*"university" + 0.005*"weekend" + 0.005*"college" + 0.004*"people" + 0.004*"game" + 0.004*"year"
INFO : topic #7 (0.100): 0.006*"state" + 0.006*"distefano

INFO : topic #0 (0.100): 0.018*"emory" + 0.009*"football" + 0.006*"college" + 0.005*"trump" + 0.005*"said" + 0.005*"sterk" + 0.005*"people" + 0.005*"university" + 0.005*"new" + 0.004*"sports"
INFO : topic #7 (0.100): 0.006*"state" + 0.006*"distefano" + 0.005*"audience" + 0.004*"virtue" + 0.004*"greater" + 0.003*"student" + 0.003*"emory" + 0.003*"performance" + 0.003*"dun" + 0.003*"good"
INFO : topic #5 (0.100): 0.007*"said" + 0.006*"emory" + 0.005*"lutz" + 0.005*"ocean" + 0.004*"new" + 0.004*"time" + 0.004*"year" + 0.004*"university" + 0.004*"people" + 0.004*"students"
INFO : topic #3 (0.100): 0.016*"emory" + 0.010*"said" + 0.006*"university" + 0.004*"campus" + 0.004*"individual" + 0.004*"students" + 0.004*"student" + 0.004*"mandl" + 0.004*"kaldi" + 0.003*"koval"
INFO : topic #2 (0.100): 0.010*"said" + 0.010*"emory" + 0.008*"chai" + 0.008*"students" + 0.005*"kaldi" + 0.005*"meal" + 0.005*"house" + 0.005*"chang" + 0.004*"student" + 0.004*"food"
INFO : topic diff=0.002615, rho=0.208514
I

INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.010*"team" + 0.008*"said" + 0.005*"university" + 0.005*"weekend" + 0.005*"college" + 0.004*"game" + 0.004*"season" + 0.004*"people"
INFO : topic #2 (0.100): 0.010*"said" + 0.010*"emory" + 0.008*"chai" + 0.008*"students" + 0.005*"kaldi" + 0.005*"meal" + 0.005*"house" + 0.005*"chang" + 0.004*"student" + 0.004*"food"
INFO : topic #0 (0.100): 0.018*"emory" + 0.009*"football" + 0.006*"college" + 0.005*"trump" + 0.005*"said" + 0.005*"sterk" + 0.005*"people" + 0.005*"university" + 0.005*"new" + 0.004*"sports"
INFO : topic #4 (0.100): 0.011*"said" + 0.008*"emory" + 0.006*"year" + 0.005*"students" + 0.004*"people" + 0.004*"season" + 0.004*"spc" + 0.004*"college" + 0.004*"rehab" + 0.004*"team"
INFO : topic diff=0.001003, rho=0.182574
INFO : -8.708 per-word bound, 418.2 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 29, at document #106/106
INFO : topic #8 (0.100): 0.012*"emory" + 0

INFO : topic #0 (0.100): 0.018*"emory" + 0.009*"football" + 0.006*"college" + 0.005*"trump" + 0.005*"said" + 0.005*"sterk" + 0.005*"people" + 0.005*"university" + 0.005*"new" + 0.004*"sports"
INFO : topic #6 (0.100): 0.011*"kumar" + 0.006*"game" + 0.006*"said" + 0.005*"new" + 0.005*"film" + 0.004*"parents" + 0.004*"india" + 0.004*"life" + 0.004*"emory" + 0.004*"want"
INFO : topic #2 (0.100): 0.010*"said" + 0.010*"emory" + 0.008*"chai" + 0.008*"students" + 0.005*"kaldi" + 0.005*"meal" + 0.005*"house" + 0.005*"chang" + 0.005*"student" + 0.004*"food"
INFO : topic diff=0.000589, rho=0.164399
INFO : -8.707 per-word bound, 417.9 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 36, at document #106/106
INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.010*"team" + 0.008*"said" + 0.005*"university" + 0.005*"college" + 0.005*"weekend" + 0.004*"season" + 0.004*"game" + 0.004*"people"
INFO : topic #5 (0.100): 0.007*"said" + 0.006*"em

INFO : topic #5 (0.100): 0.007*"said" + 0.006*"emory" + 0.005*"lutz" + 0.005*"ocean" + 0.004*"new" + 0.004*"time" + 0.004*"year" + 0.004*"university" + 0.004*"people" + 0.004*"students"
INFO : topic #0 (0.100): 0.018*"emory" + 0.009*"football" + 0.006*"college" + 0.005*"trump" + 0.005*"said" + 0.005*"sterk" + 0.005*"people" + 0.005*"university" + 0.005*"new" + 0.004*"sports"
INFO : topic diff=0.000418, rho=0.150756
INFO : -8.707 per-word bound, 417.8 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 43, at document #106/106
INFO : topic #6 (0.100): 0.011*"kumar" + 0.006*"game" + 0.006*"said" + 0.005*"new" + 0.005*"film" + 0.004*"parents" + 0.004*"india" + 0.004*"life" + 0.004*"emory" + 0.004*"want"
INFO : topic #7 (0.100): 0.006*"state" + 0.006*"distefano" + 0.005*"audience" + 0.004*"virtue" + 0.004*"greater" + 0.003*"performance" + 0.003*"dun" + 0.003*"good" + 0.003*"emory" + 0.003*"student"
INFO : topic #2 (0.100): 0.010*"said" + 0

INFO : topic #9 (0.100): 0.010*"emory" + 0.007*"said" + 0.007*"good" + 0.005*"according" + 0.004*"apple" + 0.004*"hong" + 0.004*"duo" + 0.004*"kong" + 0.003*"students" + 0.003*"people"
INFO : topic diff=0.000322, rho=0.140028
INFO : -8.706 per-word bound, 417.7 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 50, at document #106/106
INFO : topic #3 (0.100): 0.016*"emory" + 0.009*"said" + 0.006*"university" + 0.004*"campus" + 0.004*"individual" + 0.004*"student" + 0.004*"students" + 0.004*"mandl" + 0.004*"kaldi" + 0.003*"koval"
INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.011*"team" + 0.008*"said" + 0.005*"university" + 0.005*"college" + 0.005*"weekend" + 0.004*"season" + 0.004*"game" + 0.004*"people"
INFO : topic #0 (0.100): 0.018*"emory" + 0.009*"football" + 0.006*"college" + 0.005*"trump" + 0.005*"said" + 0.005*"sterk" + 0.005*"people" + 0.005*"university" + 0.005*"new" + 0.004*"sports"
INFO : topic #9 (0.100): 0.0

INFO : topic diff=0.000256, rho=0.131306
INFO : -8.706 per-word bound, 417.6 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 57, at document #106/106
INFO : topic #4 (0.100): 0.011*"said" + 0.007*"emory" + 0.005*"year" + 0.005*"students" + 0.004*"people" + 0.004*"spc" + 0.004*"season" + 0.004*"rehab" + 0.004*"college" + 0.004*"fernández"
INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.011*"team" + 0.008*"said" + 0.005*"university" + 0.005*"college" + 0.005*"weekend" + 0.004*"season" + 0.004*"game" + 0.004*"year"
INFO : topic #6 (0.100): 0.011*"kumar" + 0.006*"game" + 0.006*"said" + 0.005*"film" + 0.005*"new" + 0.004*"parents" + 0.004*"india" + 0.004*"life" + 0.004*"emory" + 0.004*"want"
INFO : topic #3 (0.100): 0.016*"emory" + 0.009*"said" + 0.006*"university" + 0.004*"campus" + 0.004*"individual" + 0.004*"student" + 0.004*"students" + 0.004*"mandl" + 0.004*"kaldi" + 0.003*"koval"
INFO : topic #5 (0.100): 0.007*"said" +

INFO : PROGRESS: pass 64, at document #106/106
INFO : topic #3 (0.100): 0.016*"emory" + 0.009*"said" + 0.006*"university" + 0.004*"campus" + 0.004*"individual" + 0.004*"student" + 0.004*"students" + 0.004*"mandl" + 0.004*"kaldi" + 0.003*"koval"
INFO : topic #0 (0.100): 0.018*"emory" + 0.009*"football" + 0.006*"college" + 0.005*"trump" + 0.005*"said" + 0.005*"sterk" + 0.005*"people" + 0.005*"university" + 0.005*"new" + 0.004*"sports"
INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.011*"team" + 0.008*"said" + 0.005*"university" + 0.005*"college" + 0.005*"weekend" + 0.004*"season" + 0.004*"game" + 0.004*"year"
INFO : topic #1 (0.100): 0.006*"students" + 0.006*"staples" + 0.006*"emory" + 0.005*"scott" + 0.005*"campus" + 0.004*"safe" + 0.004*"need" + 0.004*"album" + 0.004*"student" + 0.004*"chen"
INFO : topic #7 (0.100): 0.006*"state" + 0.006*"distefano" + 0.005*"audience" + 0.004*"greater" + 0.004*"virtue" + 0.003*"good" + 0.003*"performance" + 0.003*"dun" + 0.003*"pilots" + 0.

INFO : topic #8 (0.100): 0.012*"emory" + 0.011*"eagles" + 0.011*"team" + 0.008*"said" + 0.005*"university" + 0.005*"college" + 0.005*"weekend" + 0.004*"season" + 0.004*"game" + 0.004*"year"
INFO : topic #3 (0.100): 0.016*"emory" + 0.009*"said" + 0.006*"university" + 0.004*"campus" + 0.004*"individual" + 0.004*"student" + 0.004*"students" + 0.004*"mandl" + 0.004*"kaldi" + 0.003*"koval"
INFO : topic #9 (0.100): 0.010*"emory" + 0.007*"said" + 0.007*"good" + 0.005*"according" + 0.004*"apple" + 0.004*"hong" + 0.004*"duo" + 0.004*"kong" + 0.003*"students" + 0.003*"people"
INFO : topic #2 (0.100): 0.010*"said" + 0.010*"emory" + 0.008*"chai" + 0.008*"students" + 0.005*"kaldi" + 0.005*"house" + 0.005*"meal" + 0.005*"student" + 0.005*"chang" + 0.004*"epd"
INFO : topic diff=0.000161, rho=0.117041
INFO : -8.706 per-word bound, 417.5 perplexity estimate based on a held-out corpus of 106 documents with 37345 words
INFO : PROGRESS: pass 72, at document #106/106
INFO : topic #0 (0.100): 0.018*"emory" 

Wall time: 29 s


In [31]:
%time lda_model_sp2017 = gensim.models.LdaModel(sp2017_corpus, num_topics=10, id2word=id2word_sp2017, passes=75)

INFO : using symmetric alpha at 0.1
INFO : using symmetric eta at 0.1
INFO : using serial LDA version on this node
INFO : running online (multi-pass) LDA training, 10 topics, 75 passes over the supplied corpus of 66 documents, updating model once every 66 documents, evaluating perplexity every 66 documents, iterating 50x with a convergence threshold of 0.001000
INFO : -10.298 per-word bound, 1259.2 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 0, at document #66/66
INFO : topic #3 (0.100): 0.010*"emory" + 0.008*"time" + 0.007*"said" + 0.006*"women" + 0.006*"junior" + 0.006*"atlanta" + 0.005*"took" + 0.005*"eagles" + 0.005*"people" + 0.005*"team"
INFO : topic #1 (0.100): 0.011*"trump" + 0.009*"said" + 0.008*"emory" + 0.006*"students" + 0.005*"team" + 0.005*"public" + 0.004*"schools" + 0.004*"rights" + 0.004*"participation" + 0.004*"jan"
INFO : topic #2 (0.100): 0.025*"emory" + 0.014*"students" + 0.012*"university" + 0.011*"said" + 

INFO : topic #6 (0.100): 0.033*"epd" + 0.032*"jan" + 0.021*"responded" + 0.020*"emory" + 0.015*"assigned" + 0.014*"student" + 0.014*"case" + 0.013*"reported" + 0.010*"center" + 0.010*"officers"
INFO : topic diff=0.276935, rho=0.353553
INFO : -7.405 per-word bound, 169.5 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 7, at document #66/66
INFO : topic #2 (0.100): 0.027*"emory" + 0.019*"students" + 0.015*"said" + 0.013*"sga" + 0.013*"president" + 0.012*"student" + 0.010*"university" + 0.008*"graduate" + 0.007*"trump" + 0.007*"campus"
INFO : topic #3 (0.100): 0.019*"time" + 0.015*"junior" + 0.013*"took" + 0.010*"emory" + 0.009*"placed" + 0.009*"events" + 0.009*"team" + 0.008*"sophomore" + 0.008*"women" + 0.008*"people"
INFO : topic #7 (0.100): 0.018*"emory" + 0.014*"team" + 0.013*"eagles" + 0.011*"game" + 0.010*"time" + 0.010*"said" + 0.010*"university" + 0.009*"lead" + 0.008*"quarter" + 0.008*"forward"
INFO : topic #8 (0.100): 0.013*

INFO : topic diff=0.037573, rho=0.258199
INFO : -7.361 per-word bound, 164.4 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 14, at document #66/66
INFO : topic #7 (0.100): 0.017*"emory" + 0.014*"team" + 0.014*"eagles" + 0.012*"game" + 0.011*"said" + 0.010*"time" + 0.010*"university" + 0.009*"lead" + 0.008*"quarter" + 0.008*"forward"
INFO : topic #2 (0.100): 0.027*"emory" + 0.019*"students" + 0.015*"said" + 0.013*"sga" + 0.013*"president" + 0.012*"student" + 0.011*"university" + 0.008*"graduate" + 0.007*"campus" + 0.007*"government"
INFO : topic #0 (0.100): 0.011*"time" + 0.011*"life" + 0.010*"day" + 0.010*"like" + 0.009*"people" + 0.009*"emory" + 0.009*"ew" + 0.007*"sun" + 0.007*"come" + 0.006*"way"
INFO : topic #4 (0.100): 0.013*"love" + 0.011*"students" + 0.009*"album" + 0.009*"drop" + 0.009*"swap" + 0.009*"film" + 0.009*"add" + 0.006*"student" + 0.006*"time" + 0.005*"music"
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" +

INFO : -7.351 per-word bound, 163.2 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 21, at document #66/66
INFO : topic #8 (0.100): 0.014*"immigration" + 0.013*"image" + 0.011*"rochester" + 0.011*"immigrant" + 0.011*"trump" + 0.011*"games" + 0.010*"americans" + 0.010*"american" + 0.008*"america" + 0.008*"immigrants"
INFO : topic #7 (0.100): 0.017*"emory" + 0.015*"eagles" + 0.015*"team" + 0.012*"game" + 0.011*"said" + 0.010*"time" + 0.010*"university" + 0.009*"lead" + 0.009*"quarter" + 0.008*"forward"
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 0.008*"man" + 0.008*"song" + 0.007*"character" + 0.006*"film" + 0.005*"makes" + 0.005*"series" + 0.005*"fans" + 0.005*"audience"
INFO : topic #3 (0.100): 0.021*"time" + 0.015*"junior" + 0.014*"took" + 0.010*"placed" + 0.010*"events" + 0.009*"team" + 0.009*"emory" + 0.009*"sophomore" + 0.009*"women" + 0.009*"freestyle"
INFO : topic #0 (0.100): 0.011*"time" + 0.011*"life" + 0.010*"da

INFO : PROGRESS: pass 28, at document #66/66
INFO : topic #7 (0.100): 0.017*"emory" + 0.016*"eagles" + 0.015*"team" + 0.012*"game" + 0.011*"said" + 0.010*"time" + 0.010*"university" + 0.009*"lead" + 0.009*"quarter" + 0.008*"points"
INFO : topic #6 (0.100): 0.035*"epd" + 0.033*"jan" + 0.022*"responded" + 0.020*"emory" + 0.015*"assigned" + 0.015*"student" + 0.014*"case" + 0.014*"reported" + 0.010*"center" + 0.010*"officers"
INFO : topic #0 (0.100): 0.011*"time" + 0.011*"life" + 0.010*"day" + 0.010*"like" + 0.010*"ew" + 0.010*"people" + 0.009*"emory" + 0.007*"come" + 0.007*"sun" + 0.006*"way"
INFO : topic #2 (0.100): 0.028*"emory" + 0.019*"students" + 0.015*"said" + 0.013*"sga" + 0.013*"president" + 0.012*"student" + 0.011*"university" + 0.008*"graduate" + 0.007*"campus" + 0.007*"government"
INFO : topic #5 (0.100): 0.015*"said" + 0.014*"center" + 0.012*"emory" + 0.010*"year" + 0.009*"director" + 0.009*"arts" + 0.008*"film" + 0.007*"vaught" + 0.007*"brooks" + 0.007*"patients"
INFO : topic

INFO : topic #5 (0.100): 0.015*"said" + 0.014*"center" + 0.012*"emory" + 0.011*"year" + 0.009*"director" + 0.009*"arts" + 0.008*"film" + 0.007*"vaught" + 0.007*"brooks" + 0.007*"patients"
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 0.007*"man" + 0.007*"song" + 0.007*"character" + 0.006*"film" + 0.005*"makes" + 0.005*"series" + 0.005*"fans" + 0.005*"audience"
INFO : topic #7 (0.100): 0.017*"emory" + 0.016*"eagles" + 0.015*"team" + 0.012*"game" + 0.011*"said" + 0.010*"time" + 0.010*"university" + 0.009*"lead" + 0.009*"quarter" + 0.008*"forward"
INFO : topic #4 (0.100): 0.013*"love" + 0.011*"students" + 0.009*"album" + 0.009*"drop" + 0.009*"swap" + 0.009*"film" + 0.009*"add" + 0.006*"student" + 0.006*"time" + 0.005*"music"
INFO : topic diff=0.002821, rho=0.164399
INFO : -7.343 per-word bound, 162.3 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 36, at document #66/66
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 

INFO : topic #1 (0.100): 0.023*"trump" + 0.016*"women" + 0.013*"rights" + 0.011*"participation" + 0.011*"public" + 0.010*"schools" + 0.009*"march" + 0.006*"civil" + 0.006*"education" + 0.005*"losing"
INFO : topic #7 (0.100): 0.017*"emory" + 0.016*"eagles" + 0.015*"team" + 0.013*"game" + 0.011*"said" + 0.010*"university" + 0.009*"lead" + 0.009*"time" + 0.009*"quarter" + 0.009*"forward"
INFO : topic #0 (0.100): 0.011*"time" + 0.011*"life" + 0.010*"day" + 0.010*"like" + 0.010*"ew" + 0.010*"people" + 0.009*"emory" + 0.007*"come" + 0.007*"sun" + 0.006*"way"
INFO : topic diff=0.001959, rho=0.150756
INFO : -7.341 per-word bound, 162.1 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 43, at document #66/66
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 0.007*"man" + 0.007*"song" + 0.007*"character" + 0.006*"film" + 0.005*"makes" + 0.005*"series" + 0.005*"fans" + 0.005*"audience"
INFO : topic #7 (0.100): 0.017*"emory" + 0.016*"eagles

INFO : topic #1 (0.100): 0.023*"trump" + 0.016*"women" + 0.013*"rights" + 0.011*"participation" + 0.011*"public" + 0.010*"march" + 0.010*"schools" + 0.006*"civil" + 0.006*"education" + 0.005*"losing"
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 0.007*"man" + 0.007*"song" + 0.007*"character" + 0.006*"film" + 0.005*"makes" + 0.005*"series" + 0.005*"fans" + 0.005*"audience"
INFO : topic diff=0.001627, rho=0.140028
INFO : -7.339 per-word bound, 161.9 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 50, at document #66/66
INFO : topic #2 (0.100): 0.028*"emory" + 0.019*"students" + 0.015*"said" + 0.013*"sga" + 0.013*"president" + 0.012*"student" + 0.011*"university" + 0.008*"graduate" + 0.007*"campus" + 0.007*"government"
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 0.007*"man" + 0.007*"song" + 0.007*"character" + 0.006*"film" + 0.005*"makes" + 0.005*"series" + 0.005*"fans" + 0.005*"audience"
INFO : topic #5 (0.100): 

INFO : topic #4 (0.100): 0.013*"love" + 0.011*"students" + 0.009*"drop" + 0.009*"album" + 0.009*"swap" + 0.009*"film" + 0.009*"add" + 0.006*"student" + 0.006*"time" + 0.005*"course"
INFO : topic diff=0.001413, rho=0.131306
INFO : -7.338 per-word bound, 161.7 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 57, at document #66/66
INFO : topic #6 (0.100): 0.035*"epd" + 0.033*"jan" + 0.022*"responded" + 0.020*"emory" + 0.015*"assigned" + 0.015*"reported" + 0.015*"case" + 0.014*"student" + 0.010*"center" + 0.010*"officers"
INFO : topic #5 (0.100): 0.016*"said" + 0.014*"center" + 0.012*"emory" + 0.011*"year" + 0.009*"director" + 0.009*"arts" + 0.008*"film" + 0.007*"brooks" + 0.007*"vaught" + 0.007*"patients"
INFO : topic #1 (0.100): 0.023*"trump" + 0.016*"women" + 0.013*"rights" + 0.011*"participation" + 0.011*"public" + 0.010*"march" + 0.010*"schools" + 0.006*"civil" + 0.006*"education" + 0.005*"losing"
INFO : topic #3 (0.100): 0.026*"ti

INFO : topic #7 (0.100): 0.018*"emory" + 0.017*"eagles" + 0.015*"team" + 0.013*"game" + 0.011*"said" + 0.010*"university" + 0.010*"lead" + 0.009*"quarter" + 0.009*"forward" + 0.009*"points"
INFO : topic diff=0.001222, rho=0.124035
INFO : -7.336 per-word bound, 161.6 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 64, at document #66/66
INFO : topic #3 (0.100): 0.027*"time" + 0.016*"junior" + 0.015*"took" + 0.013*"freestyle" + 0.011*"women" + 0.011*"placed" + 0.011*"sophomore" + 0.010*"senior" + 0.010*"place" + 0.010*"team"
INFO : topic #7 (0.100): 0.018*"emory" + 0.017*"eagles" + 0.015*"team" + 0.013*"game" + 0.011*"said" + 0.010*"university" + 0.010*"lead" + 0.009*"quarter" + 0.009*"forward" + 0.009*"points"
INFO : topic #0 (0.100): 0.011*"time" + 0.011*"life" + 0.010*"day" + 0.010*"like" + 0.010*"ew" + 0.010*"people" + 0.009*"emory" + 0.007*"come" + 0.007*"sun" + 0.006*"way"
INFO : topic #8 (0.100): 0.015*"immigration" + 0.013*"im

INFO : topic diff=0.001027, rho=0.117851
INFO : -7.335 per-word bound, 161.4 perplexity estimate based on a held-out corpus of 66 documents with 15454 words
INFO : PROGRESS: pass 71, at document #66/66
INFO : topic #4 (0.100): 0.013*"love" + 0.011*"students" + 0.009*"drop" + 0.009*"swap" + 0.009*"album" + 0.009*"film" + 0.009*"add" + 0.006*"student" + 0.006*"time" + 0.005*"course"
INFO : topic #5 (0.100): 0.016*"said" + 0.014*"center" + 0.012*"emory" + 0.011*"year" + 0.009*"director" + 0.009*"arts" + 0.008*"film" + 0.007*"brooks" + 0.007*"vaught" + 0.007*"patients"
INFO : topic #9 (0.100): 0.014*"album" + 0.008*"like" + 0.007*"man" + 0.007*"song" + 0.007*"character" + 0.006*"film" + 0.005*"makes" + 0.005*"series" + 0.005*"fans" + 0.005*"audience"
INFO : topic #2 (0.100): 0.028*"emory" + 0.019*"students" + 0.015*"said" + 0.013*"sga" + 0.013*"president" + 0.012*"student" + 0.011*"university" + 0.008*"graduate" + 0.007*"campus" + 0.007*"government"
INFO : topic #1 (0.100): 0.023*"trump" +

Wall time: 13.5 s


In [32]:
%time lda_model_fa2017 = gensim.models.LdaModel(fa2017_corpus, num_topics=10, id2word=id2word_fa2017, passes=75)

INFO : using symmetric alpha at 0.1
INFO : using symmetric eta at 0.1
INFO : using serial LDA version on this node
INFO : running online (multi-pass) LDA training, 10 topics, 75 passes over the supplied corpus of 135 documents, updating model once every 135 documents, evaluating perplexity every 135 documents, iterating 50x with a convergence threshold of 0.001000
INFO : -13.261 per-word bound, 9817.0 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 0, at document #135/135
INFO : topic #9 (0.100): 0.014*"emory" + 0.008*"said" + 0.005*"students" + 0.005*"sept" + 0.004*"university" + 0.004*"according" + 0.003*"college" + 0.003*"daca" + 0.003*"campus" + 0.003*"life"
INFO : topic #2 (0.100): 0.008*"said" + 0.007*"emory" + 0.006*"trump" + 0.003*"people" + 0.003*"game" + 0.003*"daca" + 0.003*"university" + 0.003*"president" + 0.003*"students" + 0.003*"according"
INFO : topic #1 (0.100): 0.007*"emory" + 0.005*"said" + 0.004*"students" + 0.

INFO : topic #6 (0.100): 0.008*"emory" + 0.005*"sept" + 0.005*"year" + 0.005*"students" + 0.004*"officers" + 0.004*"epd" + 0.004*"student" + 0.004*"responded" + 0.004*"said" + 0.004*"hope"
INFO : topic diff=0.190143, rho=0.353553
INFO : -8.866 per-word bound, 466.4 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 7, at document #135/135
INFO : topic #8 (0.100): 0.008*"said" + 0.008*"emory" + 0.006*"eagles" + 0.005*"women" + 0.005*"team" + 0.004*"season" + 0.003*"game" + 0.003*"university" + 0.003*"sept" + 0.003*"goal"
INFO : topic #7 (0.100): 0.007*"emory" + 0.004*"bar" + 0.003*"campus" + 0.003*"like" + 0.003*"food" + 0.003*"university" + 0.003*"students" + 0.003*"building" + 0.003*"new" + 0.003*"schultz"
INFO : topic #3 (0.100): 0.007*"emory" + 0.005*"trump" + 0.005*"students" + 0.004*"speech" + 0.004*"class" + 0.003*"group" + 0.003*"major" + 0.003*"pace" + 0.003*"like" + 0.003*"time"
INFO : topic #4 (0.100): 0.012*"emory" + 0.012*

INFO : topic diff=0.020557, rho=0.258199
INFO : -8.835 per-word bound, 456.6 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 14, at document #135/135
INFO : topic #7 (0.100): 0.006*"emory" + 0.004*"bar" + 0.003*"campus" + 0.003*"food" + 0.003*"like" + 0.003*"building" + 0.003*"university" + 0.003*"students" + 0.003*"new" + 0.003*"schultz"
INFO : topic #5 (0.100): 0.008*"said" + 0.007*"barankitse" + 0.006*"emory" + 0.005*"runners" + 0.005*"ung" + 0.005*"film" + 0.004*"sga" + 0.003*"oxford" + 0.003*"jin" + 0.003*"saturday"
INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.003*"anthony" + 0.003*"like" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"fantasy" + 0.003*"trade" + 0.003*"football"
INFO : topic #1 (0.100): 0.008*"team" + 0.007*"eagles" + 0.006*"emory" + 0.006*"speech" + 0.005*"game" + 0.005*"atlanta" + 0.005*"said" + 0.005*"season" + 0.004*"home" + 0.004*"sept"
INFO : topic #2 (0.100): 0.008*"trump" + 0.008

INFO : topic #2 (0.100): 0.008*"trump" + 0.007*"daca" + 0.007*"said" + 0.005*"swift" + 0.004*"emory" + 0.004*"president" + 0.004*"people" + 0.004*"warhol" + 0.004*"carter" + 0.004*"life"
INFO : topic #6 (0.100): 0.008*"emory" + 0.005*"sept" + 0.005*"year" + 0.005*"students" + 0.005*"officers" + 0.005*"epd" + 0.005*"student" + 0.004*"responded" + 0.004*"hope" + 0.004*"said"
INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.003*"anthony" + 0.003*"like" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"fantasy" + 0.003*"trade" + 0.003*"football"
INFO : topic #8 (0.100): 0.008*"said" + 0.008*"emory" + 0.007*"eagles" + 0.006*"team" + 0.006*"women" + 0.004*"season" + 0.004*"sept" + 0.004*"university" + 0.004*"game" + 0.003*"goal"
INFO : topic #7 (0.100): 0.006*"emory" + 0.004*"food" + 0.004*"bar" + 0.003*"campus" + 0.003*"like" + 0.003*"building" + 0.003*"students" + 0.003*"university" + 0.003*"new" + 0.003*"schultz"
INFO : topic diff=0.003895, rho=0.208514
INFO : -8.829 per-

INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.003*"anthony" + 0.003*"like" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"fantasy" + 0.003*"trade" + 0.003*"football"
INFO : topic #4 (0.100): 0.014*"according" + 0.013*"emory" + 0.007*"university" + 0.007*"campus" + 0.007*"irma" + 0.007*"monday" + 0.006*"atlanta" + 0.005*"students" + 0.005*"georgia" + 0.004*"power"
INFO : topic #5 (0.100): 0.008*"said" + 0.007*"barankitse" + 0.006*"emory" + 0.005*"runners" + 0.005*"ung" + 0.005*"film" + 0.004*"sga" + 0.004*"oxford" + 0.003*"jin" + 0.003*"saturday"
INFO : topic #2 (0.100): 0.008*"trump" + 0.007*"daca" + 0.007*"said" + 0.005*"swift" + 0.004*"emory" + 0.004*"president" + 0.004*"people" + 0.004*"warhol" + 0.004*"carter" + 0.004*"life"
INFO : topic diff=0.001861, rho=0.182574
INFO : -8.826 per-word bound, 453.9 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 29, at document #135/135
INFO : topic #8 (0.100): 0.008*"s

INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.003*"anthony" + 0.003*"like" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"fantasy" + 0.003*"trade" + 0.003*"football"
INFO : topic #8 (0.100): 0.008*"said" + 0.008*"emory" + 0.007*"team" + 0.007*"eagles" + 0.006*"women" + 0.004*"season" + 0.004*"sept" + 0.004*"university" + 0.004*"junior" + 0.003*"game"
INFO : topic #1 (0.100): 0.008*"team" + 0.007*"eagles" + 0.006*"emory" + 0.006*"speech" + 0.005*"game" + 0.005*"said" + 0.005*"atlanta" + 0.005*"season" + 0.004*"home" + 0.004*"sept"
INFO : topic diff=0.001307, rho=0.164399
INFO : -8.825 per-word bound, 453.4 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 36, at document #135/135
INFO : topic #2 (0.100): 0.008*"trump" + 0.007*"daca" + 0.007*"said" + 0.005*"swift" + 0.004*"emory" + 0.004*"president" + 0.004*"people" + 0.004*"warhol" + 0.004*"carter" + 0.004*"life"
INFO : topic #8 (0.100): 0.008*"said" + 0.008*"em

INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.003*"anthony" + 0.003*"like" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"fantasy" + 0.003*"trade" + 0.003*"football"
INFO : topic #2 (0.100): 0.008*"trump" + 0.007*"daca" + 0.007*"said" + 0.005*"swift" + 0.004*"president" + 0.004*"emory" + 0.004*"people" + 0.004*"warhol" + 0.004*"carter" + 0.004*"life"
INFO : topic diff=0.001050, rho=0.150756
INFO : -8.823 per-word bound, 453.0 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 43, at document #135/135
INFO : topic #4 (0.100): 0.014*"according" + 0.014*"emory" + 0.007*"campus" + 0.007*"university" + 0.007*"irma" + 0.007*"monday" + 0.006*"atlanta" + 0.005*"georgia" + 0.005*"students" + 0.004*"power"
INFO : topic #5 (0.100): 0.007*"said" + 0.007*"barankitse" + 0.006*"emory" + 0.005*"ung" + 0.005*"film" + 0.005*"runners" + 0.004*"sga" + 0.004*"oxford" + 0.003*"jin" + 0.003*"saturday"
INFO : topic #7 (0.100): 0.006*"e

INFO : topic #4 (0.100): 0.014*"according" + 0.014*"emory" + 0.007*"campus" + 0.007*"university" + 0.007*"irma" + 0.007*"monday" + 0.006*"atlanta" + 0.005*"georgia" + 0.005*"students" + 0.004*"power"
INFO : topic diff=0.000901, rho=0.140028
INFO : -8.822 per-word bound, 452.6 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 50, at document #135/135
INFO : topic #4 (0.100): 0.014*"emory" + 0.014*"according" + 0.007*"campus" + 0.007*"university" + 0.007*"irma" + 0.007*"monday" + 0.006*"atlanta" + 0.005*"georgia" + 0.005*"students" + 0.004*"power"
INFO : topic #3 (0.100): 0.007*"emory" + 0.006*"trump" + 0.005*"students" + 0.004*"speech" + 0.004*"class" + 0.003*"group" + 0.003*"pace" + 0.003*"major" + 0.003*"like" + 0.003*"hate"
INFO : topic #6 (0.100): 0.008*"emory" + 0.005*"sept" + 0.005*"year" + 0.005*"students" + 0.005*"officers" + 0.005*"epd" + 0.005*"student" + 0.004*"responded" + 0.004*"hope" + 0.004*"said"
INFO : topic #5 (0.100

INFO : topic diff=0.000812, rho=0.131306
INFO : -8.821 per-word bound, 452.3 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 57, at document #135/135
INFO : topic #4 (0.100): 0.014*"emory" + 0.014*"according" + 0.007*"campus" + 0.007*"university" + 0.007*"irma" + 0.007*"monday" + 0.006*"atlanta" + 0.005*"georgia" + 0.005*"students" + 0.004*"county"
INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.004*"anthony" + 0.003*"like" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"fantasy" + 0.003*"trade" + 0.003*"characters"
INFO : topic #3 (0.100): 0.007*"emory" + 0.006*"trump" + 0.005*"students" + 0.004*"speech" + 0.004*"class" + 0.003*"group" + 0.003*"pace" + 0.003*"major" + 0.003*"like" + 0.003*"hate"
INFO : topic #5 (0.100): 0.007*"said" + 0.007*"barankitse" + 0.006*"emory" + 0.005*"ung" + 0.005*"film" + 0.004*"sga" + 0.004*"runners" + 0.004*"oxford" + 0.003*"jin" + 0.003*"according"
INFO : topic #9 (0.100): 0.015*"

INFO : PROGRESS: pass 64, at document #135/135
INFO : topic #4 (0.100): 0.014*"emory" + 0.014*"according" + 0.007*"campus" + 0.007*"university" + 0.007*"irma" + 0.007*"monday" + 0.006*"atlanta" + 0.005*"georgia" + 0.005*"students" + 0.004*"county"
INFO : topic #2 (0.100): 0.008*"trump" + 0.007*"daca" + 0.007*"said" + 0.005*"swift" + 0.004*"president" + 0.004*"emory" + 0.004*"people" + 0.004*"warhol" + 0.004*"carter" + 0.004*"life"
INFO : topic #6 (0.100): 0.008*"emory" + 0.005*"sept" + 0.005*"year" + 0.005*"students" + 0.005*"officers" + 0.005*"epd" + 0.005*"student" + 0.004*"responded" + 0.004*"hope" + 0.004*"said"
INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.004*"anthony" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"like" + 0.003*"fantasy" + 0.003*"characters" + 0.003*"trade"
INFO : topic #5 (0.100): 0.007*"barankitse" + 0.007*"said" + 0.006*"emory" + 0.005*"ung" + 0.005*"film" + 0.004*"sga" + 0.004*"runners" + 0.004*"oxford" + 0.003*"jin" + 0.003*"legislatu

INFO : topic #0 (0.100): 0.007*"game" + 0.005*"play" + 0.004*"film" + 0.004*"anthony" + 0.003*"pizza" + 0.003*"kingsman" + 0.003*"like" + 0.003*"fantasy" + 0.003*"characters" + 0.003*"trade"
INFO : topic #7 (0.100): 0.005*"emory" + 0.004*"food" + 0.004*"bar" + 0.003*"like" + 0.003*"campus" + 0.003*"building" + 0.003*"schultz" + 0.003*"restaurant" + 0.003*"students" + 0.002*"new"
INFO : topic #9 (0.100): 0.015*"emory" + 0.013*"said" + 0.009*"students" + 0.008*"daca" + 0.006*"university" + 0.006*"according" + 0.006*"law" + 0.006*"school" + 0.005*"sept" + 0.005*"program"
INFO : topic #5 (0.100): 0.007*"barankitse" + 0.007*"said" + 0.006*"emory" + 0.005*"ung" + 0.005*"film" + 0.004*"sga" + 0.004*"oxford" + 0.003*"runners" + 0.003*"jin" + 0.003*"legislature"
INFO : topic diff=0.000635, rho=0.117041
INFO : -8.819 per-word bound, 451.6 perplexity estimate based on a held-out corpus of 135 documents with 45641 words
INFO : PROGRESS: pass 72, at document #135/135
INFO : topic #1 (0.100): 0.008*

Wall time: 34.7 s


In [33]:
sp2016_topics = lda_model_sp2016.show_topics(10, 10, formatted = False)
fa2016_topics = lda_model_fa2016.show_topics(10, 10, formatted = False)
sp2017_topics = lda_model_sp2017.show_topics(10, 10, formatted = False)
fa2017_topics = lda_model_fa2017.show_topics(10, 10, formatted = False)

def show_topic(topics):
    for topic in topics:
        topic_num = topic[0]
        topic_words = ""
        topic_pairs = topic[1]
        for pair in topic_pairs:
            topic_words += pair[0] + ", "
        print("T" + str(topic_num) + ": " + topic_words)
        
        
print("-----Spring 2016-----")
show_topic(sp2016_topics)
print("-----Fall 2016-----")
show_topic(fa2016_topics)
print("-----Spring 2017-----")
show_topic(sp2017_topics)
print("-----Fall 2017-----")
show_topic(fa2017_topics)

-----Spring 2016-----
T0: students, emory, said, black, class, college, campus, americans, course, student, 
T1: eagles, game, points, jan, lead, people, think, know, team, point, 
T2: film, speech, free, films, like, hated, yak, yik, authenticity, director, 
T3: business, said, group, school, businesses, goizueta, provide, areas, help, trying, 
T4: emory, best, page, library, email, service, university, information, little, georgia, 
T5: house, said, website, think, program, students, emory, according, book, going, 
T6: game, man, star, punch, wars, vote, candidate, students, new, like, 
T7: film, story, like, films, time, best, year, way, character, ve, 
T8: said, students, student, lounge, building, harris, hall, class, according, project, 
T9: team, hawks, game, time, university, second, emory, meter, said, placed, 
-----Fall 2016-----
T0: emory, football, college, trump, said, sterk, people, university, new, sports, 
T1: students, staples, emory, scott, campus, safe, need, album, 

In [35]:
# create the bag of words for the document on the basis of the CCP dictionary, created above
sp2016_doc_bow = id2word_sp2016.doc2bow(tokens)
fa2016_doc_bow = id2word_fa2016.doc2bow(tokens)

# get the topics that the doc consists of
sp2016_doc_topics = lda_model.get_document_topics(sp2016_doc_bow)
fa2016_doc_topics = lda_model_fa2016.get_document_topics(sp2016_doc_bow)

def show_percentage(doc_topics):
    for topic, prob in doc_topics:
        print("T" + str(topic) + ": " + "{:.2%}".format(prob) + " of document.")
        topic_words = "Top words in topic: "
        select_topics = topics[topic]
        for pair in select_topics[1]:
            topic_words += pair[0] + ", "
        print(topic_words)

show_percentage(sp2016_doc_topics)
print("-----\n")
show_percentage(fa2016_doc_topics)

T0: 22.62% of document.
Top words in topic: like, people, time, think, said, way, work, life, world, know, new, feel, want, different, ve, good, going, best, years, love, 
T1: 22.87% of document.
Top words in topic: emory, students, said, university, according, college, president, community, school, people, student, education, campus, year, trump, atlanta, georgia, program, department, wheel, 
T2: 5.90% of document.
Top words in topic: game, team, brereton, great, emory, britain, baseball, runs, said, boden, solomon, innings, eagles, final, tournament, petrels, run, year, inning, win, 
T3: 5.68% of document.
Top words in topic: game, season, win, games, quarterback, year, points, week, backup, nfl, point, players, atlanta, new, allen, hawks, team, teams, play, ryan, 
T4: 1.77% of document.
Top words in topic: yak, yik, speech, resolution, emory, sga, think, people, posts, world, university, thunberg, hate, media, support, empathy, student, members, person, abuse, 
T5: 3.24% of document

## TF-IDF

In [115]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

docs_2014 = df['Content'][df['Date'].str[-4:] == "2014"].tolist()
docs_2015 = df['Content'][df['Date'].str[-4:] == "2015"].tolist()
docs_2016 = df['Content'][df['Date'].str[-4:] == "2016"].tolist()
docs_2017 = df['Content'][df['Date'].str[-4:] == "2017"].tolist()
docs_2018 = df['Content'][df['Date'].str[-4:] == "2018"].tolist()
docs_2019 = df['Content'][df['Date'].str[-4:] == "2019"].tolist()

docs_2014_wc = docs_2014
docs_2015_wc = docs_2015
docs_2016_wc = docs_2016
docs_2017_wc = docs_2017
docs_2018_wc = docs_2018
docs_2019_wc = docs_2019

# word_count_vector.shape
# docs_2014
docs_2014 = [' '.join(docs_2014)]
docs_2015 = [' '.join(docs_2015)]
docs_2016 = [' '.join(docs_2016)]
docs_2017 = [' '.join(docs_2017)]
docs_2018 = [' '.join(docs_2018)]
docs_2019 = [' '.join(docs_2019)]

323

In [108]:
# Year of 2014
tfidf_vectorizer=TfidfVectorizer(stop_words='english', use_idf=True)

tfidf_vectorizer_vectors_2014 = tfidf_vectorizer.fit_transform(docs_2014)

tfidf_result_2014 = pd.DataFrame(tfidf_vectorizer_vectors_2014.T.todense(), 
                                 index=tfidf_vectorizer.get_feature_names(), 
                                 columns=["tf-idf Score"])
tfidf_result_2014.sort_values(by=["tf-idf Score"],ascending=False).head(20)

Unnamed: 0,tf-idf Score
emory,0.364561
said,0.300324
students,0.223836
team,0.199333
college,0.196022
university,0.147016
time,0.145692
people,0.142712
year,0.1245
like,0.122845


In [109]:
# Year of 2015
tfidf_vectorizer_vectors_2015 = tfidf_vectorizer.fit_transform(docs_2015)

tfidf_result_2015 = pd.DataFrame(tfidf_vectorizer_vectors_2015.T.todense(), 
                                 index=tfidf_vectorizer.get_feature_names(), 
                                 columns=["tf-idf Score"])
tfidf_result_2015.sort_values(by=["tf-idf Score"],ascending=False).head(20)

Unnamed: 0,tf-idf Score
emory,0.408915
said,0.310005
students,0.252253
college,0.178303
university,0.163433
people,0.136615
like,0.134889
time,0.132366
year,0.132101
student,0.115505


In [110]:
# Year of 2016
tfidf_vectorizer_vectors_2016 = tfidf_vectorizer.fit_transform(docs_2016)

tfidf_result_2016 = pd.DataFrame(tfidf_vectorizer_vectors_2016.T.todense(), 
                                 index=tfidf_vectorizer.get_feature_names(), 
                                 columns=["tf-idf Score"])
tfidf_result_2016.sort_values(by=["tf-idf Score"],ascending=False).head(20)

Unnamed: 0,tf-idf Score
emory,0.409598
said,0.302781
students,0.168352
film,0.155414
college,0.15494
time,0.151311
university,0.14926
team,0.144211
like,0.14216
people,0.136638


In [111]:
# Year of 2017
tfidf_vectorizer_vectors_2017 = tfidf_vectorizer.fit_transform(docs_2017)

tfidf_result_2017 = pd.DataFrame(tfidf_vectorizer_vectors_2017.T.todense(), 
                                 index=tfidf_vectorizer.get_feature_names(), 
                                 columns=["tf-idf Score"])
tfidf_result_2017.sort_values(by=["tf-idf Score"],ascending=False).head(20)

Unnamed: 0,tf-idf Score
emory,0.43724
said,0.374316
students,0.188771
university,0.171024
according,0.137411
time,0.135125
student,0.132301
team,0.128268
like,0.118453
people,0.114688


In [112]:
# Year of 2018
tfidf_vectorizer_vectors_2018 = tfidf_vectorizer.fit_transform(docs_2018)

tfidf_result_2018 = pd.DataFrame(tfidf_vectorizer_vectors_2018.T.todense(), 
                                 index=tfidf_vectorizer.get_feature_names(), 
                                 columns=["tf-idf Score"])
tfidf_result_2018.sort_values(by=["tf-idf Score"],ascending=False).head(20)

Unnamed: 0,tf-idf Score
said,0.423482
emory,0.389653
students,0.174616
student,0.150862
university,0.126982
wheel,0.126485
film,0.125366
according,0.125117
like,0.120391
time,0.119272


In [113]:
# Year of 2019
tfidf_vectorizer_vectors_2019 = tfidf_vectorizer.fit_transform(docs_2019)

tfidf_result_2019 = pd.DataFrame(tfidf_vectorizer_vectors_2019.T.todense(), 
                                 index=tfidf_vectorizer.get_feature_names(), 
                                 columns=["tf-idf Score"])
tfidf_result_2019.sort_values(by=["tf-idf Score"],ascending=False).head(20)

Unnamed: 0,tf-idf Score
emory,0.46306
said,0.34542
students,0.167578
university,0.149813
wheel,0.149616
post,0.126522
film,0.121785
student,0.121785
like,0.121193
time,0.120601


## Word Counting

In [118]:
# Year of 2014 Word Frequencies
cv = CountVectorizer(stop_words = 'english')
word_count_vector_2014 = cv.fit_transform(docs_2014_wc)
sum_words = word_count_vector_2014.sum(axis=0)

words_freq_2014 = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]

words_freq_2014 = sorted(words_freq_2014, key = lambda x: x[1], reverse=True)

# display the top 10
words_freq_2014[:10]

[('emory', 1101),
 ('said', 907),
 ('students', 676),
 ('team', 602),
 ('college', 592),
 ('university', 444),
 ('time', 440),
 ('people', 431),
 ('year', 376),
 ('like', 371)]