# Topic Modeling the Dispatch — Part I

Run in terminal or command prompt
`python -m spacy download en`

The core packages used in this tutorial are `nltk`, `re`, `gensim`, `spacy` and `pyLDAvis`. Besides this we will also using `matplotlib`, `numpy`, `pandas` and `plotly` for data handling and visualization. Let’s import them.

(to install and run jupyter notebooks, see: <https://www.csestack.org/install-use-jupyter-notebook-python-example/>)

In [1]:
%%time
import nltk
#nltk.download('stopwords')


CPU times: user 541 ms, sys: 253 ms, total: 793 ms
Wall time: 1.29 s


We'll also need some libraries for data manipulation and visualization: `matplotlib`, `numpy` and `pandas`.

In [2]:
import re
import numpy
import pandas
pandas.set_option("display.max_colwidth", 30)
from pprint import pprint

In [3]:
# Gensim
import gensim
import gensim.corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

In [4]:
# spacy for lemmatization
import spacy

In [5]:
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [6]:
# import nltk
# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context

# nltk.download()

# a pop-up window will open and you should select `Stopwords` to download;
# after it is installed, you can close that pop-up window

# Stopwords

In [7]:
%%time

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**NB:** however, it is always better to have your personal curated list. How can we create it? 

# Load our Dispatch data:

In [8]:
%%time

dispatchSubfolder = "./Dispatch_Processed_TSV/"

# WE CAN EDIT THIS LIST IN ORDER TO REDUCE THE AMOUNT OF DATA THAT WE ARE LOADING
dispatchFiles = ["Dispatch_1860.tsv",
                 "Dispatch_1861.tsv",
                 "Dispatch_1862.tsv",
                 "Dispatch_1863.tsv",
                 "Dispatch_1864.tsv",
                 "Dispatch_1865.tsv"
                 ]

df = pandas.DataFrame()

for f in dispatchFiles:
    dfTemp = pandas.read_csv(dispatchSubfolder + f, sep="\t", header=0)
    df = df.append(dfTemp)

dispatch = df
# drop=True -- use it to avoid creating a new column with the old index values
dispatch = dispatch.reset_index(drop=True) 

# add a column with all dates of each month changed to 1 (we can use that to aggregate our data into months)
dispatch["month"] = [re.sub("-\d\d$", "", str(i)) for i in dispatch["date"]]

# reorder columns
dispatch = dispatch[["id", "month", "date", "type", "header", "text"]]

CPU times: user 1.56 s, sys: 145 ms, total: 1.71 s
Wall time: 1.72 s


In [9]:
dispatch

Unnamed: 0,id,month,date,type,header,text
0,1860-12-31_article_000,1860-12,1860-12-31,article,The National crisis.partic...,The National crisis. parti...
1,1860-12-31_article_000,1860-12,1860-12-31,article,[from the Charleston couri...,[from the Charleston couri...
2,1860-12-31_article_000,1860-12,1860-12-31,article,Death of Commodore Platt.,Death of Commodore Platt.;...
3,1860-12-31_article_000,1860-12,1860-12-31,article,Death of the last survivor...,Death of the last survivor...
4,1860-12-31_article_000,1860-12,1860-12-31,article,Christmas in Charleston.,Christmas in Charleston.;;...
...,...,...,...,...,...,...
129848,1865-01-23_orders_000,1865-01,1865-01-23,orders,"Treasury Department,Confed...","Treasury Department, Confe..."
129849,1865-01-23_orders_000,1865-01,1865-01-23,orders,"Treasury Department,Confed...","Treasury Department, Confe..."
129850,1865-01-23_orders_000,1865-01,1865-01-23,orders,"Treasury Department, Confe...","Treasury Department, Confe..."
129851,1865-01-23_orders_000,1865-01,1865-01-23,orders,"Treasury Department,Confed...","Treasury Department, Confe..."


In [10]:
%%time

dispatch["month"] = pandas.to_datetime(dispatch["month"], format="%Y-%m")
dispatch["date"] = pandas.to_datetime(dispatch["date"], format="%Y-%m-%d")

dispatch

CPU times: user 49.5 ms, sys: 8.76 ms, total: 58.3 ms
Wall time: 56.6 ms


Unnamed: 0,id,month,date,type,header,text
0,1860-12-31_article_000,1860-12-01,1860-12-31,article,The National crisis.partic...,The National crisis. parti...
1,1860-12-31_article_000,1860-12-01,1860-12-31,article,[from the Charleston couri...,[from the Charleston couri...
2,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of Commodore Platt.,Death of Commodore Platt.;...
3,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of the last survivor...,Death of the last survivor...
4,1860-12-31_article_000,1860-12-01,1860-12-31,article,Christmas in Charleston.,Christmas in Charleston.;;...
...,...,...,...,...,...,...
129848,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confed...","Treasury Department, Confe..."
129849,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confed...","Treasury Department, Confe..."
129850,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department, Confe...","Treasury Department, Confe..."
129851,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confed...","Treasury Department, Confe..."


In [11]:
# you can change the parameter below, but it may become messy if it is too large
# this parameter sets the maximum width of columns
pandas.set_option("display.max_colwidth", 50)

dispatch

Unnamed: 0,id,month,date,type,header,text
0,1860-12-31_article_000,1860-12-01,1860-12-31,article,The National crisis.particulars of the evacuat...,The National crisis. particulars of the evacua...
1,1860-12-31_article_000,1860-12-01,1860-12-31,article,"[from the Charleston courier, of Friday.]Fort ...","[from the Charleston courier, of Friday.] Fort..."
2,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of Commodore Platt.,Death of Commodore Platt.;;; Another of our mo...
3,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of the last survivor of the battle of Bu...,Death of the last survivor of the battle of Bu...
4,1860-12-31_article_000,1860-12-01,1860-12-31,article,Christmas in Charleston.,"Christmas in Charleston.;;; --It seemed, on Tu..."
...,...,...,...,...,...,...
129848,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."
129849,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."
129850,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department, Confederate States of Ame...","Treasury Department, Confederate States of Ame..."
129851,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."


The following checks data types in a dataframe (table). Let's run it to check if our "date" column was properly converted:

In [12]:
dispatch.dtypes

id                object
month     datetime64[ns]
date      datetime64[ns]
type              object
header            object
text              object
dtype: object

The following line prints a specific piece of our table (a cell from the second row, column `text`):

In [13]:
print(dispatch["text"][0])

The National crisis. particulars of the evacuation and occupation of Fort Moultrie. resignation of Secretary Floyd. &amp;c., &amp;c., &amp;c.;;; The Washington Constitution of yesterday announces that the resignation of Hon. John B. Floyd, Secretary of War, was tendered on Saturday, and accepted by the President.--The Star of the evening before, foreshadowing this result, says:;;; The on dit of the day, immediately around us is, that Secretaries Floyd, Thompson, and Thomas, all of whom believe in the alleged constitutional right of secession, it will be remembered, have formally notified the President that they will resign their respective portfolios unless he accede to the demand of the South Carolina Commissioners, that orders shall be issued to Major Anderson directing him to go back to Fort Moultrie from Fort Sumter, with all his force — of course thus shadowing the latter to the --This rumor is probably true.;;; we may not inappropriately add, that if such orders are issued to Maj

The following command allows us to count instances of specific values in a specific column:

In [14]:
dispatch['type'].value_counts()

article     71014
ad-blank    26518
advert      18736
orders      10916
death        1038
            ...  
ordors          1
misc            1
product         1
report          1
oped            1
Name: type, Length: 104, dtype: int64

## Filtering your dataframe:

1. Range of rows or columns by index positions
2. Filtering by value(s)
3. And many other ways (you can google many useful tutorials)

In [15]:
len(dispatch)

129853

Range of rows or columns by index positions:

In [16]:
dispatch = dispatch.reset_index(drop=True)
len(dispatch.loc[1:1000])

1000

In [17]:
dispatch

Unnamed: 0,id,month,date,type,header,text
0,1860-12-31_article_000,1860-12-01,1860-12-31,article,The National crisis.particulars of the evacuat...,The National crisis. particulars of the evacua...
1,1860-12-31_article_000,1860-12-01,1860-12-31,article,"[from the Charleston courier, of Friday.]Fort ...","[from the Charleston courier, of Friday.] Fort..."
2,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of Commodore Platt.,Death of Commodore Platt.;;; Another of our mo...
3,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of the last survivor of the battle of Bu...,Death of the last survivor of the battle of Bu...
4,1860-12-31_article_000,1860-12-01,1860-12-31,article,Christmas in Charleston.,"Christmas in Charleston.;;; --It seemed, on Tu..."
...,...,...,...,...,...,...
129848,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."
129849,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."
129850,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department, Confederate States of Ame...","Treasury Department, Confederate States of Ame..."
129851,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."


Filtering by value(s):

In [18]:
dispatch_light = dispatch[dispatch.type != "ad-blank"]
len(dispatch_light)

103335

Filtering affects index numbers of rows in the table (they are in the leftmost "column"). If you scroll the table generated below, you will see that some numbers are missing. We need to reset them (or renumber) so that we could merge with topic data that we will generate later.

**NB**: reset can be done with the following command `dispatch_light = dispatch_light.reset_index(drop=True)`

In [19]:
dispatch_light = dispatch_light.reset_index(drop=True)
dispatch_light

Unnamed: 0,id,month,date,type,header,text
0,1860-12-31_article_000,1860-12-01,1860-12-31,article,The National crisis.particulars of the evacuat...,The National crisis. particulars of the evacua...
1,1860-12-31_article_000,1860-12-01,1860-12-31,article,"[from the Charleston courier, of Friday.]Fort ...","[from the Charleston courier, of Friday.] Fort..."
2,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of Commodore Platt.,Death of Commodore Platt.;;; Another of our mo...
3,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of the last survivor of the battle of Bu...,Death of the last survivor of the battle of Bu...
4,1860-12-31_article_000,1860-12-01,1860-12-31,article,Christmas in Charleston.,"Christmas in Charleston.;;; --It seemed, on Tu..."
...,...,...,...,...,...,...
103330,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."
103331,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."
103332,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department, Confederate States of Ame...","Treasury Department, Confederate States of Ame..."
103333,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame..."


You can read/print the text of any item in the following manner: `your_dataframe["column_name"][row_number]`

In [20]:
print(dispatch_light["text"][123])

Singular cause of death.;;; -- Col. William Early, an old and respectable citizen of Washington county, Tenn., died suddenly on the 11th instant.;;; He had been salting down some pork, and cut his hand slightly against a bone, from which mortification and death ensued.


# Preparing our data

Do not rush to run the following code! There is a lot of data in the Dispatch and running all data may take quite a while. In class, let's use a smaller sample. At home, re-run this notebook with all the data (you can drop items of the type `ad-blank`).

In [21]:
%%time
dispatch_light["textData"] = dispatch_light["text"]
dispatch_light["textData"] = [re.sub("\W+", " ", str(i).lower()) for i in dispatch_light["textData"]]
dispatch_light["textData"] = [re.sub(" +", " ", str(i).lower()) for i in dispatch_light["textData"]]

dispatch_light

CPU times: user 14 s, sys: 139 ms, total: 14.2 s
Wall time: 14.2 s


Unnamed: 0,id,month,date,type,header,text,textData
0,1860-12-31_article_000,1860-12-01,1860-12-31,article,The National crisis.particulars of the evacuat...,The National crisis. particulars of the evacua...,the national crisis particulars of the evacuat...
1,1860-12-31_article_000,1860-12-01,1860-12-31,article,"[from the Charleston courier, of Friday.]Fort ...","[from the Charleston courier, of Friday.] Fort...",from the charleston courier of friday fort mo...
2,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of Commodore Platt.,Death of Commodore Platt.;;; Another of our mo...,death of commodore platt another of our most e...
3,1860-12-31_article_000,1860-12-01,1860-12-31,article,Death of the last survivor of the battle of Bu...,Death of the last survivor of the battle of Bu...,death of the last survivor of the battle of bu...
4,1860-12-31_article_000,1860-12-01,1860-12-31,article,Christmas in Charleston.,"Christmas in Charleston.;;; --It seemed, on Tu...",christmas in charleston it seemed on tuesday a...
...,...,...,...,...,...,...,...
103330,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame...",treasury department confederate states of amer...
103331,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame...",treasury department confederate states of amer...
103332,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department, Confederate States of Ame...","Treasury Department, Confederate States of Ame...",treasury department confederate states of amer...
103333,1865-01-23_orders_000,1865-01-01,1865-01-23,orders,"Treasury Department,Confederate States of Amer...","Treasury Department, Confederate States of Ame...",treasury department confederate states of amer...


In [22]:
dispatch_light["textData"][0]

'the national crisis particulars of the evacuation and occupation of fort moultrie resignation of secretary floyd amp c amp c amp c the washington constitution of yesterday announces that the resignation of hon john b floyd secretary of war was tendered on saturday and accepted by the president the star of the evening before foreshadowing this result says the on dit of the day immediately around us is that secretaries floyd thompson and thomas all of whom believe in the alleged constitutional right of secession it will be remembered have formally notified the president that they will resign their respective portfolios unless he accede to the demand of the south carolina commissioners that orders shall be issued to major anderson directing him to go back to fort moultrie from fort sumter with all his force of course thus shadowing the latter to the this rumor is probably true we may not inappropriately add that if such orders are issued to major anderson secretaries toney holt black and

In [23]:
dispatch_light["text"][0]

'The National crisis. particulars of the evacuation and occupation of Fort Moultrie. resignation of Secretary Floyd. &amp;c., &amp;c., &amp;c.;;; The Washington Constitution of yesterday announces that the resignation of Hon. John B. Floyd, Secretary of War, was tendered on Saturday, and accepted by the President.--The Star of the evening before, foreshadowing this result, says:;;; The on dit of the day, immediately around us is, that Secretaries Floyd, Thompson, and Thomas, all of whom believe in the alleged constitutional right of secession, it will be remembered, have formally notified the President that they will resign their respective portfolios unless he accede to the demand of the South Carolina Commissioners, that orders shall be issued to Major Anderson directing him to go back to Fort Moultrie from Fort Sumter, with all his force — of course thus shadowing the latter to the --This rumor is probably true.;;; we may not inappropriately add, that if such orders are issued to Ma

# Tokenize and Clean Texts 

Tokenization is essentially splitting of text (i.e., an uninterrupted string of characters) into a `list` of tokens (which in English corresponds to words). This operation may take some time, so be patient.

In [24]:
%%time
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

dispatch_light["textDataLists"] = list(sent_to_words(dispatch_light["textData"].copy()))

CPU times: user 41 s, sys: 1.16 s, total: 42.1 s
Wall time: 42.7 s


In [25]:
dispatch_light["textDataLists"].head()

0    [the, national, crisis, particulars, of, the, ...
1    [from, the, charleston, courier, of, friday, f...
2    [death, of, commodore, platt, another, of, our...
3    [death, of, the, last, survivor, of, the, batt...
4    [christmas, in, charleston, it, seemed, on, tu...
Name: textDataLists, dtype: object

In [26]:
print(dispatch_light["text"][1001]) # our original text

1860.;;; Stoves — Stoves.;;; 1860.;;; Wm. Sears Wood,;;; No. 6 Main street, near the Old Market.;;; Manufacturer, Wholesale and Retail Dealer in;;; Stoves, Ranges and Furnaces.;;; Mott 's Agricultural Boilers,;;; Tin and Sheet Iron Ware.;;; Copper Lightning Rods &amp;c., &amp;c.;;; Plumbing and Gas-Fitting, in all its branches;;; Jobbing promptly attended to.;;; Repairs for all kinds of Stoves, always on hand.;;; Roofing and Guttering;;; done in the city and country, in the best manner and at shortest notice.;;; se 26 --ts


In [27]:
print(dispatch_light["textDataLists"][1001]) # our `listed` text

['stoves', 'stoves', 'wm', 'sears', 'wood', 'no', 'main', 'street', 'near', 'the', 'old', 'market', 'manufacturer', 'wholesale', 'and', 'retail', 'dealer', 'in', 'stoves', 'ranges', 'and', 'furnaces', 'mott', 'agricultural', 'boilers', 'tin', 'and', 'sheet', 'iron', 'ware', 'copper', 'lightning', 'rods', 'amp', 'amp', 'plumbing', 'and', 'gas', 'fitting', 'in', 'all', 'its', 'branches', 'jobbing', 'promptly', 'attended', 'to', 'repairs', 'for', 'all', 'kinds', 'of', 'stoves', 'always', 'on', 'hand', 'roofing', 'and', 'guttering', 'done', 'in', 'the', 'city', 'and', 'country', 'in', 'the', 'best', 'manner', 'and', 'at', 'shortest', 'notice', 'se', 'ts']


# Creating a frequency list

## Identifying stop words and low frequency words

Why would we want/need to identify and remove these?

In [28]:
%%time

vocabList = [item for sublist in dispatch_light["textDataLists"] for item in sublist]

freqDic = {}
low_frequency_words = []

# collecting frequencies, like we did before
for v in vocabList:
    if v in freqDic:
        freqDic[v] += 1
    else:
        freqDic[v]  = 1

# reformatting collected results
freqList = []
for k,v in freqDic.items():
    val = "%09d\t%s\t0" % (v, k)
    freqList.append(val)
    # the cutoff value for low frequency items should be determined through the distribution of frequencies
    # but it is always safe to remove items that have frequency 1 (most likely typos)
    if v <= 1:
        low_frequency_words.append(k)

with open("dispatch_freq_list.csv", "w", encoding="utf8") as f9:
    f9.write("\n".join(sorted(freqList, reverse=True)))

print("-"*50)
print("frequency_list")
print(len(freqList))    
print("low_frequency_words")
print(len(low_frequency_words))
print("-"*50)

--------------------------------------------------
frequency_list
125904
low_frequency_words
63510
--------------------------------------------------
CPU times: user 5.08 s, sys: 647 ms, total: 5.73 s
Wall time: 6.08 s


In [29]:
print(low_frequency_words[100:200])

['demirep', 'mobillans', 'mafflin', 'buem', 'jankins', 'quartes', 'kobinson', 'crittendon', 'fortres', 'maryck', 'revivify', 'barwall', 'mazyek', 'krarney', 'seannell', 'labouchere', 'tiensin', 'molis', 'attendez', 'avet', 'plaisir', 'gossipped', 'capul', 'barustoff', 'ciœsus', 'heanan', 'pefore', 'telion', 'blace', 'zooper', 'chermania', 'zerenates', 'vish', 'proun', 'harolds', 'slungshotted', 'gratied', 'substanated', 'medicin', 'menoned', 'litzinger', 'pingitsville', 'griffithville', 'savoie', 'neri', 'touchhole', 'hydrogenated', 'begbie', 'bookbindery', 'moxan', 'vaillant', 'barocke', 'siza', 'benedek', 'dapples', 'lausanne', 'chur', 'lombards', 'crajona', 'jassy', 'atholi', 'travesties', 'fils', 'enfante', 'numismatic', 'nagurs', 'rubic', 'sorcke', 'execu', 'dabner', 'whitting', 'osward', 'usquhart', 'alvoy', 'hobbed', 'quatlebaum', 'memmimger', 'wardleman', 'fluidal', 'shirtless', 'milledonville', 'porteress', 'augennes', 'whimper', 'oppressiveness', 'impenetrably', 'participancy

After this, you can work through the generated frequency list and mark those that you want to consider stop words in the third column by changing `0` to `1`. After that, we can use this file to load stop words and you will have, in case you want to reconsider what counts as a stop word. This approach will allow you to adjust your stop word list based on your corpus.

You should rename the frequency file so that it does not get overwritten when you re-run the script. For example, into `dispatch_freq_list_manual.csv`. 

In [31]:
with open("dispatch_freq_list_manual.csv", "r", encoding="utf8") as f1:
    data = f1.read().split("\n")
    
    stop_words_custom = []
    
    for d in data:
        d = d.split("\t")
        if d[2] == "1":
            stop_words_custom.append(d[1])
 
stop_words_custom.extend(stop_words) # adding stop words from nltk
stop_words_custom = list(set(stop_words_custom))

print(stop_words_custom)

['only', 'or', "doesn't", 'there', 'under', 'most', 'further', 'mustn', 'd', 'and', 'but', 'should', "you're", "mightn't", 'few', 't', 'this', 'his', 'ours', 'for', 'be', "should've", 're', 'not', "wasn't", 'did', 'them', 'ain', "isn't", 'who', 'their', 'a', 've', 'are', 'ourselves', 'she', 'me', 'its', 'some', 'will', 'her', 'so', 'isn', 'our', 'above', 'y', 'hers', 'him', 'am', 'is', 'no', 'more', 'any', "you've", 'what', 'have', 'yourselves', 'herself', 'can', "weren't", 'which', 'why', "wouldn't", 'they', 'where', 'm', "didn't", 'had', 'other', "shouldn't", 'himself', 'doing', "hadn't", 'you', 'between', 'shan', 'themselves', 'through', 'wasn', 'own', 'do', 'off', 'an', 'those', 'whom', 'were', 'my', 'down', 'each', 'i', 'such', 'hasn', 'up', 'just', "you'll", 'we', 'of', 'when', 'o', 'with', 'during', "that'll", 'over', 'once', 'nor', 'll', 'shouldn', 'too', 'before', 'both', 'myself', "aren't", 'below', "won't", 'as', 'about', 'couldn', 'it', "shan't", 'being', 'he', 'that', 'doe

**NB:** One can also add bigrams and trigrams into the mix, before running topic modeling.

I have already prepared stopwords, let's use it:

In [32]:
stop_words_custom = ["the", "of", "and", "to", "in", "a", "that", "for", "on", "was", "is", "at", "be", "by",
                   "from", "his", "he", "it", "with", "as", "this", "will", "which", "have", "or", "are",
                   "they", "their", "not", "were", "been", "has", "our", "we", "all", "but", "one", "had",
                   "who", "an", "no", "i", "them", "about", "him", "two", "upon", "may", "there", "any",
                   "some", "so", "men", "when", "if", "day", "her", "under", "would", "c", "such", "made",
                   "up", "last", "j", "time", "years", "other", "into", "said", "new", "very", "five",
                   "after", "out", "these", "shall", "my", "w", "more", "its", "now", "before", "three",
                   "m", "than", "h", "o'clock", "old", "being", "left", "can", "s", "man", "only", "same",
                   "act", "first", "between", "above", "she", "you", "place", "following", "do", "per",
                   "every", "most", "near", "us", "good", "should", "having", "great", "also", "over",
                   "r", "could", "twenty", "people", "those", "e", "without", "four", "received", "p", "then",
                   "what", "well", "where", "must", "says", "g", "large", "against", "back", "000", "through",
                   "b", "off", "few", "me", "sent", "while", "make", "number", "many", "much", "give",
                   "1", "six", "down", "several", "high", "since", "little", "during", "away", "until",
                   "each", "5", "year", "present", "own", "t", "here", "d", "found", "reported", "2",
                   "right", "given", "age", "your", "way", "side", "did", "part", "long", "next", "fifty",
                   "another", "1st", "whole", "10", "still", "among", "3", "within", "get", "named", "f",
                   "l", "himself", "ten", "both", "nothing", "again", "n", "thirty", "eight", "took",
                   "never", "came", "called", "small", "passed", "just", "brought", "4", "further",
                   "yet", "half", "far", "held", "soon", "main", "8", "second", "however", "say",
                   "heavy", "thus", "hereby", "even", "ran", "come", "whom", "like", "cannot", "head",
                   "ever", "themselves", "put", "12", "cause", "known", "7", "go", "6", "once", "therefore",
                   "thursday", "full", "apply", "see", "though", "seven", "tuesday", "11", "done",
                   "whose", "let", "how", "making", "immediately", "forty", "early", "wednesday",
                   "either", "too", "amount", "fact", "heard", "receive", "short", "less", "100",
                   "know", "might", "except", "supposed", "others", "doubt", "set", "works"]

# Text cleaning

In [33]:
def remove_words(texts, word_list_filter):
    return [[word for word in simple_preprocess(str(doc)) if word not in word_list_filter] for doc in texts]

Removing `low_frequency_words` is going to take a while, so it is currently commented out. You can run this at home. (**Note**: low frequency words should not be a problem in our case --- they will simply be ignored since they are used only in single texts; expanding the stop word list should be more helpful, since stop words have extremely high frequencies.)

In [None]:
%%time
dispatch_light["textDataListsFiltered"] = remove_words(dispatch_light["textDataLists"], stop_words_custom)
#dispatch_light["textDataListsFiltered"] = remove_words(dispatch_light["textDataListsFiltered"], low_frequency_words)

In [35]:
dispatch_light["textDataLists"].head(5)

0    [the, national, crisis, particulars, of, the, ...
1    [from, the, charleston, courier, of, friday, f...
2    [death, of, commodore, platt, another, of, our...
3    [death, of, the, last, survivor, of, the, batt...
4    [christmas, in, charleston, it, seemed, on, tu...
Name: textDataLists, dtype: object

In [36]:
dispatch_light["textDataListsFiltered"].head(5)

0    [national, crisis, particulars, evacuation, oc...
1    [charleston, courier, friday, fort, moultrie, ...
2    [death, commodore, platt, eminent, public, ser...
3    [death, survivor, battle, bunker, hill, ralph,...
4    [christmas, charleston, seemed, elements, cons...
Name: textDataListsFiltered, dtype: object

# Saving intermediate results

It makes sense to store results of some expensive preprocessing, so that you do not have to rerun highly time-costly operartions again and again.

In [37]:
dispatch_light.to_csv("./Dispatch_Processed_TSV/Dispatch_Light_Preprocessed.tsv", sep="\t", index=False)

This spot will now become our new starting point.