<a href="https://colab.research.google.com/github/meva77/Yellow-Copper/blob/master/LDA_No_more_silence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling for No More Silence

This notebook is a proof of concept of  of the OCRed text from the No More Silence corpus.

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

## Import data

In [0]:
import pandas as pd
import os

# Link to Google Drive folder
from google.colab import drive
# This will open a new tab and prompt for an authorization code.
drive.mount('/content/drive', force_remount = True)

# Customize project path to the individual analyst's google drive structure:
# TODO: try to automatically detect what folder this Colab notebook is in.
dir_proj = '/content/drive/My Drive/data-raw'
#dir_proj = '/content/gdrive/Text summarization No More Silence 2019/notebooks/'

# Import the scaled data
# TODO: confirm that this is the right data.
df = pd.read_excel(dir_proj + "/NoMoreSilence_ProjectData.xlsx", use_threads = os.cpu_count())


Mounted at /content/drive


In [0]:
!ls "/content/drive/My Drive/data-raw"

NoMoreSilence_ProjectData.xlsx


In [0]:

# Rename text column to not have a space in the name, and to be lowercase.
df.rename(index=str, columns={"Ocr text": "ocr_text"}, inplace = True)

print(df.shape)
print(df.columns)

text_col = "ocr_text"

(735, 53)
Index(['Collection Title', 'Title', 'Local Identifier ', 'Type', 'Date ',
       'Date Type', 'Publication/Origination Info', 'Creator 1 Name',
       'Creator 1 NameType', 'Creator 1 Source', 'Creator 2 Name',
       'Creator 2 NameType', 'Creator 2 Source', 'Creator 3 Name',
       'Creator 3 NameType', 'Creator 3 Source', 'Creator 4 Name',
       'Creator 4 NameType', 'Creator 4 Source', 'Format/Physical Description',
       'Language ', 'Language Code', 'Copyright Status', 'Copyright Statement',
       'Source', 'Subject (Name) 1 Name', 'Subject (Name) 1 Name Type',
       'Subject (Name) 1 Source', 'Subject (Name) 2 Name',
       'Subject (Name) 2 Name Type', 'Subject (Name) 2 Source',
       'Subject (Name) 3 Name', 'Subject (Name) 3 Name Type',
       'Subject (Name) 3 Source', 'Subject (Topic) 1 Heading',
       'Subject (Topic) 1 Heading Type', 'Subject (Topic) 1 Source',
       'Subject (Topic) 2 Heading', 'Subject (Topic) 2 Heading Type',
       'Subject (Topic) 2 

In [0]:
df.ocr_text.sample(10)

437    INDECENTMATERIALS (Text bySenatorJesseHelms,R-...
182      pgNbr=1 rear. tio^-frele Cjty University nea...
441    ¬ß San Francisco Black CoALmoN on AIDS 1042Div...
712    DECLARATION OF DEAN F. ECHENBERG, M.D. 1, Dean...
607    WE ARE A RESEARCH QROUP THAT PROVIDES AIDS ANT...
619    NEXT WAN MEETING MARCH 7, 1989, 9:30-11:30 25 ...
239      pgNbr=1 Richard Whelan on the Barry Mehler i...
259      pgNbr=1 Christmas 1992 New Year's 1993 Randy...
364                                                     
273    WRITER ‚Ä¢ EDITOR  PUBLIC RELATIONS CONSUL TAN...
Name: ocr_text, dtype: object

In [0]:
data_text = df[['ocr_text']]
data_text ['index'] = data_text.index
doc = data_text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [0]:

print(len(doc))
print(doc[:5])

735
                                            ocr_text index
0  PROPOSITION 64 The AIDSInitiativein California...     0
1  MAKING YOUR WILL California State Aids Legal S...     1
2  January 11, 1997 Community Liaison Committee c...     2
3  ^ GREAT REPUBLIC IIMSURAIMCE COMPANY i 470 SOU...     3
4  SANFRANCISCOAIDSFOUNDATION P.O.BOX 426182,SANF...     4


# Pre-processing 



Pre-Preprocessing
It is always a good practice to make your textual data noise-free as much as possible. So, let’s do some basic text cleaning.
Importing all the necessary Libaries 

In [0]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english") 


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
#Function to  lemmatized the document 
def lemmatize_stemming(ocr_text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(ocr_text, pos='v'))
def preprocess(ocr_text):
    result = []
    for token in gensim.utils.simple_preprocess(ocr_text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [0]:
doc.ocr_text.iloc[200]

'  pgNbr=1 1 An Open Latter To The Bergalis Family By Randy Shi i ts On television shows on every network, I\'ve seen you talking about your family\'s suffering for several months and about what the government should do to prevent future pain like your\'a¬´ Because anyone in possession of a human near\'t can symo ath i c e with /our anguisn, you\'ve been able to say some very tough things wi t r. ou c. man y of y a ur .1.11 e i\'" v i e we f" as r-j \\ g t ou g hi gu e sit i a n s b ac k ¬´ But it\'s time scmebooy talk back to you with candor, because the legislation you are proposing will hurt people and much of the anger you express will only serve, I feel, to ensure that many thousands of more families will feel the pain of losing someone to AIDS. What you\'re saying is not helping the fight against this disease; it\'s hindering it, and it\'s time you came to understand this. You\'ra angry that your daughter was infected with this horrible virus and is dying,, I understand that. I\'

In [0]:
doc_sample = doc.ocr_text.iloc[200]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['', '', 'pgNbr=1', '1', 'An', 'Open', 'Latter', 'To', 'The', 'Bergalis', 'Family', 'By', 'Randy', 'Shi', 'i', 'ts', 'On', 'television', 'shows', 'on', 'every', 'network,', "I've", 'seen', 'you', 'talking', 'about', 'your', "family's", 'suffering', 'for', 'several', 'months', 'and', 'about', 'what', 'the', 'government', 'should', 'do', 'to', 'prevent', 'future', 'pain', 'like', "your'a¬´", 'Because', 'anyone', 'in', 'possession', 'of', 'a', 'human', "near't", 'can', 'symo', 'ath', 'i', 'c', 'e', 'with', '/our', 'anguisn,', "you've", 'been', 'able', 'to', 'say', 'some', 'very', 'tough', 'things', 'wi', 't', 'r.', 'ou', 'c.', 'man', 'y', 'of', 'y', 'a', 'ur', '.1.11', 'e', 'i\'"', 'v', 'i', 'e', 'we', 'f"', 'as', 'r-j', '\\', 'g', 't', 'ou', 'g', 'hi', 'gu', 'e', 'sit', 'i', 'a', 'n', 's', 'b', 'ac', 'k', '¬´', 'But', "it's", 'time', 'scmebooy', 'talk', 'back', 'to', 'you', 'with', 'candor,', 'because', 'the', 'legislation', 'you', 'are', 'proposing', 'will', 'hurt', 

In [0]:
doc_sample = doc.ocr_text.iloc[100]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['', 'pgNbr=1', 'I', "'", '', 'BY-¬∑RA¬∑NDY:', 'SHILTS', '¬∑', '"t\'', '', 'pgNbr=2', 'UP', '', 'GROWS', '', "There's", 'a', 'new', 'breed', '', 'of', 'gay', 'leadership', 'in', '', 'San', 'Francisco', 'today.', '', "It's", 'not', 'just', 'in', 'the', '', 'Castro,', 'or', 'even', 'City', '', 'Hall', '...', 'gay', 'clout', 'is', '', 'everywhere.', '', 'ed', '', 'he', 'reception', 'drew', 'an', 'array', 'of', 'San', 'Francisco', '']


 tokenized and lemmatized document: 
['pgnbr', 'shilt', 'pgnbr', 'grow', 'breed', 'leadership', 'francisco', 'today', 'castro', 'citi', 'hall', 'clout', 'recept', 'draw', 'array', 'francisco']


In [0]:
 doc.ocr_text.iloc[100]

' pgNbr=1 I \'  BY-¬∑RA¬∑NDY: SHILTS ¬∑ "t\'  pgNbr=2 UP  GROWS  There\'s a new breed  of gay leadership in  San Francisco today.  It\'s not just in the  Castro, or even City  Hall ... gay clout is  everywhere.  ed  he reception drew an array of San Francisco '