## Library of Congress & Chronicling America

This notebook uses historic newspapers and select digitized newspaper pages provided by [Chronicling America](https://chroniclingamerica.loc.gov/about/) (ISSN 2475-2703).

This example is based on the [*About Hispano América*](https://chroniclingamerica.loc.gov/lccn/sn87021178/) that was published in San Francisco.

[Chronicling America](https://chroniclingamerica.loc.gov/about/api/) provides an extensive application programming interface (API) which you can use to explore all of the data. The information is also [published as JSON](https://chroniclingamerica.loc.gov/lccn/sn87021178.json), including the OCR text files.

### Setting things up

In [None]:
import pandas as pd
import re
import os
from pathlib import Path
import requests
from numpy import mean, ones
import nltk
import json

### Let's retrieve the results!

The *About Hispano América* is accessible via a JSON file including all metadata. The attribute *issues* contains the URLs of all issues that, in turn, contain all *pages*.

In [None]:
url = 'https://chroniclingamerica.loc.gov/lccn/sn87021178.json'

r = requests.get(url)

ca_dict = json.loads(r.text)

df = pd.DataFrame(ca_dict['issues'])
df.head()

### How many issues?

In [None]:
df.count()

## Retrieving the OCR texts from Chronicling America

**Note:**  This step may take a while to process due to the number of issues. Uncomment the code in order to execute this step.

In [None]:
for index, row in df.iterrows():
    print(index, row['url'])
    response = requests.get(row['url'])
    print(response)
    text = ''
    if response:
        item = json.loads(response.text)
        text = ''
        for p in item['pages']:
            res_page = requests.get(p['url']) 
            json_page = json.loads(res_page.text)
            print("text"+json_page['text'])
            
            text = text + requests.get(json_page['text']).text.replace('\n','').encode('latin1').decode('utf8')
        
        outF = open('lc-editions/{}'.format(row['url'].replace('https://chroniclingamerica.loc.gov/lccn/sn87021178/','').replace('/', '_').replace('json', 'txt')), "w")
        outF.write(text) 
        outF.close()
        
df.head(10)   

## Now we load the text into pandas DataFrame

In [None]:
for index, row in df.iterrows():
    print(index, row['url'])
   
    filename = Path('lc-editions/{}'.format(row['url'].replace('https://chroniclingamerica.loc.gov/lccn/sn87021178/','').replace('/', '_').replace('json', 'txt')))
    
    text = ''
    
    if os.path.exists(filename):
        with open(filename, 'r') as myfile:
            text = myfile.read()
  
    df.loc[index, 'ocr_text'] = text

df.head(10)  

## Extracting the years from the dates

In [None]:
for index,row in df.iterrows():
    
    try:
        df.loc[index, 'year'] = int(row['date_issued'][:4])
    except:
        df.loc[index, 'year'] = ''

In [None]:
df.head(3)

## Remove punctuation/lower casing/stopwords
Next, let’s perform a simple preprocessing on the content to make them more amenable for analysis, and reliable results. We use a regular expression to remove any punctuation, lowercase the text, remove stopwords and then remove non Spanish words since the OCR may have some errors.

We use wordnet to verify if the word exists. We also have added some specific stopwords to enhance the performance.

The initial_clean function performs an initial clean by removing punctuations, uppercase text, etc.

In [None]:
def initial_clean(text):
    """
    Function to clean text-remove punctuations, lowercase text etc.    
    """
    regex = re.compile('[\",\.!?]')
    regex.sub('', text)
    text = text.lower() # lower case text
    text = nltk.word_tokenize(text)
    return text


In [None]:
def remove_words(text):
    filtered_text = [] 
    
    for token in text:

        if len(token) <= 2:
            continue
        else:
            filtered_text.append(token)
            
    return filtered_text

## Removing stop words

Stop words are words which does not add much meaning to a sentence. For example, the words in English like the, he, have, etc.

There are several Python packages that provide stopwords lists and they can also be customized.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('spanish')
stop_words.extend(['with', 'song', 'guitar', 'spanish', 'typical'])
def remove_stop_words(text):
     return [word for word in text if word not in stop_words]

We create a function to perform the whole process

In [None]:
def apply_all(text):
    """
    This function applies all the functions above into one
    """
    return remove_stop_words(remove_words(initial_clean(text)))

Finally, we process the original text by using the function apply.

In [None]:
# clean reviews and create new column "tokenized" 
import time   
t1 = time.time()   
df['tokenized_text'] = df['ocr_text'].apply(apply_all)    
t2 = time.time()  
print("Time to clean and tokenize", len(df), "reviews:", (t2-t1)/60, "min") #Time to clean and tokenize

## Checking the result

In [None]:
df.head(10)

In [None]:
df['tokenized_text']

## Create Gensim Dictionary and Corpus
Topic modeling using LDA are based on the dictionary and the corpus. This example is based on gensim library for building both.

In [None]:
# LDA
import gensim
from gensim import corpora, models, similarities

In [None]:
tokenized = df['tokenized_text']

#Creating term dictionary of corpus, where each unique term is assigned an index.
dictionary = corpora.Dictionary(tokenized)
#Filter terms which occurs in less than 1 document and more than 80% of the documents.
dictionary.filter_extremes(no_below=1, no_above=0.8)
#convert the dictionary to a bag of words corpus 
corpus = [dictionary.doc2bow(tokens) for tokens in tokenized]
#print(corpus[:1])

## Building the Topic Model
In this step, num_topics is the number of topics to be created and passes corresponds to the number of times to iterate through the entire corpus. By running the LDA algorithm we get the topics as a result.

In [None]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

#LDA
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 5, id2word=dictionary, passes=15)
ldamodel.save('model_combined.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

This output shows the 5 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., topic 3 is related to independencia and trabajadores).