# Environment Setup

If you do not have these modules installed, you will need to install them at the command line, using a BASH shell, Terminal, or the Anaconda Command Prompt.

- `wordcloud`
- `matplotlib`
- `spaCy`
- `nltk`
- `gensim`

## wordcloud

"A little word cloud generator in Python," made by Andreas Mueller

"A little word cloud generator in Python," made by Andreas Mueller

Documentation:
- INSTALL: ["word_cloud", GitHub](https://github.com/amueller/word_cloud)
- ["WordCloud for Python documentation"](https://amueller.github.io/word_cloud/)

## matplotlib

"Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python...Matplotlib produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, web application servers, and various graphical user interface toolkits"

Documentation:
- INSTALL: ["Installation," matplotlib documentation](https://matplotlib.org/stable/users/installing.html
- ["Documentation," matplotlib](https://matplotlib.org/stable/tutorials/index.html)

## spaCy

"spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython"

Documentation:
- INSTALL: ["Install spaCy", spaCy documentation](https://spacy.io/usage)
- ["GUIDES", spaCy documentation](https://spacy.io/usage/linguistic-features)

## nltk

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."

Documentation:
- INSTALL
  * ["Installing NLTK," NLTK documentation](https://www.nltk.org/install.html)
  * ["Installing NLTK Data," NLTK documentation](http://www.nltk.org/data.html)
- [NLTK Documentation](https://www.nltk.org/index.html)

For more on NLTK:
- Steven Bird, Ewan Klein, and Edward Loper, [*Natural Language Processing With Python: Analyzing Text with the Natural Language Toolkit*](http://www.nltk.org/book/) (O'Reilly, 2009).

## gensim

"Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim is designed to process raw, unstructured digital texts (”plain text”) using unsupervised machine learning algorithms."

The core concepts of gensim are:
- Document: some text.
- Corpus: a collection of documents.
- Vector: a mathematically convenient representation of a document.
- Model: an algorithm for transforming vectors from one representation to another.


Documentation:
- INSTALL: ["Installation", Gensim documentation](https://radimrehurek.com/gensim/intro.html#installation)
- ["Documentation," Gensim](https://radimrehurek.com/gensim/auto_examples/index.html)

## Putting It All Together

# Generate WordCloud

In [None]:
# import modules
import os
from os import path
from wordcloud import WordCloud

# read text
text = open('sample_output_file.txt').read()

In [None]:
# create image
wordcloud = WordCloud().generate(text)

# display image using matplotlib
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')

In [None]:
# lower max font size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# non-matplotlib option
image = wordcloud.to_image()
image.show()

# Getting Started With Spacy

In [None]:
# import module
import spacy

# load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

# process document
spacy_text = text
doc = nlp(spacy_text)

print(doc)

In [None]:
# analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == 'VERB'])

In [None]:
# part-of-speech tagging
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

In [None]:
# find named entities
for entity in doc.ents:
    print(entity.text, entity.label_)

In [None]:
# visualizing named entities
from spacy import displacy

displacy.serve(doc, style="ent")

# Getting Started With NLTK

We can use `nltk.download()` to download additional nltk components.

In [None]:
# load additional nltk data
import nltk
nltk.download()

The next step with `nltk` is to tokenize the text, which then lets us access other components of `nltk`.

In [None]:
# tokenize text using nltk
import nltk
tokens = nltk.word_tokenize(text)
print(tokens)

In [None]:
# tag tokens by position
tagged = nltk.pos_tag(tokens)
print(tagged)

In [None]:
# identify and extract named entities
entities = nltk.chunk.ne_chunk(tagged)
entities

`NLTK` also includes other options for tokenizing a text.

In [None]:
# tokenize using word_tokenize
from nltk.tokenize import word_tokenize

word_tokenize(text)

In [None]:
# tokenize using TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

tokenizer.tokenize(text)

In [None]:
# tokenize using WordPunctTokenizer 
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

tokenizer.tokenize(text)

In [None]:
#tokenize using RegexpTokenizer
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")

regex_words = tokenizer.tokenize(text)

regex_words

We can use the most effective tokenizing method for this data in combination with a few other data wrangling steps to output a unique list of words.

In [None]:
# tokenize using word_tokenize
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove punctuation/special characters
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove non-text content
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = [w for w in words if not w in stop_words]

# removes words with fewer than 3 characters
# words = [word for word in words if len(word) > 3]

# output cleaned list of words
print(words)

We can then take that list of words and plot term frequency and distribution.

In [None]:
# import nltk components
from nltk.corpus import webtext
from nltk.probability import FreqDist

nltk.download('webtext')

# analyze term frequency/distribution
data_analysis = nltk.FreqDist(words)

data_analysis

In [None]:
# plot term frequency/distribution for all terms
data_analysis.plot()

In [None]:
# show term frequency/distribution for top 10 terms
for word, frequency in data_analysis.most_common(10):
    print(u'{};{}'.format(word, frequency))

If you really want to dig in with NLTK:
- Steven Bird, Ewan Klein, and Edward Loper, [*Natural Language Processing With Python: Analyzing Text with the Natural Language Toolkit*](http://www.nltk.org/book/) (O'Reilly, 2009).

# Getting Started With Gensim

In [None]:
# creating a gensim dictionary from a single txt file

from gensim.utils import simple_preprocess
from smart_open import smart_open
from gensim import corpora
import os

dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample.txt'))

dictionary.token2id

In [None]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import nltk
nltk.download('stopwords')  # run once
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary

    def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in smart_open(self.filepath, encoding='latin'):
            # tokenize
            tokenized_list = simple_preprocess(line, deacc=True)

            # create bag of words
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)

            # update the source dictionary (OPTIONAL)
            mydict.merge_with(self.dictionary)

            # lazy return the BoW
            yield bow


# Create the Dictionary
mydict = corpora.Dictionary()

# Create the Corpus
bow_corpus = BoWCorpus('sample.txt', dictionary=mydict)  # memory friendly

# Print the token_id and count for each line.
for line in bow_corpus:
    print(line)

In [None]:
# show word weights
from gensim import models
import numpy as np

for doc in bow_corpus:
    print([[mydict[id], freq] for id, freq in doc])

In [None]:
# create bag of words from single text file
from gensim.utils import simple_preprocess
from smart_open import smart_open
import nltk
nltk.download('stopwords')  # run once
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# set up classes for bag of words model
class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary

    def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in smart_open(self.filepath, encoding='latin'):
            # tokenize
            tokenized_list = simple_preprocess(line, deacc=True)

            # create bag of words
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)

            # update the source dictionary (OPTIONAL)
            mydict.merge_with(self.dictionary)

            # lazy return the BoW
            yield bow


# create dictionary
mydict = corpora.Dictionary()

# build corpus from txt file
bow_corpus = BoWCorpus('sample.txt', dictionary=mydict)  # memory friendly

# print token id and count for each line
for line in bow_corpus:
    print(line)

In [None]:
# create tf-idf model
tfidf = models.TfidfModel(bow_corpus, smartirs='ntc')

# Show the TF-IDF weights
for doc in tfidf[bow_corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

## Python Text Analysis Next Steps and Additional Resources

### More documentation on tools covered in this notebook

**`wordcloud`**
- wordcloud, "[WordCloud for Python documentation](https://amueller.github.io/word_cloud/)"
- wordcloud, "[Gallery of Examples](https://amueller.github.io/word_cloud/auto_examples/index.html#example-gallery)"
- Duong Vu, "[Generating Word Clouds in Python](https://www.datacamp.com/community/tutorials/wordcloud-python)" *DataCamp* (8 November 2019)
- Zolzaya Luvsandorj, "[Simple Word Cloud in Python](https://towardsdatascience.com/simple-wordcloud-in-python-2ae54a9f58e5)" *Towards Data Science* (20 June 2020)

**`spaCy`**
- spaCy, "[spaCy 101: Everything you need to know](https://spacy.io/usage/spacy-101)"
- spaCy, "[Advanced NLP with spaCy](https://course.spacy.io/en/)"
- Ines Montani, "[spaCy Cheat Sheet: Advanced NLP in Python](https://www.datacamp.com/community/blog/spacy-cheatsheet)" *DataCamp* (14 July 2021)
- Conor Mc., "[A short introduction to NLP in Python with spaCy](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad)" *Towards Data Science* (17 March 2017)

**`nltk`**
- Steven Bird, Ewan Klein, and Edward Loper, *[Natural Language Processing With Python: Analyzing Text with the Natural Language Toolkit](http://www.nltk.org/book/)* (O'Reilly, 2009).
  * Free online book includes chapters on processing tasks, categorizing/tagging words, classifying text, extracting information, analyzing sentence structure, and other tasks.
- Avinash Navlani, "[Text Analytics for Beginners using NLTK](https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk)" *DataCamp* (13 December 2019)
- Michelle Morales, "[How to Work with Language Data in Python 3 Using the Natural Language Toolkit (NLTK)](https://www.digitalocean.com/community/tutorials/how-to-work-with-language-data-in-python-3-using-the-natural-language-toolkit-nltk)" *Digital Ocean* (3 January 2017)
- Shaumik Daityari, "[How to Perform Sentiment Analysis in Python 3 using the Natural Langauge Toolkit (NLTK)](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk)" *Digital Ocean* (26 September 2019)
- Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis," The Programming Historian 7 (2018), https://doi.org/10.46430/phen0079.
- François Dominic Laramée, "Introduction to stylometry with Python," The Programming Historian 7 (2018), https://doi.org/10.46430/phen0078.

### Other Tutorials, Packages, and Methods

For a more detailed overview of text analysis methods- START HERE before diving into a new method/tutorial:
- Heather Froelich, "[Text Analysis: An Introduction, Methods](https://guides.libraries.psu.edu/c.php?g=829065&p=5922906)" *PennState University Library Guide*
- Stéfan Sinclair & Geoffrey Rockwell, *[The Art of Literary Text Analysis](https://nbviewer.jupyter.org/github/sgsinclair/alta/blob/master/ipynb/ArtOfLiteraryTextAnalysis.ipynb)*, edited by Melissa Mony (last modified 12 January 2018)
  * Includes Jupyter notebooks with more resources on `nltk`, visualizing textual data, and a variety of analysis methods 

For a more detailed overview of various tools and tutorials (beyond those listed here):
- Alan Liu, "[Tutorials for DH Tools and Methods, Text Analysis](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244314/Tutorials%20for%20DH%20Tools%20and%20Methods#tutorials-text-analysis)" *digital humanities resources for project building* (2019)

**List of words and word frequency**
- William J. Turkel and Adam Crymble, "Normalizing Textual Data with Python," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0014.
- William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0003.
- William J. Turkel and Adam Crymble, "Keywords in Context (Using n-grams) with Python," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0010.

**TF-IDF (term frequency, inverse document frequency)**
- Matthew J. Lavin, "Analyzing Documents with TF-IDF," The Programming Historian 8 (2019), https://doi.org/10.46430/phen0082.
- John R. Ladd, "Understanding and Using Common Similarity Measures for Text Analysis," The Programming Historian 9 (2020), https://doi.org/10.46430/phen0089.
  * Both of these tutorials use [`scikit-learn`'s tf-idf implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

**Topic Modeling (looking for patterns of words in text/corpus to build topics, builds on td-idf)**
- For more specifics on topic modeling as a methodology: 
  * Ted Underwood, "[Topic modeling made just simple enough](https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/) *blog* (7 April 2012)
  * Scott Weingart, "[Topic Modeling for Humanists: A Guided Tour](http://scottbot.net/topic-modeling-for-humanists-a-guided-tour/)" *blog* (25 July 2012)
- Shawn Graham, Scott Weingart, and Ian Milligan, "Getting Started with Topic Modeling and MALLET," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0017.
- Shanshank Kapadia, "[Topic Modeling in Python: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0)" *Towards Data Science* (14 April 2019)
- Susan Li, "[Topic Modelling in Python with NLTK and Gensim](https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21)" *Towards Data Science* (30 March 2018)

**Sentiment analysis (emotional intensity/weight for words and phrases in a text)**
- Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis," The Programming Historian 7 (2018), https://doi.org/10.46430/phen0079.

**Stylometry (literary style, authorship attribution)**
- François Dominic Laramée, "Introduction to stylometry with Python," The Programming Historian 7 (2018), https://doi.org/10.46430/phen0078.

**Geoparsing (extracting and plotting location information)**
- Beatrice Alex, "Geoparsing English-Language Text with the Edinburgh Geoparser," The Programming Historian 6 (2017), https://doi.org/10.46430/phen0067.

**Working with other source material (other archives, APIs, etc)**
- Stephen Krewson, "Extracting Illustrated Pages from Digital Libraries with Python," The Programming Historian 8 (2019), https://doi.org/10.46430/phen0084.
  *  Tutorial covers workflows for getting digital image content from HathiTrust and Internet Archive, but could also get you started with an OCR workflow for texts in these collections.
- Resources for using the Library of Congress' "Chronicling America" collection (wide variety of national and local U.S. newspapers, 1777-1963)
  * Library of Congress, "[About the Site and API](https://chroniclingamerica.loc.gov/about/api/)" *Chronicling America: Historic American Newspapers*
  * Library of Congress, "[Available Tutorials](https://libraryofcongress.github.io/data-exploration/all-tutorials.html)" *loc.gov JSON API*
  * Hugo van Kemenade, "[chroniclingamerica.py: Python API to search Chronicling America newspaper pages](https://github.com/hugovk/chroniclingamerica.py)" *GitHub* (2016)
  * Jason Heppler, "[Working with the Chronicling America API](https://observablehq.com/@hepplerj/working-with-the-chronicling-america-api)" *Observable* (6 July 2020)

**Stanford Natural Language Processing Group**

From their "[Software](https://nlp.stanford.edu/software/)" page: "The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs. These packages are widely used in industry, academia, and government...All our supported software distributions are written in Java."

**`scikit-learn`**
- `scikit-learn`, "[Working With Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)": The `scikit-learn` machine-larning library includes a nubmer of modules and resources for natural language processing, including tasks like feature extraction and linear models for categorization.