# Environment Setup

If you do not have these modules installed, you will need to install them at the command line, using a BASH shell, Terminal, or the Anaconda Command Prompt.

- `wordcloud`
- `matplotlib`
- `spaCy`
- `nltk`
- `gensim`

## wordcloud

"A little word cloud generator in Python," made by Andreas Mueller

"A little word cloud generator in Python," made by Andreas Mueller

Documentation:
- INSTALL: ["word_cloud", GitHub](https://github.com/amueller/word_cloud)
- ["WordCloud for Python documentation"](https://amueller.github.io/word_cloud/)

## matplotlib

"Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python...Matplotlib produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, web application servers, and various graphical user interface toolkits"

Documentation:
- INSTALL: ["Installation," matplotlib documentation](https://matplotlib.org/stable/users/installing.html
- ["Documentation," matplotlib](https://matplotlib.org/stable/tutorials/index.html)

## spaCy

"spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython"

Documentation:
- INSTALL: ["Install spaCy", spaCy documentation](https://spacy.io/usage)
- ["GUIDES", spaCy documentation](https://spacy.io/usage/linguistic-features)

## nltk

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."

Documentation:
- INSTALL
  * ["Installing NLTK," NLTK documentation](https://www.nltk.org/install.html)
  * ["Installing NLTK Data," NLTK documentation](http://www.nltk.org/data.html)
- [NLTK Documentation](https://www.nltk.org/index.html)

For more on NLTK:
- Steven Bird, Ewan Klein, and Edward Loper, [*Natural Language Processing With Python: Analyzing Text with the Natural Language Toolkit*](http://www.nltk.org/book/) (O'Reilly, 2009).

## gensim

"Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible. Gensim is designed to process raw, unstructured digital texts (”plain text”) using unsupervised machine learning algorithms."

The core concepts of gensim are:
- Document: some text.
- Corpus: a collection of documents.
- Vector: a mathematically convenient representation of a document.
- Model: an algorithm for transforming vectors from one representation to another.


Documentation:
- INSTALL: ["Installation", Gensim documentation](https://radimrehurek.com/gensim/intro.html#installation)
- ["Documentation," Gensim](https://radimrehurek.com/gensim/auto_examples/index.html)

## Putting It All Together

# Generate WordCloud

In [None]:
# import modules
import os
from os import path
from wordcloud import WordCloud

# read text
text = open('sample_output_file.txt').read()

In [None]:
# create image
wordcloud = WordCloud().generate(text)

# display image using matplotlib
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')

In [None]:
# lower max font size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# non-matplotlib option
image = wordcloud.to_image()
image.show()

# Getting Started With Spacy

In [None]:
# import module
import spacy

# load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

# process document
spacy_text = text
doc = nlp(spacy_text)

print(doc)

In [None]:
# analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == 'VERB'])

In [None]:
# part-of-speech tagging
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

In [None]:
# find named entities
for entity in doc.ents:
    print(entity.text, entity.label_)

In [None]:
# visualizing named entities
from spacy import displacy

displacy.serve(doc, style="ent")

# Getting Started With NLTK

We can use `nltk.download()` to download additional nltk components.

In [None]:
# load additional nltk data
import nltk
nltk.download()

The next step with `nltk` is to tokenize the text, which then lets us access other components of `nltk`.

In [None]:
# tokenize text using nltk
import nltk
tokens = nltk.word_tokenize(text)
print(tokens)

In [None]:
# tag tokens by position
tagged = nltk.pos_tag(tokens)
print(tagged)

In [None]:
# identify and extract named entities
entities = nltk.chunk.ne_chunk(tagged)
entities

`NLTK` also includes other options for tokenizing a text.

In [None]:
# tokenize using word_tokenize
from nltk.tokenize import word_tokenize

word_tokenize(text)

In [None]:
# tokenize using TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

tokenizer.tokenize(text)

In [None]:
# tokenize using WordPunctTokenizer 
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

tokenizer.tokenize(text)

In [None]:
#tokenize using RegexpTokenizer
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")

regex_words = tokenizer.tokenize(text)

regex_words

We can use the most effective tokenizing method for this data in combination with a few other data wrangling steps to output a unique list of words.

In [None]:
# tokenize using word_tokenize
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove punctuation/special characters
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove non-text content
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = [w for w in words if not w in stop_words]

# removes words with fewer than 3 characters
# words = [word for word in words if len(word) > 3]

# output cleaned list of words
print(words)

We can then take that list of words and plot term frequency and distribution.

In [None]:
# import nltk components
from nltk.corpus import webtext
from nltk.probability import FreqDist

nltk.download('webtext')

# analyze term frequency/distribution
data_analysis = nltk.FreqDist(words)

data_analysis

In [None]:
# plot term frequency/distribution for all terms
data_analysis.plot()

In [None]:
# show term frequency/distribution for top 10 terms
for word, frequency in data_analysis.most_common(10):
    print(u'{};{}'.format(word, frequency))

If you really want to dig in with NLTK:
- Steven Bird, Ewan Klein, and Edward Loper, [*Natural Language Processing With Python: Analyzing Text with the Natural Language Toolkit*](http://www.nltk.org/book/) (O'Reilly, 2009).

# Getting Started With Gensim

In [None]:
# creating a gensim dictionary from a single txt file

from gensim.utils import simple_preprocess
from smart_open import smart_open
from gensim import corpora
import os

dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample.txt'))

dictionary.token2id

In [None]:
from gensim.utils import simple_preprocess
from smart_open import smart_open
import nltk
nltk.download('stopwords')  # run once
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary

    def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in smart_open(self.filepath, encoding='latin'):
            # tokenize
            tokenized_list = simple_preprocess(line, deacc=True)

            # create bag of words
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)

            # update the source dictionary (OPTIONAL)
            mydict.merge_with(self.dictionary)

            # lazy return the BoW
            yield bow


# Create the Dictionary
mydict = corpora.Dictionary()

# Create the Corpus
bow_corpus = BoWCorpus('sample.txt', dictionary=mydict)  # memory friendly

# Print the token_id and count for each line.
for line in bow_corpus:
    print(line)

In [None]:
# show word weights
from gensim import models
import numpy as np

for doc in bow_corpus:
    print([[mydict[id], freq] for id, freq in doc])

In [None]:
# create bag of words from single text file
from gensim.utils import simple_preprocess
from smart_open import smart_open
import nltk
nltk.download('stopwords')  # run once
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# set up classes for bag of words model
class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary

    def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in smart_open(self.filepath, encoding='latin'):
            # tokenize
            tokenized_list = simple_preprocess(line, deacc=True)

            # create bag of words
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)

            # update the source dictionary (OPTIONAL)
            mydict.merge_with(self.dictionary)

            # lazy return the BoW
            yield bow


# create dictionary
mydict = corpora.Dictionary()

# build corpus from txt file
bow_corpus = BoWCorpus('sample.txt', dictionary=mydict)  # memory friendly

# print token id and count for each line
for line in bow_corpus:
    print(line)

In [None]:
# create tf-idf model
tfidf = models.TfidfModel(bow_corpus, smartirs='ntc')

# Show the TF-IDF weights
for doc in tfidf[bow_corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])