# Text Processing

# Topics
- Parsing, Stemming, Lemmatization
- Frequency Analysis
- Named Entity Recognition
- Word Embeddings

# Workshop: Creating Word2Vec Models

Credits:
- https://codesachin.wordpress.com/2015/10/09/generating-a-word2vec-model-from-a-block-of-text-using-gensim-python/
- https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

Word2Vec
- Semantic learning of text representations
- Neural network 
- Cosine similarity

## Setup

Run this command from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) conda install gensim cython nltk
```

### gensim: for training word2vec

https://radimrehurek.com/gensim/


### Cython: to speed up training word2vec
http://docs.cython.org/en/latest/src/quickstart/install.html


### NLTK: for text processing
Installation: https://www.nltk.org/install.html

Book: http://www.nltk.org/book


## Download text

For demonstration purposes, we'll start with Wikipedia articles.

We'll use a python library that wraps the Wikipedia APIs.

https://pypi.org/project/wikipedia/

Run this from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) pip install wikipedia
```

In [1]:
import wikipedia
from wikipedia import search, page

# Get our documents: wikipedia articles
topic = 'singapore'

titles = search(topic)
titles

['Singapore',
 'Singapore Standard Time',
 'Capella Resort, Singapore',
 'Singapore Airlines fleet',
 'Singapore dollar',
 'Languages of Singapore',
 'History of Singapore',
 'Singapore Armed Forces',
 'Economy of Singapore',
 'Singapore Sling']

In [4]:
# retrieve all pages
wikipages = [page(title) for title in titles]

# inspect the first page
wikipages[0].summary

'Singapore ( ( listen)), officially the Republic of Singapore, is a sovereign city-state and island country in Southeast Asia. It lies one degree (137 kilometres or 85 miles) north of the equator, at the southern tip of the Malay Peninsula, with Indonesia\'s Riau Islands to the south and Peninsular Malaysia to the north. Singapore\'s territory consists of one main island along with 62 other islets. Since independence, extensive land reclamation has increased its total size by 23% (130 square kilometres or 50 square miles).\nStamford Raffles founded colonial Singapore in 1819 as a trading post of the British East India Company; after the latter\'s collapse in 1858, the islands were ceded to the British Raj as a crown colony. During the Second World War, Singapore was occupied by Japan. It gained independence from the UK in 1963 by federating with other former British territories to form Malaysia, but separated two years later over ideological differences, becoming a sovereign nation in 

## Process text

- Remove special characters
- Convert to lowercase
- Tokenize the text into words
- Optionally remove stop words such as 'a', 'the'
- Stem each word
  - Stemmers remove morphological affixes from words, leaving only the word stem
  - http://www.nltk.org/howto/stem.html

In [14]:
import re # python regular expressions library
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK corpora
# List of available corpora: http://www.nltk.org/book/ch02.html#tab-corpora

# 1. Download the Punkt sentence tokenizer
# https://www.nltk.org/_modules/nltk/tokenize/punkt.html
nltk.download('punkt')

# 2. Download the Stop Words corpus
nltk.download('stopwords')

# 3. Helper function to convert text into list of stem words
def text_to_stem_wordlist(text, remove_stopwords=False):
    """Cleans and converts text to a wordlist of stemmed words
    Args:
        text: input text
        remove_stopwords: whether to remove stopwords
    Returns:
        a numpy array containing the list of stem words
    """
    # Remove non-letters and numbers
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Optionally remove stop words
    # We will keep stop words for training
    if remove_stopwords:
        stops = set(stopwords.words('english'))
        tokens = [token for token in tokens if not token in stops]
    
    # Stem the words, we'll try the porter stemmer first
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens]
    
    return tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\issohl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\issohl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
# Test our helper function to see what it does
text = wikipages[0].summary
print('===== Original text for first article =====')
print(text)

wordlist = text_to_stem_wordlist(text)
print('\n===== Stem words [%d words] =====' % len(wordlist))
print(wordlist)

wordlist = text_to_stem_wordlist(text, remove_stopwords=True)
print('\n===== Stem words - stopwords [%d words] =====' % len(wordlist))
print(wordlist)

===== Original text for first article =====
Singapore ( ( listen)), officially the Republic of Singapore, is a sovereign city-state and island country in Southeast Asia. It lies one degree (137 kilometres or 85 miles) north of the equator, at the southern tip of the Malay Peninsula, with Indonesia's Riau Islands to the south and Peninsular Malaysia to the north. Singapore's territory consists of one main island along with 62 other islets. Since independence, extensive land reclamation has increased its total size by 23% (130 square kilometres or 50 square miles).
Stamford Raffles founded colonial Singapore in 1819 as a trading post of the British East India Company; after the latter's collapse in 1858, the islands were ceded to the British Raj as a crown colony. During the Second World War, Singapore was occupied by Japan. It gained independence from the UK in 1963 by federating with other former British territories to form Malaysia, but separated two years later over ideological diffe

### Convert all articles to stem words

Let's now convert all articles on our topic to stem wordlists.

We were examining the summary for each article, let's see how we can get to the content.

Looking at the wikipedia library's documentation, we can use `WikipediaPage.content` to get to the plain text content for each page: https://wikipedia.readthedocs.io/en/latest/code.html

In [25]:
wikipages[0].content

'Singapore ( ( listen)), officially the Republic of Singapore, is a sovereign city-state and island country in Southeast Asia. It lies one degree (137 kilometres or 85 miles) north of the equator, at the southern tip of the Malay Peninsula, with Indonesia\'s Riau Islands to the south and Peninsular Malaysia to the north. Singapore\'s territory consists of one main island along with 62 other islets. Since independence, extensive land reclamation has increased its total size by 23% (130 square kilometres or 50 square miles).\nStamford Raffles founded colonial Singapore in 1819 as a trading post of the British East India Company; after the latter\'s collapse in 1858, the islands were ceded to the British Raj as a crown colony. During the Second World War, Singapore was occupied by Japan. It gained independence from the UK in 1963 by federating with other former British territories to form Malaysia, but separated two years later over ideological differences, becoming a sovereign nation in 

In [28]:
training_set = []

print('Converting %d articles to training set...' % len(titles))
for wikipage in wikipages:
    training_set += text_to_stem_wordlist(wikipage.content) # keep stopwords

print('Training set size: %d stem words' % len(training_set))

Converting 10 articles to training set...
Training set size: 39822 stem words


### Question to ponder:

Should we randomize the training set?

Why or why not?

## Train a word2vec model

(Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors)

With the list of nicely parsed sentences, we're ready to train the model. There are a number of parameter choices that affect the run time and the quality of the final model that is produced.

For details on the algorithms below, see the [word2vec API documentation](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors) as well as the [Google documentation](https://code.google.com/archive/p/word2vec/)(Performance section) 

### Architecture:
Architecture options are skip-gram (the default: slower, better for infrequent words) or continuous bag of words (fast). 

### Training algorithm:
Hierarchical softmax (the default: better for infrequent words) or negative sampling (better for frequent words, better with low dimensional vectors). Start with the default first.

### Downsampling of frequent words:
This can improve both accuracy and speed for large data sets. The [Google documentation](https://code.google.com/archive/p/word2vec/) recommends values between 1e-3 and 1e-5. Let's try 1e-3 and then iterate from there.

### Word vector dimensionality:
Higher dimensionality (more features) usually result in better models, but also longer runtimes. Reasonable values can be in the tens to hundreds. We'll try 100.

### Context / window size:
For skip-gram usually around 10, for CBOW around 5. More is better, up to a point.

### Worker threads:
Number of parallel processes to run. This can significantly improve training speed.  The number to choose depends on how many logical CPU cores your computer has (on Windows, Start Menu -> System Information, look for Processors). Start with a number around 2-4, and then increase up if your computer is more powerful.

### Minimum word count:
This helps limit the size of the vocabulary to meaningful words. Any word that does not occur at least this many times across all documents is ignored. Reasonable values could be between 10 and 100. Higher values also help limit run time.

For wikipedia articles, we'll try a minimum wordcount of 50.

In [30]:
from gensim.models import word2vec

word2vec?



In [None]:
# Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

In [31]:
# Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

# Set values for various parameters
num_features = 100    # Word vector dimensionality                      
min_word_count = 50   # Minimum word count                        
num_workers = 2       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model
print('Training Word2Vec model... (this may take a while)')
%time model = word2vec.Word2Vec(trainin_set, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "wikipedia_100features_50minwords_10context"
model.save(model_name)

Training Word2Vec model... (this may take a while)


NameError: name 'trainin_set' is not defined

NameError: name 'model' is not defined

https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

https://radimrehurek.com/gensim/

https://www.nltk.org/index.html

http://www.nltk.org/book/ch02.html#tab-corpora

https://github.com/charlieg/A-Smattering-of-NLP-in-Python

http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html