# text mining (nlp) with python

**Author:** Ties de Kok   ([Personal Website](https://www.tiesdekok.com))   <br>
**Last updated:** June 2020  
**Python version:** Python 3.7  
**License:** MIT License  

**Note:** Some features (like the ToC) will only work if you run the notebook or if you use nbviewer by clicking this link:  
https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb

# *Introduction*

This notebook contains code examples to get you started with Python for Natural Language Processing (NLP) / Text Mining.  

In the large scheme of things there are roughly 4 steps:  

1. Identify a data source  
2. Gather the data  
3. Process the data  
4. Analyze the data  

This notebook only discusses step 3 and 4. If you want to learn more about step 2 see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch). 

## Note: companion slides

This notebook was designed to accompany a PhD course session on NLP techniques in Accounting Research.  
The slides of this session are publically availabe here: [Slides](http://www.tiesdekok.com/AccountingNLP_Slides/)

# *Elements / topics that are discussed in this notebook:*


<img style="float: left" src="https://i.imgur.com/c3aCZLA.png" width="50%" /> 

# *Table of Contents*  <a id='toc'></a>

* [Primer on NLP tools](#tool_primer)     
* [Process + Clean text](#proc_clean)   
    * [Normalization](#normalization)
        * [Deal with unwanted characters](#unwanted_char)
        * [Sentence segmentation](#sentence_seg)   
        * [Word tokenization](#word_token)
        * [Lemmatization & Stemming](#lem_and_stem) 
    * [Language modeling](#lang_model) 
        * [Part-of-Speech tagging](#pos_tagging) 
        * [Uni-Gram & N-Grams](#n_grams) 
        * [Stop words](#stop_words) 
* [Direct feature extraction](#feature_extract) 
    * [Feature search](#feature_search) 
        * [Entity recognition](#entity_recognition) 
        * [Pattern search](#pattern_search) 
    * [Text evaluation](#text_eval) 
        * [Language](#language) 
        * [Dictionary counting](#dict_counting) 
        * [Readability](#readability) 
* [Represent text numerically](#text_numerical) 
    * [Bag of Words](#bows) 
        * [TF-IDF](#tfidf) 
    * [Word Embeddings](#word_embed) 
        * [Spacy](#spacyEmbedding)
        * [Word2Vec](#Word2Vec) 
* [Statistical models](#stat_models) 
    * ["Traditional" machine learning](#trad_ml) 
        * [Supervised](#trad_ml_supervised) 
            * [Naïve Bayes](#trad_ml_supervised_nb) 
            * [Support Vector Machines (SVM)](#trad_ml_supervised_svm) 
        * [Unsupervised](#trad_ml_unsupervised) 
            * [Latent Dirichilet Allocation (LDA)](#trad_ml_unsupervised_lda) 
            * [pyLDAvis](#trad_ml_unsupervised_pyLDAvis) 
* [Model Selection and Evaluation](#trad_ml_eval) 
* [Neural Networks](#nn_ml)

# <span style="text-decoration: underline;">Primer on NLP tools</span><a id='tool_primer'></a> [(to top)](#toc)

There are many tools available for NLP purposes.  
The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview.  

Besides build-in Python functionality I will use / demonstrate the following packages:

**Standard NLP libraries**:
1. `Spacy` 
2. `NLTK` and the higher-level wrapper `TextBlob`

*Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!*

**Standard machine learning library**:

1. `scikit learn`

**Specific task libraries**:

There are many, just a couple of examples:

1. `pyLDAvis` for visualizing LDA)
2. `langdetect` for detecting languages
3. `fuzzywuzzy` for fuzzy text matching
4. `Gensim` for topic modelling

# <span style="text-decoration: underline;">Get some example data</span><a id='example_data'></a> [(to top)](#toc)

There are many example datasets available to play around with, see for example this great repository:  
https://archive.ics.uci.edu/ml/datasets.php

The data that I will use for most of the examples is the "Reuter_50_50 Data Set" that is used for author identification experiments. 

See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50  

### Download and load the data

Can't follow what I am doing here? Please see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch) (although the `zipfile` and `io` operations are not very relevant).

In [1]:
import requests, zipfile, io, os
from tqdm.notebook import tqdm

*Note:* for `tqdm` to work in JupyterLab you need to install the `@jupyter-widgets/jupyterlab-manager` using the puzzle icon in the left side bar. 

*Download and extract the zip file with the data *

In [2]:
if not os.path.exists('C50test'):
    r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip")
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

*Load the data into memory*

In [2]:
folder_dict = {'test' : 'C50test'}
text_dict = {'test' : {}}

In [3]:
for label, folder in tqdm(folder_dict.items()):
    authors = os.listdir(folder)
    for author in authors:
        text_files = os.listdir(os.path.join(folder, author))
        for file in text_files:
            with open(os.path.join(folder, author, file), 'r') as text_file:
                text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))

  0%|          | 0/1 [00:00<?, ?it/s]

*Note: the text comes pre-split per sentence, for the sake of example I undo this through `' '.join(text_file.readlines()`*

In [4]:
text_dict['test']['TimFarrand'][0]

'United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.\n United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Keebler subsidiary, said total exceptional charges, mainly from the loss on disposal of businesses, amounted to 84.7 million pounds in 1996 compared with 150.3 million in 1995.\n Sales rose by three percent to 1.887 billion and trading profits grew four percent to 129.2 million.\n Underlying profits growth was in line with stock brokers forecasts, but a presentation by management to analysts was greeted positively, sending the group\'s shares up 11 pence to 248-1/2p by 1415 gmt.\n "It\'s all quite encouraging. The way they are analysing and managing the business is very much more in line with what the market demands," said Richard Workman, an analyst at ABN-AMRO Hoare Govett.\n The company said

# <span style="text-decoration: underline;">Process + Clean text</span><a id='proc_clean'></a> [(to top)](#toc)

## Convert the text into a NLP representation

We can use the text directly, but if want to use packages like `spacy` and `textblob` we first have to convert the text into a corresponding object.  

### Spacy

**Note:** depending on the way that you installed the language models you will need to import it differently:

```
from spacy.en import English
nlp = English()
```
OR
```
import en_core_web_sm
nlp = en_core_web_sm.load()

import en_core_web_md
nlp = en_core_web_md.load()

import en_core_web_lg
nlp = en_core_web_lg.load()
```

In [6]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

Convert all text in the "test" sample to a `spacy` `doc` object using `nlp.pipe()`:

In [7]:
spacy_text = {}
for author, text_list in tqdm(text_dict['test'].items()):
    spacy_text[author] = list(nlp.pipe(text_list))

  0%|          | 0/50 [00:00<?, ?it/s]

In [8]:
# for doc in nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"]):


*A note on speed:*  This is slow because we didn't disable any compontents, see this note from the documentation:  
> Only apply the pipeline components you need. Getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you don’t need – either when loading a model, or during processing with nlp.pipe. See the section on disabling pipeline components for more details and examples. [link](https://spacy.io/usage/processing-pipelines#disabling)

In [9]:
type(spacy_text['TimFarrand'][0])

spacy.tokens.doc.Doc

### NLTK

In [9]:
import nltk

We can apply basic `nltk` operations directly to the text so we don't need to convert first. 

### TextBlob

In [10]:
from textblob import TextBlob

Convert all text in the "test" sample to a `TextBlob` object using `TextBlob()`:

In [11]:
textblob_text = {}
for author, text_list in text_dict['test'].items():
    textblob_text[author] = [TextBlob(text) for text in text_list]

In [12]:
type(textblob_text['TimFarrand'][0])

textblob.blob.TextBlob

## <span style="text-decoration: underline;">Normalization</span><a id='normalization'></a> [(to top)](#toc)

**Text normalization** describes the task of transforming the text into a different (more comparable) form.  

This can imply many things, I will show a couple of options below:

### <span style="text-decoration: underline;">Deal with unwanted characters</span><a id='unwanted_char'></a> [(to top)](#toc)

You will often notice that there are characters that you don't want in your text.  

Let's look at this sentence for example:

> "Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

You notice that there are some `\` and `\n` in there. These are used to define how a string should be displayed, if we print this text we get:  

In [10]:
text_dict['test']['TimFarrand'][0][:298]

'United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.\n United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Kee'

In [11]:
print(text_dict['test']['TimFarrand'][0][:298])

United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
 United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Kee


These special characters can cause problems in our analyses (and can be hard to debug if you are using `print` statements to inspect the data).

**So how do we remove them?**

In many cases it is sufficient to simply use the `.replace()` function:

In [15]:
text_dict['test']['TimFarrand'][0][:298].replace('\n', '').replace('\\', '')

"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts. Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

Sometimes, however, the problem arrises because of encoding / decoding problems.  

In those cases you can usually do something like:  

In [16]:
problem_sentence = 'This is some \u03c0 text that has to be cleaned\u2026! it\u0027s difficult to deal with!'
print(problem_sentence)
print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))

This is some π text that has to be cleaned…! it's difficult to deal with!
b"This is some  text that has to be cleaned! it's difficult to deal with!"


An alternative that is better at preserving the unicode characters would be to use `unidecode`

In [17]:
import unidecode

In [18]:
print('\u738b\u7389')

王玉


In [19]:
unidecode.unidecode(u"\u738b\u7389")

'Wang Yu '

In [20]:
unidecode.unidecode(problem_sentence)

"This is some p text that has to be cleaned...! it's difficult to deal with!"

### <span style="text-decoration: underline;">Sentence segmentation</span><a id='sentence_seg'></a> [(to top)](#toc)

Sentence segmentation refers to the task of splitting up the text by sentence.  

You could do this by splitting on the `.` symbol, but dots are used in many other cases as well so it is not very robust:

In [21]:
text_dict['test']['TimFarrand'][0][:550].split('.')

["Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts",
 '\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997',
 ' The shares fell 6p to 781p on the news',
 '\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers',
 '  \n Dermott Carr, an analyst at Nikko said, "the mark']

It is better to use a more sophisticated implementation such as the one by `Spacy`:

In [22]:
example_paragraph = spacy_text['TimFarrand'][0]

In [23]:
sentence_list = [s for s in example_paragraph.sents]
sentence_list[:5]

[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ,
 "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  
  ,
 Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment]

Notice that the returned object is still a `spacy` object:

In [24]:
type(sentence_list[0])

spacy.tokens.span.Span

*Note:* `spacy` sentence segmentation relies on the text being capitalized, so make sure you didn't convert it to all lower case before running this operation.

Apply to all texts (for use later on):

In [12]:
spacy_sentences = {}
for author, text_list in tqdm(spacy_text.items()):
    spacy_sentences[author] = [list(text.sents) for text in text_list]

  0%|          | 0/50 [00:00<?, ?it/s]

In [13]:
spacy_sentences['TimFarrand'][0][:3]

[United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
  ,
 United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Keebler subsidiary, said total exceptional charges, mainly from the loss on disposal of businesses, amounted to 84.7 million pounds in 1996 compared with 150.3 million in 1995.
  ,
 Sales rose by three percent to 1.887 billion and trading profits grew four percent to 129.2 million.
  ]

### <span style="text-decoration: underline;">Word tokenization</span><a id='word_token'></a> [(to top)](#toc)

Word tokenization means to split the sentence (or text) up into words.

In [14]:
example_sentence = spacy_sentences['TimFarrand'][0][0]
example_sentence

United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
 

A word is called a `token` in this context (hence `tokenization`), using `spacy`:

In [28]:
token_list = [token for token in example_sentence]
token_list[0:15]

[Shares,
 in,
 brewing,
 -,
 to,
 -,
 leisure,
 group,
 Bass,
 Plc,
 are,
 likely,
 to,
 be,
 held]

### <span style="text-decoration: underline;">Lemmatization & Stemming</span><a id='lem_and_stem'></a> [(to top)](#toc)

In some cases you want to convert a word (i.e. token) into a more general representation.  

For example: convert "car", "cars", "car's", "cars'" all into the word `car`.

This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal).  

**Spacy**

Space offers build-in functionality for lemmatization:

In [29]:
lemmatized = [token.lemma_ for token in example_sentence]
lemmatized[0:15]

['share',
 'in',
 'brewing',
 '-',
 'to',
 '-',
 'leisure',
 'group',
 'Bass',
 'Plc',
 'be',
 'likely',
 'to',
 'be',
 'hold']

**NLTK**

Using the NLTK libary we can also use the more aggressive Porter Stemmer

In [30]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [31]:
stemmed = [stemmer.stem(token.text) for token in example_sentence]
stemmed[0:15]

['share',
 'in',
 'brew',
 '-',
 'to',
 '-',
 'leisur',
 'group',
 'bass',
 'plc',
 'are',
 'like',
 'to',
 'be',
 'held']

**Compare**:

In [32]:
print('  Original  | Spacy Lemma  | NLTK Stemmer')
print('-' * 41)
for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):
    print(str(original).rjust(10, ' '), ' | ', str(lemma).rjust(10, ' '), ' | ', str(stem).rjust(10, ' '))

  Original  | Spacy Lemma  | NLTK Stemmer
-----------------------------------------
    Shares  |       share  |       share
        in  |          in  |          in
   brewing  |     brewing  |        brew
         -  |           -  |           -
        to  |          to  |          to
         -  |           -  |           -
   leisure  |     leisure  |      leisur
     group  |       group  |       group
      Bass  |        Bass  |        bass
       Plc  |         Plc  |         plc
       are  |          be  |         are
    likely  |      likely  |        like
        to  |          to  |          to
        be  |          be  |          be
      held  |        hold  |        held


In my experience it is usually best to use lemmatization instead of a stemmer. 

## <span style="text-decoration: underline;">Language modeling</span><a id='lang_model'></a> [(to top)](#toc)

Text is inherently structured in complex ways, we can often use some of this underlying structure. 

### <span style="text-decoration: underline;">Part-of-Speech tagging</span><a id='pos_tagging'></a> [(to top)](#toc)

Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc. 

Using `Spacy`:

In [33]:
pos_list = [(token, token.pos_) for token in example_sentence]
pos_list[0:10]

[(Shares, 'NOUN'),
 (in, 'ADP'),
 (brewing, 'NOUN'),
 (-, 'PUNCT'),
 (to, 'ADP'),
 (-, 'PUNCT'),
 (leisure, 'NOUN'),
 (group, 'NOUN'),
 (Bass, 'PROPN'),
 (Plc, 'PROPN')]

### <span style="text-decoration: underline;">Uni-Gram & N-Grams</span><a id='n_grams'></a> [(to top)](#toc)

Obviously a sentence is not a random collection of words, the sequence of words has information value.  

A simple way to incorporate some of this sequence is by using what is called `n-grams`.  
An `n-gram` is nothing more than a a combination of `N` words into one token (a uni-gram token is just one word).  

So we can convert `"Sentence about flying cars"` into a list of bigrams:

> Sentence-about, about-flying, flying-cars  

See my slide on N-Grams for a more comprehensive example: [click here](http://www.tiesdekok.com/AccountingNLP_Slides/#14)

Using `NLTK`:

In [34]:
bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]
bigram_list[10:15]

['are-likely', 'likely-to', 'to-be', 'be-held', 'held-back']

Using `spacy`

In [16]:
def tokenize_without_punctuation(sen_obj):
    return [token.text for token in sen_obj if token.is_alpha]

In [15]:
def create_ngram(sen_obj, n, sep = '-'):
    token_list = tokenize_without_punctuation(sen_obj)
    number_of_tokens = len(token_list)
    ngram_list = []
    for i, token in enumerate(token_list[:-n+1]):
        ngram_item = [token_list[i + ii] for ii in range(n)]
        ngram_list.append(sep.join(ngram_item))
    return ngram_list

In [17]:
create_ngram(example_sentence, 2)[:5]

['United-Biscuits',
 'Biscuits-Holdings',
 'Holdings-Plc',
 'Plc-more',
 'more-than']

In [38]:
create_ngram(example_sentence, 3)[:5]

['Shares-in-brewing',
 'in-brewing-to',
 'brewing-to-leisure',
 'to-leisure-group',
 'leisure-group-Bass']

### <span style="text-decoration: underline;">Stop words</span><a id='stop_words'></a> [(to top)](#toc)

Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence.  

The primary example are stop words.  

Sometimes you can improve the accuracy of your model by removing stop words.

Using `Spacy`:

In [39]:
no_stop_words = [token for token in example_sentence if not token.is_stop]

In [40]:
no_stop_words[:10]

[Shares, brewing, -, -, leisure, group, Bass, Plc, likely, held]

In [41]:
token_list[:10]

[Shares, in, brewing, -, to, -, leisure, group, Bass, Plc]

*Note* we can also remove punctuation in the same way:

In [42]:
[token for token in example_sentence if not token.is_stop and token.is_alpha][:10]

[Shares, brewing, leisure, group, Bass, Plc, likely, held, Britain, Trade]

## Wrap everything into one function

**Basic SpaCy text processing function**

1. Split into sentences
2. Apply lemmatizer, remove top words, remove punctuation
3. Clean up the sentence using `textacy`

In [19]:
def process_text_custom(text):
    sentences = list(nlp(text, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entity_ruler']).sents)
    lemmatized_sentences = []
    for sentence in sentences:
        lemmatized_sentences.append([token.lemma_ for token in sentence if not token.is_stop and token.is_alpha])
    return [' '.join(sentence) for sentence in lemmatized_sentences]

In [20]:
spacy_text_clean = {}
for author, text_list in tqdm(text_dict['test'].items()):
    lst = []
    for text in text_list:
        lst.append(process_text_custom(text))
    spacy_text_clean[author] = lst

  0%|          | 0/50 [00:00<?, ?it/s]



*Note:* that this would take quite a long time if we didn't disable some of the components. 

In [21]:
count = 0
for author, texts in spacy_text_clean.items():
    for text in texts:
        count += len(text)
print('Number of sentences:', count)

Number of sentences: 53986


Result

In [22]:
spacy_text_clean['TimFarrand'][0][:3]

['united biscuits holdings plc doubled profits million pounds million tax exceptional items reflecting simpler slimmed portfolio products',
 'united owns brands mcvities biscuits kp nuts exited keebler subsidiary said total exceptional charges mainly loss disposal businesses amounted million pounds compared million',
 'sales rose percent billion trading profits grew percent million']

*Note:* the quality of the input text is not great, so the sentence segmentation is also not great (without further tweaking).

# <span style="text-decoration: underline;">Direct feature extraction</span><a id='feature_extract'></a> [(to top)](#toc)

We now have pre-processed our text into something that we can use for direct feature extraction or to convert it to a numerical representation. 

## <span style="text-decoration: underline;">Feature search</span><a id='feature_search'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Entity recognition</span><a id='entity_recognition'></a> [(to top)](#toc)

It is often useful / relevant to extract entities that are mentioned in a piece of text.   

SpaCy is quite powerful in extracting entities, however, it doesn't work very well on lowercase text.  

Given that "token.lemma\_" removes capitalization I will use `spacy_sentences` for this example.

In [23]:
example_sentence = spacy_sentences['TimFarrand'][0][3]
example_sentence

Underlying profits growth was in line with stock brokers forecasts, but a presentation by management to analysts was greeted positively, sending the group's shares up 11 pence to 248-1/2p by 1415 gmt.
 

In [24]:
[(i, i.label_) for i in nlp(example_sentence.text).ents]

[(11 pence, 'MONEY'), (248, 'CARDINAL'), (1415, 'DATE')]

In [25]:
example_sentence = spacy_sentences['TimFarrand'][4][0]
example_sentence

Brewer to leisure group Whitbread Plc has turned in a "sound" business performance in the last three months, said chief executive Peter Jarvis in an interview on Friday.
 

In [26]:
[(i, i.label_) for i in nlp(example_sentence.text).ents]

[(Whitbread Plc, 'ORG'),
 (the last three months, 'DATE'),
 (Peter Jarvis, 'PERSON'),
 (Friday, 'DATE')]

### <span style="text-decoration: underline;">Pattern search</span><a id='pattern_search'></a> [(to top)](#toc)

Using the build-in `re` (regular expression) library you can pattern match nearly anything you want.  

I will not go into details about regular expressions but see here for a tutorial:  
https://regexone.com/references/python  

In [27]:
import re

**TIP**: Use [Pythex.org](https://pythex.org/) to try out your regular expression

Example on Pythex: <a href="https://pythex.org/?regex=IDNUMBER: (\d\d\d-\w\w)&test_string=Ties de Kok (IDNUMBER: 123-AZ). Rest of Text." target='_blank'>click here</a>

**Example 1:**  

In [52]:
string_1 = 'Ties de Kok (#IDNUMBER: 123-AZ). Rest of text...'
string_2 = 'Philip Joos (#IDNUMBER: 663-BY). Rest of text...'

In [53]:
pattern = r'#IDNUMBER: (\d\d\d-\w\w)'

In [54]:
print(re.findall(pattern, string_1)[0])
print(re.findall(pattern, string_2)[0])

123-AZ
663-BY


### Example 2:

If a sentence contains the word 'million' return True, otherwise return False

In [28]:
for sen in spacy_text_clean['TimFarrand'][2]:
    TERM = 'million'
    if re.search('million', sen, flags= re.IGNORECASE):
        print(sen)

company hotels excluding metropole london acquired november million stg million lonrho showed occupancy level percent weeks touch percent time
group hotel occupancy pulled slightly period temporary closure stakis tyneside work underway million stg refurbishment
turnover casinos rose percent million stg driven percent increase attendances


## <span style="text-decoration: underline;">Text evaluation</span><a id='text_eval'></a> [(to top)](#toc)

Besides feature search there are also many ways to analyze the text as a whole.  

Let's, for example, evaluate the following paragraph:

In [29]:
example_paragraph = ' '.join([x for x in spacy_text_clean['TimFarrand'][2]])
example_paragraph[:500]

'scottish based stakis plc wednesday reported surge visitors casinos sharp rise hotel room rates chief executive david michel confident mood future trends real terms room rates late room rates reached pre recession levels provinces michels told reuters company hotels excluding metropole london acquired november million stg million lonrho showed occupancy level percent weeks touch percent time average room rate rose stg period quarter percent think average percent year said michels forecast finish'

### <span style="text-decoration: underline;">Language</span><a id='language'></a> [(to top)](#toc)

Using the `spacy-langdetect` package it is easy to detect the language of a piece of text

In [32]:
# from spacy_langdetect import LanguageDetector
from langdetect import detect
nlp.add_pipe(detect(), name='language_detector', last=True)

TypeError: detect() missing 1 required positional argument: 'text'

In [33]:
print(nlp(example_paragraph)._.language)

AttributeError: [E046] Can't retrieve unregistered extension attribute 'language'. Did you forget to call the `set_extension` method?

### <span style="text-decoration: underline;">Readability</span><a id='readability'></a> [(to top)](#toc)

Generally I'd recommend to calculate the readability metrics by yourself as they don't tend to be that difficult to compute. However, there are packages out there that can help, such as `spacy_readability`

In [59]:
from spacy_readability import Readability

In [60]:
nlp.add_pipe(Readability(), name='readability', last=True)

In [61]:
doc = nlp("I am some really difficult text to read because I use obnoxiously large words.")
print(doc._.flesch_kincaid_grade_level)
print(doc._.smog)

8.412857142857145
0


**Manual example:** FOG index

In [62]:
import syllapy

In [63]:
def calculate_fog(document):
    doc = nlp(document, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entity_ruler'])
    sen_list = list(doc.sents)
    num_sen = len(sen_list)

    num_words = 0
    num_complex_words = 0
    for sen_obj in sen_list:
        words_in_sen = [token.text for token in sen_obj if token.is_alpha]
        num_words += len(words_in_sen)
        num_complex  = 0
        for word in words_in_sen:
            num_syl = syllapy.count(word.lower())
            if num_syl > 2:
                num_complex += 1
        num_complex_words += num_complex
        
    fog = 0.4 * ((num_words / num_sen) + ((num_complex_words / num_words)*100))
    return {'fog' : fog, 
            'num_sen' : num_sen, 
            'num_words' : num_words, 
            'num_complex_words' : num_complex_words}

In [64]:
calculate_fog(example_paragraph)

{'fog': 13.42327889849504,
 'num_sen': 36,
 'num_words': 347,
 'num_complex_words': 83}

## Text similarity

### Using `fuzzywuzzy`

In [65]:
from fuzzywuzzy import fuzz

In [66]:
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

### Using `spacy`

Spacy can provide a similary score based on the semantic similarity ([link](https://spacy.io/usage/vectors-similarity))

In [67]:
tokens_1 = nlp("fuzzy wuzzy was a bear")
tokens_2 = nlp("wuzzy fuzzy was a bear")

tokens_1.similarity(tokens_2)

1.0000000623731768

In [68]:
tokens_1 = nlp("Tom believes German cars are the best.")
tokens_2 = nlp("Sarah recently mentioned that she would like to go on holiday to Germany.")

tokens_1.similarity(tokens_2)

0.8127869114665882

### <span style="text-decoration: underline;">Term (dictionary) counting</span><a id='dict_counting'></a> [(to top)](#toc)

A common technique for basic NLP insights is to create simple metrics based on term counts. 

These are relatively easy to implement.

### Example 1:

In [69]:
word_dictionary = ['soft', 'first', 'most', 'be']

In [70]:
for word in word_dictionary:
    print(word, example_paragraph.count(word))

soft 2
first 0
most 0
be 7


### Example 2:

In [71]:
pos = ['great', 'agree', 'increase']
neg = ['bad', 'disagree', 'decrease']

sentence = '''According to the president everything is great, great, 
and great even though some people might disagree with those statements.'''

pos_count = 0
for word in pos:
    pos_count += sentence.lower().count(word)
print(pos_count)

neg_count = 0
for word in neg:
    neg_count += sentence.lower().count(word)
print(neg_count)

pos_count / (neg_count + pos_count)

4
1


0.8

Getting the total number of words is also easy:

In [72]:
num_tokens = len([token for token in nlp(sentence) if token.is_alpha])
num_tokens

19

#### Example 3:

We can also save the count per word

In [73]:
pos_count_dict = {}
for word in pos:
    pos_count_dict[word] = sentence.lower().count(word)

In [74]:
pos_count_dict

{'great': 3, 'agree': 1, 'increase': 0}

*Note:* `.lower()` is actually quite slow, if you have a lot of words / sentences it is recommend to minimize the amount of `.lower()` operations that you have to make.

# <span style="text-decoration: underline;">Represent text numerically</span><a id='text_numerical'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">Bag of Words</span><a id='bows'></a> [(to top)](#toc)

Sklearn includes the `CountVectorizer` and `TfidfVectorizer` function.  

For details, see the documentation:  
[TF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)  
[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

*Note 1:* these functions also provide a variety of built-in preprocessing options (e.g. ngrames, remove stop words, accent stripper).

*Note 2:* example based on the following website [click here](http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Simple example:

In [35]:
doc_1 = "The sky is blue."
doc_2 = "The sun is bright today."
doc_3 = "The sun in the sky is bright."
doc_4 = "We can see the shining sun, the bright sun."

Calculate term frequency:

In [36]:
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform([doc_1, doc_2, doc_3, doc_4])

In [38]:
print(vectorizer.get_feature_names_out(), '\n')
for doc_tf_vector in tf.toarray():
    print(doc_tf_vector)

['blue' 'bright' 'shining' 'sky' 'sun' 'today'] 

[1 0 0 1 0 0]
[0 1 0 0 1 1]
[0 1 0 1 1 0]
[0 1 1 0 2 0]


### <span style="text-decoration: underline;">TF-IDF</span><a id='tfidf'></a> [(to top)](#toc)

In [39]:
transformer = TfidfVectorizer(stop_words='english')
tfidf = transformer.fit_transform([doc_1, doc_2, doc_3, doc_4])

In [40]:
for doc_vector in tfidf.toarray():
    print(doc_vector)

[0.78528828 0.         0.         0.6191303  0.         0.        ]
[0.         0.47380449 0.         0.         0.47380449 0.74230628]
[0.         0.53256952 0.         0.65782931 0.53256952 0.        ]
[0.         0.36626037 0.57381765 0.         0.73252075 0.        ]


### More elaborate example:

In [41]:
clean_paragraphs = []
for author, value in spacy_text_clean.items():
    for article in value:
        clean_paragraphs.append(' '.join([x for x in article]))

In [42]:
len(clean_paragraphs)

2500

In [43]:
transformer = TfidfVectorizer(stop_words='english')
tfidf_large = transformer.fit_transform(clean_paragraphs)

In [44]:
print('Number of vectors:', len(tfidf_large.toarray()))
print('Number of words in dictionary:', len(tfidf_large.toarray()[0]))

Number of vectors: 2500
Number of words in dictionary: 27743


In [85]:
tfidf_large

<2500x21978 sparse matrix of type '<class 'numpy.float64'>'
	with 410121 stored elements in Compressed Sparse Row format>

## <span style="text-decoration: underline;">Word Embeddings</span><a id='word_embed'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Spacy</span><a id='spacyEmbedding'></a> [(to top)](#toc)

The `en_core_web_lg` language model comes with GloVe vectors trained on the Common Crawl dataset ([link](https://spacy.io/models/en#en_core_web_lg))

In [45]:
tokens = nlp("The Dutch word for peanut butter is 'pindakaas', did you know that? This is a typpo.")

for token in tokens:
    if token.is_alpha:
        print(token.text, token.has_vector, token.vector_norm, token.is_oov)

The True 7.2950425 True
Dutch True 7.227342 True
word True 5.537156 True
for True 7.144757 True
peanut True 6.721452 True
butter True 5.5611167 True
is True 6.0367823 True
pindakaas True 6.0864863 True
did True 7.466853 True
you True 8.403958 True
know True 6.753204 True
that True 7.6419835 True
This True 7.5847964 True
is True 6.569338 True
a True 7.379784 True
typpo True 6.361052 True


In [46]:
token = nlp('Car')
print('The token: "{}" has the following vector (dimension: {})'.format(token.text, len(token.vector)))
token.vector

The token: "Car" has the following vector (dimension: 96)


array([-1.3298006 , -1.0853987 ,  0.7887242 ,  0.56536686, -0.03679049,
        0.17918521,  1.0721457 ,  0.4140611 , -0.45138526, -0.4259355 ,
       -0.32376927, -0.24423355, -1.1884332 ,  0.38261512,  0.15325083,
        0.888623  , -1.0993898 , -0.36031944, -0.0155575 , -0.48386702,
       -0.65094745,  1.1043994 , -1.2378284 ,  0.16953191, -0.19734624,
       -0.11411573,  0.655138  ,  0.71806145,  0.1673866 ,  1.1834166 ,
       -0.5743222 ,  1.0206122 ,  0.2183578 , -0.8829129 , -0.37797755,
       -0.8775984 , -0.8520317 ,  0.5326886 ,  0.44445798, -0.02371767,
       -0.45813775,  0.1717524 ,  0.3198011 ,  0.56773376,  0.15410456,
       -0.26940504, -1.2045121 , -1.0995429 ,  0.20882471, -0.5321012 ,
        0.33936197,  0.8772712 ,  0.7056221 , -0.4283748 ,  0.673675  ,
       -1.0647851 ,  0.76150036, -0.8680595 , -0.11669695, -0.03319389,
       -1.2372603 ,  0.29322624,  0.12529306, -0.27613178,  0.4557415 ,
       -0.4610034 , -0.09175343,  0.7253711 , -0.12498236,  0.11

### <span style="text-decoration: underline;">Word2Vec</span><a id='Word2Vec'></a> [(to top)](#toc)

Simple example below is from:  https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980

*Note:* you might have to run `nltk.download('brown')` to install the NLTK corpus files

In [47]:
import gensim
from nltk.corpus import brown

In [49]:
sentences = brown.sents()

In [50]:
model = gensim.models.Word2Vec(sentences, min_count=1)

Save model

In [51]:
model.save('brown_model')

Load model

In [52]:
model = gensim.models.Word2Vec.load('brown_model')

Find words most similar to 'mother':

In [53]:
print(model.wv.most_similar("mother"))

[('father', 0.9791275262832642), ('husband', 0.96830815076828), ('wife', 0.9470910429954529), ('son', 0.9341433644294739), ('friend', 0.9211450815200806), ('nickname', 0.9091935157775879), ('voice', 0.9058724045753479), ('brother', 0.8998143672943115), ('patient', 0.8902835249900818), ('uncle', 0.8770355582237244)]


Find the odd one out:

In [54]:
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


In [55]:
print(model.wv.doesnt_match("pizza pasta garden fries".split()))

garden


Retrieve vector representation of the word "human"

In [56]:
model.wv['human']

array([-0.5970706 ,  0.377084  ,  0.7432498 ,  0.18828899, -0.2765691 ,
       -0.63277966,  1.4230691 ,  1.3836328 , -0.5716369 , -0.35936382,
       -0.09947161, -0.70819783,  0.7136255 , -0.4917133 ,  0.09956642,
       -0.63992554,  0.4826114 , -0.26998937, -0.4830221 , -1.0307387 ,
        0.5938822 ,  0.04562862,  0.5864228 ,  0.74978435, -0.36310643,
       -0.35855043,  0.21060367,  0.17792048, -0.73211503, -0.10176354,
        0.43670663, -0.6949803 ,  0.78549016, -0.32429957,  0.31239748,
        0.88337296, -0.37561104, -0.4772151 , -0.59918207,  0.01508404,
       -0.08576139, -0.33897203,  0.5327704 ,  0.00574782,  0.6405279 ,
        0.14520381, -0.322308  , -0.16603571,  0.07882246,  0.20980729,
        1.1841923 , -0.6082804 , -0.4157628 , -0.09768971, -0.8226214 ,
       -0.49966252,  0.97380173,  0.14174512, -0.6766032 , -0.1139592 ,
        0.06136736,  0.27670348,  0.01457378, -0.6659458 , -1.0047694 ,
        1.0267179 ,  0.06506062,  0.30315375, -0.83636177,  0.18

# <span style="text-decoration: underline;">Statistical models</span><a id='stat_models'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">"Traditional" machine learning</span><a id='trad_ml'></a> [(to top)](#toc)

The library to use for machine learning is scikit-learn (["sklearn"](http://scikit-learn.org/stable/index.html)).

## <span>Supervised</span><a id='trad_ml_supervised'></a> [(to top)](#toc)

In [57]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import joblib

In [58]:
import pandas as pd
import numpy as np

### Convert the data into a pandas dataframe (so that we can input it easier)

In [59]:
article_list = []
for author, value in spacy_text_clean.items():
    for article in value:
        article_list.append((author, ' '.join([x for x in article])))

In [60]:
article_df = pd.DataFrame(article_list, columns=['author', 'text'])

In [61]:
article_df.sample(5)

Unnamed: 0,author,text
2204,BernardHickey,treasurer peter costello wednesday announced b...
289,PatriciaCommins,summit medical systems said monday revenues su...
151,BenjaminKangLim,selling motorcycles joy ride giant military fo...
1388,GrahamEarnshaw,china announced tuesday preferential tax rate ...
2383,FumikoFujisaki,year formed japanese financial megamerger worl...


### Split the sample into a training and test sample

In [62]:
X_train, X_test, y_train, y_test = train_test_split(article_df.text, article_df.author, test_size=0.20, random_state=3561)

In [63]:
print(len(X_train), len(X_test))

2000 500


### Train and evaluate function

Simple function to train (i.e. fit) and evaluate the model

In [64]:
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))

### <span>Naïve Bayes estimator</span><a id='trad_ml_supervised_nb'></a> [(to top)](#toc)

In [65]:
from sklearn.naive_bayes import MultinomialNB

Define pipeline

In [66]:
clf = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),
        
    ('clf', MultinomialNB(alpha = 1,
                          fit_prior = True
                          )
    ),
])

Train and show evaluation stats

In [67]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.853
Accuracy on testing set:
0.73
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.64      0.88      0.74         8
       AlanCrosby       0.53      1.00      0.70         8
   AlexanderSmith       0.83      0.91      0.87        11
  BenjaminKangLim       0.43      0.27      0.33        11
    BernardHickey       0.58      0.70      0.64        10
      BradDorfman       0.86      0.67      0.75         9
 DarrenSchuettler       0.71      0.71      0.71        14
      DavidLawder       1.00      0.50      0.67        10
    EdnaFernandes       0.60      1.00      0.75         3
      EricAuchard       0.71      0.56      0.62         9
   FumikoFujisaki       0.93      1.00      0.96        13
   GrahamEarnshaw       0.62      0.91      0.74        11
 HeatherScoffield       0.83      0.50      0.62        10
       JanLopatka       0.83      0.50      0.62        10
    JaneMacartney       0.45 

Save results

In [68]:
joblib.dump(clf, 'naive_bayes_results.pkl')

['naive_bayes_results.pkl']

Predict out of sample:

In [69]:
example_y, example_X = y_train[33], X_train[33]

In [70]:
print('Actual author:', example_y)
print('Predicted author:', clf.predict([example_X])[0])

Actual author: MarcelMichelson
Predicted author: MarcelMichelson


### <span>Support Vector Machines (SVM)</span><a id='trad_ml_supervised_svm'></a> [(to top)](#toc)

In [71]:
from sklearn.svm import SVC

Define pipeline

In [72]:
clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),
        
    ('clf', SVC(kernel='rbf' ,
                C=10, gamma=0.3)
    ),
])

*Note:* The SVC estimator is very sensitive to the hyperparameters!

Train and show evaluation stats

In [73]:
train_and_evaluate(clf_svm, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.9985
Accuracy on testing set:
0.84
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.73      1.00      0.84         8
       AlanCrosby       0.89      1.00      0.94         8
   AlexanderSmith       0.92      1.00      0.96        11
  BenjaminKangLim       0.53      0.73      0.62        11
    BernardHickey       0.86      0.60      0.71        10
      BradDorfman       0.73      0.89      0.80         9
 DarrenSchuettler       1.00      0.93      0.96        14
      DavidLawder       0.80      0.80      0.80        10
    EdnaFernandes       0.60      1.00      0.75         3
      EricAuchard       0.57      0.89      0.70         9
   FumikoFujisaki       1.00      1.00      1.00        13
   GrahamEarnshaw       0.83      0.91      0.87        11
 HeatherScoffield       1.00      1.00      1.00        10
       JanLopatka       1.00      0.70      0.82        10
    JaneMacartney       0.50

Save results

In [74]:
joblib.dump(clf_svm, 'svm_results.pkl')

['svm_results.pkl']

Predict out of sample:

In [75]:
example_y, example_X = y_train[33], X_train[33]

In [76]:
print('Actual author:', example_y)
print('Predicted author:', clf_svm.predict([example_X])[0])

Actual author: MarcelMichelson
Predicted author: MarcelMichelson


## <span>Model Selection and Evaluation</span><a id='trad_ml_eval'></a> [(to top)](#toc)

Both the `TfidfVectorizer` and `SVC()` estimator take a lot of hyperparameters.  

It can be difficult to figure out what the best parameters are.

We can use `GridSearchCV` to help figure this out.

In [77]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score

First we define the options that should be tried out:

In [78]:
clf_search = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', SVC())
])
parameters = { 'vect__stop_words': ['english'],
                'vect__strip_accents': ['unicode'],
              'vect__max_features' : [1500],
              'vect__ngram_range': [(1,1), (2,2) ],
             'clf__gamma' : [0.2, 0.3, 0.4], 
             'clf__C' : [8, 10, 12],
              'clf__kernel' : ['rbf']
             }

Run everything:

In [80]:
grid = GridSearchCV(clf_search, 
                    param_grid=parameters, 
                    # scoring=make_scorer(f1_score, average='micro'), 
                    n_jobs=-1
                   )
grid.fit(X_train, y_train)    

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.","Pipeline(step...clf', SVC())])"
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'clf__C': [8, 10, ...], 'clf__gamma': [0.2, 0.3, ...], 'clf__kernel': ['rbf'], 'vect__max_features': [1500], ...}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",-1
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",0
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",'unicode'
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"analyzer  analyzer: {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. .. versionchanged:: 0.21  Since v0.21, if ``input`` is ``'filename'`` or ``'file'``, the data  is first read from the file and then passed to the given callable  analyzer.",'word'
,"stop_words  stop_words: {'english'}, list, default=None If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. In this case, setting `max_df` to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.",'english'
,"token_pattern  token_pattern: str, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w\\w+\\b'

0,1,2
,"C  C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.",8
,"kernel  kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ``(n_samples, n_samples)``. For an intuitive visualization of different kernel types see :ref:`sphx_glr_auto_examples_svm_plot_svm_kernels.py`.",'rbf'
,"degree  degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels.",3
,"gamma  gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses  1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22  The default value of ``gamma`` changed from 'auto' to 'scale'.",0.3
,"coef0  coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.",0.0
,"shrinking  shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide `.",True
,"probability  probability: bool, default=False Whether to enable probability estimates. This must be enabled prior to calling `fit`, will slow down that method as it internally uses 5-fold cross-validation, and `predict_proba` may be inconsistent with `predict`. Read more in the :ref:`User Guide `.",False
,"tol  tol: float, default=1e-3 Tolerance for stopping criterion.",0.001
,"cache_size  cache_size: float, default=200 Specify the size of the kernel cache (in MB).",200
,"class_weight  class_weight: dict or 'balanced', default=None Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``.",


*Note:* if you are on a powerful (preferably unix system) you can set n_jobs to the number of available threads to speed up the calculation

In [81]:
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
y_true, y_pred = y_test, grid.predict(X_test)
print(metrics.classification_report(y_true, y_pred))

The best parameters are {'clf__C': 8, 'clf__gamma': 0.3, 'clf__kernel': 'rbf', 'vect__max_features': 1500, 'vect__ngram_range': (1, 1), 'vect__stop_words': 'english', 'vect__strip_accents': 'unicode'} with a score of 0.79
                   precision    recall  f1-score   support

    AaronPressman       0.73      1.00      0.84         8
       AlanCrosby       0.89      1.00      0.94         8
   AlexanderSmith       0.92      1.00      0.96        11
  BenjaminKangLim       0.53      0.73      0.62        11
    BernardHickey       0.86      0.60      0.71        10
      BradDorfman       0.73      0.89      0.80         9
 DarrenSchuettler       1.00      0.93      0.96        14
      DavidLawder       0.80      0.80      0.80        10
    EdnaFernandes       0.60      1.00      0.75         3
      EricAuchard       0.57      0.89      0.70         9
   FumikoFujisaki       1.00      1.00      1.00        13
   GrahamEarnshaw       0.83      0.91      0.87        11
 HeatherSc

## <span>Unsupervised</span><a id='trad_ml_unsupervised'></a> [(to top)](#toc)

### <span>Latent Dirichilet Allocation (LDA)</span><a id='trad_ml_unsupervised_lda'></a> [(to top)](#toc)

In [82]:
from sklearn.decomposition import LatentDirichletAllocation

Vectorizer (using countvectorizer for the sake of example)

In [83]:
vectorizer = CountVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english', max_df=0.8)
tf_large = vectorizer.fit_transform(clean_paragraphs)

Run the LDA model

In [84]:
n_topics = 10
n_top_words = 25

In [85]:
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10,
                                learning_method='online',
                                n_jobs=-1)
lda_fitted = lda.fit_transform(tf_large)

Visualize top words

In [86]:
def save_top_words(model, feature_names, n_top_words):
    out_list = []
    for topic_idx, topic in enumerate(model.components_):
        out_list.append((topic_idx+1, " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])))
    out_df = pd.DataFrame(out_list, columns=['topic_id', 'top_words'])
    return out_df

In [88]:
result_df = save_top_words(lda, vectorizer.get_feature_names_out(), n_top_words)

In [89]:
result_df

Unnamed: 0,topic_id,top_words
0,1,conrail new merger long services companies com...
1,2,china kong hong chinese beijing people deng pa...
2,3,percent million year quarter sales share billi...
3,4,bank market banks financial percent securities...
4,5,new company ford gm tobacco car industry chrys...
5,6,company new internet computer software microso...
6,7,year percent tonnes prices oil market million ...
7,8,thomson aol online new service america french ...
8,9,pounds million british bre shares gold company...
9,10,government czech minister percent told deal po...


### <span>pyLDAvis</span><a id='trad_ml_unsupervised_pyLDAvis'></a> [(to top)](#toc)

In [96]:
# %matplotlib inline
import pyLDAvis.lda_model
# import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [97]:
pyLDAvis.lda_model.prepare(lda, tf_large, vectorizer, n_jobs=-1)

**Warning:** there is a small bug that when you show the `pyLDAvis` visualization it will hide some of the icons of JupyterLab

## <span style="text-decoration: underline;">Neural Networks</span><a id='nn_ml'></a> [(to top)](#toc)

Interested? Check out the Stanford course CS224n ([Page](http://web.stanford.edu/class/cs224n/index.html#schedule))!   