# Week 2: Python Business Analytics

See the Repository for Future Work: https://github.com/firmai/python-business-analytics or

Sign up to the mailing list: https://mailchi.mp/ec4942d52cc5/firmai

# Text Mining NLP

# *Introduction*

This notebook contains code examples to get you started with Natural Language Processing (NLP) / Text Mining for Research and Data Science purposes.  

In the large scheme of things there are roughly 4 steps:  

1. Identify a data source  
2. Gather the data  
3. Process the data  
4. Analyze the data  


## Note: companion slides

# *Elements / topics that are discussed in this notebook: *


<img style="float: left" src="https://i.imgur.com/c3aCZLA.png" width="50%" />

# *Table of Contents*  <a id='toc'></a>

* [Primer on NLP tools](#tool_primer)     
* [Process + Clean text](#proc_clean)   
    * [Normalization](#normalization)
        * [Deal with unwanted characters](#unwanted_char)
        * [Sentence segmentation](#sentence_seg)   
        * [Word tokenization](#word_token)
        * [Lemmatization & Stemming](#lem_and_stem)
    * [Language modeling](#lang_model)
        * [Part-of-Speech tagging](#pos_tagging)
        * [Uni-Gram & N-Grams](#n_grams)
        * [Stop words](#stop_words)
* [Direct feature extraction](#feature_extract)
    * [Feature search](#feature_search)
        * [Entity recognition](#entity_recognition)
        * [Pattern search](#pattern_search)
    * [Text evaluation](#text_eval)
        * [Language](#language)
        * [Dictionary counting](#dict_counting)
        * [Readability](#readability)
* [Represent text numerically](#text_numerical)
    * [Bag of Words](#bows)
        * [TF-IDF](#tfidf)
    * [Word Embeddings](#word_embed)
        * [Word2Vec](#Word2Vec)
* [Statistical models](#stat_models)
    * ["Traditional" machine learning](#trad_ml)
        * [Supervised](#trad_ml_supervised)
            * [Naïve Bayes](#trad_ml_supervised_nb)
            * [Support Vector Machines (SVM)](#trad_ml_supervised_svm)
        * [Unsupervised](#trad_ml_unsupervised)
            * [Latent Dirichilet Allocation (LDA)](#trad_ml_unsupervised_lda)
            * [pyLDAvis](#trad_ml_unsupervised_pyLDAvis)
* [Model Selection and Evaluation](#trad_ml_eval)
* [Neural Networks](#nn_ml)

# <span style="text-decoration: underline;">Primer on NLP tools</span><a id='tool_primer'></a> [(to top)](#toc)

There are many tools available for NLP purposes.  
The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview.  

Besides build-in Python functionality I will use / demonstrate the following packages:

**Standard NLP libraries**:
1. `Spacy` and the higher-level wrapper `Textacy`
2. `NLTK` and the higher-level wrapper `TextBlob`

*Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!*

**Standard machine learning library**:

1. `scikit learn`

**Specific task libraries**:

There are many, just a couple of examples:

1. `pyLDAvis` for visualizing LDA)
2. `langdetect` for detecting languages
3. `fuzzywuzzy` for fuzzy text matching
4. `textstat` to calculate readability statistics
5. `Gensim` for topic modelling

# <span style="text-decoration: underline;">Get some example data</span><a id='example_data'></a> [(to top)](#toc)

There are many example datasets available to play around with, see for example this great repository:  
https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table

The data that I will use for most of the examples is the "Reuter_50_50 Data Set" that is used for author identification experiments.

See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50  

### Download and load the data

Can't follow what I am doing here? Please see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch) (although the `zipfile` and `io` operations are not very relevant).

In [3]:
# !uv add textacy
# !uv add langdetect
# !uv add gensim
# !uv add fuzzywuzzy
# !uv add pyLDAvis

[2mResolved [1m55 packages[0m [2min 1ms[0m[0m
[2mAudited [1m53 packages[0m [2min 1ms[0m[0m
[2K[2mResolved [1m57 packages[0m [2min 439ms[0m[0m                                        [0m
[2K[2mPrepared [1m1 package[0m [2min 1.73s[0m[0m                                              
[2K[2mInstalled [1m2 packages[0m [2min 2ms[0m[0m                                 [0m
 [32m+[39m [1mlangdetect[0m[2m==1.0.9[0m
 [32m+[39m [1msix[0m[2m==1.17.0[0m
[2K[2mResolved [1m58 packages[0m [2min 459ms[0m[0m                                        [0m
[2K[2mPrepared [1m1 package[0m [2min 7.92s[0m[0m                                              
[2K[2mInstalled [1m1 package[0m [2min 6ms[0m[0m                                  [0m
 [32m+[39m [1mgensim[0m[2m==4.4.0[0m
[2K[2mResolved [1m59 packages[0m [2min 304ms[0m[0m                                        [0m
[2K[2mPrepared [1m1 package[0m [2min 121ms[0m[0m               

In [1]:
import requests, zipfile, io, os

*Download and extract the zip file with the data *

In [2]:
if not os.path.exists('C50test'):
    r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip")
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

*Load the data into memory*

In [2]:
folder_dict = {'test' : 'C50test'}
text_dict = {'test' : {}}

In [3]:
for label, folder in folder_dict.items():
    authors = os.listdir(folder)
    for author in authors:
        text_files = os.listdir(os.path.join(folder, author))
        for file in text_files:
            with open(os.path.join(folder, author, file), 'r') as text_file:
                text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))

*Note: the text comes pre-split per sentence, for the sake of example I undo this through `' '.join(text_file.readlines()`*

In [4]:
text_dict['test']['TimFarrand'][0]

'United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.\n United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Keebler subsidiary, said total exceptional charges, mainly from the loss on disposal of businesses, amounted to 84.7 million pounds in 1996 compared with 150.3 million in 1995.\n Sales rose by three percent to 1.887 billion and trading profits grew four percent to 129.2 million.\n Underlying profits growth was in line with stock brokers forecasts, but a presentation by management to analysts was greeted positively, sending the group\'s shares up 11 pence to 248-1/2p by 1415 gmt.\n "It\'s all quite encouraging. The way they are analysing and managing the business is very much more in line with what the market demands," said Richard Workman, an analyst at ABN-AMRO Hoare Govett.\n The company said

# <span style="text-decoration: underline;"> 8.0. Process + Clean text</span><a id='proc_clean'></a> [(to top)](#toc)

## Convert the text into a NLP representation

We can use the text directly, but if want to use packages like `spacy` and `textblob` we first have to convert the text into a corresponding object.  

### Spacy

**Note:** depending on the way that you installed the language models you will need to import it differently:

```
from spacy.en import English
parser = English()
```
OR
```
import en_core_web_sm
parser = en_core_web_sm.load()
```

In [14]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()
# nlp = spacy.load('en_core_web_sm')



Convert all text in the "test" sample to a `spacy` `doc` object using `parser()`:

In [16]:
spacy_text = {}
for author, text_list in text_dict['test'].items():
    spacy_text[author] = [nlp(text) for text in text_list]

In [17]:
type(spacy_text['TimFarrand'][0])

spacy.tokens.doc.Doc

### NLTK

In [18]:
import nltk

We can apply basic `nltk` operations directly to the text so we don't need to convert first.

### TextBlob

In [20]:
from textblob import TextBlob

Convert all text in the "test" sample to a `TextBlob` object using `TextBlob()`:

In [21]:
textblob_text = {}
for author, text_list in text_dict['test'].items():
    textblob_text[author] = [TextBlob(text) for text in text_list]

In [22]:
type(textblob_text['TimFarrand'][0])

textblob.blob.TextBlob

## <span style="text-decoration: underline;">Normalization</span><a id='normalization'></a> [(to top)](#toc)

**Text normalization** describes the task of transforming the text into a different (more comparable) form.  

This can imply many things, I will show a couple of things below:

### <span style="text-decoration: underline;">Deal with unwanted characters</span><a id='unwanted_char'></a> [(to top)](#toc)

You will often notice that there are characters that you don't want in your text.  

Let's look at this sentence for example:

> "Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

You notice that there are some `\` and `\n` in there. These are used to define how a string should be displayed, if we print this text we get:  

In [23]:
text_dict['test']['TimFarrand'][0][:298]

'United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.\n United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Kee'

In [24]:
print(text_dict['test']['TimFarrand'][0][:298])

United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
 United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Kee


If we want to analyze text we often don't care about the visual representation. They might actually cause problems!  

**So how do we remove them?**

In many cases it is sufficient to simply use the `.replace()` function:

In [25]:
text_dict['test']['TimFarrand'][0][:298].replace('\n', '').replace('\\', '')

'United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products. United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Kee'

Sometimes, however, the problem arises because of encoding/decoding problems.  

In those cases you can usually do something like:  

In [26]:
problem_sentence = 'This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!'
print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))

b"This is some  text that has to be cleaned! it's annoying!"


### <span style="text-decoration: underline;">Sentence segmentation</span><a id='sentence_seg'></a> [(to top)](#toc)

Sentence segmentation means the task of splitting up the piece of text by sentence.  

You could do this by splitting on the `.` symbol, but dots are used in many other cases as well so it is not very robust:

In [27]:
text_dict['test']['TimFarrand'][0][:550].split('.')

['United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products',
 '\n United, which owns brands such as McVities biscuits and KP nuts but has exited from its U',
 'S',
 ' Keebler subsidiary, said total exceptional charges, mainly from the loss on disposal of businesses, amounted to 84',
 '7 million pounds in 1996 compared with 150',
 '3 million in 1995',
 '\n Sales rose by three percent to 1',
 '887 billion and trading profits grew four pe']

It is better to use a more sophisticated implementation such as the one by `Spacy`:

In [28]:
example_paragraph = spacy_text['TimFarrand'][0]

In [29]:
sentence_list = [s for s in example_paragraph.sents]
sentence_list[:5]

[United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
  ,
 United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Keebler subsidiary, said total exceptional charges, mainly from the loss on disposal of businesses, amounted to 84.7 million pounds in 1996 compared with 150.3 million in 1995.
  ,
 Sales rose by three percent to 1.887 billion and trading profits grew four percent to 129.2 million.
  ,
 Underlying profits growth was in line with stock brokers forecasts, but a presentation by management to analysts was greeted positively, sending the group's shares up 11 pence to 248-1/2p by 1415 gmt.
  ,
 "It's all quite encouraging.]

Notice that the returned object is still a `spacy` object:

In [30]:
type(sentence_list[0])

spacy.tokens.span.Span

Apply to all texts (for use later on):

In [31]:
spacy_sentences = {}
for author, text_list in spacy_text.items():
    spacy_sentences[author] = [list(text.sents) for text in text_list]

In [32]:
spacy_sentences['TimFarrand'][0][:3]

[United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
  ,
 United, which owns brands such as McVities biscuits and KP nuts but has exited from its U.S. Keebler subsidiary, said total exceptional charges, mainly from the loss on disposal of businesses, amounted to 84.7 million pounds in 1996 compared with 150.3 million in 1995.
  ,
 Sales rose by three percent to 1.887 billion and trading profits grew four percent to 129.2 million.
  ]

### <span style="text-decoration: underline;">Word tokenization</span><a id='word_token'></a> [(to top)](#toc)

Word tokenization means to split the sentence (or text) up into words.

In [33]:
example_sentence = spacy_sentences['TimFarrand'][0][0]
example_sentence

United Biscuits (Holdings) Plc more than doubled its profits in 1996 to 109 million pounds ($174 million) before tax and exceptional items, reflecting a simpler and slimmed-down portfolio of products.
 

A word is called a `token` in this context (hence `tokenization`), using `spacy`:

In [34]:
token_list = [token for token in example_sentence]
token_list[0:15]

[United,
 Biscuits,
 (,
 Holdings,
 ),
 Plc,
 more,
 than,
 doubled,
 its,
 profits,
 in,
 1996,
 to,
 109]

### <span style="text-decoration: underline;">Lemmatization & Stemming</span><a id='lem_and_stem'></a> [(to top)](#toc)

In some cases you want to convert a word (i.e. token) into a more general representation.  

For example: convert "car", "cars", "car's", "cars'" all into the word `car`.

This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal).  

**Spacy**

Space offers build-in functionality for lemmatization:

In [35]:
lemmatized = [token.lemma_ for token in example_sentence]
lemmatized[0:15]

['United',
 'Biscuits',
 '(',
 'Holdings',
 ')',
 'Plc',
 'more',
 'than',
 'double',
 'its',
 'profit',
 'in',
 '1996',
 'to',
 '109']

**NLTK**

Using the NLTK libary we can also use the more aggressive Porter Stemmer

In [36]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [37]:
stemmed = [stemmer.stem(token.text) for token in example_sentence]
stemmed[0:15]

['unit',
 'biscuit',
 '(',
 'hold',
 ')',
 'plc',
 'more',
 'than',
 'doubl',
 'it',
 'profit',
 'in',
 '1996',
 'to',
 '109']

**Compare**:

In [38]:
for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):
    print(original, ' | ', lemma, ' | ', stem)

United  |  United  |  unit
Biscuits  |  Biscuits  |  biscuit
(  |  (  |  (
Holdings  |  Holdings  |  hold
)  |  )  |  )
Plc  |  Plc  |  plc
more  |  more  |  more
than  |  than  |  than
doubled  |  double  |  doubl
its  |  its  |  it
profits  |  profit  |  profit
in  |  in  |  in
1996  |  1996  |  1996
to  |  to  |  to
109  |  109  |  109


In my experience it is usually best to use lemmatization instead of a stemmer.

## <span style="text-decoration: underline;">Language modeling</span><a id='lang_model'></a> [(to top)](#toc)

Text is inherently structured in complex ways, we can often use some of this underlying structure.

### <span style="text-decoration: underline;">Part-of-Speech tagging</span><a id='pos_tagging'></a> [(to top)](#toc)

Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc.

Using `Spacy`:

In [39]:
pos_list = [(token, token.pos_) for token in example_sentence]
pos_list[0:10]

[(United, 'PROPN'),
 (Biscuits, 'PROPN'),
 ((, 'PUNCT'),
 (Holdings, 'PROPN'),
 (), 'PUNCT'),
 (Plc, 'PROPN'),
 (more, 'ADV'),
 (than, 'ADP'),
 (doubled, 'VERB'),
 (its, 'PRON')]

### <span style="text-decoration: underline;">Uni-Gram & N-Grams</span><a id='n_grams'></a> [(to top)](#toc)

Obviously a sentence is not a random collection of words, the sequence of words has information value.  

A simple way to incorporate some of this sequence is by using what is called `n-grams`.  
An `n-gram` is nothing more than a a combination of `N` words into one token (a uni-gram token is just one word).  

So we can convert `"Sentence about flying cars"` into a list of bigrams:

> Sentence-about, about-flying, flying-cars  

See my slide on N-Grams for a more comprehensive example: [click here](http://www.tiesdekok.com/AccountingNLP_Slides/#14)

Using `NLTK`:

In [40]:
bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]
bigram_list[10:15]

['profits-in', 'in-1996', '1996-to', 'to-109', '109-million']

### <span style="text-decoration: underline;">Stop words</span><a id='stop_words'></a> [(to top)](#toc)

Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence.  

The primary example are stop words.  

Sometimes you can improve the accuracy of your model by removing stop words.

Using `Spacy`:

In [41]:
no_stop_words = [token for token in example_sentence if not token.is_stop]

In [42]:
no_stop_words[:10]

[United, Biscuits, (, Holdings, ), Plc, doubled, profits, 1996, 109]

In [43]:
token_list[:10]

[United, Biscuits, (, Holdings, ), Plc, more, than, doubled, its]

*Note* we can also remove punctuation in the same way:

In [44]:
[token for token in example_sentence if not token.is_punct][:5]

[United, Biscuits, Holdings, Plc, more]

## Wrap everything into one function

Below I will primarily use `SpaCy` directly. However, I also recommend to check out the high-level wrapper `Textacy`.

See their GitHub page for details: https://github.com/chartbeat-labs/textacy

### Quick `Textacy` example

In [45]:
import textacy

In [46]:
example_text = text_dict['test']['TimFarrand'][0]

In [47]:
cleaned_text = textacy.preprocess_text(example_text, lowercase=True, fix_unicode=True, no_punct=True)

AttributeError: module 'textacy' has no attribute 'preprocess_text'

**Basic SpaCy text processing function**

1. Split into sentences
2. Apply lemmatizer and remove top words
3. Clean up the sentence using `textacy`

In [52]:
def process_text_custom(text):
    sentences = list(nlp(text).sents)
    lemmatized_sentences = []
    for sentence in sentences:
        lemmatized_sentences.append([token.lemma_ for token in sentence if not token.is_stop | token.is_punct | token.is_space])
    return [nlp(' '.join(sentence)) for sentence in lemmatized_sentences]

In [53]:
%%time
spacy_text_clean = {}
for author, text_list in text_dict['test'].items():
    lst = []
    for text in text_list:
        lst.append(process_text_custom(text))
    spacy_text_clean[author] = lst

CPU times: user 16min 5s, sys: 9.75 s, total: 16min 15s
Wall time: 1h 12min 48s


Note that there are quite a lot of sentences (~52K) so this takes a bit of time (~ 15 minutes).

In [54]:
count = 0
for author, texts in spacy_text_clean.items():
    for text in texts:
        count += len(text)
print('Number of sentences:', count)

Number of sentences: 53986


Result

In [55]:
spacy_text_clean['TimFarrand'][0][:3]

[United Biscuits Holdings Plc double profit 1996 109 million pound $ 174 million tax exceptional item reflect simple slimme portfolio product,
 United own brand McVities biscuit KP nut exit U.S. Keebler subsidiary say total exceptional charge mainly loss disposal business amount 84.7 million pound 1996 compare 150.3 million 1995,
 sale rise percent 1.887 billion trading profit grow percent 129.2 million]

# <span style="text-decoration: underline;">Direct feature extraction</span><a id='feature_extract'></a> [(to top)](#toc)

We now have pre-processed our text into something that we can use for direct feature extraction or to convert it to a numerical representation.

## <span style="text-decoration: underline;">Feature search</span><a id='feature_search'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Entity recognition</span><a id='entity_recognition'></a> [(to top)](#toc)

It is often useful / relevant to extract entities that are mentioned in a piece of text.   

SpaCy is quite powerful in extracting entities, however, it doesn't work very well on lowercase text.  

Given that "token.lemma\_" removes capitalization I will use `spacy_sentences` for this example.

In [56]:
example_sentence = spacy_sentences['TimFarrand'][0][3]
example_sentence

Underlying profits growth was in line with stock brokers forecasts, but a presentation by management to analysts was greeted positively, sending the group's shares up 11 pence to 248-1/2p by 1415 gmt.
 

In [58]:
[(i, i.label_) for i in nlp(example_sentence.text).ents]

[(11 pence, 'MONEY'), (248, 'CARDINAL'), (1415, 'DATE')]

In [59]:
example_sentence = spacy_sentences['TimFarrand'][4][0]
example_sentence

Brewer to leisure group Whitbread Plc has turned in a "sound" business performance in the last three months, said chief executive Peter Jarvis in an interview on Friday.
 

In [60]:
[(i, i.label_) for i in nlp(example_sentence.text).ents]

[(Whitbread Plc, 'ORG'),
 (the last three months, 'DATE'),
 (Peter Jarvis, 'PERSON'),
 (Friday, 'DATE')]

### <span style="text-decoration: underline;">Pattern search</span><a id='pattern_search'></a> [(to top)](#toc)

Using the built-in `re` (regular expression) library you can pattern match nearly anything you want.  

I will not go into details about regular expressions but see here for a tutorial:  
https://regexone.com/references/python  

In [61]:
import re

**TIP**: Use [Pythex.org](https://pythex.org/) to try out your regular expression

Example on Pythex: <a href="https://pythex.org/?regex=IDNUMBER: (\d\d\d-\w\w)&test_string=Ties de Kok (IDNUMBER: 123-AZ). Rest of Text." target='_blank'>click here</a>

**Example 1:**  

In [62]:
string_1 = 'Ties de Kok (#IDNUMBER: 123-AZ). Rest of text...'
string_2 = 'Philip Joos (#IDNUMBER: 663-BY). Rest of text...'

In [63]:
pattern = r'#IDNUMBER: (\d\d\d-\w\w)'

In [64]:
print(re.findall(pattern, string_1)[0])
print(re.findall(pattern, string_2)[0])

123-AZ
663-BY


### Example 2:

If a sentence contains the word 'million' return True, otherwise return False

In [67]:
for sen in spacy_text_clean['TimFarrand'][2]:
    TERM = 'million'
    contains = True if re.search('million', sen.text) else False
    if contains:
        print(sen)
        # print(sen.ents)

company hotel exclude Metropole London acquire November 327 million stg $ 543 million Lonrho show occupancy level 71.8 percent 13 week 1997 touch 72 percent time
group hotel occupancy pull slightly period temporary closure Stakis Tyneside work underway 3.5 million stg refurbishment
turnover casino rise 20 percent 14.2 million stg drive 17 percent increase attendance


## <span style="text-decoration: underline;">Text evaluation</span><a id='text_eval'></a> [(to top)](#toc)

Besides feature search there are also many ways to analyze the text as a whole.  

Let's, for example, evaluate the following paragraph:

In [70]:
example_paragraph = ' '.join([x.text for x in spacy_text_clean['TimFarrand'][2]])
example_paragraph[:500]

'Scottish base Stakis Plc Wednesday report surge visitor casino sharp rise hotel room rate chief executive David Michel confident mood future trend real term room rate late 1980 room rate reach pre recession level province Michels tell Reuters company hotel exclude Metropole London acquire November 327 million stg $ 543 million Lonrho show occupancy level 71.8 percent 13 week 1997 touch 72 percent time average room rate rise 50.10 stg period 45.58 1996 quarter 10 percent think average 7.5 percent'

### <span style="text-decoration: underline;">Language</span><a id='language'></a> [(to top)](#toc)

Using the `langdetect` package it is easy to detect the language of a piece of text

In [None]:
!pip install langdetect

In [71]:
from langdetect import detect

In [72]:
detect(example_paragraph)

'en'

### <span style="text-decoration: underline;">Readability</span><a id='readability'></a> [(to top)](#toc)

Using the `textstat` package we can compute various readability metrics

https://github.com/shivam5992/textstat

In [None]:
!pip install textstat

In [73]:
from textstat.textstat import textstat

In [74]:
print(textstat.flesch_reading_ease(example_paragraph))
print(textstat.smog_index(example_paragraph))
print(textstat.flesch_kincaid_grade(example_paragraph))
print(textstat.coleman_liau_index(example_paragraph))
print(textstat.automated_readability_index(example_paragraph))
print(textstat.dale_chall_readability_score(example_paragraph))
print(textstat.difficult_words(example_paragraph))
print(textstat.linsear_write_formula(example_paragraph))
print(textstat.gunning_fog(example_paragraph))
print(textstat.text_standard(example_paragraph))

34.07917624521073
16.32212239822248
14.909885057471268
15.376724137931038
17.826837606837607
14.103224329501916
54
13.4
17.724904214559388
14th and 15th grade


## Text similarity

In [None]:
!pip install gensim
!pip install fuzzywuzzy
!pip install pyLDAvis

In [76]:
from fuzzywuzzy import fuzz

In [77]:
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

### <span style="text-decoration: underline;">Term (dictionary) counting</span><a id='dict_counting'></a> [(to top)](#toc)

One of the most common techniques that researchers currently use (at least in Accounting research) are simple metrics based on counting words in a dictionary.  
This technique is, for example, very prevalent in sentiment analysis (counting positive and negative words).  

In essence this technique is very simple to program:

### Example 1:

In [78]:
word_dictionary = ['soft', 'first', 'most', 'be']

In [79]:
for word in word_dictionary:
    print(word, example_paragraph.count(word))

soft 0
first 0
most 0
be 3


### Example 2:

In [80]:
pos = ['great', 'increase']
neg = ['bad', 'decrease']

sentence = '''According to Trump everything is great, great,
and great even though his popularity is seeing a decrease.'''

pos_count = 0
for word in pos:
    pos_count += sentence.lower().count(word)
print(pos_count)

neg_count = 0
for word in neg:
    neg_count += sentence.lower().count(word)
print(neg_count)

pos_count / (neg_count + pos_count)

3
1


0.75

In [None]:
sentence = '''According to Trump everything is great, great,
and great even though his popularity is seeing a decrease.'''

In [None]:
pos_count = 0
for word in pos:
    pos_count += sentence.lower().count(word)
print(pos_count)

3


In [None]:
neg_count = 0
for word in neg:
    neg_count += sentence.lower().count(word)
print(neg_count)

1


In [None]:
pos_count / (neg_count + pos_count)

0.75

Getting the total number of words is also easy:

In [81]:
len(nlp(example_paragraph))

234

#### Example 3:

We can also save the count per word

In [82]:
pos_count_dict = {}
for word in pos:
    pos_count_dict[word] = sentence.lower().count(word)

In [83]:
pos_count_dict

{'great': 3, 'increase': 0}

# <span style="text-decoration: underline;">Represent text numerically</span><a id='text_numerical'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">Bag of Words</span><a id='bows'></a> [(to top)](#toc)

Sklearn includes the `CountVectorizer` and `TfidfVectorizer` function.  

For details, see the documentation:  
[TF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)  
[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

Note 1: These functions also already include a lot of preprocessing options (e.g. ngrams, remove stop words, accent stripper).

Note 2: Example based on the following website [click here](http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

In [84]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Simple example:

In [85]:
doc_1 = "The sky is blue."
doc_2 = "The sun is bright today."
doc_3 = "The sun in the sky is bright."
doc_4 = "We can see the shining sun, the bright sun."

Calculate term frequency:

In [86]:
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform([doc_1, doc_2, doc_3, doc_4])

In [88]:
print(vectorizer.get_feature_names_out())
for doc_tf_vector in tf.toarray():
    print(doc_tf_vector)

['blue' 'bright' 'shining' 'sky' 'sun' 'today']
[1 0 0 1 0 0]
[0 1 0 0 1 1]
[0 1 0 1 1 0]
[0 1 1 0 2 0]


### <span style="text-decoration: underline;">TF-IDF</span><a id='tfidf'></a> [(to top)](#toc)

In [89]:
transformer = TfidfVectorizer(stop_words='english')
tfidf = transformer.fit_transform([doc_1, doc_2, doc_3, doc_4])

In [90]:
for doc_vector in tfidf.toarray():
    print(doc_vector)

[0.78528828 0.         0.         0.6191303  0.         0.        ]
[0.         0.47380449 0.         0.         0.47380449 0.74230628]
[0.         0.53256952 0.         0.65782931 0.53256952 0.        ]
[0.         0.36626037 0.57381765 0.         0.73252075 0.        ]


### More elaborate example:

In [91]:
clean_paragraphs = []
for author, value in spacy_text_clean.items():
    for article in value:
        clean_paragraphs.append(' '.join([x.text for x in article]))

In [92]:
len(clean_paragraphs)

2500

In [93]:
transformer = TfidfVectorizer(stop_words='english')
tfidf_large = transformer.fit_transform(clean_paragraphs)

In [94]:
print('Number of vectors:', len(tfidf_large.toarray()))
print('Number of words in dictionary:', len(tfidf_large.toarray()[0]))

Number of vectors: 2500
Number of words in dictionary: 24036


In [97]:
tfidf_large

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 443919 stored elements and shape (2500, 24036)>

## <span style="text-decoration: underline;">Word Embeddings</span><a id='word_embed'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Word2Vec</span><a id='Word2Vec'></a> [(to top)](#toc)

Simple example below is from:  https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980

In [98]:
import gensim
nltk.download('brown', download_dir="/opt/share/nltk_data")
from nltk.corpus import brown

[nltk_data] Downloading package brown to /home/koollio/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [99]:
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)

Save model

In [100]:
model.save('brown_model')

Load model

In [101]:
model = gensim.models.Word2Vec.load('brown_model')

Find words most similar to 'mother':

In [104]:
print(model.wv.most_similar("mother"))

[('father', 0.9790230393409729), ('husband', 0.9679327011108398), ('wife', 0.946841835975647), ('son', 0.9338570237159729), ('friend', 0.9207152128219604), ('nickname', 0.9092258810997009), ('voice', 0.9054335951805115), ('brother', 0.8999156355857849), ('patient', 0.8900373578071594), ('uncle', 0.8772851228713989)]


Find the odd one out:

In [105]:
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


In [106]:
print(model.wv.doesnt_match("pizza pasta garden fries".split()))

garden


Retrieve vector representation of the word "human"

In [111]:
model.wv['human']

array([-0.59822446,  0.37523517,  0.7429167 ,  0.18978556, -0.27945676,
       -0.63060504,  1.4232993 ,  1.3837723 , -0.5692845 , -0.35793665,
       -0.09688178, -0.7069113 ,  0.71324366, -0.4941389 ,  0.09916251,
       -0.6406467 ,  0.48585665, -0.26917398, -0.48565945, -1.0296743 ,
        0.59314114,  0.04635499,  0.5858246 ,  0.7504386 , -0.36290467,
       -0.35890818,  0.21142045,  0.17639156, -0.73307604, -0.10112016,
        0.43756485, -0.6939162 ,  0.78580564, -0.3232209 ,  0.31219828,
        0.88461053, -0.37566754, -0.47876412, -0.59902835,  0.01704816,
       -0.08662194, -0.3377133 ,  0.5304672 ,  0.00568901,  0.6407281 ,
        0.14313847, -0.32486907, -0.16631149,  0.07767224,  0.21260379,
        1.1854287 , -0.60895926, -0.4149272 , -0.09742101, -0.824041  ,
       -0.49826938,  0.9745278 ,  0.14413634, -0.67737544, -0.11342227,
        0.06062152,  0.27667984,  0.01661909, -0.66811484, -1.0064873 ,
        1.0263972 ,  0.06277847,  0.3018849 , -0.8364451 ,  0.17

# <span style="text-decoration: underline;">Statistical models</span><a id='stat_models'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">"Traditional" machine learning</span><a id='trad_ml'></a> [(to top)](#toc)

The library to use for machine learning is scikit-learn (["sklearn"](http://scikit-learn.org/stable/index.html)).

## <span>Supervised</span><a id='trad_ml_supervised'></a> [(to top)](#toc)

In [123]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import joblib

In [113]:
import pandas as pd
import numpy as np

### Convert the data into a pandas dataframe (so that we can input it easier)

In [114]:
article_list = []
for author, value in spacy_text_clean.items():
    for article in value:
        article_list.append((author, ' '.join([x.text for x in article])))

In [115]:
article_df = pd.DataFrame(article_list, columns=['author', 'text'])

In [116]:
article_df.sample(5)

Unnamed: 0,author,text
2483,SamuelPerry,Intel Corp. world big maker computer chip say ...
2238,BernardHickey,New Zealand base investment group Brierley Inv...
2258,KeithWeir,anglo dutch publishing group Reed Elsevier Plc...
1151,EdnaFernandes,Northern England glitter record inward investm...
2125,DarrenSchuettler,saga Bre X Minerals Ltd. rich indonesian gold ...


### Split the sample into a training and test sample

In [117]:
X_train, X_test, y_train, y_test = train_test_split(article_df.text, article_df.author, test_size=0.20, random_state=3561)

In [118]:
print(len(X_train), len(X_test))

2000 500


### Train and evaluate function

Simple function to train (i.e. fit) and evaluate the model

In [119]:
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):

    clf.fit(X_train, y_train)

    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))

    y_pred = clf.predict(X_test)

    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))

### <span>Naïve Bayes estimator</span><a id='trad_ml_supervised_nb'></a> [(to top)](#toc)

In [120]:
from sklearn.naive_bayes import MultinomialNB

Define pipeline

In [121]:
clf = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),

    ('clf', MultinomialNB(alpha = 1,
                          fit_prior = True
                          )
    ),
])

Train and show evaluation stats

In [122]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.848
Accuracy on testing set:
0.708
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.64      0.88      0.74         8
       AlanCrosby       0.50      1.00      0.67         8
   AlexanderSmith       0.82      0.82      0.82        11
  BenjaminKangLim       0.60      0.27      0.38        11
    BernardHickey       0.67      0.60      0.63        10
      BradDorfman       0.86      0.67      0.75         9
 DarrenSchuettler       0.75      0.64      0.69        14
      DavidLawder       0.83      0.50      0.62        10
    EdnaFernandes       0.43      1.00      0.60         3
      EricAuchard       0.71      0.56      0.62         9
   FumikoFujisaki       0.93      1.00      0.96        13
   GrahamEarnshaw       0.59      0.91      0.71        11
 HeatherScoffield       0.83      0.50      0.62        10
       JanLopatka       0.83      0.50      0.62        10
    JaneMacartney       0.40

Save results

In [124]:
joblib.dump(clf, 'naive_bayes_results.pkl')

['naive_bayes_results.pkl']

Predict out of sample:

In [125]:
example_y, example_X = y_train[33], X_train[33]

In [126]:
print('Actual author:', example_y)
print('Predicted author:', clf.predict([example_X])[0])

Actual author: MarcelMichelson
Predicted author: MarcelMichelson


### <span>Support Vector Machines (SVM)</span><a id='trad_ml_supervised_svm'></a> [(to top)](#toc)

In [127]:
from sklearn.svm import SVC

Define pipeline

In [128]:
clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),

    ('clf', SVC(kernel='rbf' ,
                C=10, gamma=0.3)
    ),
])

*Note:* The SVC estimator is very sensitive to the hyperparameters!

Train and show evaluation stats

In [129]:
train_and_evaluate(clf_svm, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.998
Accuracy on testing set:
0.846
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.80      1.00      0.89         8
       AlanCrosby       0.89      1.00      0.94         8
   AlexanderSmith       0.85      1.00      0.92        11
  BenjaminKangLim       0.64      0.82      0.72        11
    BernardHickey       0.80      0.80      0.80        10
      BradDorfman       0.80      0.89      0.84         9
 DarrenSchuettler       1.00      0.93      0.96        14
      DavidLawder       0.89      0.80      0.84        10
    EdnaFernandes       0.60      1.00      0.75         3
      EricAuchard       0.58      0.78      0.67         9
   FumikoFujisaki       1.00      1.00      1.00        13
   GrahamEarnshaw       0.83      0.91      0.87        11
 HeatherScoffield       1.00      1.00      1.00        10
       JanLopatka       1.00      0.70      0.82        10
    JaneMacartney       0.50

Save results

In [130]:
joblib.dump(clf_svm, 'svm_results.pkl')

['svm_results.pkl']

Predict out of sample:

In [131]:
example_y, example_X = y_train[33], X_train[33]

In [132]:
print('Actual author:', example_y)
print('Predicted author:', clf_svm.predict([example_X])[0])

Actual author: MarcelMichelson
Predicted author: MarcelMichelson


## <span>Model Selection and Evaluation</span><a id='trad_ml_eval'></a> [(to top)](#toc)

Both the `TfidfVectorizer` and `SVC()` estimator take a lot of hyperparameters.  

It can be difficult to figure out what the best parameters are.

We can use `GridSearchCV` to help figure this out.

In [133]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score

First we define the options that should be tried out:

In [159]:
clf_search = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', SVC())
])
parameters = {'vect__stop_words': ['english'],
              'vect__strip_accents': ['unicode'],
              'vect__max_features' : [1500],
              'vect__ngram_range': [(1,1), (2,2) ],
              'clf__gamma' : [0.2, 0.3, 0.4],
              'clf__C' : [8, 10, 12],
              'clf__kernel' : ['rbf']
             }


Run everything:

In [None]:
grid = GridSearchCV(clf_search, param_grid=parameters, cv=6, n_jobs=-1)
grid.fit(X_train, y_train)

*Note:* if you are on a powerful unix system you can set n_jobs to the number of available threads to speed up the calculation

In [None]:
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
y_true, y_pred = y_test, grid.predict(X_test)
print(metrics.classification_report(y_true, y_pred))

## <span>Unsupervised</span><a id='trad_ml_unsupervised'></a> [(to top)](#toc)

### <span>Latent Dirichilet Allocation (LDA)</span><a id='trad_ml_unsupervised_lda'></a> [(to top)](#toc)

In [163]:
from sklearn.decomposition import LatentDirichletAllocation

Vectorizer (using countvectorizer for the sake of example)

In [164]:
vectorizer = CountVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english', max_df=0.8)
tf_large = vectorizer.fit_transform(clean_paragraphs)

Run the LDA model

In [165]:
n_topics = 10
n_top_words = 25

In [166]:
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=10,
                                learning_method='online',
                                n_jobs=1)
lda_fitted = lda.fit_transform(tf_large)

Visualize top words

In [167]:
def save_top_words(model, feature_names, n_top_words):
    out_list = []
    for topic_idx, topic in enumerate(model.components_):
        out_list.append((topic_idx+1, " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])))
    out_df = pd.DataFrame(out_list, columns=['topic_id', 'top_words'])
    return out_df

In [169]:
result_df = save_top_words(lda, vectorizer.get_feature_names_out(), n_top_words)

In [170]:
result_df

Unnamed: 0,topic_id,top_words
0,1,quarter analyst percent share million sale com...
1,2,government minister official union state plan ...
2,3,service company new amp corp network customer ...
3,4,percent million profit bank billion analyst ma...
4,5,company internet computer microsoft software t...
5,6,000 crop air tobacco japanese trade industry j...
6,7,market tonne percent price export 000 trader o...
7,8,china kong hong chinese beijing people deng of...
8,9,bre gold stock share toronto busang canada con...
9,10,company bank share market financial business m...


### <span>pyLDAvis</span><a id='trad_ml_unsupervised_pyLDAvis'></a> [(to top)](#toc)

In [None]:
# %matplotlib inline
import pyLDAvis.lda_model

pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.lda_model.prepare(lda, tf_large, vectorizer, n_jobs=1)


Credit: [Ties de Kok](https://github.com/TiesdeKok)

Repository: [Python NLP](https://github.com/TiesdeKok/Python_NLP_Tutorial)