## Python Examples for Text Pre-processing

We can use Python to pre-process text. There are a number of libraries that make this easier, notably:

- NLTK (Natural Language Tool Kit)
- TextBlob

The documentation for TextBlob is [here](https://textblob.readthedocs.io/en/dev/) and NLTK is [here](https://www.nltk.org/).

In [2]:
import os
from textblob import TextBlob

texts = ['ball pitch goal corner keeper kick pass run referee',
        'ball pass scrum maul lineout kick goal fullback',
        'net pass serve set rotate net block hit libero ball',
        'court wing pass circle goal umpire quarter ball']

doclist = []
for text in texts:
    doc = TextBlob(text)
    doclist.append(doc.lower())

In [3]:
doclist[0].words

WordList(['ball', 'pitch', 'goal', 'corner', 'keeper', 'kick', 'pass', 'run', 'referee'])

### Calculate Term Frequency - Inverse Document Frequency

Term Frequency - Inverse Document Frequency (TF-IDF) is a weighting that finds words that are characteristic of a particular document within a corpus. It finds words that appear quite frequently in a given document, but not in the other documents. If we are interested in information retrieval or finding topics in documents then tf-idf is a useful way to weight terms.

Words that occur only once or twice in a single document and not in any other documents don't tell us a lot about the document - they may be just the whim of the writer. Similarly, words that appear a lot in all the documents don't tell us much about the differences between documents.

**Notes:**  
TF-IDF code adapted from Steven Loria: http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/  

In [4]:
%%bash
# install textblob library if not already installed
pip install textblob

Defaulting to user installation because normal site-packages is not writeable


#### Definitions

For each word in the corpus:

**Term Frequency** (tf) = frequency of the word in each document

**Document Frequency** (df) = number of documents in the corpus containing the word

**Inverse Document Frequency** (idf) = (logarithm of) the number of documents divided by the document frequency for the word

So tf-idf for a word in the corpus is calculated by tf * idf

#### Interpretation

A _high_ tf-idf score for a word means the term is fairly frequent in the corpus but not dispersed across many documents

A _low_ tf-idf score for a word means the term is fairly infrequent in the corpus or is frequent but dispersed across many documents

Tf-idf scores are relative to a corpus. Adding more documents will change the weightings.

In [9]:
'''
Here are function definitions for tf, df, idf and tfidf.
In practice the idf score usually uses 1 + df and then adds 1 to the result 
'''

import math

def tf(word, doc):
    return doc.words.count(word) / len(doc.words)

def df(word, doclist):
    return sum(1 for doc in doclist if word in doc.words)

def idf(word, doclist):
    return math.log(len(doclist) / (df(word, doclist)))

def tfidf(word, doc, doclist):
    return tf(word, doc) * idf(word, doclist)

In [10]:
'''
Here we loop through the list of documents called 'doclist'.
Scores is a dictionary of key:value pairs. 
Each key is a word in the document and the value is its tfidf score. 
Results are sorted by the tfidf score with the largest value at the top.
Lastly we print some results for each document.
'''

for i, doc in enumerate(doclist):
    print("Top words in document {}".format(i + 1), "({}...)".format(doc[:20]))
    scores = {word: tfidf(word, doc, doclist) for word in doc.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:10]:
        print("\t{}, TF-IDF: {}".format(word, round(score, 5)))

Top words in document 1 (ball pitch goal corn...)
	pitch, TF-IDF: 0.15403
	corner, TF-IDF: 0.15403
	keeper, TF-IDF: 0.15403
	run, TF-IDF: 0.15403
	referee, TF-IDF: 0.15403
	kick, TF-IDF: 0.07702
	goal, TF-IDF: 0.03196
	ball, TF-IDF: 0.0
	pass, TF-IDF: 0.0
Top words in document 2 (ball pass scrum maul...)
	scrum, TF-IDF: 0.17329
	maul, TF-IDF: 0.17329
	lineout, TF-IDF: 0.17329
	fullback, TF-IDF: 0.17329
	kick, TF-IDF: 0.08664
	goal, TF-IDF: 0.03596
	ball, TF-IDF: 0.0
	pass, TF-IDF: 0.0
Top words in document 3 (net pass serve set r...)
	net, TF-IDF: 0.27726
	serve, TF-IDF: 0.13863
	set, TF-IDF: 0.13863
	rotate, TF-IDF: 0.13863
	block, TF-IDF: 0.13863
	hit, TF-IDF: 0.13863
	libero, TF-IDF: 0.13863
	pass, TF-IDF: 0.0
	ball, TF-IDF: 0.0
Top words in document 4 (court wing pass circ...)
	court, TF-IDF: 0.17329
	wing, TF-IDF: 0.17329
	circle, TF-IDF: 0.17329
	umpire, TF-IDF: 0.17329
	quarter, TF-IDF: 0.17329
	goal, TF-IDF: 0.03596
	pass, TF-IDF: 0.0
	ball, TF-IDF: 0.0
