# BT2101 Introduction to Text Mining

## 1 Goal

In this notebook, we will explore Text Mining including:
* Handling text data
* Parsing and Removing Punctuation
* Tokenization
* tf-idf

The examples shown in this notebook are based on ["Introduction to Machine Learning with Python: A Guide for Data Scientists"](#3-References) and ["Machine Learning with Python Cookbook"](#3-References). The codes here are revised and different from the original ones.

A typical text mining procedure:
* Extracting keywords from text
* Preprocessing: Converting unstructured data to structured data
* Keywords selection
* Clustering: Group similar words together; Identify common patterns/styles
* Analysis: Identify relationships between the information in text and the focal outcome variable

In [None]:
# -*- coding:utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from __future__ import division
from math import sqrt
%matplotlib inline

## 2 Understand Text Data

Four data types that you frequently encounter:
* Numerical data
* Categorical data
* Structured String data
* Text data

The first three are all structured data. In this notebook, you will learn how to handle text data. <br/>

### 2.1 Cleaning Text
Some simple work on cleaning text data

In [None]:
# Suppose you have a text data and you already convert it to structured type
text_data = ["      This is a BT2101 course    .",
            "This course teaches Machine learning methods,",
            "    including Supervised, unsupervised, and Deep learning models."]

In [None]:
# Some simple preparation work

# 1. Strip whitespace
strip_whitespace = map(lambda x:x.strip(), text_data)
strip_whitespace

# 2. Remove periods
remove_periods = map(lambda y:y.replace(".",""), strip_whitespace)
remove_periods

# 3. Captitalize
capital_initial = map(lambda z:z.upper(), remove_periods)
capital_initial

### 2.2 Parsing

Parsing is the process of structuring the input text and deriving patterns within the structured data, including:
* Sentence Segmentation
* Remove stop words (e.g., numbers, puctuations, symbols, whitespace)
* Tokenization
* Stemming (Text Normalization)

#### Remove stop words
Removing the words that are very commonly used (but less informative) in a given language, we can focus on the important words instead.
* Articles (the, a, an…)
* Prepositions (for, after, above, across, before, under…)
* Conjunctions (and, but, nor, yet, so, than…)
* Pronouns (she, he, I, you, they, them…)
* Auxiliary Verbs (can, will, could, would, must) and Linking Verbs (is, are, am)
* When, where, how, what, which
* Punctuations

Stop words are the common words that is used in the language (e.g., a, the, so, them, he, she, who, what, when, how, is, are, etc.). In text processing, stop words are usually ignored to improve performance (speed & accuracy)

In [None]:
# Example 1: Remove punctuations
import string
import re

remove_punctuation = map(lambda x:x.translate(None, string.punctuation), text_data)
remove_punctuation

In [None]:
# Example 2: Remove words that are very commonly used but less informative
# Load library (Natural Language Toolkit NLTK)
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

In [None]:
# What are these stopwords
stop_words = stopwords.words('english')
stop_words

In [None]:
# Suppose you have a list of words, and you want to remove stop words from them
strings = "i am going to take this BT2101 module because it is very very interesting"
word_list = strings.split()
word_list

In [None]:
# Remove stop words
word_list_without_stopwords = [word for word in word_list if word not in stop_words]
word_list_without_stopwords

#### Tokenization
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis. Tokenization is the process of splitting text into tokens (i.e., individual words). 
* Generally an easy task for English
* Split the string by space and punctuation
* Some problems for hypenation, apostrophe, periods


In [None]:
# Load library (Natural Language Toolkit NLTK)
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

In [None]:
# Suppose you have a text
text = remove_punctuation[0] + remove_punctuation[1] + remove_punctuation[2]
text

In [None]:
# Tokenize into words
word_tokenize(text)

In [None]:
# Tokenize into sentences
from nltk.tokenize import sent_tokenize
text = "This is a BT2101 course. This course teaches Machine learning methods. You will learn Supervised, unsupervised, and Deep learning models."
sent_tokenize(text)

#### Stemming and Text Normalization
Text Normalization is a process of transforming text into a form that is consistent. For example:
* I’m really HAPPY! → i’m really happy!
* U.S.A → usa
* café → cafe

The purpose of stemming is to make the text more “general”, so that "café" and "cafe" are treated the same manner. 

In additional to the obvious transformation (change to lower case, etc), we can also transform words to their stem (or root form):
* books → book
* beautiful → beauty
* eats → eat

This process is called Stemming. Porter stemmer is a popular rule-based stemming algorithm:
1. Remove plurals, -ed, -ing
2. Turn terminal y to i when there is another vowel in the stem: (e.g., furry → furri, fry → fry)
3. Maps double suffixes to single ones (e.g., playfulness →playful)
4. Deals with suffixes, -full, -ness, etc. 
5. Takes off –ant, -ence, etc.
6. Removes the final -e


In [None]:
# We use NLTK's PorterStemmer to do stemming
from nltk.stem.porter import PorterStemmer

In [None]:
# Suppose you have a word tokens, and you want to do stemming on it
strings = "i am interested in these amazing machine learning models"
tokenized_words = strings.split()
tokenized_words

In [None]:
# Do porter stemming
porter = PorterStemmer()
porter_stem = map(lambda x: porter.stem(x), tokenized_words)
porter_stem

### 2.3 Bag of Words

One of the most common methods of transforming text into features is by using a **bag-of-words** model. Bag-of-words models output a feature for every unique word in text data, with each feature containing a count of occurences in observations. For example, in our solution the sentence `I love Brazil. Brazil!` has a value of 2 in the `brazil` feature, because the word *brazil* appears twice. 

Suppose you want to create a set of features indicating the number of times an observation's text contains a particular word.

In [None]:
# Load library
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Create text
text_data = np.array(['I love machine learning. Machine learning!', 'Ensemble learning is the best', 'Deep learning beats all'])
text_data

In [None]:
# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
bag_of_words

In [None]:
bagofwords = pd.DataFrame(bag_of_words.toarray(), columns=count.get_feature_names())
bagofwords

### 2.4 TF-IDF

So far we have weighted each token based on its term frequency (frequency of occurrence in the doc)
* Idea: words that occur more → document seems to focus more on that idea
* However, (after excluding the stop words), some words are more common than the others
* Does not necessarily mean that they are more important than other low frequency words

An improvement is to also consider how a word is used in other documents in the corpus. The statistic **tf-idf** is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

tf-idf (term frequency * inverse document frequency) is an approach to reflect how important a word is.
* Made up of 2 components: tf and idf
* tf = how many times the term appears in the document

<img src="http://dovgalecs.com/blog/wp-content/uploads/2012/03/img131.gif" width="500">

Example, if we have 100 documents in our corpus, and the term “SOC” appears in just 1 document $idf(“soc”) = log (100)$
* The term “i” (stop word) appears in all the documents $idf(“i”) = log(1) = 0$
* Justification for idf

If a term appears on many documents, each time it appears in a document probably not important. If a term is seldom seen, when it appears likely to be important (i.e. document is likely to be about it).

After motivating why idf makes sense, we still need to cater for the fact that, if a term is mentioned many times in a document, it is probably important: $Final weight = tf * idf$


In [None]:
# Load library
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Create text
text_data = np.array(['I love machine learning. Machine learning!', 'Ensemble learning is the best', 'Deep learning beats all'])
text_data

In [None]:
# Create tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
feature_matrix

In [None]:
tfidf_vector = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.vocabulary_.keys())
tfidf_vector

In [None]:
tfidf.vocabulary_.keys()

More information about `NLTK` can be found at https://www.nltk.org/.

## 3 References
[1] Müller, A.C. and Guido, S., 2016. Introduction to Machine Learning with Python: A Guide for Data Scientists. O'Reilly Media, Inc. <br/>
[2] Chris Albon. (2018). Machine Learning with Python Cookbook. O'Reilly.