## Text Mining and Analytics

Text mining is used for analyzing texts, turns them into a more structured form and then deriving insights from it 

#### Example technique

**named entity recognition:** Subtask of information extraction that locates and calssifies *named entities*, mentioned in *unstructured text*, into predefined categories like:
* names of people 
* locations
* dates
* ID codes

e.g. Unstrucutred text --> named entity recognition --> annotated text

>Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights the names of entities:

>**Jim** bought 300 shares of **Acme Corp.** in **2006**

Jim: Person  
Acme Corp: Organization  
2006: Time  

#### Applications
* Entity identification
* Plagiarism detection
* Topic identification
* Text clustering
* Translation
* Auto-text summarisation
* Fraud detection
* Spam filtering
* Sentiment analysis

#### Difficulties
* Text ambiguity
* Spelling mistakes 
* Acronyms
* Homonyms: bat (animal) <--> bat (baseball)
* Language context: 
    * Engligh models of langage won't work well for Arabic and vice versa
    * Algorithm trained on Twitter data won't work well if applied to legal texts




### Bag of Words

Simplest way to structure textual data --> evey document turned into a word vector. If word is a boolean defined by condition that word is contained in document.
    
Consider two documents:

> 1. "Game of thrones is a great television series but the books are better."  
 
> 2. "Doing data science is more fun that watching television"

Can combine the documents into a structured format called the *document-term matrix*:
* Rows --> documents
* Columns --> words

Binary coded bag of words is one to structure the data, others exists

In [7]:
import json 

a = [({'game':True,'of':True,'thrones':True,'is':True,'a':True,
'great':True,'television':True,'series':True,'but':True,
'the':True,'books':True,'are':True,'better':True,'doing':
False, 'data':False,'science':False,'more':False,'fun':False,
'than':False,'watching':False},
'gameofthrones'),
({'doing':True,'data':True,'science':True,'is':True,'more':
True,'fun':True,'than':True,'watching':True,'television':True,
'game':False,'of':False,'thrones':False,'a':False,'great':
False,'series':False,'but':False,'the':False,'books':False,
'are':False,'better':False},
'datascience')]

print(json.dumps(a, indent=2))

[
  [
    {
      "game": true,
      "of": true,
      "thrones": true,
      "is": true,
      "a": true,
      "great": true,
      "television": true,
      "series": true,
      "but": true,
      "the": true,
      "books": true,
      "are": true,
      "better": true,
      "doing": false,
      "data": false,
      "science": false,
      "more": false,
      "fun": false,
      "than": false,
      "watching": false
    },
    "gameofthrones"
  ],
  [
    {
      "doing": true,
      "data": true,
      "science": true,
      "is": true,
      "more": true,
      "fun": true,
      "than": true,
      "watching": true,
      "television": true,
      "game": false,
      "of": false,
      "thrones": false,
      "a": false,
      "great": false,
      "series": false,
      "but": false,
      "the": false,
      "books": false,
      "are": false,
      "better": false
    },
    "datascience"
  ]
]


### Cleansing Text Data

Bunch of standard cleansing tasks to do before building document-term matrix:
* normalising --> lowercase all text
* tokenization --> cut text into pieces called tokens or terms
    * lots of mechanisms for this
    * consider token as a unit of information
        * unigrams: single word tokens
        * bigrams: two words per token
        * n-grams n words per token
* stop word filtering
    * NLTK comes with list of English stop words 
    
#### Stemming and lemmatisation

*Stemming* is the process of bringing words back to their root form which can be things like removing suffixes. Some examples
* [courses, course] --> course
* [announce, announces, announcing] --> announce  
    
*Lemmatisation* definition from Wiki:


>In linguistics, lemmatisation is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form 
>
> 1. The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
>
> 
> 2. The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatisation.
>
>
> 3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.

Transforming words to their lemma form benefits from Part of Speech Tagging aka POS Tagging. This process attributes a grammatical label to every part of a sentence. For example

>Game of thrones is a television series

is tagged as 

```
{
    'game':        'NN',   # Noun, singular or mass
    'of':          'IN',   # Preposition or subordinating conjunction
    'thrones':     'NNS',  # Noun, plural
    'is':          'VBZ',  # Verb, 3rd person singular present
    'a':           'DT',   # Determiner
    'television':  'NN',   # Noun, singular or mass
    'series':      'NN'    # Noun, singular or mass
}
```

POS Tagging is a use case of sentence tokenisation rather than word-tokenisation.

Lets consider how to build the document-term matrix

### TF-IDF

#### Term frequency

Instead of binary encoding each word in a document we can count the number of times each word occurs in the document  

> $TF(t,d) = f_{t,d}$ 

where $f_{t,d}$ is the frequency of each term in the document. 

If corpus contains lots of documents of different length then normalise the raw counts by document length. Example the word "stock" could occur in both a 200 word email and 20 word email c.f. Enron emails case study. Count could be 20 in 200 word email and 2 in 20 word email, normalised they're the same proportion of document (10%). Count can also be scaled logarithmically. 

#### Inverse Document Frequency

Indicates how common the word is in the entire corpus (collection of documents).
* Terms with **higher document frequency** $\implies$ won't be good for discriminating when clustering documents
* Terms with **low document frequency** $\implies$ won't be basis for meaningful cluster

Has a functional form that captures the distribution of the term over the entire corpus  

> $IDF(t,D) = 1 + log_{2}\left(\frac{N}{|d \in D:t \in d|}\right)$

where N is number of documents in corpus and $N_{t}$ is the number of documents containing term *t*. Note a couple of things:
* If term is present in every doc then IDF is 1
* $N / N_{t}$ is the inverse document frequency
* Alternative way to express number of documents containing term t, is $|d \in D:t \in d|$
* There are alternative weighting schemes available

#### TF-IDF

This is calculated as 

> $TFIDF(t,d,D) = TF(t,d) \cdot IDF(t,D)$

A high weight in TF-IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms.

#### Note on Naive part of Naive Bayes classifier

In NB classifier each element in the evidence vector is considered indendentent which decomposes the vector when conditioned on $C = c_{1}$ into 

>$P(E|C=c_{1}$ = $P(e_{1}|c_{1})P(e_{2}|c_{1})...P(e_{k}|c_{1})$

This is Naive because words like ```Data Sceince```, ```data analysis``` and ```Game of Thrones``` will not be linked if data is prepared as unigrams. Would need to consider bigrams, trigrams etc to consider the  word *interactions*. 
> $\therefore$ Good to compare the NB against the decision tree classifier as it *does* consider variable interactions

## Classifying reddit posts