# Unstructured Text
Most of the data we've dealt with so far has been structured. Unstructured data involves things like emails, blog posts, articles etc. One useful thing you can do with this data is positive/negative sentiment analysis.

If we want to decide if someone likes or dislikes different foods, we can come up with a list of words that provide evidence the person would like/dislike the food:

| Like  | Dislike  |
|:-:|:-:|
| Delicious  | Awful  |
| Tasty  | Bland  |
| Good  | Bad  |
| Love  | Hate  |
| Smooth  | Gritty  |

We can count how many like and dislike words there are are and use that to classify the text. Rather than raw counts though, you can use naive Bayes:

$$
h_{MAP} = arg max_{h \in H} P(D|h)P(h)
$$

- $h_{MAP} = arg max_{h \in H}$: For each of the hypotheses, find the one with max probability
- $P(D|h)$ = Probability of data given the hypothesis (e.g. probability of seeing specific words in the text given text)
- $P(h)$ = Probability of the hypothesis

We can have a training dataset of text called the training __corpus__. Each entry in the corpus is a __document__ (e.g. Twitter tweet). Each document is labelled a class: e.g. favourable/unfavourable and we train our classifier using this corpus.

- P(Favourable) = 0.5
- P(Unfavourable) = 0.5

$P(D|h)$ is the probability of seeing some evidence (data) given the hypothesis h. The data is the text. We can calculate the probability for each word in all the text, but there's so many words it could take a while. So instead we can treat documents like bags of words.

Instead of calculating if the third word is 'thrill' in a favourable review, you ask what's the probability that the word 'thrill' occurs in a favourable document.

## Training
First we determine the __vocabulary__ (unique words) of all the documents in the corpus. |Vocabulary| = number of words ni the vocabulary. Next, for word $w_k$ in the vocabulary, we compute the probability of that word occurring given each hypothesis: $P(w_k|h_i)$

1. Combine documents tagged with hypothesis into a single text file
2. Count how many word occurrences there are in the file. e.g. if there are 500 words, n = 500
3. For each word in the vocabulary $w_k$, count how many times that word occurred in the text. Call this $n_k$
4. For each word in vocabulary, compute:

$$
P(w_k|h_i) = \frac{n_k + 1}
{n + |vocabulary|}
$$

Once the training phrase is complete, we can classify documents with the formula shown above:

$$
h_{MAP} = arg max_{h \in H} P(D|h)P(h)
$$

## After Training Example

In [3]:
import pandas as pd
import numpy as np
import math

test = ['I', 'am', 'stunned', 'by', 'the', 'hype', 'over', 'gravity']
words = {   'word': ['am', 'by', 'good', 'gravity', 'great', 'hype', 'I', 'over', 'stunned', 'the'],
            'P(word|like)': [0.007, 0.012, 0.002, 0.00001, 0.003, 0.0007, 0.01, 0.005, 0.0009, 0.047],
            'P(word|dislike)': [0.009, 0.012, 0.0005, 0.00001, 0.0007, 0.002, 0.01, 0.0047, 0.002, 0.0465],
}

df = pd.DataFrame(words)
df

Unnamed: 0,P(word|dislike),P(word|like),word
0,0.009,0.007,am
1,0.012,0.012,by
2,0.0005,0.002,good
3,1e-05,1e-05,gravity
4,0.0007,0.003,great
5,0.002,0.0007,hype
6,0.01,0.01,I
7,0.0047,0.005,over
8,0.002,0.0009,stunned
9,0.0465,0.047,the


## Predicting test tweet
- P(like) x P(I|like) x P(am|like) ...
- P(dislike) x P(I|dislike) x P(am|dislike) ...

In [15]:
def like_vs_dislike(tokens, probabilities_df, h_liked_p, h_disliked_p):
    total_liked_p = h_liked_p
    total_disliked_p = h_disliked_p
    for t in tokens:
        token_probabilities = probabilities_df[probabilities_df['word'] == t]
        liked_p = token_probabilities['P(word|like)'].values[0]
        disliked_p = token_probabilities['P(word|dislike)'].values[0]
        
        total_liked_p *= liked_p
        total_disliked_p *= disliked_p
        
    return (total_liked_p, total_disliked_p)
        
        
        

like_p, dislike_p = like_vs_dislike(test, df, 0.5, 0.5)
print(like_p, dislike_p)
print('Higher probability that the tweet should be classified in the dislike category')

6.2181e-22 4.72068e-21
Higher probability that the tweet should be classified in the dislike category


## Small Numbers and Python
Word probabilities are very small numbers, and multiplying them gives us a tiny number - Python has trouble with small numbers and can truncate them to 0s. To solve this we can add the logs of the probabilities rather than multiplying the probabilities.

In [22]:
print(0.0001 ** 100)
print(0 + math.log(0.001))

0.0
-6.907755278982137


## Logarithms
$b^n=x$ The log of a number (x) is the exponent (n) that you need to raise a base (b) to equal the number.

$log_{10}(1000) = 3$ since 1000 = $10^3$

Base of the Python log function is _e_. Log allows us to compress the scale of a number although instead of multiplying probabilities, we add the logs of the probabilities.

## Stop Words
You can often throw out the 200 most frequent words in the English language as they don't tend to add any useful information to the classifier. That said, sometimes using the most frequent words and throwing out the rest can be useful for certain tasks (such as identifying the time period when documents were written). In online chats, predators can also use the words 'I', 'me' and 'you' more frequently than other people.