# Language Modeling and N-grams

At the core of many NLP tasks, is the practice of assigning probabilities to a given string. This is referred to as **language modeling** and has in recent years seen a huge amount of development.

Assuming one can assign probabilities to a string we can complete tasks such as:

- predicting the next word in a sentence (auto-complete)
- correctly parse speech to text
- correct grammatical errors in written text
- assess the quality of translations
- measure the probability or likelihood of a given string

Modern approaches rely on deep learning and massive language models + data (Bert and co.) which has proven to be formidable solution. In contrast, n-grams is a relatively simple and naive approach based on token frequencies. Despite their age however, these models offer the advantage of:

- flexibility in adapting and transforming data for your corpus
- relatively light weight

Furthermore, for scenarios where you have very unique or scarce data, or if you have high compute/latency costs, then n-grams based approaches might be worth considering.


## N-Grams

When trying to model the next word in a sentence, we could take the frequentist approach and make a guess by first referring to a large corpus and counting up how frequently the given sequence occurs, and the number of times its various continuations appear.

Thus we could pretty well predict the next word in a sequence like: 

$$ P(\textit{blue}\vert \textit{the sky is}) =  C(\textit{the sky is blue} \div C(\textit{the sky is})$$

Despite the fact this works for some cases, it isn't without issues. First, language is not static like our corpus, it is constantly evolving. Permutations of words which generate sensible sentences is far larger than any corpus. Just because we may never encounter the sentence *"the sky is red"* in our reference corpus doesn't mean we should necessarily assign it a probability of 0. In addition, if we want to estimate the probability of *"the sky is blue"* would require us to count the frequency of this sentence, and divide in by the count of all 4 word sentences. 

Although this naive approach isn't ideal, we can make some assumptions to simplify the task. Transforming the join probability using the **chain rule**

$$ P(w^n_1)  = P(w_1)P(w_2\vert w_1)P(w_3 \vert w_1^2)...P(w_n \vert w_1^{n-1}) = \prod_{k=1}^n P(w_k \vert w_1^{k-1}) $$

and then make a **markov assumption** such that:

$$ P(w_n \vert w_1^{n-1}) \approx P(w_n \vert w_n-1) $$

Thus allowing us simplify our joint probability to:

$$ P(w_1^n) \approx \prod_{k=1}^n P(w_k \vert w_k-1) $$

Where we can use the **MLE** to estimate the individual probabilities:

$$ P(w_k \vert w_k-1) = \dfrac{C(w_k w_{k-1})}{\sum_{w_i} C(w_{k-1} w_i)} $$


What is important to note, is that in our **markov assumption** above, we assume the next word only depends on the previous word. This is referred to as a bigram model, in practice we could also predict words based on multiple preceding words to capture more contextual information.

## A Bigram Example 

The data I'll use for this example originates from [here](http://help.sentiment140.com/for-students/)

It is tweets which were scraped and they used emojis to determine the sentiment labels. From the website this is the described structure:

`
    The data is a CSV with emoticons removed. Data file format has 6 fields:
    0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    1 - the id of the tweet (2087)
    2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    4 - the user that tweeted (robotickilldozr)
    5 - the text of the tweet (Lyx is cool)
`

This will provide an interesting dataset with many issues seen in the real world.

In [33]:
import os
from pathlib import Path
import pandas as pd
from pprint import pprint
import re
import spacy

nlp = spacy.load('en')

In [8]:
CWD = Path(os.getcwd())
TRAIN_CSV = CWD / 'training.1600000.processed.noemoticon.csv'
TEST_CSV = CWD / 'testdata.manual.2009.06.14.csv'

In [21]:
columns = ['polarity', 'id', 'date','query','user','text']

train_df = pd.read_csv(TRAIN_CSV, encoding='latin-1', header=None)
test_df = pd.read_csv(TEST_CSV, encoding = 'latin-1', header=None)
train_df.columns = columns
test_df.columns = columns

A practical issue which immediately arises when trying to load this data is that it is not `utf-8` encoded.

You can try a bash command like
```
file -i *
``` 
It will give you a best guess, otherwise simply trying `ascii` and `latin-1` are reasonable guesses for english especially. If you want to read more about this problem which all too often arises in the wild read [here](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) or take a look on wikipedia.

In [24]:
train_df

Unnamed: 0,polarity,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


As mentioned in chapter 2, some of the question we're now confronted with are the following:

- how do we tokenize the data
- is punctuation a token
    - should repeated punctuation be normalized (!!! -> !)
- should we lowercase everything
- should we replace/standardize u
    - usernames
    - hashtags
    - numbers
    - urls

Depending on the downstream some of these things might need to be done. For thoroughness we'll just implement all of these steps, that way the data has fewer unique tokens.

First let's have a look at usernames and hashtags since I'm not 100% familiar with how they can be written.

In [28]:
users = []
tags = []
for text in train_df.text.values:
    rough_tokens = text.split(' ')
    for token in rough_tokens:
        if '@' in token:
            users.append(token)
        if '#' in token:
            tags.append(token)

In [37]:
users[0:10],tags[0:10]

(['@switchfoot',
  '@Kenichan',
  '@nationwideclass',
  '@Kwesidei',
  '@LOLTrish',
  '@Tatiana_K',
  '@twittera',
  '@caregiving',
  '@octolinz16',
  '@smarrison'],
 ['#itm',
  '#therapyfail',
  '#fb',
  '#TTSC?',
  '#24',
  '#gayforpeavy',
  '#FML',
  '#3',
  '#camerafail',
  '#'])

Now we can write a few regex to help us identify these texts as well as pipe everything through `spaCy` and rely on some of the tools they provide.

In a strictly python based approach we could use the `re` library. However, in this case we'll use some built in functionality from `spaCy` this provides neater code and makes it reusable down the road.

Thus we'll refer to the [docs](https://spacy.io/usage/rule-based-matching#regex), and build a couple regex patterns. Note, these regex patters need to work on the already tokenized text.

In [86]:
from spacy.matcher import Matcher
from spacy.tokens import Doc, Span, Token

nlp = spacy.load('en', disable=['ner','parser','tagger'])

In [114]:
class hashtag_matcher(object):
    
    def __init__(self, nlp, label='HASHTAG'):
        self.label = nlp.vocab.strings[label]  # get entity label ID
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}])

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension("is_hashtag", default=False, force=True)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_tech_org == True.
        Doc.set_extension("has_hashtag", getter=self.has_tech_org, force=True)
        Span.set_extension("has_hashtag", getter=self.has_tech_org, force=True)
        
    def __call__(self, doc):
        matches = self.matcher(doc)
        hashtags = []
        for match_id, start, end in matches:
            if doc.vocab.strings[match_id] == "HASHTAG":
                hashtags.append(doc[start:end])
        with doc.retokenize() as retokenizer:
            for span in hashtags:
                retokenizer.merge(span)
                for token in span:
                    token._.is_hashtag = True
        return doc  # don't forget to return the Doc!
            
        
    def has_tech_org(self, tokens):
        """Getter for Doc and Span attributes. Returns True if one of the tokens
        is a hashtag. Since the getter is only called when we access the
        attribute, we can refer to the Token's 'is_hashtag' attribute here,
        which is already set in the processing step."""
        return any([t._.get("is_hashtag") for t in tokens])

In [115]:
nlp.remove_pipe('hashtag_matcher')

('hashtag_matcher', <__main__.hashtag_matcher at 0x7f1e5b9cf370>)

In [116]:
tag_matcher = hashtag_matcher(nlp)

In [117]:
nlp.add_pipe(tag_matcher, last=True)

In [118]:
test_doc = 'here is a #tweet'
for token in nlp(test_doc):
    print(token, token._.is_hashtag)

here False
is False
a False
#tweet True


In [106]:
nlp.pipe_names

['hashtag_matcher']

In [108]:
list(nlp(test_doc))

TypeError: 'hashtag_matcher' object is not iterable

In [79]:

user_pattern = '^@\w+$'
tag_pattern = '^#\w+$'

In [54]:
user_pattern = re.compile('\s(@\w+)\s')
tag_pattern = re.compile('[\s^](#\w+)\s')

In [76]:
tag_pattern = re.compile('[\s^](#\w+)[\s$]')
test = '#asbc #asdf hi there #as'
tag_pattern.sub(' <USER> ', test, count=3)

'#asbc <USER> hi there #as'

In [71]:
tag_pattern.match(test)

<re.Match object; span=(0, 7), match=' #asbc '>

In [56]:
test = ' #asbc hi there'
user_pattern.sub('<USER>', test)

' #asbc hi there'

In [58]:
tag_pattern.sub??

In [25]:
train_df.text.values

array(["@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D",
       "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!",
       '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds',
       ..., 'Are you ready for your MoJo Makeover? Ask me for details ',
       'Happy 38th Birthday to my boo of alll time!!! Tupac Amaru Shakur ',
       'happy #charitytuesday @theNSPCC @SparksCharity @SpeakingUpH4H '],
      dtype=object)

## Additional Comments:

If you don't want to go through the trouble of counting n-grams yourselves, there are also a few resources available from very large corpus available here:

[google n-grams](https://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
[5-grams](https://catalog.ldc.upenn.edu/LDC2006T13)
[historical british 1,2,3-grams](https://data.bris.ac.uk/data/dataset/dobuvuu00mh51q773bo8ybkdz)
[yahoo n-grams](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1)

Additional general NLP data sources can also be found [here](https://github.com/niderhoff/nlp-datasets) with some broken links. 