# Language Models

---


### Background and Theory:
#### * Categories and applications of NLP
#### * What is a Language model?
#### * See a Rule based and a Statistically based model
#### * Problems in Language Representation

### Applications for this week
#### * Spacy Basic - solving Preprocessing
#### * Spacy Advanced - Introducing word2vec

---

## Categories and applications of NLP

----

## What is a Language Model?
#### Either rule based or statistically based

---

### Rule based:
#### Codify your expertise on what constitutes 'correct' and 'incorrect' langauge
* Good for specific & limited examples
* Lots of customisation
* Linguists love it!
* e.g. VADER, Spacy

### Rule-based example:
* A sentiment analysis function using simple sentiment counting

In [37]:
import re
import requests
from bs4 import BeautifulSoup as soup

In [38]:
keywords = {}

In [39]:
#scrape positive words
positive = requests.get('https://www.thesaurus.com/browse/good')
text = soup(positive.text, parser='html.parser')
synonyms = text.find(attrs={"class":"css-1ytlws2 et6tpn80"})
pos_words = [a.text for a in synonyms.find_all('a')]
keywords['positive'] = pos_words

In [40]:
#scrape negative words
negative = requests.get('https://www.thesaurus.com/browse/bad')
text = soup(negative.text, parser='html.parser')
synonyms = text.find(attrs={"class":"css-1ytlws2 et6tpn80"})
neg_words = [a.text for a in synonyms.find_all('a')]
keywords['negative'] = neg_words

In [41]:
keywords['positive']

['acceptable',
 'bad',
 'excellent',
 'exceptional',
 'favorable',
 'great',
 'marvelous',
 'positive',
 'satisfactory',
 'satisfying',
 'superb',
 'valuable',
 'wonderful',
 'ace',
 'boss',
 'bully',
 'capital',
 'choice',
 'crack',
 'nice',
 'pleasing',
 'prime',
 'rad',
 'sound',
 'spanking',
 'sterling',
 'super',
 'superior',
 'welcome',
 'worthy',
 'admirable',
 'agreeable',
 'commendable',
 'congenial',
 'deluxe',
 'first-class',
 'first-rate',
 'gnarly',
 'gratifying',
 'honorable',
 'neat',
 'precious',
 'recherché',
 'reputable',
 'select',
 'shipshape',
 'splendid',
 'stupendous',
 'super-eminent',
 'super-excellent',
 'tip-top',
 'up to snuff']

In [42]:
def positive_or_negative(text, keywords):
        """calculate the sentiment of a string based on word count of positive and negative terms

        params: text - the string to be assessed 
                keywords - a dictionary of keywords

        returns: classification either positive or negative"""
        
        try:
            text = re.findall('(?u)\\b\\w\\w+\\b',text.lower())
            positives = [pos for pos in text if pos in keywords['positive']]
            negatives = [neg for neg in text if neg in keywords['negative']]
            return 'positive' if len(positives) >= len(negatives) else 'negative' 
        except:
            return 'Sorry something went wrong'

In [43]:
positive_or_negative(input(), keywords)

Brexit is not a good idea


'positive'

In [44]:
positive_or_negative(input(), keywords)

Brexit is an atrocious idea


'negative'


<span style="background-color:orange">**-> Rule based systems are easy to build, but hard to get good results from**</span>
- double meanings break systems
- humour and sarcasm breaks system

* Semantics - I go home vs I go house - MEANING
* Syntax - I go home vs I goes home - GRAMMER
* Pragmatics - I go home vs I go home cry myself to sleep because I am depressed - CONTENT - Grices maxims

### Statistically-based: 
#### Build a model which tries to assess whether language is 'likely' or 'unlikely' regarding semantics, syntax and pragmatics

* Good for less well-defined uses
* Harder to customise 
* Computer Scientists love it!
* e.g. Spacy, HuggingFace

$P(W) = P(w_1, w_2, ..., w_n)$ # seuqence classification 

or

$P(w_{t+1} | w_{t-1+n}, ..., w_{t})$ # Sequence generation

### Statistically-based example:
* A bigram Markov model which generates new language (Sequence Generation)
* Downloads text data and calculates probabilities based on that
* Use the probability distributions to calculate new information

In [48]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np

In [46]:
def generate_word_distribution(data):
    """create a probability distribution over all words
    
        params: data - a Bunch data object from sklearn
        returns: Word probability distribution
    """

    text = data['data']
    all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w+\\b',article.lower())) for article in text]).split()
    words = pd.DataFrame({'words':all_data})
    words['next_words'] = words['words'].shift(-1)
    word_distribution = words.groupby('words')['next_words'].value_counts(normalize=True)
    
    return word_distribution

In [47]:
def text_generation(seed, length, distribution):
    """seed a distribution with a seed word, and ask it to make more words
        
        params: seed - A seed word, 
                length -Length of the generated sentence
                distribution - A word probability distribution
                
        returns: generated sentence
    """
    
    try:
        seed = seed.lower()
        for i in range(length):
             seed += ' ' + np.random.choice(distribution[seed.split()[-1]].index, p=distribution[seed.split()[-1]].values)
        return seed
    
    except:
        print('Oops! Try another seed')
        return None

#### Download text data

In [52]:
data = fetch_20newsgroups(remove=['header', 'footer'])

#### Calculate the bigram probabilities

In [54]:
distribution = generate_word_distribution(data)

In [55]:
distribution

words     next_words
00        00            0.112125
          gmt           0.035854
          1993          0.021512
          01            0.018905
          am            0.014342
                          ...   
érale     et            1.000000
ête       renvers       1.000000
íålittin  no            1.000000
ñaustin   jacobs        1.000000
ýé        am            1.000000
Name: next_words, Length: 1054743, dtype: float64

#### Generate some new sentences

In [60]:
text_generation('While', 50, distribution)

'while driving course in the penguins the israeli patrols every time year although it was about flopticals could use 32 bits long exposure most of speech research council of china or cipriani att com coffee as saying shouldn think that make size up in assuming of the university of comfort and'

#### Staitiscal based models:
- less precise - harder to control
- they learn themselves
- require good initialization
- require lots of data to work well
- THE BEST STATISTICAL MODELS ARE DEEP MODELS - NEURAL NETWORKS

----

## Problems in Language Representation

#### * Preprocessing - tokenization, stop words, lowercase, lemmatization/stemming
#### * Curse of dimensionality
#### * Semantic similarity
#### * Word order
#### * Word sense disambiguation
#### * Grammar

---

In [8]:
'I was an untokenized string'.split() # tokenization

['I', 'was', 'an', 'untokenized', 'string']

In [9]:
'and if but on a' # stop words

'and if but on a'

In [11]:
'Apple is a fruit', 'My fruit is an apple' # always lowercase

('Apple is a fruit', 'My fruit is an apple')

In [12]:
'To be or not to be that is the question whether tis nobler to suffer the sling and arrows'
'is' > 'be' 'am' > 'be'# lemmatization
'slings' > 'sling' # stemming - find the root (lexeme) of the word

True

# Applications for this week:

## Spacy Basic - solving Preprocessing

* `pip install spacy`
* `python -m spacy download en_core_web_md` - if this doesn't load, try
* `python -m spacy download en_core_web_sm`

In [5]:
import spacy

In [6]:
nlp = spacy.load('en_core_web_md')

In [24]:
result = nlp('I can transform these strings')
result

I can transform this string

In [25]:
type(result)

spacy.tokens.doc.Doc

In [26]:
result[-1]

string

In [27]:
type(result[-1])

spacy.tokens.token.Token

In [32]:
result[-1].pos_, result[-1].is_stop, result[-1].lemma_

('NOUN', False, 'string')

In [36]:
def clean_text(corpus, model):
    """preprocess a string (tokens, stopwords, lowercase, lemma & stemming) returns the cleaned result
        params: review - a string
                model - a spacy model
                
        returns: list of cleaned strings
    """
    
    new_doc = []
    doc = model(corpus)
    for word in doc:
        if not word.is_stop and word.is_alpha:
            new_doc.append(word.lemma_.lower())
            
    return new_doc

In [35]:
clean_text('I can transform these strings', nlp)

['transform', 'string']

## Semantic Similarity - None in BOW!!

---

### Spacy Advanced - introducing word2vec

#### Solves: Semantic similarity, curse of dimensionality

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
def vectorize_my_word(word, model):
    try:
        return model.vocab[word].vector.reshape(-1,1).T
    except:
        print("Doesn't look like this word can be found")
        return None