# Vocabulary
- **Task**: implement different strategies to tokenize and normalize text in order to weight token relevance.
- **Input**: raw text
- **Output**: a list of tokens for each text

### Main steps
0. Language detection
1. Tokenization
2. Case, punctuation, stopwords
3. Normalization

In [1]:
import json
import pandas as pd

In [2]:
dataset_file = '../data/wiki_dataset.json'
with open(dataset_file, 'r') as infile:
    dataset = json.load(infile)
docs = dataset['docs']
queries = dataset['queries']

In [3]:
T = docs[10]

In [4]:
T

'Owen Underhill (born January 26, 1954) is a Canadian composer, flutist and conductor based in Vancouver. He is currently a professor of music at Simon Fraser University. He has been an active contributor to the new music scene on the West Coast, as a flutist, as co-music director of Western Front New Music (1982-3), as the artistic director (1987–2000) of the Vancouver New Music Society, and as a conductor in Magnetic Band and the Turning Point Ensemble, for which he is also currently the Artistic Co-Director.'

## Language detection
Many options, see for example [langdetect](https://pypi.org/project/langdetect)

In [5]:
from langdetect import detect, detect_langs

In [6]:
L = "Testo che mescola English words con testo italiano."

 This shows how models trained mainly on English may be unbalanced

In [9]:
print(detect_langs(L))
print(detect(L))

[en:0.9999981301583485]
en


The purpose of language detection is to use it when dealing with multilanguage corpora because some of the vocabulary building operations may be language dependant (e.g., lemmatization)

## Tokenization

In [10]:
from nltk.tokenize import RegexpTokenizer

In [11]:
pattern = '\w+|\$[\d\.]+|\S+'
tokenizer = RegexpTokenizer(pattern)

In [12]:
text = docs[10]
tokens = tokenizer.tokenize(text)

In [13]:
print(text)

Owen Underhill (born January 26, 1954) is a Canadian composer, flutist and conductor based in Vancouver. He is currently a professor of music at Simon Fraser University. He has been an active contributor to the new music scene on the West Coast, as a flutist, as co-music director of Western Front New Music (1982-3), as the artistic director (1987–2000) of the Vancouver New Music Society, and as a conductor in Magnetic Band and the Turning Point Ensemble, for which he is also currently the Artistic Co-Director.


In [14]:
print(tokens)

['Owen', 'Underhill', '(born', 'January', '26', ',', '1954', ')', 'is', 'a', 'Canadian', 'composer', ',', 'flutist', 'and', 'conductor', 'based', 'in', 'Vancouver', '.', 'He', 'is', 'currently', 'a', 'professor', 'of', 'music', 'at', 'Simon', 'Fraser', 'University', '.', 'He', 'has', 'been', 'an', 'active', 'contributor', 'to', 'the', 'new', 'music', 'scene', 'on', 'the', 'West', 'Coast', ',', 'as', 'a', 'flutist', ',', 'as', 'co', '-music', 'director', 'of', 'Western', 'Front', 'New', 'Music', '(1982-3),', 'as', 'the', 'artistic', 'director', '(1987–2000)', 'of', 'the', 'Vancouver', 'New', 'Music', 'Society', ',', 'and', 'as', 'a', 'conductor', 'in', 'Magnetic', 'Band', 'and', 'the', 'Turning', 'Point', 'Ensemble', ',', 'for', 'which', 'he', 'is', 'also', 'currently', 'the', 'Artistic', 'Co', '-Director.']


**Note**: when dealing with long texts, tokenization shold be performed sentence by sentence, exploiting <code>nltk.tokenize.sent_tokenize</code> before tokenization and normalization.

## Case, punctuation and stopwords removal
The importance of each step is relative to the size of the corpus and its sparseness

In [15]:
from string import punctuation
from nltk.corpus import stopwords

In [16]:
stopwords = set(stopwords.words('english'))

In [19]:
print(list(stopwords)[:10])
print(punctuation)

['its', 'theirs', 'didn', 'herself', 'who', 'weren', 'her', 'a', 'was', 'below']
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [20]:
lower_tokens = lambda data: [x.lower() for x in data]
punct_tokens = lambda data: [x for x in data if x not in punctuation]
stop_tokens = lambda data: [x for x in data if x not in stopwords]

In [21]:
pipeline = [('lower', lower_tokens), ('punctuation', punct_tokens), ('stopwords', stop_tokens)]
current = tokens
print(T)
print(current, '\n')
for operation, f in pipeline:
    print(operation)
    current = f(current)
    print(current, '\n')

Owen Underhill (born January 26, 1954) is a Canadian composer, flutist and conductor based in Vancouver. He is currently a professor of music at Simon Fraser University. He has been an active contributor to the new music scene on the West Coast, as a flutist, as co-music director of Western Front New Music (1982-3), as the artistic director (1987–2000) of the Vancouver New Music Society, and as a conductor in Magnetic Band and the Turning Point Ensemble, for which he is also currently the Artistic Co-Director.
['Owen', 'Underhill', '(born', 'January', '26', ',', '1954', ')', 'is', 'a', 'Canadian', 'composer', ',', 'flutist', 'and', 'conductor', 'based', 'in', 'Vancouver', '.', 'He', 'is', 'currently', 'a', 'professor', 'of', 'music', 'at', 'Simon', 'Fraser', 'University', '.', 'He', 'has', 'been', 'an', 'active', 'contributor', 'to', 'the', 'new', 'music', 'scene', 'on', 'the', 'West', 'Coast', ',', 'as', 'a', 'flutist', ',', 'as', 'co', '-music', 'director', 'of', 'Western', 'Front', 

## Normalization

### Stemming

In [22]:
from nltk.stem.snowball import SnowballStemmer

In [23]:
stemmer = SnowballStemmer('english')

In [24]:
print([stemmer.stem(x) for x in current])

['owen', 'underhil', '(born', 'januari', '26', '1954', 'canadian', 'compos', 'flutist', 'conductor', 'base', 'vancouv', 'current', 'professor', 'music', 'simon', 'fraser', 'univers', 'activ', 'contributor', 'new', 'music', 'scene', 'west', 'coast', 'flutist', 'co', '-music', 'director', 'western', 'front', 'new', 'music', '(1982-3),', 'artist', 'director', '(1987–2000)', 'vancouv', 'new', 'music', 'societi', 'conductor', 'magnet', 'band', 'turn', 'point', 'ensembl', 'also', 'current', 'artist', 'co', '-director.']


### Lemmatization with WordNet

In [26]:
from nltk.corpus import wordnet as wn

In [27]:
syns = wn.synsets('group')

#### Problem 1: word sense disambiguation

In [28]:
for syn in syns:
    print(syn, syn.definition())

Synset('group.n.01') any number of entities (members) considered as a unit
Synset('group.n.02') (chemistry) two or more atoms bound together as a single unit and forming part of a molecule
Synset('group.n.03') a set that is closed, associative, has an identity element and every element has an inverse
Synset('group.v.01') arrange into a group or groups
Synset('group.v.02') form a group or group together


#### Problem 2: choice of lemma

In [29]:
for syn in syns:
    print(syn, [lemma.name() for lemma in syn.lemmas()])

Synset('group.n.01') ['group', 'grouping']
Synset('group.n.02') ['group', 'radical', 'chemical_group']
Synset('group.n.03') ['group', 'mathematical_group']
Synset('group.v.01') ['group']
Synset('group.v.02') ['group', 'aggroup']


### Naive strategy

In [30]:
def wnlemma(word):
    try:
        s = wn.synsets(word)[0]
        try:
            l = s.lemmas()[0].name()
        except IndexError:
            return word
    except IndexError:
        return word
    return l

In [31]:
lemma_tokens = lambda data: [wnlemma(x) for x in data]

In [32]:
print(current)
print(lemma_tokens(current))

['owen', 'underhill', '(born', 'january', '26', '1954', 'canadian', 'composer', 'flutist', 'conductor', 'based', 'vancouver', 'currently', 'professor', 'music', 'simon', 'fraser', 'university', 'active', 'contributor', 'new', 'music', 'scene', 'west', 'coast', 'flutist', 'co', '-music', 'director', 'western', 'front', 'new', 'music', '(1982-3),', 'artistic', 'director', '(1987–2000)', 'vancouver', 'new', 'music', 'society', 'conductor', 'magnetic', 'band', 'turning', 'point', 'ensemble', 'also', 'currently', 'artistic', 'co', '-director.']
['Owen', 'underhill', '(born', 'January', 'twenty-six', '1954', 'Canadian', 'composer', 'flutist', 'conductor', 'establish', 'Vancouver', 'presently', 'professor', 'music', 'Simon', 'fraser', 'university', 'active_agent', 'subscriber', 'new', 'music', 'scene', 'West', 'seashore', 'flutist', 'carbon_monoxide', '-music', 'director', 'Western', 'front', 'new', 'music', '(1982-3),', 'artistic', 'director', '(1987–2000)', 'Vancouver', 'new', 'music', 'soc

### Exercize: find a better strategy for word sense disambiguation using WordNet

# Approaches based on language modeling: Spacy

In [33]:
import spacy

In [34]:
nlp = spacy.load("en_core_web_sm")

In [35]:
doc = nlp(T)

### Sentence parsing

In [36]:
for s in doc.sents:
    print(s)

Owen Underhill (born January 26, 1954) is a Canadian composer, flutist and conductor based in Vancouver.
He is currently a professor of music at Simon Fraser University.
He has been an active contributor to the new music scene on the West Coast, as a flutist, as co-music director of Western Front New Music (1982-3), as the artistic director (1987–2000) of the Vancouver New Music Society, and as a conductor in Magnetic Band and the Turning Point Ensemble, for which he is also currently the Artistic Co-Director.


## Tokenization

In [37]:
fields = ['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'alpha', 'stopwords']
tks = []
for token in list(doc.sents)[0]:
    data = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop]
    tks.append(dict([(fields[i], x) for i, x in enumerate(data)]))

In [38]:
df = pd.DataFrame(tks)

In [39]:
df

Unnamed: 0,text,lemma,pos,tag,dep,shape,alpha,stopwords
0,Owen,Owen,PROPN,NNP,compound,Xxxx,True,False
1,Underhill,Underhill,PROPN,NNP,nsubj,Xxxxx,True,False
2,(,(,PUNCT,-LRB-,punct,(,False,False
3,born,bear,VERB,VBN,acl,xxxx,True,False
4,January,January,PROPN,NNP,npadvmod,Xxxxx,True,False
5,26,26,NUM,CD,nummod,dd,False,False
6,",",",",PUNCT,",",punct,",",False,False
7,1954,1954,NUM,CD,nummod,dddd,False,False
8,),),PUNCT,-RRB-,punct,),False,False
9,is,be,VERB,VBZ,ROOT,xx,True,True


## Case, punctuation and stopwords removal

In [40]:
punct_tokens = lambda data: [x for x in data if x.pos_ not in ['PUNCT', 'SPACE']]
stop_tokens = lambda data: [x for x in data if not x.is_stop]

In [41]:
spacy_tokens = stop_tokens(punct_tokens(nlp(T)))

In [42]:
print(spacy_tokens)

[Owen, Underhill, born, January, 26, 1954, Canadian, composer, flutist, conductor, based, Vancouver, currently, professor, music, Simon, Fraser, University, active, contributor, new, music, scene, West, Coast, flutist, co, -, music, director, Western, New, Music, 1982, -, 3, artistic, director, 1987–2000, Vancouver, New, Music, Society, conductor, Magnetic, Band, Turning, Point, Ensemble, currently, Artistic, Co, -, Director]


## Normalization

In [43]:
spacy_lemma = lambda data: [x.lemma_ for x in data]

In [44]:
print(spacy_lemma(spacy_tokens))

['Owen', 'Underhill', 'bear', 'January', '26', '1954', 'canadian', 'composer', 'flutist', 'conductor', 'base', 'Vancouver', 'currently', 'professor', 'music', 'Simon', 'Fraser', 'University', 'active', 'contributor', 'new', 'music', 'scene', 'West', 'Coast', 'flutist', 'co', '-', 'music', 'director', 'western', 'New', 'Music', '1982', '-', '3', 'artistic', 'director', '1987–2000', 'Vancouver', 'New', 'Music', 'Society', 'conductor', 'Magnetic', 'Band', 'Turning', 'Point', 'ensemble', 'currently', 'Artistic', 'Co', '-', 'Director']


### A look into dependencies and entities (more on this later on course)

In [45]:
from spacy import displacy

In [46]:
T = docs[1].strip()

In [47]:
displacy.render(nlp(T), style='ent')

In [48]:
displacy.render(nlp(T), style='dep', options={'compact': True, 
                                              'collapse_phrases': True,
                                             'add_lemma': True})

In [49]:
table = {'token': [], 'token dep': [], 'head': [], 'head pos': [], 'children': [], 'ancestors': []}
for token in nlp(T):
    table['token'].append(token.text)
    table['token dep'].append(token.dep_)
    table['head'].append(token.head.text)
    table['head pos'].append(token.head.pos_)
    table['children'].append(", ".join([child.text for child in token.children]))
    table['ancestors'].append(", ".join([a.text for a in token.ancestors]))
S = pd.DataFrame(table)

In [50]:
S

Unnamed: 0,token,token dep,head,head pos,children,ancestors
0,Elections,nsubjpass,held,VERB,to,held
1,to,prep,Elections,NOUN,Council,"Elections, held"
2,Rotherham,compound,Council,PROPN,,"Council, to, Elections, held"
3,Metropolitan,compound,Council,PROPN,,"Council, to, Elections, held"
4,Borough,compound,Council,PROPN,,"Council, to, Elections, held"
5,Council,pobj,to,ADP,"Rotherham, Metropolitan, Borough","to, Elections, held"
6,were,auxpass,held,VERB,,held
7,held,ROOT,held,VERB,"Elections, were, on, .",
8,on,prep,held,VERB,May,held
9,3,nummod,May,PROPN,,"May, on, held"
