# Lab 1
## **Text processing**

## Exercise 1:
Benchmark different language-detection algorithm by computing accuracy of each approach.
- FastText
- LangID
- langDetect

Hint: use language code conversion `iso639-lang`

Report
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

### My code..

In [2]:
# import
import pandas as pd

# read file
corpus = pd.read_csv('langid_dataset.csv')

In [3]:
# sneak peak
corpus.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [None]:
# accuracy function and format printing-function

def acc(clf_fcn, corpus):
    correct = 0
    for text,lan in zip(corpus.Text, corpus.language):
        try:
            pred = clf_fcn(text)
            if type(pred) == tuple: # langid_classify returns lang and prob
                pred = pred[0]
        except:
            continue
        else:
            correct += (pred == Lang(lan).pt1)*1
    return correct

def output(method, corpus, correct, elapsed):
    print(f'{method:s} \nAccuracy: {correct/len(corpus):.3f}. Est time/sample: {elapsed/len(corpus)*1000:.3f} ms')

In [None]:
import time
#!pip install iso639-lang
from iso639 import Lang

In [None]:
# FastText (fastlangid)
from fastlangid import LID

method = 'FastText'

# fastText model
fastText_clf = LID()

# accuracy and timing
start = time.time()
correct = acc(fastText_clf.predict, corpus)
elapsed = time.time() - start

# print result
output('FastText', corpus, correct, elapsed)

In [None]:
# LangID
#!pip install langid
import langid

method = 'LangID'

# accuracy and timing
start = time.time()
correct = acc(langid.classify, corpus)
elapsed = time.time() - start

# print result
output(method, corpus,correct, elapsed)


In [None]:
# langdetect
#!pip install langdetect
from langdetect import detect

method = 'langdetect'

# accuracy and timing
start = time.time()
correct = acc(detect, corpus)
elapsed = time.time() - start

# print result
output(method, corpus, correct, elapsed)

## Exercise 2
For English-written text, apply word-level tokenization. What is the average number of words per sentence?
Implement word-tokenization using both nltk and spacy. Report the results for both of them.
For spaCy use the en_core_web_sm model.

### My code...

In [4]:
# find only english texts
corpus_eng = corpus.loc[corpus.language=='English']

In [5]:
# imports
#!pip install nltk
import nltk
#!pip install -U spacy
#python -m spacy download en_core_web_sm (run in terminal)
import spacy

In [45]:
# counting average number of words

# nltk
tot_words = 0
for sentence in corpus_eng.Text:
    tokens = nltk.word_tokenize(sentence)
    tot_words += len(tokens)

print(f'NLTK \nAverage number of words/sentence: {tot_words/len(corpus_eng):.2f}')

# spacy
spacy_nlp = spacy.load("en_core_web_sm", disable = ['parser','ner','tagger', 'attribute_ruler', 'lemmatizer']) # only using 'tok2vec'
tot_words = 0
for sentence in corpus_eng.Text:
    doc = spacy_nlp(sentence)
    sentence_words = 0
    sentence_words += sum([1 for w in doc]) #includes "space" and " - "
    tot_words += sentence_words

print(f'spaCy \nAverage number of words/sentence: {tot_words/len(corpus_eng):.2f}')



NLTK 
Average number of words/sentence: 68.75
spaCy 
Average number of words/sentence: 72.33


In [46]:
spacy_nlp.pipe_names

['tok2vec']

## Exercise 3

Dependency Parsing aims at analyzing the grammatical structure of sentences. The main goal is to find out related words as well as the type of the relationship between them.
The output of this step is a dependency tree similar to the one reported in the figure below.


Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [7]:
corpus_eng

Unnamed: 0,Text,language
37,in johnson was awarded an american institute ...,English
40,bussy-saint-georges has built its identity on ...,English
76,minnesotas state parks are spread across the s...,English
90,nordahl road is a station served by north coun...,English
97,a talk by takis fotopoulos about the internati...,English
...,...,...
21829,on march empty mirrors press published epste...,English
21879,he [musk] wants to go to mars to back up human...,English
21896,overall the male is black above and white belo...,English
21897,tim reynolds born december in wiesbaden germ...,English


In [52]:
# choose random sentence
nlp = spacy.load("en_core_web_sm")
sentence = corpus_eng.sample(random_state=123).Text.iloc[0]
doc = nlp(sentence)
#spacy.displacy.serve(doc, style="dep")

## Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [71]:
# what is included in pipeline
print(f'{nlp.pipe_names}\n')

# original sentence
print(f'Original sentence: \n{sentence}')

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Original sentence: 
wright took up a coaching role at guildford city following his retirement in  before holding similar positions at walsall and luton town he was named as head coach at everton in  and later coached the india youth team between  and  in preparation for the  afc youth championship in  he inherited syed abdul rahims india national team where wright led the side to the runners-up spot in the  asian cup which remains the most notable triumph in professional football for india


In [70]:
# 1. lemmatization
doc = nlp(sentence)
sentence_lemma = " ".join([token.lemma_ for token in doc])
print(f'After lemmatization: \n{sentence_lemma}')

After lemmatization: 
wright take up a coach role at guildford city follow his retirement in   before hold similar position at walsall and luton town he be name as head coach at everton in   and later coach the india youth team between   and   in preparation for the   afc youth championship in   he inherit syed abdul rahim india national team where wright lead the side to the runner - up spot in the   asian cup which remain the most notable triumph in professional football for india


In [69]:
# 2. stopword removal
stopwords = nlp.Defaults.stop_words
sentence_wo_sw = " ".join([w for w in sentence_lemma.split() if w not in stopwords])

print(f'After removal of stopwords: \n{sentence_wo_sw}')

After removal of stopwords 
wright coach role guildford city follow retirement hold similar position walsall luton town head coach everton later coach india youth team preparation afc youth championship inherit syed abdul rahim india national team wright lead runner - spot asian cup remain notable triumph professional football india


In [76]:
# 3. POS

# use latest version of sentence in doc
doc = nlp(sentence_wo_sw)
for token in doc:
    print(f'{token.text:{15}}, {token.pos_}')


wright         , NOUN
coach          , NOUN
role           , NOUN
guildford      , ADJ
city           , NOUN
follow         , VERB
retirement     , NOUN
hold           , VERB
similar        , ADJ
position       , NOUN
walsall        , PROPN
luton          , PROPN
town           , PROPN
head           , PROPN
coach          , NOUN
everton        , PROPN
later          , ADJ
coach          , PROPN
india          , PROPN
youth          , PROPN
team           , NOUN
preparation    , PROPN
afc            , PROPN
youth          , PROPN
championship   , PROPN
inherit        , PROPN
syed           , VERB
abdul          , PROPN
rahim          , PROPN
india          , PROPN
national       , PROPN
team           , NOUN
wright         , NOUN
lead           , VERB
runner         , NOUN
-              , PUNCT
spot           , NOUN
asian          , ADJ
cup            , NOUN
remain         , VERB
notable        , ADJ
triumph        , PROPN
professional   , PROPN
football       , NOUN
india          , 

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']