# Lab 1
## **Text processing**

## Exercise 1:
Benchmark different language-detection algorithm by computing accuracy of each approach.
- FastText
- LangID
- langDetect

Hint: use language code conversion `iso639-lang`

Report
- Accuracy
- Average time per example

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/langid_dataset.csv

### My code..

In [None]:
# import
import pandas as pd

# read file
corpus = pd.read_csv('langid_dataset.csv')

In [None]:
# sneak peak
corpus.head()

In [None]:
# accuracy function and format printing-function

def acc(clf_fcn, corpus):
    correct = 0
    for text,lan in zip(corpus.Text, corpus.language):
        try:
            pred = clf_fcn(text)
            if type(pred) == tuple: # langid_classify returns lang and prob
                pred = pred[0]
        except:
            continue
        else:
            correct += (pred[:2] == Lang(lan).pt1)*1 #due to 'zh-hans'
    return correct

def output(method, corpus, correct, elapsed):
    print(f'{method:s} \nAccuracy: {correct/len(corpus):.3f}. Est time/sample: {elapsed/len(corpus)*1000:.3f} ms')

In [None]:
import time
#!pip install iso639-lang
from iso639 import Lang

In [None]:
# FastText (fastlangid)
from fastlangid import LID

method = 'FastText'

# fastText model
fastText_clf = LID()

# accuracy and timing
start = time.time()
correct = acc(fastText_clf.predict, corpus)
elapsed = time.time() - start

# print result
output('FastText', corpus, correct, elapsed)

In [None]:
# LangID
#!pip install langid
import langid

method = 'LangID'

# accuracy and timing
start = time.time()
correct = acc(langid.classify, corpus)
elapsed = time.time() - start

# print result
output(method, corpus,correct, elapsed)


In [None]:
# langdetect
#!pip install langdetect
from langdetect import detect

method = 'langdetect'

# accuracy and timing
start = time.time()
correct = acc(detect, corpus)
elapsed = time.time() - start

# print result
output(method, corpus, correct, elapsed)

## Exercise 2
For English-written text, apply word-level tokenization. What is the average number of words per sentence?
Implement word-tokenization using both nltk and spacy. Report the results for both of them.
For spaCy use the en_core_web_sm model.

### My code...

In [None]:
# find only english texts
corpus_eng = corpus.loc[corpus.language=='English']

In [None]:
# imports
#!pip install nltk
import nltk
#!pip install -U spacy
#python -m spacy download en_core_web_sm (run in terminal)
import spacy

In [None]:
# counting average number of words

# nltk
tot_words = 0
for sentence in corpus_eng.Text:
    tokens = nltk.word_tokenize(sentence)
    tot_words += len(tokens)

print(f'NLTK \nAverage number of words/sentence: {tot_words/len(corpus_eng):.2f}')

# spacy
spacy_nlp = spacy.load("en_core_web_sm", disable = ['parser','ner','tagger', 'attribute_ruler', 'lemmatizer']) # only using 'tok2vec'
tot_words = 0
for sentence in corpus_eng.Text:
    doc = spacy_nlp(sentence)
    sentence_words = 0
    sentence_words += sum([1 for w in doc]) #includes "space" and " - "
    tot_words += sentence_words

print(f'spaCy \nAverage number of words/sentence: {tot_words/len(corpus_eng):.2f}')



In [None]:
spacy_nlp.pipe_names

## Exercise 3

Dependency Parsing aims at analyzing the grammatical structure of sentences. The main goal is to find out related words as well as the type of the relationship between them.
The output of this step is a dependency tree similar to the one reported in the figure below.


Use spacy to parse the dependency tree of a **randomly selected** sentence. You can both use English sentences or your native language (if supported in [spaCy](https://spacy.io/usage/models/)). Use [displaCy](https://explosion.ai/demos/displacy) to visualize the result in the notebook.

In [None]:
corpus_eng

In [None]:
# choose random sentence
nlp = spacy.load("en_core_web_sm")
sentence = corpus_eng.sample(random_state=123).Text.iloc[0]
doc = nlp(sentence)
spacy.displacy.render(doc, style="dep")

## Exercise 4
For the same sentence selected in the previous step apply all the following steps:
1. Lemmatization: convert each word to its root form.
2. Stopword removal: remove language-specific stopwords.
3. Part of Speech Tagging: for each word in the sentence display its part-of-speech.

For each step, print the resulting list on the console.

In [None]:
# what is included in pipeline
print(f'{nlp.pipe_names}\n')

# original sentence
print(f'Original sentence: \n{sentence}')

In [None]:
# 1. lemmatization
doc = nlp(sentence)
sentence_lemma = " ".join([token.lemma_ for token in doc])
print(f'After lemmatization: \n{sentence_lemma}')

In [None]:
# 2. stopword removal
stopwords = nlp.Defaults.stop_words
sentence_wo_sw = " ".join([w for w in sentence_lemma.split() if w not in stopwords])

print(f'After removal of stopwords: \n{sentence_wo_sw}')

In [None]:
# 3. POS

# use latest version of sentence in doc
doc = nlp(sentence_wo_sw)
for token in doc:
    print(f'{token.text:{15}}, {token.pos_}')


## **Occurrence-based text representation - TF-IDF**

---
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It allows to create occurrence-based vector representation for each document.

## Exercise 5
Use TF-IDF to vectorize each sentence in the original data collection. You can choose your preferred implementation for TF-IDF vectorization. It is also available on [SciKit-Learn library](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [None]:
#!pip install sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
X = v.fit_transform(corpus.Text)
#X = X.toarray()

In [None]:
feature_names = v.get_feature_names_out()

#dense = X.todense()

#arr = X.toarray()


## Exercise 6
Build a supervised multi-class language detector using as features the vector obtained by TF-IDF representation. Use 80% of the data to train the language detector and 20% of the data for assessing its accuracy.



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
import numpy as np


target = corpus.language.values # yields numpy.array
text = corpus.Text.values

#X_tr, X_test, y_tr, y_test = train_test_split(X, target, test_size = 0.2, random_state=42, stratify=target)

X_tr, X_test, y_tr, y_test = train_test_split(text, target, test_size = 0.2, random_state=42, stratify=target)

X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.2, random_state=42, stratify=y_tr)

In [None]:
# apply tf-idf on input variables (text)
X_train_tfidf = v.fit_transform(X_train) # only fit to the training data
X_val_tfidf = v.transform(X_val)
X_test_tfidf = v.transform(X_test)

In [None]:
# one-vs-rest classifier using logistic regression
clf_lr = OneVsRestClassifier(LogisticRegression(penalty='l2')).fit(X_train_tfidf, y_train)

# multinomial naive Bayes
clf_mnb = MultinomialNB(alpha=0.5).fit(X_train_tfidf, y_train)

# training performance
print(f'One-vs-rest logistic regression \nTraining acc: {clf_lr.score(X_train_tfidf, y_train):.3f} \nValidation acc: {clf_lr.score(X_val_tfidf, y_val):.3f}')

print(f'Multinomial Naive Bayes \nTraining acc: {clf_mnb.score(X_train_tfidf, y_train):.3f} \nValidation acc: {clf_mnb.score(X_val_tfidf, y_val):.3f}')



In [None]:
# test performance
print(f'One-vs-rest logistic regression \nTest acc: {clf_lr.score(X_test_tfidf, y_test):.3f}')
print(f'Multinomial Naive Bayes \nTest acc: {clf_mnb.score(X_test_tfidf, y_test):.3f}')

# **Topic Modelling**

Occurrence-based representations are high-dimensional, what is the dimension of the generated TF-IDF vector representation?
Topic modelling focuses on caturing latent topics in large document corpora.

The data collection used in this second part of the practice is provided [here](https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv) - [source: Zenodo](https://zenodo.org/record/4282522#.YVdCXcbOOpd)




In [None]:
# dimensions of generated TF-IDF for whole corpus
np.shape(X) #277.719 features (tokens)

## Exercise 7

Latent Semantic Indexing (LSI) models underlying concepts by using SVD (Singular Value Decomposition).

Use [gensim](https://radimrehurek.com/gensim/) library to:
1. Create a corpus composed of the headlines contained in the data collection.
2. Generate a [dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) to create a word -> id mapping (required by LSI module).
3. Using the dictionary, preprocess the corpus to obtain the representation required for LSI model training ([documentation here](https://radimrehurek.com/gensim/models/lsimodel.html)).
4. Inspect the top-5 topics generated by the LSI model for the analysed corpus.

In [None]:
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P1/CovidFake_filtered.csv

In [None]:
# 1. Create corpus of all words in headlines

# read file
corpus_covid = pd.read_csv('CovidFake_filtered.csv').headlines.values

# merge all headlines and split each word
corpus_covid_merged = [sentence.split() for sentence in corpus_covid]

In [None]:
corpus_covid[0]

In [None]:
# 2. dictionary for ID-mapping
from gensim.corpora import Dictionary
dct = Dictionary(corpus_covid_merged)  # initialize a Dictionary

In [None]:
# 3. Preprocess corpus to get representation for LSI-model, using dictionary
processed_corpus = [dct.doc2bow(text) for text in corpus_covid_merged]

In [None]:
# 4. Train the LSI-model
#!pip install gensim
from gensim.models import LsiModel
model = LsiModel(processed_corpus, id2word=dct)
model.print_topics(5)

## Exercise 8 (Optional)

The top-scored words contributing to each topic (if no stopword removal is applied) are english common words (e.g., *to, for, in, of, on*..). Moreover, missing punctuation removal could be critical for topic identification. Repeat the same procedure of Ex. 7 by adding preliminary preprocessing step to:
1. **remove stopwords**
2. **strip punctuation**
3. **lowercase all words**

In [None]:
# remove stopwords, punctuation and make lower case

#from nltk.corpus import stopwords # not working
import nltk
import string

# nltk.download('stopwords') # not working
#stop_words = set(stopwords.words('english'))
stopwords = nlp.Defaults.stop_words
corpus_covid_no_sw = [[w.lower() for w in s if w.lower() not in stopwords] for s in corpus_covid_merged]
corpus_covid_no_sw = [[w.translate(str.maketrans('', '', string.punctuation)) for w in s] for s in corpus_covid_no_sw]
print(corpus_covid_no_sw[:10])

In [None]:
rs_tm_dict = Dictionary(corpus_covid_merged)
rs_processed_corpus = [rs_tm_dict.doc2bow(text) for text in corpus_covid_no_sw]
rs_model = LsiModel(rs_processed_corpus, id2word=rs_tm_dict)
rs_model.print_topics(5)

## Exercise 9 (Optional)

Leveraging the same corpus used for LSI model generation, apply LDA modelling setting the number of topics to 5. Display the words most contributing to the those topics according to the LDA model.

In [None]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel(rs_processed_corpus, id2word=rs_tm_dict, num_topics=3)
lda.print_topics(5)

## Exercise 10 (Optional)

Using [pyLDAvis]() library build an interactive visualization for the trained LDA model.

In [None]:
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

lda_display = gensimvis.prepare(lda, rs_processed_corpus, rs_tm_dict, sort_topics=False)
pyLDAvis.display(lda_display)