### initial: 29sep2022  
### revised: jul2023
-----
## NLP
#### 0. ponavljanje
#### 1. Vektorizacija
#### 2. Obrada dataseta
#### 3. Treniranje osnovnih modela
#### 4. Spremanje, loading i deployanje modela putem API-ja
#### 5. Conversational AI - Basic chatbot(s)

In [None]:
# import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

import spacy

import gensim
# import fasttext


### 0. Ponavljanje 
- Libraries: 
    - **NLTK** (word_tokenize, sent_tokenize, PorterStemmer, FreqDist, ...)
    - **TextBlob** (.correct(), .words)
    - https://www.nltk.org/book/ch02.html
    - https://textblob.readthedocs.io/en/dev/
    - https://www.analyticsvidhya.com/blog/2021/10/making-natural-language-processing-easy-with-textblob/ 
    - ...
- NLP Terminologija: corpus, tokenizacija, stopwords, korjenovanje, POS tagging, lematizacija, …
- Procesi: 
    - EDA (frekvencije riječi, gustoća rečenica, distribucija kroz dataset, izrada WordClouda, …), 
    - Feature Engineering (vektorizacija)
- ...

-----------------
#### 1.
- Što je vektorizacija? 
    - Pretvaranje tokena, rečenica, dokumenata u vektorske reprezentacije, prigodne kao input algoritmima strojnog učenja 

- Bag-of-words
- One-hot encoding
- ...

In [None]:
text = '''Lipik is a town in western Slavonia, in the Požega-Slavonia County of northeastern Croatia. 
It is known for its spas, mineral water and Lipizzaner stables. 
Lipik was occupied by Ottoman forces along with several other cities in Slavonia until its liberation in 1691. 
In 1773, the warm waters of Lipik were described favorably by a Varaždin doctor. 
It continued to be used as a treatment spa for over a century, and in 1872, the first hotel was opened in the town. 
By 1920 the number of hotels grew to six. Spa treatment is still the major focus of economy for the town. 
In the late 19th and early 20th century, Lipik was part of the Požega County of the Kingdom of Croatia-Slavonia.
'''


corpus = text.split(' ')
print(corpus)


# bez interpunkcija
import string
corpus = text.translate(str.maketrans('', '', string.punctuation)).split(' ')
print(corpus)


VOCAB = len(corpus)
print(VOCAB)

In [None]:
sents = text.split('.')
print(sents)


frekvencije = {}

for sent in sents:
    tokens = sent.split(' ')
    for token in tokens:
        token = token.strip().lower()
        if token not in frekvencije.keys():
            frekvencije[token] = 1
        else:
            frekvencije[token] += 1
            
frekvencije

In [None]:
### Vektorske reprezentacije riječi:

vektori = []
for sent in sents:
    r_vector = []
    tokens = [t.strip().lower() for t in sent.split(' ')]
    for rijec in corpus:
        if rijec in tokens:
            r_vector.append(1)
        else:
            r_vector.append(0)
    vektori.append(r_vector)
    
    
print(len(vektori))
### zašto 9?
for i in vektori:
    print(len(i))
    ### zašto 122?

print(vektori[0])



vektori = np.asarray(vektori)

- Frequency vektori, 
    - https://towardsdatascience.com/the-magic-behind-embedding-models-part-1-974d539f21fd
    - https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- One-hot vektori 
    - One-hot vektori su pogodni za prikazivanje pojedinih tokena, kao i za prikaz cijelih rečenica ili dokumenata
    - https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
- bez korištenja "stopwordsa"
- **problemi?**
    - "sparse vektori" - previše nula
    - memorijski zahtjevni
    - imaju problem s promijenjenim redom u rečenici
    - ne računaju na semantičku distribuciju, tj uzorkovanost teksta na temelju njegovog značenja
-------------
- TF-IDF
    - na logaritamskoj skali računa koje riječi se pojavljuju rjeđe, a koje češće te "izjednačuje" njihove vrijednosti tako da rjeđima daje veću "težinu"

- https://machinelearningmastery.com/gentle-introduction-bag-words-model/ 

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(sents)

tokenizer.index_word

tokenizer.texts_to_matrix(sents, mode="tfidf")

------------------------------
### bolje(novije) opcije - Word2vec algoritmi
- https://arxiv.org/pdf/1301.3781.pdf 
- **CBOW**
- **Skip-gram**

![cbow&skip-gram](data/1_HmmFCZpKk3i4EvMYZ855tg.png)

- najpoznatiji algoritmi za treniranje "word embeddinga", tj vektorskih značajki
    - neuronske mreže od dva layera
    - input je korpus 
    - output je set vektora
    
- skip-gram koristi riječ kako bi predvidio "target" context.
    - vektori na izlazu iz skip-grama jako dobro hvataju semantičke relacije između riječi
    - ![cbow&skip-gram](data/1_jpnKO5X0Ii8PVdQYFO2z1Q.png)

- CBOW koristi kontekst kako bi predvidio "target" riječ
    - svi primjeri s target riječima su feedani u mrežu
    - prosjek vrijednosti u hidden layeru će biti vektor koji predstavlja našu target riječ
---------------
- Primjer: "Lipik je divan grad"
        - U skip-gramu bi input bio divan, a output: 'Lipik', 'je', 'grad'
        - kod CBOW-a bi input bio 'Lipik', 'je', 'grad', a on s određenom vjerojatnosti vraća target riječi, među kojima je "divan"

- https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314
- https://machinelearningmastery.com/develop-word-embeddings-python-gensim/  
- https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92 (+ T-SNE)
- https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 - kako od nule napraviti svoje embeddinge pomoću 2-slojne Neuronske mreže

### Libraryji koji se koriste za izradu Word2vec reprezentacija teksta:
- gensim (https://radimrehurek.com/gensim/)
- fasttext (https://fasttext.cc/)
- ...

In [None]:
import gensim

print(text)

In [None]:
### 1. kreiramo korpus tekstova/rečenica/tokena

gensim_corpus = [[
    token for token in sent.strip().lower().split() if len(sent) > 0
    ] for sent in text.split('.')]
import pprint
pprint.pprint(gensim_corpus)


# sentences = MyCorpus()
# model = gensim.models.Word2Vec(sentences=gensim_corpus)

In [None]:
### 2. kreiramo rječnik - on sadrži sve moguće riječi koje će naš pipeline znati

from gensim import corpora

dictionary = corpora.Dictionary(gensim_corpus)
print(dictionary)

In [None]:
### 3. s obzirom da imamo 75 unique tokena u rječniku, svaki token će dobiti unique id od 0 do 74

dictionary.token2id

In [None]:
### Instanciramo i treniramo model kojem su input liste rečenice koje sadrže liste tokena (matrice)

model = gensim.models.Word2Vec(
    sentences=gensim_corpus,
    vector_size=50,
    window=2,
    min_count=1,
    sg=1
)
# https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#training-parameters


lipik_vektor = model.wv['lipik']
print(lipik_vektor)


In [None]:
model.wv.most_similar('lipik')

In [None]:
print(model.wv.most_similar('lipik'))

In [None]:
### Što ako tražimo riječ koja nije bila prisutna u modelu ??

print(model.wv.most_similar('village'))

- fasttext je razvijen 2016. u Facebooku kao ekstenzija Word2Vec algoritama 
    - razbija sve tokene na "character n-grame" (podriječi) 
    - to mu omogućuje da bolje reprezentira riječi koje nije sreo u trening datasetu
    

In [None]:
from gensim.models import FastText # (=character n-gram algoritam)

In [None]:
### Spremanje modela

# model.wv.save_word2vec_format('model.bin')
# model.wv.save_word2vec_format('model.txt', binary=False)


### druge vrste modela:
# 1. LDA
    # - https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#:~:text=In%20natural%20language%20processing%2C%20the,of%20the%20data%20are%20similar.
    # - https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
    # - https://radimrehurek.com/gensim/auto_examples/tutorials/run_ensemblelda.html
# 2. Doc2Vec
    # https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html#introducing-paragraph-vector
        # nije isto što i prosjek svih word2vec vektora!!
# 3. Latent Semantic Analysis
    # - https://en.wikipedia.org/wiki/Latent_semantic_analysis


In [None]:
### primjer vizualizacije ###

from sklearn.manifold import TSNE

def display_closestwords_tsnescatterplot(model, word, size):
    arr = np.empty((0,size), dtype='f')
    word_labels = [word]

    close_words = model.wv.most_similar(word)
    print(type(close_words))
    arr = np.append(arr, np.array([model.wv[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model.wv[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    tsne = TSNE(n_components=2, perplexity=5, random_state=42)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    plt.scatter(x_coords, y_coords)
    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()

display_closestwords_tsnescatterplot(model, 'lipik', 50) 

------------------------

- **Transfer learning**
    - ekstrakcija "znanja" iz nekog drugog izvora i primjena na naš problem/dataset
        - znanje = značajke
        - znanje = data
        - znanje = metadata u ML-u
    - omogućuje primjenu istreniranih modela na drugačiji, sličan ili isti problem (ovisno o zadatku kojeg rješavamo)
    - https://ruder.io/state-of-transfer-learning-in-nlp/ 

    - najpoznatiji Word2Vec modeli: 
        - https://nlp.stanford.edu/projects/glove/ 
        - GoogleNews-vectors-negative300.bin (https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz / https://figshare.com/articles/dataset/GoogleNews-vectors-negative300_bin/6007688)
    
    - današnji SOTA nlp modeli:
        - transformeri (Bert (bertić), GPT, XLNet)
        - https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/ 

In [None]:
import spacy
# python -m spacy download en_core_web_lg # md # sm
nlp = spacy.load("en_core_web_lg")


### AKO NE PROĐE:
## aktivirati environment i upisati naredbu:
    # "python -m spacy download en_core_web_lg"

In [None]:
### uzimamo tekst o lipiku s početka
doc = nlp(text)

In [None]:
print(type(doc))

In [None]:
sentences = list(doc.sents)
print(sentences)

In [None]:
### spacy omogućuje niz lingvističkih podataka po tokenu, rečenici ili cijelim dokumentima

for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

In [None]:
### VEKTORI 
vectors = []
for sent in doc.sents:
    s = []
    for token in sent:
        # print(token.vector.shape)
        if token.is_oov:
            print(token.text)
            token.vector == np.ones((300,), dtype=np.float32)
        vector = token.vector
        s.append(vector)
    vectors.append(s)

# print(len(vectors[0][5]))
# print(vectors[0][5])

In [None]:
print(vectors[5])

In [None]:
### pretrained NER oznake

from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)


### Semantic similarity 

- Koliko su dvije riječi međusobno slične/različite? 
- Koje su sve moguće metode računanja sličnosti među riječima? 

-----------------
- Koji bi bili načini računanja sličnosti između dva dokumenta?

In [None]:
# https://www.sciencedirect.com/topics/computer-science/cosine-similarity#:~:text=Cosine%20similarity%20measures%20the%20similarity,in%20roughly%20the%20same%20direction.&text=Thus%2C%20each%20document%20is%20an,called%20a%20term%2Dfrequency%20vector. 
# https://www.sciencedirect.com/topics/computer-science/euclidean-distance 

from scipy import spatial
def get_cosine_similarity(*word_pairs, nlp_pipeline):
    # izvor: https://towardsdatascience.com/a-little-spacy-food-for-thought-easy-to-use-nlp-framework-97cbcc81f977
    cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)
    for pair in word_pairs:
        print("{} AND {}: ".format(pair[0], pair[1]), cosine_similarity(nlp_pipeline.vocab[pair[0]].vector, nlp_pipeline.vocab[pair[1]].vector))
    # print("apple vs banana: ", cosine_similarity(nlp.vocab["apple"].vector, nlp.vocab["banana"].vector))
    # print("car vs banana: ", cosine_similarity(nlp.vocab["car"].vector, nlp.vocab["banana"].vector))
    # print("car vs bus: ", cosine_similarity(nlp.vocab["car"].vector, nlp.vocab["bus"].vector))
    # print("tomatos vs banana: ", cosine_similarity(nlp.vocab["tomatos"].vector, nlp.vocab["banana"].vector))
    # print("tomatos vs cucumber: ", cosine_similarity(nlp.vocab["tomatos"].vector, nlp.vocab["cucumber"].vector))


get_cosine_similarity(
        ("apple", "banana"),
        ("Slavonija", "Croatia"),
        ("spa", "water"),
        ("spa", "mineral"),
        ("water", "mineral"),
        ("Lipik", "Pakrac"),
        ("Lipik", "Croatia"),
        nlp_pipeline = nlp
    )

In [None]:
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector
king = nlp.vocab["king"].vector
calculated_king = man - woman + queen
print("similarity between our calculated king vector and real king vector:", cosine_similarity(calculated_king, king))

In [None]:
### Spacy nema ".most_similar() ili .similarity() metodu koja iz skupa podataka izvlači udaljenost točaka po Euklidu ili kosinusu"


### ne radi - need to fix def :) 
def get_most_similar(nlp_pipeline, word, limit=7):
    word = nlp_pipeline.vocab[str(word)]
    queries = [
        w for w in word.vocab 
        if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
    ]
    print(queries)

    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [(w.lower_,w.similarity(word)) for w in by_similarity[:limit+1] if w.lower_ != word.lower_]

# get_most_similar(nlp, "Lipik")
# get_most_similar(nlp, "Croatia")
# get_most_similar(nlp, "Slavonija")
# get_most_similar(nlp, "Lipizzaner")
get_most_similar(nlp, "man")

In [None]:
from gensim.models import KeyedVectors
import os

Word2Vec_embeddings = os.path.join(os.getcwd(), 'models', 'GoogleNews-vectors-negative300.bin')
w2v_model = KeyedVectors.load_word2vec_format(Word2Vec_embeddings, binary=True)

embeddings = []
for t in tokens:
    try:
        # https://radimrehurek.com/gensim_3.8.3/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors
        embeddings.append(w2v_model.get_vector(t)) 
    except KeyError:
        # embeddings.append(get_placeholder_vector(300))
        continue


# print(cosine_similarity(calculated_king, king))

In [None]:
### Using GloVe (stanford university)
# https://nlp.stanford.edu/projects/glove/

filename = 'glove.6B.50d.txt'

GloVe_embeddings = os.path.join(os.getcwd(), 'models', 'glove.6B', filename)

embeddings_index = {}
with open(GloVe_embeddings, "r", encoding="utf-8") as file:
    for line in file:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype="float32")
        # print(coefs.shape)
        vector_size = coefs.shape[0]
        embeddings_index[word.strip().lower()] = coefs
    print("Loaded %s word vectors." % len(embeddings_index))

tokens = text.split(' ')
embeddings = []

for token in tokens:
    t = token.strip().lower()
    try:
        embeddings.append(embeddings_index[t])
    except KeyError:
        # embeddings.append(get_placeholder_vector(vector_size))
        continue

-----------------------------------------------------------

### TASKS (if the semantic similarity and unsupervised approach to text analysis has not been clear): 
##### Textual (dis)similarity 
1. Given a new random article, to which article from an existing dataset is it most similar to?
2. Performing a cluster analysis, divide a dataset into n clusters (use your ds skills and elbow method to decide which one is the best) -> define semantic classes of those articles
3. Given a new random article, what class does this article belong to ? 

##### How-To:
1. Get 1000 wikipedia articles (choose a topic about an AI field, or 3 AI fields)
- use web scraping, no existing Kaggle or wikipedia dataset :) 

2. Model training
    1. Don't train a model, use TF-IDF, CBOW, Jaccard similarity and Levenshtein distance (yes)
    2. Train an unsupervised model on all the data and start using cosine similarity between both words and documents
    3. Perform topic modelling with gensim --> use this data to both classify texts and perform similarity
    4. Use TextRank or Rake to perform "keyword extraction" --> Does this work better on your dataset ?
- see https://www.analyticsvidhya.com/blog/2022/01/four-of-the-easiest-and-most-effective-methods-of-keyword-extraction-from-a-single-text-using-python/ for kw extraction info