Sveučilište u Zagrebu<br>
Fakultet elektrotehnike i računarstva

## Uvod u znanost o podacima

# Replikacija rezultata

In [57]:
import numpy as np
import math
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import matplotlib.pyplot as plt
from collections import Counter
import warnings
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import random
warnings.filterwarnings("ignore")

In [58]:
!pip install swig



In [59]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

Znanstveni rad opisuje novi način razvrstavanja članaka na temelju jezgrenih funkcija. Jezgrene funkcije predstavljaju umnožak u prostoru značajki. Jezgrene funkcije se koriste za klasifikaciju članaka jer je članke teško vektorizirati, odnosno pretvoriti u vektor značajki.

Znanstveni rad opisuje implementaciju jezgre SSK koja navodno ima bolje performanse od standardnih jezgri NGK i WK.

## Kernels

Jezgrene funkcije računaju sličnost između dva primjera. Sličnost se računa kao umnožak u prostoru značajki. Primjeri se samo implicitno preslikavaju u prostor značajki i tamo se množe.

### WK kernel

Standardni pristup klasifikaciji teksta preslikava tekst u visokodimenzionalni vektor u kojem svaki element vektora označava prisutnost ili nedostatak neke značajke. Ovakav pristup gubi svu informaciju o redoslijedu riječi te zadržava samo informaciju o frekvenciji pojavljivanja pojmova u dokumentu.

Npr.
s="science is organized knowledge"
t="wisdom is organized life"

feature vector = ["science, is, organized, knowledge, wisdom, life]

fi_1 = [1, 1, 1, 1, 0, 0]
fi_2 = [0, 1, 1, 0, 1, 1]

K(s, t) = [1, 1, 1, 1, 0, 0]*[0, 1, 1, 0, 1, 1] = 2

In [60]:
import kernels.wk

In [61]:
wk_kernel = lambda x, y: wk(x, y)

def wkGmats(trainDocs, testDocs):
    vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = "english", input='content') 

    train_data_features = vectorizer.fit_transform(trainDocs)
    train_data_features = train_data_features.toarray()

    transformer = TfidfTransformer(smooth_idf=True)
    tfidf = transformer.fit_transform(train_data_features)
    tfidf = tfidf.toarray() 

    nTrainDocs = len(tfidf)
    GmatTrain = np.ones((nTrainDocs,nTrainDocs))

    for i in range( 0, nTrainDocs ):
        for j in range(0,nTrainDocs):
            GmatTrain[i][j] = np.dot(tfidf[i], tfidf[j])
            
    n_features_train = len(tfidf[0])
    
    train_data_features = vectorizer.transform(testDocs)
    train_data_features = train_data_features.toarray()

    transformer = TfidfTransformer(smooth_idf=True)
    tfidfTest = transformer.fit_transform(train_data_features)
    tfidfTest = tfidfTest.toarray() 

    nTestDocs = len(tfidfTest)
    GmatTest = np.ones((nTestDocs,nTrainDocs))

    for i in range( 0, nTestDocs ):
        for j in range(0,nTrainDocs):
            GmatTest[i][j] = np.dot(tfidfTest[i], tfidf[j])
    
    return GmatTrain, GmatTest

In [62]:
[wk_train_gram_mat, wk_test_gram_mat] = wkGmats(["science is organized knowledge"], ["wisdom is organized life"])

wk.wk("science is organized knowledge","wisdom is organized life")

Starting WK, creating bag of words...



0.14851129125610232

### NGK kernel

NGK jezgra koristi n-grams. N-grams daju n susjednih slova nekog stringa.

In [63]:
import kernels.ngk

doc1, doc2 = "science is organized knowledge", "wisdom is organized life"
print(ngk.ngk(doc1, doc2))
print(ngk.ngk(doc1, doc2, n=7))

0.38207551689619024
0.24129913647238913


### SSK kernel

Cijeli dokument se promatra kao jedan dugačak sequence. Prostor značajki u ovom slučaju je set svih ne nužno susjednih substringova od k simbola. Dva članka su to sličniji što imaju više zajedničkih takvih substringova.

Sličnost se računa u ovisnosti o lambdi koja mjeri težinu u ovisnosti u duljini i uzastopnosti charactera svakog subsequenca iz jednog dokumenta u drugom.

In [64]:
import kernels.ssk

In [65]:
def ssk_compute_train_gram(docs, kernel=None):
    n = len(docs)
    gram = np.ones((n, n))
    print("SSK compute train gram...")
    for x in range(n):
        #print('{0:.2f}%'.format(x / n * 100))
        for y in range(x + 1, n):
            gram[x, y] = kernel(docs[x], docs[y])
            gram[y, x] = gram[x, y]
    return gram  

In [66]:
def ssk_compute_test_gram(test, train, kernel=None):
    gram = np.zeros((len(test), len(train)))
    print("SSK compute test gram...")
    for x in range(len(test)):
        #print('{0:.2f}%'.format(x / len(test) * 100))
        for y in range(len(train)):
            gram[x, y] = kernel(test[x], train[y])
    return gram

### Priprema podataka
U članku piše da su sve riječi u body-jima svih članaka pretvorene u lowercase. Također, uklonjene su sve *stopwords*, a interpunkcijski znakovi zamijenjeni su razmacima. Također, bitni su nam samo stupci TOPICS i BODY tako da ostale možemo izbaciti.

U članku piše da su zadržali samo stem riječi. Pokušala sam to napraviti, ali javlja neku grešku s nltk --> pokušati opet.

In [67]:
clanci_stripped = pd.read_csv("my_data/data/clanci_stripped.csv", index_col = 0)
clanci_stripped.head()

Unnamed: 0,TOPICS,BODY
0,['cocoa'],showers continued throughout week bahia cocoa zone alleviating drought since early january improving prospects coming temporao although normal humidity levels restored comissaria smith said weekly review dry period means temporao late year arrivals week ended february bags kilos making cumulative total season mln stage last year seems cocoa delivered earlier consignment included arrivals figures comissaria smith said still doubt much old crop cocoa still available harvesting practically come end total bahia crop estimates around mln bags sales standing almost mln hundred thousand bags still hands farmers middlemen exporters processors doubts much cocoa would fit export shippers experiencing dificulties obtaining bahia superior certificates view lower quality recent weeks farmers sold good part cocoa held consignment comissaria smith said spot bean prices rose cruzados per arroba kilos bean shippers reluctant offer nearby shipment limited sales booked march shipment dlrs per tonne ports named new crop sales also light open ports june july going dlrs dlrs new york july aug sept dlrs per tonne fob routine sales butter made march april sold dlrs april may butter went times new york may june july dlrs aug sept dlrs times new york sept oct dec dlrs times new york dec comissaria smith said destinations u covertible currency areas uruguay open ports cake sales registered dlrs march april dlrs may dlrs aug times new york dec oct dec buyers u argentina uruguay convertible currency areas liquor sales limited march april selling dlrs june july dlrs times new york july aug sept dlrs times new york sept oct dec times new york dec comissaria smith said total bahia sales currently estimated mln bags crop mln bags crop final figures period february expected published brazilian cocoa trade commission carnival ends midday february reuter
1,"['grain', 'wheat', 'corn', 'barley', 'oat', 'sorghum']",u agriculture department reported farmer owned reserve national five day average price february follows dlrs bu sorghum cwt natl loan release call avge rate x level price price wheat iv v vi corn iv v x rates natl loan release call avge rate x level price price oats v barley n iv v sorghum iv v reserves ii iii matured level iv reflects grain entered oct feedgrain july wheat level v wheat barley corn sorghum level vi covers wheat entered january x rates dlrs per cwt lbs n available reuter
2,"['veg-oil', 'linseed', 'lin-oil', 'soy-oil', 'sun-oil', 'soybean', 'oilseed', 'corn', 'sunseed', 'grain', 'sorghum', 'wheat']",argentine grain board figures show crop registrations grains oilseeds products february thousands tonnes showing future shipments month total total february brackets bread wheat prev feb march total maize mar total nil sorghum nil nil oilseed export registrations sunflowerseed total soybean may total nil board also detailed export registrations subproducts follows subproducts wheat prev feb march apr total linseed prev feb mar apr total soybean prev feb mar nil apr nil may total sunflowerseed prev feb mar apr total vegetable oil registrations sunoil prev feb mar apr may nil jun total linoil prev feb mar apr total soybean oil prev feb mar nil apr may jun jul total reuter
3,['earn'],champion products inc said board directors approved two one stock split common shares shareholders record april company also said board voted recommend shareholders annual meeting april increase authorized capital stock five mln mln shares reuter
4,['acq'],computer terminal systems inc said completed sale shares common stock warrants acquire additional one mln shares lt sedio n v lugano switzerland dlrs company said warrants exercisable five years purchase price dlrs per share computer terminal said sedio also right buy additional shares increase total holdings pct computer terminals outstanding common stock certain circumstances involving change control company company said conditions occur warrants would exercisable price equal pct common stocks market price time exceed dlrs per share computer terminal also said sold technolgy rights dot matrix impact technology including future improvements lt woodco inc houston tex dlrs said would continue exclusive worldwide licensee technology woodco company said moves part reorganization plan would help pay current operation costs ensure product delivery computer terminal makes computer generated labels forms tags ticket printers terminals reuter


Razdvajamo BODY-je po TOPICS-ima tako da u TOPICS nije lista nego samo jedna vrijednost.

In [68]:
#clanci_stripped = clanci_stripped.loc[:, ["TOPICS", "BODY"]]
clanci_stripped = clanci_stripped.loc[clanci_stripped.TOPICS.notnull(), :]
#print(clanci.head(n=10))

mapa = {'TOPICS': [], 'BODY': []}
for index, row in clanci_stripped.iterrows():
    topics = row.TOPICS.split(",")
    for topic in topics:
        topic = topic.replace('[', "")
        topic = topic.replace("]", "")
        if mapa['TOPICS'] is None:
            mapa['TOPICS'] = [topic]
        else:
            mapa['TOPICS'].append(topic)
            
        body = row.BODY
        unfiltered_body = ""
        for word in body.split(" "):
            if len(word) > 3:
                unfiltered_body+=word
                unfiltered_body+=" "
        filtered_body = ""
        for character in unfiltered_body:
            if (character.isalnum()) or (character == ' '):
                filtered_body += character
        filtered_body = filtered_body.replace('[^a-zA-Z]', " ")
        filtered_body = filtered_body.replace(' [ ]+', ' ')
        
        if mapa['BODY'] is None:
            mapa['BODY'] = [filtered_body]
        else:
            mapa['BODY'].append(filtered_body)
            
dataframe = pd.DataFrame(mapa)
dataframe.to_csv("my_data/data/clanci_split.csv")

In [69]:
dataframe.head()

Unnamed: 0,TOPICS,BODY
0,'cocoa',showers continued throughout week bahia cocoa zone alleviating drought since early january improving prospects coming temporao although normal humidity levels restored comissaria smith said weekly review period means temporao late year arrivals week ended february bags kilos making cumulative total season stage last year seems cocoa delivered earlier consignment included arrivals figures comissaria smith said still doubt much crop cocoa still available harvesting practically come total bahia crop estimates around bags sales standing almost hundred thousand bags still hands farmers middlemen exporters processors doubts much cocoa would export shippers experiencing dificulties obtaining bahia superior certificates view lower quality recent weeks farmers sold good part cocoa held consignment comissaria smith said spot bean prices rose cruzados arroba kilos bean shippers reluctant offer nearby shipment limited sales booked march shipment dlrs tonne ports named crop sales also light open ports june july going dlrs dlrs york july sept dlrs tonne routine sales butter made march april sold dlrs april butter went times york june july dlrs sept dlrs times york sept dlrs times york comissaria smith said destinations covertible currency areas uruguay open ports cake sales registered dlrs march april dlrs dlrs times york buyers argentina uruguay convertible currency areas liquor sales limited march april selling dlrs june july dlrs times york july sept dlrs times york sept times york comissaria smith said total bahia sales currently estimated bags crop bags crop final figures period february expected published brazilian cocoa trade commission carnival ends midday february reuter
1,'grain',agriculture department reported farmer owned reserve national five average price february follows dlrs sorghum natl loan release call avge rate level price price wheat corn rates natl loan release call avge rate level price price oats barley sorghum reserves matured level reflects grain entered feedgrain july wheat level wheat barley corn sorghum level covers wheat entered january rates dlrs available reuter
2,'wheat',agriculture department reported farmer owned reserve national five average price february follows dlrs sorghum natl loan release call avge rate level price price wheat corn rates natl loan release call avge rate level price price oats barley sorghum reserves matured level reflects grain entered feedgrain july wheat level wheat barley corn sorghum level covers wheat entered january rates dlrs available reuter
3,'corn',agriculture department reported farmer owned reserve national five average price february follows dlrs sorghum natl loan release call avge rate level price price wheat corn rates natl loan release call avge rate level price price oats barley sorghum reserves matured level reflects grain entered feedgrain july wheat level wheat barley corn sorghum level covers wheat entered january rates dlrs available reuter
4,'barley',agriculture department reported farmer owned reserve national five average price february follows dlrs sorghum natl loan release call avge rate level price price wheat corn rates natl loan release call avge rate level price price oats barley sorghum reserves matured level reflects grain entered feedgrain july wheat level wheat barley corn sorghum level covers wheat entered january rates dlrs available reuter


## Experimental Results

Ciljevi eksperimenata su:
- proučavati utjecaj promjene parametara k(duljina) i $\lambda$(težina)
- uočiti prednosti kombiniranja različitih jezgri

### Podjela podataka u train i test skup

Eksperimenti su provedeni samo na dijelu Reuters seta. U članku piše da je subset bio veličine 470 dokumenata, od čega je 380 bilo korišteno za treniranje, a 90 za ispitivanje.

U eksperimentu su odabrane kategorije "earn", "acq", "crude" i "corn".


In [70]:
earn_clanci = dataframe[dataframe.TOPICS.str.contains("earn")]
acq_clanci = dataframe[dataframe.TOPICS.str.contains("acq")]
crude_clanci = dataframe[dataframe.TOPICS.str.contains("crude")]
corn_clanci = dataframe[dataframe.TOPICS.str.contains("corn")]

clanci = [earn_clanci, acq_clanci, crude_clanci, corn_clanci]

Navedeno je da je broj članaka za pojedinu kategoriju za učenje (ispitivanje) sljedeći:
1. earn 152 (40)
2. acquisition 114 (25)
3. crude 76 (15)
4. corn 38 (10)

In [71]:
from sklearn.model_selection import train_test_split

def give_train_test_split(give_train):
    [earn_tr, earn_te] = train_test_split(earn_clanci, train_size=152/len(earn_clanci), test_size=40/len(earn_clanci))
    [acq_tr, acq_te] = train_test_split(acq_clanci, train_size=114/len(acq_clanci), test_size=25/len(acq_clanci))
    [crude_tr, crude_te] = train_test_split(crude_clanci, train_size=76/len(crude_clanci), test_size=15/len(crude_clanci))
    [corn_tr, corn_te] = train_test_split(corn_clanci, train_size=38/len(corn_clanci), test_size=10/len(corn_clanci))

    y_train = []
    y_train.extend(['earn' for i in range(0, len(earn_tr))])
    y_train.extend(['acq' for i in range(0, len(acq_tr))])
    y_train.extend(['crude' for i in range(0, len(crude_tr))])
    y_train.extend(['corn' for i in range(0, len(corn_tr))])
    y_train = np.array(y_train)
    #print(y_train)

    y_test = []
    y_test.extend(['earn' for i in range(0, len(earn_te))])
    y_test.extend(['acq' for i in range(0, len(acq_te))])
    y_test.extend(['crude' for i in range(0, len(crude_te))])
    y_test.extend(['corn' for i in range(0, len(corn_te))])
    y_test = np.array(y_test)
    #print(y_test)

    clanci_test = [earn_te, acq_te, crude_te, corn_te]
    clanci_test = pd.concat(clanci_test)
    clanci_train = [earn_tr, acq_tr, crude_tr, corn_tr]
    clanci_train = pd.concat(clanci_train)
    
    earn_test=[]
    acq_test=[]
    crude_test=[]
    corn_test=[]
    for index, row in earn_te.iterrows():
        earn_test.append(row['BODY'])
    for index, row in acq_te.iterrows():
        acq_test.append(row['BODY'])
    for index, row in crude_te.iterrows():
        crude_test.append(row['BODY'])
    for index, row in corn_te.iterrows():
        corn_test.append(row['BODY'])
    
    treniranje_parovi = []
    treniranje = []
    i = 0
    for index, row in clanci_train.iterrows():
        par = []
        par = [row['BODY'], y_train[i]]
        treniranje.append(row['BODY'])
        treniranje_parovi.append(par)

    testiranje = []
    for index, row in clanci_test.iterrows():
        testiranje.append(row['BODY'])
    
    if give_train:
        return [treniranje, y_train]
    else:
        return [earn_test, acq_test, crude_test, corn_test, testiranje, y_test]


### Effectiveness of Varying Sequence Length

U ovom dijelu promatramo kako parametar duljine subsequenca, k, utječe na točnost modela. Za svaku vrijednost k, eksperiment je proveden 10 puta i onda su dobivene vrijednosti mean i sd. Lambda je postavljen na 0.5.


Kako je računanje za SSK jako sporo, kod njega sam izostavila provođenje eksperimenta 10 puta pa se on provodi samo par puta. Za NGK i WK se provodi 10 puta i vrijednosti evaluacije rezultata su uprosječene.

Stvorimo listu u koju stavljamo [category, ime ljuske, length, f1_mean, f1_std, precision_mean, precision_std, recall_mean, recall_std]. Od te liste kasnije stvorimo dataframe.

#### Evaluacija SSK jezgre

In [72]:
# ispisuje tablicu kao sto je u radu
def ssk_evaluation(category, k_range=[5], lambd_range=[0.5], filename="my_data/results/no_file.csv"):
    rezultat_lista = []

    print("Starting SSK evaluation for {}...".format(category))
    for k in k_range:
        for lambd in lambd_range:
            print("\tk = {}, lambda = {}".format(k, lambd))
            lista_u_ovom_koraku = []
            lista_u_ovom_koraku.append(category)
            lista_u_ovom_koraku.append("SSK")
            f1 = []
            precision = []
            recall = []
            for j in range(0, 10):
                [trening, y_trening] = give_train_test_split(True)

                #odabire podskup primjera za ucenje jer ih je inace previse i dugo traje
                treniranje = []
                y_train = []
                for i in range(0, 60):
                    r = random.randint(0, len(trening)-1)
                    treniranje.append(trening[r])
                    y_train.append(y_trening[r])

                ssk_kernel = lambda x, y: ssk.ssk(x, y, k, lambd)
                train_gram = ssk_compute_train_gram(treniranje, kernel=ssk_kernel)
                [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
                if(category=="earn"):
                    X_test = earn_test
                    y_true = np.array(['earn' for i in range(0, len(earn_test))])
                elif category == "acq":
                    X_test = acq_test
                    y_true = np.array(['acq' for i in range(0, len(acq_test))])
                elif category == "crude":
                    X_test = crude_test
                    y_true = np.array(['crude' for i in range(0, len(crude_test))])
                else:
                    X_test = corn_test
                    y_true = np.array(['corn' for i in range(0, len(corn_test))])

                test_gram = ssk_compute_test_gram(X_test, treniranje, kernel=ssk_kernel)
                clf = SVC(kernel='precomputed')
                clf.fit(train_gram, y_train)
                # predikcija
                y_pred = clf.predict(test_gram)
                f1.append(f1_score(y_true, y_pred, average='micro'))
                precision.append(precision_score(y_true, y_pred, average='micro'))
                recall.append(recall_score(y_true, y_pred, average='micro'))
            # kraj j petlje
        
            if len(k_range) == 1:
                lista_u_ovom_koraku.append(lambd)
            else:
                lista_u_ovom_koraku.append(k)
            
            lista_u_ovom_koraku.append(round(np.mean(f1), 4))
            lista_u_ovom_koraku.append(round(np.std(f1), 4))
            lista_u_ovom_koraku.append(round(np.mean(precision), 4))
            lista_u_ovom_koraku.append(round(np.std(precision), 4))
            lista_u_ovom_koraku.append(round(np.mean(recall), 4))
            lista_u_ovom_koraku.append(round(np.std(recall), 4))
        
            rez = pd.DataFrame([lista_u_ovom_koraku])
            rez.to_csv(filename, mode='a', header=False)       
        
            rezultat_lista.append(lista_u_ovom_koraku)
    print("End of SSK evaluation\n")
    return rezultat_lista


#### Evaluacija NGK jezgre

In [73]:
def ngk_evaluation(category, k_range=[5], filename="my_data/results/no_file.csv"):
    rezultat_lista = []
    
    print("Starting NGK evaluation for {}...".format(category))
    for k in k_range:
        lista_u_ovom_koraku = []
        lista_u_ovom_koraku.append(category)
        lista_u_ovom_koraku.append("NGK")
        print("\t k =", k)
        f1 = []
        precision = []
        recall = []
        for i in range(0, 10):
            #treniranje jezgre
            [trening, y_trening] = give_train_test_split(True)
            
            #odabire podskup primjera za ucenje jer ih je inace previse i dugo traje
            treniranje = []
            y_train = []
            for i in range(0, 60):
                r = random.randint(0, len(trening)-1)
                treniranje.append(trening[r])
                y_train.append(y_trening[r])
        
            [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
            if(category=="earn"):
                X_test = earn_test
                y_true = np.array(['earn' for i in range(0, len(earn_test))])
            elif category == "acq":
                X_test = acq_test
                y_true = np.array(['acq' for i in range(0, len(acq_test))])
            elif category == "crude":
                X_test = crude_test
                y_true = np.array(['crude' for i in range(0, len(crude_test))])
            else:
                X_test = corn_test
                y_true = np.array(['corn' for i in range(0, len(corn_test))])

            train_gram, test_gram = ngk.ngkGmats(treniranje, X_test, n=k)
            clf = SVC(kernel='precomputed')
            clf.fit(train_gram, y_train)
            y_pred = clf.predict(test_gram)
            f1.append(round(f1_score(y_true, y_pred, average='micro'), 3))
            precision.append(round(precision_score(y_true, y_pred, average='micro'), 3))
            recall.append(round(recall_score(y_true, y_pred, average='micro'), 3))
        
        if len(k_range) == 1:
            lista_u_ovom_koraku.append(0)
        else:
            lista_u_ovom_koraku.append(k)
        lista_u_ovom_koraku.append(round(np.mean(f1), 4))
        lista_u_ovom_koraku.append(round(np.std(f1), 4))
        lista_u_ovom_koraku.append(round(np.mean(precision), 4))
        lista_u_ovom_koraku.append(round(np.std(precision), 4))
        lista_u_ovom_koraku.append(round(np.mean(recall), 4))
        lista_u_ovom_koraku.append(round(np.std(recall), 4))
        
        rez = pd.DataFrame([lista_u_ovom_koraku])
        rez.to_csv(filename, mode='a', header=False)      
        
        rezultat_lista.append(lista_u_ovom_koraku)
        #print(lista_u_ovom_koraku)
    print("End of NGK evaluation\n")
    return rezultat_lista
        

#### Evaluacija WK jezgre

In [74]:
def wk_evaluation(category, filename="my_data/results/no_file.csv"):
    rezultat_lista = []
    rezultat_lista.append(category)
    rezultat_lista.append("WK")

    f1 = []
    precision = []
    recall = []
    print("Starting WK evaluation for {}...".format(category))
    for i in range(0, 10):
        #print(category, i)
        #treniranje jezgre
        
        [trening, y_trening] = give_train_test_split(True)
            
        #odabire podskup primjera za ucenje jer ih je inace previse i dugo traje
        treniranje = []
        y_train = []
        for i in range(0, 60):
            r = random.randint(0, len(trening)-1)
            treniranje.append(trening[r])
            y_train.append(y_trening[r])
        
        [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
        if(category=="earn"):
            X_test = earn_test
            y_true = np.array(['earn' for i in range(0, len(earn_test))])
        elif category == "acq":
            X_test = acq_test
            y_true = np.array(['acq' for i in range(0, len(acq_test))])
        elif category == "crude":
            X_test = crude_test
            y_true = np.array(['crude' for i in range(0, len(crude_test))])
        else:
            X_test = corn_test
            y_true = np.array(['corn' for i in range(0, len(corn_test))])

        wk_kernel = lambda x, y: wk(x, y)
        [wk_train_gram, wk_test_gram] = wkGmats(treniranje, X_test)
        clf = SVC(kernel='precomputed')
        clf.fit(wk_train_gram, y_train)
        y_pred = clf.predict(wk_test_gram)
        f1.append(round(f1_score(y_true, y_pred, average='micro'), 3))
        precision.append(round(precision_score(y_true, y_pred, average='micro'), 3))
        recall.append(round(recall_score(y_true, y_pred, average='micro'), 3))
        
    rezultat_lista.append(0)
    rezultat_lista.append(round(np.mean(f1), 4))
    rezultat_lista.append(round(np.std(f1), 4))
    rezultat_lista.append(round(np.mean(precision), 4))
    rezultat_lista.append(round(np.std(precision), 4))
    rezultat_lista.append(round(np.mean(recall), 4))
    rezultat_lista.append(round(np.std(recall), 4))
    
    rez = pd.DataFrame([rezultat_lista])
    rez.to_csv(filename, mode='a', header=False) 
    
    print("End of WK evaluation\n")
    return [rezultat_lista]
        

In [None]:
# k_range=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
def evaluation_for_varying_sequence_lengths():
    filename = "my_data/results/varying_sequence_length.csv"
    rezultat = []
    varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_sequence_length.to_csv(filename)
    
    rezultat = ngk_evaluation("earn", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = ssk_evaluation("earn", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = wk_evaluation("earn", filename=filename)
    
    rezultat = ngk_evaluation("acq", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = ssk_evaluation("acq", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = wk_evaluation("acq", filename=filename)
    
    rezultat = ngk_evaluation("crude", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = ssk_evaluation("crude", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = wk_evaluation("crude", filename=filename)
    
    rezultat = ngk_evaluation("corn", k_range=[*range(3, 10+1, 1)], filename=filename)
    
    rezultat = ssk_evaluation("corn", k_range=[*range(3, 10+1, 1)], filename=filename)

    rezultat = wk_evaluation("corn", filename=filename)
    
evaluation_for_varying_sequence_lengths()

Starting NGK evaluation for earn...
	 k = 3
	 k = 4
	 k = 5
	 k = 6
	 k = 7
	 k = 8
	 k = 9
	 k = 10
End of NGK evaluation

Starting SSK evaluation for earn...
	k = 3, lambda = 0.5
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
	k = 4, lambda = 0.5
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK compute test gram...
SSK compute train gram...
SSK com

#### Usporedba rezultata

Radi brzine izvođenja, nisam računala za sve k kao što je u znanstvenom radu. Ali se može uočiti da SSK najbolje radi za male i srednje velike k (otprilike 4-7). Parametar k može se postaviti unaprijed unakrsnom provjerom tako da maksimizira točnost (minimizira pogrešku) na skupu za provjeru.

In [None]:
filename = "my_data/results/varying_sequence_length.csv"
rezultat = pd.read_csv(filename, index_col=[0, 1, 2, 3])
display(rezultat)
#pd.reset_option('all')

### Effectiveness of Varying Weight Decay Factors

Sada ispitujemo model za promjenjive vrijednosti lambde. Parametar $\lambda$ upravlja "kažnjavanjem" ne-susjednih substringova. Što su stringovi "ne-susjedniji" u člancima, to su više kažnjeni.

Pozivamo iste funkcije kao i za Varying Sequence Length, ali ovaj puta predajemo niz lambdi (weight decay factor) za koje testiramo model.

Za svaku vrijednost $\lambda$, eksperiment je proveden 10 puta i onda su dobivene vrijednosti mean i sd. Parametar k je postavljen na 5.


In [None]:
rezultat_lista = []
def evaluation_for_varying_weight_decay_factors():
    filename = "my_data/results/varying_weight_decay.csv"
    rezultat = []
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv(filename)
    
    rezultat = ngk_evaluation("earn", filename=filename)
    
    rezultat = ssk_evaluation("earn", lambd_range=[*range(0.01, 1, 0.2)], filename=filename)
    
    rezultat = wk_evaluation("earn", filename=filename)
    
    rezultat = ngk_evaluation("acq", filename=filename)
    
    rezultat = ssk_evaluation("acq", lambd_range=[*range(0.01, 1, 0.2)], filename=filename)
    
    rezultat = wk_evaluation("acq", filename=filename)
    
    rezultat = ngk_evaluation("crude", filename=filename)
    
    rezultat = ssk_evaluation("crude", lambd_range=[*range(0.01, 1, 0.2)], filename=filename)
    
    rezultat = wk_evaluation("crude", filename=filename)
    
    rezultat = ngk_evaluation("corn", filename=filename)
    
    rezultat = ssk_evaluation("corn", lambd_range=[*range(0.01, 1, 0.2)], filename=filename)
    
    rezultat = wk_evaluation("corn", filename=filename)

evaluation_for_varying_weight_decay_factors()

#### Usporedba rezultata

Moji rezultati se i ne poklapaju baš s onima danim u znanstvenom radu. Kod njih za sve kategorije osim 'corn' preciznost postiže vrhunac za veće vrijednosti lambde. Kod kategorije 'corn', koja postiže najveću preciznost za lambda=0.3, vidimo da povećanje lambde ne znači nužno i povećanje preciznosti.

In [None]:
filename = "my_data/results/varying_weight_decay.csv"
rezultat = pd.read_csv(filename, index_col=[0, 1, 2, 3])
display(rezultat)

### Effectiveness of Combining Kernels

Sljedeće se promatra ako kombinacija jezgri pomaže generalizaciji modela, odnosno je li model točniji na neviđenim podacima.

#### Combining NGK and SSK

Ovdje smo kombinirali NGK i SSK jezgre. Koristili smo njihovu težinsku sumu tako da w_ng predstavlja utjecaj NGK, a w_sk utjecaj SSK u sumi. Parametri k i lambda postavljeni su na 5 i 0.5

In [None]:
k = 5
lambd = 0.5
def NGK_SSK_comb_evaluation(category, filename="my_data/results/no_file.csv"):
    rezultat_lista = []
    w_ng_list = [1, 0.5, 0.8, 0.9] #[1, 0, 0.5, 0.6, 0.7, 0.8, 0.9]
    w_sk_list = [0, 0.5, 0.2, 0.1] #[0, 1, 0.5, 0.4, 0.3, 0.2, 0.1]
    print("Starting NGK_SSK_COMB evaluation...")
    for i in range(0, len(w_ng_list)):
        w_ng = w_ng_list[i]
        w_sk = w_sk_list[i]
        lista_u_ovom_koraku = []
        lista_u_ovom_koraku.append(category)
        lista_u_ovom_koraku.append(w_ng)
        lista_u_ovom_koraku.append(w_sk)
        print("\tw_ng={} w_sk={}".format(w_ng, w_sk))
        f1 = []
        precision = []
        recall = []
        for j in range(0, 3):
            #treniranje jezgre
            [trening, y_trening] = give_train_test_split(True)

            #odabire podskup primjera za ucenje jer ih je inace previse i dugo traje
            treniranje = []
            y_train = []
            for it in range(0, 60):
                r = random.randint(0, len(trening)-1)
                treniranje.append(trening[r])
                y_train.append(y_trening[r])

            [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
            if(category=="earn"):
                X_test = earn_test
                y_true = np.array(['earn' for i in range(0, len(earn_test))])
            elif category == "acq":
                X_test = acq_test
                y_true = np.array(['acq' for i in range(0, len(acq_test))])
            elif category == "crude":
                X_test = crude_test
                y_true = np.array(['crude' for i in range(0, len(crude_test))])
            else:
                X_test = corn_test
                y_true = np.array(['corn' for i in range(0, len(corn_test))])

            ssk_kernel = lambda x, y: ssk.ssk(x, y, 5, 0.5)
            ssk_train_gram = ssk_compute_train_gram(treniranje, kernel=ssk_kernel)
            ssk_test_gram = ssk_compute_test_gram(X_test, treniranje, kernel=ssk_kernel)
            ngk_train_gram, ngk_test_gram = ngk.ngkGmats(treniranje, X_test, n=5)

            test_gram = ngk_test_gram*w_ng + ssk_test_gram*w_sk
            train_gram = ngk_train_gram*w_ng + ssk_train_gram*w_sk

            clf = SVC(kernel='precomputed')
            clf.fit(train_gram, y_train)
            y_pred = clf.predict(test_gram)
            f1.append(f1_score(y_true, y_pred, average='micro'))
            precision.append(precision_score(y_true, y_pred, average='micro'))
            recall.append(recall_score(y_true, y_pred, average='micro'))
        # kraj for j petlje
        
        lista_u_ovom_koraku.append(round(np.mean(f1), 3))
        lista_u_ovom_koraku.append(round(np.std(f1), 3))
        lista_u_ovom_koraku.append(round(np.mean(precision), 3))
        lista_u_ovom_koraku.append(round(np.std(precision), 3))
        lista_u_ovom_koraku.append(round(np.mean(recall), 3))
        lista_u_ovom_koraku.append(round(np.std(recall), 3))
        
        rez = pd.DataFrame([lista_u_ovom_koraku])
        rez.to_csv(filename, mode='a', header=False)   
        
        #print(lista_u_ovom_koraku)
        rezultat_lista.append(lista_u_ovom_koraku)
        
    print("Ending NGK_SSK_COMB evaluation\n")
    return rezultat_lista


In [None]:
def evaluation_for_combining_ngk_and_ssk():
    rezultat = []
    filename = "my_data/results/combining_ngk_and_ssk.csv"
    combining_ngk_and_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_ngk_and_ssk.to_csv()
    
    rezultat = NGK_SSK_comb_evaluation("earn", filename)
    
    rezultat = NGK_SSK_comb_evaluation("acq", filename)
    
    rezultat = NGK_SSK_comb_evaluation("crude", filename)
    
    rezultat = NGK_SSK_comb_evaluation("corn", filename)

evaluation_for_combining_ngk_and_ssk()

#### Usporedba rezultata

In [None]:
filename = "my_data/results/combining_ngk_and_ssk.csv"
rezultat = pd.read_csv(filename, index_col=[0, 1, 2, 3])
display(rezultat)

#### Combining SSK with different weight decay factors

Sljedeće smo proučavali kombiniranje dvije SSK jezgre s različitim lambdama.

In [183]:
def SSK_lambda_comb_evaluation(category, filename="my_data/results/no_file.csv"):
    rezultat_lista = []
    lambda_1_list = [0.05, 0.5, 0.05]
    lambda_2_list = [0.0, 0.0, 0.5]
    print("Starting combining SSK with differend lambdas for {}...".format(category))
    for i in range(0, len(lambda_1_list)):
        lambda1 = lambda_1_list[i]
        lambda2 = lambda_2_list[i]
        lista_u_ovom_koraku = []
        lista_u_ovom_koraku.append(category)
        lista_u_ovom_koraku.append(lambda1)
        lista_u_ovom_koraku.append(lambda2)
        print("\tlambda_1={} lambda_2={}".format(lambda1, lambda2))
        f1 = []
        precision = []
        recall = []
        #for j in range(0, 10):
        #treniranje jezgre
        [trening, y_trening] = give_train_test_split(True)
            
        #odabire podskup primjera za ucenje jer ih je inace previse i dugo traje
        treniranje = []
        y_train = []
        for i in range(0, 60):
            r = random.randint(0, len(trening)-1)
            treniranje.append(trening[r])
            y_train.append(y_trening[r])
            
        [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
        if(category=="earn"):
            X_test = earn_test
            y_true = np.array(['earn' for i in range(0, len(earn_test))])
        elif category == "acq":
            X_test = acq_test
            y_true = np.array(['acq' for i in range(0, len(acq_test))])
        elif category == "crude":
            X_test = crude_test
            y_true = np.array(['crude' for i in range(0, len(crude_test))])
        else:
            X_test = corn_test
            y_true = np.array(['corn' for i in range(0, len(corn_test))])
        
        SSKTrainGram = np.zeros((len(treniranje),len(treniranje)))
        for m in range(0, len(treniranje)):
            for n in range(0, len(treniranje)):
                SSKTrainGram[m][n] = ssk.ssk(treniranje[m], treniranje[n], 5, lambda1) + ssk.ssk(treniranje[m], treniranje[n], 5, lambda2) 
        
        SSKTestGram = np.zeros((len(X_test),len(X_test)))
        for m in range(0, len(X_test)):
            for n in range(0, len(X_test)):
                SSKTrainGram[m][n] = ssk.ssk(X_test[m], X_test[n], 5, lambda1) + ssk.ssk(treniranje[m], treniranje[n], 5, lambda2)

        clf = SVC(kernel='precomputed')
        clf.fit(SSKTrainGram, y_train)
        y_pred = clf.predict(SSKTestGram)
        f1.append(f1_score(y_true, y_pred, average='micro'))
        precision.append(precision_score(y_true, y_pred, average='micro'))
        recall.append(recall_score(y_true, y_pred, average='micro'))
        # kraj j petlje
        
        lista_u_ovom_koraku.append(round(np.mean(f1), 3))
        lista_u_ovom_koraku.append(round(np.std(f1), 3))
        lista_u_ovom_koraku.append(round(np.mean(precision), 3))
        lista_u_ovom_koraku.append(round(np.std(precision), 3))
        lista_u_ovom_koraku.append(round(np.mean(recall), 3))
        lista_u_ovom_koraku.append(round(np.std(recall), 3))
        
        rez = pd.DataFrame([lista_u_ovom_koraku])
        rez.to_csv(filename, mode='a', header=False)
        
        #print(lista_u_ovom_koraku)
        rezultat_lista.append(lista_u_ovom_koraku)
    
    print("End of combining SSK with differend lambdas\n")
    return rezultat_lista


In [185]:
def evaluation_for_combining_lambda_ssk():
    rezultat = []
    filename = "my_data/results/combining_lambda_ssk.csv"
    combining_lambda_ssk = pd.DataFrame(rezultat, columns=["category", "lambda_1", "lambda_2", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_lambda_ssk.to_csv(filename)
    
    rezultat = SSK_lambda_comb_evaluation("earn", filename)
    
    rezultat = SSK_lambda_comb_evaluation("acq", filename)
    
    rezultat = SSK_lambda_comb_evaluation("crude", filename)
    
    rezultat = SSK_lambda_comb_evaluation("corn", filename)

evaluation_for_combining_lambda_ssk()

Starting combining SSK with differend lambdas for earn...
	lambda_1=0.05 lambda_2=0.0


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

#### Usporedba rezultata

Ovaj dio iz nekog razloga baca grešku...

### Combining kernels of different lengths

#### Usporedba rezultata

### Moji pokušaji implementacija jezgri koji su bili krivi/prespori

In [53]:
## MOJ SSK -> radi za onaj mali uvodni primjer
import itertools

# SSK - string subsequence kernel
def is_subsequence(subsequence, word):
    iterator = iter(word)
    if all(c in iterator for c in subsequence):
        return True
    else:
        return False

def ssk_kernel(string1, string2, k=2, lambd=1):
    stupci = []
    tablica = {}

    for word in [string1, string2]:
        letters = list(word)
        for combination in itertools.combinations(letters, k): # nalazi sve kombinacije slova u letters duljine k
            s = ''.join(combination)
            if s not in stupci:
                stupci.append(s)
    
    #print(stupci)

    for word in [string1, string2]:
        tablica[word] = [0 for i in range(len(stupci))]
        subsequence_index = 0
        for stupac in stupci:
            if is_subsequence(stupac, word):
                cell_rez = 1
                index_slova_rijeci = 0
                for index_slova_stupca in range(len(stupac) - 1):
                    cell_rez += word.index(stupac[index_slova_stupca+1], word.index(stupac[index_slova_stupca])+1)-word.index(stupac[index_slova_stupca])
                tablica[word][subsequence_index] = pow(lambd, cell_rez)
                #print(word, tablica[word])
                # res += i.index(j[ki+1], i.index(j[ki])+1)-i.index(j[ki])
            subsequence_index += 1
    red_1 = np.array(tablica[string1])
    red_2 = np.array(tablica[string2])
    
    rez_1 = np.sum(red_1*red_2.T)
    rez_2 = np.sum(red_1*red_1.T)
    rez_3 = np.sum(red_2*red_2.T)
    rez = rez_1/pow(rez_2*rez_3, 0.5)
    return rez

print(ssk_kernel("cat","car", lambd=2))

0.16666666666666666


In [None]:
# NGK - n-grams kernel
# NGK is a linear kernel that returns a similarity score between documents
# that are indexed by n-grams
# vrijednost jezgrene funkcije
def ngk(string1, string2):
    def ngrams(string):
        ngrams = set(())
        for n in range(1, len(string)+1):
            ngrams_helper = zip(*[string[i:] for i in range(n)])
            for ngram in ngrams_helper:
                ngrams.add(''.join(ngram))
        #print(ngrams)
        return ngrams
    
    ngrams_1 = ngrams(string1) # racuna ngrams za prvi dokument
    ngrams_2 = ngrams(string2) # racuna ngrams za drugi dokument
    
    # usporeduje broj jednakih ngrams oba dokumenta
    intercept_rez = ngrams_1.intersection(ngrams_2)
    num_common = len(intercept_rez)
    
    rez = num_common/(len(ngrams_1)+len(ngrams_2))
    rez = rez/0.5 #skaliranje
    return rez

def ngk_kernel(X1, X2):
    kernel_matrix = np.zeros([len(X1), len(X2)])
    for i in range(0, len(X1)):
        for j in range(0, len(X2)):
            kernel_matrix[i][j] = ngk(X1[i], X1[j])
    return kernel_matrix

print(ngk_kernel("car","cat"))