Sveučilište u Zagrebu<br>
Fakultet elektrotehnike i računarstva

## Uvod u znanost o podacima

# Replikacija rezultata

In [49]:
import numpy as np
import math
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import matplotlib.pyplot as plt
from collections import Counter
import warnings
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#warnings.filterwarnings("ignore")

In [16]:
!pip install swig



Znanstveni rad opisuje novi način razvrstavanja članaka na temelju jezgrenih funkcija. Jezgrene funkcije predstavljaju umnožak u prostoru značajki. Jezgrene funkcije se koriste za klasifikaciju članaka jer je članke teško vektorizirati, odnosno pretvoriti u vektor značajki.

Znanstveni rad opisuje implementaciju jezgre SSK koja navodno ima bolje performanse od standardnih jezgri NGK i WK.

## Kernels

Jezgrene funkcije računaju sličnost između dva primjera. Sličnost se računa kao umnožak u prostoru značajki. Primjeri se samo implicitno preslikavaju u prostor značajki i tamo se množe.

### WK kernel

Standardni pristup klasifikaciji teksta preslikava tekst u visokodimenzionalni vektor u kojem svaki element vektora označava prisutnost ili nedostatak neke značajke. Ovakav pristup gubi svu informaciju o redoslijedu riječi te zadržava samo informaciju o frekvenciji pojavljivanja pojmova u dokumentu.

Npr.
s="science is organized knowledge"
t="wisdom is organized life"

feature vector = ["science, is, organized, knowledge, wisdom, life]

fi_1 = [1, 1, 1, 1, 0, 0]
fi_2 = [0, 1, 1, 0, 1, 1]

K(s, t) = [1, 1, 1, 1, 0, 0]*[0, 1, 1, 0, 1, 1] = 2

In [52]:
import wk

In [64]:
wk_kernel = lambda x, y: wk(x, y)

def wkGmats(trainDocs, testDocs):
    #defaultanalyzer "word" removes non-chars in preprocessing and tokenizes words. does not remove "markup tokens"
    #stop_words should be "english" if not using clean_input_docs()
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = "english", input='content') 

    train_data_features = vectorizer.fit_transform(trainDocs)
    train_data_features = train_data_features.toarray()

    transformer = TfidfTransformer(smooth_idf=True)
    tfidf = transformer.fit_transform(train_data_features)
    tfidf = tfidf.toarray() 
    #print tfidf
    #print "done"
    #print tfidf.shape
    nTrainDocs = len(tfidf)
    GmatTrain = np.ones((nTrainDocs,nTrainDocs))

    for i in range( 0, nTrainDocs ):
        for j in range(0,nTrainDocs):
            GmatTrain[i][j] = np.dot(tfidf[i], tfidf[j])
            
    n_features_train = len(tfidf[0])
    
    train_data_features = vectorizer.transform(testDocs)
    train_data_features = train_data_features.toarray()

    transformer = TfidfTransformer(smooth_idf=True)
    tfidfTest = transformer.fit_transform(train_data_features)
    tfidfTest = tfidfTest.toarray() 
    #print tfidf
    #print "done"
    #print tfidf.shape
    nTestDocs = len(tfidfTest)
    GmatTest = np.ones((nTestDocs,nTrainDocs))

    for i in range( 0, nTestDocs ):
        for j in range(0,nTrainDocs):
            GmatTest[i][j] = np.dot(tfidfTest[i], tfidf[j])

    # print "Trainmean: ", GmatTrain.mean()
    # print "Testmean: ", GmatTest.mean()
    
    return GmatTrain, GmatTest

In [69]:
[wk_train_gram_mat, wk_test_gram_mat] = wkGmats(["science is organized knowledge"], ["wisdom is organized life"])

wk.wk("science is organized knowledge","wisdom is organized life")

Starting WK, creating bag of words...



0.14851129125610232

### NGK kernel

NGK jezgra koristi n-grams. N-grams daju n susjednih slova nekog stringa.

In [None]:
import ngk

doc1, doc2 = "science is organized knowledge", "wisdom is organized life"
print(ngk.ngk(doc1, doc2))
print(ngk.ngk(doc1, doc2, n=7))

### SSK kernel

Cijeli dokument se promatra kao jedan dugačak sequence. Prostor značajki u ovom slučaju je set svih ne nužno susjednih substringova od k simbola. Dva članka su to sličniji što imaju više zajedničkih takvih substringova.

Sličnost se računa u ovisnosti o lambdi koja mjeri težinu u ovisnosti u duljini i uzastopnosti charactera svakog subsequenca iz jednog dokumenta u drugom.

In [None]:
import ssk

In [None]:
def ssk_compute_train_gram(docs, kernel=None):
    n = len(docs)
    gram = np.ones((n, n))
    for x in range(n):
        print('{0:.2f}%'.format(x / n))
        for y in range(x + 1, n):
            gram[x, y] = kernel(docs[x], docs[y])
            gram[y, x] = gram[x, y]
    return gram  

In [None]:
def ssk_compute_test_gram(test, train, kernel=None):
    gram = np.zeros((len(test), len(train)))
    for x in range(len(test)):
        print('{0:.2f}%'.format(x / len(test)))
        for y in range(len(train)):
            gram[x, y] = kernel(test[x], train[y])
    return gram

### Priprema podataka
U članku piše da su sve riječi u body-jima svih članaka pretvorene u lowercase. Također, uklonjene su sve *stopwords*, a interpunkcijski znakovi zamijenjeni su razmacima. Također, bitni su nam samo stupci TOPICS i BODY tako da ostale možemo izbaciti.

U članku piše da su zadržali samo stem riječi. Pokušala sam to napraviti, ali javlja neku grešku s nltk --> pokušati opet.

In [17]:
clanci_stripped = pd.read_csv("clanci_stripped.csv", index_col = 0)
clanci_stripped.head()

Unnamed: 0,TOPICS,BODY
0,['cocoa'],showers continued throughout week bahia cocoa ...
1,"['grain', 'wheat', 'corn', 'barley', 'oat', 's...",u agriculture department reported farmer owned...
2,"['veg-oil', 'linseed', 'lin-oil', 'soy-oil', '...",argentine grain board figures show crop regist...
3,['earn'],champion products inc said board directors app...
4,['acq'],computer terminal systems inc said completed s...


Razdvajamo BODY-je po TOPICS-ima tako da u TOPICS nije lista nego samo jedna vrijednost.

In [37]:
#clanci_stripped = clanci_stripped.loc[:, ["TOPICS", "BODY"]]
clanci_stripped = clanci_stripped.loc[clanci_stripped.TOPICS.notnull(), :]
#print(clanci.head(n=10))

mapa = {'TOPICS': [], 'BODY': []}
for index, row in clanci_stripped.iterrows():
    topics = row.TOPICS.split(",")
    for topic in topics:
        topic = topic.replace('[', "")
        topic = topic.replace("]", "")
        if mapa['TOPICS'] is None:
            mapa['TOPICS'] = [topic]
        else:
            mapa['TOPICS'].append(topic)
            
        body = row.BODY
        unfiltered_body = ""
        for word in body.split(" "):
            if len(word) > 3:
                unfiltered_body+=word
                unfiltered_body+=" "
        filtered_body = ""
        for character in unfiltered_body:
            if (character.isalnum()) or (character == ' '):
                filtered_body += character
        filtered_body = filtered_body.replace('[^a-zA-Z]', " ")
        filtered_body = filtered_body.replace(' [ ]+', ' ')
        
        if mapa['BODY'] is None:
            mapa['BODY'] = [filtered_body]
        else:
            mapa['BODY'].append(filtered_body)
            
dataframe = pd.DataFrame(mapa)
dataframe.to_csv("clanci_split.csv")
dataframe.head(n=10)

Unnamed: 0,TOPICS,BODY
0,'cocoa',showers continued throughout week bahia cocoa ...
1,'grain',agriculture department reported farmer owned r...
2,'wheat',agriculture department reported farmer owned r...
3,'corn',agriculture department reported farmer owned r...
4,'barley',agriculture department reported farmer owned r...
5,'oat',agriculture department reported farmer owned r...
6,'sorghum',agriculture department reported farmer owned r...
7,'veg-oil',argentine grain board figures show crop regist...
8,'linseed',argentine grain board figures show crop regist...
9,'lin-oil',argentine grain board figures show crop regist...


## Experimental Results

Ciljevi eksperimenata su:
- proučavati utjecaj promjene parametara k(duljina) i $\lambda$(težina)
- uočiti prednosti kombiniranja različitih jezgri

### Podjela podataka u train i test skup

Eksperimenti su provedeni samo na dijelu Reuters seta. U članku piše da je subset bio veličine 470 dokumenata, od čega je 380 bilo korišteno za treniranje, a 90 za ispitivanje.

U eksperimentu su odabrane kategorije "earn", "acq", "crude" i "corn".


In [None]:
earn_clanci = dataframe[dataframe.TOPICS.str.contains("earn")]
acq_clanci = dataframe[dataframe.TOPICS.str.contains("acq")]
crude_clanci = dataframe[dataframe.TOPICS.str.contains("crude")]
corn_clanci = dataframe[dataframe.TOPICS.str.contains("corn")]

clanci = [earn_clanci, acq_clanci, crude_clanci, corn_clanci]
clanci.head()

Navedeno je da je broj članaka za pojedinu kategoriju za učenje (ispitivanje) sljedeći:
1. earn 152 (40)
2. acquisition 114 (25)
3. crude 76 (15)
4. corn 38 (10)

In [39]:
from sklearn.model_selection import train_test_split

def give_train_test_split(give_train):
    [earn_tr, earn_te] = train_test_split(earn_clanci, train_size=152/len(earn_clanci), test_size=40/len(earn_clanci))
    [acq_tr, acq_te] = train_test_split(acq_clanci, train_size=114/len(acq_clanci), test_size=25/len(acq_clanci))
    [crude_tr, crude_te] = train_test_split(crude_clanci, train_size=76/len(crude_clanci), test_size=15/len(crude_clanci))
    [corn_tr, corn_te] = train_test_split(corn_clanci, train_size=38/len(corn_clanci), test_size=10/len(corn_clanci))

    y_train = []
    y_train.extend(['earn' for i in range(0, len(earn_tr))])
    y_train.extend(['acq' for i in range(0, len(acq_tr))])
    y_train.extend(['crude' for i in range(0, len(crude_tr))])
    y_train.extend(['corn' for i in range(0, len(corn_tr))])
    y_train = np.array(y_train)
    #print(y_train)

    y_test = []
    y_test.extend(['earn' for i in range(0, len(earn_te))])
    y_test.extend(['acq' for i in range(0, len(acq_te))])
    y_test.extend(['crude' for i in range(0, len(crude_te))])
    y_test.extend(['corn' for i in range(0, len(corn_te))])
    y_test = np.array(y_test)
    #print(y_test)

    clanci_test = [earn_te, acq_te, crude_te, corn_te]
    clanci_test = pd.concat(clanci_test)
    clanci_train = [earn_tr, acq_tr, crude_tr, corn_tr]
    clanci_train = pd.concat(clanci_train)
    #clanci_train #.to_csv("clanci_train.csv")
    
    earn_test=[]
    acq_test=[]
    crude_test=[]
    corn_test=[]
    for index, row in earn_te.iterrows():
        earn_test.append(row['BODY'])
    for index, row in acq_te.iterrows():
        acq_test.append(row['BODY'])
    for index, row in crude_te.iterrows():
        crude_test.append(row['BODY'])
    for index, row in corn_te.iterrows():
        corn_test.append(row['BODY'])
    
    treniranje_parovi = []
    treniranje = []
    i = 0
    for index, row in clanci_train.iterrows():
        par = []
        par = [row['BODY'], y_train[i]]
        treniranje.append(row['BODY'])
        treniranje_parovi.append(par)

    testiranje = []
    for index, row in clanci_test.iterrows():
        testiranje.append(row['BODY'])
    
    if give_train:
        return [treniranje, y_train]
    else:
        return [earn_test, acq_test, crude_test, corn_test, testiranje, y_test]


### Effectiveness of Varying Sequence Length

U ovom dijelu promatramo kako parametar duljine subsequenca, k, utječe na točnost modela. Za svaku vrijednost k, eksperiment je proveden 10 puta i onda su dobivene vrijednosti mean i sd. Lambda je postavljen na 0.5.


Kako je računanje za SSK jako sporo, kod njega sam izostavila provođenje eksperimenta 10 puta pa se on provodi samo jednom. Za NGK se provodi 10 puta i vrijednosti evaluacije rezultata su uprosječene.

Stvorimo listu u koju stavljamo [category, ime ljuske, length, f1_mean, f1_std, precision_mean, precision_std, recall_mean, recall_std]. Od te liste kasnije stvorimo dataframe.

#### Evaluacija SSK jezgre

In [77]:
# ispisuje tablicu kao sto je u radu
def ssk_evaluation(category, k_range=[5], lambd_range=[0.5]):
    rezultat_lista = []
    [treniranje, y_train] = give_train_test_split(True)
    for k in k_range:
        for lambd in lambd_range:
            print(category, k)
            lista_u_ovom_koraku = []
            lista_u_ovom_koraku.append(category)
            lista_u_ovom_koraku.append("SSK")
            f1 = []
            precision = []
            recall = []
            ssk_kernel = lambda x, y: ssk.ssk(x, y, k, lambd)
            train_gram = ssk_compute_train_gram(treniranje, kernel=ssk_kernel)
            #for i in range(0, 10):
                #treniranje jezgre
            #[treniranje, y_train] = give_train_test_split(True)
            #train_gram = ssk_compute_train_gram(treniranje, kernel=ssk_kernel)
            print("\t---1---")
            [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
            if(category=="earn"):
                X_test = earn_test
                y_true = np.array(['earn' for i in range(0, len(earn_test))])
            elif category == "acq":
                X_test = acq_test
                y_true = np.array(['acq' for i in range(0, len(acq_test))])
            elif category == "crude":
                X_test = crude_test
                y_true = np.array(['crude' for i in range(0, len(crude_test))])
            else:
                X_test = corn_test
                y_true = np.array(['corn' for i in range(0, len(corn_test))])

            test_gram = ssk_compute_test_gram(X_test, treniranje, kernel=ssk_kernel)
            print("\t---2---")
            clf = SVC(kernel='precomputed')
            clf.fit(train_gram, y_train)
            print("\t---3---")
            # predikcija
            y_pred = clf.predict(test_gram)
            print("\t---4---")
            f1.append(f1_score(y_true, y_pred, average='micro'))
            precision.append(precision_score(y_true, y_pred, average='micro'))
            recall.append(recall_score(y_true, y_pred, average='micro'))
            
        if len(k_range) == 1:
            lista_u_ovom_koraku.append(lambd)
        else:
            lista_u_ovom_koraku.append(k)
            
        lista_u_ovom_koraku.append(round(np.mean(f1), 3))
        lista_u_ovom_koraku.append(round(np.std(f1), 3))
        lista_u_ovom_koraku.append(round(np.mean(precision), 3))
        lista_u_ovom_koraku.append(round(np.std(precision), 3))
        lista_u_ovom_koraku.append(round(np.mean(recall), 3))
        lista_u_ovom_koraku.append(round(np.std(recall), 3))
        
        rezultat_lista.append(lista_u_ovom_koraku)
        lista_u_ovom_koraku.to_csv("varying_sequence_length.csv", mode='a', header=False)
    return rezultat_lista


#### Evaluacija NGK jezgre

In [41]:
def ngk_evaluation(category, k_range=[5],):
    rezultat_lista = []
    [treniranje, y_train] = give_train_test_split(True)
    for k in k_range:
        lista_u_ovom_koraku = []
        lista_u_ovom_koraku.append(category)
        lista_u_ovom_koraku.append("NGK")
        print(k)
        f1 = []
        precision = []
        recall = []
        for i in range(0, 10):
            #treniranje jezgre
            #[treniranje, y_train] = give_train_test_split(True)
            [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
            if(category=="earn"):
                X_test = earn_test
                y_true = np.array(['earn' for i in range(0, len(earn_test))])
            elif category == "acq":
                X_test = acq_test
                y_true = np.array(['acq' for i in range(0, len(acq_test))])
            elif category == "crude":
                X_test = crude_test
                y_true = np.array(['crude' for i in range(0, len(crude_test))])
            else:
                X_test = corn_test
                y_true = np.array(['corn' for i in range(0, len(corn_test))])

            train_gram, test_gram = ngk.ngkGmats(treniranje, X_test, n=k)
            clf = SVC(kernel='precomputed')
            clf.fit(train_gram, y_train)
            y_pred = clf.predict(test_gram)
            f1.append(round(f1_score(y_true, y_pred, average='micro'), 3))
            precision.append(round(precision_score(y_true, y_pred, average='micro'), 3))
            recall.append(round(recall_score(y_true, y_pred, average='micro'), 3))
        
        if len(k_range) == 1:
            lista_u_ovom_koraku.append(lambd)
        else:
            lista_u_ovom_koraku.append(k)
        lista_u_ovom_koraku.append(round(np.mean(f1), 3))
        lista_u_ovom_koraku.append(round(np.std(f1), 3))
        lista_u_ovom_koraku.append(round(np.mean(precision), 3))
        lista_u_ovom_koraku.append(round(np.std(precision), 3))
        lista_u_ovom_koraku.append(round(np.mean(recall), 3))
        lista_u_ovom_koraku.append(round(np.std(recall), 3))
        #print(lista_u_ovom_koraku)
    return rezultat_lista
        

#### Evaluacija WK jezgre

In [74]:
def wk_evaluation(category):
    rezultat_lista = []
    rezultat_lista.append(category)
    rezultat_lista.append("WK")

    f1 = []
    precision = []
    recall = []
    
    [treniranje, y_train] = give_train_test_split(True)
    for i in range(0, 10):
        print(category, i)
        #treniranje jezgre
        #[treniranje, y_train] = give_train_test_split(True)
        [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
        if(category=="earn"):
            X_test = earn_test
            y_true = np.array(['earn' for i in range(0, len(earn_test))])
        elif category == "acq":
            X_test = acq_test
            y_true = np.array(['acq' for i in range(0, len(acq_test))])
        elif category == "crude":
            X_test = crude_test
            y_true = np.array(['crude' for i in range(0, len(crude_test))])
        else:
            X_test = corn_test
            y_true = np.array(['corn' for i in range(0, len(corn_test))])

        wk_kernel = lambda x, y: wk(x, y)
        [wk_train_gram, wk_test_gram] = wkGmats(treniranje, X_test)
        clf = SVC(kernel='precomputed')
        clf.fit(wk_train_gram, y_train)
        y_pred = clf.predict(wk_test_gram)
        f1.append(round(f1_score(y_true, y_pred, average='micro'), 3))
        precision.append(round(precision_score(y_true, y_pred, average='micro'), 3))
        recall.append(round(recall_score(y_true, y_pred, average='micro'), 3))
        
    rezultat_lista.append(0)
    rezultat_lista.append(round(np.mean(f1), 3))
    rezultat_lista.append(round(np.std(f1), 3))
    rezultat_lista.append(round(np.mean(precision), 3))
    rezultat_lista.append(round(np.std(precision), 3))
    rezultat_lista.append(round(np.mean(recall), 3))
    rezultat_lista.append(round(np.std(recall), 3))
    
    return [rezultat_lista]
        

In [78]:
def evaluation_for_varying_sequence_lengths():
    #rezultat = []
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv")
    
    #rezultat = ngk_evaluation("earn", k_range=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    #rezultat = ssk_evaluation("earn", k_range=[3, 5, 7, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("earn")
    varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    #rezultat = ngk_evaluation("acq", k_range=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    rezultat = ssk_evaluation("acq", k_range=[3, 5, 7, 14])
    varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("acq")
    varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    #rezultat = ssk_evaluation("crude", k_range=[3, 5, 7, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    #rezultat = ngk_evaluation("crude", k_range=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("crude")
    varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    #rezultat = ngk_evaluation("corn", k_range=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
    #rezultat = ssk_evaluation("corn", k_range=[3, 5, 7, 14])
    #varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)

    rezultat = wk_evaluation("corn")
    varying_sequence_length = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "k", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_sequence_length.to_csv("varying_sequence_length.csv", mode='a', header=False)
    
evaluation_for_varying_sequence_lengths()

acq 3
0.00%
0.00%
0.01%
0.01%
0.01%
0.01%
0.02%
0.02%
0.02%
0.02%
0.03%
0.03%
0.03%
0.03%
0.04%
0.04%
0.04%
0.04%
0.05%
0.05%
0.05%
0.06%
0.06%


KeyboardInterrupt: 

#### Usporedba rezultata

Radi brzine izvođenja, nisam računala za sve k kao što je u znanstvenom radu. Ali se može uočiti da SSK najbolje radi za male i srednje velike k (otprilike 4-7). Parametar k može se postaviti unaprijed unakrsnom provjerom tako da maksimizira točnost (minimizira pogrešku) na skupu za provjeru.

In [83]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

rezultat = pd.read_csv("varying_sequence_length.csv", index_col=[0, 1, 2, 3])
display(rezultat)
#pd.reset_option('all')

  pd.set_option('display.max_colwidth', -1)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,f1_mean,f1_std,precision_mean,precision_std,recall_mean,recall_std
Unnamed: 0_level_1,category,ime ljuske,k,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,earn,NGK,3,0.885,0.03,0.885,0.03,0.885,0.03
1,earn,NGK,4,0.885,0.051,0.885,0.051,0.885,0.051
2,earn,NGK,5,0.923,0.028,0.923,0.028,0.923,0.028
3,earn,NGK,6,0.93,0.05,0.93,0.05,0.93,0.05
4,earn,NGK,7,0.933,0.035,0.933,0.035,0.933,0.035
5,earn,NGK,8,0.915,0.041,0.915,0.041,0.915,0.041
6,earn,NGK,9,0.918,0.046,0.918,0.046,0.918,0.046
7,earn,NGK,10,0.915,0.039,0.915,0.039,0.915,0.039
8,earn,NGK,11,0.893,0.055,0.893,0.055,0.893,0.055
9,earn,NGK,12,0.908,0.046,0.908,0.046,0.908,0.046


### Effectiveness of Varying Weight Decay Factors

Sada ispitujemo model za promjenjive vrijednosti lambde. Lambda upravlja "kažnjavanjem" ne-susjednih substringova. Što su stringovi "ne-susjedniji" u člancima, to su više kažnjeni.

Pozivamo iste funkcije kao i za Varying Sequence Length, ali ovaj puta predajemo niz lambdi (weight decay factor) za koje testiramo model.

Za svaku vrijednost lambda, eksperiment je proveden 10 puta i onda su dobivene vrijednosti mean i sd. Parametar k je postavljen na 5.


In [None]:
rezultat_lista = []
def evaluation_for_varying_weight_decay_factors():
    #rezultat = []
    #varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_weight_decay.to_csv("varying_weight_decay.csv")
    
    #rezultat = ngk_evaluation("earn")
    #varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    #rezultat = ngk_evaluation("acq")
    #varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    #rezultat = ngk_evaluation("crude")
    #varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    #rezultat = ngk_evaluation("corn")
    #varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    #varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = ssk_evaluation("earn", lambd_range=[0.01, 0.05, 0.3, 0.7])
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = ssk_evaluation("acq", lambd_range=[0.01, 0.05, 0.3, 0.7])
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = ssk_evaluation("crude", lambd_range=[0.01, 0.05, 0.3, 0.7])
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = ssk_evaluation("corn", lambd_range=[0.01, 0.05, 0.3, 0.7])
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("earn")
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("acq")
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("crude")
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)
    
    rezultat = wk_evaluation("corn")
    varying_weight_decay = pd.DataFrame(rezultat, columns=["category", "ime ljuske", "lambda", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    varying_weight_decay.to_csv("varying_weight_decay.csv", mode='a', header=False)

evaluation_for_varying_weight_decay_factors()

#### Usporedba rezultata

##

### Effectiveness of Combining Kernels

#### Combining NGK and SSK

In [None]:
k = 5
lambd = 0.5
def NGK_SSK_comb_evaluation(category):
    rezultat_lista = []
    w_ng_list = [1, 0.5, 0.8, 0.9] #[1, 0, 0.5, 0.6, 0.7, 0.8, 0.9]
    w_sk_list = [0, 0.5, 0.2, 0.1] #[0, 1, 0.5, 0.4, 0.3, 0.2, 0.1]
    for i in len(w_ng_list):
        w_ng = w_ng_list[i]
        w_sk = w_sk_list[i]
        lista_u_ovom_koraku = []
        lista_u_ovom_koraku.append(category)
        lista_u_ovom_koraku.append(w_ng)
        lista_u_ovom_koraku.append(w_sk)
        print(w_ng, w_sk)
        f1 = []
        precision = []
        recall = []
        #for i in range(0, 10):
        #treniranje jezgre
        [treniranje, y_train] = give_train_test_split(True)
        [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
        if(category=="earn"):
            X_test = earn_test
            y_true = np.array(['earn' for i in range(0, len(earn_test))])
        elif category == "acq":
            X_test = acq_test
            y_true = np.array(['acq' for i in range(0, len(acq_test))])
        elif category == "crude":
            X_test = crude_test
            y_true = np.array(['crude' for i in range(0, len(crude_test))])
        else:
            X_test = corn_test
            y_true = np.array(['corn' for i in range(0, len(corn_test))])
            
        ssk_kernel = lambda x, y: ssk.ssk(x, y, 5, 0.5)
        ssk_train_gram = ssk_compute_train_gram(treniranje, kernel=ssk_kernel)
        ssk_test_gram = ssk_compute_test_gram(X_test, treniranje, kernel=ssk_kernel)
        ngk_train_gram, ngk_test_gram = ngk.ngkGmats(treniranje, X_test, n=5)
            
        test_gram = ngk_test_gram*w_ng + ssk_test_gram*w_sk
        train_gram = ngk_train_gram*w_ng + ssk_train_gram*w_sk
            
        clf = SVC(kernel='precomputed')
        clf.fit(train_gram, y_train)
        y_pred = clf.predict(test_gram)
        f1.append(round(f1_score(y_true, y_pred, average='micro')), 3)
        precision.append(round(precision_score(y_true, y_pred, average='micro')), 3)
        recall.append(round(recall_score(y_true, y_pred, average='micro')), 3)
        # kraj for i petlje
        
        lista_u_ovom_koraku.append(round(np.mean(f1), 3))
        lista_u_ovom_koraku.append(round(np.std(f1), 3))
        lista_u_ovom_koraku.append(round(np.mean(precision), 3))
        lista_u_ovom_koraku.append(round(np.std(precision), 3))
        lista_u_ovom_koraku.append(round(np.mean(recall), 3))
        lista_u_ovom_koraku.append(round(np.std(recall), 3))
        #print(lista_u_ovom_koraku)
        rezultat_lista.append(lista_u_ovom_koraku)
        
    return rezultat_lista


In [None]:
def evaluation_for_combining_ngk_and_ssk():
    rezultat = []
    combining_ngk_and_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_ngk_and_ssk.to_csv("combining_ngk_and_ssk.csv")
    
    rezultat = NGK_SSK_comb_evaluation("earn")
    combining_ngk_and_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_ngk_and_ssk.to_csv("combining_ngk_and_ssk.csv", mode='a', header=False)
    
    rezultat = NGK_SSK_comb_evaluation("acq")
    combining_ngk_and_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_ngk_and_ssk.to_csv("combining_ngk_and_ssk.csv", mode='a', header=False)
    
    rezultat = NGK_SSK_comb_evaluation("crude")
    combining_ngk_and_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_ngk_and_ssk.to_csv("combining_ngk_and_ssk.csv", mode='a', header=False)
    
    rezultat = NGK_SSK_comb_evaluation("corn")
    combining_ngk_and_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combining_ngk_and_ssk.to_csv("combining_ngk_and_ssk.csv", mode='a', header=False)

evaluation_for_combining_ngk_and_ssk()

#### Combining SSK with different lambdas

In [None]:
def SSK_lambda_comb_evaluation(category):
    rezultat_lista = []
    lambda_1_list = [0.05, 0.5, 0.05]
    lambda_2_list = [0.0, 0.0, 0.5]
    for i in len(lambda_1_list):
        lambda1 = lambda_1_list[i]
        lambda2 = lambda_2_list[i]
        lista_u_ovom_koraku = []
        lista_u_ovom_koraku.append(category)
        lista_u_ovom_koraku.append(lambda1)
        lista_u_ovom_koraku.append(lambda2)
        print(lambda1, lambda2)
        f1 = []
        precision = []
        recall = []
        for i in range(0, 10):
            #treniranje jezgre
            [treniranje, y_train] = give_train_test_split(True)
            [earn_test, acq_test, crude_test, corn_test, _, _] = give_train_test_split(False)
            if(category=="earn"):
                X_test = earn_test
                y_true = np.array(['earn' for i in range(0, len(earn_test))])
            elif category == "acq":
                X_test = acq_test
                y_true = np.array(['acq' for i in range(0, len(acq_test))])
            elif category == "crude":
                X_test = crude_test
                y_true = np.array(['crude' for i in range(0, len(crude_test))])
            else:
                X_test = corn_test
                y_true = np.array(['corn' for i in range(0, len(corn_test))])
            
            ssk_kernel_1 = lambda x, y: ssk.ssk(x, y, 5, lambda1)
            ssk_kernel_2 = lambda x, y: ssk.ssk(x, y, 5, lambda2)
            ssk_train_gram_1 = ssk_compute_train_gram(treniranje, kernel=ssk_kernel_1)
            ssk_train_gram_2 = ssk_compute_train_gram(treniranje, kernel=ssk_kernel_2)
            ssk_test_gram_1 = ssk_compute_test_gram(X_test, treniranje, kernel=ssk_kernel_1)
            ssk_test_gram_2 = ssk_compute_test_gram(X_test, treniranje, kernel=ssk_kernel_2)
            
            train_gram = ssk_train_gram_1 + ssk_train_gram_2
            test_gram = ssk_test_gram_1 + ssk_test_gram_2
        
            clf = SVC(kernel='precomputed')
            clf.fit(train_gram, y_train)
            y_pred = clf.predict(test_gram)
            f1.append(f1_score(y_true, y_pred, average='micro'))
            precision.append(precision_score(y_true, y_pred, average='micro'))
            recall.append(recall_score(y_true, y_pred, average='micro'))
        
        lista_u_ovom_koraku.append(round(np.mean(f1), 3))
        lista_u_ovom_koraku.append(round(np.std(f1), 3))
        lista_u_ovom_koraku.append(round(np.mean(precision), 3))
        lista_u_ovom_koraku.append(round(np.std(precision), 3))
        lista_u_ovom_koraku.append(round(np.mean(recall), 3))
        lista_u_ovom_koraku.append(round(np.std(recall), 3))
        #print(lista_u_ovom_koraku)
        rezultat_lista.append(lista_u_ovom_koraku)
        
    return rezultat_lista


In [None]:
def evaluation_for_combining_lambda_ssk():
    rezultat = []
    combininig_lambda_ssk = pd.DataFrame(rezultat, columns=["category", "lambda_1", "lambda_2", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combininig_lambda_ssk.to_csv("combininig_lambda_ssk.csv")
    
    rezultat = SSK_lambda_comb_evaluation("earn")
    combininig_lambda_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combininig_lambda_ssk.to_csv("combininig_lambda_ssk.csv", mode='a', header=False)
    
    rezultat = SSK_lambda_comb_evaluation("acq")
    combininig_lambda_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combininig_lambda_ssk.to_csv("combininig_lambda_ssk.csv", mode='a', header=False)
    
    rezultat = SSK_lambda_comb_evaluation("crude")
    combininig_lambda_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combininig_lambda_ssk.to_csv("combininig_lambda_ssk.csv", mode='a', header=False)
    
    rezultat = SSK_lambda_comb_evaluation("corn")
    combininig_lambda_ssk = pd.DataFrame(rezultat, columns=["category", "w_ng", "w_sk", "f1_mean", "f1_std", "precision_mean", "precision_std", "recall_mean", "recall_std"])
    combininig_lambda_ssk.to_csv("combininig_lambda_ssk.csv", mode='a', header=False)

evaluation_for_combining_lambda_ssk()

## ZANEMARITI

Ako treniram modele s punim člancima, onda radi presporo tako da sam odlučila iz svakog članka izdvojiti n = 50 najčešćih riječi i onda po njima uspoređivati članke

In [11]:
from collections import Counter

def return_n_most_common_words(row, n):
    words_in_row = row['BODY'].split(" ")
    count = Counter()
    for word in words_in_row:
        if len(word) > 3:
            count[word] += 1
    lista = np.array([])
    for (element, _) in count.most_common(n):
        lista = np.append(lista, element)
    #row['MOST_COMMON'] = lista
    #print(lista)
    return lista

In [12]:
earn_clanci['MOST_COMMON'] = earn_clanci.apply(lambda row: return_n_most_common_words(row, n=50), axis=1)
acq_clanci['MOST_COMMON'] = acq_clanci.apply(lambda row: return_n_most_common_words(row, n=50), axis=1)
crude_clanci['MOST_COMMON'] = crude_clanci.apply(lambda row: return_n_most_common_words(row, n=50), axis=1)
corn_clanci['MOST_COMMON'] = corn_clanci.apply(lambda row: return_n_most_common_words(row, n=50), axis=1)
earn_clanci

Unnamed: 0,TOPICS,BODY,MOST_COMMON
19,'earn',champion products inc said its board of direct...,"[said, board, stock, shares, shareholders, apr..."
21,'earn',shr cts vs dlrs net vs ...,"[dlrs, assets, deposits, loans, note, availabl..."
22,'earn',ohio mattress co said its first quarter endin...,"[said, quarter, first, acquisitions, dlrs, sea..."
24,'earn',oper shr loss two cts vs profit seven cts ...,"[profit, oper, loss, revs, shrs, mths, seven, ..."
25,'earn',shr one dlr vs cts net mln vs ...,"[revs, dlrs, nine, mths, billion, reuter]"
...,...,...,...
13058,'earn',shr loss nine cts vs loss cts net loss ...,"[loss, dlrs, capitalized, costs, nine, revs, s..."
13059,'earn',shr cts vs cts shr diluted cts vs...,"[diluted, shrs, sales, nine, mths, dlrs, reuter]"
13060,'earn',shr cts vs cts net mln vs ...,"[dlrs, sales, shrs, nine, mths, oper, billion,..."
13103,'earn',nine months ended august group shr ...,"[billion, group, nine, months, ended, august, ..."


In [53]:
## MOJ SSK -> radi za onaj mali uvodni primjer
import itertools

# SSK - string subsequence kernel
def is_subsequence(subsequence, word):
    iterator = iter(word)
    if all(c in iterator for c in subsequence):
        return True
    else:
        return False

def ssk_kernel(string1, string2, k=2, lambd=1):
    stupci = []
    tablica = {}

    for word in [string1, string2]:
        letters = list(word)
        for combination in itertools.combinations(letters, k): # nalazi sve kombinacije slova u letters duljine k
            s = ''.join(combination)
            if s not in stupci:
                stupci.append(s)
    
    #print(stupci)

    for word in [string1, string2]:
        tablica[word] = [0 for i in range(len(stupci))]
        subsequence_index = 0
        for stupac in stupci:
            if is_subsequence(stupac, word):
                cell_rez = 1
                index_slova_rijeci = 0
                for index_slova_stupca in range(len(stupac) - 1):
                    cell_rez += word.index(stupac[index_slova_stupca+1], word.index(stupac[index_slova_stupca])+1)-word.index(stupac[index_slova_stupca])
                tablica[word][subsequence_index] = pow(lambd, cell_rez)
                #print(word, tablica[word])
                # res += i.index(j[ki+1], i.index(j[ki])+1)-i.index(j[ki])
            subsequence_index += 1
    red_1 = np.array(tablica[string1])
    red_2 = np.array(tablica[string2])
    
    rez_1 = np.sum(red_1*red_2.T)
    rez_2 = np.sum(red_1*red_1.T)
    rez_3 = np.sum(red_2*red_2.T)
    rez = rez_1/pow(rez_2*rez_3, 0.5)
    return rez

print(ssk_kernel("cat","car", lambd=2))

0.16666666666666666


In [None]:
# NGK - n-grams kernel
# NGK is a linear kernel that returns a similarity score between documents
# that are indexed by n-grams
# vrijednost jezgrene funkcije
def ngk(string1, string2):
    def ngrams(string):
        ngrams = set(())
        for n in range(1, len(string)+1):
            ngrams_helper = zip(*[string[i:] for i in range(n)])
            for ngram in ngrams_helper:
                ngrams.add(''.join(ngram))
        #print(ngrams)
        return ngrams
    
    ngrams_1 = ngrams(string1) # racuna ngrams za prvi dokument
    ngrams_2 = ngrams(string2) # racuna ngrams za drugi dokument
    
    # usporeduje broj jednakih ngrams oba dokumenta
    intercept_rez = ngrams_1.intersection(ngrams_2)
    num_common = len(intercept_rez)
    
    rez = num_common/(len(ngrams_1)+len(ngrams_2))
    rez = rez/0.5 #skaliranje
    return rez

def ngk_kernel(X1, X2):
    kernel_matrix = np.zeros([len(X1), len(X2)])
    for i in range(0, len(X1)):
        for j in range(0, len(X2)):
            kernel_matrix[i][j] = ngk(X1[i], X1[j])
    return kernel_matrix

print(ngk_kernel("car","cat"))