# Forsøk: Hvordan varier ML-performance for de forskjellige sifrene?

## Hypotese: Det vil være lettere å klassifisere deweys dess flere siffer dem har, siden innholdet er mer spesifisert?

In [61]:
import sys
sys.path.append('/home/ubuntu/PycharmProjects_saved/tgpl_w_oop')
from nb_ml import utils_nb
import pandas as pd
folder = "/home/ubuntu/PycharmProjects_saved/tgpl_w_oop/data_set/tgcForOptimization/tgcForOptimization"
articles = utils_nb.get_articles_from_folder_several_deweys(folder)
print(articles.describe())

        dewey                                          file_name  \
count   14567                                              14567   
unique   3552                                              14567   
top     36229  httpwwwidunnnotsstat201003art26Sosialemedierog...   
freq      604                                                  1   

                                                     text  
count                                               14567  
unique                                              14278  
top      samtideninnhold Billettmerket Ikke forambisiø...  
freq                                                   15  


# Oversikt over deweys og antall dokumenter per dewey
Det første steget er å få en oversikt over hvilke deweys vi har og hvor mange vi har av hver. For å få et inntrykk så skriver vi ut en liste over de 40 øverste. Dette vil også gi oss en formening om hvilke 3,4, 5 og 6 sifrede deweys som kan være aktuelle for videre forsøk.


In [100]:
topN = articles["dewey"].value_counts().head(60)
print("dewey       frekvens")
topN


dewey       frekvens


36229         604
839823        141
362293        137
3621          129
3622          126
34304         116
3627          100
362204         96
6168915        90
351481         90
362292         79
30223          75
306            70
379481         58
34602          58
3412422        57
61092          56
34705          54
34306          53
34206          53
9072           53
75981          52
657            51
34401          48
30712          47
34606          47
3521409481     47
0014           46
37817          45
327481         44
61612          44
3637387        44
36211068       44
30542          44
341481         43
34202          42
34604          42
33263          42
34603          42
610711         41
7114           40
839821         40
34505          40
331257         40
3523           39
61578          39
193            39
30072          39
6167           37
3053           36
3058           36
343055         35
610730711      34
3401           34
346043         34
36218     

Fra tabellen over ser vi at vi har en rekke lovende kandidater for videre forsøk. Jeg lister dem opp ved siden av antall deweys, frekvens er i parentes:
- 10-siffer: 3521409481(47)

- 7-siffer: 6168915 (90), 3412422(57)
- 6-siffer: 839823 (141), 362293(137), 362204(96), 362292(79),379481(58)
- 5-siffer: 36229 (604), 34304(116), 30223(75), 34602(58)
- 4-siffer: 3621(129), 3622(126),3627(100), 34602(58),9072(53)
- 3-siffer: 306(70), 657(51)

# Scenarioer for testing
### Hovedscenario
- 3 vs 6

#### Ekstra-scenarioer
- 3 vs 5
- 3 vs 6
- 3 vs 7

### Teknisk oppsett
I testene har logistisk regresjon blitt brukt som klassifiseringsalgoritme, under ligger testkoden.

Biblioteker:
- NLTK
- sci-kit learn
- matplotlib
- numpy


In [63]:
import sys
sys.path.append('/home/ubuntu/PycharmProjects_saved/tgpl_w_oop/nb_ml')
from nb_ml import logreg
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score
import matplotlib.cm as cm
import numpy as np
from collections import OrderedDict
from sklearn.neighbors.classification import KNeighborsClassifier
class dewey_test():
    
    def __init__(self, data):
        self.corpus_dataframe = data.copy()
        self.filtered_corpus = []
        self.x_train = None
        self.y_train = None
        self.x_test = None
        self.y_test = None
        self.model = None
        self.predictions = None
        self.results = None
        self.accuracy = None
    def preprocessing(self, numArticlesPerDewey=2, strict = False):
        filtered_texts = []
        if strict == True:
            self.getStrictArticleSelection(numArticlesPerDewey)
        for text in self.corpus_dataframe["text"].values:
            tokenized_text = word_tokenize(text = str(text), language = "norwegian")
        self.y_train = self.corpus_dataframe["dewey"].tolist()
        self.y_test = self.corpus_dataframe["dewey"].tolist()
    def splitToTrainingAndTest(self, stratified):
        x = self.corpus_dataframe["text"].tolist()
        y = self.corpus_dataframe["dewey"].tolist()
        if stratified == True:
            self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(x, y,test_size = 0.2, stratify = y,random_state = 42)
        else:
            self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(x, y,test_size = 0.2,random_state = 42)
    def fit(self):
        count_vectorizer = CountVectorizer(max_features = 10000)
        self.x_train = count_vectorizer.fit_transform(self.x_train)
        self.x_test = count_vectorizer.transform(self.x_test)
    def train(self):
        self.model = LogisticRegression() 
        self.model.fit(self.x_train, self.y_train)
    def predict(self):
        self.predictions = self.model.predict(self.x_test)
        self.results = classification_report(self.y_test, self.predictions)
        self.getAccuracy()
    def getAccuracy(self):
        self.accuracy = accuracy_score(self.y_test, self.predictions)
    def printResults(self):
        print(str(self.results) +"\n")
        print("Accuracy:"+ str(self.accuracy))
    
    def tsne(self):
        X_reduced = TruncatedSVD(n_components = 50, random_state=0).fit_transform(self.x_train)
        X_embedded = TSNE(n_components =2, perplexity = 40, random_state = 0).fit_transform(X_reduced)
        
        colors = cm.rainbow(np.linspace(0,1,len(set(self.y_train))))
        unique_labels = set(self.y_train)
        color_dictionary = dict(zip(unique_labels, colors))
        
        color_list = []
        for label in self.y_train:
            color_list.append(color_dictionary[str(label)])
        for i in range(0,len(self.y_train)):    
            plt.scatter(X_embedded[i,0], X_embedded[i,1], c = color_list[i],label = str(self.y_train[i]), 
                        cmap = "tab20b" )
        handles, labels = plt.gca().get_legend_handles_labels()
        by_label = OrderedDict(zip(labels, handles))
        plt.legend(by_label.values(), by_label.keys())
        plt.title("TSNE-plot")
        plt.show()
    def getStrictArticleSelection(self, articlesPerDewey):
        np.random.seed(0)
        size = articlesPerDewey  # sample size
        replace = False  # with replacement
       
        self.corpus_dataframe =self.corpus_dataframe[self.corpus_dataframe['dewey'].isin(self.corpus_dataframe['dewey'].value_counts()[self.corpus_dataframe['dewey'].value_counts()>size-1].index)]
        fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace), :]
        self.corpus_dataframe = self.corpus_dataframe.groupby('dewey', as_index=False).apply(fn)
        
    def plotDecisionSurface(self):
        X_reduced = TruncatedSVD(n_components = 50, random_state=0).fit_transform(self.x_train)
        X_embedded = TSNE(n_components =2, perplexity = 40, random_state = 0).fit_transform(X_reduced)
        
        
        colors = cm.rainbow(np.linspace(0,1,len(set(self.y_train))))
        unique_labels = set(self.y_test)
        color_dictionary = dict(zip(unique_labels, colors))
        
        color_list = []
        for label in self.y_train:
            color_list.append(color_dictionary[str(label)])
                
        # create meshgrid
        resolution = 1000 # 100x100 background pixels
        X2d_xmin, X2d_xmax = np.min(X_embedded[:,0]), np.max(X_embedded[:,0])
        X2d_ymin, X2d_ymax = np.min(X_embedded[:,1]), np.max(X_embedded[:,1])
        xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))

        # approximate Voronoi tesselation on resolution x resolution grid using 1-NN
        background_model = KNeighborsClassifier(n_neighbors=1).fit(X_embedded, self.y_train) 
        voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()])
        voronoiBackground = voronoiBackground.reshape(xx.shape)
        
        plt.contourf(xx, yy, voronoiBackground, cmap=plt.cm.Paired)
        handles, labels = plt.gca().get_legend_handles_labels()
        by_label = OrderedDict(zip(labels, handles))
        plt.legend(by_label.values(), by_label.keys())
        for i in range(0,len(self.y_train)):    
            plt.scatter(X_embedded[i,0], X_embedded[i,1], c = color_list[i],label = str(self.y_train[i]), 
                        cmap = "tab20b" )
        handles, labels = plt.gca().get_legend_handles_labels()
        by_label = OrderedDict(zip(labels, handles))
        plt.legend(by_label.values(), by_label.keys())
        plt.title("Contour-diagram")
        #plt.scatter(X_embedded[:,0], X_embedded[:,1], c=self.y_train)
        plt.show()
        


In [64]:
## Hjelpefunksjoner
def getDeweyAndAllSubdeweys(deweynr, corpus):
    
    filter_col = [col for col in articles["dewey"] if col.startswith(deweynr)]
    dfWithDeweyAndSubdeweys = corpus.loc[corpus['dewey'].isin(filter_col)].copy()
    return dfWithDeweyAndSubdeweys

def sliceDewey(x,length):
    if len(x)==length:
        return x[:length]
    else:
        return x[:]
def joinDeweysDfs(*args):
    all_dfs = []
    for arg in args:
        all_dfs.append(arg)
    joined_df = pd.concat(all_dfs)
    return joined_df

## Scenario 13

In [65]:
df_362 = getDeweyAndAllSubdeweys("362", articles)
df_362["dewey"] = df_362["dewey"].str[:3]
print(df_362.describe())
df_616 = getDeweyAndAllSubdeweys("616", articles)
df_616["dewey"] = df_616["dewey"].str[:3]
print(df_616.describe())
scenario_13_data= joinDeweysDfs(df_362, df_616)
scenario_13 = dewey_test(scenario_13_data)
scenario_13.preprocessing(numArticlesPerDewey =240, strict = True)
scenario_13.splitToTrainingAndTest(stratified = True)
scenario_13.fit()
scenario_13.train()
scenario_13.predict()
scenario_13.printResults()

       dewey                                          file_name  \
count   2029                                               2029   
unique     1                                               2029   
top      362  httpwwwidunnnotsrus200406omgivelsene_taler_alt...   
freq    2029                                                  1   

                                                     text  
count                                                2029  
unique                                               1978  
top      rus avhengighet nr Arbeidslivet rusmiddelfri ...  
freq                                                    5  
       dewey                                          file_name  \
count    791                                                791   
unique     1                                                791   
top      616  urndoi104045tidsskr090300Psykogeneikkeepilepti...   
freq     791                                                  1   

                            

## Scenario 14

In [85]:
scenario_14_data= joinDeweysDfs(df_362, df_616)
scenario_14 = dewey_test(scenario_14_data)
scenario_14.preprocessing(numArticlesPerDewey =2, strict = False)
scenario_14.splitToTrainingAndTest(stratified = True)
scenario_14.fit()
scenario_14.train()
scenario_14.predict()
scenario_14.printResults()

             precision    recall  f1-score   support

        362       0.92      0.94      0.93       406
        616       0.84      0.80      0.82       158

avg / total       0.90      0.90      0.90       564


Accuracy:0.902482269504


## Scenario 15

In [89]:
## For å få nok artikler så slår jeg sammen dewey 3621 og 3622, ender da opp med 252 artikler
df_3621 = getDeweyAndAllSubdeweys("3621", articles)
df_3622 = getDeweyAndAllSubdeweys("3622", articles)
df_362x = joinDeweysDfs(df_3622, df_3621)
## Setter alle deweyene til labelen "362x" for å indikere at dette er en syntetisk dewey.

mask = (df_362x['dewey'].str.len() < 5)
df_362x = df_362x.loc[mask]
df_362x["dewey"] = "362x"


## For å få nok artikler så slår jeg sammen dewey "61689", "61681", "61684", "61683", 
#"6168582", "61686", "6168915","616891","6168914"
df_61689 = getDeweyAndAllSubdeweys("61689", articles)
df_61681 = getDeweyAndAllSubdeweys("61681", articles)
df_61684 = getDeweyAndAllSubdeweys("61684", articles)
df_61683 = getDeweyAndAllSubdeweys("61683", articles)
df_6168582 = getDeweyAndAllSubdeweys("6168582", articles)
df_61686 = getDeweyAndAllSubdeweys("61686", articles)

df_616x = joinDeweysDfs( df_61689, df_61681, df_61684,df_61683, df_6168582, df_61686)

validDeweys = ["61689", "61681", "61684", "61683", "6168582", "61686", "6168915","616891","6168914"]
df_616x = df_616x[df_616x['dewey'].isin(validDeweys)]
## Også her setter jeg labelnavnet til 616x for å indikere at dette er en syntetisk dewey.
df_616x["dewey"] = "616x"

#df_616x.describe()
scenario_15_data= joinDeweysDfs(df_362x, df_616x)
scenario_15 = dewey_test(scenario_15_data)
scenario_15.preprocessing(numArticlesPerDewey =240, strict = True)
scenario_15.splitToTrainingAndTest(stratified = True)
scenario_15.fit()
scenario_15.train()
scenario_15.predict()
scenario_15.printResults()

             precision    recall  f1-score   support

       362x       0.89      0.98      0.93        48
       616x       0.98      0.88      0.92        48

avg / total       0.93      0.93      0.93        96


Accuracy:0.927083333333


## Scenario 16

In [108]:
scenario_16_data= joinDeweysDfs(df_362x, df_616x)
scenario_16 = dewey_test(scenario_16_data)
scenario_16.preprocessing(numArticlesPerDewey =2, strict = False)
scenario_16.splitToTrainingAndTest(stratified = True)
scenario_16.fit()
scenario_16.train()
scenario_16.predict()
scenario_16.printResults()

             precision    recall  f1-score   support

       362x       0.90      0.92      0.91        51
       616x       0.92      0.90      0.91        50

avg / total       0.91      0.91      0.91       101


Accuracy:0.910891089109


## Scenario 17

In [109]:
## For å få nok artikler har man valgt å slå sammen dewey 362293, 362204, 362292. Den nye "6-sifrede"-deweyen
## har fått navnet 362xxx
df_362293 = getDeweyAndAllSubdeweys("362293", articles)
df_362204 = getDeweyAndAllSubdeweys("362204", articles)
df_362292 = getDeweyAndAllSubdeweys("362292", articles)
df_362xxx = joinDeweysDfs(df_362293, df_362204, df_362292)
mask = (df_362xxx['dewey'].str.len() < 7)
df_362xxx = df_362xxx.loc[mask]
df_362xxx["dewey"] = "362xxx"

#Pga vanskeligheter med å få nok artikler velger jeg å bruke dewey 839 i disse testene
#Følgende deweys har blitt slått sammen for å få nok artikler: 839823, 839821, 839822, 839828, 839813, 
df_839823 = getDeweyAndAllSubdeweys("839823", articles)
df_839821 = getDeweyAndAllSubdeweys("839821", articles)
df_839822 = getDeweyAndAllSubdeweys("839822", articles)
df_839828 = getDeweyAndAllSubdeweys("839828", articles)
df_839813 = getDeweyAndAllSubdeweys("839813", articles)
df_8398209 = getDeweyAndAllSubdeweys("8398209", articles)

df_839xxx = joinDeweysDfs(df_839823, df_839821, df_839822, df_839828, df_839813,df_8398209)
mask = (df_839xxx['dewey'].str.len() < 8)
df_839xxx = df_839xxx.loc[mask]
df_839xxx["dewey"] = "839xxx"

scenario_17_data= joinDeweysDfs(df_362xxx, df_839xxx)
scenario_17 = dewey_test(scenario_17_data)
scenario_17.preprocessing(numArticlesPerDewey =240, strict = True)
scenario_17.splitToTrainingAndTest(stratified = True)
scenario_17.fit()
scenario_17.train()
scenario_17.predict()
scenario_17.printResults()

             precision    recall  f1-score   support

     362xxx       0.98      1.00      0.99        48
     839xxx       1.00      0.98      0.99        48

avg / total       0.99      0.99      0.99        96


Accuracy:0.989583333333


## Scenario 18

In [110]:
scenario_18_data= joinDeweysDfs(df_362xxx, df_839xxx)
scenario_18 = dewey_test(scenario_18_data)
scenario_18.preprocessing(numArticlesPerDewey =2, strict = False)
scenario_18.splitToTrainingAndTest(stratified = True)
scenario_18.fit()
scenario_18.train()
scenario_18.predict()
scenario_18.printResults()

             precision    recall  f1-score   support

     362xxx       0.98      1.00      0.99        63
     839xxx       1.00      0.98      0.99        50

avg / total       0.99      0.99      0.99       113


Accuracy:0.991150442478


## Scenario 19

In [120]:
## De to 3 sifrede er 362 og 616. Utenforklassen blir satt sammen av et tilfeldig utvalg 
# av artikler som ikke tilhører disse deweyene.

#Lager random-sett
df_362 = getDeweyAndAllSubdeweys("362", articles)
df_616 = getDeweyAndAllSubdeweys("616", articles)
invalidDeweys =set(list(df_362["dewey"].values) + list(df_616["dewey"].values))

df_articles_wo_362_616 = articles[~articles['dewey'].isin(invalidDeweys)]

utenfor_sett = df_articles_wo_362_616.sample(n=240, random_state=2)
utenfor_sett["dewey"] = "utenfor"


df_362["dewey"] = df_362["dewey"].str[:3]
df_616["dewey"] = df_616["dewey"].str[:3]

scenario_19_data= joinDeweysDfs(df_362, df_616, utenfor_sett)
scenario_19 = dewey_test(scenario_19_data)
scenario_19.preprocessing(numArticlesPerDewey =240, strict = True)
scenario_19.splitToTrainingAndTest(stratified = True)
scenario_19.fit()
scenario_19.train()
scenario_19.predict()
scenario_19.printResults()

             precision    recall  f1-score   support

        362       0.73      0.75      0.74        48
        616       0.79      0.77      0.78        48
    utenfor       0.88      0.88      0.88        48

avg / total       0.80      0.80      0.80       144


Accuracy:0.798611111111


## Scenario 20

In [125]:
utenfor_sett_alle = df_articles_wo_362_616.sample(n=2000, random_state=2)
utenfor_sett_alle["dewey"] = "utenfor"

scenario_20_data= joinDeweysDfs(df_362, df_616, utenfor_sett_alle)
scenario_20 = dewey_test(scenario_20_data)
scenario_20.preprocessing(numArticlesPerDewey =2, strict = False)
scenario_20.splitToTrainingAndTest(stratified = True)
scenario_20.fit()
scenario_20.train()
scenario_20.predict()
scenario_20.printResults()

             precision    recall  f1-score   support

        362       0.82      0.88      0.85       406
        616       0.79      0.70      0.74       158
    utenfor       0.89      0.86      0.87       400

avg / total       0.84      0.84      0.84       964


Accuracy:0.843360995851


## Scenario 21

In [127]:
scenario_21_data= joinDeweysDfs(scenario_15_data, utenfor_sett)
scenario_21 = dewey_test(scenario_21_data)
scenario_21.preprocessing(numArticlesPerDewey =240, strict = True)
scenario_21.splitToTrainingAndTest(stratified = True)
scenario_21.fit()
scenario_21.train()
scenario_21.predict()
scenario_21.printResults()

             precision    recall  f1-score   support

       362x       0.85      0.94      0.89        48
       616x       0.93      0.81      0.87        48
    utenfor       0.86      0.88      0.87        48

avg / total       0.88      0.88      0.87       144


Accuracy:0.875


## Scenario 22

In [131]:
df_362 = getDeweyAndAllSubdeweys("362", articles)
df_616 = getDeweyAndAllSubdeweys("616", articles)
invalidDeweys =set(list(df_362["dewey"].values) + list(df_839["dewey"].values))
df_articles_wo_362_839 = articles[~articles['dewey'].isin(invalidDeweys)]
utenfor_sett = df_articles_wo_362_839.sample(n=240, random_state=2)
utenfor_sett["dewey"] = "utenfor"

scenario_22_data= joinDeweysDfs(scenario_17_data, utenfor_sett)
scenario_22 = dewey_test(scenario_22_data)
scenario_22.preprocessing(numArticlesPerDewey =240, strict = True)
scenario_22.splitToTrainingAndTest(stratified = True)
scenario_22.fit()
scenario_22.train()
scenario_22.predict()
scenario_22.printResults()

             precision    recall  f1-score   support

     362xxx       0.85      0.96      0.90        48
     839xxx       0.94      0.96      0.95        48
    utenfor       0.90      0.77      0.83        48

avg / total       0.90      0.90      0.89       144


Accuracy:0.895833333333


## Scenario 23

In [136]:
utenfor_sett_alle = df_articles_wo_362_839.sample(n=2000, random_state=2)
utenfor_sett_alle["dewey"] = "utenfor"

scenario_23_data= joinDeweysDfs(scenario_18_data, utenfor_sett_alle)
scenario_23 = dewey_test(scenario_23_data)
scenario_23.preprocessing(numArticlesPerDewey =240, strict = False)
scenario_23.splitToTrainingAndTest(stratified = True)
scenario_23.fit()
scenario_23.train()
scenario_23.predict()
scenario_23.printResults()

             precision    recall  f1-score   support

     362xxx       0.81      0.81      0.81        63
     839xxx       0.90      0.72      0.80        50
    utenfor       0.94      0.96      0.95       400

avg / total       0.92      0.92      0.92       513


Accuracy:0.918128654971
