<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Modules-Definition" data-toc-modified-id="Modules-Definition-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Modules Definition</a></span></li><li><span><a href="#Read-and-trim-dictionary" data-toc-modified-id="Read-and-trim-dictionary-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read and trim dictionary</a></span></li><li><span><a href="#Import-and-Preprocess-Tweets" data-toc-modified-id="Import-and-Preprocess-Tweets-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import and Preprocess Tweets</a></span></li><li><span><a href="#Create-and-save-doc2vec-model" data-toc-modified-id="Create-and-save-doc2vec-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create and save doc2vec model</a></span></li><li><span><a href="#Train-the-Model:-The-Training-Dataset" data-toc-modified-id="Train-the-Model:-The-Training-Dataset-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Train the Model: The Training Dataset</a></span><ul class="toc-item"><li><span><a href="#K-neighbors-classifier" data-toc-modified-id="K-neighbors-classifier-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>K-neighbors-classifier</a></span></li><li><span><a href="#Random-Forest-and/or-Support-Vector-Machine" data-toc-modified-id="Random-Forest-and/or-Support-Vector-Machine-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Random Forest and/or Support Vector Machine</a></span></li></ul></li><li><span><a href="#General-Classification-Model" data-toc-modified-id="General-Classification-Model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>General Classification Model</a></span></li><li><span><a href="#Classification-of-unseen-data" data-toc-modified-id="Classification-of-unseen-data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Classification of unseen data</a></span><ul class="toc-item"><li><span><a href="#Neural-Network" data-toc-modified-id="Neural-Network-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Neural Network</a></span></li></ul></li><li><span><a href="#Automatic-Classification" data-toc-modified-id="Automatic-Classification-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Automatic Classification</a></span></li></ul></div>

In [1]:
import csv
import os
from os import path
import pandas as pd
import numpy as np
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
from gensim.models import doc2vec
from gensim.corpora import WikiCorpus
from gensim.corpora import Dictionary
from collections import namedtuple
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold # import KFold
from sklearn.grid_search import GridSearchCV
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



# Modules Definition

Define the following modules:
- preProcess(): Text preprocessing, to tokenize, remove stopwords, and sentences that are too short.
- addTags(): This is needed to change the structure of each sentence (tweet) in the format required by doc2vec. We need to add a tag to each document.

In [None]:
def preProcess(df):
    docs = []
    corpus = []
    idOriginal = []
    countWords = 0
    countDict = 0
    countShort = 0
    i = 0
    for ss in df["documents"]:
        if (i % 1000 == 0):
            print("Doc {0:5d} has been processed".format(i))
        i += 1
        tokens = word_tokenize(ss)
        #print(tokens)
        countWords += len(tokens)
        for w in tokens:
            #print("w ", w, " in dictionary? ", w in dct.token2id)
            if w not in dct.token2id:
                countDict += 1
        words = [w.lower() for w in word_tokenize(ss) if w not in stop_words and w in dct.token2id]
        if len(words) > 2:
            docs.append(words)
            corpus.append(ss)
            idOriginal.append(df.iloc[i-1][0])
        else:
            countShort += 1
    print("Dict = ", countDict, " (", countWords, " - ", countDict/countWords, " ) and Short = ", countShort)
    
    return docs, corpus, idOriginal

def addTags(docs):
    dTags = []
    analyzedDocument = namedtuple("AnalyzedDocument", "words tags")
    for i, doc in enumerate(docs):
        tags = [i]
        dTags.append(analyzedDocument(doc, tags))
    return dTags

# Read and trim dictionary 

We import a dictionary built on the Italian wikipedia and we cut it to a maximum number of words.

In [None]:
inp = "../wiki/Wikipedia_Word2vec-master/v1/itwiki-20180520-pages-articles.xml.bz2"
wiki = WikiCorpus(inp,lemmatize=False, dictionary={})

In [None]:
dct = Dictionary.load_from_text("/home/marco/gdrive/research/nlp/wiki/dictWiki.txt")
print("Length dictionary before filter = ", len(dct))
dct.filter_extremes(keep_n=500000)
print("Length dictionary after filter = ", len(dct))

In [None]:
stop_words = stopwords.words("italian")

# Import and Preprocess Tweets

We read the __full list__ of tweets (over 23k tweets), to build a doc2vec model with the entire dataset.

In [None]:
for i in range(len(dTagsAll)):
    if (dTagsAll[i][1][0] != idOriginal[i]-1):
        print(dTagsAll[i][1][0], " vs ", idOriginal[i]-1)
print("done")

In [None]:
df = pd.read_csv("data/listTweets.csv")
df.head()
docs, corpus, idOriginal = preProcess(df)
dTagsAll = addTags(docs)
print("Total nr. of documents = ", len(docs))
#for i in range(len(docs)):
#    print(dTagsAll[i][1], " :: ", dTagsAll[i][0])
hashtags = df[df["ID"].isin(idOriginal)]["HASHTAG"]
hashtags = [w.lower() for w in hashtags]

In [None]:
dfProcessed = pd.DataFrame({
    "idOriginal": idOriginal,
    "idCurrent" : np.arange(len(docs)),
    "docs"      : docs,
    "corpus"    : corpus,
    "hashtag"   : hashtags
})
dfProcessed.head()
dfProcessed.to_csv("data/preProcessedAll.csv")

# Create and save doc2vec model

Note that there are two implementations of the doc2vec model:

- doc2vecTwitter.model: This is the model obtained when the option "dm=0" is used in the generation of the model using the gensim module.
- doc2vecTwitter.model.dm: This model is obtained activating "dm=1". Some papers in the literature say that the doc2vec model obtained using "dm=1" is of higher quality than the one obtained with "dm=0".

In [None]:
model = doc2vec.Doc2Vec(size=300, window=10, min_count=2, iter=100, workers=4, dm=0, max_vocab_size=10000)
model.build_vocab(dTagsAll)
model.train(dTagsAll, total_examples=model.corpus_count, epochs=model.iter)
model.init_sims(replace=True)
model.save("doc2vecTwitter.model")

In [None]:
modelDM = doc2vec.Doc2Vec(size=300, window=10, min_count=2, iter=100, workers=4, dm=1, max_vocab_size=10000)
modelDM.build_vocab(dTagsAll)
modelDM.train(dTagsAll, total_examples=modelDM.corpus_count, epochs=modelDM.iter)
modelDM.init_sims(replace=True)
modelDM.save("doc2vecTwitter.model.dm")

In [None]:
model = doc2vec.Doc2Vec.load("doc2vecTwitter.model") # load one of the two versions of doc2vec model
# this is a little test to assess the quality of the doc2vec model...
withTest = 0
if withTest:
    x = []
    for idx in range(1000):
        v0 = model.infer_vector(dTagsAll[idx].words, steps=10000)
        sims = model.docvecs.most_similar([v0], topn=len(docs))
        rank = [idx for idx,sim in sims].index(idx)
        x.append(rank)
        if idx % 50 == 0:
            print("Vector ", idx, " has rank ", rank)

In [None]:
print(np.mean(x))
print(np.median(x))
print(x)

In [None]:
model.most_similar("banca", topn=20)


# Train the Model: The Training Dataset

We now import the training dataset, i.e., the one in which tweets have been manually labeled. The current version contains only labels $0$ and $1$. Therefore, we can use standard binary classification algorithms. Currently, we are trying the following algorithms:
- k-neighbors-classifier
- random forest
- support vector machine
We also do some calibration with Support Vector machine, to determine an appropriate choice of $\gamma$ and $C$.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

In [None]:
dTrain1000 = pd.read_csv("data/sample_1000.csv")
dTrain1000.head()
docsTraining, corpusTraining, idTraining = preProcess(dTrain1000)
dTags1000 = addTags(docsTraining)

In [None]:
# get embedding for Hashtag (inferred)
hashtags = dTrain1000[dTrain1000["ID"].isin(idTraining)]["HASHTAG"]
vecHashtags = []
print("banca" in model.wv)
count = 0
for ht in hashtags:
    vecHashtags.append(model.infer_vector(ht))

In [None]:
print(dTrain1000.iloc[0,1])
print(dTrain1000.iloc[0,0])
#for i in range(10):
#    print("i = ", i , " vs tag ", dTags1000[i][1], " :: ", dTags1000[i].words)
# get mapping from original id to TAG value in doc2vec
id2tag = {orig:curr for orig,curr in zip(dfProcessed.idOriginal, dfProcessed.idCurrent)}
print(len(dTrain1000), len(docsTraining))
print(id2tag[64])

In [None]:
# infer the vector embeddings using doc2vec
# NOTE: Rem that some tweets have been eliminated from dTrain1000
Y = dTrain1000[dTrain1000["ID"].isin(idTraining)]["COD_SE"]
tags = [id2tag[i] for i in idTraining]

y = [y for y in Y]
X = []
#nn = 100
for idx in range(len(idTraining)):
#for idx in range(100):
    tag = tags[idx]
    if idx % 50 == 0:
        print("Vector ", idx, " with TAG = ", tag)
    #X.append(model.infer_vector(dTags[idx].words, steps=10000))
    if docsTraining[idx] != dTagsAll[tag].words:
        print("D2V(", tag, ") ", dTagsAll[tag].words)
        print("docs = ", dTrain1000.iloc[idx][1])
        print("doc =  ", docsTraining[idx])
        input("aka")
    xi = [i for i in model.docvecs[tag]]
    #print(xi)
    for i in vecHashtags[idx]:
        xi.append(i)
    X.append(xi)
print("D2V(", tag, ") ", dTagsAll[tag].words)
print("docs = ", dTrain1000.iloc[idx][1])
print("doc =  ", docsTraining[idx])

## K-neighbors-classifier

In [None]:
classifier = KNeighborsClassifier(n_neighbors=2)  
classifier.fit(X, y)  
y_pred = classifier.predict(X) 

In [None]:
print(confusion_matrix(y, y_pred))  
print(classification_report(y, y_pred))  

In [None]:
ypd = pd.DataFrame({
    "id": idOriginal,
    "corpus": corpusTraining,
    "y":y,
    "yPred":y_pred})
ypd.to_csv("data/yPred.csv")

## Random Forest and/or Support Vector Machine

Here we use a k-fold cross-validation method, to determine a more "objective" measure of the accuracy of the method. Note that such k-fold validation can easily be activated for random forest as well (just uncomment the corresponding lines and comment out those belonging to SVM).

In [9]:
method = "NN" # change it to "RF" or "SVM"
print("** ** ** Using Method ", method, " ** ** **")
KK = 5
kf = KFold(n_splits=KK, shuffle=True) # Define the split - into 5 folds 
print("Nr. of Folds = ", kf.get_n_splits(X)) # returns the number of splitting iterations in the cross-validator
i = 0
param_grid = {'C': [1, 5, 10, 50,100,1000,10000,100000,1000000],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]}
avgPos      = 0.0
avgFalsePos = 0.0
avgAccuracy = 0.0
avgScore    = 0.0
for train_index, test_index in kf.split(X):
    #print("TEST:", test_index)
    i += 1
    print("Fold ", i)
    X_train, X_test = np.array(X)[train_index],  np.array(X)[test_index]
    y_train, y_test = np.array(y)[train_index],  np.array(y)[test_index]
    if method == "RF":
        mm = RandomForestClassifier(n_estimators=500, criterion="gini", n_jobs=-1, max_features="sqrt",class_weight={0:1, 1:100000})
        mm.fit(X_train, np.array(y_train))
    elif method == "SVM":
        #mm = SVC(kernel='rbf', class_weight={0: 1, 1: 100000}, gamma=0.0001, C=1E10)
        mm = SVC(kernel='sigmoid', class_weight={0: 1, 1: 100000}, gamma=0.001, C=1E10)
        # this is used to run a grid search on model parameters
        #grid = GridSearchCV(modelSVM, param_grid)
        #grid.fit(X_train, y_train)
        #print("Best Parameters = ", grid.best_params_)    
        mm.fit(X_train, np.array(y_train))
    elif method == "NN":
        mm = MLPClassifier(solver='lbfgs', alpha=0.5, hidden_layer_sizes=(5, 2), random_state=1)
        mm.fit(X, y)    
 
    # relevant output
    #training_predictions = modelRF.predict(X_train)
    testing_predictions = mm.predict(X_test)
    #print(confusion_matrix(y_train, training_predictions))
    #print(classification_report(y_train, training_predictions)) 
    tt = confusion_matrix(y_test, testing_predictions)
    print(tt)
    #print("Percent Correct Positive = ", tt[1][1]/(tt[1][0] + tt[1][1]))
    #print("Percent False Positive = ", tt[0][1]/(tt[0][1] + tt[1][1]))
    avgPos += tt[1][1]/(tt[1][0] + tt[1][1])
    avgFalsePos += tt[0][1]/(tt[0][1] + tt[1][1])
    avgAccuracy += (tt[0][0] + tt[1][1])/sum(sum(tt))
    avgScore += mm.score(X_test, y_test) # same as accuracy
    #print(classification_report(y_test, testing_predictions)) 

print("AVGS ::")
print("Correct Positives = ", avgPos/KK)
print("False Positives   = ", avgFalsePos/KK)
print("Accuracy          = ", avgAccuracy/KK)
print("Score             = ", avgScore/KK)

** ** ** Using Method  NN  ** ** **
Nr. of Folds =  5
Fold  1
[[188   0]
 [  0  11]]
Fold  2
[[187   1]
 [  0  11]]
Fold  3
[[191   2]
 [  0   6]]
Fold  4
[[194   0]
 [  0   4]]
Fold  5
[[182   4]
 [  0  12]]
AVGS ::
Correct Positives =  1.0
False Positives   =  0.116666666667
Accuracy          =  0.992944520583
Score             =  0.992944520583


# General Classification Model

This is the model we want to use on the unseen data. The choice of method and paramenters depend on the calibration phase described above.

In [None]:
# create general model to be used on unseen data
def createModel(method, kk, X, y, dm):
    if method == "RF":
        print("** ** ** Using Method ", method, " with dm = ", dm, " ** ** **")
        mmAll = RandomForestClassifier(n_estimators=500, criterion="gini", n_jobs=-1, max_features="sqrt",class_weight={0:1, 1:100000})
    elif method == "SVM":
        print("** ** ** Using Method ", method, " with dm = ", dm, " [ kernel = ", kk, "] ** ** **")
        mmAll = SVC(kernel=kk, class_weight={0: 1, 1: 100000}, gamma=0.0001, C=1E10)
        #mmAll = SVC(kernel='linear', class_weight={0: 0.001, 1: 10000000}, gamma=0.01, C=1)
    elif method == "NN":
        print("** ** ** Using Method ", method, " with dm = ", dm, " ** ** **")
        mmAll = MLPClassifier(solver='lbfgs', alpha=0.5, hidden_layer_sizes=(5, 2), random_state=1)

    mmAll.fit(X, np.array(y))
    training_predictions = mmAll.predict(X)
    #print(confusion_matrix(y, training_predictions))
    #print(classification_report(y, training_predictions)) 
    
    return mmAll


# Classification of unseen data

Read the full list of tweets (over 23k) and apply the best classifier (learned above) to this data. The resulting classification is stored on a dataframe and written on disk.

In [None]:
# this is no longer needed
#dfAll = pd.read_csv("data/listTweets.csv")
#docsAll, corpusAll, idOriginalAll = preProcess(dfAll)
#dTagsAll = addTags(docsAll)
#print("Tot documents: ", len(docsAll))
#print(dTagsAll[0])

In [None]:
# load the model and infer the embedding vectors
model = doc2vec.Doc2Vec.load("doc2vecTwitter.model.dm")
if os.path.exists("data/XAll.csv.dm") or os.path.exists("data/XAll.csv"):
    print("SKIPPING tweets embedding infer vector... Load tweets embedding from disk ")
    XAll = []
    with open("data/XAll.csv.dm","r") as f:
        rr = csv.reader(f)
        for row in rr:
            XAll.append(row)
    print("Read ", len(XAll), " vector embeddings.")
else:
    print("Computing tweets embedding...")
    XAll = []
    for idx in range(len(dTagsAll)):
        if idx % 25 == 0:
            print("Vector ", idx)
        XAll.append(model.infer_vector(dTagsAll[idx].words, steps=10000))
    # write in csv format the embeddings. This should save some time
    with open("data/XAll.csv.dm","w") as f:
        wr = csv.writer(f)
        wr.writerows(XAll)
        
print("Step completed.")

## Neural Network

In [None]:

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5, 2), random_state=1)
clf.fit(X, y)       
training_predictions = clf.predict(X)
print(confusion_matrix(y, training_predictions))
print(classification_report(y, training_predictions)) 
print(clf.score(X,y))


# Automatic Classification

We first define and load needed libraries. Next, we run the cycle over different method (`dm=0` and `dm=1`), different classification algorithms (`RF` and `SVM`), and in the case of Support Vector Machine, different kernels.

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

In [3]:
def preProcess(df):
    docs = []
    corpus = []
    idOriginal = []
    countWords = 0
    countDict = 0
    countShort = 0
    i = 0
    print("Preprocessing Full Set of Documents ... ")
    for ss in df["documents"]:
        if (i % 10000 == 0):
            print("Doc {0:5d} has been processed".format(i))
        i += 1
        tokens = word_tokenize(ss)
        #print(tokens)
        countWords += len(tokens)
        for w in tokens:
            #print("w ", w, " in dictionary? ", w in dct.token2id)
            if w not in dct.token2id:
                countDict += 1
        words = [w.lower() for w in word_tokenize(ss) if w not in stop_words and w in dct.token2id]
        if len(words) > 2:
            docs.append(words)
            corpus.append(ss)
            idOriginal.append(df.iloc[i-1][0])
        else:
            countShort += 1
    print("... done.")
    print("Dict = ", countDict, " (", countWords, " - ", countDict/countWords, " ) and Short = ", countShort)
    
    return docs, corpus, idOriginal

def addTags(docs):
    dTags = []
    analyzedDocument = namedtuple("AnalyzedDocument", "words tags")
    for i, doc in enumerate(docs):
        tags = [i]
        dTags.append(analyzedDocument(doc, tags))
    return dTags

In [4]:
def createDF():
    df = pd.read_csv("data/listTweets.csv")
    docs, corpus, idOriginal = preProcess(df)
    dTagsAll = addTags(docs)
    print("Total nr. of documents = ", len(docs))
    hashtags = df[df["ID"].isin(idOriginal)]["HASHTAG"]
    hashtags = [w.lower() for w in hashtags]
        
    dfProcessed = pd.DataFrame({
    "idOriginal": idOriginal,
    "idCurrent" : np.arange(len(docs)),
    "docs"      : docs,
    "corpus"    : corpus,
    "hashtag"   : hashtags
    })
    
    return dfProcessed

def readDoc2VecModel(dm):
    if dm==0:
        return doc2vec.Doc2Vec.load("doc2vecTwitter.model")
    elif dm==1:
        return doc2vec.Doc2Vec.load("doc2vecTwitter.model.dm")
    
def prepareData(modelDoc2Vec, hashtags, nDocs):    
    vecHashtags = []
    for ht in hashtags:
        vecHashtags.append(modelDoc2Vec.infer_vector(ht))
    XAll = []
    # construct XAll
    for idx in range(nDocs):
    #for idx in range(100):
        tag = dfProcessed.idCurrent[idx]
        xi = [i for i in modelDoc2Vec.docvecs[tag]]
        for i in vecHashtags[idx]:
            xi.append(i)
        XAll.append(xi)
    return XAll

def prepareTraining(modelDoc2Vec, selected_column):
    print("Preprocessing Training Data ...")
    dTrain1000 = pd.read_csv("data/sample_1000.csv")
    docsTraining, corpusTraining, idTraining = preProcess(dTrain1000)
    dTags1000 = addTags(docsTraining)
    hashtags = dTrain1000[dTrain1000["ID"].isin(idTraining)]["HASHTAG"]
    id2tag = {orig:curr for orig,curr in zip(dfProcessed.idOriginal, dfProcessed.idCurrent)}
    vecHashtags = []
    for ht in hashtags:
        vecHashtags.append(modelDoc2Vec.infer_vector(ht))

    Y = dTrain1000[dTrain1000["ID"].isin(idTraining)][selected_column]
    y = [y for y in Y]
    tags = [id2tag[i] for i in idTraining]

    X = []
    #nn = 100
    for idx in range(len(idTraining)):
    #for idx in range(100):
        tag = tags[idx]
        if idx % 500 == 0:
            print("Vector ", idx, " with TAG = ", tag)
        xi = [i for i in modelDoc2Vec.docvecs[tag]]
        #print(xi)
        for i in vecHashtags[idx]:
            xi.append(i)
        X.append(xi)
    print("... done.")

    return X, y

In [5]:
# use general model to predict Y value of unseen data
# note that dfProcessed here is a GLOBAL var
# create general model to be used on unseen data
def createModel(method, kk, X, y, dm):
    if method == "RF":
        print("** ** ** Using Method ", method, " with dm = ", dm, " ** ** **")
        mmAll = RandomForestClassifier(n_estimators=500, criterion="gini", n_jobs=-1, max_features="sqrt",class_weight={0:1, 1:100000})
    elif method == "SVM":
        print("** ** ** Using Method ", method, " with dm = ", dm, " [ kernel = ", kk, "] ** ** **")
        mmAll = SVC(kernel=kk, class_weight={0: 1, 1: 100000}, gamma=0.0001, C=1E10)
        #mmAll = SVC(kernel='linear', class_weight={0: 0.001, 1: 10000000}, gamma=0.01, C=1)
    elif method == "NN":
        print("** ** ** Using Method ", method, " with dm = ", dm, " ** ** **")
        mmAll = MLPClassifier(solver='lbfgs', alpha=0.5, hidden_layer_sizes=(5, 2), random_state=1)


    mmAll.fit(X, np.array(y))
    training_predictions = mmAll.predict(X)
    #print(confusion_matrix(y, training_predictions))
    #print(classification_report(y, training_predictions)) 
    
    return mmAll

def predictAll(mmAll, newCol, XAll):
    training_predictions = mmAll.predict(XAll)
    print("Nr. Ones = ", sum(training_predictions))
    print("Done with prediction on all unseen data")
    dfProcessed[newCol] = training_predictions
    
#def getTraining(modelDoc2Vec, selected_column):
#    df = pd.read_csv("data/sample_1000.csv")
#    idTrain = [int(i) for i in df["ID"]]
#    hashtags = dfProcessed[dfProcessed["idOriginal"].isin(idTrain)]["hashtag"]
#    ids = dfProcessed[dfProcessed["idOriginal"].isin(idTrain)]["idOriginal"]
    #y = df[df["ID"].isin(idTrain)]["COD_SE"]
#    y = df[df["ID"].isin(idTrain)][selected_column]
#    return hashtags, y


In [6]:
# preliminary libraries and stuff
stop_words = stopwords.words("italian")
dct = Dictionary.load_from_text("/home/marco/gdrive/research/nlp/wiki/dictWiki.txt")
print("Length dictionary before filter = ", len(dct))
dct.filter_extremes(keep_n=500000)
print("Length dictionary after filter = ", len(dct))

dfProcessed = createDF()

Length dictionary before filter =  2022361


2018-06-25 11:25:37,002 : INFO : discarding 1522361 tokens: [('al', 670883), ('alla', 533998), ('bagchee', 3), ('categoria', 695352), ('chandrakantha', 5), ('che', 694169), ('collegamenti', 623305), ('con', 695374), ('da', 678541), ('dal', 544442)]...
2018-06-25 11:25:37,003 : INFO : keeping 500000 tokens which were in no less than 5 and no more than 524298 (=50.0%) documents
2018-06-25 11:25:39,496 : INFO : resulting dictionary: Dictionary(500000 unique tokens: ['stadhuis', 'monétaire', 'köller', 'indecifrati', 'krunić']...)


Length dictionary after filter =  500000
Preprocessing Full Set of Documents ... 
Doc     0 has been processed
Doc 10000 has been processed
Doc 20000 has been processed
... done.
Dict =  83200  ( 329807  -  0.25226875111807817  ) and Short =  170
Total nr. of documents =  23142


In [7]:
# cycle over different methods and kernels
cols = ["SE", "ME"]  #two types of emotions
dm = [0,1]
methods = ["RF", "SVM", "NN"]
kernels = ["sigmoid", "linear", "poly", "rbf"]
for cc in cols:
    selected_column = "COD_" + cc
    print("*"*81)
    print("*\t \t \t \t COLUMN ", selected_column, " \t \t \t \t*")
    print("*"*81)

    for dd in dm:
        print("*"*81)
        print("*\t \t \t \t CLASSIFICATION \t \t \t \t*")
        print("*"*81)

        modelDoc2Vec = readDoc2VecModel(dd)
        #hashtags = dfProcessed["hashtag"]
        XAll = prepareData(modelDoc2Vec, dfProcessed["hashtag"], len(dfProcessed))
        X, y = prepareTraining(modelDoc2Vec, selected_column)

        for meth in methods:
            if meth == "SVM":
                for kk in kernels:
                    mmAll = createModel(meth, kk, X, y, dd)
                    newCol = cc + "_dm" + str(dd) + "_" + str(meth) + "_" + str(kk)
                    predictAll(mmAll, newCol, XAll)
            elif meth == "RF":
                mmAll = createModel(meth, "", X, y, dd)
                newCol = cc + "_dm" + str(dd) + "_" + str(meth)
                predictAll(mmAll, newCol, XAll)
            elif meth == "NN":
                mmAll = createModel(meth, "", X, y, dd)
                newCol = cc + "_dm" + str(dd) + "_" + str(meth)
                predictAll(mmAll, newCol, XAll)

print("Classification saved on disk...")
dfProcessed.to_csv("data/classifyUnseen.csv")

2018-06-25 11:26:19,261 : INFO : loading Doc2Vec object from doc2vecTwitter.model


*********************************************************************************
*	 	 	 	 COLUMN  COD_SE  	 	 	 	*
*********************************************************************************
*********************************************************************************
*	 	 	 	 CLASSIFICATION 	 	 	 	*
*********************************************************************************


2018-06-25 11:26:19,691 : INFO : loading docvecs recursively from doc2vecTwitter.model.docvecs.* with mmap=None
2018-06-25 11:26:19,692 : INFO : loading wv recursively from doc2vecTwitter.model.wv.* with mmap=None
2018-06-25 11:26:19,693 : INFO : setting ignored attribute syn0norm to None
2018-06-25 11:26:19,695 : INFO : setting ignored attribute cum_table to None
2018-06-25 11:26:19,696 : INFO : loaded doc2vecTwitter.model


Preprocessing Training Data ...
Preprocessing Full Set of Documents ... 
Doc     0 has been processed
... done.
Dict =  3530  ( 14254  -  0.24764978251718817  ) and Short =  7
Vector  0  with TAG =  19
Vector  500  with TAG =  11848
... done.
** ** ** Using Method  RF  with dm =  0  ** ** **
Nr. Ones =  138
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  sigmoid ] ** ** **
Nr. Ones =  3943
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  linear ] ** ** **
Nr. Ones =  3942
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  poly ] ** ** **
Nr. Ones =  1697
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  rbf ] ** ** **
Nr. Ones =  3762
Done with prediction on all unseen data
** ** ** Using Method  NN  with dm =  0  ** ** **


2018-06-25 11:27:08,417 : INFO : loading Doc2Vec object from doc2vecTwitter.model.dm


Nr. Ones =  1809
Done with prediction on all unseen data
*********************************************************************************
*	 	 	 	 CLASSIFICATION 	 	 	 	*
*********************************************************************************


2018-06-25 11:27:08,943 : INFO : loading docvecs recursively from doc2vecTwitter.model.dm.docvecs.* with mmap=None
2018-06-25 11:27:08,945 : INFO : loading wv recursively from doc2vecTwitter.model.dm.wv.* with mmap=None
2018-06-25 11:27:08,946 : INFO : setting ignored attribute syn0norm to None
2018-06-25 11:27:08,948 : INFO : setting ignored attribute cum_table to None
2018-06-25 11:27:08,949 : INFO : loaded doc2vecTwitter.model.dm


Preprocessing Training Data ...
Preprocessing Full Set of Documents ... 
Doc     0 has been processed
... done.
Dict =  3530  ( 14254  -  0.24764978251718817  ) and Short =  7
Vector  0  with TAG =  19
Vector  500  with TAG =  11848
... done.
** ** ** Using Method  RF  with dm =  1  ** ** **
Nr. Ones =  154
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  sigmoid ] ** ** **
Nr. Ones =  4177
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  linear ] ** ** **
Nr. Ones =  4174
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  poly ] ** ** **
Nr. Ones =  6469
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  rbf ] ** ** **
Nr. Ones =  3539
Done with prediction on all unseen data
** ** ** Using Method  NN  with dm =  1  ** ** **


2018-06-25 11:28:18,351 : INFO : loading Doc2Vec object from doc2vecTwitter.model


Nr. Ones =  1800
Done with prediction on all unseen data
*********************************************************************************
*	 	 	 	 COLUMN  COD_ME  	 	 	 	*
*********************************************************************************
*********************************************************************************
*	 	 	 	 CLASSIFICATION 	 	 	 	*
*********************************************************************************


2018-06-25 11:28:18,864 : INFO : loading docvecs recursively from doc2vecTwitter.model.docvecs.* with mmap=None
2018-06-25 11:28:18,866 : INFO : loading wv recursively from doc2vecTwitter.model.wv.* with mmap=None
2018-06-25 11:28:18,867 : INFO : setting ignored attribute syn0norm to None
2018-06-25 11:28:18,868 : INFO : setting ignored attribute cum_table to None
2018-06-25 11:28:18,870 : INFO : loaded doc2vecTwitter.model


Preprocessing Training Data ...
Preprocessing Full Set of Documents ... 
Doc     0 has been processed
... done.
Dict =  3530  ( 14254  -  0.24764978251718817  ) and Short =  7
Vector  0  with TAG =  19
Vector  500  with TAG =  11848
... done.
** ** ** Using Method  RF  with dm =  0  ** ** **
Nr. Ones =  36
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  sigmoid ] ** ** **
Nr. Ones =  1973
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  linear ] ** ** **
Nr. Ones =  1973
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  poly ] ** ** **
Nr. Ones =  516
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  0  [ kernel =  rbf ] ** ** **
Nr. Ones =  1925
Done with prediction on all unseen data
** ** ** Using Method  NN  with dm =  0  ** ** **


2018-06-25 11:28:55,344 : INFO : loading Doc2Vec object from doc2vecTwitter.model.dm


Nr. Ones =  1003
Done with prediction on all unseen data
*********************************************************************************
*	 	 	 	 CLASSIFICATION 	 	 	 	*
*********************************************************************************


2018-06-25 11:28:55,747 : INFO : loading docvecs recursively from doc2vecTwitter.model.dm.docvecs.* with mmap=None
2018-06-25 11:28:55,748 : INFO : loading wv recursively from doc2vecTwitter.model.dm.wv.* with mmap=None
2018-06-25 11:28:55,749 : INFO : setting ignored attribute syn0norm to None
2018-06-25 11:28:55,750 : INFO : setting ignored attribute cum_table to None
2018-06-25 11:28:55,752 : INFO : loaded doc2vecTwitter.model.dm


Preprocessing Training Data ...
Preprocessing Full Set of Documents ... 
Doc     0 has been processed
... done.
Dict =  3530  ( 14254  -  0.24764978251718817  ) and Short =  7
Vector  0  with TAG =  19
Vector  500  with TAG =  11848
... done.
** ** ** Using Method  RF  with dm =  1  ** ** **
Nr. Ones =  44
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  sigmoid ] ** ** **
Nr. Ones =  2702
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  linear ] ** ** **
Nr. Ones =  2702
Done with prediction on all unseen data
** ** ** Using Method  SVM  with dm =  1  [ kernel =  poly ] ** ** **


KeyboardInterrupt: 