# TCNER assigment for Data Science by Irma Harms and Maaike Keurhorst

## Decription of project:

This study investigates which combination of methods works best for recommending a conference to a researcher given the title of this new research paper. The methods were formed by combining dimensionality reduction techniques (Term Frequency and Term Frequency-Inverse Document Frequency) with classifiers (Naive Bayes, K-Nearest Neighbors, Linear Support Vector Machine), giving us six combinations. The results showed that TF-IDF Naïve Bayes was the fastest and TF-IDF Linear Support Vector Machine was the best performing. Overall, the Naïve Bayes methods did perform just a bit less than the others, but might be a better classifier for bigger data sets due to its speed. 


## All imports

To keep everything clear and to avoid redundancy in imports, we do them all here.

In [1]:
# Functions preprocessing
import pandas as pd
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS

# Functions feature extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Functions for classification
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.pipeline import Pipeline

# Evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import precision_recall_fscore_support as score

## All functions for getting the data and data preprocessing

We save the data in pandas for ease of use. The txt files gotten have first been converted to excel before used in the code. 


For the preprocessing we take the following steps:

- Stopword removal
- Lemmetazation
- Tokenization
- Lowercase
- Punctuation and character removal

In [2]:
# Return a pandas dataframe
def getData(textfile):
    df = pd.read_excel(textfile)
    return df

In [3]:
# All the preprocessing steps as mentioned before. Returns tokens.
def preprocessing(title, nlp):
    # Tokenization
    title = nlp(str(title))
    tokens = [token.lemma_ for token in title]
    
    # Punctuation, stopword and space removal
    tokens = [token for token in tokens if not nlp.vocab[token].is_punct | nlp.vocab[token].is_space | nlp.vocab[token].is_stop]
    
    # Lowercase
    tokens = [token for token in tokens if token.lower()]
    
    # Character removal: Check if it at least has some letters, otherwise remove
    tokens = [token for token in tokens if token.islower()]
    
    return tokens

## All functions for feature extraction

This is combined in the pipeline together with the classifier, so we only need a function to combine the tokens back into one text

In [4]:
def return_to_text(corpus):
    space = " "
    return [space.join(lst) for lst in corpus]

## All functions for classification

All classifiers use the predict function to make the pipeline and make the actual prediction.

In [5]:
def predict(clf, x, y, prepros):
    if prepros == 'tfidf':
        print('tfidf')
        pipeline = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', clf),]) 
    else:
        print('bow')
        pipeline = Pipeline([('vect', CountVectorizer()),('clf', clf),]) 

    return cross_val_predict(pipeline, x, y, cv=3, n_jobs=3, verbose=2)

In [6]:
# Support vector machine
def svm_linear(x, y, prepros):
    
    # The classifier
    clf = svm.SVC(kernel = 'linear')
    
    return predict(clf, x, y, prepros)

In [7]:
# Multinominal bayes
def bayes(x, y, prepros):
    
    # The classifier
    clf = MultinomialNB()

    return predict(clf, x, y, prepros)

In [8]:
# K-nearest neighbors
def k_nearest_neigbours(x, y, prepros, k):
    
    # The classifier
    clf = KNeighborsClassifier(n_neighbors=k)

    return predict(clf, x, y, prepros)


## Evaulation

Here is the method we can call for evauating our system

In [9]:
def get_scores(real, pred):
    precision, recall, fscore, support = score(real, pred, average='macro')

    print(pd.crosstab(real, pred, rownames=['Actual'], colnames=['Predicted'], margins=True))
    print(' ')
    print('F1 score:', f1_score(real, pred, average='macro'))
    print('Recall:', recall_score(real, pred, average='macro'))
    print('Accuracy:', accuracy_score(real, pred))
    print('Precision:', precision)

## Calling all functions
Here we can call all functions

In [10]:
# Get all the data
test_gt = getData('TestGroundTruth.xlsx')
test = getData('Test.xlsx')
train = getData('Train.xlsx')

In [11]:
# Data preprocessing
nlp = en_core_web_sm.load()
tokens_test = [preprocessing(text, nlp) for text in test['Title']]
tokens_train = [preprocessing(text, nlp) for text in train['Title']]
text_train = return_to_text(tokens_train)

In [13]:
# Training of bow
svm_linear_tfidf_pred = svm_linear(text_train, train['Conference'], 'tfidf')
bayes_tfidf_pred = bayes(text_train, train['Conference'], 'tfidf')
knn_tfidf_pred = k_nearest_neigbours(text_train, train['Conference'], 'tfidf', 1)

svm_linear_bow_pred = svm_linear(text_train, train['Conference'], 'bow')
bayes_bow_pred = bayes(text_train, train['Conference'], 'bow')
knn_bow_pred = k_nearest_neigbours(text_train, train['Conference'], 'bow', 1)


# Used for finding the best k value. 
# for i in range(1, 8):
#     knn_tfidf_pred = k_nearest_neigbours(text_train, train['Conference'], 'bow', i)
#     print('(', i, ',', f1_score(train['Conference'], knn_tfidf_pred, average = 'micro'), ')')
    
# for i in range(1,11):
#     knn_bow_pred = k_nearest_neigbours(text_train, train['Conference'], 'tfidf', i)
#     print('(', i, ',', f1_score(train['Conference'], knn_bow_pred, average = 'micro'), ')')

tfidf


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   13.8s finished


tfidf


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    0.3s finished


tfidf


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   13.2s finished


bow


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   11.6s finished


bow


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    0.3s finished


bow


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    2.3s finished


In [14]:
# put all the predictions in a dict.
predictions = {'svm_linear_bow': svm_linear_bow_pred, 'bayes_bow': bayes_bow_pred, 
               'knn_bow': knn_bow_pred, 'svm_linear_tfidf': svm_linear_tfidf_pred, 
                'bayes_tfidf': bayes_tfidf_pred, 'knn_tfidf': knn_tfidf_pred}

# Print all scores
for each in predictions:
    print(each)
    get_scores(train['Conference'], predictions[each])
    print(' ')
    print(' ')

svm_linear_bow
Predicted  INFOCOM  ISCAS  SIGGRAPH  VLDB   WWW    All
Actual                                                
INFOCOM       2792    321        34  1209   125   4481
ISCAS          628   5924       169   721    72   7514
SIGGRAPH       170    312      1730   389    77   2678
VLDB           451    149        33  2848   197   3678
WWW            352    131        74   439  2296   3292
All           4393   6837      2040  5606  2767  21643
 
F1 score: 0.7119224302630673
Recall: 0.7058513840314863
Accuracy: 0.7203252783810008
Precision: 0.7375728679771367
 
 
bayes_bow
Predicted  INFOCOM  ISCAS  SIGGRAPH  VLDB   WWW    All
Actual                                                
INFOCOM       2753   1313        43   153   219   4481
ISCAS          294   6893       155    67   105   7514
SIGGRAPH        83    467      1914    65   149   2678
VLDB           315   1524        81  1286   472   3678
WWW            258    199        61   223  2551   3292
All           3703  10396    