## <span style="color:navy">Alternative 1 : Classifying articles using their abstracts <span>

### **Loading the packages we are going to use.**

In [53]:
import numpy as np
import pandas as pd
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

### **Loading data about each article in a dataframe ( id/year/title/authors/abstract ) **

In [109]:
df = pd.read_csv("node_information.csv")
print(df.columns)

Index(['id', 'year', 'title', 'authors', 'abstract'], dtype='object')


### **Reading data (document ids and the corresponding journal they were published in).**


### ** We have 28 journals **

In [40]:
all_ids = list()
y_all = list()
with open('train.csv', 'r') as f:
    next(f)
    for line in f:
        t = line.split(',')
        all_ids.append(t[0])
        y_all.append(t[1][:-1])
n_all = len(all_ids)
unique = np.unique(y_all)
print("\nNumber of classes: ", unique.size)


Number of classes:  28


### **Splitting in train and validation set in order to run my own evaluation before uploading the results on kaggle**

In [41]:
train_ids,valid_ids, y_train, y_val = train_test_split(all_ids, y_all, test_size=0.2, random_state=1)

In [42]:
print("Number of documents used for training : ",len(train_ids))
print("Number of documents used for validation: ",len(valid_ids))

Number of documents used for training :  12272
Number of documents used for validation:  3069


### **Extracting abstracts for training and validation ids**

In [157]:
train_abstracts = list()
val_abstracts = list()

for i in train_ids:
    train_abstracts.append(df.loc[df['id'] == int(i)]['abstract'].iloc[0])

for i in valid_ids:
    val_abstracts.append(df.loc[df['id'] == int(i)]['abstract'].iloc[0])

In [158]:
print("Train abstracts dimensionality: ", len(train_abstracts))
print("Validation abstracts dimensionality: ", len(val_abstracts))

Train abstracts dimensionality:  12272
Validation abstracts dimensionality:  3069


### **Creating the training matrix and validation matrix.**
- Each row corresponds to an article and each column to a word present in at least 2 and at most 50 articles. 
- The value of each entry in a row is equal to the frequency of that word in the corresponding article.

In [45]:
vec = CountVectorizer(decode_error='ignore', min_df=2, max_df=50, stop_words='english')
X_train = vec.fit_transform(train_abstracts)
X_valid = vec.transform(val_abstracts)

In [46]:
print("Train matrix dimensionality: ", X_train.shape)
print("Validation matrix dimensionality: ", X_valid.shape)

Train matrix dimensionality:  (12272, 7965)
Validation matrix dimensionality:  (3069, 7965)


### **Reading data (document ids and the corresponding journal they were published in).**

In [47]:
test_ids = list()
with open('test.csv', 'r') as f:
    next(f)
    for line in f:
        test_ids.append(line[:-2])

### **Extracting abstracts for test ids**

In [48]:
# Extract the abstract of each test article from the dataframe
n_test = len(test_ids)
test_abstracts = list()
for i in test_ids:
    test_abstracts.append(df.loc[df['id'] == int(i)]['abstract'].iloc[0])

### **Creating the test matrix **

In [49]:
# Create the test matrix following the same approach as in the case of the training matrix
X_test = vec.transform(test_abstracts)

In [50]:
print("Train matrix dimensionality: ", X_train.shape)
print("Validation matrix dimensionality: ", X_valid.shape)
print("Test matrix dimensionality: ", X_test.shape)

Train matrix dimensionality:  (12272, 7965)
Validation matrix dimensionality:  (3069, 7965)
Test matrix dimensionality:  (3836, 7965)


### **Logistic regression classifier to classify the articles of the validation set **

In [82]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_valid)
loss = log_loss(y_val,y_pred)
print("Logistic Regression classifiers's los : ",loss)

Logistic Regression classifiers's los :  2.53412567057


### **SGD classifier to classify the articles of the validation set **

In [80]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log')
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_valid)
loss = log_loss(y_val,y_pred)
print("SGD classifier's loss : ",loss)



MLP classifier's loss :  2.57648476773


### **SVC classifier to classify the articles of the validation set **

In [83]:
from sklearn.svm import SVC
clf = SVC(random_state=0,probability=True)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_valid)
loss = log_loss(y_val,y_pred)
print("SVC classifier's loss : ",loss)

SVC classifier's loss :  2.39967155118


 | **Logistic Regression**        | **SGD**           | **SVC**  |
 | :-------------: |:-------------:| :-----: |
 | 2.53412567057      | 2.57648476773 | **2.39967155118** |


###  <span style="color:DarkGreen"> We notice that SVC  has the best performance out of the previous classifiers.</span>

--------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------


## <span style="color:navy">Alternative 2 : Classifying articles using all their text info. <span>


### **Adding year, title and authors in our training  and validation data **

In [159]:
train_years = list()
train_titles = list()
train_authors = list()

val_years = list()
val_titles = list()
val_authors = list()

for i in train_ids:
    train_years.append(df.loc[df['id'] == int(i)]['year'].iloc[0])
    train_titles.append(df.loc[df['id'] == int(i)]['title'].iloc[0])
    if type(df.loc[df['id'] == int(i)]['authors'].iloc[0]) != float:
        train_authors.append(df.loc[df['id'] == int(i)]['authors'].iloc[0])
    else:
        train_authors.append("")

for i in valid_ids:
    val_years.append(df.loc[df['id'] == int(i)]['year'].iloc[0])
    val_titles.append(df.loc[df['id'] == int(i)]['title'].iloc[0])
    if type(df.loc[df['id'] == int(i)]['authors'].iloc[0]) != float:
        val_authors.append(df.loc[df['id'] == int(i)]['authors'].iloc[0])
    else:
        val_authors.append("")

In [160]:
train_all = list()
for i in range(0,len(train_abstracts)):
    train_all.append(train_abstracts[i] + " " + train_titles[i] + " " + str(train_years[i]) + " " + str(train_authors[i]))

In [161]:
valid_all = list()
for i in range(0,len(val_abstracts)):
    valid_all.append(val_abstracts[i] + " " + val_titles[i] + " " + str(val_years[i]) + " " + str(val_authors[i]))

### **Creating the new training and validation matrices.**
- Each row corresponds to an article and each column to a word present in at least 2 and at most 50 articles. 
- The value of each entry in a row is equal to the frequency of that word in the corresponding article.

In [164]:
vec = CountVectorizer(decode_error='ignore', min_df=2, max_df=50, stop_words='english')
X_train = vec.fit_transform(train_all)
X_valid = vec.transform(valid_all)

### **Reconstructing test data in order to include the new info **

In [166]:
test_abstracts = list()
for i in test_ids:
    test_abstracts.append(df.loc[df['id'] == int(i)]['abstract'].iloc[0])
    
test_years = list()
test_titles = list()
test_authors = list()

for i in test_ids:
    test_years.append(df.loc[df['id'] == int(i)]['year'].iloc[0])
    test_titles.append(df.loc[df['id'] == int(i)]['title'].iloc[0])
    if type(df.loc[df['id'] == int(i)]['authors'].iloc[0]) != float:
        test_authors.append(df.loc[df['id'] == int(i)]['authors'].iloc[0])
    else:
        test_authors.append("")

In [167]:
test_all = list()
for i in range(0,len(test_abstracts)):
    test_all.append(test_abstracts[i] + " " + test_titles[i] + " " + str(test_years[i]) + " " + str(test_authors[i]))

### ** Creating new test matrix **

In [168]:
X_test = vec.transform(test_all)

In [169]:
print("Train matrix dimensionality: ", X_train.shape)
print("Validation matrix dimensionality: ", X_valid.shape)
print("Test matrix dimensionality: ", X_test.shape)

Train matrix dimensionality:  (12272, 12011)
Validation matrix dimensionality:  (3069, 12011)
Test matrix dimensionality:  (3836, 12011)


### **Logistic regression classifier to classify the articles of the validation set **

In [170]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_valid)
loss = log_loss(y_val,y_pred)
print("Logistic Regression classifiers's los : ",loss)

Logistic Regression classifiers's los :  2.41165919834


### **SGD classifier to classify the articles of the validation set **

In [171]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log')
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_valid)
loss = log_loss(y_val,y_pred)
print("SGD classifier's loss : ",loss)



SGD classifier's loss :  2.48548767875


### **SVC classifier to classify the articles of the validation set **

In [172]:
from sklearn.svm import SVC
clf = SVC(random_state=0,probability=True)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_valid)
loss = log_loss(y_val,y_pred)
print("SVC classifier's loss : ",loss)

SVC classifier's loss :  2.30822787892


 | **Logistic Regression**        | **SGD**           | **SVC**  |
 | :-------------: |:-------------:| :-----: |
 | 2.41165919834      | 2.48548767875 | **2.30822787892** |

###  <span style="color:DarkGreen"> SVC has still the best performance out of the previous classifiers.</span>

### **We are using SVC in order to classify our test data and create the according .csv file for Kaggle**

In [173]:
y_pred = clf.predict_proba(X_test)

In [175]:
# Write predictions to a file
with open('text_baseline_results.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    lst = clf.classes_.tolist()
    lst.insert(0, "Article")
    writer.writerow(lst)
    for i,test_id in enumerate(test_ids):
        lst = y_pred[i,:].tolist()
        lst.insert(0, test_id)
        writer.writerow(lst)

### <span style="color:navy"> Kaggle evaluation : 2.25878 <span>