# <a id='toc1_'></a>[Modeling](#toc0_)
In this notebook, the process begins by loading and preprocessing a dataset comprising `title`,`abstract` and `journal` columns. Text data is cleaned through lowercase conversion, removal of non-alphabetic characters, and elimination of stop words. Subsequently, `embeddings` are generated for `title` and `abstract` using a neural network for word embedding, followed by dimensionality reduction through `PCA`. Journal labels are encoded using `LabelEncoder`, and the dataset is split into training and testing sets.

In the pursuit of refining our predictive model, I meticulously explored various classification models. Despite encountering challenges in achieving high accuracy rates, the focus shifted towards enhancing the overall process and usability. 

Robust input preprocessing functions were developed to seamlessly integrate new title and abstract entries into the model, showcasing the resilience of the `pipeline`.

**Table of contents**<a id='toc0_'></a>    
- [Modeling](#toc1_)    
    - [Import Libraries](#toc1_1_1_)    
    - [Load Data](#toc1_1_2_)    
    - [Reduce the data to have the same amount of publications per journal](#toc1_1_3_)    
    - [Preprocess Data](#toc1_1_4_)    
    - [Create Embeddings](#toc1_1_5_)    
    - [Encode Labels and Split Data](#toc1_1_6_)    
    - [Train and Evaluate Classification Models](#toc1_1_7_)    
    - [Grid Search for parametters](#toc1_1_8_)    
    - [Train the classification model with all the dataset](#toc1_1_9_)    
    - [Save the model and the label encoder](#toc1_1_10_)    
    - [Test the model with my own non published neurosciences article](#toc1_1_11_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_1_'></a>[Import Libraries](#toc0_)

In [1]:
import numpy as np
import pandas as pd

# Functions
import sys
sys.path.append('../src')
from support_model import *

# Text Processing
import re
import ast
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Dimensionality Reduction
from sklearn.decomposition import PCA

# Deep Learning
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Machine Learning
import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# ML Classification Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier, ExtraTreesClassifier, VotingClassifier, StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')




In [2]:
# nltk.download('punkt')

In [3]:
minimum_number_of_journals = 1000
n_components = 1
n_estimators = 50
seed = 12

### <a id='toc1_1_2_'></a>[Load Data](#toc0_)

In [4]:
publications = pd.read_csv('../data/neuropapers_db/publications.csv')

In [5]:
pre_data = publications[['title', 'abstract', 'journal']]

### <a id='toc1_1_3_'></a>[Reduce the data to have the same amount of publications per journal](#toc0_)

In [6]:
# Count the number of publications per journal
counts = pre_data['journal'].value_counts()

# Remove the journals with less than 80 publications
journals_to_drop = counts[counts < minimum_number_of_journals].index
data_filtered = pre_data[~pre_data['journal'].isin(journals_to_drop)]

In [7]:
journals_to_drop

Index(['American journal of Alzheimer's disease and other dementias',
       'Expert review of neurotherapeutics', 'Reviews in the neurosciences',
       'Neuroscientist', 'Neurophotonics', 'Translational neuroscience',
       'Annual review of neuroscience',
       'Frontiers of neurology and neuroscience', 'Biomedical reports',
       'Journal of the history of the neurosciences', 'Nature aging',
       'Journal of Physiology Paris', 'Current protocols in neuroscience',
       'Acta neurobiologiae experimentalis', 'IBRO reports',
       'Functional neurology', 'AJOB neuroscience', 'eLife'],
      dtype='object')

In [8]:
# Reduce randomly the amount of papers per journal
data = pd.DataFrame()

for journal in data_filtered['journal'].unique():
    subset = data_filtered[data_filtered['journal'] == journal]
    if len(subset) >= minimum_number_of_journals:
        subset = subset.sample(n=minimum_number_of_journals, random_state=seed)
    data = pd.concat([data, subset])

In [9]:
data = data.sample(frac=1, random_state=seed).reset_index(drop=True)

### <a id='toc1_1_4_'></a>[Preprocess Data](#toc0_)

In [11]:
columns_to_preprocess = ['title', 'abstract']
for column in columns_to_preprocess:
    data[column] = data[column].apply(preprocess_text)

### <a id='toc1_1_5_'></a>[Create Embeddings](#toc0_)

In [14]:
list_of_columns = ['title', 'abstract']
new_data = data.copy()

for column in list_of_columns:
    embeddings_df = column_embeddings(data, column, n_components)
    new_data = new_data.merge(embeddings_df, left_index=True, right_index=True)

data = new_data.copy()




### <a id='toc1_1_6_'></a>[Encode Labels and Split Data](#toc0_)

In [15]:
label_encoder = LabelEncoder()
data['y_encoded'] = label_encoder.fit_transform(data['journal'])

In [16]:
all_columns = data.columns
drop_columns = ['title', 'abstract', 'journal', 'y_encoded']
columns = all_columns.drop(drop_columns)

X = data[columns]
y = data['y_encoded']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

### <a id='toc1_1_7_'></a>[Train and Evaluate Classification Models](#toc0_)

In [33]:
models = {"dtc": DecisionTreeClassifier(),
          "rfc": RandomForestClassifier(),
          "svcr": SVC(kernel="rbf"),
          "svcl": SVC(kernel="linear"),
          "knc": KNeighborsClassifier(),
          "logr": LogisticRegression(),
          "adaboost": AdaBoostClassifier(),
          "gradient_boosting": GradientBoostingClassifier(),
          "naive_bayes": GaussianNB(),
          "mlp": MLPClassifier(),
          "bagging": BaggingClassifier(),
          "extra_trees": ExtraTreesClassifier(),
          "voting": VotingClassifier(estimators=[('dtc', DecisionTreeClassifier()), ('rfc', RandomForestClassifier()), ('svc', SVC())]),
          "stacking": StackingClassifier(estimators=[('knc', KNeighborsClassifier()), ('logr', LogisticRegression()), ('svc', SVC())], final_estimator=DecisionTreeClassifier()),
          "sgd": SGDClassifier(),
          "nu_svc": NuSVC(),
          "linear_svc": LinearSVC(),
          "gaussian_process": GaussianProcessClassifier(),
          "lda": LinearDiscriminantAnalysis(),
          "qda": QuadraticDiscriminantAnalysis()}   

In [37]:
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{model_name} Accuracy: {accuracy}")

dtc Accuracy: 0.1984375
rfc Accuracy: 0.245
svcr Accuracy: 0.2621875
svcl Accuracy: 0.2496875
knc Accuracy: 0.2103125
logr Accuracy: 0.2440625
adaboost Accuracy: 0.19125
gradient_boosting Accuracy: 0.2528125
naive_bayes Accuracy: 0.2509375
mlp Accuracy: 0.2490625
bagging Accuracy: 0.2228125
extra_trees Accuracy: 0.240625
voting Accuracy: 0.250625
stacking Accuracy: 0.203125
sgd Accuracy: 0.2153125
nu_svc Accuracy: 0.2525
linear_svc Accuracy: 0.234375
gaussian_process Accuracy: 0.245625
lda Accuracy: 0.24
qda Accuracy: 0.2628125


### <a id='toc1_1_8_'></a>[Grid Search for parametters](#toc0_)

In [48]:
models = {"svcr": SVC(kernel="rbf"),
          "qda": QuadraticDiscriminantAnalysis()}   

param_grids = {"svcr": {"C": [0.1, 1],
                        "gamma": [0.01, 0.1],
                        "kernel": ['rbf'],
                        "probability": [True, False],
                        "tol": [1e-3, 1e-5],
                        "class_weight": [None, 'balanced']},
               "qda": {"reg_param": [0.0, 0.2],
                       "store_covariance": [True, False],
                       "tol": [1e-3, 1e-5]}}

In [49]:
# Perform GridSearchCV for each model
for model_name, model in models.items():
    param_grid = param_grids[model_name]
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    # Get the best parameters and retrain the model
    best_params = grid_search.best_params_
    model.set_params(**best_params)
    model.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{model_name} Best Parameters: {best_params}")
    print(f"{model_name} Accuracy: {accuracy}")

svcr Best Parameters: {'C': 1, 'class_weight': 'balanced', 'gamma': 0.1, 'kernel': 'rbf', 'probability': True, 'tol': 0.001}
svcr Accuracy: 0.2328125
qda Best Parameters: {'reg_param': 0.0, 'store_covariance': True, 'tol': 0.001}
qda Accuracy: 0.2628125


### <a id='toc1_1_9_'></a>[Train the classification model with all the dataset](#toc0_)

In [17]:
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X, y)

### <a id='toc1_1_10_'></a>[Save the model and the label encoder](#toc0_)

In [18]:
joblib.dump(label_encoder, '../model/label_encoder.pkl')

['label_encoder.pkl']

In [19]:
joblib.dump(qda_model, '../model/qda_model.pkl')

['qda_model.pkl']

### <a id='toc1_1_11_'></a>[Test the model with my own non published neurosciences article](#toc0_)
https://www.biorxiv.org/content/10.1101/364760v1

In [21]:
input_title = "Expression and role of Galectin-3 in the postnatal development of the cerebellum"
input_abstract = "Many proteins initially identified in the immune system play roles in neurogenesis, neuronal migration, axon guidance, synaptic plasticity and other processes related to the formation and refinement of neural circuits. Although the function of the immune-related protein Galectin-3 (LGALS3) has been extensively studied in the regulation of inflammation, cancer and microglia activation, little is known about its role in the development of the brain. In this study, we identified that LGALS3 is expressed in the developing postnatal cerebellum. More precisely, LGALS3 is expressed by cells in meninges and in the choroid plexus, and in subpopulations of astrocytes and of microglial cells in the cerebellar cortex. Analysis of Lgals3 knockout mice showed that Lgals3 is dispensable for the development of cerebellar cytoarchitecture and Purkinje cell excitatory synaptogenesis in the mouse."

In [22]:
result = predict_journal_for_input(input_title, input_abstract, qda_model, label_encoder)
print(result)

[{'journal': 'Biological psychology', 'probability': 0.07037802108040132}]
