<a href="https://colab.research.google.com/github/mhmd2015/UIC412ML1/blob/main/UIC412_Project4_Research_topic_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder


# 1- Load data into panda

In [2]:
data_path = "https://raw.githubusercontent.com/mhmd2015/UIC412ML1/main/"

train = pd.read_csv(data_path+"blogs_train.csv")
test = pd.read_csv(data_path+"blogs_test.csv")

print('Train shape:',train.shape)
print('Test shape:',test.shape)

Train shape: (20972, 9)
Test shape: (8989, 3)


## set the required labeling

In [3]:
labels = ['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance']

# 2- prepares data for training a machine learning model

Takes raw training data, selects the relevant features and targets, splits them into training and testing sets, and converts the target variables to a format suitable for model training.

### test = test.drop(['ID'],axis=1):

Removes the 'ID' column from the test DataFrame. The axis=1 argument specifies that dropping a column and it not needed for the model's predictions.

### X = train.loc[:,['TITLE','ABSTRACT']]:

Creates a new DataFrame X containing only the 'TITLE' and 'ABSTRACT' columns from the train DataFrame.

These are the features will use to train your model.

loc[:, ['TITLE', 'ABSTRACT']] is used for label-based indexing, selecting all rows (:) and the specified columns.

### y = train.loc[:,labels]:

Creates a new DataFrame y containing the columns specified in the labels list from the train DataFrame. These are the target variables (the topics to predict).

### from sklearn.model_selection import train_test_split:
imports the train_test_split function from scikit-learn's model_selection module. This function is used to split your data into training and testing sets.

### X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True):

This is where the data splitting happens.
X and y are the feature and target DataFrames created earlier.

#### test_size=0.1
specifies that 10% of the data will be used for the testing set, and the remaining 90% for the training set.

#### random_state=42
ensures that the split is the same every time you run the code, which is helpful for reproducibility.

#### shuffle=True
shuffles the data before splitting.


### y_test.reset_index(drop=True,inplace=True) and X_test.reset_index(drop=True,inplace=True):
Reset the index of the X_test and y_test DataFrames. When train_test_split creates the subsets, it keeps the original index. Resetting the index with drop=True discards the old index and creates a new default integer index, which can be useful for later operations. inplace=True modifies the DataFrame directly without returning a new one.


### y1 = np.array(y_train) and y2 = np.array(y_test):
convert the y_train and y_test DataFrames into NumPy arrays. Many machine learning algorithms in scikit-learn work more efficiently with NumPy arrays. y1 will contain the training target data as a NumPy array, and y2 will contain the testing target data as a NumPy array.



In [4]:
test = test.drop(['ID'],axis=1)

X = train.loc[:,['TITLE','ABSTRACT']]
y = train.loc[:,labels]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True)

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

y_test.reset_index(drop=True,inplace=True)
X_test.reset_index(drop=True,inplace=True)

y1 = np.array(y_train)
y2 = np.array(y_test)

(18874, 2) (2098, 2)
(18874, 6) (2098, 6)


# 3- Cleaning the data

In [5]:
#Removing Punctuations

X_train.replace('[^a-zA-Z]',' ', regex=True, inplace=True)
X_test.replace('[^a-zA-Z]',' ', regex=True, inplace=True)

test.replace('[^a-zA-Z]',' ', regex=True, inplace=True)

#Converting to lower case characters

for index in X_train.columns:
  X_train[index] = X_train[index].str.lower()

for index in X_test.columns:
  X_test[index] = X_test[index].str.lower()

for index in test.columns:
  test[index] = test[index].str.lower()

#Removing one letter words

X_train['ABSTRACT'] = X_train['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')
X_test['ABSTRACT'] = X_test['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')

test['ABSTRACT'] = test['ABSTRACT'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ')

#Removing multiple blank spaces

X_train = X_train.replace('\s+', ' ', regex=True)
X_test = X_test.replace('\s+', ' ', regex=True)

test = test.replace('\s+', ' ', regex=True)

# 4- Downloads necessary resources for Natural Language Toolkit (NLTK)
Python library for working with human language data.

**import nltk**: Imports the NLTK library.

**nltk.download('punkt')**: Downloads the 'punkt' tokenizer models, which are used for splitting text into sentences and words.

**nltk.download('wordnet')**: Downloads WordNet, a lexical database of English.

**nltk.download('stopwords')**: Downloads a list of common English stop words (words like 'the', 'a', 'is', etc.) that are often removed in text processing.

**nltk.download('averaged_perceptron_tagger')**: Downloads a pre-trained POS (Part-of-Speech) tagger.

And defines stop words and combines the 'TITLE' and 'ABSTRACT' columns into a single 'combined' column in the training, testing, and prediction datasets, dropping the original columns.

In [6]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk import sent_tokenize, word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords


stop_words = set(stopwords.words('english'))
# len(stop_words)
# X_train['ABSTRACT'] = X_train['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))
# X_test['ABSTRACT'] = X_test['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

# test['ABSTRACT'] = test['ABSTRACT'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

X_train['combined'] = X_train['TITLE']+' '+X_train['ABSTRACT']
X_test['combined'] = X_test['TITLE']+' '+X_test['ABSTRACT']

test['combined'] = test['TITLE']+' '+test['ABSTRACT']

X_train = X_train.drop(['TITLE','ABSTRACT'],axis=1)
X_test = X_test.drop(['TITLE','ABSTRACT'],axis=1)

test = test.drop(['TITLE','ABSTRACT'],axis=1)

X_train.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Unnamed: 0,combined
13275,clustering in hilbert space of a quantum optim...
19273,graph heat mixture model learning graph infer...
6427,fast and unsupervised methods for multilingual...
19168,natasha faster non convex stochastic optimizat...
14148,kustaanheimo stiefel transformation with an ar...


# 5- Preprocessing

In [7]:

X_lines = []
for row in range(0,X.shape[0]):
  X_lines.append(' '.join(str(x) for x in X.iloc[row,:]))

In [8]:
train_lines = []
for row in range(0,X_train.shape[0]):
  train_lines.append(' '.join(str(x) for x in X_train.iloc[row,:]))

test_lines = []
for row in range(0,X_test.shape[0]):
  test_lines.append(' '.join(str(x) for x in X_test.iloc[row,:]))

predtest_lines = []
for row in range(0,test.shape[0]):
  predtest_lines.append(' '.join(str(x) for x in test.iloc[row,:]))

In [9]:
len(train_lines)

18874

# 6- Convert
Converts text data into a format that machine learning models can understand and process by transforming the text data into numerical vectors using two common techniques:

**Count Vectorization**: CountVectorizer counts the occurrences of words (and pairs of words, due to ngram_range=(1,2)) in your text. X_train_cv, X_test_cv, and test_cv are the results of this step.

**TF-IDF Transformation**: TfidfTransformer then applies the Term Frequency-Inverse Document Frequency (TF-IDF) weighting to the count vectors. TF-IDF gives higher weights to words that are important to a specific document but less common across all documents. X_train_tf, X_test_tf, and test_tf are the final numerical representations used for training the model.



In [10]:
from sklearn.feature_extraction.text import CountVectorizer

countvector = CountVectorizer(ngram_range=(1,2))
X_train_cv = countvector.fit_transform(train_lines)
X_test_cv = countvector.transform(test_lines)

test_cv = countvector.transform(predtest_lines)

#Using TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

tfidfvector = TfidfTransformer()
X_train_tf = tfidfvector.fit_transform(X_train_cv)
X_test_tf = tfidfvector.fit_transform(X_test_cv)

test_tf = tfidfvector.fit_transform(test_cv)

X_cv = countvector.transform(X_lines)

X_tf = tfidfvector.fit_transform(X_cv) #x_tf,y

# 7- Train the model
trains a machine learning model to predict multiple output labels based text data:

**from sklearn.svm import LinearSVC**: imports the LinearSVC class from scikit-learn's Support Vector Machine (SVM) module.

LinearSVC is a linear classifier based on the Support Vector Machine algorithm. It's suitable for large datasets and works well with text data.

**from sklearn.multioutput import MultiOutputClassifier**: imports the MultiOutputClassifier class from scikit-learn's multioutput module. This is a meta-estimator that trains a separate classifier for each target variable in a multi-label classification problem. Since the task involves predicting multiple topics for each document (a document can belong to more than one topic), MultiOutputClassifier is used to handle this.

**model = LinearSVC(C=0.5, class_weight='balanced', random_state=42)**: This line initializes a LinearSVC model with specific parameters:
- **C=0.5**: This is the regularization parameter. A smaller C value increases the regularization strength, which helps prevent overfitting.

- **class_weight='balanced'**: This parameter automatically adjusts weights inversely proportional to class frequencies in the input data. This is useful when have imbalanced classes (some topics appear much more often than others).

- **random_state=42**: This sets the random seed for reproducibility.
- **models = MultiOutputClassifier(model)**: This line creates a MultiOutputClassifier instance and wraps the LinearSVC model you just defined.
This means that a separate LinearSVC classifier will be trained for each of your target labels (each topic).

- **models.fit(X_train_tf, y1)**: This is the training step.
- **models.fit()**: This method trains the multi-output classifier.
- **X_train_tf**: This is training data, which is the TF-IDF vectorized representation of your combined 'TITLE' and 'ABSTRACT' text.
- **y1**: This is training target data, which is a NumPy array containing the labels (topics) for each training document.

The code sets up a LinearSVC model to handle multi-label classification and then trains it on preprocessed and vectorized training data (X_train_tf and y1). The MultiOutputClassifier ensures that a separate classifier is trained for each of the possible topics you are trying to predict.

In [11]:
from sklearn.svm import LinearSVC
from sklearn.multioutput import MultiOutputClassifier

model = LinearSVC(C=0.5, class_weight='balanced', random_state=42)
models = MultiOutputClassifier(model)

models.fit(X_train_tf, y1)

# 8- Evaluate
evaluates the performance of trained model on the test data:

- **classification_report(y2,preds)**: A detailed report showing precision, recall, and F1-score for each topic, as well as overall averages.
- **accuracy_score(y2,preds)**: The overall accuracy of the model, which is the proportion of correctly predicted labels across all samples and topics.
- **(commented out)confusion_matrix(y2,preds)`**: If uncommented, this would print the confusion matrix, which shows a summary of correct and incorrect predictions broken down by class.

In [12]:
preds = models.predict(X_test_tf)
preds


array([[1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0]])

In [13]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#print(confusion_matrix(y2,preds))
print(classification_report(y2,preds))
print(accuracy_score(y2,preds))

              precision    recall  f1-score   support

           0       0.81      0.91      0.85       853
           1       0.88      0.89      0.88       623
           2       0.84      0.84      0.84       580
           3       0.72      0.86      0.78       516
           4       0.53      0.40      0.46        58
           5       0.86      0.69      0.77        26

   micro avg       0.81      0.86      0.83      2656
   macro avg       0.77      0.76      0.76      2656
weighted avg       0.81      0.86      0.83      2656
 samples avg       0.84      0.89      0.84      2656

0.6611058150619638


In [14]:
predssv = models.predict(test_tf)
predssv

array([[0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0]])

In [15]:
test = pd.read_csv(data_path+"blogs_test.csv")

submit = pd.DataFrame({'ID': test.ID, 'Computer Science': predssv[:,0],'Physics':predssv[:,1],
                       'Mathematics':predssv[:,2],'Statistics':predssv[:,3],'Quantitative Biology':predssv[:,4],
                       'Quantitative Finance':predssv[:,5]})
submit.head()

Unnamed: 0,ID,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,20973,0,0,0,1,0,0
1,20974,0,1,0,0,0,0
2,20975,1,0,0,0,0,0
3,20976,0,1,0,0,0,0
4,20977,1,0,0,0,0,0


In [35]:
submit.to_csv('predictions.csv', index=False)

# More tests

In [17]:
# A function to clean the text
def preprocess_text(df, column_name='combined'):
    """Cleans the text data in a DataFrame."""
    df_copy = df.copy()
    # Removing Punctuations and converting to lower case
    df_copy[column_name] = df_copy[column_name].str.replace('[^a-zA-Z]',' ', regex=True).str.lower()
    # Removing one letter words and multiple spaces
    df_copy[column_name] = df_copy[column_name].str.replace(r'\b\w\b', '', regex=True).str.replace(r'\s+', ' ', regex=True)
    return df_copy

In [16]:
def predict_topic(text, trained_model, count_vec, tfidf_vec, topic_list):
    """
    Predicts the research topic(s) for a given string of text.

    Args:
    text (str): The title and abstract of a research paper.
    trained_model: The fitted multi-output classifier.
    count_vec: The fitted CountVectorizer.
    tfidf_vec: The fitted TfidfTransformer.
    topic_list (list): The list of topic names.

    Returns:
    list: A list of predicted topic names.
    """
    # Create a temporary DataFrame to use the existing preprocessing function
    text_df = pd.DataFrame({'combined': [text]})
    processed_text_df = preprocess_text(text_df)
    processed_text = processed_text_df['combined']

    # Vectorize the preprocessed text using the fitted vectorizers
    text_cv = count_vec.transform(processed_text)
    text_tf = tfidf_vec.transform(text_cv)

    # Make a prediction
    prediction_array = trained_model.predict(text_tf)

    # Map the prediction to topic names
    predicted_topics = [topic_list[i] for i, value in enumerate(prediction_array[0]) if value == 1]

    return predicted_topics






In [38]:
# --- 7. Example Usage ---
print("\n--- Example Prediction ---")
sample_abstract = """
A science to study cell and enzymes.
"""


# Use the function with all the necessary trained components
predicted_labels = predict_topic(sample_abstract, models, countvector, tfidfvector, labels)
print(f"Sample Text:\n{sample_abstract}")
print(f"\nPredicted Topics: {predicted_labels}")


--- Example Prediction ---
Sample Text:

A science to study cell and enzymes.


Predicted Topics: ['Quantitative Biology']


In [39]:
# --- 7. Example Usage ---
print("\n--- Example Prediction ---")
sample_abstract = """
A science to solve sine and cosine.
"""


# Use the function with all the necessary trained components
predicted_labels = predict_topic(sample_abstract, models, countvector, tfidfvector, labels)
print(f"Sample Text:\n{sample_abstract}")
print(f"\nPredicted Topics: {predicted_labels}")


--- Example Prediction ---
Sample Text:

A science to solve sine and cosine.


Predicted Topics: ['Mathematics']


In [40]:
# --- 7. Example Usage ---
print("\n--- Example Prediction ---")
sample_abstract = """
This paper introduces a novel deep learning framework for sentiment analysis.
We propose a deep convolutional neural network (CNN) combined with a recurrent neural network (RNN)
to classify text. Our statistical analysis shows that the model achieves state-of-the-art
performance on several benchmark datasets. The algorithm is implemented in Python and leverages
modern computational architectures for efficient training.
"""


# Use the function with all the necessary trained components
predicted_labels = predict_topic(sample_abstract, models, countvector, tfidfvector, labels)
print(f"Sample Text:\n{sample_abstract}")
print(f"\nPredicted Topics: {predicted_labels}")


--- Example Prediction ---
Sample Text:

This paper introduces a novel deep learning framework for sentiment analysis.
We propose a deep convolutional neural network (CNN) combined with a recurrent neural network (RNN)
to classify text. Our statistical analysis shows that the model achieves state-of-the-art
performance on several benchmark datasets. The algorithm is implemented in Python and leverages
modern computational architectures for efficient training.


Predicted Topics: ['Computer Science', 'Statistics']


In [41]:
# --- 7. Example Usage ---
print("\n--- Example Prediction ---")
sample_abstract = """
Bioinformatics is an interdisciplinary field that develops methods
and software tools for understanding biological data.
One such resource is a biological wiki, or "bio wiki,"
a collaborative online platform for sharing biological information.
These wikis often contain curated data on genes, proteins, and pathways,
facilitating research and knowledge dissemination in the life sciences.
"""


# Use the function with all the necessary trained components
predicted_labels = predict_topic(sample_abstract, models, countvector, tfidfvector, labels)
print(f"Sample Text:\n{sample_abstract}")
print(f"\nPredicted Topics: {predicted_labels}")


--- Example Prediction ---
Sample Text:

Bioinformatics is an interdisciplinary field that develops methods
and software tools for understanding biological data.
One such resource is a biological wiki, or "bio wiki,"
a collaborative online platform for sharing biological information.
These wikis often contain curated data on genes, proteins, and pathways,
facilitating research and knowledge dissemination in the life sciences.


Predicted Topics: ['Computer Science', 'Quantitative Biology']


# Improvement by using Naive Bayes method

In [23]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score # Import f1_score

In [25]:


# Assuming train_df, test_df, X_train_tfidf, X_val_tfidf, y_train, y_val are already defined
# from your existing pipeline, with preprocessing and feature extraction completed.

# Model: Multinomial Naive Bayes with Hyperparameter Tuning
nb_classifier = MultinomialNB()
multi_nb = MultiOutputClassifier(nb_classifier)

# Hyperparameter tuning for alpha (smoothing parameter)
param_grid_nb = {'estimator__alpha': [0.01, 0.1, 0.5, 1.0, 2.0]}
grid_search_nb = GridSearchCV(multi_nb, param_grid_nb, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_nb.fit(X_train_tf, y_train)

# Best parameters and model
print("Best parameters for Multinomial Naive Bayes:", grid_search_nb.best_params_)
best_nb_model = grid_search_nb.best_estimator_


# Evaluate on validation set
y_val = y2 # y_val is assigned y2 (test data) here, but the variable name is misleading. Using y2 directly below is clearer.
y_pred_nb = best_nb_model.predict(X_test_tf) # Use X_test_tf for prediction on the test set
print("\nMultinomial Naive Bayes Classification Report:\n",
      classification_report(y2, y_pred_nb, # Use y2 for evaluation
                           target_names=['Computer Science', 'Physics', 'Mathematics',
                                         'Statistics', 'Quantitative Biology', 'Quantitative Finance']))
print("Multinomial Naive Bayes Accuracy:", accuracy_score(y2, y_pred_nb)) # Use y2 for evaluation
print("Multinomial Naive Bayes Micro-average F1-score:",
      f1_score(y2, y_pred_nb, average='micro')) # Use y2 for evaluation
print("Multinomial Naive Bayes Macro-average F1-score:",
      f1_score(y2, y_pred_nb, average='macro')) # Use y2 for evaluation



Best parameters for Multinomial Naive Bayes: {'estimator__alpha': 0.1}

Multinomial Naive Bayes Classification Report:
                       precision    recall  f1-score   support

    Computer Science       0.82      0.87      0.84       853
             Physics       0.97      0.72      0.83       623
         Mathematics       0.93      0.63      0.75       580
          Statistics       0.86      0.52      0.65       516
Quantitative Biology       0.00      0.00      0.00        58
Quantitative Finance       0.00      0.00      0.00        26

           micro avg       0.88      0.69      0.77      2656
           macro avg       0.60      0.46      0.51      2656
        weighted avg       0.86      0.69      0.75      2656
         samples avg       0.78      0.73      0.74      2656

Multinomial Naive Bayes Accuracy: 0.6129647283126788
Multinomial Naive Bayes Micro-average F1-score: 0.7711505922165821
Multinomial Naive Bayes Macro-average F1-score: 0.5115706345968979


In [26]:
# Input your text here for prediction
sample_text = """
This paper discusses the latest advancements in quantum computing and its potential impact on cryptography.
We explore new algorithms and experimental results.
"""

# Preprocess the input text using the existing function
sample_text_df = pd.DataFrame({'combined': [sample_text]})
processed_sample_text_df = preprocess_text(sample_text_df)
processed_sample_text = processed_sample_text_df['combined']


# Vectorize the preprocessed text using the fitted vectorizers
sample_text_cv = countvector.transform(processed_sample_text)
sample_text_tf = tfidfvector.transform(sample_text_cv)

# Make prediction using the best Naive Bayes model
predicted_labels_nb = best_nb_model.predict(sample_text_tf)

# Map the prediction to topic names
predicted_topics_nb = [labels[i] for i, value in enumerate(predicted_labels_nb[0]) if value == 1]

print(f"Sample Text:\n{sample_text}")
print(f"\nPredicted Topics (Naive Bayes): {predicted_topics_nb}")


Sample Text:

This paper discusses the latest advancements in quantum computing and its potential impact on cryptography.
We explore new algorithms and experimental results.


Predicted Topics (Naive Bayes): ['Computer Science']
