## Required Libraries and Installation
This project utilizes several key Python libraries. To ensure proper execution, please install them using the following commands:

In [27]:
##Exact Pip Command for Installation:

# pip install spacy sentence-transformers scikit-learn pandas numpy PyPDF2 textstat seaborn matplotlib
##Installs core libraries for NLP, ML, data handling, and PDF processing.

##Downloads the small English spaCy language model for text analysis.
# python (or python3) -m spacy download en_core_web_sm

## Imports and Initialization
This section imports necessary libraries and initializes key components for our paper classification pipeline:

In [28]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from sklearn.metrics import confusion_matrix, roc_curve, auc
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer 
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, classification_report
from sklearn.model_selection import train_test_split
from textstat import flesch_reading_ease 


# Initialize NLP components and models
nlp = spacy.load('en_core_web_sm')
embedding_model = SentenceTransformer('allenai/specter')
scaler = StandardScaler()
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

No sentence-transformers model found with name allenai/specter. Creating a new one with mean pooling.


## Reading PDF
#### <small>This Python function <b>read_pdf</b> extracts text content from a PDF file using the PdfReader library. Its job in a binary classifier context is to convert the raw PDF documents (input features) into text strings, which can then be processed (e.g., tokenized, vectorized) and used as input for the classification model (e.g., to predict whether a paper is publishable or not). </small>

In [29]:
def read_pdf(pdf_path):
    """Extract text from PDF file."""
    try:
        reader = PdfReader(pdf_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
        return text
    except Exception as e:
        print(f"Error reading PDF {pdf_path}: {str(e)}")
        return None

## Preprocessing Text
#### <small>This <b>preprocess_text</b> function cleans and prepares the extracted text from PDFs for use in a machine learning model. It removes extra whitespace and citation numbers, then uses spaCy for tokenization, filtering out stop words and punctuation to retain only meaningful words for the binary classification task.</small>

In [30]:
def preprocess_text(text):
    """Clean and preprocess the extracted text."""
    if not text:
        return ""
    
    # Basic cleaning
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'\[[\d,\s]+\]', '', text)  # Remove citation numbers
    
    # Process with spaCy
    doc = nlp(text)
    
    # Keep only meaningful tokens
    tokens = [token.text for token in doc 
              if not token.is_stop and not token.is_punct 
              and token.text.strip()]
    
    return " ".join(tokens)

## Feature Extraction

The `extract_features` function processes the preprocessed text of each paper to create a set of numerical features. These features armadeed to capture various aspects of the paper's content and structure, which can be useful for distinguishing between different types of papers or predicting publishability. The extracted features include:

*   **Structural Features:** Presence of key sections like abstract, introduction, methodology, results, and conclusion.
*   **Content Quality Features:** Count of citations, equations, figures, and tables, which can indicate the depth and rigor of the research.
*   **Readability and Complexity Features:** Flesch reading ease score, word count, and average word length, providing insights into the writing style and complexity of the paper.
*   **Technical Content Density:** Ratio of technical keywords (e.g., "algorithm," "method," "analysis") to the total word count, reflecting the technical focus of the paper.

These features are then used as input for the machine learning model.

In [31]:
def extract_features(text):
    """
     This function extracts numerical features from the paper text that might 
     be informative for classifying research papers. """
    
    features = {}
    
    # Basic structure features (presence of common section headings)
    features['has_abstract'] = int(bool(re.search(r'\b(abstract)\b', text.lower())))
    features['has_introduction'] = int(bool(re.search(r'\b(introduction|background)\b', text.lower())))
    features['has_methodology'] = int(bool(re.search(r'\b(methodology|methods|approach)\b', text.lower())))
    features['has_results'] = int(bool(re.search(r'\b(results|findings)\b', text.lower())))
    features['has_conclusion'] = int(bool(re.search(r'\b(conclusion|conclusions|summary)\b', text.lower())))
    
    # Content quality features
    features['citation_count'] = len(re.findall(r'\[\d+\]|\(\w+\s*,\s*\d{4}\)', text))  # Count citations in various formats
    features['equation_count'] = len(re.findall(r'\$.*?\$', text)) # Count occurrences of LaTeX expressions
    features['figure_count'] = len(re.findall(r'\b(figure|fig\.)\s*\d+\b', text.lower())) # Count mentions of figures by number
    features['table_count'] = len(re.findall(r'\b(table|tbl\.)\s*\d+\b', text.lower())) # Count mentions of tables by number
    
    # Readability and complexity
    features['reading_score'] = flesch_reading_ease(text)  # Assuming flesch_reading_ease function is imported elsewhere
    features['word_count'] = len(text.split())   # Count total words
    features['average_word_length'] = np.mean([len(word) for word in text.split()]) # Calculate average word length
    
    # Technical content density
    features['technical_word_ratio'] = len(re.findall(r'\b(algorithm|method|implementation|analysis|evaluation)\b', text.lower())) / max(1, features['word_count'])
    # Count technical terms divided by word count (to avoid division by zero)
    
    return features

### General Embeddings

The `generate_embedding` function creates a document embedding (vector representation) of the input text using the SPECTER model. This embedding captures the semantic meaning of the text and is used as input for the classification model.

In [32]:
def generate_embedding(text):
    """Generate document embedding using SPECTER."""
    return embedding_model.encode(text)

## Preparing Features

The `prepare_features` function combines the extracted numerical features and the document embedding into a single feature vector. This combined vector serves as the final input to the classification model.

In [33]:
def prepare_features(features_dict, embedding):
    """Combine numerical features and embedding into single feature vector."""
    numerical_features = np.array(list(features_dict.values()))
    return np.concatenate([numerical_features, embedding])

### Data Preparation

The `prepare_data` function takes a list of labeled paper paths and performs the full data preprocessing pipeline: reading the PDF, preprocessing the text, extracting features (numerical

In [34]:
def prepare_data(labeled_papers):
    """Prepare features and labels from labeled papers."""
    X = []
    y = []
    
    for paper_path, label in labeled_papers:
        # Read and process paper
        text = read_pdf(paper_path)
        if text is None:
            continue
        
        processed_text = preprocess_text(text)
        
        # Extract features
        features = extract_features(text)
        embedding = generate_embedding(processed_text)
        
        # Combine features
        X.append(prepare_features(features, embedding))
        y.append(label)
    
    return np.array(X), np.array(y)

## Model Training and Evaluation

This section details the training and evaluation of our classification model. The `train_classifier` function orchestrates the process:

1.  **Data Preparation:** The `prepare_data` function (not shown) extracts features (X) and labels (y) from the labeled papers.
2.  **Train-Test Split:** Data is split into 80% training and 20% testing sets using `train_test_split` with stratification to maintain class balance.
3.  **Feature Scaling:** Features are scaled using a pre-defined `scaler` (e.g., `StandardScaler`) to improve model performance.
4.  **Model Training:** A pre-defined `classifier` is trained on the scaled training data.
5.  **Prediction and Evaluation:** Predictions are made on the test set, and performance is evaluated using the F1 score and a detailed classification report. The F1 score is returned for model comparison.

This robust training pipeline ensures reliable model evaluation and facilitates comparison between different classifier choices.

In [35]:
def train_classifier(labeled_papers):
    """Train the classifier using labeled papers and visualize results."""
    # Prepare features and labels
    X, y = prepare_data(labeled_papers)
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # Scale features
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Train classifier
    classifier.fit(X_train, y_train)
    
    # Make predictions on test set
    y_pred = classifier.predict(X_test)
    # y_proba = classifier.predict_proba(X_test)
    
    # Calculate and print metrics
    f1 = f1_score(y_test, y_pred)
    print("\nModel Performance Metrics:")
    print(f"F1 Score: {f1:.3f}")
    print("\nDetailed Classification Report:")
    print(classification_report(y_test, y_pred))
    
    return f1

## Paper Publishability Prediction

The `predict_paper` function predicts a paper's publishability given its file path

1.  **Text Extraction & Preprocessing:** Extracts text from the PDF using `read_pdf` and preprocesses it with `preprocess_text`. Returns `None` if PDF reading fails.
2.  **Feature Engineering:** Extracts features (`extract_features`) from the raw text and generates a text embedding (`generate_embedding`) from the preprocessed text.
3.  **Feature Preparation & Scaling:** Combines features and embedding using `prepare_features`, then scales the resulting feature vector using the pre-trained `scaler`. Reshaping to (1, -1) ensures compatibility with the scaler.
4.  **Prediction:** Uses the pre-trained `classifier` to predict publishability.
5.  **Output:** Returns a dictionary containing the predction.

This function efficiently predicts publishability using a trained model and pre-engineered features.

In [36]:
def predict_paper(paper_path):
    """Predict if a paper is publishable."""
    # Read and process paper
    text = read_pdf(paper_path)
    if text is None:
        return None
    
    processed_text = preprocess_text(text)
    
    # Extract features
    features = extract_features(text)
    embedding = generate_embedding(processed_text)
    
    # Prepare features
    X = prepare_features(features, embedding)
    X = scaler.transform(X.reshape(1, -1))
    
    # Make prediction
    probability = classifier.predict_proba(X)[0]
    prediction = classifier.predict(X)[0]
    
    return {
        'prediction': prediction,
        # 'confidence': probability[1] if prediction == 1 else probability[0]
    }

## Main Execution and Result Generation

The `main` function orchestrates the entire paper publishability prediction process

1.  **Labeled Data:** Defines a list of labeled papers (filename, label) for model training.
2.  **Model Training:** Trains the classifier using `train_classifier` and prints the overall F1 score.
3.  **Batch Prediction:** Iterates through all PDF files in the 'Papers' directory, predicting publishability for each using `predict_paper`. Handles potential PDF reading errors.
4.  **Result Aggregation:** Stores the predictions (filename and publishability) in a list.
5.  **Result Saving:** Saves the aggregated results to a 'results.csv' file using a Pandas Dat<small>Frame.

This function provides a complete workflow for training the model and applying it to a batch of papers, saving the results for further</small> analysis.

In [37]:
def main():
    # Example of labeled papers (path, label)
    labeled_papers = [
        ('R001.pdf', 0),
        ('R002.pdf', 0),
        ('R003.pdf', 0),
        ('R004.pdf', 0),            
        ('R005.pdf', 0),
        ('R006.pdf', 1),
        ('R007.pdf', 1),
        ('R008.pdf', 1),
        ('R009.pdf', 1),
        ('R010.pdf', 1),
        ('R011.pdf', 1),
        ('R012.pdf', 1),
        ('R013.pdf', 1),
        ('R014.pdf', 1),
        ('R015.pdf', 1)
    ]
    
    # Train the classifier and get F1 score
    f1 = train_classifier(labeled_papers)
    print(f"\nOverall F1 Score: {f1:.3f}")
    
    # Directory containing papers to classify
    papers_dir = 'Papers'
    
    # Classify all papers in directory
    results = []
    for filename in os.listdir(papers_dir):
        if filename.endswith('.pdf'):
            paper_path = os.path.join(papers_dir, filename)
            result = predict_paper(paper_path)
            if result is not None:
                results.append({
                    'paper_id': filename,
                    'publishable': result['prediction'],
                    # 'confidence': result['confidence']
                })
    
    # Save results to CSV
    results_df = pd.DataFrame(results)
    results_df.to_csv('results.csv', index=False)
    print("\nClassification complete. Results saved to results.csv")

## Main Execution Entry Point

This section defines the entry point for the program. The `if __name__ == "__main__":` block is a standard Python construct that ensures the `main()` function, which orchestrates the entire paper classification workflow, is executed only when the script is run directls.

In [39]:
if __name__ == "__main__":
    main()


Model Performance Metrics:
F1 Score: 0.800

Detailed Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3


Overall F1 Score: 0.800


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Classification complete. Results saved to results.csv
