# Topic Modelling - Citation Prediction Project
By Jakub Wujec and Jakub Żmujdzin

## Abstract
This project aims to develop an article citation prediction model for identifying and categorizing articles for use in topic modelling. We examine the text of the articles to identify common semantics and categories, and use machine learning techniques to build predictive models based on these observations. We then use our model to evaluate its accuracy in predicting citation count. Our proposed model will involve utilizing natural language processing (NLP) techniques such as topic modeling, document-term matrices to extract features from the articles and build predictive models. We will also used supervised tree bosting algorithm (XGBoost) to predict the citation score based on topics present in an article. The research mainly focuses on identifying topics which are more "hot" in case of citations, not particularly to correctly predict citation count of an article. 

## Keywords
Topic modelling, XGBoost, citation, regression

## Introduction

In [3]:
import pandas as pd
from pathlib import Path
df = pd.read_json(Path.cwd() / "final_df" / "final_df.json")

### Research questions
- How can the text content of a paper be used to predict its citation count?
- Can machine learning algorithms be trained to accurately predict the citation count of a paper based on its text content?
- Can combination of the tasks of topic modelling and citation prediction produce better results than when the tasks are separated?
### Motivation standing for undertaking the topic
The topic was undertaken due to 2 key factors. As students, with some interest in publishing papers in the future, maximizing citation score can be one of the objectives when writing some text. The first issue was to determine whether Machine Learning (Natural Language Processing) tools can be used to find correlation between certain topics and citation score. Furthermore, if such relationship is existent, it was key to find out which topics produce the highest citation score and on which to focus. Finally, it was key to determine, whether splitting the topic modelling and citation prediction tasks can produce better results. Thanks to that, it was easier to decide on which models to focus for accurate predictions in the future.
### Methodology
- Models

Latent Dirichlet Allocation model was chosen as a basic tool for topic modelling in the task. It is due to its state-of-the-art results on different texts. The model is also currently perceived as among the best models created for Topic Modelling.
Furthermore, for the task of predicting citation score based on certain topics, XGBoost, a gradient boosted trees method was chosen, along with sLDA, an iteration of LDA model for supervised tasks. XGBoost is considered among the supervised models that produce the highest accuracy predictions. sLDA, on the other hand, combines the choice of hyperparameters for topic modelling and regression, which theoretically could yield better results in terms of predicting citation score of a peper, since the tasks are not split.

- Dataset source & presentation

The dataset is available in final_df/final_df.json directory. 
To construct this dataset, we have used arXiv API and Google Scholar. Using arXiv API, we have searched with "machine learning" query to download article's titles, authors and links to PDF files containing the text. Then, we used BeautifulSoup to scrap Google Scholar. For each article in a dataframe, we have searched in Google Scholar for article's title and article's authors, then extracted citation count, if it was available. Finally, we used PyPDF2 to download PDF files from the links we have scrapped earlier, from arXiv. We have saved the articles to a json file, containing publication's title, text and citation score.
Finally, we have arrived at 1234 articles, of which 12 were wrongly decoded. Those articles were discarded. 


In [2]:
df.head()

Unnamed: 0,title,link,citations,text
0,Continual Reinforcement Learning with TELLA,http://arxiv.org/pdf/2208.04287v1,2,Workshop Track - 1st Conference on Lifelong Le...
1,An exact mapping between the Variational Renor...,http://arxiv.org/pdf/1410.3831v1,295,arXiv:1410.3831v1 [stat.ML] 14 Oct 2014An ex...
2,Learning Generative Models across Incomparable...,http://arxiv.org/pdf/1905.05461v2,69,Learning Generative Models across Incomparable...
3,On the Generalization Ability of Online Learni...,http://arxiv.org/pdf/1305.2505v1,74,On the Generalization Ability of Online Learni...
4,Geometric Understanding of Deep Learning,http://arxiv.org/pdf/1805.10451v2,110,Geometric Understanding of Deep Learning\nNa L...


In [3]:
df['text'].isna().sum()

12

- code for data cleaning and preprocessing

We have used Porter Stemmer and Regexp Tokenizer

do not run the cell below - it is just for presentation purposes

In [None]:
stemmer = PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')

We have constructed a lengthy and chaotic function ```preprocess_text```, which (believe us), in order, does this: <br>
- [x] Gets rid of whitespace and numbers ```re.sub(r"[\s\d]+", " ", word)```
- [x] Gets rid of LaTex equations ```re.sub(r"(\${1,2})(?:(?!\1)[\s\S])*\1", ... ```
- [x] Tokenizes the words ``` tokenizer.tokenize(... ```
- [x] Gets rid of words that are shorter than 2 characters ``` if len(word) > 2 ```
- [x] Gets rid of "special" words we have identified as useless ```word not in [ ... ] ```
- [x] Stems the result of it all

Finally, we have applied CountVectorizer to the output.
- [x] Using max_df and min_df we have gotten rid of too rare or too frequent words
- [x] Using stop_words='english' we have gotten rid of english stopwords
- [x] We have extracted word ngrams in the boundaries of (1, 4)

do not run the cell below - it is just for presentation purposes

In [None]:
def preprocess_text(text: str):
    return " ".join(
        [
            stemmer.stem(word)
            if len(word) > 2
            and word
            not in [
                "uni",
                "uni uni",
                "uni uni uni",
                "ieee",
                "doi",
                "vextendsingl",
                "http",
                "https",
                "vextenddoubl",
                "parenrightbig",
                "parenleftbig",
            ]
            else ""
            for word in tokenizer.tokenize(
                " ".join(
                    [
                        re.sub(
                            r"(\${1,2})(?:(?!\1)[\s\S])*\1",
                            " ",
                            re.sub(r"[\s\d]+", " ", word),
                        )
                        for word in text.split()
                    ]
                )
            )
        ]
    )


tf_vectorizer = CountVectorizer(ngram_range = (1, 4),
                                max_df = 0.8,
                                min_df = 0.01,
                                tokenizer = tokenizer.tokenize,
                                stop_words='english'
)


- Our fitted topic model is available in the model.pkl file

In [4]:
import pickle
lda = pickle.load(open('model.pkl', 'rb'))
lda

## Presentation and interpretation of results
Aggregated Profiles plot:
<br>
<img src="https://raw.githubusercontent.com/jzmujdzin/topic-modelling-citation-prediction/main/agg_profiles_static.png">
<br>
Interactive plot (html link):
<a href="agg_profiles.html" target="_blank">here</a>
<br> <br>
Interpretation of Aggregated Profiles
Interestingly, for all topics, the high values of citations were when certain topic probability was very close to 0. They sharply decreased then and slowly increased as probability of topic rose. This is due to the fact that usually the probability that the paper was relevant to a topic was either really low or really high.
Interestingly, 2 topic stood out when it comes to citations: Topic 0 (Cybersecurity/Hacking/Reinforcement Learning) and Topic 4 (Classification/Modelling). Topic 1 and 2 were very similar in citations. Topic 3 (Reinforcement Learning) interestingly had similar number of citations throughout the whole p, though had one outlier at approx. p = 0.72, where it reached 600 citations.
The conclusions from Aggregate Profiles analysis are not that detached from reality. Overall, Cybersecurity can be considered as "trendy" topic over the last few years, while the Classification/Modelling topic is so broad it is not surprising it collects so much citations.
<br> <br>
Variable importance plot:
<br>
<img src="https://raw.githubusercontent.com/jzmujdzin/topic-modelling-citation-prediction/main/var_importance_static.png">
<br>
Interactive plot (html link):
<a href="var_importance.html" target="_blank">here</a>
<br> <br>
Interpretation of Variable Importance
Topic 0 (Cybersecurity/Hacking/Reinforcement Learning) had the highest drop-out loss metric. The model considered the variable as providing the most information from all other variables. Interestingly, Topics 1, 4 and 2 had very similar values of drop-out loss and they can be considered as equally worth keeping from the perspective of XGB Regressor model. Topic 3 (Reinforcement Learning) was the least valuable in terms of information.

In [1]:
import tomotopy as tp

mdl = tp.SLDAModel.load('best_model.bin')
mdl

ModuleNotFoundError: No module named 'tomotopy'

In [6]:
results_sld = pd.read_csv("slda_results_2.csv").drop(columns="Unnamed: 0")

In [9]:
results_sld.sort_values(by="mape", ascending=True)

Unnamed: 0,k,min_df,rm_top,vars,alpha,eta,mu,nu_sq,glm_param,seed,mape
23,25,0,5,l,0.2,0.01,0.0,1,1,123,0.848175
15,25,0,0,l,0.3,0.01,0.0,1,1,123,1.054079
4,20,0,5,l,0.1,0.01,0.0,1,1,123,25.392506
3,20,0,5,l,0.3,0.01,0.2,1,1,123,74.194412
8,15,0,1,l,0.3,0.01,0.2,1,1,123,103.041735
6,20,0,1,l,0.2,0.01,0.0,1,1,123,112.171611
29,15,0,1,l,0.1,0.01,0.1,1,1,123,122.279371
12,15,0,0,l,0.2,0.01,0.2,1,1,123,136.615933
24,15,0,5,l,0.1,0.01,0.0,1,1,123,145.060881
9,25,0,2,l,0.3,0.05,0.2,1,1,123,291.991895


#### sLDA results intepretation
As we can see, there is a significant difference between the first two models and the rest of them. Their mean absolute percentage error is reasonable and allows to determine whether the paper is going to be famous or not.  

#### List of topics

In [None]:
PRINTED_TOPICS = 5
topic_list = []
for i in range(PRINTED_TOPICS):
    topic_list.append(mdl.get_topic_words(i))

## Conclusions
In this research we tried to verify the following hypothesis: 
#### How can the text content of a paper be used to predict its citation count?  
We can use topic modelling algorithms to predict citation counts. Yet, we need to choose algorithms supervised versions or use a combination of two different models to be able to predict the value of the citations from the text.

#### Can machine learning algorithms be trained to accurately predict the citation count of a paper based on its text content?
We have shown in our study that machine learning algorithms can be trained to predict the citation count of a paper based on its text content. The models are totally accurate, yet their predictive abilities allow us to see whether the paper will be cited a lot or not.

#### Can combination of the tasks of topic modelling and citation prediction produce better results than when the tasks are separated?
Our study has shown that the sLDA model outperformed combining LDA with XGBoost based on the MAPE metrics. Yet, the difference was not tha big and sLDA model has its downsides - computational cost of its training it's much bigger, then LDA and XGBoost.

In this research, the hypothesis "How can the text content of a paper be used to predict its citation count?" was tested. The results showed that text content can indeed be used to predict the citation count of a paper. Machine learning algorithms are capable of accurately predicting the citation count based on the text content, providing valuable insights into the potential impact and recognition of a paper.

It was also found that combining the tasks of topic modeling and citation prediction can produce better results than when the tasks are performed separately. The sLDA model was found to outperform the combination of LDA and XGBoost based on the MAPE metrics, indicating that sLDA is a more effective model for this particular task. However, it is important to note that the results may vary depending on the specific application and the data being used.

While the results from this study are encouraging, it is important to remember that there are limitations to the accuracy of machine learning algorithms in predicting the citation count. There are numerous factors that can influence the citation count of a paper, including the quality of the research, the relevance of the topic, and the visibility of the paper. These factors cannot be solely captured by the text content of the paper, and therefore, the results of this research should be considered in conjunction with other relevant information.

In conclusion, the findings from this research highlight the potential of text content to predict the citation count of a paper and the role of machine learning algorithms in this process. Further research is needed to improve the accuracy and applicability of these models, and to better understand the various factors that influence the citation count of a paper.