# Topic Modelling - Citation Prediction Project
By Jakub Wujec and Jakub Żmujdzin

## Abstract
This project aims to develop an article citation prediction model for identifying and categorizing articles for use in topic modelling. We examine the text of the articles to identify common semantics and categories, and use machine learning techniques to build predictive models based on these observations. We then use our model to evaluate its accuracy in predicting citation count. Our proposed model will involve utilizing natural language processing (NLP) techniques such as topic modeling, document-term matrices to extract features from the articles and build predictive models. We will also used supervised tree bosting algorithm (XGBoost) to predict the citation score based on topics present in an article. The research mainly focuses on identifying topics which are more "hot" in case of citations, not particularly to correctly predict citation count of an article. 

## Keywords
Topic modelling, XGBoost, citation, regression

## Introduction

In [1]:
import pandas as pd
from pathlib import Path
df = pd.read_json(Path.cwd() / "final_df" / "final_df.json")

### Research questions
some research q
### Motivation standing for undertaking the topic
some motivation text 
### Methodology
- Dataset source & presentation <br>
The dataset is available in final_df/final_df.json directory. 
To construct this dataset, we have used arXiv API and google scholar. Using arXiv API, we have searched with "machine learning" query to download article's titles, authors and links to PDF files containing the text. Then, we used BeautifulSoup to scrap Google Scholar. For each article in a dataframe, we have searched in Google Scholar for article's title and article's authors, then extracted citation count, if it was available. Finally, we used PyPDF2 to download PDF files from the links we have scrapped earlier, from arXiv. We have saved the articles to a json file, containing publication's title, text and citation score.
Finally, we have arrived at 1234 articles, of which 12 were wrongly decoded. Those articles were discarded. 


In [2]:
df.head()

Unnamed: 0,title,link,citations,text
0,Continual Reinforcement Learning with TELLA,http://arxiv.org/pdf/2208.04287v1,2,Workshop Track - 1st Conference on Lifelong Le...
1,An exact mapping between the Variational Renor...,http://arxiv.org/pdf/1410.3831v1,295,arXiv:1410.3831v1 [stat.ML] 14 Oct 2014An ex...
2,Learning Generative Models across Incomparable...,http://arxiv.org/pdf/1905.05461v2,69,Learning Generative Models across Incomparable...
3,On the Generalization Ability of Online Learni...,http://arxiv.org/pdf/1305.2505v1,74,On the Generalization Ability of Online Learni...
4,Geometric Understanding of Deep Learning,http://arxiv.org/pdf/1805.10451v2,110,Geometric Understanding of Deep Learning\nNa L...


In [3]:
df['text'].isna().sum()

12

- code for data cleaning and preprocessing

We have used Porter Stemmer and Regexp Tokenizer

do not run the cell below - it is just for presentation purposes

In [None]:
stemmer = PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')

We have constructed a lengthy and chaotic function ```preprocess_text```, which (believe us), in order, does this: <br>
- [x] Gets rid of whitespace and numbers ```re.sub(r"[\s\d]+", " ", word)```
- [x] Gets rid of LaTex equations ```re.sub(r"(\${1,2})(?:(?!\1)[\s\S])*\1", ... ```
- [x] Tokenizes the words ``` tokenizer.tokenize(... ```
- [x] Gets rid of words that are shorter than 2 characters ``` if len(word) > 2 ```
- [x] Gets rid of "special" words we have identified as useless ```word not in [ ... ] ```
- [x] Stems the result of it all

Finally, we have applied CountVectorizer to the output.
- [x] Using max_df and min_df we have gotten rid of too rare or too frequent words
- [x] Using stop_words='english' we have gotten rid of english stopwords
- [x] We have extracted word ngrams in the boundaries of (1, 4)

do not run the cell below - it is just for presentation purposes

In [None]:
def preprocess_text(text: str):
    return " ".join(
        [
            stemmer.stem(word)
            if len(word) > 2
            and word
            not in [
                "uni",
                "uni uni",
                "uni uni uni",
                "ieee",
                "doi",
                "vextendsingl",
                "http",
                "https",
                "vextenddoubl",
                "parenrightbig",
                "parenleftbig",
            ]
            else ""
            for word in tokenizer.tokenize(
                " ".join(
                    [
                        re.sub(
                            r"(\${1,2})(?:(?!\1)[\s\S])*\1",
                            " ",
                            re.sub(r"[\s\d]+", " ", word),
                        )
                        for word in text.split()
                    ]
                )
            )
        ]
    )


tf_vectorizer = CountVectorizer(ngram_range = (1, 4),
                                max_df = 0.8,
                                min_df = 0.01,
                                tokenizer = tokenizer.tokenize,
                                stop_words='english'
)


- Our fitted topic model is available in the model.pkl file

In [4]:
import pickle
lda = pickle.load(open('model.pkl', 'rb'))
lda

### Reasons for choosing methods
some reasons

## Presentation and interpretation of results
Aggregated Profiles plot:
<br>
<img src="https://raw.githubusercontent.com/jzmujdzin/topic-modelling-citation-prediction/main/agg_profiles_static.png">
<br>
Interactive plot (html link):
<a href="agg_profiles.html" target="_blank">here</a>
<br> <br>
Some interpratation for agg profiles
<br> <br>
Variable importance plot:
<br>
<img src="https://raw.githubusercontent.com/jzmujdzin/topic-modelling-citation-prediction/main/var_importance_static.png">
<br>
Interactive plot (html link):
<a href="var_importance.html" target="_blank">here</a>
<br> <br>
Some interpretation for var importance

## Conclusions
veryfing previously stated reserach questions