# Related Work recommandation system for research papers

In [40]:
################################################################
# Get the libraries
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
import scienceplots
plt.style.use(['science', 'notebook', 'grid'])


##  Get data


Source : https://www.kaggle.com/datasets/nechbamohammed/research-papers-dataset

The dataset is obtained from the kaggle platform and it gives little context about the selection process and its temporality.

One important aspect of every Machine Learning project is the Data Quality Analysis. This means that *before starting to analyse the data* we want to ensure that the dataset reflects the reality, meaning that it is :
1. Valid
2. Reliable

This is why we will create a sort of data dictionary that will allow us to define data quality rules, think about the expected data types and list the coherence rules to check. This way, we will ensure we work will data that reflect reality and that we know how to adequatly exploit and interpret them.

NB : Even though the data quality is differently measured between tabular numerical data and the textual ones, there are certain "common sense" rules we should be able to confirm before modelling.



In [11]:
df = pd.read_csv("./data/dblp-v10.csv")

df.shape

(1000000, 8)

| col_name     | description                                     | dtype          | value rules                          | DQ rules                                       | example value |
|--------------|-------------------------------------------------|----------------|--------------------------------------|-----------------------------------------------|---------------|
| abstract         | Paper abstract text                      | string         | free format                     | Always assigned                                | 'In this paper, a robust 3D triangular mesh watermarking algorithm based on 3D segmentation is proposed. In this algorithm three classes of watermarking are combined. First, we segment the original image to many different regions. Then we mark every type of region with the corresponding algorithm based on their curvature value. The experiments show that our watermarking is robust against numerous attacks including RST transformations, smoothing, additive random noise, cropping, simplification and remeshing.'
| authors          | List of authors of the paper                   | string        | ['Name Lastname1', 'Name Lastname2', ...]               | Always assigned, same name always in the same format                                | "['S. Ben Jabra', 'Ezzeddine Zagrouba']"        |
| n_citation          | Number of citations at the sampling date                         |  integer        | non-negative              | Always assigned                             | 50          |
| references          | List of paper_ids of all papers references in the paper                                 | string        | ['paperid1', 'paperid2', ...]                 | Always assigned                                | "['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa16da-3cc8-4af6-9d66-48037e915d76', '35cb45c3-9408-4096-ab30-bc2e4de3fb5d', '661a342e-a911-4420-b67d-51c75d3b14e9', '779553f3-e4c1-456e-bc01-5eb9d9567541', 'b24ba5c0-fee8-4a3e-9330-17f6564856cd', 'fd1c676d-1296-4f19-89b4-17c7ecd270f3']"           |
| title          | Official title of the paper                            | string        | free format                 | Always assigned                                | 'A new approach of 3D watermarking based on image segmentation'           |
| venue          | Name of conference where the paper was submitted                         | string        | free format                 | Not always assigned                                | 'international symposium on computers and communications'           |
| year          | Year of publishing                          | integer        | <= 19000 <= 2025               | Always assigned        | 2008          |
| id       | Paper identifier                     | string        | Unique                 | Always assigned                                | 	4ab3735c-80f1-472d-b953-fa0557fed28b           |


In [15]:
df.head()

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
0,"In this paper, a robust 3D triangular mesh wat...","['S. Ben Jabra', 'Ezzeddine Zagrouba']",50,"['09cb2d7d-47d1-4a85-bfe5-faa8221e644b', '10aa...",A new approach of 3D watermarking based on ima...,international symposium on computers and commu...,2008,4ab3735c-80f1-472d-b953-fa0557fed28b
1,We studied an autoassociative neural network w...,"['Joaquín J. Torres', 'Jesús M. Cortés', 'Joaq...",50,"['4017c9d2-9845-4ad2-ad5b-ba65523727c5', 'b118...",Attractor neural networks with activity-depend...,Neurocomputing,2007,4ab39729-af77-46f7-a662-16984fb9c1db
2,It is well-known that Sturmian sequences are t...,"['Genevi eve Paquin', 'Laurent Vuillon']",50,"['1c655ee2-067d-4bc4-b8cc-bc779e9a7f10', '2e4e...",A characterization of balanced episturmian seq...,Electronic Journal of Combinatorics,2007,4ab3a4cf-1d96-4ce5-ab6f-b3e19fc260de
3,One of the fundamental challenges of recognizi...,"['Yaser Sheikh', 'Mumtaz Sheikh', 'Mubarak Shah']",221,"['056116c1-9e7a-4f9b-a918-44eb199e67d6', '05ac...",Exploring the space of a human action,international conference on computer vision,2005,4ab3a98c-3620-47ec-b578-884ecf4a6206
4,This paper generalizes previous optimal upper ...,"['Efraim Laksman', 'Håkan Lennerstad', 'Magnus...",0,"['01a765b8-0cb3-495c-996f-29c36756b435', '5dbc...",Generalized upper bounds on the minimum distan...,Ima Journal of Mathematical Control and Inform...,2015,4ab3b585-82b4-4207-91dd-b6bce7e27c4e


# 1. Data quality

In this section we confirm (or not) the rules we assumed in the data dictionary above. We take appropriate actions for each analysis we take.

## 1.1. Nan check

- Where do we have not assigned data and does it bother us?


In [30]:
def check_for_nans(df: 'pd.DataFrame', col: str) -> float:    
    nan_percentage = df[col].isna().mean() * 100
    
    if nan_percentage == 0:
        print(f"There are no NaN values in the column '{col}'.")
    else:
        print(f"The column '{col}' contains {nan_percentage:.2f}% NaN values.")
    
    return nan_percentage

In [31]:
for col in df.columns:
    df[col] = df[col].replace(["", " ", "NA", "null", "N/A", 'Nan'], pd.NA)
    check_for_nans(df, col)

The column 'abstract' contains 17.25% NaN values.
The column 'authors' contains 0.00% NaN values.
There are no NaN values in the column 'n_citation'.
The column 'references' contains 12.44% NaN values.
There are no NaN values in the column 'title'.
The column 'venue' contains 17.78% NaN values.
There are no NaN values in the column 'year'.
There are no NaN values in the column 'id'.


In [41]:
df[df['abstract'].isna()]

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
9,,"['Zhanjun Bai', 'Xing Zhou', 'Ralph Mason']",3,"['54f270aa-ce44-4ece-a2ca-c63a9f266cb3', '638c...",A novel Injection Locked Rotary Traveling Wave...,international symposium on circuits and systems,2014,4ab439a4-9379-44f5-b98b-87125ae7366e
36,,['Ruiz-Huerta'],50,,The Programmable Compiler,IEEE Computer,1983,4ab689ab-506e-457c-b0b2-192829c34035
74,,"['Shin Ya Abe', 'Youhua Shi', 'Kimiyoshi Usami...",0,"['04159258-4dfc-45f4-8ee8-5709e9700049', '09f5...",Floorplan driven architecture and high-level s...,IEICE Transactions on Fundamentals of Electron...,2013,4ab90243-f09f-4a20-bfe8-37e199ff6c95
77,,"['Willard L. Eastman', 'Shimon Even']",0,['bddb9051-acc8-4eb0-b770-6fbb8ca5871f'],Some further results on synchronizable block c...,IEEE Transactions on Information Theory,1966,4ab93ae9-278c-468d-8077-371bfedb576a
97,,"['Chulhoon Jang', 'Chansoo Kim', 'Dongchul Kim...",21,"['5384be17-46f4-412b-9c95-834d90d83297', '971d...",Multiple exposure images based traffic light r...,,2014,4aba3206-6232-4aa9-9e89-d178da94e865
...,...,...,...,...,...,...,...,...
999983,,"['Chao Wang', 'Yizhong Yuan', 'Xiaohui Tian']",0,"['03208590-7f63-4a9c-be3b-89afc2ce58a1', '7b42...",Assessment of range‐separated exchange functio...,Journal of Computational Chemistry,2017,fd14f60b-9577-4461-824c-57611090cd02
999984,,"['Oliver Kroemer', 'Jan Peters']",0,"['01f07b38-7038-4ac1-b9b5-4b79a13f307b', '0a15...",A Comparison of Autoregressive Hidden Markov M...,international conference on robotics and autom...,2017,fd256ca3-41df-40c2-80a8-a3286d4b982f
999986,,"['Xian-He Sun', 'Yuhang Liu']",0,"['26031e0e-3b83-4ab8-bcda-e1a342814b70', '6666...",Utilizing Concurrency: A New Theory for Memory...,languages and compilers for parallel computing,2016,fd6bbc97-1107-4857-86f0-4c1a5aff8a4c
999988,,"['Prabhakar Dixit', 'Joos C. A. M. Buijs', 'Wi...",0,"['00c59fef-26e6-4c07-8388-9784a05306a3', '0a0e...",Using Domain Knowledge to Enhance Process Mini...,,2015,fdf08a4d-1002-405f-b0e9-4c9df4632ff9


In [51]:
df = df.dropna(subset=['abstract'])
df[df['abstract'].isna()]

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id


In [52]:
df[df['authors'].isna()]

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id
594452,This paper proposes a new intra-mode decision ...,,1,"['1a6ecea3-bad3-4139-8c15-9a35247b8be4', '93cf...",An efficient intra-mode decision method for HEVC,"Signal, Image and Video Processing",2016,9c4cf6a4-3d7a-4892-9acd-dc30336c73f1


In [53]:
df = df.dropna(subset=['authors'])
df[df['authors'].isna()]

Unnamed: 0,abstract,authors,n_citation,references,title,venue,year,id


In [54]:
df[df['references'].isna()].shape

(40309, 8)

to do : Comment on the deleting decision and understand whether to delete the rows where references are not present

## 1.2. Value rules
- years format range
- citation number range and typr


In [55]:
def check_nonnegative_integer(df: 'pd.DataFrame', col: str) -> float:
    
    positive_int_percentage = (df[col].apply(lambda x: isinstance(x, int) and x >= 0).mean()) * 100
    
    if positive_int_percentage == 100:
        print(f"All values in the column '{col}' are non negative  integers.")
    else:
        print(f"{100 - positive_int_percentage:.2f}% of values in the column '{col}' are negative integers.")
    
    return positive_int_percentage


In [56]:
def check_int_1900_to_2025(df: 'pd.DataFrame', col: str) -> float:

    int_1900_to_2025_percentage = (df[col].apply(lambda x: isinstance(x, (int, int)) and 1900 <= x <= 2025).mean()) * 100
    
    if int_1900_to_2025_percentage == 100:
        print(f"All values in the column '{col}' are real numbers between 1900 and 2025.")
    else:
        print(f"{100 - int_1900_to_2025_percentage:.2f}% of values in the column '{col}' are not real numbers between 19000 and 2025.")
        
    return int_1900_to_2025_percentage

In [39]:
def validate_columns(df):
    validation_rules = {
        "year": check_int_1900_to_2025,
        "n_citation": check_nonnegative_integer
    }
    
    for col, func in validation_rules.items():
        if col in df.columns:  
            func(df, col) 
        else:
            print(f'The column {col} does not exist in the dataframe!')
            
validate_columns(df)

All values in the column 'year' are real numbers between 1900 and 2025.
All values in the column 'n_citation' are non negative  integers.


## 1.3. Uniqueness check
- unique id column
- unique title column 
- no doubled rows 

In [58]:
df.duplicated().sum()

0

In [59]:
df['id'].duplicated().sum()

0

In [60]:
df['title'].duplicated().sum()

629

to do : decide whether we should delete the doubled titles or not, given that ids are unique

## Save cleaned dataset

In [62]:
def save_df_to_csv(df: pd.DataFrame, path: str, filename: str) -> None:
    if not filename.endswith('.csv'):
        raise ValueError("Filename must have a .csv extension !")
    
    full_path = f"{path}/{filename}"
    df.to_csv(full_path, index=False)
    print(f"DataFrame saved as {full_path}")
    return

In [63]:
df.shape

(827532, 8)

In [65]:
save_df_to_csv(df, './data/cleaned', 'research_papers_cleaned.csv')

DataFrame saved as ./data/cleaned/research_papers_cleaned.csv


# 2. Exploratory Descriptive Analysis - EDA