# Citation

For Citations, we should use the existing papers that we have to cite each other. This way we will be able to have a graph system. Using the existing (real) citations, we won't find the relationships within the tables.

- Another point to note is that a paper would cite a paper from a related field, not just any random field. Eg. Doesn't make sense that an NLP paper cites a quantum paper.

Therefore, we need some papers from the keywords that I have provided to Konok.

In [1]:
import pandas as pd
import numpy as np
import random
np.random.seed(0)

In [2]:
# Data

df_papers = pd.read_csv('../konok_data/papers_data.csv')
df_papers.head()

Unnamed: 0,field_query,paperId,title,url,venue,publicationTypes,abstract,year,citationCount,journal_name,journal_volume,journal_pages
0,NLP,29ddc1f43f28af7c846515e32cc167bc66886d0c,Parameter-Efficient Transfer Learning for NLP,https://www.semanticscholar.org/paper/29ddc1f4...,International Conference on Machine Learning,"['JournalArticle', 'Conference']",Fine-tuning large pre-trained models is an eff...,2019,2453,ArXiv,abs/1902.00751,
1,NLP,58ed1fbaabe027345f7bb3a6312d41c5aac63e22,Retrieval-Augmented Generation for Knowledge-I...,https://www.semanticscholar.org/paper/58ed1fba...,Neural Information Processing Systems,['JournalArticle'],Large pre-trained language models have been sh...,2020,1686,ArXiv,abs/2005.11401,
2,NLP,d6a083dad7114f3a39adc65c09bfbb6cf3fee9ea,Energy and Policy Considerations for Deep Lear...,https://www.semanticscholar.org/paper/d6a083da...,Annual Meeting of the Association for Computat...,"['JournalArticle', 'Conference']",Recent progress in hardware and methodology fo...,2019,2113,ArXiv,abs/1906.02243,
3,NLP,06d7cb8c8816360feb33c3367073e0ef66d7d0b0,Super-NaturalInstructions: Generalization via ...,https://www.semanticscholar.org/paper/06d7cb8c...,Conference on Empirical Methods in Natural Lan...,"['JournalArticle', 'Conference']",How well can NLP models generalize to a variet...,2022,432,,,5085-5109
4,NLP,5471114e37448bea2457b74894b1ecb92bbcfdf6,From Pretraining Data to Language Models to Do...,https://www.semanticscholar.org/paper/5471114e...,Annual Meeting of the Association for Computat...,"['JournalArticle', 'Conference']",Language models (LMs) are pretrained on divers...,2023,72,ArXiv,abs/2305.08283,


In [3]:
df_papers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   field_query       1200 non-null   object
 1   paperId           1200 non-null   object
 2   title             1200 non-null   object
 3   url               1200 non-null   object
 4   venue             1122 non-null   object
 5   publicationTypes  1034 non-null   object
 6   abstract          892 non-null    object
 7   year              1200 non-null   int64 
 8   citationCount     1200 non-null   int64 
 9   journal_name      1038 non-null   object
 10  journal_volume    776 non-null    object
 11  journal_pages     645 non-null    object
dtypes: int64(2), object(10)
memory usage: 112.6+ KB


In [4]:
len(df_papers)

1200

In [5]:
df_papers['field_query'].value_counts()

NLP                  100
Machine Learning     100
LLM                  100
Deep Learning        100
Quantum Computing    100
Graph Database       100
Data Management      100
Data Modeling        100
 Big Data            100
Data Processing      100
Data Storage         100
Data Querying        100
Name: field_query, dtype: int64

Given an NLP paper, it may cite graph database, but probably not Quantum computing. But wait, since we now have to use the keywords which are general and similar to each other, lets just let them cite one another without the field type.

Now, it is not necessary that our data source will have all the citations for a particular paper. So, lets pick a number randomly & use only those number of paper ids to cite that paper.

Make sure that the same id is not used to cite the paper.

In [6]:
df_papers_new = df_papers[['paperId']]

In [7]:
random_numbers = np.random.randint(15, 51, size=len(df_papers))
len(random_numbers)

1200

In [8]:
df_papers_new['random_cites'] = random_numbers
df_papers_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_papers_new['random_cites'] = random_numbers


Unnamed: 0,paperId,random_cites
0,29ddc1f43f28af7c846515e32cc167bc66886d0c,15
1,58ed1fbaabe027345f7bb3a6312d41c5aac63e22,18
2,d6a083dad7114f3a39adc65c09bfbb6cf3fee9ea,18
3,06d7cb8c8816360feb33c3367073e0ef66d7d0b0,24
4,5471114e37448bea2457b74894b1ecb92bbcfdf6,34


Now, use other paper ids based on the number in the random_cites column for citations. Do NOT cite your own paper.

In [9]:
# Generate a list of unique paper IDs
all_paper_ids = df_papers_new['paperId'].unique()

# Function to generate citations for each row
def generate_citations(row):
    paper_id = row['paperId']
    cite_number = row['random_cites']
    
    # Remove the current paperid from the list of all_paper_ids
    other_paper_ids = np.delete(all_paper_ids, np.where(all_paper_ids == paper_id))
    
    # Randomly select cites_number IDs from other_paper_ids
    citations = list(np.random.choice(other_paper_ids, cite_number, replace=False))
    
    return citations

# Apply the function to each row to generate the citations
df_papers_new['Citations'] = df_papers_new.apply(generate_citations, axis=1)

df_papers_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_papers_new['Citations'] = df_papers_new.apply(generate_citations, axis=1)


Unnamed: 0,paperId,random_cites,Citations
0,29ddc1f43f28af7c846515e32cc167bc66886d0c,15,"[5c45a5d05ac564adb67811eeb9d41d6460c70135, 63a..."
1,58ed1fbaabe027345f7bb3a6312d41c5aac63e22,18,"[7676c02ea839ff1ceb6e5e1427c42bc45e169bde, ce2..."
2,d6a083dad7114f3a39adc65c09bfbb6cf3fee9ea,18,"[9d6aa5247b9919a86f174e918107c234c548274d, 13c..."
3,06d7cb8c8816360feb33c3367073e0ef66d7d0b0,24,"[90aca7b4cdd728b28b2fb5b4dc3ae3e37daa5b94, ed9..."
4,5471114e37448bea2457b74894b1ecb92bbcfdf6,34,"[ca0e479ba2327f71e842d033b6b48b082962cc6a, dca..."


In [10]:
df_papers_new['Citations'][1]

['7676c02ea839ff1ceb6e5e1427c42bc45e169bde',
 'ce2d5b5856bb6c9ab5c2390eb8b180c75a162055',
 'c0aec04ee86c0724d61c976f19590fbe9c615723',
 '9d6aa5247b9919a86f174e918107c234c548274d',
 'f295157f37cfb43cd8d8d2690ea124edc5ea59c2',
 '4e58100b319d74f97ed550a4e5fa32dea8c06fe1',
 'b3dbe1b460a3b66df1653508c9eed7dd51dee4d2',
 '63adc1e5086481e36b19b62707a96b799da51e59',
 '7171a0e9b07ebc98a32eb912262613efc20f283a',
 '752604994a7ca548ff2954114fc61a501d857b1c',
 'd6e1e4f0ad898ca6ac37e6e139a77fa3982170d4',
 '4f8d648c52edf74e41b0996128aa536e13cc7e82',
 'e449b9b3fe04fe260731a3c74d2123bf6eaadf5b',
 '52a6695ae1c08cc29baf764dedb5831c7a954214',
 '41c93960a066876d5e4f1dacaef75cd8daa2791f',
 '375125029b085e70a109491656b69aa01bc2a166',
 '4bd3c9e1bb1ca2df62b66201616b8740300efd0a',
 'f72d3f58ff73353978e224af348448b34d27cf7b']

In [11]:
# delete random_cites column

df_papers_new = df_papers_new.drop(columns=['random_cites'])


In [12]:
df_papers_new.to_csv('../aryan_data/citations_info.csv', index=False)