In [2]:
import numpy as np
import pandas as pd
import re

## Creating Citation Data

In [3]:
# Citing papers on GS
df_ref = pd.read_excel("data/JFE_GS_DATA.xlsx")

KeyboardInterrupt: 

In [None]:
df_ref = df_ref.iloc[:,1:]

In [None]:
df_ref

In [None]:
# Papers published on JFE 
jfe = pd.read_excel("data/Published_Papers_JFE.xlsx",index_col=0)

In [None]:
jfe

We can make several types of networks: 

1. Author -> paper network 
    * nodes - authors, papers 
    * edges - authors worked on paper x
    * Implication: collaboration network 

2. Citations between papers 
    * nodes - papers (JFE, GS)
    * edges - citations across papers 
    * Implication: citation network 

* Note: We have a lot of features: 
    * Papers with identified authors 
    * University of authors
    * Country of authors 
    * Date of publication 

Citation Network: 

In [None]:
cit_ntwk = df_ref[['Cleaned Titles', 'cited_paper']].rename(columns={'Cleaned Titles': 'citing_paper'})

In [None]:
cit_ntwk

In [58]:
raw_df = pd.merge(cit_ntwk, jfe.rename(columns={'Title': 'cited_paper'}), 
                     on='cited_paper',how='right').dropna()

In [59]:
raw_df.to_csv("cit_ntwk_raw.csv")

In [117]:
merged_df = pd.merge(cit_ntwk, jfe.rename(columns={'Title': 'cited_paper'}), 
                     on='cited_paper',how='right').dropna()

In [29]:
merged_df = merged_df[['citing_paper','cited_paper','Date']]

In [30]:
merged_df

Unnamed: 0,citing_paper,cited_paper,Date
0,121413,21704,February 2024
1,183313,21704,February 2024
2,124760,21704,February 2024
3,2829,86470,February 2024
4,72332,86470,February 2024
...,...,...,...
410809,15250,26077,September 2008
410810,116915,26077,September 2008
410811,22517,26077,September 2008
410812,96250,26077,September 2008


In [56]:
merged_df.to_csv("cit_ntwk.csv")

In [31]:
merged_df.groupby('citing_paper').count().sort_values(by = 'cited_paper')

Unnamed: 0_level_0,cited_paper,Date
citing_paper,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,1
107049,1,1
107047,1,1
107044,1,1
107042,1,1
...,...,...
91331,160,160
106738,184,184
154096,259,259
150942,315,315


In [50]:
merged_df.groupby('citing_paper').count().sort_values(by = 'cited_paper').sum()

cited_paper    410772
Date           410772
dtype: int64

In [51]:
unique_cit = merged_df.groupby('citing_paper').count().sort_values(by = 'cited_paper')

In [56]:
unique_cit[unique_cit.cited_paper >1].sum()

cited_paper    306662
Date           306662
dtype: int64

Note that there are papers on Google Scholar citing multiple papers published on JFE. Motivates us to visualize the citation network where
* Node: paper 
* Edge: paper-to-paper citation

## NetworkX package

In [5]:
import numpy as np
import pandas as pd
import re
# Import the NetworkX package
import networkx as nx
from tqdm import tqdm

In [7]:
merged_df = pd.read_csv("cit_ntwk_raw.csv",index_col = 0)

In [8]:
df = merged_df[['citing_paper','cited_paper','Date']]

In [25]:
df

Unnamed: 0,citing_paper,cited_paper,Date
0,dissecting mechanisms of financial crises: int...,Delayed crises and slow recoveries,February 2024
1,public liquidity and financial crises,Delayed crises and slow recoveries,February 2024
2,inefficient credit cycles,Delayed crises and slow recoveries,February 2024
3,learning and the capital age premium,Learning about the consumption risk exposure o...,February 2024
4,"investment, uncertainty, and u-shaped return v...",Learning about the consumption risk exposure o...,February 2024
...,...,...,...
410809,kobi̇'lerin finansman sorunları ve çözüm öneri...,Financing patterns around the world: Are small...,September 2008
410810,finanțarea întreprinderilor mici și mijlocii d...,Financing patterns around the world: Are small...,September 2008
410811,relação entre estrutura de financiamento e açõ...,Financing patterns around the world: Are small...,September 2008
410812,中国商业银行综合融资能力测度及影响因素分析,Financing patterns around the world: Are small...,September 2008


In [9]:
# Set to collect unique nodes
unique_nodes = set(df['citing_paper']) | set(df['cited_paper'])

In [24]:
unique_nodes

{'zur vertrauensökonomik: der interbankenmarkt in der krise von 2007-2009',
 'nascent markets: understanding the success and failure of new stock markets',
 'the existence of idiosyncratic risk in reits market',
 'инвестиционные решения компаний в условиях асимметрии информации',
 'unconventional monetary policy,(a) synchronicity and the yield curve',
 'to what extent do environmental rating schemes capture climate goals?',
 'imprecise and informative: lessons from market reactions to imprecise disclosure',
 'the impact of changes of capital cost in enterprise valuation',
 'stock markets during covid-19',
 'georg schreyögg',
 'disentangling shareholder risk aversion from leverage-dependent borrowing cost on corporate policies',
 'reservation return and asset pricing inference',
 'transmission of quantitative easing: the role of central bank reserves',
 "«they'll just go to moody's»: investigating corporate credit rating updates using machine learning techniques",
 'phasing out the gses

In [10]:
paper_nodes = pd.DataFrame(unique_nodes, columns = ['Title'])
paper_nodes['name'] = ["node"+str(i) for i in range(len(paper_nodes))]
paper_nodes

Unnamed: 0,Title,name
0,zur vertrauensökonomik: der interbankenmarkt i...,node0
1,nascent markets: understanding the success and...,node1
2,the existence of idiosyncratic risk in reits m...,node2
3,инвестиционные решения компаний в условиях аси...,node3
4,"unconventional monetary policy,(a) synchronici...",node4
...,...,...
186539,the rise of conscious consumers: the cash flow...,node186539
186540,an overview of the literature on upper echelons,node186540
186541,sustainability reporting and performance of me...,node186541
186542,women in monitoring positions and market risk....,node186542


In [12]:
# Create a directed graph DG
G = nx.DiGraph()

In [13]:
# Set to collect unique nodes
unique_nodes = set(merged_df['citing_paper']) | set(merged_df['cited_paper'])

# Add nodes with attributes and add edges
with tqdm(total=len(unique_nodes)) as pbar:
    for node in unique_nodes:
        G.add_node(node, label='GS' if node in merged_df['citing_paper'].values else 'JFE')
        pbar.update(1)


100%|██████████████████████████████████████████████████████████████████████████| 186544/186544 [37:46<00:00, 82.29it/s]


In [14]:
# Add edges from DataFrame
with tqdm(total=len(merged_df)) as pbar:
    for _, row in merged_df.iterrows():
        G.add_edge(row['citing_paper'], row['cited_paper'])
        pbar.update(1)

100%|███████████████████████████████████████████████████████████████████████| 410772/410772 [00:38<00:00, 10656.46it/s]


In [26]:
G.nodes['Delayed crises and slow recoveries']

{'label': 'JFE'}

In [28]:
# Export the graph to GML format
nx.write_graphml(G, "jfe_gs_citation_ntwk.graphml")

---

## Further Imrprovements

Citation network: 

* Nodes - papers (JFE or GS) 
* Edges - citation (directed)

Use GS papers with only the identified authors. 

Should we do the same with identified JFE authors too? 

In [8]:
gs_matches = pd.read_excel("data/JFE_citing_paper_author_matches.xlsx",index_col=0)

In [9]:
gs_matches

Unnamed: 0,Authors_x,First Name,Middle Name_x,Last Name,Authors_y,University,Source,Middle Name_y,Country
0,A krishnamurthy,Arvind,,Krishnamurthy,Arvind krishnamurthy,stanford university,Abfer,,USA
4,W li,Wendy,C.y.,Li,Wendy c.y. li,Executive director moon economics institute,Cepr,C.y.,Unknown
5,W li,Wendy,C.y.,Li,Wendy c.y. li,Executive director moon economics institute,Cepr,C.y.,Unknown
6,Z li,Zhan,,Li,Zhan li,"Postdoctoral researcher in economics, national...",Cepr,,China
8,K li,Kai,,Li,Kai li,university of british columbia,Abfer,,Canada
...,...,...,...,...,...,...,...,...,...
152487,C eckel,Carsten,,Eckel,Carsten eckel,Professor of economics bibliothek wirtscharfts...,Cepr,,Germany
152525,S winston smith,Stanley,D,Smith,Stanley d smith,university of central florida,Afa,D,USA
152538,R stehrer,Robert,,Stehrer,Robert stehrer,Scientific director the vienna institute for i...,Cepr,,Austria
152566,J kren,Janez,,Kren,Janez kren,Doctoral researcher ku leuven,Cepr,,Unknown
