# <a id='toc1_'></a>[intermediary tables](#toc0_)

In this notebook, four intermediary tables were created for the database: 

    pub_res: pub_id & res_id
    pub-aff: pub_id & aff_id
    jrn-res: journal_id & res_id
    aff-res: aff_id & res_id
    
This is the way to manage the many-to-many relationships of the database neuropapers_db

**Table of contents**<a id='toc0_'></a>    
- [intermediary tables](#toc1_)    
    - [Import tables](#toc1_1_1_)    
  - [pub-res](#toc1_2_)    
  - [pub-aff](#toc1_3_)    
  - [jrn-res](#toc1_4_)    
  - [aff-res](#toc1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
# Libraries
import pandas as pd

# Abstract Syntax Trees
import ast

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

### <a id='toc1_1_1_'></a>[Import tables](#toc0_)

In [2]:
pub = pd.read_csv('../data/neuropapers_db/publications.csv', )
res = pd.read_csv('../data/neuropapers_db/researchers.csv')
aff = pd.read_csv('../data/neuropapers_db/affiliations.csv')
jrn = pd.read_csv('../data/neuropapers_db/journals.csv')

In [3]:
display(pub.columns, res.columns, aff.columns, jrn.columns)

Index(['pub_id', 'journal_id', 'last_revision', 'volume', 'title', 'pages',
       'DOI', 'authors', 'journal', 'abstract', 'abstract_words', 'keywords',
       'terms', 'pub_type', 'citation', 'publication_year', 'pub_date'],
      dtype='object')

Index(['res_id', 'researcher', 'surname', 'name', 'gender', 'prob'], dtype='object')

Index(['affiliation_names', 'aff_id', 'longitude', 'latitude'], dtype='object')

Index(['journal_id', 'country_id', 'journal_name', 'num_articles', 'region',
       'publisher', 'coverage', 'categories', 'areas', 'sjr_2010',
       'quartile_2010', 'hindex_2010', 'docs_2010', 'sjr_2011',
       'quartile_2011', 'hindex_2011', 'docs_2011', 'sjr_2012',
       'quartile_2012', 'hindex_2012', 'docs_2012', 'sjr_2013',
       'quartile_2013', 'hindex_2013', 'docs_2013', 'sjr_2014',
       'quartile_2014', 'hindex_2014', 'docs_2014', 'sjr_2015',
       'quartile_2015', 'hindex_2015', 'docs_2015', 'sjr_2016',
       'quartile_2016', 'hindex_2016', 'docs_2016', 'sjr_2017',
       'quartile_2017', 'hindex_2017', 'docs_2017', 'sjr_2018',
       'quartile_2018', 'hindex_2018', 'docs_2018', 'sjr_2019',
       'quartile_2019', 'hindex_2019', 'docs_2019', 'sjr_2020',
       'quartile_2020', 'hindex_2020', 'docs_2020', 'sjr_2021',
       'quartile_2021', 'hindex_2021', 'docs_2021', 'sjr_2022',
       'quartile_2022', 'hindex_2022', 'docs_2022'],
      dtype='object')

## <a id='toc1_2_'></a>[pub-res](#toc0_)

In [4]:
pub.head(2)

Unnamed: 0,pub_id,journal_id,last_revision,volume,title,pages,DOI,authors,journal,abstract,abstract_words,keywords,terms,pub_type,citation,publication_year,pub_date
0,38012702,105,2023-11-29,20.0,"Neuroinflammation, memory, and depression: new...",283,10.1186/s12974-023-02964-x,"['Wu, Anbiao', 'Zhang, Jiyan']",Journal of neuroinflammation,As one of most common and severe mental disord...,"one common severe mental disorders , major dep...","['one', 'common', 'severe', 'mental', 'disorde...","['Hippocampal neurogenesis', 'Major depressive...",Journal Article,J Neuroinflammation. 2023 Nov 27;20(1):283. do...,2023,2023-01-01
1,38012669,105,2023-11-29,20.0,OTUD1 ameliorates cerebral ischemic injury thr...,281,10.1186/s12974-023-02968-7,"['Zheng, Shengnan', 'Li, Yiquan', 'Song, Xiaom...",Journal of neuroinflammation,BACKGROUND: Inflammatory response triggered by...,BACKGROUND : Inflammatory response triggered i...,"['background', 'inflammatory', 'response', 'tr...","['Cerebral ischemic injury', 'Inflammation', '...",Journal Article,J Neuroinflammation. 2023 Nov 27;20(1):281. do...,2023,2023-01-01


In [5]:
# dataframe with [pub_id & auhors]
df_01 = pub[['pub_id', 'authors']]

# Transform the strings to lists again
df_01['authors'] = df_01['authors'].apply(ast.literal_eval)

# Create a row per author
df_01 = df_01.explode('authors')

In [6]:
res.head(2)

Unnamed: 0,res_id,researcher,surname,name,gender,prob
0,0,"Howe, Charles L",Howe,Charles L,male,1.0
1,1,"Sun, Hong-Shuo",Sun,Hong-Shuo,unknown,0.0


In [7]:
# dataframe with res_id and 'researcher'
df_02 = res[['res_id', 'researcher']]

In [8]:
df_02.head(2)

Unnamed: 0,res_id,researcher
0,0,"Howe, Charles L"
1,1,"Sun, Hong-Shuo"


In [9]:
# add the res_id to the table
display(df_01.shape)
merged_df = pd.merge(df_01, df_02, left_on='authors', right_on='researcher', how='inner')
merged_df.shape

(371874, 2)

(371874, 4)

In [10]:
merged_df.isna().sum()

pub_id        0
authors       0
res_id        0
researcher    0
dtype: int64

In [11]:
merged_df.columns

Index(['pub_id', 'authors', 'res_id', 'researcher'], dtype='object')

In [12]:
# Export as .csv
pub_res = merged_df[['pub_id', 'res_id']]

In [13]:
display(pub_res.shape)
pub_res = pub_res.drop_duplicates()
display(pub_res.shape)

(371874, 2)

(371851, 2)

In [14]:
pub_res.isna().sum()

pub_id    0
res_id    0
dtype: int64

In [15]:
pub_res.to_csv('../data/neuropapers_db/pub_res.csv', index=False)

## <a id='toc1_3_'></a>[pub-aff](#toc0_)

In [16]:
affiliations = pd.read_csv('../data/affiliations_df.csv')

In [17]:
affiliations.columns

Index(['Unnamed: 0', 'pub_id', 'authors', 'auth_aff_list', 'first_auth_aff',
       'last_auth_aff'],
      dtype='object')

In [18]:
df_03 = affiliations[['pub_id', 'auth_aff_list']]

In [19]:
df_03.head(2)

Unnamed: 0,pub_id,auth_aff_list
0,38012702,"['Beijing Institute of Basic Medical Sciences,..."
1,38012669,"['Department of Pharmacology, School of Basic ..."


In [20]:
# Fill NaN with an empty list for the anonymous publications
df_03['auth_aff_list'] = df_03['auth_aff_list'].fillna('[]')

In [21]:
# Transform the strings to lists again
df_03['auth_aff_list'] = df_03['auth_aff_list'].apply(ast.literal_eval)

In [22]:
# Create a row per author
df_03 = df_03.explode('auth_aff_list')

In [23]:
df_03.head(2)

Unnamed: 0,pub_id,auth_aff_list
0,38012702,"Beijing Institute of Basic Medical Sciences, B..."
0,38012702,"Beijing Institute of Basic Medical Sciences, B..."


In [24]:
df_04 = aff[['affiliation_names', 'aff_id']]

In [25]:
df_04.head(2)

Unnamed: 0,affiliation_names,aff_id
0,"The fourth affiliated hospital, Harbin Medical...",0
1,"Department of Biology, University of Padova, P...",1


In [26]:
# add the aff_id to the table
display(df_01.shape)
merged_df_02 = pd.merge(df_03, df_04, left_on='auth_aff_list', right_on='affiliation_names', how='inner')
merged_df_02.shape

(371874, 2)

(373699, 4)

In [27]:
pub_aff = merged_df_02[['pub_id', 'aff_id']]

In [28]:
display(pub_aff.shape)
pub_aff = pub_aff.drop_duplicates()
display(pub_aff.shape)

(373699, 2)

(169938, 2)

In [29]:
pub_aff.isna().sum()

pub_id    0
aff_id    0
dtype: int64

In [30]:
pub_aff.to_csv('../data/neuropapers_db/pub_aff.csv', index=False)

## <a id='toc1_4_'></a>[jrn-res](#toc0_)

In [31]:
publications = pd.read_csv('../data/publications_df.csv')

In [32]:
df_05 = publications[['journal', 'authors']]

In [33]:
df_06 = jrn[['journal_id', 'journal_name']]

In [34]:
df_07 = res[['res_id', 'researcher']]

In [35]:
type(df_05['authors'][0])

str

In [36]:
# Transform the strings to lists again
df_05['authors'] = df_05['authors'].apply(ast.literal_eval)

# Create a row per author
df_05 = df_05.explode('authors')

In [37]:
df_05.head(2)

Unnamed: 0,journal,authors
0,Journal of neuroinflammation,"Wu, Anbiao"
0,Journal of neuroinflammation,"Zhang, Jiyan"


In [38]:
# Rename the two journals with problematic names according with the 'journals'
old_names = ['The Neuroscientist : a review journal bringing neurobiology, neurology and ',
             'Journal of physiology, Paris']
new_names = ['Neuroscientist', 'Journal of Physiology Paris']

df_05['journal'] = df_05['journal'].replace(old_names, new_names)

In [39]:
df_05['journal'] = df_05['journal'].str.title()
df_06['journal_name'] = df_06['journal_name'].str.title()

In [40]:
# add the jrn_id to the table
display(df_05.shape)
merged_df_03 = pd.merge(df_05, df_06, left_on='journal', right_on='journal_name', how='left')
merged_df_03.shape

(372016, 2)

(372016, 4)

In [41]:
merged_df_03.isna().sum()

journal         0
authors         0
journal_id      0
journal_name    0
dtype: int64

In [42]:
# add the res_id to the table
display(merged_df_03.shape)
merged_df_04 = pd.merge(merged_df_03, df_07, left_on='authors', right_on='researcher', how='left')
merged_df_04.shape

(372016, 4)

(372016, 6)

In [43]:
merged_df_04.head(2)

Unnamed: 0,journal,authors,journal_id,journal_name,res_id,researcher
0,Journal Of Neuroinflammation,"Wu, Anbiao",105,Journal Of Neuroinflammation,54006,"Wu, Anbiao"
1,Journal Of Neuroinflammation,"Zhang, Jiyan",105,Journal Of Neuroinflammation,92026,"Zhang, Jiyan"


In [44]:
jrn_res = merged_df_04[['journal_id', 'res_id']]

In [45]:
display(jrn_res.shape)
jrn_res = jrn_res.drop_duplicates()
display(jrn_res.shape)

(372016, 2)

(269401, 2)

In [46]:
jrn_res.isna().sum()

journal_id    0
res_id        0
dtype: int64

In [47]:
jrn_res.to_csv('../data/neuropapers_db/jrn_res.csv', index=False)

## <a id='toc1_5_'></a>[aff-res](#toc0_)

In [48]:
df_08 = affiliations[['authors', 'auth_aff_list']]
df_09 = aff[['affiliation_names', 'aff_id']]
df_10 = res[['res_id', 'researcher']]

In [49]:
# Clean affiliations
df_08.isna().sum()

authors            0
auth_aff_list    674
dtype: int64

In [50]:
# Fill NaN with an empty list for the anonymous publications
df_08['auth_aff_list'] = df_08['auth_aff_list'].fillna('[]')

In [51]:
type(df_08['authors'][0]), type(df_08['auth_aff_list'][0])

(str, str)

In [52]:
# Transform the strings to lists again
df_08['auth_aff_list'] = df_08['auth_aff_list'].apply(ast.literal_eval)
df_08['authors'] = df_08['authors'].apply(ast.literal_eval)

In [53]:
type(df_08['authors'][0]), type(df_08['auth_aff_list'][0])

(list, list)

In [54]:
# Create a row per affiliation
df_08_exp = df_08.explode('auth_aff_list')

In [55]:
# Create a row per researcher
df_08_exp = df_08_exp.explode('authors')

In [56]:
display(df_08_exp.shape)
df_08_exp = df_08_exp.drop_duplicates()
display(df_08_exp.shape)

(3683436, 2)

(1463368, 2)

In [57]:
# add the aff_id to the table
display(df_08_exp.shape)
merged_df_05 = pd.merge(df_08_exp, df_09, left_on='auth_aff_list', right_on='affiliation_names', how='inner')
merged_df_05.shape

(1463368, 2)

(1460099, 4)

In [58]:
# add the res_id to the table
display(merged_df_05.shape)
merged_df_06 = pd.merge(merged_df_05, df_10, left_on='authors', right_on='researcher', how='left')
merged_df_06.shape

(1460099, 4)

(1460099, 6)

In [59]:
aff_res = merged_df_06[['aff_id', 'res_id']]

In [60]:
display(aff_res.shape)
aff_res = aff_res.drop_duplicates()
display(aff_res.shape)

(1460099, 2)

(1460099, 2)

In [61]:
aff_res.isna().sum()

aff_id    0
res_id    0
dtype: int64

In [62]:
aff_res.to_csv('../data/neuropapers_db/aff_res.csv', index=False)