## Authors

For Authors, we are going to keep only authorID Name & their affiliations & affiliation type (company or university)in authors.csv

The Papers will have a list of author ids in them.

In [1]:
import pandas as pd
import numpy as np
import random
np.random.seed(0)

In [3]:
# Get Data 

df_authors = pd.read_csv('../konok_data/authors_data.csv')
df_authors.head()

Unnamed: 0,paperId,authorId,name,affiliations
0,29ddc1f43f28af7c846515e32cc167bc66886d0c,2815290.0,N. Houlsby,
1,29ddc1f43f28af7c846515e32cc167bc66886d0c,1911881.0,A. Giurgiu,
2,29ddc1f43f28af7c846515e32cc167bc66886d0c,40569328.0,Stanislaw Jastrzebski,
3,29ddc1f43f28af7c846515e32cc167bc66886d0c,68973833.0,Bruna Morrone,
4,29ddc1f43f28af7c846515e32cc167bc66886d0c,51985388.0,Quentin de Laroussilhe,


In [4]:
df_authors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6801 entries, 0 to 6800
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   paperId       6801 non-null   object 
 1   authorId      6780 non-null   float64
 2   name          6801 non-null   object 
 3   affiliations  0 non-null      float64
dtypes: float64(2), object(2)
memory usage: 212.7+ KB


In [5]:
# There are some papers without authorId

df_authors[df_authors['authorId'].isnull()]

Unnamed: 0,paperId,authorId,name,affiliations
162,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,,Amanpreet Singh,
369,c0e6cd2ec3bc9eb46c7d45bb708854da3327339e,,L.,
398,25761ba4bdc054bfe902fe7c5d6338be6d00d491,,Ali Shariq Imran,
943,256db9dba1978f004a67c86ffc321563b1aee79a,,Chaofan Chen,
1106,69d49a06f09cf934310ccbf3bb2a360fa719272d,,Alessandro Anna Emily Emmanuel Georg Ghassem G...,
1374,ec58a564fdda29e6a9a0a7bab5eeb4c290f716d7,,Zhiyuan Liu,
1464,92930ed3560ea6c86d53cf52158bc793b089054d,,Yizhou Wang,
1633,cd29c25c489562b409a60f83365f93f33ee1a0a1,,Bochuan Cao,
1754,96273b87cd0eb9d1c9a12afae621ce2abdbbab36,,Xiao Yu,
2015,3a58efcc4558727cc5c131c44923635da4524f33,,Ryan Faulkner,


In [6]:
# Check if these paperIds have other authors

df_authors.query('paperId == "77a096d80eb4dd4ccd103d1660c5a5498f7d026b"')

Unnamed: 0,paperId,authorId,name,affiliations
154,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,1743722.0,Douwe Kiela,
155,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,153409000.0,Max Bartolo,
156,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,40383660.0,Yixin Nie,
157,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,9264826.0,Divyansh Kaushik,
158,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,80833910.0,Atticus Geiger,
159,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,47039340.0,Zhengxuan Wu,
160,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,2737827.0,Bertie Vidgen,
161,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,119869500.0,Grusha Prasad,
162,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,,Amanpreet Singh,
163,77a096d80eb4dd4ccd103d1660c5a5498f7d026b,1422035000.0,Pratik Ringshia,


Since based on the data that we have, the paperIds and their authors are stored only here together - therefore, I can remove these 4 missing authors. Another way would be generating a random id for them, but I don't think its necessary to keep all the authors. So im removing them.

In [7]:
df_authors = df_authors.dropna(subset=['authorId'])
df_authors[df_authors['authorId'].isnull()]

Unnamed: 0,paperId,authorId,name,affiliations


In [8]:
# converting the authorId to integer
df_authors['authorId'] = df_authors['authorId'].astype(int)

In [9]:
df_authors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6780 entries, 0 to 6800
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   paperId       6780 non-null   object 
 1   authorId      6780 non-null   int64  
 2   name          6780 non-null   object 
 3   affiliations  0 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 264.8+ KB


# Creating the New Authors dataframe

In [10]:
df_authors_new = df_authors[['authorId', 'name']]
df_authors_new

Unnamed: 0,authorId,name
0,2815290,N. Houlsby
1,1911881,A. Giurgiu
2,40569328,Stanislaw Jastrzebski
3,68973833,Bruna Morrone
4,51985388,Quentin de Laroussilhe
...,...,...
6796,1882948,M. Pirrung
6797,121917790,W. P. Smith
6798,143840196,Mathew Thomas
6799,1798255,Diego Figueira


# Add column for affiliations & type

For the authors, we need to have an affiliation to a university or a company. 

University Affiliations : https://www.kaggle.com/datasets/joebeachcapital/qs-world-university-rankings-2024

Company Affiliations : https://www.kaggle.com/datasets/sabirbagwan/fortune-2023-companies-dataset

In [12]:
df_uni = pd.read_csv('../aryan_data/universities.csv')
df_company = pd.read_csv('../aryan_data/companies.csv')

In [13]:
df_uni.head()

Unnamed: 0,2024 RANK,2023 RANK,Institution Name,Country Code,Country,SIZE,FOCUS,RES.,AGE,STATUS,...,International Faculty Rank,International Students Score,International Students Rank,International Research Network Score,International Research Network Rank,Employment Outcomes Score,Employment Outcomes Rank,Sustainability Score,Sustainability Rank,Overall SCORE
0,rank display,rank display2,institution,location code,location,size,focus,research,age band,status,...,ifr rank,isr score,isr rank,irn score,irn rank,ger score,ger rank,SUS SCORE,SUS RANK,Overall Score
1,1,1,Massachusetts Institute of Technology (MIT),US,United States,M,CO,VH,5,B,...,56,88.2,128,94.3,58,100,4,95.2,51,100.0
2,2,2,University of Cambridge,UK,United Kingdom,L,FC,VH,5,A,...,64,95.8,85,99.9,7,100,6,97.3,33=,99.2
3,3,4,University of Oxford,UK,United Kingdom,L,FC,VH,5,A,...,110,98.2,60,100.0,1,100,3,97.8,26=,98.9
4,4,5,Harvard University,US,United States,L,FC,VH,5,B,...,210,66.8,223,100.0,5,100,1,96.7,39,98.3


In [14]:
# list of universities

universities = df_uni['Institution Name'][1:51].tolist()
len(universities)

50

In [15]:
universities

['Massachusetts Institute of Technology (MIT) ',
 'University of Cambridge',
 'University of Oxford',
 'Harvard University',
 'Stanford University',
 'Imperial College London',
 'ETH Zurich - Swiss Federal Institute of Technology',
 'National University of Singapore (NUS)',
 'UCL',
 'University of California, Berkeley (UCB)',
 'University of Chicago',
 'University of Pennsylvania',
 'Cornell University',
 'The University of Melbourne',
 'California Institute of Technology (Caltech)',
 'Yale University',
 'Peking University',
 'Princeton University',
 'The University of New South Wales (UNSW Sydney)',
 'The University of Sydney',
 'University of Toronto',
 'The University of Edinburgh',
 'Columbia University',
 'Université PSL',
 'Tsinghua University',
 'Nanyang Technological University, Singapore (NTU)',
 'The University of Hong Kong',
 'Johns Hopkins University',
 'The University of Tokyo',
 'University of California, Los Angeles (UCLA)',
 'McGill University',
 'The University of Manc

In [16]:
df_company.head()

Unnamed: 0,company,rank,revenue,profit,num. of employees,sector,city,state,profitable
0,Walmart,1,572754.0,13673.0,2300000.0,Retailing,Bentonville,AR,yes
1,Amazon,2,469822.0,33364.0,1608000.0,Retailing,Seattle,WA,yes
2,Apple,3,365817.0,94680.0,154000.0,Technology,Cupertino,CA,yes
3,CVS Health,4,292111.0,7910.0,258000.0,Health Care,Woonsocket,RI,yes
4,UnitedHealth Group,5,287597.0,17285.0,350000.0,Health Care,Minnetonka,MN,yes


In [17]:
# list of companies
companies = df_company['company'][:50].tolist()
len(companies)

50

In [18]:
companies

['Walmart',
 'Amazon',
 'Apple',
 'CVS Health',
 'UnitedHealth Group',
 'ExxonMobil',
 'Berkshire Hathaway',
 'Alphabet',
 'McKesson',
 'AmerisourceBergen',
 'Costco Wholesale',
 'Cigna',
 'AT&T',
 'Microsoft',
 'Cardinal Health',
 'Chevron',
 'Home Depot',
 'Walgreens Boots Alliance',
 'Marathon Petroleum',
 'Elevance Health',
 'Kroger',
 'Ford Motor',
 'Verizon Communications',
 'JPMorgan Chase',
 'General Motors',
 'Centene',
 'Meta Platforms',
 'Comcast',
 'Phillips 66',
 'Valero Energy',
 'Dell Technologies',
 'Target',
 'Fannie Mae',
 'UPS',
 "Lowe's",
 'Bank of America',
 'Johnson & Johnson',
 'Archer Daniels Midland',
 'FedEx',
 'Humana',
 'Wells Fargo',
 'State Farm Insurance',
 'Pfizer',
 'Citigroup',
 'PepsiCo Beverages',
 'Intel',
 'Procter & Gamble',
 'General Electric',
 'IBM',
 'MetLife']

# Combine them

The authors must either have an affiliation with a university or a company. So, we can get that information from here. It is not necessary for the authors to be from similar universities, but some similarity is good to have.

In [19]:
data_length = len(df_authors_new)

# Populate the 'Affiliations' column randomly from universities and companies
affiliations = random.choices(universities + companies, k=data_length)

df_authors_new['Affiliations'] = affiliations
df_authors_new['Affiliation_type'] = df_authors_new['Affiliations'].apply(lambda x: 'university' if x in universities else 'company')
df_authors_new.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_authors_new['Affiliations'] = affiliations


Unnamed: 0,authorId,name,Affiliations,Affiliation_type
0,2815290,N. Houlsby,Citigroup,company
1,1911881,A. Giurgiu,McKesson,company
2,40569328,Stanislaw Jastrzebski,General Electric,company
3,68973833,Bruna Morrone,Comcast,company
4,51985388,Quentin de Laroussilhe,Northwestern University,university


In [20]:
df_authors_new['Affiliation_type'].value_counts()

company       3414
university    3366
Name: Affiliation_type, dtype: int64

In [21]:
# save to csv

df_authors_new.to_csv('../aryan_data/authors_info.csv', index=False)