## Scraping Author list from ADS
- Obtain a new token: https://ui.adsabs.harvard.edu/#user/settings/token 
- ADS package that I use: https://github.com/andycasey/ads

There is a limit of how many requests I can run per day (5000). 
The software will go to the list of our input name, search for top papers from that particular name, order found papers in term of publication date, start looking for affliation associated with the first sorted paper, if not continue looking until find the first non-empty affliation, else said 'NAN'

In [126]:
%matplotlib inline 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import pickle
from datetime import datetime

import ads
#https://github.com/andycasey/ads

from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = True
rcParams['axes.facecolor'] = '#eeeeee'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()
        
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)


ads.config.token = 'aykjztnkf5jTSxoltAvCygXuAk7fH5z2VKnZy4iU'

In [128]:
## to see the limit of search for today (http://www.convert-unix-time.com/?t=1470873600 to convert timestap)

r = ads.RateLimits('SearchQuery')
print r

SearchQuery: {"reset": "1470873600", "limit": "5000", "remaining": "4875"}


In [5]:
with open('author_list.pickle') as f:
    names = pickle.load(f)

with open('author_list_r.pickle') as f:  #only consider parts of the list (GO, GTO, AR)
    names2 = pickle.load(f)

In [47]:
q=ads.SearchQuery(first_author=names[0], fl=['first_author'])
q.execute()

In [50]:
q.response.numFound

130

In [141]:
papers=list(ads.SearchQuery(first_author='Helen Russell', fl=['first_author','first_author_norm','author',\
                                                       'title','aff','pubdate','pub','property','keyword',\
                                                      'citation_count']))

In [143]:
js=[]
for ind, i in enumerate(papers):
#    if ind ==20 :
#        print i.metrics
        j=dict(i.items())
        js.append(j)
#        for key, value in dict(i.items()).iteritems():
#            print key, value


df=pd.DataFrame(js)
dff=df.ix[np.argsort(df.citation_count, order=('x','y'))[::-1]]; dff.head()

#j['metrics']

Unnamed: 0,aff,author,citation_count,first_author,first_author_norm,id,keyword,property,pub,pubdate,title
26,"[-, -]","[Russell, H. A. J., Arnott, R. W. C.]",24,"Russell, H. A. J.","Russell, H",6285828,,"[REFEREED, ARTICLE]",Journal of Sedimentary Research,2003-11-00,[Hydraulic-Jump and Hyperconcentrated-Flow Dep...
14,[-],"[Russell, H. W.]",4,"Russell, H. W.","Russell, H",2202811,,"[REFEREED, ARTICLE]",Journal of the Optical Society of America (191...,1940-06-00,[A new two-color optical pyrometer]
30,[-],"[Russell, H. C.]",4,"Russell, H. C.","Russell, H",2105169,,"[REFEREED, ARTICLE]",Quarterly Journal of the Royal Meteorological ...,1893-01-00,[Moving anticyclones in the Southern Hemisphere]
49,[-],"[Russell, H. N.]",1,"Russell, H. N.","Russell, H",168420,,"[OPENACCESS, REFEREED, ADS_OPENACCESS, ARTICLE]",The Observatory,1935-08-00,"[The George Darwin lecture, 1935]"
36,[-],"[Russell, H. L.]",1,"Russell, H. L.","Russell, H",9206645,,"[REFEREED, ARTICLE]",Science,1911-10-00,[Contagious Abortion in Cattle]


In [124]:
paperi = list(ads.SearchQuery(q="bibcode:\"2015arXiv150701293E\""))
#print paperi.metrics['basic stats']['total number of reads']

[]

In [69]:
##draft
papers=list(ads.SearchQuery(first_author=name, fl=['first_author','author','title','aff','pubdate','pub','property']))

print name

for i in papers:
    print i.pubdate
    
names=["Paul Goudfrooij","Philip Kaaret","Eric Perlman","Craig Sarazin","F. Tavecchio",\
       "Walter Lewin","Michael Garcia","Jonathan Grindlay","Edmund Nelan","Stefano Casertano"]

In [83]:
## Don't press here. Press here will reset the results

print len(names)
author_list=[]

2194


In [96]:
#rerun here to get a new results
for index, name in enumerate(names):
    if index!=1074:
        papers=list(ads.SearchQuery(first_author=name, fl=['first_author','first_author_norm','author'\
                                                           ,'title','aff','pubdate','pub','property']))
        aut_l=[]
        for i in papers:
            aut={}
            aut['search_name']=name
            aut['first_author']=i.first_author
            aut['first_author_norm']=i.first_author_norm
            aut['author_list']=i.author
            aut['title']=i.title
            aut['first_aff']=i.aff[0]
            aut['aff']=i.aff
            aut['pubdate']=i.pubdate
            aut['pub']=i.pub
            aut['property']=i.property
            aut_l.append(aut)
        print index
    else: 
        aut_l=[]
    author_list.append(aut_l)

1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275


In [108]:
len(author_list)

2192

In [109]:
outname='author_paper.pickle'
pickle.dump(author_list, open(outname , "wb" ) )
print outname

author_paper.pickle


<hr>

In [51]:
with open('author_paper.pickle') as f:
    author_list2 = pickle.load(f)

In [52]:
list_pd=[]
for i in author_list2:
    list_pd.append(pd.DataFrame(i))

In [53]:
list_pd[18].shape==(0,0)

True

In [56]:
df=list_pd[3]
dff=df.ix[np.argsort(df.pubdate, order=('x','y'))[::-1]]; dff.head()
#if dff.aff.values[0]=='-':

Unnamed: 0,aff,author_name,property,pub,pubdate,search_name,title
23,-,"Sarazin, Craig","[NONARTICLE, NOT REFEREED]",Chandra Proposal,2014-09-00,Craig Sarazin,[The Burst Cluster: Dark Matter in a Merging C...
45,University of Virginia,"Sarazin, Craig","[ARTICLE, NOT REFEREED]",X-ray View of Galaxy Ecosystems,2014-08-00,Craig Sarazin,[The Physical State of the Hot and Cool Gas in...
2,"Department of Astronomy, University of Virgini...","Sarazin, C.","[ARTICLE, NOT REFEREED]",The X-ray Universe 2014,2014-07-00,Craig Sarazin,[XMM-Newton and Chandra Observations of the Re...
22,-,"Sarazin, Craig","[NONARTICLE, NOT REFEREED]",XMM-Newton Proposal,2013-10-00,Craig Sarazin,[PKS B1400-33 and Abell S753: A Very Bright Ra...
12,-,"Sarazin, Craig","[NONARTICLE, NOT REFEREED]",XMM-Newton Proposal,2012-10-00,Craig Sarazin,[The Burst Cluster: Dark Matter in a Merging C...


In [409]:
list_pd[19].head()

Unnamed: 0,aff,author_name,property,pub,pubdate,search_name,title
0,-,"Wheeler, T.","[NONARTICLE, NOT REFEREED]",Space Telescope NICMOS Instrument Science Report,2005-10-00,Thomas Wheeler,[NICMOS Small Î”T Dewar / NCS PID Model for Or...
1,-,"Wheeler, T.","[REFEREED, ARTICLE]",Weather,2004-06-00,Thomas Wheeler,[Meeting report: Meteorology and agriculture]
2,"Forest Ecology and Biogeosciences, University ...","Wheeler, T.","[NONARTICLE, NOT REFEREED]",AGU Fall Meeting Abstracts,2010-12-00,Thomas Wheeler,[The Importance of Marine Nutrient Subsidies i...
3,Space Telescope Science Institute,"Wheeler, Thomas","[NONARTICLE, NOT REFEREED]",HST Proposal,2013-10-00,Thomas Wheeler,[COS NUV MAMA Fold Distribution]
4,Space Telescope Science Institute,"Wheeler, Thomas","[NONARTICLE, NOT REFEREED]",HST Proposal,2013-10-00,Thomas Wheeler,[COS FUV Recovery after Anomalous Shutdown]


In [367]:
f_name,f_aff=[],[]
for i in range(len(list_pd)):
    df=list_pd[i]
    if not df.shape==(0,0):
        dff=df.ix[np.argsort(df.pubdate, order=('x','y'))[::-1]]
        aff=dff.aff.values[0]
        ind=1
        while aff=='-':
            if ind<len(dff.aff.values):
                aff=dff.aff.values[ind]
                ind+=1
            else: 
                aff='NAN'
        f_name.append(dff.search_name.values[0])
        f_aff.append(aff)
    else:
        print i

18
20
23
27
30
33
42
84
97
113
181
182
189
210
270
286
396
438
539
1322
1416
1447


In [240]:
df3=pd.DataFrame({'name':f_name,'aff':f_aff});
print np.sum(df3['aff']=='NAN')

4


In [310]:
def replace_name(x):
    aff=x['aff'].lstrip()
    cut=filter(lambda i: 'University' in i, aff.split(','))
    if not cut==[]:
        aff=cut[0].lstrip()
    
    if 'Space Telescope Science Institute' in aff or 'STScI' in aff:
        return 'Space Telescope Science Institute'
    elif 'Univ. of' in aff:
        return aff.replace('Univ. of','University of')
    elif 'Univ' in aff:
        return aff.replace('Univ.','University')
    elif 'Harvard-Smithsonian Center for Astrophysics' in aff:
        return 'Harvard-Smithsonian Center for Astrophysics'
    else:
        return aff

In [311]:
xx='Department of Physics and Astronomy, University of Utah, Salt Lake City, UT 84112─0830, USA.'
aff=filter(lambda i: 'University' in i, xx.split(','))

In [375]:
for index, i in enumerate(df3.apply(replace_name, axis=1)):
    print index, i

0 Space Telescope Science Institute
1 University of Iowa
2 Florida Institute of Technology
3 University of Virginia
4 INAF - Osservatorio Astronomico di Brera, via E. Bianchi 46, I-23807 Merate, Italy
5 Massachusetts Institute of Technology
6 Smithsonian Institution Astrophysical Observatory
7 Harvard
8 Space Telescope Science Institute
9 Space Telescope Science Institute
10 Space Telescope Science Institute
11 Space Telescope Science Institute
12 Space Telescope Science Institute
13 University of the Basque Country UPV/EHU
14 Computer Sciences Corporation
15 Southern Illinois University
16 University of California - Riverside
17 University of Illinois
18 Space Telescope Science Institute
19 Swarthmore College
20 The Johns Hopkins University
21 University of Wyoming
22 University of Massachusetts Amherst
23 Space Telescope Science Institute
24 The Pennsylvania State University
25 University of Texas, at Austin
26 Arizona State University
27 Catholic University of America
28 Space Teles

In [387]:
df3.loc[146]

aff     Kavli Institute for Astronomy and Astrophysics...
name                                      Gregory Herczeg
Name: 146, dtype: object

In [391]:
print np.where(names=='Anna Frebel')

(array([724]),)


In [397]:
df=list_pd[724]
dff=df.ix[np.argsort(df.pubdate, order=('x','y'))[::-1]]; dff

Unnamed: 0,aff,author_name,property,pub,pubdate,search_name,title
11,"Department of Physics, Massachusetts Institute...","Frebel, A.","[REFEREED, ARTICLE]",Journal of Physics Conference Series,2016-01-00,Anna Frebel,[A new r-process star with low abundances of r...
41,Massachusetts Institute of Technology,"Frebel, Anna","[NONARTICLE, NOT REFEREED]",IAU General Assembly,2015-08-00,Anna Frebel,[The first stars - Recent results and prospect...
40,Massachusetts Institute of Technology,"Frebel, Anna","[NONARTICLE, NOT REFEREED]",IAU General Assembly,2015-08-00,Anna Frebel,[Chemical abundances of the most metal-poor st...
38,Massachusetts Institute of Technology,"Frebel, Anna","[NONARTICLE, NOT REFEREED]",APS April Meeting Abstracts,2015-04-00,Anna Frebel,[New developments in understanding the r-proce...
22,-,"Frebel, Anna","[OPENACCESS, EPRINT_OPENACCESS, ARTICLE, NOT R...",ArXiv e-prints,2014-08-00,Anna Frebel,[Reconstructing the cosmic evolution of the ch...
39,Kavli Institute for Astrophysics and Space Res...,"Frebel, Anna","[OPENACCESS, REFEREED, EPRINT_OPENACCESS, PUB_...",The Astrophysical Journal,2014-05-00,Anna Frebel,[Segue 1: An Unevolved Fossil Galaxy from the ...
31,Massachusetts Institute of Technology,"Frebel, Anna","[NONARTICLE, NOT REFEREED]",HST Proposal,2013-10-00,Anna Frebel,[The nucleosynthetic origins and chemical evol...
35,"Massachusetts Institute of Technology, Kavli I...","Frebel, Anna","[OPENACCESS, REFEREED, EPRINT_OPENACCESS, PUB_...",The Astrophysical Journal,2013-07-00,Anna Frebel,[The 300 km s<SUP>-1</SUP> Stellar Stream near...
37,Kavli Institute for Astrophysics and Space Res...,"Frebel, Anna","[OPENACCESS, REFEREED, EPRINT_OPENACCESS, PUB_...",The Astrophysical Journal,2013-05-00,Anna Frebel,[Deriving Stellar Effective Temperatures of Me...
29,Massachusetts Institute of Technology,"Frebel, Anna","[OPENACCESS, EPRINT_OPENACCESS, ARTICLE, NOT R...",The First Galaxies,2013-00-00,Anna Frebel,[Exploring the Universe with Metal-Poor Stars]


In [362]:
len(list_pd), len(names)

(2192, 2194)

In [187]:
names2[names2=='Stephen Odewahn']

array(['Stephen Odewahn'], dtype=object)