# Calculating the H-index of a researcher (testing and Simon version)

This notebook shows how to use Python and the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to calculate the H-index of a researcher. 

NOTE: in the public folder there is a version I implemented

#### Prerequisites

In [1]:
# data analysis libraries 
import pandas as pd
# Dimensions API query helper
import dimcli
dimcli.login()
dsl = dimcli.Dsl()
# 

DimCli v0.5.4 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


### Selecting a researcher

Let's take a researcher ID eg [Michael Boutros ur.01357111535.49](https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01357111535.49) and save its ID into a variable that can be referenced later.

> Try modifying the researcher ID below to get different results! 

In [2]:
RESEARCHER = "ur.01357111535.49"

### Simon version

In [3]:
def get_H_Index(researcher_id):
  tc = pd.DataFrame(dsl.query("""
      search publications
      where researchers.id = "{}"
      return publications[times_cited]
      limit 1000
  """.format(researcher_id))['publications'])
  # Against each times cited, count the number of citations >= the number cited
  tc['Hcandidate'] =  tc.apply(lambda r: len(tc[tc['times_cited']>= r['times_cited']].index), axis=1)
  # Find the maximum where the value for times_cited <= the number of publications that have that citation value
  return tc[tc['times_cited'] <= tc['Hcandidate'] ].times_cited.max()
print("H_index is:", get_H_Index(RESEARCHER))

H_index is: 52.0


> ERROR: it should be 53, prob the issue is that the index is 0-based while for H-index it should be 1-based 

### My version

The h-Index function takes a list of citations and outputs the h-index value as explained above: 

In [4]:
def the_H_function(sorted_citations_list, n=1):
    """from a list of integers [n1, n2 ..] representing publications citations, 
    return the max list-position which is >= integer
    
    eg 
    >>> the_H_function([10, 8, 5, 4, 3]) => 4
    >>> the_H_function([25, 8, 5, 3, 3]) => 3
    >>> the_H_function([1000, 20]) => 2
    """
    if sorted_citations_list and sorted_citations_list[0] >= n:
        return the_H_function(sorted_citations_list[1:], n+1)
    else:
        return n-1

The H-index function is generic and can take any list of numbers representing publication citations. 

In order to pass some real-world data to the H-Index function, we can easily use the Dimensions API to extract all publication citations for a researcher, like this: 

In [5]:
def get_pubs_citations(researcher_id):
    q = """search publications where researchers.id = "{}" return publications[times_cited] sort by times_cited limit 1000"""
    pubs = dsl.query(q.format(researcher_id))
    return list(pubs.as_dataframe().fillna(0)['times_cited'])

Finally, we combine the two functions to calculate the H-Index for a specific researcher:

In [6]:
print("H_index is:", the_H_function(get_pubs_citations(RESEARCHER)))

H_index is: 53


In [10]:
for x,y in enumerate(get_pubs_citations(RESEARCHER)):
    print (x+1,y)

1 757.0
2 592.0
3 557.0
4 409.0
5 382.0
6 362.0
7 358.0
8 348.0
9 336.0
10 322.0
11 315.0
12 309.0
13 296.0
14 292.0
15 268.0
16 235.0
17 218.0
18 217.0
19 213.0
20 212.0
21 157.0
22 142.0
23 140.0
24 121.0
25 119.0
26 117.0
27 109.0
28 107.0
29 103.0
30 102.0
31 101.0
32 98.0
33 90.0
34 88.0
35 87.0
36 87.0
37 84.0
38 83.0
39 81.0
40 81.0
41 75.0
42 74.0
43 72.0
44 70.0
45 69.0
46 69.0
47 68.0
48 64.0
49 64.0
50 60.0
51 58.0
52 56.0
53 55.0
54 52.0
55 52.0
56 51.0
57 45.0
58 45.0
59 44.0
60 43.0
61 41.0
62 40.0
63 37.0
64 36.0
65 36.0
66 35.0
67 35.0
68 35.0
69 34.0
70 34.0
71 34.0
72 34.0
73 33.0
74 33.0
75 32.0
76 32.0
77 31.0
78 30.0
79 30.0
80 27.0
81 27.0
82 26.0
83 26.0
84 26.0
85 25.0
86 25.0
87 25.0
88 24.0
89 24.0
90 24.0
91 23.0
92 23.0
93 23.0
94 21.0
95 21.0
96 20.0
97 20.0
98 19.0
99 19.0
100 19.0
101 18.0
102 18.0
103 18.0
104 17.0
105 17.0
106 17.0
107 16.0
108 16.0
109 15.0
110 15.0
111 15.0
112 14.0
113 14.0
114 13.0
115 12.0
116 12.0
117 12.0
118 12.0
119 11.0
120 11