# Find the h-index Metric

The h-index is a metric that measures both the productivity and citation impact of a researcher. Specifically, a research’s h-index is the largest number `h` such that the research has published `h` pages that have each been cited at least `h` times.

For example, if Carl has published papers A, B, C, D, which have been cited 3, 1, 4, 1 time, respectively, then his h-index is `2` (corresponding to the pagers A, C).

Design an algorithm to determine a research’s h-index.

## Abstraction

Given an array of natural number, find the largest `h` such that there are at least `h` entries in the array that is greater than or equal to `h`.

Example 1:  
```
Input: [3, 1, 4, 1]
Ouput: 2
Explanation: The h-index is 2 because these 2 entries [3, 4] have >= 2 citations.
It is not 3 because there are only 2 entry [3, 4] which has >= 3 citation
```

Example 2:  
```
Input: [1, 4, 1, 5, 2, 1, 2, 5, 6]
Ouput: 4
Explanation: 4
```

Example 3:  
```
Input: [1]
Output: 1
Explanation: Paper [1] is with at least 1 citation.
```

Example 4:    
```
Input: [0]
Ouput: 0
Explanation: No entry is cited. Same output for input [0, 0] etc.
```

**Hints**

- An easy approach is to sort the array first.
- What are the possible values of the h-index?
- A faster approach is to use extra space.

## Approach 1: Brute force count

Count from 1 onwards, and check how many entries in the array are greater than or equal to teh count. As soon as there are fewer entries than the count, the h-index is one less than the count.

Applied to the example of [1, 4, 1, 4, 2, 1, 3, 5, 6], the count progresses as below. So we stop at step 4, which is the h-index.

| citation | 1 | 2 | 3 | 4 | 5 |
| -------- | - | - | - | - | - |
| count    | 9 | 6 | 5 | 4 | 2 |


In [20]:
def h_index(citations) -> int:
    if not citations: return 0
    i = min(min(citations), len(citations))
    while i <= len(citations):
        n = sum(t >= i for t in citations)
        if i <= n:
            i += 1
        else:
            break
    return max(i - 1, 0)

In [21]:
assert h_index([3, 1, 4, 1]) == 2
assert h_index([1, 4, 1, 5, 2, 1, 2, 5, 6]) == 4
assert h_index([1]) == 1
assert h_index([0]) == 0

- Time complexity : O(n<sup>2</sup>)
- Space complexity : O(1)

## Approach 2: Sort and count

One way to improve the brute force algorithm is to avoid the repeated examines. To find such a square length, we first sort the citations array in *descending* order. After sorting, if citations[i]>i, then papers 0 to i all have at least i+1 citations.

Thus, to find h-index, we search for the largest i (let's call it i′) such that citations[i] > i, and therefore the h-index is i′+1

For example:

|                  i | **0** | **1** | **2** | **3** | **4** | **5** | **6** |
| -----------------: | :---- | ----- | ----- | ----- | ----- | ----- | ----- |
|   sorted citations | 10    | 9     | 5     | 3     | 3     | 2     | 1     |
| citations[i] > i? | true  | true  | true  | false | false | false | false |

In [22]:
def h_index(citations) -> int:
    citations.sort(reverse=True)
    i = 0
    while i < len(citations) and i < citations[i]:
        i += 1
    return i

In [23]:
assert h_index([3, 1, 4, 1]) == 2
assert h_index([1, 4, 1, 5, 2, 1, 2, 5, 6]) == 4
assert h_index([1]) == 1
assert h_index([0]) == 0

- Time complexity: O(N*logN)
- Space complexity:  O(1)

## [Approach 3: Smart count](https://leetcode.com/problems/h-index/solution/)

To achieve better performance, we need non-comparison based sorting algorithms. The most commonly used non-comparison sorting is `counting sort`.

> Counting sort operates by counting the number of objects that have each distinct key value, and using arithmetic on those tallies to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum and minimum keys, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items.

> \---by Wikipedia

However, in our problem, the keys are the citations of each paper which can be much larger than the number of papers nn. It seems that we cannot use `counting sort`. The trick here is the following observation:

> Any citation larger than nn can be replaced by nn and the hh-index will not change after the replacement. The reason is that hh-index is upper bounded by total number of papers nn, i.e.
h ≤ n

We don't even need to get sorted citations. We can find the h-index by using the paper counts directly.

To explain this, let's look at the following example:

citations=[1,3,2,3,100]

The counting results are:

| **k** | **0** | **1** | **2** | **3** | **4** | **5** |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| count | 0     | 1     | 1     | 2     | 0     | 1     |
| s<sub>k</sub>   | 5     | 5     | 4     | 3     | 1     | 1     |

The value s<sub>k</sub> is defined as "the sum of all counts with citation ≥k" or "the number of papers having, at least, k citations". By definition of the h-index, the largest k with k≤s<sub>k</sub>​ is our answer.

After replacing 100 with n=5, we have citations=[1,3,2,3,5]. Now, we count the number of papers for each citation number 0 to 5. The counts are [0,1,1,2,0,1]. The first k from right to left (5 down to 0) that have k≤s is the h-index 3.

Since we can calculate s<sub>k</sub> on the fly when traverse the count array, we only need one pass through the count array which only costs O(n) time.

In [24]:
def h_index(citations) -> int:
    s = len(citations)
    c = [0] * (s + 1)
    for i in citations:
        c[min(i, s)] += 1

    for k, v in enumerate(c):
        s -= v
        if k >= s:
            return k
    return 0

In [25]:
assert h_index([3, 1, 4, 1]) == 2
assert h_index([1, 4, 1, 5, 2, 1, 2, 5, 6]) == 4
assert h_index([1]) == 1
assert h_index([0]) == 0

**Reference**
- [LeetCode 274. H-Index](https://leetcode.com/problems/h-index/solution/)
- EPI (Elements of Programming Interviews) [solution 13.3](https://github.com/adnanaziz/EPIJudge/blob/master/epi_judge_python_solutions/h_index.py)