This notebook is used to count the number of citations associated with each author. Running each cell in turn will be sufficient to create files containing a mapping between author ids and a list of the number of citations on their papers, the sum of their citations, the maximum number of citations awarded an individual paper, and the average number of citations. 

This first command counts the number of times each abstract was cited and stores it in the file citation_counts

In [1]:
cat citation_data/hep-th-citations | python count_citations.py > citation_data/citation_counts

In [1]:
# read the citation counts file into a dictionary
with open('citation_data/citation_counts', 'r') as f:
    citation_counts = {line.strip().split()[0]: int(line.strip().split()[1]) for line in f}

In [2]:
# create a list of strings from a string that looks like a list of strings eg ['a', 'b', 'c']
def parse_string_to_list(string):
    return map(lambda x: x.replace("'", '').strip(), string.replace(']', '').replace('[', '').strip().split(','))

# create a dictionary mapping strings to lists from lines of the form: key ['item', 'item2', 'item3']
with open('abstract_map.txt', 'r') as f:
    abstract_map = {line.strip().split()[0]: parse_string_to_list(' '.join(line.strip().split()[1:])) for line in f}

In [3]:
from collections import defaultdict

In [4]:
# invert abstract dictionary from abstract -> authors to author -> abstracts
author_abstract_map = defaultdict(list)
for abstract, authors in abstract_map.iteritems():
    for author in authors:
        author_abstract_map[author].append(abstract)

In [5]:
# store dictionary in file in format author_id\tcount1 count2 count3
with open('citation_data/author_to_abstract.txt', 'a') as f:
    for author, abstracts in author_abstract_map.iteritems():
        if author == '':
            continue
        f.write(author + '\t' + ' '.join(abstracts)+ '\n')

In [6]:
# get a dictionary mapping authors to their abstracts
def read_author_abstract():
    with open('citation_data/author_to_abstract.txt', 'r') as f:
        d = defaultdict(list)
        for line in f.readlines():
            author, abstract_string = line.strip().split('\t')
            abstracts = abstract_string.split()
            d[author] = abstracts
    return d

author_to_abstract = read_author_abstract()

In [7]:
def replace_abstract_with_count(author_to_abstract, abstract_counts):
    # replaces abstracts with their corresponding count
    out_dict = defaultdict(list)
    for author, abstracts in author_to_abstract.iteritems():
        for abstract in abstracts:
            try:
                out_dict[author].append(abstract_counts[abstract])
            except KeyError:
                out_dict[author].append(0)

    return out_dict

In [8]:
author_to_count = replace_abstract_with_count(author_to_abstract, citation_counts)

The citation counts for our dataset is only taken from a period of two months in 2003 (either feb-march or may-june it isn't clear which). There are positives and negatives with this. The main positive is that we have a set period which means that papers that have been around longer will have a bigger count just by virtue of being older. However, we will also have a potential risk where old papers no longer get cited as much in favour of the new hotness. We could perhaps imagine that the best papers will stand the test of time and still be cited. The same could be said about the new papers, which might not have been found yet so they will not be cited yet, but they also have a chance to benefit from being new and interesting. 

In [12]:
with open('citation_data/author_citation_counts', 'w') as f:
    for author, counts in author_to_count.iteritems():
        f.write(author + '\t' + ' '.join(map(str, counts)) + '\n')

In [13]:
with open('citation_data/author_citation_sums', 'w') as f:
    for author, counts in author_to_count.iteritems():
        f.write(author + '\t' + str(sum(counts)) + '\n')

In [14]:
with open('citation_data/author_citation_max', 'w') as f:
    for author, counts in author_to_count.iteritems():
        f.write(author + '\t' + str(max(counts)) + '\n')

In [15]:
from __future__ import division
with open('citation_data/author_citation_avg', 'w') as f:
    for author, counts in author_to_count.iteritems():
        f.write(author + '\t' + str(sum(counts)/len(counts)) + '\n')