### How many votes were cast for each paper?

* Data export from unipark tool on **Mar 25, 2018**
* At that time, **154 responses** were available

In [1]:
exportdate = 20180325 # YYYYMMDD, appears in file name

In [2]:
number_of_papers = 435 # this is constant

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv(f'../data/{exportdate}repract.csv')

How many rows and columns?

In [5]:
df.shape

(154, 1360)

What does the data look like?

In [6]:
df.head(2)

Unnamed: 0,lfdn,external_lfdn,tester,dispcode,lastpage,quality,duration,v_7039,v_7040,v_7041,...,output_mode,javascript,flash,session_id,language,cleaned,ats,datetime,date_of_last_access,date_of_first_mail
0,106,0,no tester,Completed after break (32),2138658,NotShown,-1,NotShown,NotShown,0,...,HTML,NotShown,NotShown,3bb21c1b318e2f6b87557566bdd6b4d9,English,Not cleaned,1515411510,2018-01-08 11:38:30,2018-01-08 13:07:14,0000-00-00 00:00:00
1,131,0,no tester,Completed (31),2138658,NotShown,3805,NotShown,NotShown,NotShown,...,HTML,NotShown,NotShown,fc38f6556787a459c2cc604abf799448,English,Not cleaned,1515667019,2018-01-11 10:36:59,2018-01-11 11:40:24,0000-00-00 00:00:00


### Getting the Data Into Shape

1. Map PaperIDs to variables, starting at v_7039 for PaperID 1, ending at v_7473
2. Create a df containing evaluations as rows and papers as columns
3. Count number of valid values in each column

In [7]:
paper_ids = list(zip(range(1,number_of_papers+1),
                     ['v_'+str(x) for x in range(7039,7039+number_of_papers)]))
paper_id_dict = {tup[1]:tup[0] for tup in paper_ids}

In [12]:
evaluations = df.transpose()[7:7+number_of_papers].transpose()
evaluations = evaluations.rename(paper_id_dict, axis=1)
assert evaluations.shape == (df.shape[0], number_of_papers)
evaluations.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,426,427,428,429,430,431,432,433,434,435
0,NotShown,NotShown,0,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,...,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown
1,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,...,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown


Distinct values in the evaluations df:

In [13]:
set(evaluations.values.flatten())

{'0', 'Essential', 'NotShown', 'Unimportant', 'Unwise', 'Worthwhile'}

'NotShown' is used when papers were not shown to respondents (i.e., not part of their random sample). '0' is one of the survey tool's weird ways to say 'NotAnswered'. Later, we are generally  interested in 'Essential', 'Worthwhile', 'Unimportant', and 'Unwise'.

In [14]:
rating_values = ['Essential', 'Worthwhile', 'Unimportant', 'Unwise', '0', 'NotShown']

In [15]:
evaluation_counts = pd.DataFrame(columns=['PaperId']+rating_values, index=range(1, number_of_papers+1))
for i in range(1, number_of_papers+1):
    evaluation_counts.at[i,'PaperId'] = i 
    for idx, val in evaluations[i].value_counts().items():
        evaluation_counts.at[i, idx] = val
evaluation_counts = evaluation_counts.fillna(0)
evaluation_counts.head(2)

Unnamed: 0,PaperId,Essential,Worthwhile,Unimportant,Unwise,0,NotShown
1,1,1,3,1,0,0,149
2,2,2,1,1,0,1,149


#### Now that we have the raw counts, we can aggregate positives, negatives and non-ratings:

In [16]:
number_of_positives = evaluation_counts[['Essential', 'Worthwhile']].T.sum()
number_of_negatives = evaluation_counts[['Unimportant', 'Unwise']].T.sum()
number_of_ratings = evaluation_counts[['Essential', 'Worthwhile', 'Unimportant', 'Unwise']].T.sum()
number_not_rated = evaluation_counts[['0', 'NotShown']].T.sum()

#### Which papers have no ratings?

In [17]:
number_of_ratings[number_of_ratings.values == 0]

92     0
250    0
252    0
267    0
303    0
376    0
dtype: int64

In [18]:
evaluation_counts = evaluation_counts.rename({'0': 'ZeroRating'}, axis=1)
evaluation_counts['PosRatings'] = number_of_positives
evaluation_counts['NegRatings'] = number_of_negatives
evaluation_counts['TotalRatings'] = number_of_ratings
evaluation_counts.head(2)

Unnamed: 0,PaperId,Essential,Worthwhile,Unimportant,Unwise,ZeroRating,NotShown,PosRatings,NegRatings,TotalRatings
1,1,1,3,1,0,0,149,4,1,5
2,2,2,1,1,0,1,149,3,1,4


Now, the evaluation df can be sorted by the number of ratings:

In [19]:
evaluation_counts = evaluation_counts.sort_values(
    ['TotalRatings', 'PaperId'], ascending=[False, True])
evaluation_counts.head()

Unnamed: 0,PaperId,Essential,Worthwhile,Unimportant,Unwise,ZeroRating,NotShown,PosRatings,NegRatings,TotalRatings
248,248,2,5,6,0,0,141,7,6,13
100,100,7,5,0,0,1,141,12,0,12
39,39,1,6,4,0,0,143,7,4,11
104,104,2,5,4,0,1,142,7,4,11
108,108,1,9,0,1,0,143,10,1,11


Persisting to csv:

In [20]:
#evaluation_counts.to_csv(f'../analysis/{exportdate}repract_evaluation_counts.csv', index=False)

The End.