# Updating Vispubdata with new IEEE VIS papers

This notebook allows you to create an update of vispubdata. It will walk you through the different steps that are needed. 

This code was written by Petra Isenberg (petra.isenberg@inria.fr) and is update regularly.
If you find better or more elegant ways to implement what I did below, don't hesitate to suggest improvements!


## Preparation

### Get an IEEE Xplore API Key
https://ieeexplore.ieee.org/Xplorehelp/administrators-and-librarians/api-portal

Put the API key into the file called "ieeexplore-apikey.txt" in the main folder.


### 1) Finding the issue number

Find the issue number for the latest TVCG issue that holds the new VIS papers you would like to add. To do so, do the following: 
    - Go to the TVCG IEEEXplore entry: https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=2945. Find the issue that holds the IEEE VIS papers of the year. At the time of writing, this was typicaly issue 1 of each year. Open any paper of this issue. Then check the source of the html page for this paper and search in the source file for "isnumber".
    - Open the file "xplore-isnumbers.csv". Add a new line at the bottom with the year, [YOUR ISNUMBER] ,TVCGSI,articles-SI. TVCGSI stands for "special issue of TVCG" and "articles-SI" stands for "article in the special issue". In the earlier years you can find slightly different codes but those don't have to concern you if you are updating more recent articles.

**Example: 

The IEEE VIS 2023 papers can be found here in the digital library: https://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=10373160&punumber=2945. Open a random paper from the issue. For example this one: https://ieeexplore.ieee.org/document/10301796. Check the source. In the document metadata you will find the isnumber: isnumber=10373160. Add the following line to xplore-isnumbers.csv: **2023,10373160,TVCGSI,

#### Documentation for xplore-isnumbers.csv

The columns of this file are:
1. year: the year of the conference. E.g.1990 for Vis 1990
2. isnumber: the issue number as given by IEEEXplore. See above
3. conference: the conference that belongs to the issue number. Vis-conf = the IEEE Visualization conference when papers were still published as conference papers, InfoVis-conf = the IEEE Information Visualization conference when papers were still published as conference papers
4. jason_filename = the name of the file that will hold the paper info once downloaded. You don't have to write anything here. This will be automatically set below

### 2) Get the IEEE VIS full paper titles from the IEEE VIS program chairs
Every year there are a few errors on the IEEE digital library. In addition, the special issue of TVCG that includes the IEEE VIS full papers (usually the first issue of the year) sometimes contains additional papers such as best papers from associated events such as LDAV. 
To chatch errors and exclude papers that do not belong on vispubdata, I have data files with the correct titles of each IEEE VIS full paper. These are the titles listed on vispubdata. To get the correct titles email (mailto:program@ieeevis.org)[program@ieeevis.org] to ask for the file. Then go to the folder for the year that you are updating. For example, to add paper titles for VIS 2023, **create a folder** called 2023 and in it a file called **VisTitles.txt**. Copy the titles in there, one title per line.

### 3) Get the awarded papers
Open the file **awardedPapers.csv** and add the information for the paper awards of the conference year you are adding. The paper award types are:
- BP = Best paper award
- HM = Honorable mention award
- TT = Test of Time award
- BCS = Best case study award (was given out in the early years, not recently)

If a paper received more than one award then separate them by a semicolon. You can do still step after you've done the conversion of IEEEXplore data to vispubdata format. That way you can easily copy the DOIs of the awarded papers from the generated_data file called Vispubdata-Vis.csv if you're adding a new year - if you are adding an older award then search for the DOI on IEEEXplore or vispubdata.

### 4) Extract a VIS Authors file from DBLP
To do this go to the folder in this repo called "dblp-data-extraction" and follow the instructions in the ParseDBLP-VIS-Authors notebook. 



### Libraries 
Make sure to have these libraries. Install them if not. Run the code below before starting to check

In [12]:
import urllib.request, json 
import requests
import pandas as pd
import os
from pathlib import Path
import numpy as np
import re
import json
import sys
import csv
from crossref.restful import Works, Etiquette


with open('ieeexplore-apikey.txt') as f:
    apikey = f.readline()

youremail = "petra.isenberg@inria.fr" #replace this with your own email address. This is required for querying CrossRef



## Download the paper metadata from IEEEXplore

These next code sections will download and check that there is a json file with data about papers from each year. If your data files are up to date you can skip this section.

In [13]:


#set whether or not to download all files again or just check for new ones
# --> if you want updated download counts, then set this to true
renewall = True 

xplore_numbers = pd.read_csv("xplore-isnumbers.csv", dtype=object) #readings all as strings
print(xplore_numbers.head(60))


    year  isnumber    conference     json_filename
0   1990      3914      Vis-conf      articles-vis
1   1991      4467      Vis-conf      articles-vis
2   1992      6054      Vis-conf      articles-vis
3   1993      9012      Vis-conf      articles-vis
4   1994      8039      Vis-conf      articles-vis
5   1995     10213      Vis-conf      articles-vis
6   1995     11604  InfoVis-conf  articles-infovis
7   1996     12277      Vis-conf      articles-vis
8   1996     12180  InfoVis-conf  articles-infovis
9   1997     14360      Vis-conf      articles-vis
10  1997     13801  InfoVis-conf  articles-infovis
11  1998     16079      Vis-conf      articles-vis
12  1998     15744  InfoVis-conf  articles-infovis
13  1999     17553      Vis-conf      articles-vis
14  1999     17394  InfoVis-conf  articles-infovis
15  2000     19150      Vis-conf      articles-vis
16  2000     19138  InfoVis-conf  articles-infovis
17  2001     20824      Vis-conf      articles-vis
18  2001     20791  InfoVis-con

##### Check if we have all files we need

In [14]:
for index, row in xplore_numbers.iterrows():
    year = row['year']
    if not pd.isnull(row['json_filename']):
        filename = row['json_filename']+".json"
        
        my_file = Path(year+"/"+filename)
        if renewall == True:
            row['json_filename'] = np.NaN
        elif not my_file.is_file():
            row['json_filename'] = np.NaN

##### Download files

In [15]:
#next we download all the files we don't have  or we want to renew  
baseurl =  "https://ieeexploreapi.ieee.org/api/v1/search/articles?parameter&apikey="+apikey+"&max_records=200&is_number="

for index, row in xplore_numbers.iterrows():
    year = row['year']
    print(year + " " + row['conference'])
    
    if pd.isnull(row['json_filename']):
        fullurl = baseurl + row['isnumber']
        
        filename = "articles-SI"
        if(row['conference'] == "InfoVis-conf"): 
            filename = "articles-infovis"
        elif (row['conference'] == 'VAST-conf'):
            filename = "articles-vast"
        elif (row['conference'] == 'SciVis-conf'):
            filename = "articles-scivis"
        elif (row['conference'] == 'Vis-conf'):
            filename = "articles-vis"
            
        full_filename = year + "/original_data/"+filename + ".json"
        
        # Create target Directory if it doesn't exist
        directory = year+"/original_data"
        if not os.path.exists(directory):
            os.mkdir(directory)
            print("Directory " , directory ,  " Created ")
        else:    
            print("Directory " , directory ,  " already exists")
        
        with open(full_filename, 'w+', encoding="utf-8") as f:
            resp = requests.get(fullurl, verify=True)
            f.write(resp.text)
        
        xplore_numbers.at[index,'json_filename'] = filename

xplore_numbers.to_csv("xplore-isnumbers.csv",index=False)

1990 Vis-conf
Directory  1990/original_data  already exists
1991 Vis-conf
Directory  1991/original_data  already exists
1992 Vis-conf
Directory  1992/original_data  already exists
1993 Vis-conf
Directory  1993/original_data  already exists
1994 Vis-conf
Directory  1994/original_data  already exists
1995 Vis-conf
Directory  1995/original_data  already exists
1995 InfoVis-conf
Directory  1995/original_data  already exists
1996 Vis-conf
Directory  1996/original_data  already exists
1996 InfoVis-conf
Directory  1996/original_data  already exists
1997 Vis-conf
Directory  1997/original_data  already exists
1997 InfoVis-conf
Directory  1997/original_data  already exists
1998 Vis-conf
Directory  1998/original_data  already exists
1998 InfoVis-conf
Directory  1998/original_data  already exists
1999 Vis-conf
Directory  1999/original_data  already exists
1999 InfoVis-conf
Directory  1999/original_data  already exists
2000 Vis-conf
Directory  2000/original_data  already exists
2000 InfoVis-conf
Di

## Convert downloaded paper metadata from json to csv

Here we convert the .json file into something more flat and similar to vispubdata. Run this code if you updated the .json files above.

This code comes from here:
https://github.com/vinay20045/json-to-csv/blob/master/json_to_csv.py


In [16]:
##
# Convert to string keeping encoding in mind...
##
def to_string(s):
    
    try:
        return str(s)
    except:
        #Change the encoding type if needed
        return s.encode('utf-8')

    
def reduce_item(key, value):
    global reduced_item
    
    #Reduction Condition 1
    if type(value) is list:
        i=0
        for sub_item in value:
            reduce_item(key+'_'+to_string(i), sub_item)
            i=i+1

    #Reduction Condition 2
    elif type(value) is dict:
        sub_keys = value.keys()
        for sub_key in sub_keys:
            reduce_item(key+'_'+to_string(sub_key), value[sub_key])
    
    #Base Condition
    else:
        reduced_item[to_string(key)] = to_string(value)

        

for index, row in xplore_numbers.iterrows():
    year = row['year']
    
    if not pd.isnull(row['json_filename']):
        node = "articles"
        
        generated_data_path = year +"/generated_data/"
        pathExists = os.path.exists(generated_data_path)
        if not pathExists:
            os.makedirs(generated_data_path)
        
        json_file_path = year + "/original_data/" + row['json_filename']+".json"
        csv_file_path = generated_data_path + row['json_filename']+".csv"
        
        #print ("Now working on "+csv_file_path)
        
        #TODO: We could check if the .csv file already exists and then decide not to do anything

        #fp = open(json_file_path, 'r',encoding="utf8")
        fp = open(json_file_path, mode = 'rb')
        json_value = fp.read()
        raw_data = json.loads(json_value)

        try:
            data_to_be_processed = raw_data[node]
        except:
            data_to_be_processed = raw_data

        processed_data = []
        header = []
        for item in data_to_be_processed:
            reduced_item = {}
            reduce_item(node, item)

            header += reduced_item.keys()

            processed_data.append(reduced_item)

        header = list(set(header))
        header.sort()

        with open(csv_file_path, 'w+',encoding="utf8") as f:
            writer = csv.DictWriter(f, header, dialect='excel')
            writer.writeheader()
            for row in processed_data:
                writer.writerow(row)

        #print ("Just completed writing "+csv_file_path+" file with %d columns" % len(header))
        
    else:
        print("We are missing a filename for: " + year + " " + row['conference']+ ". Did you run the previous step already?")



## Convert IEEEXplore csvs to Vispubdata

Here we convert the data from the .json file into the vispubdata format



#### Read the latest vispubdata

In [32]:
# Google Sheet URL
vispubdata_sheet_url = "https://docs.google.com/spreadsheets/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/gviz/tq?tqx=out:csv"

# Read the CSV into a DataFrame
vispubdata = pd.read_csv(vispubdata_sheet_url)

print(vispubdata.columns)

#if you ever want to introduce new columns you could try to do it here. I usually add them on the Google sheet. 
final_columns = vispubdata.columns
#final_columns = ['Conference', 'Year', 'Title', 'DOI', 'Link', 'FirstPage','LastPage','PaperType','Abstract','AuthorNames-Deduped','AuthorNames','AuthorAffiliation','InternalReferences','AuthorKeywords','AminerCitationCount','CitationCount_CrossRef','PubsCited_CrossRef',' Downloads_Xplore','Award','GraphicsReplicabilityStamp']

#double check that the names are correct here
crossRefCitation_column = "CitationCount_CrossRef"
crossRefPubsCited_column = 'PubsCited_CrossRef'
downloads_column = "Downloads_Xplore"


Index(['Conference', 'Year', 'Title', 'DOI', 'Link', 'FirstPage', 'LastPage',
       'PaperType', 'Abstract', 'AuthorNames-Deduped', 'AuthorNames',
       'AuthorAffiliation', 'InternalReferences', 'AuthorKeywords',
       'AminerCitationCount', 'CitationCount_CrossRef', 'PubsCited_CrossRef',
       'Downloads_Xplore', 'Award', 'GraphicsReplicabilityStamp'],
      dtype='object')


#### Helper methods

In [33]:
#we need this later to sort the author columns by the number hidden in its name  
def num_sort(test_string):
    return list(map(int, re.findall(r'\d+', test_string)))[0]

In [34]:
#Here we prepare the data structure that will resemble the final vispubdata table
def prepareXploreDFTable(xplore_df):
    
    #get all column names
    columns = xplore_df.columns

    #remove all the columns we don't need
    #-------------------------------------------------------------
    xplore_df.drop('articles_access_type', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_abstract_url', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publisher', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_article_number', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_volume', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_rank', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publication_number', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publication_date', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_pdf_url', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_partnum', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_issue', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_issn', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_is_number', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_html_url', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publication_title', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_citing_patent_count', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_conference_location', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_conference_dates', axis=1, inplace=True,errors='ignore')
 

    #WARNING: I assume the file is correctly ordered. If not we need to do something more fancy
    columns_to_drop = columns[columns.str.contains("author_order")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)

    columns_to_drop = columns[columns.str.contains("ieee_terms")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    
    columns_to_drop = columns[columns.str.contains("_id")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    
    columns_to_drop = columns[columns.str.contains("authorUrl")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    
    columns_to_drop = columns[columns.str.contains("isbn")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    

    #rename the columns we want to keep
    xplore_df.rename(index=str, inplace=True, columns={"articles_title":"Title",
                                                       "articles_start_page":"FirstPage",
                                                       "articles_abstract":"Abstract",
                                                       "articles_publication_year": "Year",
                                                       "articles_content_type": "PaperType",
                                                       "articles_end_page":"LastPage",
                                                       "articles_doi":"DOI",
                                                       "articles_citing_paper_count":"CitationCount_CrossRef",
                                                       "articles_download_count":"Downloads_Xplore"})


    #put all author full names together
    #----------------------------------------
    author_columns = columns[columns.str.contains("full_name")].tolist()
    author_columns.sort(key=num_sort)
    
    xplore_df[author_columns].fillna(value="")
    authors = xplore_df[author_columns].apply(lambda x: ';'.join(x.dropna().values.tolist()), axis=1)
    authors = authors.str.rstrip(";")
    #we have the authors put together, now add them to the df
    xplore_df['AuthorNames'] = authors
    #now remove all the individual columns that we no longer need
    xplore_df.drop(author_columns, axis=1, inplace=True)

    #put all affiliations together
    #------------------------------------------
    ##We have to do something more complicated for the affiliations because the csv file does not contain an affiliation column if none of the authors in a given position has an affiliation
    
    #Careful. IEEEXplore seems to change the way it handles the affiliations. Their json changes each year. 
    #Also, the original json can now handle multiple affiliations - todo for the future. Requires an update to vispubdata.
    
    xplore_df['AuthorAffiliation'] = "" #make an empty column first
    xplore_df["AuthorCount"] =  xplore_df['AuthorNames'].str.count(';') + 1
    
    for index, row in xplore_df.iterrows():
        #we need to find out how many authors a paper has first
        authorCount = row["AuthorCount"]
        affiliations = ""
        for i in range(0,authorCount):
            author_column_name = author_columns[i]
            affcolumn = author_column_name.replace("full_name","affiliation")
            if affcolumn in xplore_df.columns:
                #we're doing this a few too many times here but we don't care for speed, yet
                affiliations = affiliations + ";" + row[affcolumn]
            else:
                affiliations = affiliations + ";" + ""
        affiliations = affiliations[1:]
        #print(affiliations)
        xplore_df.at[index,'AuthorAffiliation'] = affiliations
    
    
    xplore_df.drop(["AuthorCount"], axis=1, inplace=True)
    
    affiliation_columns = columns[columns.str.contains("affiliation")]
    #now remove all the individual columns that we no longer need
    xplore_df.drop(affiliation_columns, axis=1, inplace=True)

    #put all author keywords together
    #------------------------------------------
    kw_columns = columns[columns.str.contains("author_terms")]
    xplore_df[kw_columns].fillna(value="")
    keywords = xplore_df[kw_columns].apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
    #we have the keywords put together, now add them to the df
    xplore_df['AuthorKeywords'] = keywords
    #now remove all the individual columns that we no longer need
    xplore_df.drop(kw_columns, axis=1, inplace=True)

    #create the link column
    #--------------------------------------------
    xplore_df["Link"] = 'http://dx.doi.org/' + xplore_df["DOI"]

    #now add columns that are missing
    xplore_df_columns = xplore_df.columns
    
    for c in final_columns:
        if not c in xplore_df_columns:
            if c == "Conference":
                xplore_df["Conference"] = ["conference_external"] * len(xplore_df.index)
            else:
                xplore_df[c] = [""] * len(xplore_df.index)
    
    for c in xplore_df_columns:
        if not c in final_columns:
            #we remove all columns that we haven't captured yet
            xplore_df.drop([c], axis=1, inplace=True)
            
    #now reorder the columns
    xplore_df = xplore_df[final_columns]


In [35]:
def checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,conference_external):
    
    #For a couple of years InfoVis, SciVis, and VAST were all published in the same special issue of TVCG. 
    #In order to separate them we need to check which title in that special issue belongs to which conference.
    

    conf_titles = pd.read_csv(conf_title_file,header=None,names=['Title'],sep="\t",encoding ='utf-8',quoting=csv.QUOTE_NONE)
    conf_titles = conf_titles['Title'].str.lower()
    titles_not_found = [ ]
    reasons_found = 0
    xplore_df = xplore_df[final_columns]
    xPloreTitles = xplore_df['Title'].str.lower()

    reasonsFileName = str(year_to_import)+"/missingTitles-reason.txt"
    
    for t in conf_titles:
        #we should just get the titles from df that match the conference
        #now check if the title exists in the read out data
        found = xPloreTitles[xPloreTitles == t].any()
            
                
        if(found != False):
            #rowindex = pd.Index(xPloreTitles).get_loc(t)
            #xplore_df.at[str(rowindex),'Conference'] = conference_external.replace('-conf', '')
            confname = conference_external.replace('-conf', '')
            xplore_df.loc[xplore_df['Title'].str.lower() == t,"Conference"] = confname
            
        else:
            titles_not_found.append(t)
            if(os.path.isfile(reasonsFileName)):
                reason_df = pd.read_csv(reasonsFileName,dtype=object, sep="\t",encoding ='utf-8',keep_default_na=False)
                reasontitles = reason_df["Title"].str.lower().tolist()
                if t in reasontitles:
                    reasons_found = reasons_found + 1
                else:
                    print("No reason found for: "+t)

    
    if len(titles_not_found) > 0:
        print("For "+str(year_to_import)+ " " + str(conference_external) + " I couldn't find " + str(len(titles_not_found)) + " titles. "+str(reasons_found) + " reasons for missings titles were recorded.")
    pd.DataFrame({'Conference':conference_external,'Title':titles_not_found}).to_csv(year_to_import+'/generated_data/TitlesNotFound-'+conference_external+'.csv',index=False)
    
    xplore_df.to_csv(year_to_import+"/generated_data/Vispubdata-"+conference_external+".csv",index =False)
    

In [36]:
def workOnYear(year_to_import,csv_filename,conference_external):

    #we'll get the .csv file for the conference and year we're interested in
    path = year_to_import+"/generated_data/"+csv_filename

    #double-checking that the file exists
    if(os.path.isfile(path)):
        print("Your file " + path + " exists")
    else:
        print("Your file " + path + " does not exist")

    #print(path)
    
    #now we load it
    xplore_df = pd.read_csv(path,dtype=object, encoding ='utf-8',keep_default_na=False,dialect='excel')
    #now prepare it
    prepareXploreDFTable(xplore_df)
    
    
    ###Now we have the xplore_df file ready for this particular year and conference
    
    conf_title_file = year_to_import+"/"+conference_external+"Titles.txt"
    
    
    
    if Path(conf_title_file).is_file():
        checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,conference_external)

    else:
        #print("Title file: " + conf_title_file + " not found")
        
        #Most likely we've encountered the special issue link that we need to split into InfoVis, Vis, SciVis, or VAST
        if "TVCGSI" in conference_external:
            possibleConferences = ["InfoVis","SciVis","VAST"]
            titleFilesFound = []
            filesChecked = 0
            
            #here we check if this special issue contained papers from InfoVis, VAST, or SciVis
            for pc in possibleConferences:
                conf_title_file = year_to_import+"/"+pc+"Titles.txt"
                if Path(conf_title_file).is_file():
                    checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,pc)
                    filesChecked = filesChecked + 1
                    
            #now we check if this special issue is from the combined conference that no longer has these sub-confereces
            conf_title_file = year_to_import+"/VisTitles.txt" #from 2021 onwards
            if Path(conf_title_file).is_file():
                    checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,"Vis")
                    filesChecked = filesChecked + 1
            
            if filesChecked == 0:
                print("Warning: For this special issue there is no file with paper titles!")
                    
            
        else:
            print("Some problem here. I don't know what to do with " + conf_title_file)
        
    print("------------------------")
        

#### Start the conversion

In [37]:
#lets work on all years

xplore_numbers = pd.read_csv("xplore-isnumbers.csv", dtype=object) #readings all as strings

for index, row in xplore_numbers.iterrows():
    year = row['year']
    issue_number = row['isnumber']
    conference_external = row['conference']

    csv_filename = xplore_numbers.loc[xplore_numbers['isnumber'] == issue_number, 'json_filename']+".csv"
    csv_filename = csv_filename.iloc[0]
    
    print("Looking for: " + conference_external + " " + year + " at " + csv_filename)
    
    workOnYear(year,csv_filename,conference_external)

Looking for: Vis-conf 1990 at articles-vis.csv
Your file 1990/generated_data/articles-vis.csv exists
------------------------
Looking for: Vis-conf 1991 at articles-vis.csv
Your file 1991/generated_data/articles-vis.csv exists
------------------------
Looking for: Vis-conf 1992 at articles-vis.csv
Your file 1992/generated_data/articles-vis.csv exists
------------------------
Looking for: Vis-conf 1993 at articles-vis.csv
Your file 1993/generated_data/articles-vis.csv exists
------------------------
Looking for: Vis-conf 1994 at articles-vis.csv
Your file 1994/generated_data/articles-vis.csv exists
------------------------
Looking for: Vis-conf 1995 at articles-vis.csv
Your file 1995/generated_data/articles-vis.csv exists
------------------------
Looking for: InfoVis-conf 1995 at articles-infovis.csv
Your file 1995/generated_data/articles-infovis.csv exists
------------------------
Looking for: Vis-conf 1996 at articles-vis.csv
Your file 1996/generated_data/articles-vis.csv exists
-----

### Double-check correctness of the data manually

#### Check for titles that were not found
Now we have the data in vispubdata format.  Please read the output above. It's important that each time when you see that a specific title hasn't been found, a reason has been recorded. If a reason has not been recorded then there is either an error in the IEEEXplore DL or in our list of titles. You need to check where the paper is manually. The main reference should be the paper itself. Try to find it on the DL and open the pdf. You can find the title from our list that hasn't been found by going to the year's generated_data folder and opening the respective **TitlesNotFound.txt** file. If you found out where the problem is create a file called **missingTitles-reason.txt**. The header is "Title" "DOI"  "Reason" -> all tab separated. On line 2 copy the title from our titles list (which has to be the correct one), the DOI of the paper, and explain why it's different from the one in the IEEE DL. You can follow the example in the 2018 folder.

#### Check if we missed any titles in our list
Go to the generated_data folder of the year that you are trying to add. Open it. Check all the papers that have a **conference_external** label in the first column. If you suspect that they might be actual IEEE VIS papers, then investigate why their title is not in your list. Reasons can be:
- Many of the entries labeled as conference_external will be editorials, the list of PC members, the OC, etc. You can ignore them
- Every once in a while there is something that looks like an actual paper. Don't assume immediately that you missed the paper. Check the program of ieeevis.org of the year you are adding. In the last couple of years, there has been the occasional best paper from an associated event such as LDAV or VDS included in the special issue of TVCG that also includes all VIS papers. We don't want these papers in vispubdata, so ignore them. If you find one of these, create a file called "non-vis titles in SI.txt" to the main folder of the year you are adding. Add the paper title. This helps other people to avoid looking for this paper again.
- If, however, you find an actual Vis paper with the conference_external label, it means that you didn't have its title in the "VisTitles.txt" file. Add it and rerun the code above.

Note, if you are re-checking older years, there is a peculiarity: The special issue contained papers from all 3 conferences (InfoVis, SciVis, and VAST) - so you will see a lot of conference_external labels in the respective files. So you'll have to do a merge of the three vispubdata files and find out what is consistently labeled as conference_external. I haven't written code for that yet.

#### Fixing the publication year

If you are adding a new special issue published after 2015 you have to manually fix the date. Since 2015 VIS papers are published in the first issue of the next calendar year of IEEE TVCG but in vispubdata we record the papers by when they were presented at the conference and not by when they appeared in TVCG.  For simplicity we'll just fix the code of every year to be the year of the folder the data is in




In [38]:
for root, dirs, files in os.walk("./"):
    for file in files:
        
        filepath = os.path.join(root, file)
        if ("generated_data\Vispubdata-" in filepath):
            year = root[2:6]  #this may be a source of error in other operating systems. To check...
            df = pd.read_csv(filepath,dtype=object, encoding ='utf-8',keep_default_na=False)
            df['Year'] = year
            df.to_csv(filepath,index =False)


##### Fixing the Publication Type

If you are adding VIS papers that are part of a special issue of TVCG then we can simply replace the paper type from "journals" to "J" (=vispubdata notation). If you, however, you are adding conference papers or posters you likely need to do some manual fixing of this data since IEEEXplore tags both as "Conferences" while vispubdata marks posters, panels, VAST challenges etc. as "M".


In [39]:
for root, dirs, files in os.walk("./"):
    for file in files:
        
        filepath = os.path.join(root, file)
        if ("generated_data\Vispubdata-" in filepath):
            year = root[2:6]  #this may be a source of error in other operating systems. To check...
            df = pd.read_csv(filepath,dtype=object, encoding ='utf-8',keep_default_na=False)
            df.loc[df['PaperType'] == 'Journals', 'PaperType'] = "J"
            df.to_csv(filepath,index =False)

## Make sure that you have done all the steps above

Next, we automatically remove all entries with the conference_external label. So it's really important that you finished careful error checking at this point.

In [40]:

years = xplore_numbers.year.unique()
dataframes = []

for root, dirs, files in os.walk("./"):
    for file in files:
        filepath = os.path.join(root, file)
        if ("generated_data\Vispubdata-" in filepath):
            df = pd.read_csv(filepath,dtype=object, encoding ='utf-8',keep_default_na=False)
            df.drop(df[df.Conference == "conference_external"].index,inplace=True)
            df.to_csv(filepath,index =False)
            dataframes.append(df)
            
#this df will hold all ieeevis papers with data from ieeexplore. Note, that the online version of vispubdata contains error fixes that are available only on the spreadsheet
#so NEVER just copy this df below over onto vispubdata. We should only copy over certain columns like the ones about downloads
ieeexplore_vispub_df = pd.concat(dataframes, ignore_index=True)   
            
ieeexplore_vispub_df.info()

ieeexplore_vispub_df.to_csv("results/DEBUG_ieeexplore_vispub_df.csv")

#Petra: I usually take a quick scan through this DEBUG-file to see if everything looks more or less ok
            
            


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3747 entries, 0 to 3746
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Conference                  3747 non-null   object
 1   Year                        3747 non-null   object
 2   Title                       3747 non-null   object
 3   DOI                         3747 non-null   object
 4   Link                        3747 non-null   object
 5   FirstPage                   3747 non-null   object
 6   LastPage                    3747 non-null   object
 7   PaperType                   3747 non-null   object
 8   Abstract                    3747 non-null   object
 9   AuthorNames-Deduped         3747 non-null   object
 10  AuthorNames                 3747 non-null   object
 11  AuthorAffiliation           3747 non-null   object
 12  InternalReferences          3747 non-null   object
 13  AuthorKeywords              3747 non-null   obje

Now we merge the current vispubdata and the new table.
We want to keep our current version of vispubdata and add the new papers, then we need to copy over the new citation and download counts

In [41]:
###################CONTINUE HERE#################################
#somewhere above we've loaded the current vispubdata into a dataframe
vispubdata #the current version of vispubdata
ieeexplore_vispub_df #all vis papers with data from IEEEXplore -> that is, some papers will be missing but there will be some new ones in here

#we first add the new papers to vispubdata
new_papers = ieeexplore_vispub_df[~ieeexplore_vispub_df['DOI'].isin(vispubdata['DOI'])]

print(new_papers['Year']) #hopefully we're only adding papers from one new year here

vispubdata_new = pd.concat([vispubdata,new_papers],ignore_index = True)


#print(vispubdata_new.info())
vispubdata_new.to_csv("results/DEBUG_vispubdata_new.csv")

vispubdata_new.head()

0       1990
1       1990
2       1990
3       1990
4       1990
        ... 
3742    2023
3743    2023
3744    2023
3745    2023
3746    2023
Name: Year, Length: 3747, dtype: object


Unnamed: 0,Conference,Year,Title,DOI,Link,FirstPage,LastPage,PaperType,Abstract,AuthorNames-Deduped,AuthorNames,AuthorAffiliation,InternalReferences,AuthorKeywords,AminerCitationCount,CitationCount_CrossRef,PubsCited_CrossRef,Downloads_Xplore,Award,GraphicsReplicabilityStamp
0,Vis,2022,Photosensitive Accessibility for Interactive D...,10.1109/tvcg.2022.3209359,http://dx.doi.org/10.1109/TVCG.2022.3209359,374.0,384.0,J,Accessibility guidelines place restrictions on...,Laura South;Michelle A. Borkin,Laura South;Michelle A. Borkin,"Northeastern University, USA;Northeastern Univ...",0.1109/tvcg.2011.185;10.1109/tvcg.2021.3114829...,"accessibility,photosensitive epilepsy,photosen...",,4.0,63.0,554.0,,
1,Vis,2022,HetVis: A Visual Analysis Approach for Identif...,10.1109/tvcg.2022.3209347,http://dx.doi.org/10.1109/TVCG.2022.3209347,310.0,319.0,J,Horizontal federated learning (HFL) enables di...,Xumeng Wang;Wei Chen 0001;Jiazhi Xia;Zhen Wen;...,Xumeng Wang;Wei Chen;Jiazhi Xia;Zhen Wen;Rongc...,"TMCC, CS, Nankai University, China;State Key L...",0.1109/tvcg.2015.2467618;10.1109/tvcg.2019.293...,"Federated learning,data heterogeneity,cluster ...",,10.0,43.0,984.0,,
2,Vis,2022,Rigel: Transforming Tabular Data by Declarativ...,10.1109/tvcg.2022.3209385,http://dx.doi.org/10.1109/TVCG.2022.3209385,128.0,138.0,J,"We present Rigel, an interactive system for ra...",Ran Chen;Di Weng;Yanwei Huang;Xinhuan Shu;Jiay...,Ran Chen;Di Weng;Yanwei Huang;Xinhuan Shu;Jiay...,"State Key Lab of CAD&CG, Zhejiang University, ...",0.1109/tvcg.2021.3114830;10.1109/vast47406.201...,"Data transformation,self-service data transfor...",,6.0,68.0,610.0,,
3,Vis,2022,BeauVis: A Validated Scale for Measuring the A...,10.1109/tvcg.2022.3209390,http://dx.doi.org/10.1109/TVCG.2022.3209390,363.0,373.0,J,We developed and validated a rating scale to a...,Tingying He;Petra Isenberg;Raimund Dachselt;To...,Tingying He;Petra Isenberg;Raimund Dachselt;To...,"Université Paris-Saclay, CNRS, Inria, LISN, Fr...",0.1109/infvis.2005.1532128;10.1109/tvcg.2006.1...,"Aesthetics,aesthetic pleasure,validated scale,...",,7.0,79.0,753.0,,X
4,Vis,2022,NAS-Navigator: Visual Steering for Explainable...,10.1109/tvcg.2022.3209361,http://dx.doi.org/10.1109/TVCG.2022.3209361,299.0,309.0,J,The success of DL can be attributed to hours o...,Anjul Kumar Tyagi;Cong Xie;Klaus Mueller 0001,Anjul Tyagi;Cong Xie;Klaus Mueller,"Computer Science Department, Visual Analytics ...",0.1109/vast.2012.6400490;10.1109/tvcg.2019.293...,"Deep Learning,Neural Network Architecture Sear...",,0.0,63.0,391.0,,


### Copy over the download data

The IEEEXplore API provides information on how many times a paper has been downloaded. We copy this over here




In [42]:
download_df = pd.DataFrame({'DOI':ieeexplore_vispub_df['DOI'],downloads_column:ieeexplore_vispub_df[downloads_column]})
download_df

for index,row in download_df.iterrows():
    downloads = row[downloads_column]
    doi = row['DOI']

    vispubdata_new.loc[vispubdata_new['DOI']== doi,downloads_column] = downloads

#check that now we have download counts updated
vispubdata_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Conference                  7500 non-null   object
 1   Year                        7500 non-null   object
 2   Title                       7500 non-null   object
 3   DOI                         7500 non-null   object
 4   Link                        7500 non-null   object
 5   FirstPage                   7350 non-null   object
 6   LastPage                    7135 non-null   object
 7   PaperType                   7500 non-null   object
 8   Abstract                    7430 non-null   object
 9   AuthorNames-Deduped         7498 non-null   object
 10  AuthorNames                 7499 non-null   object
 11  AuthorAffiliation           7494 non-null   object
 12  InternalReferences          6871 non-null   object
 13  AuthorKeywords              6521 non-null   obje

## Add Awards

Make sure that you have updated **awardedPapers.csv**. See the instructions above

In [43]:
awardedPapers = pd.read_csv("awardedPapers.csv")
awardedPapers.head()

for index,row in awardedPapers.iterrows():
    award = row['Award']
    doi = row['DOI']

    vispubdata_new.loc[vispubdata_new['DOI']== doi,'Award'] = award
    

## Add Replicability Stamp

In [44]:
replicablePapers = pd.read_csv("tvcg-dois-with-stamp.csv")
replicablePapers.head()

stampMarker = "X"

for index,row in replicablePapers.iterrows():
    
    doi = row['doi']

    vispubdata_new.loc[vispubdata_new['DOI'] == doi,'GraphicsReplicabilityStamp'] = stampMarker



## Add CrossRef Data



### Download new Citation data.

The output from this operation is a file named  **citations_vispubdata.csv**. You can jump over this section if this file is relatively recent since the next chunk of code takes around 30 minutes to run. The file will be read in by the subsequent steps.

In [32]:

my_etiquette = Etiquette('Vispubdata', '9.02', 'https://sites.google.com/site/vispubdata/home', youremail)
str(my_etiquette)

'Vispubdata/9.02 (https://sites.google.com/site/vispubdata/home; mailto:petra.isenberg@inria.fr) BasedOn: CrossrefAPI/1.5.0'

In [34]:
works = Works(etiquette=my_etiquette)

crossRefCitations = [ ]
crossRefPubsCited = [ ]
dois = [ ] 

print("Starting to work on the citation counts")

for index,row in vispubdata_new.iterrows():
    doi = row['DOI']
    print(str(index) + " " + doi)
    
    paper = works.doi(doi)
    
    if (paper is None):
        print("CrossRef does not know: " + doi)
        dois.append(doi)
        crossRefCitations.append("")
        crossRefPubsCited.append("")
        continue
        
    isreferencedby = paper['is-referenced-by-count']
    
    if (isreferencedby is None):
        isreferencedby = ""
    
    references = paper['references-count']
    
    if (references is None):
        references = ""
        
    crossRefCitations.append(isreferencedby)
    crossRefPubsCited.append(references)
    dois.append(doi)


citationdf = pd.DataFrame({'DOI':dois,crossRefCitation_column:crossRefCitations,crossRefPubsCited_column:crossRefPubsCited})
citationdf.to_csv("results/citations_vispubdata.csv",index=False)
citationdf.head()

    #vispubdata_new.at[index,crossRefCitation_column] = isreferencedby
    #vispubdata_new.at[index,crossRefPubsCited_column] = references
    

#vispubdata_new.head(10)
#citationfilename = "citation-data-update/"+filename+"-citation-update.csv"
#vispubdata_new.to_csv(citationfilename,index=False)
    

Starting to work on the citation counts
0 10.1109/TVCG.2022.3209359
1 10.1109/TVCG.2022.3209347
2 10.1109/TVCG.2022.3209385
3 10.1109/TVCG.2022.3209390
4 10.1109/TVCG.2022.3209361
5 10.1109/TVCG.2022.3209360
6 10.1109/TVCG.2022.3209426
7 10.1109/TVCG.2022.3209354
8 10.1109/TVCG.2022.3209379
9 10.1109/TVCG.2022.3209365
10 10.1109/TVCG.2022.3209375
11 10.1109/TVCG.2022.3209377
12 10.1109/TVCG.2022.3209387
13 10.1109/TVCG.2022.3209374
14 10.1109/TVCG.2022.3209405
15 10.1109/TVCG.2022.3209356
16 10.1109/TVCG.2022.3209380
17 10.1109/TVCG.2022.3209411
18 10.1109/TVCG.2022.3209373
19 10.1109/TVCG.2022.3209392
20 10.1109/TVCG.2022.3209383
21 10.1109/TVCG.2022.3209371
22 10.1109/TVCG.2022.3209348
23 10.1109/TVCG.2022.3209367
24 10.1109/TVCG.2022.3209357
25 10.1109/TVCG.2022.3209470
26 10.1109/TVCG.2022.3209388
27 10.1109/TVCG.2022.3209352
28 10.1109/TVCG.2022.3209395
29 10.1109/TVCG.2022.3209398
30 10.1109/TVCG.2022.3209384
31 10.1109/TVCG.2022.3209353
32 10.1109/TVCG.2022.3209442
33 10.1109/TV

Unnamed: 0,DOI,CitationCount_CrossRef,PubsCited_CrossRef
0,10.1109/TVCG.2022.3209359,4,63
1,10.1109/TVCG.2022.3209347,10,43
2,10.1109/TVCG.2022.3209385,6,68
3,10.1109/TVCG.2022.3209390,7,79
4,10.1109/TVCG.2022.3209361,0,63


### Integrate the Citation data

This part you shouldn't skip over.

In [115]:
#error checking
#Check for which DOI we don't have data
citationdf = pd.read_csv("citations_vispubdata.csv")
citationdf.loc[citationdf[crossRefCitation_column] == "" ]



Unnamed: 0,DOI,CitationCount_CrossRef,PubsCited_CrossRef


In [47]:
#For debugging the API
# doi = '10.0000/00000001'

# works = Works(etiquette=my_etiquette)
# paper = works.doi(doi)
# isreferencedby = ""
# if(paper is not None):
#         isreferencedby = works.doi(doi)['is-referenced-by-count']
#         if (isreferencedby is None):
#                 isreferencedby = ""

# print("Citations: " + isreferencedby)
# #pubscited = works.doi(doi)['references-count']

# works.doi(doi)

Citations: 


In [117]:
#Here we copy the citation data over
#it would be much smarter to do this with a merge but I am currently worried of getting it wrong and merging rows and values in that I don't want. Since I don't care for speed yet....

for index,row in citationdf.iterrows():
    
    doi = row['DOI']
    pubscited = row[crossRefPubsCited_column]
    citation = row[crossRefCitation_column]

    vispubdata_new.loc[vispubdata_new['DOI'] == doi,crossRefPubsCited_column] = pubscited
    vispubdata_new.loc[vispubdata_new['DOI'] == doi,crossRefCitation_column] = citation



vispubdata_new.to_csv("results/DEBUG_vispubdata_new.csv")

vispubdata_new.head()


Unnamed: 0,Conference,Year,Title,DOI,Link,FirstPage,LastPage,PaperType,Abstract,AuthorNames-Deduped,AuthorNames,AuthorAffiliation,InternalReferences,AuthorKeywords,AminerCitationCount,CitationCount_CrossRef,PubsCited_CrossRef,Downloads_Xplore,Award,GraphicsReplicabilityStamp
0,Vis,2022,Photosensitive Accessibility for Interactive D...,10.1109/TVCG.2022.3209359,http://dx.doi.org/10.1109/TVCG.2022.3209359,374.0,384.0,J,Accessibility guidelines place restrictions on...,Laura South;Michelle Borkin,Laura South;Michelle A. Borkin,"Northeastern University, USA;Northeastern Univ...",10.1109/TVCG.2011.185;10.1109/TVCG.2021.311482...,"accessibility,photosensitive epilepsy,photosen...",,4,63,554,,
1,Vis,2022,HetVis: A Visual Analysis Approach for Identif...,10.1109/TVCG.2022.3209347,http://dx.doi.org/10.1109/TVCG.2022.3209347,310.0,319.0,J,Horizontal federated learning (HFL) enables di...,Xumeng Wang;Wei Chen 0001;Jiazhi Xia;Zhen Wen;...,Xumeng Wang;Wei Chen;Jiazhi Xia;Zhen Wen;Rongc...,"TMCC, CS, Nankai University, China;State Key L...",10.1109/TVCG.2015.2467618;10.1109/TVCG.2019.29...,"Federated learning,data heterogeneity,cluster ...",,10,43,984,,
2,Vis,2022,Rigel: Transforming Tabular Data by Declarativ...,10.1109/TVCG.2022.3209385,http://dx.doi.org/10.1109/TVCG.2022.3209385,128.0,138.0,J,"We present Rigel, an interactive system for ra...",Ran Chen;Di Weng;Yanwei Huang;Xinhuan Shu;Jiay...,Ran Chen;Di Weng;Yanwei Huang;Xinhuan Shu;Jiay...,"State Key Lab of CAD&CG, Zhejiang University, ...",10.1109/TVCG.2021.3114830;10.1109/VAST47406.20...,"Data transformation,self-service data transfor...",,6,68,610,,
3,Vis,2022,BeauVis: A Validated Scale for Measuring the A...,10.1109/TVCG.2022.3209390,http://dx.doi.org/10.1109/TVCG.2022.3209390,363.0,373.0,J,We developed and validated a rating scale to a...,Tingying He;Petra Isenberg;Raimund Dachselt;To...,Tingying He;Petra Isenberg;Raimund Dachselt;To...,"Université Paris-Saclay, CNRS, Inria, LISN, Fr...",10.1109/INFVIS.2005.1532128;10.1109/TVCG.2006....,"Aesthetics,aesthetic pleasure,validated scale,...",,7,79,753,,
4,Vis,2022,NAS-Navigator: Visual Steering for Explainable...,10.1109/TVCG.2022.3209361,http://dx.doi.org/10.1109/TVCG.2022.3209361,299.0,309.0,J,The success of DL can be attributed to hours o...,Anjul Tyagi;Cong Xie;Klaus Mueller 0001,Anjul Tyagi;Cong Xie;Klaus Mueller,"Computer Science Department, Visual Analytics ...",10.1109/VAST.2012.6400490;10.1109/TVCG.2019.29...,"Deep Learning,Neural Network Architecture Sear...",,0,63,391,,


### Getting the internal citations

Testing to see if we can get this through the IEEE API as well



In [3]:
# from xploreapi import XPLORE
# query = XPLORE(apikey)
# query.setAuthToken('auth_token')
# query.dataType('json')
# query.dataFormat('raw')
# query.fullTextRequest('article number')
# data = query.callAPI()


Token cannot be retrieved


UnboundLocalError: local variable 'tokenValue' referenced before assignment

## Include deduped authors from DBLP

For the following code I expect that you already ran preparation 2 from all the way at the top of the document. That's the step where you extracted potential VIS authors from the DBLP

In [7]:
dblpauthors = pd.read_csv("../dblp-data-extraction/data/VIS-author-articles.csv",keep_default_na=False)

#if this fails you didn't do step 2 above

FileNotFoundError: [Errno 2] No such file or directory: '../dblp-data-extraction/data/VIS-author-articles.csv'

In [None]:
# vispubdata -- this is vispubdata before the update