# Updating Vispubdata with new IEEE VIS papers

This notebook allows you to create an update of vispubdata. It will walk you through the different steps that are needed. 

This code was written by Petra Isenberg (petra.isenberg@inria.fr) and is update regularly.
If you find better or more elegant ways to implement what I did below, don't hesitate to suggest improvements!


## Preparation

### Get an IEEE Xplore API Key
https://ieeexplore.ieee.org/Xplorehelp/administrators-and-librarians/api-portal

Put the API key into the file called "ieeexplore-apikey.txt" in the main folder.


### 1) Finding the issue number

Find the issue number for the latest TVCG issue that holds the new VIS papers you would like to add. To do so, do the following: 
    - Go to the TVCG IEEEXplore entry: https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=2945. Find the issue that holds the IEEE VIS papers of the year. At the time of writing, this was typicaly issue 1 of each year. Open any paper of this issue. Then check the source of the html page for this paper and search in the source file for "isnumber".
    - Open the file "xplore-isnumbers.csv". Add a new line at the bottom with the year, [YOUR ISNUMBER] ,TVCGSI,articles-SI. TVCGSI stands for "special issue of TVCG" and "articles-SI" stands for "article in the special issue". In the earlier years you can find slightly different codes but those don't have to concern you if you are updating more recent articles.

**Example: 

The IEEE VIS 2023 papers can be found here in the digital library: https://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=10373160&punumber=2945. Open a random paper from the issue. For example this one: https://ieeexplore.ieee.org/document/10301796. Check the source. In the document metadata you will find the isnumber: isnumber=10373160. Add the following line to xplore-isnumbers.csv: **2023,10373160,TVCGSI,

#### Documentation for xplore-isnumbers.csv

The columns of this file are:
1. year: the year of the conference. E.g.1990 for Vis 1990
2. isnumber: the issue number as given by IEEEXplore. See above
3. conference: the conference that belongs to the issue number. Anything after 2021 should be like the last entry below.
   - **Vis-conf** = the IEEE Visualization conference when papers were still published as conference papers
   - **InfoVis-conf** = the IEEE Information Visualization conference when papers were still published as conference papers
   - **VAST-conf** = the IEEE VAST conference when papers were still published as conference papers
   - **SciVis-conf** = the IEEE SciVis conference when papers were still published as conference papers
   - **TVCGSI** = the papers that are published as a special issue in TVCG
5. jason_filename = the name of the file that will hold the paper info once downloaded. Anything after 2021 should be like the last entry below.
   - **articles-vis** for **Vis-conf**
   - **articles-infovis** for **InfoVis-conf**
   - **articles-vast** for **VAST-conf**
   - **articles-scivis** for **SciVis-conf**
   - **articles-SI** for **TVCGSI**

### 2) Get the IEEE VIS full paper titles from the IEEE VIS program chairs
Every year there are a few errors on the IEEE digital library. In addition, the special issue of TVCG that includes the IEEE VIS full papers (usually the first issue of the year) sometimes contains additional papers such as best papers from associated events such as LDAV. 
To chatch errors and exclude papers that do not belong on vispubdata, I have data files with the correct titles of each IEEE VIS full paper. These are the titles listed on vispubdata. To get the correct titles email (mailto:program@ieeevis.org)[program@ieeevis.org] to ask for the file. Then go to the folder for the year that you are updating. For example, to add paper titles for VIS 2023, **create a folder** called 2023 and in it a file called **VisTitles.txt**. Copy the titles in there, one title per line.

### 3) Get the awarded papers
Open the file **awardedPapers.csv** and add the information for the paper awards of the conference year you are adding. The paper award types are:
- **BP** = Best paper award
- **HM** = Honorable mention award
- **TT** = Test of Time award
- **BCS** = Best case study award (was given out in the early years, not recently)

If a paper received more than one award then separate them by a semicolon. You can still do this step after you've done the conversion of IEEEXplore data to vispubdata format. That way you can easily copy the DOIs of the awarded papers from the `generated_data` file called `Vispubdata-Vis.csv` if you're adding a new year - if you are adding an older award then search for the DOI on IEEEXplore or vispubdata.

### 4) Extract a VIS Authors file from DBLP
We assume that this step is already done. If not, go to the [`../dblp-data-extraction/`](../dblp-data-extraction/) folder in this repo and follow the instructions in the `ParseDBLP-VIS-Authors` notebook. 



### Libraries 
Make sure to have these libraries. Install them if not. Run the code below before starting to check

In [None]:
import urllib.request, json 
import requests
import pandas as pd
import os
from pathlib import Path
import numpy as np
import re
import json
import sys
import csv
from crossref.restful import Works, Etiquette


with open('ieeexplore-apikey.txt') as f:
    apikey = f.readline()

youremail = "petra.isenberg@inria.fr" #replace this with your own email address. This is required for querying CrossRef

xplore_numbers = pd.read_csv("xplore-isnumbers.csv", dtype=object,keep_default_na=False) #readings all as strings


## Download the paper metadata from IEEEXplore

These next code sections will download and check that there is a json file with data about papers from each year. If your data files are up to date you can skip this section.

In [None]:


#set whether or not to download all files again or just check for new ones
# --> if you want updated download counts, then set this to true
renewall = True 

print(xplore_numbers.head(60))


##### Check if we have all files we need

In [None]:
for index, row in xplore_numbers.iterrows():
    year = row['year']
    if not pd.isnull(row['json_filename']):
        filename = row['json_filename']+".json"
        
        my_file = Path(year+"/"+filename)
        if renewall == True:
            row['json_filename'] = np.NaN
        elif not my_file.is_file():
            row['json_filename'] = np.NaN

##### Download files

In [None]:
#next we download all the files we don't have  or we want to renew  
baseurl =  "https://ieeexploreapi.ieee.org/api/v1/search/articles?parameter&apikey="+apikey+"&max_records=200&is_number="

for index, row in xplore_numbers.iterrows():
    year = row['year']
    print(year + " " + row['conference'])
    
    if pd.isnull(row['json_filename']):
        fullurl = baseurl + row['isnumber']
        
        filename = "articles-SI"
        if(row['conference'] == "InfoVis-conf"): 
            filename = "articles-infovis"
        elif (row['conference'] == 'VAST-conf'):
            filename = "articles-vast"
        elif (row['conference'] == 'SciVis-conf'):
            filename = "articles-scivis"
        elif (row['conference'] == 'Vis-conf'):
            filename = "articles-vis"
            
        full_filename = year + "/original_data/"+filename + ".json"
        
        # Create target Directory if it doesn't exist
        directory = year+"/original_data"
        if not os.path.exists(directory):
            os.mkdir(directory)
            print("Directory " , directory ,  " Created ")
        else:    
            print("Directory " , directory ,  " already exists")
        
        with open(full_filename, 'w+', encoding="utf-8") as f:
            resp = requests.get(fullurl, verify=True)
            f.write(resp.text)
        
        xplore_numbers.at[index,'json_filename'] = filename

xplore_numbers.to_csv("xplore-isnumbers.csv",index=False)

print('Processing done.')

## Convert downloaded paper metadata from json to csv

Here we convert the .json file into something more flat and similar to vispubdata. Run this code if you updated the .json files above.

This code comes from here:
https://github.com/vinay20045/json-to-csv/blob/master/json_to_csv.py


In [None]:
##
# Convert to string keeping encoding in mind...
##
def to_string(s):
    
    try:
        return str(s)
    except:
        #Change the encoding type if needed
        return s.encode('utf-8')

    
def reduce_item(key, value):
    global reduced_item
    
    #Reduction Condition 1
    if type(value) is list:
        i=0
        for sub_item in value:
            reduce_item(key+'_'+to_string(i), sub_item)
            i=i+1

    #Reduction Condition 2
    elif type(value) is dict:
        sub_keys = value.keys()
        for sub_key in sub_keys:
            reduce_item(key+'_'+to_string(sub_key), value[sub_key])
    
    #Base Condition
    else:
        reduced_item[to_string(key)] = to_string(value)

        

for index, row in xplore_numbers.iterrows():
    year = row['year']
    
    if not pd.isnull(row['json_filename']):
        node = "articles"
        
        generated_data_path = year +"/generated_data/"
        pathExists = os.path.exists(generated_data_path)
        if not pathExists:
            os.makedirs(generated_data_path)
        
        json_file_path = year + "/original_data/" + row['json_filename']+".json"
        csv_file_path = generated_data_path + row['json_filename']+".csv"
        
        #print ("Now working on "+csv_file_path)
        
        #TODO: We could check if the .csv file already exists and then decide not to do anything

        #fp = open(json_file_path, 'r',encoding="utf8")
        fp = open(json_file_path, mode = 'rb')
        json_value = fp.read()
        raw_data = json.loads(json_value)

        try:
            data_to_be_processed = raw_data[node]
        except:
            data_to_be_processed = raw_data

        processed_data = []
        header = []
        for item in data_to_be_processed:
            reduced_item = {}
            reduce_item(node, item)

            header += reduced_item.keys()

            processed_data.append(reduced_item)

        header = list(set(header))
        header.sort()

        with open(csv_file_path, 'w+',encoding="utf8") as f:
            writer = csv.DictWriter(f, header, dialect='excel')
            writer.writeheader()
            for row in processed_data:
                writer.writerow(row)

        #print ("Just completed writing "+csv_file_path+" file with %d columns" % len(header))
        
    else:
        print("We are missing a filename for: " + year + " " + row['conference']+ ". Did you run the previous step already?")



## Convert IEEEXplore csvs to Vispubdata

Here we convert the data from the .json file into the vispubdata format



#### Read the latest vispubdata

In [None]:
# Google Sheet URL
vispubdata_sheet_url = "https://docs.google.com/spreadsheets/d/1xgoOPu28dQSSGPIp_HHQs0uvvcyLNdkMF9XtRajhhxU/gviz/tq?tqx=out:csv"

# Read the CSV into a DataFrame
vispubdata = pd.read_csv(vispubdata_sheet_url,keep_default_na=False)

print(vispubdata.columns)

#if you ever want to introduce new columns you could try to do it here. I usually add them on the Google sheet. 
final_columns = vispubdata.columns
#final_columns = ['Conference', 'Year', 'Title', 'DOI', 'Link', 'FirstPage','LastPage','PaperType','Abstract','AuthorNames-Deduped','AuthorNames','AuthorAffiliation','InternalReferences','AuthorKeywords','AminerCitationCount','CitationCount_CrossRef','PubsCited_CrossRef',' Downloads_Xplore','Award','GraphicsReplicabilityStamp']

#double check that the names are correct here
crossRefCitation_column = "CitationCount_CrossRef"
crossRefPubsCited_column = 'PubsCited_CrossRef'
downloads_column = "Downloads_Xplore"


#### Helper methods

In [None]:
#we need this later to sort the author columns by the number hidden in its name  
def num_sort(test_string):
    return list(map(int, re.findall(r'\d+', test_string)))[0]

In [None]:
#Here we prepare the data structure that will resemble the final vispubdata table
def prepareXploreDFTable(xplore_df):
    
    #get all column names
    columns = xplore_df.columns

    #remove all the columns we don't need
    #-------------------------------------------------------------
    xplore_df.drop('articles_access_type', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_abstract_url', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publisher', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_article_number', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_volume', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_rank', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publication_number', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publication_date', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_pdf_url', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_partnum', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_issue', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_issn', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_is_number', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_html_url', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_publication_title', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_citing_patent_count', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_conference_location', axis=1, inplace=True,errors='ignore')
    xplore_df.drop('articles_conference_dates', axis=1, inplace=True,errors='ignore')
 

    #WARNING: I assume the file is correctly ordered. If not we need to do something more fancy
    columns_to_drop = columns[columns.str.contains("author_order")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)

    columns_to_drop = columns[columns.str.contains("ieee_terms")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    
    columns_to_drop = columns[columns.str.contains("_id")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    
    columns_to_drop = columns[columns.str.contains("authorUrl")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    
    columns_to_drop = columns[columns.str.contains("isbn")] 
    xplore_df.drop(columns_to_drop, axis=1, inplace=True)
    

    #rename the columns we want to keep
    xplore_df.rename(index=str, inplace=True, columns={"articles_title":"Title",
                                                       "articles_start_page":"FirstPage",
                                                       "articles_abstract":"Abstract",
                                                       "articles_publication_year": "Year",
                                                       "articles_content_type": "PaperType",
                                                       "articles_end_page":"LastPage",
                                                       "articles_doi":"DOI",
                                                       "articles_citing_paper_count":"CitationCount_CrossRef",
                                                       "articles_download_count":"Downloads_Xplore"})


    #put all author full names together
    #----------------------------------------
    author_columns = columns[columns.str.contains("full_name")].tolist()
    author_columns.sort(key=num_sort)
    
    xplore_df[author_columns].fillna(value="")
    authors = xplore_df[author_columns].apply(lambda x: ';'.join(x.dropna().values.tolist()), axis=1)
    authors = authors.str.rstrip(";")
    #we have the authors put together, now add them to the df
    xplore_df['AuthorNames'] = authors
    #now remove all the individual columns that we no longer need
    xplore_df.drop(author_columns, axis=1, inplace=True)

    #put all affiliations together
    #------------------------------------------
    ##We have to do something more complicated for the affiliations because the csv file does not contain an affiliation column if none of the authors in a given position has an affiliation
    
    #Careful. IEEEXplore seems to change the way it handles the affiliations. Their json changes each year. 
    #Also, the original json can now handle multiple affiliations - todo for the future. Requires an update to vispubdata.
    
    xplore_df['AuthorAffiliation'] = "" #make an empty column first
    xplore_df["AuthorCount"] =  xplore_df['AuthorNames'].str.count(';') + 1
    
    for index, row in xplore_df.iterrows():
        #we need to find out how many authors a paper has first
        authorCount = row["AuthorCount"]
        affiliations = ""
        for i in range(0,authorCount):
            author_column_name = author_columns[i]
            affcolumn = author_column_name.replace("full_name","affiliation")
            if affcolumn in xplore_df.columns:
                #we're doing this a few too many times here but we don't care for speed, yet
                affiliations = affiliations + ";" + row[affcolumn]
            else:
                affiliations = affiliations + ";" + ""
        affiliations = affiliations[1:]
        #print(affiliations)
        xplore_df.at[index,'AuthorAffiliation'] = affiliations
    
    
    xplore_df.drop(["AuthorCount"], axis=1, inplace=True)
    
    affiliation_columns = columns[columns.str.contains("affiliation")]
    #now remove all the individual columns that we no longer need
    xplore_df.drop(affiliation_columns, axis=1, inplace=True)

    #put all author keywords together
    #------------------------------------------
    kw_columns = columns[columns.str.contains("author_terms")]
    xplore_df[kw_columns].fillna(value="")
    keywords = xplore_df[kw_columns].apply(lambda x: ','.join(x.dropna().values.tolist()), axis=1)
    keywords = keywords.apply(lambda x: x.rstrip(','))
    #we have the keywords put together, now add them to the df
    xplore_df['AuthorKeywords'] = keywords
    #now remove all the individual columns that we no longer need
    xplore_df.drop(kw_columns, axis=1, inplace=True)

    #create the link column
    #--------------------------------------------
    xplore_df["Link"] = 'http://dx.doi.org/' + xplore_df["DOI"]

    #now add columns that are missing
    xplore_df_columns = xplore_df.columns
    
    for c in final_columns:
        if not c in xplore_df_columns:
            if c == "Conference":
                xplore_df["Conference"] = ["conference_external"] * len(xplore_df.index)
            else:
                xplore_df[c] = [""] * len(xplore_df.index)
    
    for c in xplore_df_columns:
        if not c in final_columns:
            #we remove all columns that we haven't captured yet
            xplore_df.drop([c], axis=1, inplace=True)
            
    #now reorder the columns
    xplore_df = xplore_df[final_columns]


In [None]:
def checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,conference_external):
    
    #For a couple of years InfoVis, SciVis, and VAST were all published in the same special issue of TVCG. 
    #In order to separate them we need to check which title in that special issue belongs to which conference.
    

    conf_titles = pd.read_csv(conf_title_file,header=None,names=['Title'],sep="\t",encoding ='utf-8',quoting=csv.QUOTE_NONE,keep_default_na=False)
    conf_titles = conf_titles['Title'].str.lower()
    titles_not_found = [ ]
    reasons_found = 0
    xplore_df = xplore_df[final_columns]
    xPloreTitles = xplore_df['Title'].str.lower()

    reasonsFileName = str(year_to_import)+"/missingTitles-reason.txt"
    
    for t in conf_titles:
        #we should just get the titles from df that match the conference
        #now check if the title exists in the read out data
        found = xPloreTitles[xPloreTitles == t].any()
            
                
        if(found != False):
            #rowindex = pd.Index(xPloreTitles).get_loc(t)
            #xplore_df.at[str(rowindex),'Conference'] = conference_external.replace('-conf', '')
            confname = conference_external.replace('-conf', '')
            xplore_df.loc[xplore_df['Title'].str.lower() == t,"Conference"] = confname
            
        else:
            titles_not_found.append(t)
            if(os.path.isfile(reasonsFileName)):
                reason_df = pd.read_csv(reasonsFileName,dtype=object, sep="\t",encoding ='utf-8',keep_default_na=False)
                reasontitles = reason_df["Title"].str.lower().tolist()
                if t in reasontitles:
                    reasons_found = reasons_found + 1
                else:
                    print("No reason found for: "+t)

    
    if len(titles_not_found) > 0:
        print("For "+str(year_to_import)+ " " + str(conference_external) + " I couldn't find " + str(len(titles_not_found)) + " titles. "+str(reasons_found) + " reasons for missings titles were recorded.")
    pd.DataFrame({'Conference':conference_external,'Title':titles_not_found}).to_csv(year_to_import+'/generated_data/TitlesNotFound-'+conference_external+'.csv',index=False)
    
    xplore_df.to_csv(year_to_import+"/generated_data/Vispubdata-"+conference_external+".csv",index =False)
    

In [None]:
def workOnYear(year_to_import,csv_filename,conference_external):

    #we'll get the .csv file for the conference and year we're interested in
    path = year_to_import+"/generated_data/"+csv_filename

    #double-checking that the file exists
    if(os.path.isfile(path)):
        print("Your file " + path + " exists")
    else:
        print("Your file " + path + " does not exist")

    #print(path)
    
    #now we load it
    xplore_df = pd.read_csv(path,dtype=object, encoding ='utf-8',keep_default_na=False,dialect='excel')
    #now prepare it
    prepareXploreDFTable(xplore_df)
    
    
    ###Now we have the xplore_df file ready for this particular year and conference
    
    conf_title_file = year_to_import+"/"+conference_external+"Titles.txt"
    
    
    
    if Path(conf_title_file).is_file():
        checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,conference_external)

    else:
        #print("Title file: " + conf_title_file + " not found")
        
        #Most likely we've encountered the special issue link that we need to split into InfoVis, Vis, SciVis, or VAST
        if "TVCGSI" in conference_external:
            possibleConferences = ["InfoVis","SciVis","VAST"]
            titleFilesFound = []
            filesChecked = 0
            
            #here we check if this special issue contained papers from InfoVis, VAST, or SciVis
            for pc in possibleConferences:
                conf_title_file = year_to_import+"/"+pc+"Titles.txt"
                if Path(conf_title_file).is_file():
                    checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,pc)
                    filesChecked = filesChecked + 1
                    
            #now we check if this special issue is from the combined conference that no longer has these sub-confereces
            conf_title_file = year_to_import+"/VisTitles.txt" #from 2021 onwards
            if Path(conf_title_file).is_file():
                    checkConferenceTitleFiles(conf_title_file,xplore_df,year_to_import,"Vis")
                    filesChecked = filesChecked + 1
            
            if filesChecked == 0:
                print("Warning: For this special issue there is no file with paper titles!")
                    
            
        else:
            print("Some problem here. I don't know what to do with " + conf_title_file)
        
    print("------------------------")
        

#### Start the conversion

In [None]:
#lets work on all years

xplore_numbers = pd.read_csv("xplore-isnumbers.csv", dtype=object) #readings all as strings

for index, row in xplore_numbers.iterrows():
    year = row['year']
    issue_number = row['isnumber']
    conference_external = row['conference']

    csv_filename = xplore_numbers.loc[xplore_numbers['isnumber'] == issue_number, 'json_filename']+".csv"
    csv_filename = csv_filename.iloc[0]
    
    print("Looking for: " + conference_external + " " + year + " at " + csv_filename)
    
    workOnYear(year,csv_filename,conference_external)

### Double-check correctness of the data manually

#### Check for titles that were not found
Now we have the data in vispubdata format.  Please read the output above. It's important that each time when you see that a specific title hasn't been found, a reason has been recorded. If a reason has not been recorded then there is either an error in the IEEEXplore DL or in our list of titles. You need to check where the paper is manually. The main reference should be the paper itself. Try to find it on the DL and open the pdf. You can find the titles that have not been found by going to the year's generated_data folder and opening the respective `TitlesNotFound.txt` file. If you found out where the problem, do the following:
- if the title is incorrect in the respective `VisTitles.txt` file, fix it there
- if the title is incorrect on the IEEEXplore digital library, then create a file called `missingTitles-reason.txt`. The header is "Title" "DOI"  "Reason" -> all tab separated. On line 2 copy the title from our titles list (which has to be the correct one), the DOI of the paper, and explain why it's different from the one in the IEEE DL. You can follow the example in the 2018 folder. You can also try to email the IEEEXplore DL folks to tell them about the error (or email the publications chairs of IEEEVIs who might have a more direct connection to them).

#### Check if you missed any titles in our list
Go to the `generated_data` folder of the year that you are trying to add. Open all files that have the format `Vispubdata-*.csv`. Check all the papers that have a **conference_external** label in the first column. If you suspect that they might be actual IEEE VIS papers, then investigate why their title is not in your list. Reasons can be:
- Many of the entries labeled as **conference_external** will be editorials, the list of PC members, the OC, or other front/back matter. You can ignore them
- Every once in a while there is something that looks like an actual paper. Don't assume immediately that you missed the paper. Check the program of https://ieeevis.org of the year you are adding. In the last couple of years, there has been the occasional best paper from an associated event such as LDAV or VDS included in the special issue of TVCG that also includes all VIS papers. We don't want these papers in vispubdata, so ignore them. If you find one of these, create a file called `non-vis titles in SI.txt` in the main folder of the year you are adding. Add the paper title. This helps other people to avoid looking for this paper again.
- If, however, you find an actual Vis paper with the **conference_external** label, it means that you didn't have its title in the `VisTitles.txt` file. Add it and rerun the code above.

Note, if you are re-checking older years, there is a peculiarity: The special issue contained papers from all 3 conferences (InfoVis, SciVis, and VAST) - so you will see a lot of conference_external labels in the respective files. So you'll have to do a merge of the three vispubdata files and find out what is consistently labeled as **conference_external**. I haven't written code for that yet.

#### Fixing the publication year

If you are adding a new special issue published after 2015 you have to manually fix the date. Since 2015 VIS papers are published in the first issue of the next calendar year of IEEE TVCG but in vispubdata we record the papers by when they were presented at the conference and not by when they appeared in TVCG.  For simplicity we'll just fix the code of every year to be the year of the folder the data is in




In [None]:
for root, dirs, files in os.walk("./"):
    for file in files:
        
        filepath = os.path.join(root, file)
        if ("generated_data\Vispubdata-" in filepath):
            year = root[2:6]  #this may be a source of error in other operating systems. To check...
            df = pd.read_csv(filepath,dtype=object, encoding ='utf-8',keep_default_na=False)
            df['Year'] = year
            df.to_csv(filepath,index =False)


##### Fixing the Publication Type

If you are adding VIS papers that are part of a special issue of TVCG then we can simply replace the paper type from "journals" to "J" (=vispubdata notation). If you, however, you are adding conference papers or posters you likely need to do some manual fixing of this data since IEEEXplore tags both as "Conferences" while vispubdata marks posters, panels, VAST challenges etc. as "M".


In [None]:
for root, dirs, files in os.walk("./"):
    for file in files:
        
        filepath = os.path.join(root, file)
        if ("generated_data\Vispubdata-" in filepath):
            year = root[2:6]  #this may be a source of error in other operating systems. To check...
            df = pd.read_csv(filepath,dtype=object, encoding ='utf-8',keep_default_na=False)
            df.loc[df['PaperType'] == 'Journals', 'PaperType'] = "J"
            df.to_csv(filepath,index =False)

## Make sure that you have done all the steps above

Next, we automatically remove all entries with the conference_external label. So it's really important that you finished careful error checking at this point.

In [None]:

years = xplore_numbers.year.unique()
dataframes = []

for root, dirs, files in os.walk("./"):
    for file in files:
        filepath = os.path.join(root, file)
        if ("generated_data\Vispubdata-" in filepath):
            df = pd.read_csv(filepath,dtype=object, encoding ='utf-8',keep_default_na=False)
            df.drop(df[df.Conference == "conference_external"].index,inplace=True)
            df.to_csv(filepath,index =False)
            dataframes.append(df)
            
#this df will hold all ieeevis papers with data from ieeexplore. Note, that the online version of vispubdata contains error fixes that are available only on the spreadsheet
#so NEVER just copy this df below over onto vispubdata. We should only copy over certain columns like the ones about downloads
ieeexplore_vispub_df = pd.concat(dataframes, ignore_index=True)   
            
ieeexplore_vispub_df.info()

ieeexplore_vispub_df.to_csv("results/DEBUG_ieeexplore_vispub_df.csv")

#Petra: I usually take a quick scan through this DEBUG-file to see if everything looks more or less ok
            
            


Now we merge the current vispubdata and the new table.
We want to keep our current version of vispubdata and add the new papers, then we need to copy over the new citation and download counts

In [None]:
###################CONTINUE HERE#################################
#somewhere above we've loaded the current vispubdata into a dataframe
vispubdata #the current version of vispubdata
ieeexplore_vispub_df #all vis papers with data from IEEEXplore -> that is, some papers will be missing but there will be some new ones in here
ieeexplore_vispub_df['DOI'] = ieeexplore_vispub_df['DOI'].str.lower()
vispubdata['DOI'] = vispubdata['DOI'].str.lower()

#we first add the new papers to vispubdata
new_papers = ieeexplore_vispub_df[~ieeexplore_vispub_df['DOI'].isin(vispubdata['DOI'])]

print(new_papers['Year']) #hopefully we're only adding papers from one new year here

vispubdata_new = pd.concat([vispubdata,new_papers],ignore_index = True)


#print(vispubdata_new.info())
vispubdata_new.to_csv("results/DEBUG_vispubdata_new.csv")

vispubdata_new.head()

### Copy over the download data

The IEEEXplore API provides information on how many times a paper has been downloaded. We copy this over here




In [None]:
download_df = pd.DataFrame({'DOI':ieeexplore_vispub_df['DOI'],downloads_column:ieeexplore_vispub_df[downloads_column]})
download_df["DOI"] = download_df["DOI"].str.lower()

for index,row in download_df.iterrows():
    downloads = row[downloads_column]
    doi = row['DOI']

    vispubdata_new.loc[vispubdata_new['DOI']== doi,downloads_column] = downloads

#check that now we have download counts updated
vispubdata_new.info()

## Add Awards

Make sure that you have updated **awardedPapers.csv**. See the instructions above

In [None]:
awardedPapers = pd.read_csv("awardedPapers.csv",keep_default_na=False)
awardedPapers["DOI"] = awardedPapers["DOI"].str.lower()

for index,row in awardedPapers.iterrows():
    award = row['Award']
    doi = row['DOI']

    vispubdata_new.loc[vispubdata_new['DOI']== doi,'Award'] = award
    

## Add Replicability Stamp

In [None]:
replicablePapers = pd.read_csv("tvcg-dois-with-stamp.csv",keep_default_na=False)
replicablePapers["doi"] = replicablePapers["doi"].str.lower()

stampMarker = "X"

for index,row in replicablePapers.iterrows():
    
    doi = row['doi']

    vispubdata_new.loc[vispubdata_new['DOI'] == doi,'GraphicsReplicabilityStamp'] = stampMarker



## Add CrossRef Data



### Download new Citation data.

The output from this operation is a file named  **citations_vispubdata.csv**. You can jump over this section if this file is relatively recent since the next chunk of code takes around 30 minutes to run. The file will be read in by the subsequent steps.

In [None]:

my_etiquette = Etiquette('Vispubdata', '9.02', 'https://sites.google.com/site/vispubdata/home', youremail)
str(my_etiquette)

In [None]:
works = Works(etiquette=my_etiquette)

crossRefCitations = [ ]
crossRefPubsCited = [ ]
dois = [ ] 
referenced_papers = [ ]
referenced_papers_without_DOI = []

print("Starting to work on the citation counts")

for index,row in vispubdata_new.iterrows():
    doi = row['DOI']
    print(str(index) + " " + doi)
    
    paper = works.doi(doi)
    
    if (paper is None):
        print("CrossRef does not know: " + doi)
        dois.append(doi)
        crossRefCitations.append("")
        crossRefPubsCited.append("")
        referenced_papers_without_DOI.append("")
        referenced_papers.append("")
        continue
        
    isreferencedby = paper['is-referenced-by-count']
    
    if (isreferencedby is None):
        isreferencedby = ""
    
    references = paper['references-count']
    
    if (references is None):
        references = ""

    if 'reference' in paper:
        cited_papers = paper['reference']

        citedpapers = ""
        citedpapers_withoutDOI = ""

        if(cited_papers is None):
            citedpapers = ""
            print("No cited papers found for: " + doi)
        else:
            for ref in cited_papers:
                if "DOI" in ref:
                    cited_doi = ref['DOI'].lower().strip()
                    citedpapers =  citedpapers + ";" + cited_doi

                elif "article-title" in ref:
                    citedpapers_withoutDOI = citedpapers_withoutDOI + ":::" + ref['article-title'].strip()
                

        if len(citedpapers) > 0:
            citedpapers = citedpapers[1:]
        
        if len(citedpapers_withoutDOI) > 0:
            citedpapers_withoutDOI = citedpapers_withoutDOI[3:]
    else:
        print("No cited papers found for: " + doi + ": Crossref does not have the reference list for this paper. This should be flagged to the IEEE Xplore people for them to fix.")
        citedpapers_withoutDOI = ""
        citedpapers = ""
        #referenced_papers_without_DOI.append("")
        #referenced_papers.append("")
                
        
    crossRefCitations.append(isreferencedby)
    crossRefPubsCited.append(references)
    referenced_papers_without_DOI.append(citedpapers_withoutDOI)
    referenced_papers.append(citedpapers)

    dois.append(doi)


citationdf = pd.DataFrame({'DOI':dois,crossRefCitation_column:crossRefCitations,crossRefPubsCited_column:crossRefPubsCited,"citedPapers":referenced_papers,"citedPapersWithoutDOI":referenced_papers_without_DOI})
citationdf.to_csv("results/citations_references_vispubdata.csv",index=False)
citationdf.head()

### Integrate the Citation data

This part you shouldn't skip over.

In [None]:
#error checking
#Check for which DOI we don't have data
citationdf = pd.read_csv("results/citations_references_vispubdata.csv", keep_default_na=False)
citationdf.head()



In [None]:
#For debugging the API
# doi = '10.0000/00000001'

# works = Works(etiquette=my_etiquette)
# paper = works.doi(doi)
# isreferencedby = ""
# if(paper is not None):
#         isreferencedby = works.doi(doi)['is-referenced-by-count']
#         if (isreferencedby is None):
#                 isreferencedby = ""

# print("Citations: " + isreferencedby)
# #pubscited = works.doi(doi)['references-count']

# works.doi(doi)

In [None]:
#Here we copy the citation data over
#it would be much smarter to do this with a merge but I am currently worried of getting it wrong and merging rows and values in that I don't want. Since I don't care for speed yet....

for index,row in citationdf.iterrows():
    
    doi = row['DOI']
    pubscited = row[crossRefPubsCited_column]
    citation = row[crossRefCitation_column]

    vispubdata_new.loc[vispubdata_new['DOI'] == doi,crossRefPubsCited_column] = pubscited
    vispubdata_new.loc[vispubdata_new['DOI'] == doi,crossRefCitation_column] = citation



vispubdata_new.to_csv("results/DEBUG_vispubdata_new.csv")

vispubdata_new.head()


### Getting the internal citations

In previous versions of vispubdata we've extracted the internal references using other tools. For example, a tool called Grobid (https://github.com/kermitt2/grobid) converted the pdf to xml and then we went through a paper title matching process, following by manual checking of the results. That all is to say that the old references are likely pretty correct and I don't want to change them. The following code is therefore just about adding potentially missing internal references as exposed by crossref.  If you want to manually correct this data, then do so on vispubdata online. In future years we'll only be adding potentially missing references if crossref has found new ones.


In [None]:
vispubdataDOIs = vispubdata_new['DOI'].str.lower().tolist()

for index,row in citationdf.iterrows():
    
    doi = row['DOI'].lower()

    crossRefReferences = row['citedPapers']
    if pd.isna(crossRefReferences):
         #if crossref didn't return any references, for now we just continue. Later we can write code to check if it returned any titles of papers
         continue
    
    
    internalReferences_current = vispubdata_new.loc[vispubdata_new['DOI'] == doi,'InternalReferences']
    currentreflist = internalReferences_current #.tolist()[0].split(";")
    
    if not pd.isna(internalReferences_current).values[0]:
         currentreflist = currentreflist.tolist()[0].split(";")
    else:
         currentreflist = []
    
    #print("This many internal refs before: " + str(len(currentreflist)))
    refsbefore = len(currentreflist)
    currentreflist = [s.strip() for s in currentreflist]
    currentreflist = [s.lower() for s in currentreflist]
     
       
    crossRefList = crossRefReferences.split(";")
    crossRefList = [s.strip() for s in crossRefList]
    crossRefList = [s.lower() for s in crossRefList]

    for crossRefRef in crossRefList:
         if crossRefRef in currentreflist:
              #in this case the citation is already in our list
              #print("already found: " + crossRefRef)
              continue
         else:
              #the citation is not already on vispubdata. There are two reasons. Either the reference is not to a VIS paper, or it was forgotten
              #so let's check if the reference is a vis paper
              if crossRefRef in vispubdataDOIs:
                   #yes, we've found a VIS paper. So now we need to add it to our list:
                   currentreflist.append(crossRefRef)

    #print("This many internal refs after: " + str(len(currentreflist)))
    #if len(currentreflist) > refsbefore:
    #     print("found something new")
    
    

    #print(crossRefReferences)
    
    vispubdata_new.loc[vispubdata_new['DOI'] == doi,'InternalReferences'] = ';'.join(currentreflist)[1:]

In [None]:
#let's debug
#so we can double-check that things look good
vispubdata_new.to_csv("results/DEBUG_vispubdata_new.csv")

## Include deduped authors from DBLP

For the following code I expect that you already ran preparation 2 from all the way at the top of the document. That's the step where you extracted potential VIS authors from the DBLP

In [None]:
def find_doi_substring(text):
    # Define the regex pattern to find the DOI that starts with doi.org
    pattern = r'doi\.org/([^:]+?)(::|$)'
    
    # Search for the pattern in the given text
    match = re.search(pattern, text)
    
    # If a match is found, return the captured group (the DOI substring)
    if match:
        return match.group(1)
    else:
        return None

In [None]:
dblpauthors = pd.read_csv("../dblp-data-extraction/data/VIS-author-articles.csv",keep_default_na=False)

# Define the regex pattern to find the doi out of the list of electronic identifiers
dblpauthors['DOI'] = dblpauthors['ee'].apply(find_doi_substring)
dblpauthors['DOI'] = dblpauthors['DOI'].str.lower()

dblpauthors.head(20)


#if this fails you didn't do step 2 above

In [None]:
# let's prepare to copy things over

vispubdata_new['AuthorNames-Deduped'] = vispubdata_new['AuthorNames-Deduped'].astype(str)

dois_not_in_dblp = []


for index,row in vispubdata_new.iterrows():

    doi = row['DOI']
    #find the DOI in the dblpauthors
    
    #print(doi)
    found = dblpauthors['DOI'].eq(doi).any() 
    #print(found)
   
    if(found != False):

        dblp_rowindex = dblpauthors.index[dblpauthors['DOI'] == doi].tolist()[0]
        #print(dblp_rowindex)
        vispubdata_rowindex = index #pd.Index(vispubdata.DOI).get_loc(doi)
        #print(vispubdata_rowindex)
        
        vispubdata_deduped_authors = vispubdata_new.at[vispubdata_rowindex,'AuthorNames-Deduped']
        authors_on_dblp = dblpauthors.at[dblp_rowindex,'author']
        authors_on_dblp = authors_on_dblp.replace("::",";")
        

        if vispubdata_deduped_authors != authors_on_dblp:
            dblp_deduped_authors = authors_on_dblp
            print("DOI: " +dblpauthors.at[dblp_rowindex,'DOI'] + " " + vispubdata_new.at[vispubdata_rowindex,'DOI'])
            print("Existing authors do not match new authors:")
            print(" old: "+str(vispubdata_deduped_authors))
            print(" new: "+str(dblp_deduped_authors))
            vispubdata_new.at[vispubdata_rowindex,'AuthorNames-Deduped'] = dblp_deduped_authors
        #else:
            #print("Old authors match new authors")
    else:
        dois_not_in_dblp.append(doi)

print(len(dois_not_in_dblp))


In [None]:
print(dois_not_in_dblp)

#DOIs regularly  not found:
#10.0000/00000001 -> M paper
#10.0000/00000002 -> M paper
#10.1109/VISUAL.1991.175767 --> M paper
#10.1109/VISUAL.1991.175823 -- 28 --> M papers (panels etc.)
#10.1109/VISUAL.1992.235182 -- 87 --> M papers
#10.1109/INFVIS.2003.1249000 --> M paper
#10.1109/VISUAL.2003.1250348 --> M paper
#10.1109/INFVIS.2004.20 --> M paper, missing in DBLP
#10.1109/INFVIS.2004.73 --> M paper, missing in DBLP
#10.1109/INFVIS.2005.1532153 --> M paper

#for those papers we just copy over their IEEEXplore author list
for doi in dois_not_in_dblp:
    vispubdata_rowindex = pd.Index(vispubdata_new.DOI).get_loc(doi)
    authors = vispubdata_new.at[vispubdata_rowindex,'AuthorNames']
    vispubdata_new.loc[vispubdata_rowindex, 'AuthorNames-Deduped'] = authors



## The Finale
We save the file to disk that can be copied over to vispubdata. However, at this point the Aminer citations are still missing, they are not updated that often. If you do want to update them head over to th Aminer code once you have finished running the code here.


In [None]:
vispubdata_new.to_csv("results/vispubdata-update.csv",index=False)

## Appendix
Now we generate a few files that are handy for data analysis about authors. This is not strictly necessary for the vispubdata update


In [None]:
deduped_authors = []

authors = vispubdata_new['AuthorNames-Deduped'].dropna()

for author in authors:
    authors = author.split(';')
    for a in authors:
        deduped_authors.append(a)
    

deduped_unique_authors = list(set(deduped_authors))

#print(deduped_unique_authors)    

pd.DataFrame({'Authors':deduped_unique_authors}).to_csv('results/Deduped-Authors.csv',index=False)

In [None]:



#create the papers -> author names df
papersAuthorsTemp = pd.DataFrame({'Paper DOI':vispubdata_new['DOI'],'Author Names':vispubdata_new['AuthorNames-Deduped']})

authors = papersAuthorsTemp['Author Names'].str.split(';').tolist()
print(authors)
papersAuthors = pd.DataFrame(authors, index=papersAuthorsTemp['Paper DOI']).stack()
papersAuthors = papersAuthors.reset_index([0, 'Paper DOI'])
papersAuthors.columns = ['Paper DOI', 'Author Names']

papersAuthors.to_csv("results/DedupedAuthors-Papers.csv",index=False)

papersAuthors.head()

In [None]:
papersAuthorsPosition = papersAuthors
papersAuthorsPosition['PositionNumber'] = -1
papersAuthorsPosition['PositionCode'] = ''
papersAuthorsPosition['Affiliation'] = ''

papersAuthorsPosition.head()
lastDOI = ''
lastPosition = 0
lastindex = -1

#Warning, this assumes that the df was created in sorted order
for index, row in papersAuthorsPosition.iterrows():
    doi = row['Paper DOI']
    position = -1
    affiliation = ''
    
    allAffs = vispubdata_new.loc[vispubdata_new['DOI'] == doi,'AuthorAffiliation'].values[0]
    #print(allAffs)
    
    affList = allAffs.split(";")
    
    if doi != lastDOI:
        #we've reached a new paper
        position = 1
        positionCode = 'F'
        lastDOI = doi
        if len(affList) > 0:
            affiliation = affList[0]
        
        #if we're not at the very first row
        if lastindex != -1:
            #we need to update the positionCode of the last author
            papersAuthorsPosition.loc[lastindex,'PositionCode'] = 'L'
            #if it was a single author, we set them to 'F'
            if papersAuthorsPosition.loc[lastindex,'PositionNumber'] == 1:
                papersAuthorsPosition.loc[lastindex,'PositionCode'] = 'F'

    elif doi == lastDOI:
        #we're still at the same paper
        if len(affList) > lastPosition:
            affiliation = affList[lastPosition]
        position = lastPosition + 1
        positionCode = 'M'
        
        
    papersAuthorsPosition.loc[index,'PositionNumber'] = position
    papersAuthorsPosition.loc[index,'PositionCode'] = positionCode
    papersAuthorsPosition.loc[index,'Affiliation'] = affiliation
    lastindex = index
    lastPosition = position
        
        
papersAuthorsPosition.to_csv("results/DedupedAuthors-Papers-Position-Affiliation.csv",index=False)

papersAuthorsPosition.head(10)


And now we're done...now update the journal papers by heading over to **journals-at-vis-update.ipynb**