# Matching citation counts from Aminer v14

Author: Natkamon Tovanich
Last Update: 06/06/2024

**Instructions**
- Run first_scan.ipynb to match the paper title and/or DOI with Aminer
- Run second_scan.py to find the potential match papers from string distance function
- Manually select the match papers from the second scan in candidate_papers.csv
    - To do so, we usually sort by papers matching score. > 80 is good. Below you can most likely discard. Check if two titles are similar. 
    - Delete rows of titles that do not look similar
    - Save the file
4. Run merge_data.ipynb to match the result from the first (exact title or DOI match) and second scan (string distance candidates).
5. The result is available at vispubdata_citation.csv.

### First scan: match by DOI or exact title

In [1]:
import pandas as pd
import json
import re
import pickle

Load the datasets

At the moment you have to switch datasets manually here. We need to fix this later so it runs through both automatically.

In [3]:
datasets = ['./vispubdata-update/results/vispubdata-update.csv',
            './vispubdata-update/results/vispubdata-update-journals.csv']
appendices = ['vis-papers','journal-papers']

datasets_dfs = []

for dataset in datasets:
    df = pd.read_csv(dataset, keep_default_na=False)
    datasets_dfs.append(df)
    


Convert title to lower case

In [4]:
for df in datasets_dfs:
    df["title"] = df["Title"].apply(lambda x: re.sub(r'[^\w]', '', str(x).lower()))
    df["doi"] = df["DOI"].apply(lambda x: str(x).lower())

Read AMiner v14 dataset line by line and check if the title is matched.

In [7]:
%%time
def checkTitleMatch(titles,dois):
    
    count, match_doi, match_title = 0, 0, 0
    m, l, choices = list(), dict(), dict()
    with open("./aminer-citation-update/data/dblp_v14.json", encoding="utf8") as infile:
        for line in infile:
            line = line.strip().strip(',').strip("]'")
            if line[0] != '{':
                continue
            paper = json.loads(line)
            lower = re.sub(r'[^\w]', '', str(paper['title']).lower())
            
            # Add all papers title to check for missing match
            choices[paper['title'].lower()] = paper['id']
            
            # First, try to match by paper DOI
            if "doi" in paper and str(paper["doi"]).lower() in dois:
                m.append([dois[str(paper["doi"]).lower()], paper['id'], 'doi'])
                l[paper['id']] = paper
                match_doi += 1
            
            # Match by the exact title
            elif lower in titles:
                m.append([titles[lower], paper['id'], 'title'])
                l[paper['id']] = paper
                match_title += 1
                
            count += 1 
            if (count % 1000000 == 0):
                print(count, match_doi, match_title)
        print(count, match_doi, match_title)
    return m,choices

CPU times: total: 0 ns
Wall time: 0 ns


Convert to dictionary for the faster check

In [11]:
index = 0
for df in datasets_dfs:
    titles = df.set_index('title')['DOI'].to_dict()
    dois = df.set_index('doi')['DOI'].to_dict()
    result = checkTitleMatch(titles,dois)
    m = pd.DataFrame(result[0], columns=['vispub_doi', 'aminer_id', 'method'])
    m.to_csv('./aminer-citation-update/results/exact_matching'+appendices[index]+".csv", index=False)

    with open('./aminer-citation-update/results/aminer_titles'+appendices[index]+".p", 'wb') as fp:
        pickle.dump(result[1], fp, protocol=pickle.HIGHEST_PROTOCOL)
        print("Wrote File: " + str(index))
    
    index = index + 1