# Matching citation counts from Aminer v14

Author: Natkamon Tovanich
Last Update: 06/06/2024

**Instructions**
1. Run first_scan.ipynb to match the paper title and/or DOI with Aminer
2. Run second_scan.py to find the potential match papers from string distance function
3. Manually select the match papers from the second scan in candidate_papers.csv
4. Run merge_data.ipynb to match the result from the first (exact title or DOI match) and second scan (string distance candidates).
5. The result is available at vispubdata_citation.csv.

### First scan: match by DOI or exact title

In [None]:
import pandas as pd
import json
import re
import pickle

Load the dataset

In [None]:
df = pd.read_csv("../vispubdata-update/vispubdata-update.csv", keep_default_na=False)
df

Convert title to lower case

In [None]:
df["title"] = df["Title"].apply(lambda x: re.sub(r'[^\w]', '', str(x).lower()))
df["doi"] = df["DOI"].apply(lambda x: str(x).lower())

Convert to dictionary for the faster check

In [None]:
titles = df.set_index('title')['DOI'].to_dict()
dois = df.set_index('doi')['DOI'].to_dict()

Read AMiner v14 dataset line by line and check if the title is matched.

In [None]:
%%time
count, match_doi, match_title = 0, 0, 0
m, l, choices = list(), dict(), dict()
with open("dblp_v14.json", encoding="utf8") as infile:
    for line in infile:
        line = line.strip().strip(',').strip("]'")
        if line[0] != '{':
            continue
        paper = json.loads(line)
        lower = re.sub(r'[^\w]', '', str(paper['title']).lower())
        
        # Add all papers title to check for missing match
        choices[paper['title'].lower()] = paper['id']
        
        # First, try to match by paper DOI
        if "doi" in paper and str(paper["doi"]).lower() in dois:
            m.append([dois[str(paper["doi"]).lower()], paper['id'], 'doi'])
            l[paper['id']] = paper
            match_doi += 1
        
        # Match by the exact title
        elif lower in titles:
            m.append([titles[lower], paper['id'], 'title'])
            l[paper['id']] = paper
            match_title += 1
            
        count += 1 
        if (count % 1000000 == 0):
            print(count, match_doi, match_title)
    print(count, match_doi, match_title)

In [None]:
m = pd.DataFrame(m, columns=['vispub_doi', 'aminer_id', 'method'])
m.to_csv('results/exact_matching.csv', index=False)

In [None]:
with open('results/aminer_titles.p', 'wb') as fp:
    pickle.dump(choices, fp, protocol=pickle.HIGHEST_PROTOCOL)