# Data Cleaning: Scopus Publications

This notebook performs data cleaning and preprocessing on the Scopus publications dataset. The main steps include:

1. Loading and initial data exploration
2. Data cleaning using ScopusDataCleaner class
3. Data quality checks and validation
4. Visualization of cleaned data

## Setup and Imports

In [1]:
import sys

sys.path.append("/Users/jlq293/Projects/Study-1-Bibliometrics")
import pandas as pd
import numpy as np
import json
from src.data_fetching.ScopusDataCleaner import ScopusDataCleaner
import matplotlib.pyplot as plt

# show all columns
pd.set_option("display.max_columns", None)

## Load Raw Data

First, we load the raw Scopus data from the CSV file.

In [2]:
p = "../data/01-raw/scopusnew/final_scopus_results_20250326_081230.csv"
df = pd.read_csv(p)
print(df.shape)


(44444, 37)


## Data Cleaning

We use the ScopusDataCleaner class to perform the following cleaning steps:

1. Drop unnecessary columns
2. Rename columns for consistency
3. Filter publication types
4. Remove duplicates
5. Format dates
6. Create unique author-year combinations

In [3]:
scopus_cleaner = ScopusDataCleaner(df)
scopus_cleaner.drop_columns()
scopus_cleaner.rename_columns()
scopus_cleaner.subset_publication_type()
scopus_cleaner.subset_publication_subtype()
scopus_cleaner.remove_duplicates(column="eid")
scopus_cleaner.remove_duplicates(column="abstract")
scopus_cleaner.remove_missing_abstracts()

scopus_cleaner.date_formater()
scopus_cleaner.unique_auth_year_col()

INFO:src.data.ScopusDataCleaner:Initialized with 44444 rows and 37 columns
INFO:src.data.ScopusDataCleaner:Dropped 12 columns
INFO:src.data.ScopusDataCleaner:Columns renamed successfully
INFO:src.data.ScopusDataCleaner:Removing 600 Book publications
INFO:src.data.ScopusDataCleaner:Removing 238 Book Series publications
INFO:src.data.ScopusDataCleaner:Removing 92 Conference Proceeding publications
INFO:src.data.ScopusDataCleaner:Removing 27 Trade Journal publications
INFO:src.data.ScopusDataCleaner:Remaining publications: 43473
INFO:src.data.ScopusDataCleaner:Removing 757 Conference Paper publications
INFO:src.data.ScopusDataCleaner:Remaining publications: 42716
INFO:src.data.ScopusDataCleaner:Removed 0 duplicates based on eid
INFO:src.data.ScopusDataCleaner:Removed 27 duplicates based on abstract
INFO:src.data.ScopusDataCleaner:Removed 3728 rows with missing abstracts
INFO:src.data.ScopusDataCleaner:Date columns formatted successfully
INFO:src.data.ScopusDataCleaner:Created unique_auth_

## Get Cleaned Data and Removal Log

We retrieve the cleaned dataset and the removal log to track changes made during cleaning.

In [4]:
df = scopus_cleaner.get_dataframe()
print(f"Final shape: {df.shape}")
removal_log = scopus_cleaner.get_removal_log()

Final shape: (38961, 28)


## Save Cleaned Data

Save the cleaned dataset and removal log for future use.

In [7]:
# save df as pkl
p = "../data/02-clean/articles/scopus_cleaned_20250326_081230.pkl"
df.to_pickle(p)

In [8]:
# save removal log as json
p = "../output/descriptive-stats-logs/"
# Saving the dictionary to a JSON file
# Convert numpy types to Python native types
converted_dict = {
    k: (v.item() if isinstance(v, np.generic) else v) for k, v in removal_log.items()
}

# Saving the dictionary to a JSON file
with open(p + "scopus_removal_log_20250326_081230.json", "w") as f:
    json.dump(converted_dict, f)