# Clean bibliographic datasets
- Read bibliographic datasets
- Select and rename columns
- Normalise entries
- Merge datasets into single dataset
- Clean dataset
- Write dataset to file

**TODO**
- Add the possibility to switch off the logger output.
- Print some basic stats at the end of the code.
- Group the Scopus, Dimensions, and Lens operations so that they can be selected by an if statement in case the user doesn't specify all three (Scopus, Lens, Dimensions) directories.

In [1]:
import sys

from pathlib import Path

# Add the src directory to the Python path
src_path = Path("..") / "src"
if src_path.resolve() not in sys.path:
    sys.path.insert(0, str(src_path.resolve()))

from config import *
from utilities import *
from clean import *

In [2]:
# Input parameters
# -----------------------
biblio_project_dir = 'example_project'  # directory for the data and models of your bibliographic project
scopus_input_dir = 'raw/scopus'         # directory with Scopus data; leave empty if you don't have Scopus data
lens_input_dir = 'raw/lens'             # directory with Lens data; leave empty if you don't have Lens data
dims_input_dir = 'raw/dimensions'       # directory with Dimensions data; leave empty if you don't have Diemnsions data

output_dir = 'processed'                # directory where you want to save the merged and cleaned bibliographic dataset
output_file = f'biblio_example_all.csv' # filename of the bibliographic dataset; leave empty if you don't want to save the data

n_rows = 1000                              # the maxium number of rows read for each dataset; set to '0' if you want to read all the data

write_cols = ['authors', 'title', 'abstract', 'year', 'pub_date', 
              'n_cited', 'source', 'kws', 'fos', 'anzsrc_2020', 
              'auth_affils', 'link', 'links', 'bib_src', 'scopus_id', 
              'lens_id', 'dims_id']
# -----------------------

In [3]:
# 1. Read the bibliographic datasets
scopus_df = read_biblio_csv_files_to_df(biblio_project_dir = biblio_project_dir, 
                                        input_dir = scopus_input_dir,
                                        biblio_source = BiblioSource.SCOPUS,
                                        n_rows = n_rows)

lens_df = read_biblio_csv_files_to_df(biblio_project_dir = biblio_project_dir, 
                                      input_dir = lens_input_dir,
                                      biblio_source = BiblioSource.LENS,
                                      n_rows = n_rows)

dims_df = read_biblio_csv_files_to_df(biblio_project_dir = biblio_project_dir, 
                                      input_dir = dims_input_dir,
                                      biblio_source = BiblioSource.DIMS,
                                      n_rows = n_rows)

# 2. Select and rename columns from the dataset
scopus_df = modify_cols_biblio_df(biblio_df_ = scopus_df, 
                                  reshape_base = Reshape.SCOPUS_ALL)

lens_df = modify_cols_biblio_df(biblio_df_ = lens_df, 
                                reshape_base = Reshape.LENS_ALL)

dims_df = modify_cols_biblio_df(biblio_df_ = dims_df, 
                                reshape_base = Reshape.DIMS_ALL)

# 3. Normalise key variables in the dataset (bib_src, links, keywords, authors, author-affils)
scopus_df = normalise_biblio_entities(biblio_df_ = scopus_df)
lens_df = normalise_biblio_entities(biblio_df_ = lens_df)
dims_df = normalise_biblio_entities(biblio_df_ = dims_df)

# 4. Merge the datasets from Scopus, Lens, and Dimensions
biblio_df = merge_biblio_dfs(scopus_df, lens_df, dims_df)

# 5. Clean the title and abstract, remove duplicate titles, and merge values from
#    different bibliographic datasets
biblio_df = clean_biblio_df(biblio_df_ = biblio_df)

# 6. Optionally save the results
if output_file:
    write_df(biblio_df = biblio_df[write_cols],
            biblio_project_dir = biblio_project_dir,
            output_dir = output_dir,
            output_file = output_file)


2023-05-21 11:08:17,394 - Biblio - Reading 2 CSV files...
2023-05-21 11:08:17,443 - Biblio - File: scopus_example_1999_2017.csv, Size: 472 rows
2023-05-21 11:08:17,473 - Biblio - File: scopus_example_2018_2023.csv, Size: 520 rows
2023-05-21 11:08:17,491 - Biblio - Total number of publications in the dataframe: 992
2023-05-21 11:08:17,492 - Biblio - Reading 1 CSV files...
2023-05-21 11:08:17,520 - Biblio - File: lens_example.csv, Size: 1000 rows
2023-05-21 11:08:17,528 - Biblio - Total number of publications in the dataframe: 1000
2023-05-21 11:08:17,529 - Biblio - Reading 1 CSV files...
2023-05-21 11:08:17,576 - Biblio - File: dims_example.csv, Size: 999 rows
2023-05-21 11:08:17,592 - Biblio - Total number of publications in the dataframe: 999
2023-05-21 11:08:18,106 - Biblio - Number of publications in the input biblio_df: 2991


Removed 0 titles that were empty strings
Removed 0 titles that were NaN
Removed 8 records where the title contained "conference", "workshop", or "proceeding"
Removed additional 3 titles that were empty strings
Replaced 213 abtracts that were NaN with an empty string


2023-05-21 11:08:18,844 - Biblio - Number of publications before removing duplicate titles: 2980


Duplicate group: #2600 

2023-05-21 11:08:22,102 - Biblio - Number of publications after removing duplicate titles: 2668
2023-05-21 11:08:22,146 - Biblio - Writing biblio_df (2668 publications) to file 'biblio_example_all.csv'...
