# Clean Scopus Data

This notebook demonstrates the cleaning and preprocessing of Scopus publication data using the `ScopusDataCleaner` class.

## Overview

The cleaning process includes:
1. Loading and initial data inspection
2. Removing unnecessary columns
3. Renaming columns for clarity
4. Filtering publication types
5. Removing duplicates
6. Formatting dates
7. Creating unique author-year identifiers
8. Final data validation and export

## Setup and Imports

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

import sys
sys.path.append("..")

from src.data.ScopusDataCleaner import ScopusDataCleaner, ScopusConfig

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

OSError: 'seaborn' is not a valid package style, path of style file, URL of style file, or library style name (library styles are listed in `style.available`)

## Load Data

First, we'll load the raw Scopus data from the CSV file.

In [None]:
# Define paths
data_dir = Path("../data/raw")
output_dir = Path("../data/processed")

# Load the raw data
df = pd.read_csv(data_dir / "scopus.csv")
print(f"Loaded DataFrame with shape: {df.shape}")
print("\nColumns:")
print(df.columns.tolist())

## Initialize Data Cleaner

We'll create a custom configuration for the data cleaner. This allows us to:
1. Specify which publication types to keep
2. Define which subtypes to remove
3. Customize column renaming if needed

In [None]:
# Create custom configuration
config = ScopusConfig(
    publication_types_to_keep=["Journal"],  # Keep only journal articles
    publication_subtypes_to_remove=["Conference Paper"],  # Remove conference papers
)

# Initialize the cleaner
cleaner = ScopusDataCleaner(df, config=config)

## Clean Data

Now we'll apply the cleaning steps in sequence. Each step will log its actions and results.

In [None]:
# Apply cleaning steps
cleaner.drop_columns()
cleaner.rename_columns()
cleaner.subset_publication_type()
cleaner.subset_publication_subtype()
cleaner.remove_duplicates()
cleaner.date_formater()
cleaner.unique_auth_year_col()

# Get the cleaned DataFrame
cleaned_df = cleaner.get_dataframe()

# Display cleaning summary
print("\nCleaning Summary:")
for key, value in cleaner.get_removal_log().items():
    print(f"{key}: {value}")

## Data Validation

Let's perform some basic validation checks on the cleaned data.

In [None]:
# Check for missing values
missing_values = cleaned_df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])

# Check data types
print("\nData types:")
print(cleaned_df.dtypes)

# Check for duplicates
duplicates = cleaned_df.duplicated().sum()
print(f"\nNumber of duplicates: {duplicates}")

## Visualize Publication Trends

Let's create some visualizations to understand the publication patterns.

In [None]:
# Create figure with subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Publications per year
yearly_counts = cleaned_df['year'].value_counts().sort_index()
yearly_counts.plot(kind='bar', ax=ax1)
ax1.set_title('Publications per Year')
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Publications')
ax1.tick_params(axis='x', rotation=45)

# Plot 2: Publication types distribution
type_counts = cleaned_df['publication_type'].value_counts()
type_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%')
ax2.set_title('Distribution of Publication Types')

plt.tight_layout()
plt.show()

## Save Cleaned Data

Finally, we'll save the cleaned dataset to a CSV file.

In [None]:
# Create output directory if it doesn't exist
output_dir.mkdir(parents=True, exist_ok=True)

# Save cleaned data
output_file = output_dir / "scopus_cleaned.csv"
cleaned_df.to_csv(output_file, index=False)
print(f"Saved cleaned data to: {output_file}")