# Data cleanup

In this notebook, we will perform data cleanup operations on the Agrofood CO2 emission dataset. This includes removing duplicates, handling missing values, and providing a summary of the dataset's features.

In [12]:
import pandas as pd

In [13]:
# Load the dataset
df = pd.read_csv("../data/raw/Agrofood_co2_emission.csv")

# Define relevant columns

Only some columns of the dataset are relevant for our analysis. We will keep the following columns:

- Year
- Area
- total_emission
- Savanna fires
- Forest fires

In [14]:
relevant_columns = ['Year', 'Area', 'total_emission', 'Savanna fires', 'Forest fires']
df = df[relevant_columns]

# Remove duplicates

Next, we will remove duplicate rows from the dataset.

In [15]:
# Check for exact duplicates across all columns
exact_duplicates = df.duplicated().sum()
print(f"Exact duplicates found: {exact_duplicates}")

# Check duplicates on specific key columns
key_duplicates = df.duplicated(subset=['Year', 'Area']).sum()
print(f"Duplicates based on Year and Area: {key_duplicates}")

# Remove exact duplicates (keep first occurrence)
df_cleaned = df.drop_duplicates(keep='first')
df_cleaned = df_cleaned.dropna(subset=relevant_columns)


# View duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Total rows involved in duplication: {len(duplicate_rows)}")

cleaning_log = {
    'original_rows': len(df),
    'exact_duplicates_removed': len(df) - len(df_cleaned),
    'final_rows': len(df_cleaned)
}
print(f"\n\nCleaning summary: {cleaning_log}")

Exact duplicates found: 0
Duplicates based on Year and Area: 0
Total rows involved in duplication: 0


Cleaning summary: {'original_rows': 6965, 'exact_duplicates_removed': 93, 'final_rows': 6872}


# Feature summary

Last, we will show a summary of the features in the dataset after cleaning.

In [19]:
summary = pd.DataFrame({
    'Feature Name': df_cleaned.columns,
    'Type': df_cleaned.dtypes,
    'Missing?': df_cleaned.isnull().mean().round(2),
    'Unique Values': df_cleaned.nunique()
})
print(f"\n\nFeature summary table: {summary}")



Feature summary table:                   Feature Name     Type  Missing?  Unique Values
Year                      Year    int64       0.0             31
Area                      Area   object       0.0            233
total_emission  total_emission  float64       0.0           6806
Savanna fires    Savanna fires  float64       0.0           3746
Forest fires      Forest fires  float64       0.0           2962
