# Data cleanup

In this notebook, we will perform data cleanup operations on the Agrofood CO2 emission dataset. This includes removing duplicates, handling missing values, and providing a summary of the dataset's features.

In [16]:
import pandas as pd

In [17]:
# Load the dataset
df = pd.read_csv("../data/raw/Agrofood_co2_emission.csv")

## Define relevant columns

We only use some countries to keep it simple. In this example we use southern european countries.

In [18]:
relevant_columns = df.columns
df = df[relevant_columns]

southern_europe = [
    'Albania', 'Andorra', 'Bosnia and Herzegovina', 'Croatia', 'Greece', 'Italy',
    'Kosovo', 'Malta', 'Montenegro', 'North Macedonia', 'Portugal', 'San Marino',
    'Serbia', 'Slovenia', 'Spain', 'Vatican City'
]
european_countries = southern_europe
df = df[df['Area'].isin(european_countries)]

## Remove duplicates

Next, we will remove duplicate rows from the dataset.

In [19]:
# Check for exact duplicates across all columns
exact_duplicates = df.duplicated().sum()
print(f"Exact duplicates found: {exact_duplicates}")

# Check duplicates on specific key columns
key_duplicates = df.duplicated(subset=['Year', 'Area']).sum()
print(f"Duplicates based on Year and Area: {key_duplicates}")

# Remove exact duplicates (keep first occurrence)
df_cleaned = df.drop_duplicates(keep='first')
df_cleaned = df_cleaned.dropna(subset=relevant_columns)


# View duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Total rows involved in duplication: {len(duplicate_rows)}")

cleaning_log = {
    'original_rows': len(df),
    'exact_duplicates_removed': len(df) - len(df_cleaned),
    'final_rows': len(df_cleaned)
}
print(f"\n\nCleaning summary: {cleaning_log}")

Exact duplicates found: 0
Duplicates based on Year and Area: 0
Total rows involved in duplication: 0


Cleaning summary: {'original_rows': 394, 'exact_duplicates_removed': 91, 'final_rows': 303}


## Feature summary

Last, we will show a summary of the features in the dataset after cleaning.

In [20]:
summary = pd.DataFrame({
    'Feature Name': df_cleaned.columns,
    'Type': df_cleaned.dtypes,
    'Missing?': df_cleaned.isnull().mean().round(2),
    'Unique Values': df_cleaned.nunique()
})
print(f"\n\nFeature summary table: {summary}")



Feature summary table:                                                     Feature Name     Type  \
Area                                                        Area   object   
Year                                                        Year    int64   
Savanna fires                                      Savanna fires  float64   
Forest fires                                        Forest fires  float64   
Crop Residues                                      Crop Residues  float64   
Rice Cultivation                                Rice Cultivation  float64   
Drained organic soils (CO2)          Drained organic soils (CO2)  float64   
Pesticides Manufacturing                Pesticides Manufacturing  float64   
Food Transport                                    Food Transport  float64   
Forestland                                            Forestland  float64   
Net Forest conversion                      Net Forest conversion  float64   
Food Household Consumption            Food Househol

## Save the cleaned up data

In [21]:
df_cleaned.to_csv('../data/interim/cleaned_data.csv', index=False)