# Data cleanup

In this notebook, we will perform data cleanup operations on the Agrofood CO2 emission dataset. This includes removing duplicates, handling missing values, and providing a summary of the dataset's features.

In [1]:
import pandas as pd

In [2]:
# Load the dataset
df = pd.read_csv("../data/raw/yield_tables.csv", sep=";")

## Define relevant columns

Columns relevant for building the regression model:

- `id`: The unique identifier of the yield table - refers to the tree species.
- `yield_class`: The yield class.
- `age`: The age in years.
- `average_height`: The average height in meters.
- `dbh`: The diameter at breast height in centimeters.
- `taper`: The tree taper.
- `trees_per_ha`: The number of trees per hectare.
- `volume_per_ha`: The volume per hectare in vfm.

In [3]:
relevant_columns = [
    "id",
    "yield_class",
    "age",
    "average_height",
    "dbh",
    "taper",
    "trees_per_ha",
    "volume_per_ha",
]
df = df[relevant_columns]

## Remove duplicates

Next, we will remove duplicate rows from the dataset and drop any rows with missing values. Just dropping the columns is not critical, since when looking at the dataset you see, that it's consistent per tree species. This means if there is data for one tree species (grouped by the `id` column) it's data for all entries of that species. Therefore, we can safely drop duplicates and missing values without losing important information.

In [4]:
# Check for exact duplicates across all columns
exact_duplicates = df.duplicated().sum()
print(f"Exact duplicates found: {exact_duplicates}")

# Remove exact duplicates (keep first occurrence)
df_duplicates_removed = df.drop_duplicates(keep="first")
df_cleaned = df_duplicates_removed.dropna(subset=relevant_columns)  # drop the NaN values


# View duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
print(f"Total rows involved in duplication: {len(duplicate_rows)}")

cleaning_log = {
    "original_rows": len(df),
    "exact_duplicates_removed": len(df) - len(df_duplicates_removed),
    "rows_with_missing_values_removed": len(df_duplicates_removed) - len(df_cleaned),
    "final_rows": len(df_cleaned),
}
print(f"\n\nCleaning summary: {cleaning_log}")

Exact duplicates found: 5
Total rows involved in duplication: 10


Cleaning summary: {'original_rows': 14398, 'exact_duplicates_removed': 5, 'rows_with_missing_values_removed': 9974, 'final_rows': 4419}


## Feature summary

Last, we will show a summary of the features in the dataset after cleaning.

In [5]:
summary = pd.DataFrame(
    {
        "Feature Name": df_cleaned.columns,
        "Type": df_cleaned.dtypes,
        "Missing?": df_cleaned.isnull().mean().round(2),
        "Unique Values": df_cleaned.nunique(),
    }
)
print(f"\n\nFeature summary table: {summary}")



Feature summary table:                   Feature Name     Type  Missing?  Unique Values
id                          id    int64       0.0             30
yield_class        yield_class  float64       0.0             30
age                        age    int64       0.0             30
average_height  average_height  float64       0.0           1312
dbh                        dbh  float64       0.0           1619
taper                    taper  float64       0.0           1476
trees_per_ha      trees_per_ha  float64       0.0           1826
volume_per_ha    volume_per_ha  float64       0.0           1222


## Save the cleaned up data

In [6]:
df_cleaned.to_csv("../data/interim/cleaned_data.csv", index=False)