# Comprehensive Pixels Dataset Exploration

This notebook explores the `/data/oe23/fert-recon/data/processed/comprehensive_pixels_dataset.csv` file, including loading, inspecting, and visualizing its contents.

## 1. Load the Dataset

Load the comprehensive_pixels_dataset.csv file using pandas.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("/data/oe23/fert-recon/data/processed/comprehensive_pixels_dataset.csv")
df.head()

Unnamed: 0,pixel_id,year,soilgrids_bdod_bdod_0_5cm_mean,soilgrids_bdod_bdod_100_200cm_mean,soilgrids_bdod_bdod_15_30cm_mean,soilgrids_bdod_bdod_30_60cm_mean,soilgrids_bdod_bdod_5_15cm_mean,soilgrids_bdod_bdod_60_100cm_mean,soilgrids_cec_cec_0_5cm_mean,soilgrids_cec_cec_100_200cm_mean,...,terraclimate_vs_mean_2000,terraclimate_vs_min_2000,terraclimate_vs_p25_2000,terraclimate_vs_p50_2000,terraclimate_vs_p75_2000,terraclimate_vs_stdDev_2000,yield_maize_2000,yield_rice_2000,yield_soybean_2000,yield_wheat_2000
0,0,2000,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2000,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2000,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,2000,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,2000,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,-32768.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. View Basic Dataset Information

Display the shape, column names, and data types of the dataset.

In [2]:
print(f"Shape: {df.shape}")
print("Columns:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)

Shape: (608256, 219)
Columns: ['pixel_id', 'year', 'soilgrids_bdod_bdod_0_5cm_mean', 'soilgrids_bdod_bdod_100_200cm_mean', 'soilgrids_bdod_bdod_15_30cm_mean', 'soilgrids_bdod_bdod_30_60cm_mean', 'soilgrids_bdod_bdod_5_15cm_mean', 'soilgrids_bdod_bdod_60_100cm_mean', 'soilgrids_cec_cec_0_5cm_mean', 'soilgrids_cec_cec_100_200cm_mean', 'soilgrids_cec_cec_15_30cm_mean', 'soilgrids_cec_cec_30_60cm_mean', 'soilgrids_cec_cec_5_15cm_mean', 'soilgrids_cec_cec_60_100cm_mean', 'soilgrids_cfvo_cfvo_0_5cm_mean', 'soilgrids_cfvo_cfvo_100_200cm_mean', 'soilgrids_cfvo_cfvo_15_30cm_mean', 'soilgrids_cfvo_cfvo_30_60cm_mean', 'soilgrids_cfvo_cfvo_5_15cm_mean', 'soilgrids_cfvo_cfvo_60_100cm_mean', 'soilgrids_clay_clay_0_5cm_mean', 'soilgrids_clay_clay_100_200cm_mean', 'soilgrids_clay_clay_15_30cm_mean', 'soilgrids_clay_clay_30_60cm_mean', 'soilgrids_clay_clay_5_15cm_mean', 'soilgrids_clay_clay_60_100cm_mean', 'soilgrids_nitrogen_nitrogen_0_5cm_mean', 'soilgrids_nitrogen_nitrogen_100_200cm_mean', 'soilgrid

## 3. Check for Missing Values

Check for missing values in each column and summarize their counts.

In [None]:
missing_counts = df.isnull().sum()
missing_percent = (missing_counts / len(df)) * 100
missing_summary = pd.DataFrame({'Missing Count': missing_counts, 'Missing %': missing_percent})
missing_summary[missing_summary['Missing Count'] > 0]

## 4. Explore Feature Distributions

Plot histograms and boxplots for numerical features to understand their distributions.

In [None]:
num_cols = df.select_dtypes(include=[np.number]).columns

# Histograms
fig, axes = plt.subplots(len(num_cols), 1, figsize=(8, 4 * len(num_cols)))
for i, col in enumerate(num_cols):
    ax = axes[i] if len(num_cols) > 1 else axes
    df[col].hist(ax=ax, bins=30)
    ax.set_title(f'Histogram of {col}')
plt.tight_layout()
plt.show()

# Boxplots
fig, axes = plt.subplots(len(num_cols), 1, figsize=(8, 4 * len(num_cols)))
for i, col in enumerate(num_cols):
    ax = axes[i] if len(num_cols) > 1 else axes
    sns.boxplot(x=df[col], ax=ax)
    ax.set_title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

## 5. Visualize Correlations

Create a correlation matrix heatmap to visualize relationships between features.

In [None]:
corr = df[num_cols].corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix Heatmap')
plt.show()

## 6. Display Sample Rows

Show a few sample rows from the dataset for inspection.

In [None]:
df.sample(10, random_state=42)