# Exploratory Data Analysis: Global CO₂ Emissions

In this notebook, we'll perform an initial analysis of the `owid-co2-data.csv` dataset. The main goals are:

1.  **Understand the Data:** Get familiar with the features, their types, and the overall structure.
2.  **Clean and Prepare:** Identify and handle missing values, filter irrelevant data, and select the features needed for our clustering analysis.
3.  **Generate a Reduced Dataset:** Save a cleaned, smaller version of the dataset that will be fed into our Kafka-Spark data pipeline.

---

### 1. Initial Setup and Data Loading

In [None]:
!pip install pandas numpy matplotlib seaborn -q

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
df = pd.read_csv('../data/owid-co2-data.csv')

### 2. Data Inspection and Initial Findings

Let's start with a high-level overview of the dataset.

In [None]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.\n")
df.info()

**Initial Findings:**

- The dataset is quite large, with over 50,000 entries and 79 columns.
- Many columns have a significant number of missing (`NaN`) values. Out of 79 columns, 77 have missing data.
- The time range is very wide, starting from the year 1750.

### 3. Data Cleaning and Feature Selection

The dataset is too noisy for our analysis in its current state. We need to filter and clean it based on our project's goal: **clustering countries by their emission patterns**.

#### 3.1. Filtering by Time Period

Data from the 18th and 19th centuries is very sparse. Let's visualize the percentage of missing GDP data over the years to confirm this. We'll focus on the period from 1950 onwards, where data is more consistently available.

In [None]:
plt.figure(figsize=(12, 4))
missing_gdp_by_year = df.groupby('year')['gdp'].apply(lambda x: x.isnull().mean() * 100)
missing_gdp_by_year.plot(kind='line')
plt.title('Percentage of Missing GDP Data Over Time')
plt.ylabel('Missing GDP (%)')
plt.axvspan(1750, 1950, color='red', alpha=0.1, label='Highly Sparse Period')
plt.legend()
plt.show()

df_filtered = df[df['year'] >= 1950].copy()

#### 3.2. Selecting Relevant Features

For our clustering objective, we don't need all 79 columns. We'll select features that best explain a country's emission profile:

- **Identifiers:** `country`, `year`, `iso_code`
- **Scale Factors:** `population`, `gdp`
- **Emission Metrics:** `co2` (total emissions), `co2_per_capita` (efficiency)

In [None]:
relevant_columns = [
    "country", "year", "iso_code", 
    "population", "gdp", "co2", "co2_per_capita"
]
df_selected = df_filtered[relevant_columns]

#### 3.3. Removing Aggregate Regions

A crucial step is to remove rows that don't represent individual countries but are aggregates (e.g., 'World', 'Asia', 'Europe'). These would skew our clustering results. We can identify them because they typically lack an `iso_code`.

In [None]:
non_country_entities = df_selected[df_selected['iso_code'].isnull()]['country'].unique()
print(f"Found {len(non_country_entities)} aggregate entities to remove. Examples: {list(non_country_entities)[:5]}\n")

df_countries_only = df_selected.dropna(subset=['iso_code'])
print(f"Shape after removing aggregates: {df_countries_only.shape}")

#### 3.4. Handling Remaining Missing Values

Even after filtering, we still have missing data, especially for `gdp`. For our pipeline, the simplest and most robust approach is to work only with complete records. Let's remove any row that still has a `NaN` value in our selected columns.

In [None]:
print(f"Shape before dropping NaNs: {df_countries_only.shape}\n")
df_clean = df_countries_only.dropna()

print(f"Final shape of the cleaned dataset: {df_clean.shape}")
print(f"\nWe retained {df_clean.shape[0] / df_filtered.shape[0]:.2%} of the data from 1950 onwards.")

### 4. Final Exploration on Cleaned Data

In [None]:
df_clean.describe()

In [None]:
plt.figure(figsize=(8, 6))
correlation_matrix = df_clean[["population", "gdp", "co2", "co2_per_capita"]].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', fmt='.2f')
plt.title('Correlation Matrix of Key Variables')
plt.show()

**Correlation Insights:**
- There is a very strong positive correlation between `gdp` and `co2` (0.97), and `population` and `co2` (0.88). This is expected: larger economies and populations tend to have higher total emissions.
- `co2_per_capita` has a very weak correlation with other variables, suggesting it provides a different dimension for analysis (efficiency rather than scale).

### 5. Save the Final Dataset

Now we'll save this cleaned and reduced dataset. This will be the file our Kafka producer reads from.

In [None]:
output_path = '../data/reduced_co2_emissions.csv'
df_clean.to_csv(output_path, index=False)

print(f"Cleaned dataset saved to {output_path}")

### 6. Guiding Questions for Further Analysis

Based on this initial exploration, our Spark clustering job could help answer questions like:

- **Global & Temporal Patterns:** How have global CO₂ emissions increased over the years? What are the main clusters of countries based on their emission profiles?
- **Country Evolution:** How have different countries and regions evolved over time? Can we identify countries that have successfully decoupled GDP growth from CO₂ emissions?
- **Influencing Factors:** Does higher GDP always mean higher CO₂? Are wealthier countries more efficient (lower CO₂ per capita)?