# CO2 Emission Data Preprocessing
In this notebook, we will preprocess the CO2 emissions dataset to prepare it for machine learning modeling. This involves handling missing data, scaling the features, and splitting the data for training and testing.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Load the Cleaned Dataset

In [None]:
# Load the cleaned dataset
data_path = 'path_to_cleaned_file/co2_emissions_cleaned.csv'
df = pd.read_csv(data_path)
df.head()

## Handle Missing Data

In [None]:
# Fill or drop missing values
df.fillna(method='ffill', inplace=True)  # Forward fill to handle missing values
# You can also choose to drop rows or columns with too many missing values if necessary
# df.dropna(inplace=True)
df.isnull().sum()  # Check for remaining missing values

## Feature Scaling

In [None]:
# Scale the features using StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.iloc[:, 4:])  # Scale the year columns (numeric)
scaled_df = pd.DataFrame(scaled_features, columns=df.columns[4:])
scaled_df.insert(0, 'Country Name', df['Country Name'])  # Add back the non-scaled columns
scaled_df.insert(1, 'Country Code', df['Country Code'])
scaled_df.head()

## Splitting the Data for Training and Testing

In [None]:
# Define X (features) and y (target) for ML
X = scaled_df.iloc[:, 2:]  # All numeric columns (scaled)
y = df.iloc[:, 3]  # Example: You may want to predict emissions for the latest year available

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Next Steps
- After preprocessing, the data is ready for modeling.
- You can now use this preprocessed data in your machine learning models (e.g., regression, time-series forecasting).