<center>
    <h1> Predicting Air Quality for Health Risk Assessment: A Machine Learning Approach </h1>
    <h2> Oversampling using SMOTE and Splitting Data </h2>
    <h3> Divya Neelamegam, Padhma Cebolu Srinivasan, Poojitha Venkat Ram, Shruti Badrinarayanan, Sourabh Suresh Kumar </h3>
</center>

### Load Dataset

In [1]:
import pandas as pd

# Load Dataset
data = pd.read_csv("processed_data.csv")
data.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-29,83.13,118.127103,6.93,28.71,33.72,23.483476,6.93,49.52,59.76,0.02,0.0,3.14,209.0,Poor
1,Ahmedabad,2015-01-30,79.84,118.127103,13.85,28.68,41.08,23.483476,13.85,48.49,97.07,0.04,0.0,4.81,328.0,Very Poor
2,Ahmedabad,2015-01-31,94.52,118.127103,24.39,32.66,52.61,23.483476,24.39,67.39,111.33,0.24,0.01,7.67,514.0,Severe
3,Ahmedabad,2015-02-01,135.99,118.127103,43.48,42.08,84.57,23.483476,43.48,75.23,102.7,0.4,0.04,25.87,782.0,Severe
4,Ahmedabad,2015-02-02,178.33,118.127103,54.56,35.31,72.8,23.483476,54.56,55.04,107.38,0.46,0.06,35.61,914.0,Severe


### Encode Target Variable - AQI_Bucket

In [2]:
from sklearn.preprocessing import OrdinalEncoder

# Define order of categories for AQI_Bucket
categories_order = [['Good', 'Satisfactory', 'Moderate', 'Poor', 'Very Poor', 'Severe']]

# Create encoder object
ordinal_encoder = OrdinalEncoder(categories=categories_order)

# Reshape target variable to fit the encoder's expected format
y_reshaped = data['AQI_Bucket'].values.reshape(-1, 1)

# Apply encoder to the target variable
data['Encoded_AQI_Bucket'] = ordinal_encoder.fit_transform(y_reshaped)

# Use 'Encoded_AQI_Bucket' as target variable
y = data['Encoded_AQI_Bucket']

# Drop AQI_Bucket
data = data.drop(columns=['AQI_Bucket'])

### Encode Variable - City

In [3]:
from sklearn.preprocessing import OrdinalEncoder

# Define order of categories for City
categories_order = [['Ahmedabad', 'Aizawl', 'Amaravati', 'Amritsar', 'Bengaluru',
       'Bhopal', 'Brajrajnagar', 'Chandigarh', 'Chennai', 'Coimbatore',
       'Delhi', 'Ernakulam', 'Gurugram', 'Guwahati', 'Hyderabad',
       'Jaipur', 'Jorapokhar', 'Kochi', 'Kolkata', 'Lucknow', 'Mumbai',
       'Patna', 'Shillong', 'Talcher', 'Thiruvananthapuram',
       'Visakhapatnam']]

# Create encoder object
ordinal_encoder = OrdinalEncoder(categories=categories_order)

# Reshape City variable to fit the encoder's expected format
y_reshaped = data['City'].values.reshape(-1, 1)

# Apply encoder to the City variable
data['Encoded_City'] = ordinal_encoder.fit_transform(y_reshaped)

# Use 'Encoded_City' as target variable
y = data['Encoded_City']

# Drop City
data = data.drop(columns=['City'])

In [4]:
# Convert the 'Date' column to a datetime format
data['Date'] = pd.to_datetime(data['Date'])

data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day

In [5]:
data = data.drop(columns=['Date','AQI'])

### Oversampling

In [6]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import pandas as pd

X = data.drop(columns=['Encoded_AQI_Bucket'])
y = data['Encoded_AQI_Bucket']

# Splitting the data (80% train, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Applying SMOTE to the entire training set
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

### Split Remaining Data

In [7]:
# Splitting the remaining 20% data into validation (10%) and test (10%) sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

### Save Train, Validation and Test Data to CSVs

In [8]:
# Save the datasets to CSV files
X_resampled.to_csv('X_resampled_train.csv', index=False)
y_resampled.to_csv('y_resampled_train.csv', index=False)

X_val.to_csv('X_val.csv', index=False)
y_val.to_csv('y_val.csv', index=False)

X_test.to_csv('X_test.csv', index=False)
y_test.to_csv('y_test.csv', index=False)