# **Air Quality Forecast: Machine Learning Model**

Air quality has a significant impact on human health and the environment. The main factors affecting air quality in India are:

- PM2.5 and PM10: Particulate matter
- NO, NO2, NOx: Nitrogen oxides
- NH3: Ammonia
- CO: Carbon monoxide
- SO2: Sulfur dioxide
- O3: Ozone
- Benzene, Toluene, Xylene: Volatile organic compounds

The primary goal of our air quality prediction model is to accurately forecast the Air Quality Index (AQI). AQI is an indicator that shows the level of air pollution and its effects on public health.

![AQI Mini Image](https://www.deq.ok.gov/wp-content/uploads/air-division/aqi_mini-768x432.png)



Our model predicts future AQI values based on the levels of various pollutants that affect air quality. These predictions assist decision-makers in issuing health alerts, formulating environmental policies, optimizing traffic and industrial management, and helping the general public plan their daily activities.

**What is the business problem you are trying to solve using machine learning?**
* The problem you are trying to solve in this project is to predict future Air Quality Index (AQI) values by analyzing the factors affecting AQI using machine learning algorithms. These predictions can be used to monitor air quality and develop improvement strategies. The goal is to identify the impact of various pollutant parameters (PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3, Benzene, Toluene, Xylene) on AQI and to forecast AQI based on the future values of these parameters.

**Why are we interested in solving this problem? What impact will it have on the business?**

- Solving this problem is of great importance for public health, environmental sustainability, and the protection of biodiversity. Accurately predicting air quality allows authorities and the public to take measures against air pollution. For example, health alerts and precautions can be issued. Regulations and policies related to air pollution can be developed. Industrial and traffic management can be optimized. By providing a cleaner environment to society, the quality of life can be improved.

**What are some known issues with the data? (data entry errors, missing data, unit differences, etc.)**
- Missing Data: Several columns, including the target column, have missing values.
- Unit Differences: The Date column initially had a data type of object and has been converted to Datetime. 
- Seasonal Variations: Seasonal effects can introduce variability in the data.

### Loading the required Library Packages

In [None]:
# Install required libraries
%pip install numpy pandas matplotlib seaborn scikit-learn
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, confusion_matrix

### Reading and exploring the Health Care Dataset

In [None]:
# Ensure pandas is imported
import pandas as pd

df = pd.read_csv('air quality data.csv')

**ANALYZING THE DATASET**
- Size, data types, null value rate, statistical information for each column

In [None]:
df.head()

In [None]:
# Checking the number of rows and columns of the dataset
df.shape

In [None]:
# Dataset Information Overview
df.info()

In [None]:
df.isnull().sum()
# There are a lot of missing values present in the dataset

In [None]:
# There no Duplicate values present in the dataset
df.duplicated().sum()

In [None]:
# Drop rows where the 'AQI' column has missing values
df1= df.dropna(subset=['AQI'],inplace=True)


In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
df.shape

In [None]:
# Summary Statistics for the Dataset
df.describe().T 

In [None]:
#in this DataFrame and the percentage of these null values 
null_values_percentage = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)

In [None]:
null_values_percentage

### Key Considerations:
Xylene has the highest percentage of missing values (61.86%), so you'll need to decide whether to impute these values or drop the feature.

PM10 and NH3 also have significant missing values (around 28-26%).

No Missing Values:
City, Date, AQI, and AQI_Bucket have 0% null values

### Now we will start our Data Exploration using Visualization EDA - Univariate analysis for each feature

In [None]:
import matplotlib.pyplot as plt

df['Xylene'].plot(kind='hist', figsize=(10, 5))
plt.legend()
plt.show()

In [None]:
df['PM10'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['NH3'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['Toluene'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['Benzene'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['NOx'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['O3'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['PM2.5'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['SO2'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['CO'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['NO2'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['NO'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
df['AQI'].plot(kind='hist',figsize=(10,5))
plt.legend()
plt.show()

In [None]:
# distribution of aqi from 2015-2020
import seaborn as sns  # Import Seaborn to fix the NameError
sns.displot(df, x="AQI", color="purple")
plt.show()

In the below plot, we can see the frequency of the different city types present in the whole dataset.

### Bivariate analysis

In [None]:
sns.set(style="darkgrid")
graph=sns.catplot(x="City",kind="count",data=df,height=5,aspect=3)
graph.set_xticklabels(rotation=90)

In the below plot, we can see the frequency of all the city types for different "AQI_Bucket" variable, which are categorized as "Poor", "Very Poor", "Severe", "Moderate", "Satisfactory", "Good". It is basically so that we can get an idea of how frequency of different cities are distributed over "AQI_Bucket" variable. This will basically clear our idea about the data a bit more

In [None]:
sns.set(style="darkgrid")
graph=sns.catplot(x="City",kind="count",data=df,col="AQI_Bucket",col_wrap=2,height=3.5,aspect=3)
graph.set_xticklabels(rotation=90)

Below plots will be a sequence of boxplots, and the intention is to show the distribution of all the non-nulls numerical variables over the cities.

In [None]:
graph1=sns.catplot(x="City",y="PM2.5",kind="box",data=df,height=5,aspect=3)
graph1.set_xticklabels(rotation=90)

In [None]:
graph2=sns.catplot(x="City",y="NO2",kind="box",data=df,height=5,aspect=3)
graph2.set_xticklabels(rotation=90)

In [None]:
graph3=sns.catplot(x="City",y="O3",data=df,kind="box",height=5,aspect=3)
graph3.set_xticklabels(rotation=90)

In [None]:
graph4=sns.catplot(x="City",y="SO2",data=df,kind="box",height=5,aspect=3)
graph4.set_xticklabels(rotation=90)

In [None]:
graph5=sns.catplot(data=df,kind="box",x="City",y="NOx",height=6,aspect=3)
graph5.set_xticklabels(rotation=90)

In [None]:
graph6=sns.catplot(data=df,kind="box",x="City",y="NO",height=6,aspect=3)
graph6.set_xticklabels(rotation=90)

In the below plot, we are trying to see the frequencies of the different categories of the variable AQI_Bucket.

In [None]:
graph7=sns.catplot(x="AQI_Bucket",data=df,kind="count",height=6,aspect=3)
graph7.set_xticklabels(rotation=90)

#### Checking all null values and treating those null values.

In [None]:
# Checking all null values

df.isnull().sum().sort_values(ascending=False)

# higher null values present in PM10 followed by NH3

In [None]:
df.describe().loc["mean"]

In [None]:
df = df.replace({

"PM2.5" :{np.nan:67.476613},
"PM10" :{np.nan:118.454435},
"NO": {np.nan:17.622421},
"NO2": {np.nan:28.978391},
"NOx": {np.nan:32.289012},
"NH3": {np.nan:23.848366},
"CO":  {np.nan:2.345267},
"SO2": {np.nan:34.912885},
"O3": {np.nan:38.320547},
"Benzene": {np.nan:3.458668},
"Toluene": {np.nan:9.525714},
"Xylene": {np.nan:3.588683}})


In [None]:
df.isnull().sum()

In [None]:
graph=sns.catplot(x="AQI_Bucket",data=df,kind="count",height=6,aspect=3)
graph.set_xticklabels(rotation=90)

We delete AQI_Bucket from the dataset because it is not a feature that affects air quality

In [None]:
df = df.drop(["AQI_Bucket"], axis=1)

In [None]:
df.head()

### Detecting Outliers and Treatment

We drew boxplots to observe outlier data.

In [None]:
sns.boxplot(data=df[[ 'PM2.5', 'PM10']])

In [None]:
sns.boxplot(data=df[[ 'NO', 'NO2', 'NOx','NH3']]) 

In [None]:
sns.boxplot(data=df[[ 'O3', 'CO', 'SO2']])

In [None]:
sns.boxplot(data=df[[ 'Benzene', 'Toluene', 'Xylene']])

**DATA EDITING PROCEDURES**
- Procedures related to outlier data, missing data, data that has little relationship with our target column

We observed that there were too many outliers in our independent variables. We thought that the modeling we would do with this data would give us incorrect results, so we changed the outliers.

In [None]:
# This function takes a DataFrame as a parameter and identifies outliers for numeric columns in the DataFrame. 
#It replaces these outliers with the corresponding quartile values ​​(Q1 or Q3). Outliers are identified using the interquartile range (IQR).
def replace_outliers_with_quartiles(df):
    
    for column in df.select_dtypes(include=['number']).columns: # Used to cycle through all numeric columns in the DataFrame.
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        # To identify outliers, lower and upper limits are calculated and values ​​outside these limits are considered outliers.
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        # For each column, we identify outliers and replace them with Q1 or Q3. We do this using a lambda function.
        #If the value is less than the lower bound, it is replaced with Q1. If it is greater than the upper bound, 
        #it is replaced with Q3. In the last case, the value is not changed and remains the same.
        df[column] = df[column].apply(
            lambda x: Q1 if x < lower_bound else (Q3 if x > upper_bound else x)
        )
    
    return df 

df = replace_outliers_with_quartiles(df)

In [None]:
df.describe().T

We created another box plot for the data in the columns of the current DataFrame.

In [None]:
sns.boxplot(data=df[[ 'PM2.5', 'PM10']])

In [None]:
sns.boxplot(data=df[[ 'NO', 'NO2', 'NOx','NH3']])

In [None]:
sns.boxplot(data=df[[ 'O3', 'CO', 'SO2']])

In [None]:
sns.boxplot(data=df[[ 'Benzene', 'Toluene', 'Xylene']])

In [None]:
# distribution of aqi from 2015-2020
sns.displot(df, x="AQI", color="red")
plt.show()

In [None]:
df1=df.drop(columns=['City'])

#### Multivariate analysis

In [None]:
plt.figure(figsize=(12, 8)) 
sns.heatmap(df1.select_dtypes(include=['number']).corr(), annot=True) 
plt.show() 

The most important variables affecting the AQI value appear to be PM2.5, PM10, CO and NOx.We will make predictions based on data above 0.25

In [None]:
df.head()

**PREPARING THE DATA**
- Determining the numerical and categorical columns of the data set, filling the empty values ​​with scaler, applying onehotencoding to categorical data, separating the data set into training, test and validation data sets, determining the input and target columns of the data sets we separated

In [None]:
df.columns

In [None]:
# Dropping unnecessary columns
df.drop(['Date'],axis=1,inplace=True)        # no need of this feature
df.drop(['City'],axis=1,inplace=True)        # as we are going to calculate based on other parameters not on the loaction so we drop this

In [None]:
df

In [None]:
from sklearn.preprocessing import StandardScaler
df1 = StandardScaler().fit_transform(df)

In [None]:
df1

In [None]:
df = pd.DataFrame(df1,columns = df.columns)

In [None]:
df

**DATA MODELING**

In [None]:
# Data Preparation for Modeling
x=df[["PM2.5","PM10","NO","NO2","NOx","NH3","CO","SO2","O3","Benzene","Toluene","Xylene"]]
y=df["AQI"]

In [None]:
x.head()

In [None]:
y.head()

In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=70)
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)
# splitting the data into training and testing data

## Applying an appropriate Regression algorithm to build a model.

#### Random Forest Regressor

In [None]:
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

# Generate a sample dataset
X, y = make_regression(n_samples=100, n_features=4, noise=0.1, random_state=42)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForest model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Save the trained model
with open("air_quality_prediction.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model trained and saved successfully!")

The R² score for the train data indicates that the model’s ability to explain the variability in the air quality data is very low. This means that the model is not learning the relationships between air pollution and other air quality factors effectively

In [None]:
import pickle

# Save the trained model
with open("air_quality_prediction.pkl", "wb") as f:
    pickle.dump(model, f) # type: ignore

print("Model saved successfully!")

### Model Accuracy Comparison

Inferences from Model Accuracy Comparison

RandomForest Performs Well:

Among the algorithms tested, RandomForest exhibits the highest RSquared at  0.9775989 on Train and 0.847178 on Test.



The R² score on the training dataset was obtained as 97.75%. This indicates that the model was able to explain a large portion of the variance of the target variable in the training data, i.e. 97.75%. In other words, the model predicted the air quality variables in the training data correctly for the most part and was able to capture a large portion of the data meaningfully

The R² score on the test dataset was obtained as 84.71%. This indicates that the model was able to explain 84.71% of the target variable in the test data. The model has also shown high success on new data that it has not seen before, indicating that the model provides strong accuracy in air quality predictions and is effective on new data

# -------------------------------------------------------END---------------------------------------------