**Introduction**

Air pollution is a critical environmental issue, impacting health and ecosystems worldwide. This project focuses on analyzing and predicting air quality using a dataset collected from various monitoring stations in India. The data includes geographic details, pollutant levels (PM10, NH3, OZONE, etc.), and statistical measures such as minimum, maximum, and average pollutant concentrations.

By leveraging machine learning techniques, the project aims to uncover pollution trends, identify high-risk areas, and provide accurate predictions to support effective decision-making. This work contributes to mitigating the adverse effects of pollution by enabling targeted interventions.

In [None]:
import pandas as pd
import numpy as np

**Load the Dataset**

In [None]:
df=pd.read_csv(r'/content/dataset1.csv')
df.head()

Unnamed: 0,country,state,city,station,last_update,latitude,longitude,pollutant_id,pollutant_min,pollutant_max,pollutant_avg
0,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",17-01-2025 18:00,26.071318,91.87488,PM10,75.0,252.0,177.0
1,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",17-01-2025 18:00,26.071318,91.87488,NH3,1.0,5.0,3.0
2,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",17-01-2025 18:00,26.071318,91.87488,OZONE,22.0,57.0,43.0
3,India,Assam,Guwahati,"IITG, Guwahati - PCBA",17-01-2025 18:00,26.202864,91.700464,PM10,59.0,199.0,152.0
4,India,Assam,Guwahati,"IITG, Guwahati - PCBA",17-01-2025 18:00,26.202864,91.700464,NH3,5.0,5.0,5.0


In [None]:
type(df)

## pandas.core.frame.DataFram

In [None]:
df.describe()

Unnamed: 0,latitude,longitude,pollutant_min,pollutant_max,pollutant_avg
count,1358.0,1358.0,1291.0,1291.0,1291.0
mean,22.260818,78.817265,33.15182,88.290473,56.346243
std,5.489397,4.94249,39.877859,98.7165,61.487052
min,8.514909,70.909168,1.0,1.0,1.0
25%,18.9767,75.675238,6.0,17.0,12.0
50%,23.041137,77.482194,18.0,54.0,35.0
75%,26.766433,80.948222,47.0,120.0,81.0
max,34.066206,94.636574,293.0,500.0,378.0


In [None]:
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
1353,False
1354,False
1355,False
1356,False


**Identify Missing Values**

In [None]:
df.isnull().sum()

Unnamed: 0,0
country,0
state,0
city,0
station,0
last_update,0
latitude,0
longitude,0
pollutant_id,0
pollutant_min,67
pollutant_max,67


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1358 entries, 0 to 1357
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   country        1358 non-null   object 
 1   state          1358 non-null   object 
 2   city           1358 non-null   object 
 3   station        1358 non-null   object 
 4   last_update    1358 non-null   object 
 5   latitude       1358 non-null   float64
 6   longitude      1358 non-null   float64
 7   pollutant_id   1358 non-null   object 
 8   pollutant_min  1291 non-null   float64
 9   pollutant_max  1291 non-null   float64
 10  pollutant_avg  1291 non-null   float64
dtypes: float64(5), object(6)
memory usage: 116.8+ KB


**Handle Missing Values**

In [None]:
df['pollutant_min'].fillna(df['pollutant_min'].mean(),inplace=True)
df['pollutant_max'].fillna(df['pollutant_max'].mean(),inplace=True)
df['pollutant_avg'].fillna(df['pollutant_avg'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['pollutant_min'].fillna(df['pollutant_min'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['pollutant_max'].fillna(df['pollutant_max'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermed

In [None]:
df.isnull().sum()

Unnamed: 0,0
country,0
state,0
city,0
station,0
last_update,0
latitude,0
longitude,0
pollutant_id,0
pollutant_min,0
pollutant_max,0


In [None]:
# Convert Strings to Datetime
df['last_update'] = pd.to_datetime(df['last_update'])

  df['last_update'] = pd.to_datetime(df['last_update'])


In [None]:
df.head()

Unnamed: 0,country,state,city,station,last_update,latitude,longitude,pollutant_id,pollutant_min,pollutant_max,pollutant_avg
0,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",2025-01-17 18:00:00,26.071318,91.87488,PM10,75.0,252.0,177.0
1,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",2025-01-17 18:00:00,26.071318,91.87488,NH3,1.0,5.0,3.0
2,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",2025-01-17 18:00:00,26.071318,91.87488,OZONE,22.0,57.0,43.0
3,India,Assam,Guwahati,"IITG, Guwahati - PCBA",2025-01-17 18:00:00,26.202864,91.700464,PM10,59.0,199.0,152.0
4,India,Assam,Guwahati,"IITG, Guwahati - PCBA",2025-01-17 18:00:00,26.202864,91.700464,NH3,5.0,5.0,5.0


In [None]:
df1=df

In [None]:
df1.head()

Unnamed: 0,country,state,city,station,last_update,latitude,longitude,pollutant_id,pollutant_min,pollutant_max,pollutant_avg
0,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",2025-01-17 18:00:00,26.071318,91.87488,PM10,75.0,252.0,177.0
1,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",2025-01-17 18:00:00,26.071318,91.87488,NH3,1.0,5.0,3.0
2,India,Assam,Byrnihat,"Central Academy for SFS, Byrnihat - PCBA",2025-01-17 18:00:00,26.071318,91.87488,OZONE,22.0,57.0,43.0
3,India,Assam,Guwahati,"IITG, Guwahati - PCBA",2025-01-17 18:00:00,26.202864,91.700464,PM10,59.0,199.0,152.0
4,India,Assam,Guwahati,"IITG, Guwahati - PCBA",2025-01-17 18:00:00,26.202864,91.700464,NH3,5.0,5.0,5.0


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
# Drop rows with missing target values
data_cleaned = df1.dropna(subset=["pollutant_avg"]).copy()

In [None]:
# Categorize 'pollutant_avg' into classes (Low, Moderate, High)
# Using arbitrary thresholds for demonstration
bins = [0, 50, 100, float('inf')]
labels = ["Low", "Moderate", "High"]
data_cleaned["pollutant_class"] = pd.cut(data_cleaned["pollutant_avg"], bins=bins, labels=labels)

In [None]:
# Encode categorical features
label_encoders = {}
categorical_columns = ["country", "state", "city", "station", "pollutant_id"]

for col in categorical_columns:
    le = LabelEncoder()
    data_cleaned[col] = le.fit_transform(data_cleaned[col])
    label_encoders[col] = le

In [None]:
# Drop unnecessary columns
data_preprocessed = data_cleaned.drop(columns=["last_update", "pollutant_min", "pollutant_max", "pollutant_avg"])

**Split Data into Features (X) and Target (y)**

In [None]:
X = data_preprocessed.drop(columns=["pollutant_class"])
y = data_preprocessed["pollutant_class"]

**Split into Training and Testing Sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.value_counts(), y_test.value_counts()

((1086, 7),
 (272, 7),
 pollutant_class
 Low         627
 Moderate    255
 High        204
 Name: count, dtype: int64,
 pollutant_class
 Low         167
 Moderate     63
 High         42
 Name: count, dtype: int64)

**KNN**

In [None]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [None]:
knn_class=KNeighborsClassifier(n_neighbors=20, metric='minkowski', p=2 )
knn_class.fit(X_train,y_train)


In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org bold text.




In [None]:
y_pred_knn=knn_class.predict(X_test)
Accuracy_Knn=round((metrics.accuracy_score(y_test, y_pred_knn)*100),2)
print('Accuracy (KNN): ',Accuracy_Knn,"%")

Accuracy (KNN):  61.4 %


**Classification Report**

In [None]:
unique_classes = np.unique(np.concatenate((y_test, y_pred_knn)))

print(classification_report(y_test, y_pred_knn, target_names=[f'class {i}' for i in unique_classes]))

                precision    recall  f1-score   support

    class High       0.29      0.05      0.08        42
     class Low       0.63      0.96      0.76       167
class Moderate       0.50      0.06      0.11        63

      accuracy                           0.61       272
     macro avg       0.47      0.36      0.32       272
  weighted avg       0.54      0.61      0.50       272

