<a href="https://colab.research.google.com/github/nawazullakhankayani-cadiff/St20329043_CMP7005_PRAC1.ipynb/blob/main/St20329043_CMP7005_PRAC1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:

!git config --global user.name "nawazullakhankayani-cadiff"
!git config --global user.email "mr.kayaninawazullakhan.uk@gmail.com"




# üìòIntroduction of the dateset

The dataset used in this project is the India Air Quality Dataset, collected from monitoring stations across 26 major cities. It includes pollution readings recorded between 2015 and 2020 from cities like Delhi, Mumbai, Kolkata, Bengaluru, Kochi, Jaipur, and many others. The dataset contains key pollutants such as üå´ PM2.5, üåÅ PM10, üü° SO‚ÇÇ, üü¶ NO‚ÇÇ, üü• CO, and üü© O‚ÇÉ which are essential for assessing air quality. It also provides information on the city and year, allowing comparisons across different regions and time periods. This dataset is useful for studying pollution patterns, identifying highly affected cities, and observing how air quality changes over the years. It supports several key tasks in this project, including data cleaning, exploratory data analysis, and building predictive models. Overall, the dataset offers a clear and comprehensive view of air pollution in India and helps create meaningful visual and interactive analysis

In [20]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

In [14]:
from google.colab import drive

In [15]:
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
drive_path = '/content/drive/MyDrive/ Data Analysis_St20329043/Assessment Data-20251118'

# üìë Merging CSV Data Sources

The data-loading process begins by scanning the specified directory (drive_path) and identifying all files with a .csv extension. Each CSV file is read individually using pandas.read_csv() and stored in a list named dataframes. Once all files are imported, the list of DataFrames is merged into a single unified dataset using pd.concat(), ensuring the index is reset for consistency. This approach allows seamless integration of multiple data sources and prepares a consolidated dataset suitable for subsequent cleaning, exploration, and modelling tasks.

In [35]:

dataframes = []
for filename in os.listdir(drive_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(drive_path, filename)
        df = pd.read_csv(file_path)
        dataframes.append(df)

In [36]:
# Check all CSV names
print("All CSV files found:")
csv_files = []
for filename in os.listdir(drive_path):
    if filename.endswith('.csv'):
        csv_files.append(filename)

for name in csv_files:
    print(name)

All CSV files found:
Kochi_data.csv
Talcher_data.csv
Jaipur_data.csv
Mumbai_data.csv
Visakhapatnam_data.csv
Bengaluru_data.csv
Kolkata_data.csv
Ernakulam_data.csv
Patna_data.csv
Thiruvananthapuram_data.csv
Ahmedabad_data.csv
Jorapokhar_data.csv
Hyderabad_data.csv
Guwahati_data.csv
Coimbatore_data.csv
Brajrajnagar_data.csv
Shillong_data.csv
Bhopal_data.csv
Chandigarh_data.csv
Gurugram_data.csv
Delhi_data.csv
Lucknow_data.csv
Amritsar_data.csv
Chennai_data.csv
Aizawl_data.csv
Amaravati_data.csv


In [37]:
df.columns

Index(['City', 'Date', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2',
       'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI', 'AQI_Bucket'],
      dtype='object')

The dataset includes a mix of pollutant measurements, air quality indicators, and basic city/date information:

Pollutants measured:
PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3

Volatile organic compounds:
Benzene, Toluene, Xylene

Air Quality Indicators:
AQI and AQI_Bucket

Other columns:
City and Date

In [23]:
df = pd.concat(dataframes, ignore_index=True)
df

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Kochi,22/01/2020,46.54,113.87,21.86,84.84,99.99,27.89,1.25,7.01,19.69,,0.00,,,
1,Kochi,23/01/2020,62.48,110.34,22.08,55.67,73.57,25.35,1.33,6.48,9.54,,0.00,,110.0,Moderate
2,Kochi,24/01/2020,62.87,114.86,37.44,60.49,97.80,25.60,1.44,6.91,9.06,,0.00,,111.0,Moderate
3,Kochi,25/01/2020,61.76,113.70,92.78,67.73,160.51,22.11,1.34,6.68,11.29,,0.00,,144.0,Moderate
4,Kochi,26/01/2020,66.76,113.24,106.79,58.69,165.49,20.70,1.36,6.75,11.43,,0.05,,197.0,Moderate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,Amaravati,27/06/2020,14.50,24.43,1.53,6.53,4.72,8.97,0.55,13.75,33.76,0.06,0.27,0.15,42.0,Good
29527,Amaravati,28/06/2020,16.65,28.51,1.43,8.32,5.59,9.77,0.66,10.86,37.34,0.10,0.43,0.12,49.0,Good
29528,Amaravati,29/06/2020,20.96,32.56,1.65,9.55,6.43,14.30,0.66,14.79,43.29,0.12,0.69,0.10,56.0,Satisfactory
29529,Amaravati,30/06/2020,21.34,35.16,1.74,10.69,7.10,13.38,0.66,14.58,45.32,0.14,1.42,0.20,61.0,Satisfactory


In [25]:
df.head(10)

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Kochi,22/01/2020,46.54,113.87,21.86,84.84,99.99,27.89,1.25,7.01,19.69,,0.0,,,
1,Kochi,23/01/2020,62.48,110.34,22.08,55.67,73.57,25.35,1.33,6.48,9.54,,0.0,,110.0,Moderate
2,Kochi,24/01/2020,62.87,114.86,37.44,60.49,97.8,25.6,1.44,6.91,9.06,,0.0,,111.0,Moderate
3,Kochi,25/01/2020,61.76,113.7,92.78,67.73,160.51,22.11,1.34,6.68,11.29,,0.0,,144.0,Moderate
4,Kochi,26/01/2020,66.76,113.24,106.79,58.69,165.49,20.7,1.36,6.75,11.43,,0.05,,197.0,Moderate
5,Kochi,27/01/2020,67.99,114.06,100.13,57.16,157.29,19.87,1.31,6.21,10.63,,0.0,,179.0,Moderate
6,Kochi,28/01/2020,68.04,119.42,85.93,56.04,141.97,20.26,1.38,6.06,9.78,,0.05,,167.0,Moderate
7,Kochi,29/01/2020,71.79,119.6,73.72,59.63,133.35,21.6,1.4,4.93,14.14,,0.0,,159.0,Moderate
8,Kochi,30/01/2020,52.43,99.98,92.06,54.21,146.26,19.78,1.36,9.03,11.96,,0.0,,153.0,Moderate
9,Kochi,31/01/2020,46.71,109.08,106.14,53.77,159.91,19.59,1.32,20.29,14.61,,0.0,,185.0,Moderate


# üìò Dataset Column  (India Air Quality Dataset)

This dataset records air-pollution measurements collected from different cities across India. Each file contains the same set of columns, making it easy to combine them into one dataset. The main columns included are:

City ‚Äì Identifies the location where the air-quality reading was taken.

Date ‚Äì The actual calendar date of the observation, from which Year, Month, and Day were later extracted.

PM2.5 & PM10 ‚Äì Two types of particulate matter, representing fine and coarse pollution particles.

SO‚ÇÇ, NO‚ÇÇ, CO, O‚ÇÉ ‚Äì Four major gaseous pollutants measured in micrograms per cubic meter.

YEAR, MONTH, day ‚Äì Time-based features derived from the Date column to support trend analysis.

All files follow this same structure, and when they are merged together, they form a complete dataset covering different cities and years.

In [26]:
df.tail(10)

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
29521,Amaravati,22/06/2020,17.97,50.49,2.05,12.1,8.1,10.92,0.65,11.68,30.15,0.17,1.55,0.53,46.0,Good
29522,Amaravati,23/06/2020,15.48,34.84,2.07,9.15,6.55,13.53,0.58,15.88,39.82,0.11,0.8,0.16,55.0,Satisfactory
29523,Amaravati,24/06/2020,14.88,35.33,1.86,10.59,7.15,11.62,0.59,14.04,38.01,0.13,1.03,0.21,52.0,Satisfactory
29524,Amaravati,25/06/2020,20.74,37.36,1.9,8.44,6.03,14.57,0.62,15.38,39.32,0.13,0.89,0.28,50.0,Good
29525,Amaravati,26/06/2020,10.41,20.55,1.88,7.21,5.37,12.46,0.53,15.86,25.96,0.06,0.3,,37.0,Good
29526,Amaravati,27/06/2020,14.5,24.43,1.53,6.53,4.72,8.97,0.55,13.75,33.76,0.06,0.27,0.15,42.0,Good
29527,Amaravati,28/06/2020,16.65,28.51,1.43,8.32,5.59,9.77,0.66,10.86,37.34,0.1,0.43,0.12,49.0,Good
29528,Amaravati,29/06/2020,20.96,32.56,1.65,9.55,6.43,14.3,0.66,14.79,43.29,0.12,0.69,0.1,56.0,Satisfactory
29529,Amaravati,30/06/2020,21.34,35.16,1.74,10.69,7.1,13.38,0.66,14.58,45.32,0.14,1.42,0.2,61.0,Satisfactory
29530,Amaravati,01/07/2020,22.0,34.0,1.5,9.68,6.4,8.45,0.59,10.88,29.15,0.1,0.5,,54.0,Satisfactory


In [27]:
print(f"The Data has {df.shape[0]} total records and {df.shape[1]} columns ")

The Data has 29531 total records and 16 columns 


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB


The dataset has 29,531 rows and 16 columns, showing air quality measurements for different cities in India. Each row is one day‚Äôs data for a city.

 ‚úî what‚Äôs in the Dataset

Pollutants (numbers): PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3, Benzene, Toluene, Xylene, AQI

Other info: City, Date, AQI_Bucket (Good, Moderate, Poor, etc.)

Missing Data

Some pollutants are not recorded every day:

PM2.5: ~25k values available

PM10: ~18k values available

NH3: ~19k values available

Xylene: ~11k values available

So, a few values are missing, but most columns are mostly complete.

‚úî Why This Dataset is Useful

We can use this data to:

Compare pollution levels between cities

Check trends over time

See which pollutants are highest in different areas

Study overall air quality and AQI categories

In [29]:
df.duplicated().sum()

np.int64(0)

In [30]:
df = df.drop('NO', axis=1, errors='ignore')

In [None]:
city  = df['City'].value_counts()
print(f'Total number of cities in the dataset : {len(city)}')
city