# Description:
This project analyzes Indias Air Quality trends from 2020 to 2022 using a dataset comprising key pollutants like SO₂, NO₂, PM10, and PM2.5. The analysis focuses on identifying pollution levels across states and cities while visualizing changes over time.

**Key highlights include:**

Handling missing and inconsistent data (e.g., filling gaps with statistical methods and combining duplicate entries like state names).
Exploratory Data Analysis (EDA) through Python to derive meaningful insights.
Visualization of pollutant trends and state-wise AQI variations using Matplotlib .
Key findings, such as top polluted states and the impact of specific pollutants.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
df=pd.read_excel("/kaggle/input/aqi-2020-2022/AQI 2020-2022.xlsx")
df.head(10)

data from excel,so using pd.read_excel(""). 
df.head() lets us have an idea about the data

**Looking at basic information**

In [None]:
df.shape

There are 1139 rows and 8 columns

In [None]:
df.describe()

 look at the descriptive statistics of the data

In [None]:
df.info()

we have few null values , which means the data wasnt collected for that pollutant ,so instead of replacing it with some value, we let it be , cause that might effect AQI calculation.

**Analysing the data**

In [None]:
df.rename(columns={'State / Union Territory': 'State', 'City / town': 'City'}, inplace=True)

renaming the columns,To make the column names easier to work with.

In [None]:
df.columns

In [None]:
mask1=df['Year'] == 2022 
mask2=df['State']=='Andhra Pradesh'
df[mask1 & mask2]

filter the dataset to focus on specific states, years, or pollutant levels.

In [None]:
df.groupby('State')['AQI'].mean().reset_index() 

We can see that there is a data entry inconsistency, so we replace them with same name , i.e union territory names are replaced with their state names  eg. Pondicherry(UK) with puducherry and so on 

**Cleaning the Data**

In [None]:
corrections={'Chandigarh (UT)': 'Chandigarh',
                      'Dadara & Nagar Haveli\nand Daman & Diu (UT)':'Dadra & Nagar Haveli and Daman & Diu',
                      'Dadara & Nagar Haveli and Daman & Diu (UT)': 'Dadra & Nagar Haveli and Daman & Diu',
                      'Delhi (UT)':'Delhi',
                      'Jammu & Kashmir (UT)':'Jammu & Kashmir',
                      'Assam\nBihar':'Assam',
                      'Gujara': 'Gujarat',
                      'Pondicherry (UT)':'Puducherry'
                       }
df['State']=df["State"].replace(corrections)

In [None]:
df.groupby('State')['AQI'].mean().reset_index() 

In [None]:
df['State'].nunique()

You can now see that we have 28 states and 5 UT. 

**Checking for duplicates**

In [None]:
df.duplicated().sum()

NO duplicates 

# Basic Visualizations


**AQI Distribution**

In [None]:
# import seaborn as sns
# sns.histplot(df['AQI'], kde=True, bins=20, color='blue')
# plt.title("AQI Distribution")
# plt.xlabel("AQI")
# plt.ylabel("Frequency")
# plt.show()         or

plt.hist(df['AQI'], bins=30, color='#abcdef', edgecolor='darkblue')

# Add a title and labels
plt.title("AQI Distribution")
plt.xlabel("AQI")
plt.ylabel("Frequency")

# Show the plot
plt.show()

**Average AQI by State**

In [None]:
# Group by state #a6ddd3
state_avg = df.groupby('State')['AQI'].mean()   

# Plot the average pollutant levels by state
state_avg.plot(kind='barh', figsize=(10,10),color='#ff7f0e', edgecolor='darkblue')
plt.title("Average AQI by State")
plt.ylabel("Concentration")
plt.show()

In [None]:
# Group by state
list=['SO2', 'NO2', 'PM10', 'PM2.5']
state_avg = df.groupby('State')[list].mean()

# Plot the average pollutant levels by state
state_avg.plot(kind='barh', figsize=(15,16), colormap='viridis')
plt.title("Average Pollutant Levels by State")
plt.ylabel("Concentration")
plt.show()

# Trends Over Time

**AQI trends over years**

In [None]:
yearly_aqi = df.groupby('Year')['AQI'].mean()

# Plot AQI trends over years
plt.plot(yearly_aqi,"*-m",mec='b')
plt.title("Average AQI Over the Years")
plt.xticks([2020,2021,2022])
plt.xlabel("Year")
plt.ylabel("Average AQI")
plt.grid(True)
plt.show()

**Pollutant Trends Over the Years**

In [None]:
# Group by year
list=['SO2', 'NO2', 'PM10', 'PM2.5']
yearly_pollutants = df.groupby('Year')[list].mean()

# Plot pollutant trends
yearly_pollutants.plot(figsize=(5,4), marker='o')
plt.xticks([2020,2021,2022])
plt.title("Pollutant Trends Over the Years")
plt.ylabel("Average Concentration")
plt.show()

**Top Polluted states**

In [None]:
# Top 10 polluted states
top_states = df.groupby('State')['AQI'].mean().sort_values(ascending=False).head(10)

# Plot
sns.barplot(x=top_states, y=top_states.index, palette='Reds_r')
#plt.bar(x='top_states' ,height=0.5,color='#aabbcc')
plt.title("Top 10 Polluted States (Average AQI)")
plt.xlabel("Average AQI")
plt.ylabel("State")
plt.show()

In [None]:
 df.groupby('State')['AQI'].mean().sort_values(ascending=False).head(10)
#from here we pass the index on y-axis i.e state, and values on x-axis in the above graph

**Top Polluted cities**

In [None]:
# Top 10 polluted cities
top_cities = df.groupby('City')['AQI'].mean().sort_values(ascending=False).head(10)

# Plot
sns.barplot(x=top_cities, y=top_cities.index, palette='cividis')
plt.title("Top 10 Polluted Cities (Average AQI)")
plt.xlabel("Average AQI")
plt.ylabel("City")
plt.show()


# Insights & Conclusions:

**Key Insights:**

**Top Pollutants**: PM10 and PM2.5 are the leading contributors to poor air quality.

**Most Polluted States**: Industrial states like Delhi and Uttar Pradesh consistently recorded high AQI values.

**Yearly Trends**: A general improvement in AQI during 2020 (possibly due to COVID-19 lockdowns), followed by a gradual increase in pollutant levels in 2021 and 2022.

**Final Conclusion**:
The analysis highlights the critical need for targeted interventions in highly polluted regions and stricter pollution control measures, especially for PM10 and PM2.5, which have the most significant impact on air quality.