<a href="https://colab.research.google.com/github/kubrayuksekkaya/python-basics-data-science-project/blob/main/example_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Air Quality Index (AQI) of Cities with a Population over 750,000
## Introduction

In this section, we will be analyzing the Air Quality Index (AQI) data of cities with a population over 750,000. We will be using a dataset that contains the population of 5307 cities and the 2022 AQI levels for each of these cities. Our aim is to determine the AQI level of the cities and find the correlation between the AQI level and the population of these cities.


## Data Loading and Cleaning
First, let's import the required libraries:

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import MarkerCluster

Next, let's load the city population data and the 2022 AQI data:

In [None]:
# Load the population data of 5307 cities
all_cities_data = pd.read_csv("population-cities-data.csv", nrows=5307)
capitals = all_cities_data["city"].tolist()

# Load the 2022 AQI data of 5307 cities
aqi_data_2022 = pd.read_csv("air-pollution-rank-cities-2022.csv", nrows=5307)
aqi_city_pop_list = aqi_data_2022['2022_avg'].tolist()

We will create a dictionary data_pop_cities to store the AQI data for the cities with a population over 750,000.

In [None]:
# Initialize an empty dictionary to store the data
data_pop_cities = {}

# Zip the lists of cities and AQI data
zip_data_city_pop = zip(capitals, aqi_city_pop_list)

# Add only the cities that have AQI data to the dictionary
for city, aqi in zip_data_city_pop:
    if city in all_cities_data['city'].tolist():
        data_pop_cities[city] = aqi

## AQI Level Description
Let's create a function determine_aqi_level to determine the AQI level description based on the AQI value:

In [None]:
def determine_aqi_level(aqi):
    if aqi <= 50:
        return "Good"
    elif aqi <= 100:
        return "Moderate"
    elif aqi <= 150:
        return "Unhealthy for Sensitive Groups"
    elif aqi <= 200:
        return "Unhealthy"
    elif aqi <= 300:
        return "Very Unhealthy"
    else:
        return "Hazardous"

We will use the function to determine the AQI level description for each city in the data_pop_cities dictionary.

In [None]:
# Determine the AQI level description for each city
for city, aqi in data_pop_cities.items():
    aqi_level = determine_aqi_level(aqi)

# Create a list of AQI levels for the cities
aqi_levels = [determine_aqi_level(aqi) for city, aqi in data_pop_cities

## Correlation between City Population and AQI
This section of code is aimed at finding the correlation level between city population and AQI in mostly urban areas.

We start by creating a new dataframe from a data dictionary containing city names and AQI values. The population column is then added to the dataframe, and the data types of both population and AQI are ensured to be numeric.

Finally, we find the correlation level between population and AQI by using the corr method of the pandas library. The result is then printed as the correlation between AQI level of capital cities and population.

In [None]:
# Create a new dataframe from the data dictionary
data_pop_cities = list(zip(capitals,
                           aqi_city_pop_list))  # Fixing the shape of the data passed to the DataFrame constructor
# does not match the shape of the indices and columns specified.
df = pd.DataFrame(data_pop_cities, columns=['city', 'aqi'])

# Add the population column to the dataframe
df['population'] = all_cities_data['pop2023']

# Ensure that the data types of the population and AQI columns are numeric
df['population'] = df['population'].astype(float)
df['aqi'] = df['aqi'].astype(float)

# Find the correlation level between population and AQI
corr_pop_cities = df['population'].corr(df['aqi'])

print(f"The correlation between AQI Level of capital cities and population is {corr_pop_cities}")

## Scatter Plot of City Population and AQI
This section of code is aimed at creating a scatter plot to visualize the relationship between city population and AQI.

We use the scatterplot method from the seaborn library to create a scatter plot of city population (x-axis) and AQI (y-axis) based on the data in the df dataframe.

The plot is then adjusted by setting the title, x-label, and y-label. The final step is to display the plot using the show method from the matplotlib library.

In [None]:
# Create a scatter plot
sns.scatterplot(x='population', y='aqi', data=df)

# Adjust the labels and the title
plt.title("Scatter plot of city population and AQI")
plt.xlabel("Population (in ten millions)")
plt.ylabel("AQI")

# Show the plot
plt.show()

## Heatmap of Correlation between City Population and AQI
This section of code is aimed at creating a heatmap to visualize the correlation between city population and AQI.

We start by creating a correlation dataframe between population and AQI from the data in the df dataframe. This dataframe is then passed to the heatmap method from the seaborn library to create the heatmap.

The heatmap is then annotated to display the actual correlation values and is displayed using the show method from the matplotlib library.

In [None]:
# Create the correlation dataframe
corr_pop_cities_2 = df[['population', 'aqi']].corr()

# Create the heatmap
sns.heatmap(corr_pop_cities_2, annot=True)

# Show the plot
plt.show()

## Histogram of AQI Levels for Cities
This section of code is aimed at creating a histogram to show the distribution of AQI levels for cities with a population over 750,000.

We start by creating a dataframe from the AQI levels for the cities, and then pass this dataframe to the countplot method from the seaborn library to create the histogram.

The size of the bars and the font size of the labels are adjusted to make the plot more readable. Finally, the plot is displayed using the show method from the matplotlib library.

In [None]:
# Create a dataframe from the AQI levels for the cities
data = {'AQI Level': aqi_levels}
df1 = pd.DataFrame(data)

# Create the histogram
sns.countplot(x='AQI Level', data=df1, width=0.5,
              order=["Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous"])

# Adjust the size of the bars
plt.gcf().set_size_inches(10, 5)

# Adjust the font size of the labels
plt.xticks(fontsize=7)
plt.xlabel("AQI Level", fontsize=15)
plt.ylabel("Number of Cities", fontsize=15)
plt.title("AQI Levels for Cities with population over 750,000", fontsize=15)

# Show plot
plt.show()