App Succesfully hosted in the Streamlit Cloud

link: https://airmattersapp.streamlit.app/

Github link: 
https://github.com/himshisehi/CMP7005_Programming-_for_Data_Analysis.git

Commit History
![image.png](attachment:3f5fb513-c0cb-482d-b93d-2c48810eac3a.png)


In [1]:
## Import the necessary Libraries to run the code

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


NameError: name 'warnings' is not defined

In [None]:
pip install seaborn

In [None]:
pip install matplotlib

In [None]:
pip install missingno

Selecting Wanshouxigong as Urban Site, Changping as Suburban Site, Huairou as Rural Site and Aotizhongxin Industrial site

**Wanshouxigong : Located in central Beijing, this site represents typical urban air quality. Surrounded by dense traffic and residential/commercial zones.**

**Huairou : Situated in the northern rural region of Beijing. Characterized by lower population density and more natural surroundings.**

**Changping : Located in the northern outskirts of Beijing. Considered a suburban area, less dense than the city center but still developed.**

**Aotizhongxin : Known as a hotspot due to its proximity to Olympic venues and high development zones. This site is often monitored closely.**
            

****Fundamental Data Understanding****

In [None]:
## loading all four datasets here

urban_df = pd.read_csv("PRSA_Data_Wanshouxigong_20130301-20170228.csv")
suburban_df = pd.read_csv("PRSA_Data_Changping_20130301-20170228.csv")
rural_df = pd.read_csv("PRSA_Data_Huairou_20130301-20170228.csv")
industrial_df = pd.read_csv("PRSA_Data_Aotizhongxin_20130301-20170228.csv")

In [None]:
## Adding the Category columns gives context to each record for later analysis

urban_df['Category'] = 'Urban'
suburban_df['Category'] = 'Suburban'
rural_df['Category'] = 'Rural'
industrial_df['Category'] = 'Industrial'

In [None]:
### Read and combine all CSV files

air_quality_df = pd.concat([urban_df, suburban_df, rural_df, industrial_df], ignore_index=True)

In [None]:
# Save the merged dataset to a CSV file

air_quality_df.to_csv("air_quality.csv", index=False)

In [None]:
# Load the CSV file into a pandas DataFrame

df = pd.read_csv("air_quality.csv")
df.head()

In [None]:
# Combine Year, Month, Day, and Hour into a new 'Datetime' column

air_quality_df['Datetime'] = pd.to_datetime(air_quality_df[['year', 'month', 'day', 'hour']].astype(str).agg('-'.join, axis=1),format='%Y-%m-%d-%H')

In [None]:
# Check the first few rows to confirm the combined Datetime

print(air_quality_df[['year', 'month', 'day', 'hour', 'Datetime']].head())

In [None]:
# Displaying basic information about the dataset

print(air_quality_df.info())

In [None]:
# Get summary statistics of numerical columns

print(air_quality_df.describe())

 ***Data pre-processing***

In [None]:
# Check for missing values in the dataset

missing_values = air_quality_df.isna().sum()
print("Missing values in each column:\n", missing_values)

In [None]:
## Use a heatmap to visualize missing data to understand which variables have missing values and their patterns.

msno.matrix(air_quality_df)
plt.show()

In [None]:
## Handle Missing Values
## Impute missing values in numerical columns with the mean

numerical_cols = air_quality_df.select_dtypes(include=['float64', 'int64']).columns
air_quality_df[numerical_cols] = air_quality_df[numerical_cols].fillna(air_quality_df[numerical_cols].mean())

In [None]:
## For categorical columns, fill missing values with the mode - most frequent value 

categorical_cols = air_quality_df.select_dtypes(include=['object']).columns
air_quality_df[categorical_cols] = air_quality_df[categorical_cols].fillna(air_quality_df[categorical_cols].mode().iloc[0])

In [None]:
# checking to see if missing values are handled

print("Missing values after imputation:\n", air_quality_df.isna().sum())

In [None]:
# Check for duplicates and count them

duplicate_rows = air_quality_df.duplicated()
print(duplicate_rows.sum())

In [None]:
# Display the duplicate rows

duplicates = air_quality_df[air_quality_df.duplicated()]
print(duplicates)

Note : Feature engineering for combining year, month, day, and hour into a single Datetime column is done at the above steps. 
## Create Additional Time-Based Features: 

In [None]:
# Create a binary 'Weekend' feature (1 for weekend, 0 for weekday)

air_quality_df['DayOfWeek'] = air_quality_df['Datetime'].dt.dayofweek
air_quality_df['Weekend'] = air_quality_df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)

In [None]:
# Check the first few rows

print(air_quality_df[['Datetime', 'DayOfWeek', 'Weekend']].head())

In [None]:
print(air_quality_df.head())

In [None]:
# Create 'TimeOfDay' feature based on Hour

def get_time_of_day(hour):
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    elif 18 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

air_quality_df['TimeOfDay'] = air_quality_df['hour'].apply(get_time_of_day)

# Check the first few rows
print(air_quality_df[['hour', 'TimeOfDay']].head())


In [None]:
# Save the new merged dataset to a CSV file 

air_quality_df.to_csv("updated_air_quality.csv", index=False)

In [None]:
## Removing Unnecessary Columns:
## Drop unnecessary columns

air_quality_df.drop(['year', 'month', 'day', 'hour'], axis=1, inplace=True)

In [None]:
# Verify if the columns are dropped

print(air_quality_df.head())

***Statistics/computation-based analysis and Visualisation***

In [None]:
## generating general statistical summaries of the numerical and categorical variables
# Display overall dataset statistics

print(air_quality_df.describe(include='all'))

## Univariate analysis for the distribution of a all variables and Bar Chart for Categorical Data 

In [None]:
# List of pollutant columns that you want to visualize
pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

# Set up the matplotlib figure with an appropriate size
plt.figure(figsize=(16, 12))

# Loop through each pollutant and plot its histogram with a Kernel Density Estimate
for i, pollutant in enumerate(pollutants):
    plt.subplot(2, 3, i + 1)  # Create a 2x3 grid of subplots
    sns.histplot(air_quality_df[pollutant], kde=True, bins=30)
    plt.title(f'Distribution of {pollutant}')
    plt.xlabel(pollutant)
    plt.ylabel('Frequency')

plt.tight_layout()  # Adjust subplots for a clean layout
plt.show()

In [None]:
## Bar charts for the distribution of records across different times of day
# List of pollutant columns
pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

# Set up plot
plt.figure(figsize=(18, 12))
sns.set(style="whitegrid")

# Create a barplot for each pollutant
for i, pollutant in enumerate(pollutants):
    plt.subplot(2, 3, i + 1)
    sns.barplot(x='TimeOfDay', y=pollutant, data=air_quality_df, estimator='mean', ci=None, palette='pastel', order=['Morning', 'Afternoon', 'Evening', 'Night'])
    plt.title(f'Average {pollutant} by Time of Day')
    plt.xlabel("Time of Day")
    plt.ylabel(f"{pollutant} Level")

plt.tight_layout()
plt.show()

In [None]:
## Bar charts for the distribution of records across different times of day

plt.figure(figsize=(8, 5))
sns.countplot(x='TimeOfDay', data=air_quality_df, order=['Morning', 'Afternoon', 'Evening', 'Night'])
plt.title('Record Count by Time of Day')
plt.xlabel('Time of Day')
plt.ylabel('Count')
plt.show()


In [None]:
# Bar chart for distribution of Average Pollutant Levels: Weekdays vs Weekends

# List of pollutant columns
pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

# Group data by 'Weekend' and compute mean for each pollutant
avg_pollutants = air_quality_df.groupby('Weekend')[pollutants].mean().T

# Rename the index to be more readable
avg_pollutants.index.name = 'Pollutant'

# Plotting
avg_pollutants.plot(kind='bar', figsize=(12, 7), colormap='Set2')
plt.title('Average Pollutant Levels: Weekdays vs Weekends')
plt.xlabel('Pollutants')
plt.ylabel('Average Concentration')
plt.xticks(rotation=0)
plt.legend(title='Weekend', labels=['Weekday (0)', 'Weekend (1)'])
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
##  Bar charts of the records fall on weekends vs weekdays,

plt.figure(figsize=(6, 4))
sns.countplot(x='Weekend', data=air_quality_df, palette='viridis')
plt.title('Record Count by Weekend Indicator')
plt.xlabel('Weekend (0 = Weekday, 1 = Weekend)')
plt.ylabel('Count')
plt.show()


In [None]:
# Plot a bar chart showing counts for each category
# note to myself: every category has 35064 records

plt.figure(figsize=(8, 5))
sns.countplot(x='Category', data=air_quality_df, palette='Set2')
plt.title('Record Count by Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()

## Bivariate Analysis for the relationship between two variables: Scatter Plot

In [None]:
#The boxplot shows the distribution of pollutants by its ara

# Define pollutants
pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

# Set plot style
sns.set(style="whitegrid")

# Plot each pollutant's boxplot by Category
plt.figure(figsize=(18, 12))

for i, pollutant in enumerate(pollutants):
    plt.subplot(2, 3, i+1)
    sns.boxplot(x='Category', y=pollutant, data=air_quality_df, palette='Set2')
    plt.title(f'{pollutant} Distribution by Area')
    plt.xlabel('Area')
    plt.ylabel(pollutant)

plt.tight_layout()
plt.show()


In [None]:
# Step 1: Compute average value of each pollutant
pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']
mean_values = air_quality_df[pollutants].mean()

# Step 2: Identify pollutant with highest average
highest_avg_pollutant = mean_values.idxmax()
print(f"Highest average pollutant: {highest_avg_pollutant}")
print(f"Avearge Mean of pollutant: {mean_values}")

## Multivariate Analysis for interactions between several variables at once          

In [None]:
##  Correlation Heatmap of Pollutants
## used only the numeric pollutant columns

pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']
corr_matrix = air_quality_df[pollutants].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Pollutants')
plt.show()


## Time Series Visualization using Line Plot for Daily Average

In [None]:
# Ensure Datetime is set as index
air_quality_df.set_index('Datetime', inplace=True)

# List of pollutants to analyze
pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

# Set up the plot
plt.figure(figsize=(16, 10))

# Loop through pollutants and plot each one
for i, pollutant in enumerate(pollutants):
    plt.subplot(3, 2, i + 1)  # 3 rows, 2 columns layout
    daily_avg = air_quality_df[pollutant].resample('D').mean()
    plt.plot(daily_avg, label=f'Daily Avg {pollutant}', color='steelblue')
    plt.title(f'Daily Average {pollutant} Concentration Over Time')
    plt.xlabel('Date')
    plt.ylabel(f'{pollutant} Level')
    plt.tight_layout()

plt.show()

# Reset index if needed for further analysis
air_quality_df.reset_index(inplace=True)

In [None]:
##checking to see non linear features 

pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

plt.figure(figsize=(18, 12))

for i, pollutant in enumerate(pollutants):
    plt.subplot(2, 3, i + 1)
    sns.scatterplot(x='TEMP', y=pollutant, data=air_quality_df, alpha=0.3)
    plt.title(f'Temperature vs {pollutant}')
    plt.xlabel('Temperature (°C)')
    plt.ylabel(f'{pollutant} (µg/m³)')
    plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
##checking to see non linear features 

pollutants = ['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3']

plt.figure(figsize=(18, 12))

for i, pollutant in enumerate(pollutants):
    plt.subplot(2, 3, i + 1)
    sns.scatterplot(x='WSPM', y=pollutant, data=air_quality_df, alpha=0.3, color='green')
    correlation = air_quality_df['WSPM'].corr(air_quality_df[pollutant])
    plt.title(f'WSPM vs {pollutant} (corr = {correlation:.2f})')
    plt.xlabel('Wind Speed (m/s)')
    plt.ylabel(f'{pollutant} (µg/m³)')
    plt.grid(True)

plt.tight_layout()
plt.show()


**Task 3:  building machine-learning model**

In [None]:
# Sample: Assuming 'air_quality_df' is your cleaned DataFrame with no nulls
features = ['PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM']
X = air_quality_df[features]
y = air_quality_df['PM2.5']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Linear Regression
# ---------------------
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("--- Linear Regression ---")
print(f"R² Score: {r2_lr:.2f}")
print(f"RMSE: {rmse_lr:.2f}")
print(f"MSE: {mse_lr:.2f}")

In [None]:
# Random Forest
# ---------------------
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("\n--- Random Forest ---")
print(f"R² Score: {r2_rf:.2f}")
print(f"RMSE: {rmse_rf:.2f}")
print(f"MSE: {mse_rf:.2f}")

In [None]:
# Create DataFrame of feature importances
feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(8, 5))
sns.barplot(data=feature_importance, x='Importance', y='Feature', palette='Blues_d')
plt.title('Feature Importance - Random Forest')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.grid(True)
plt.show()


In [None]:
# --- Feature Importance - Linear Regression ---
coefficients = pd.DataFrame({
    'Feature': features,
    'Coefficient': lr_model.coef_,
    'Importance': np.abs(lr_model.coef_)
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(data=coefficients, x='Importance', y='Feature', palette='Greens_d')
plt.title('Feature Importance - Linear Regression')
plt.xlabel('Absolute Coefficient Value')
plt.ylabel('Feature')
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# --- Actual vs Predicted Plot for Linear Regression ---
plt.figure(figsize=(7, 5))
sns.scatterplot(x=y_test, y=y_pred_lr, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title("Actual vs Predicted - Linear Regression")
plt.xlabel("Actual PM2.5")
plt.ylabel("Predicted PM2.5")
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
# --- Actual vs Predicted Plot for Random Forest ---
plt.figure(figsize=(7, 5))
sns.scatterplot(x=y_test, y=y_pred_rf, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.title("Actual vs Predicted - Random Forest")
plt.xlabel("Actual PM2.5")
plt.ylabel("Predicted PM2.5")
plt.grid(True)
plt.tight_layout()
plt.show()

Task 4 : Application Development

In [None]:
!pip install streamlit


In [None]:
air_quality_df.to_csv('final_air_quality.csv', index=False)

In [None]:
import pandas as pd

# Load the final cleaned dataset
final_df = pd.read_csv('final_air_quality.csv')

# Show the first few rows
print(final_df.head())
