# Proyek Analisis Data: Air Quality in Changping
Name: Icha Revi Amanda

Email: m010d4kx2634@bangkit.academy

ID Dicoding: icha_revi

# Define the Question



*   How was the condition of PM2.5 during 2015
*   How was the relation between PM2.5 and other weather conditions



# Import All Packages/Libraries Used

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.stats import f_oneway
import matplotlib.pyplot as plt
import seaborn as sns
import os

sns.set_style("darkgrid")

In [None]:
pwd = os.getcwd()

# Data Wrangling

## Gathering Data

In [None]:
changping = pwd+'\\data\\PRSA_Data_Changping_20130301-20170228.csv'
df_changping = pd.read_csv(changping)
df_changping

In [None]:
changping_modified = df_changping.copy()

### Assessing Data

Examine the initial 5 data points to determine the value you possess and make into columns

In [None]:
changping_modified = df_changping.info()
columns_df = df_changping.columns

changping_modified, columns_df

In [None]:
changping_modified = df_changping.describe(include='all')
changping_modified

In [None]:
changping_modified.isna().sum()

## Cleaning Data

In [None]:
missing_rate = df_changping.isnull().mean() * 100
columns_plot = ['PM2.5', 'PM10']
missing_data = df_changping[columns_plot].isnull()
missing_data['year'] = df_changping['year']
missing_data_2015 = missing_data[missing_data['year'] == 2015]

plt.figure(figsize=(20, 8))
sns.heatmap(missing_data_2015.drop('year', axis=1), cmap='inferno', cbar=True)
plt.title('Pattern for Missing Data in 2015')
plt.xlabel('Date')
plt.ylabel('Pollutant')
plt.yticks(rotation=0)
plt.show()

missing_rate, missing_data_2015.sum()

Examined the missing value using the missing rate because the amount of missing data is relative small


*   The proportion of data attributed to PM2.5 pollutants stands at a modest 2.2%, and for PM10, it is 1.65%. This indicates that the acquired data is nearly completed



In [None]:
data_inserted = df_changping.fillna(method='ffill')
duplicates = data_inserted.duplicated().sum()
constant_columns = data_inserted.columns[data_inserted.nunique() <= 1]
data_types = data_inserted.dtypes
duplicates, constant_columns, data_types

After using the 'duplicated' function to assess duplicate rows, the output indicates that no rows are duplicated



In [None]:
summary_statistics = data_inserted.describe()
summary_statistics

## Cleaned Data


*   Filled the missing data
*   Checked the duplicated rows and no rows are duplicated

*   Used the Changping station for analyzed





# Exploratory Data Analysis (EDA)

## Explore

In [None]:
df_changping.sample(10)

*   Conduct statistical summaries
*   Analyze seasonal patterns and relationships with weather conditions

# Visualization & Explanatory Analysis

## Conduct statistical summaries

In [None]:
data_inserted['date'] = pd.to_datetime(data_inserted[['year', 'month', 'day', 'hour']])
data_time_series = data_inserted[['date', 'PM2.5', 'O3']].set_index('date').resample('M').mean()

plt.figure(figsize=(15, 6))
plt.plot(data_time_series.index, data_time_series['PM2.5'], label='PM2.5', color='red')
plt.plot(data_time_series.index, data_time_series['O3'], label='O3', color='green')
plt.title('Monthly Average Pollutant Concentrations of PM2.5 and O3')
plt.xlabel('Date')
plt.ylabel('Concentrations')
plt.legend()
plt.show()

In [None]:
correlation_matrix = data_inserted[['PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM']].corr()
correlation_matrix

### Statistical Summary

*   There are indicating significant variation in the concentration of pollutant that provide a distribution of data. Can be seen from PM2.5 which has a mean of 71.1 and with standard deviation of 72.4
*   From the monthly statistics of PM2.5 and O3, it's help to see the seasonal trends change in air quality



## Analyze seasonal patterns and relationships with weather conditions

In [None]:
data_inserted.fillna(method='ffill', inplace=True)


groups = data_inserted.groupby('year')['PM2.5']

anova_test_data = [group[1] for group in groups]

anova_test_result = f_oneway(*anova_test_data)


seasonal_pattern = data_inserted.groupby('month')['PM2.5'].mean()

weather_relations = df_changping[['TEMP', 'PRES', 'DEWP', 'RAIN', 'PM2.5']].corr()['PM2.5']


print("ANOVA Test Result:", anova_test_result)
print("Seasonal Month:", seasonal_pattern)
print("Correlations with Weather Conditions:", weather_relations)

plt.figure(figsize=(10, 6))
seasonal_pattern.plot(kind='bar', color='grey')
plt.title('Mean PM2.5 Levels by Month')
plt.xlabel('Month')
plt.ylabel('Mean PM2.5')
plt.xticks(ticks=range(0, 12), labels=[str(m) for m in range(1, 13)], rotation=0)
plt.show()

### Seasonal Patterns


*   Based on bar chart, it illustrate the monthly concentration of PM2.5 for each month.
*   The highest monthly average concentration of PM2.5 pollution occurs in January, March, and December that belongs to winter seasons
The lower monthly average concentration of PM2.5 occurs in August and September which is summer seasons

### Relations with Weather Conditions


*   Correlations with pollutants: The correlation among various pollutants indicates an relationships with meteorological data. Such is  the relation between PM2.5 and O3 suggests a robust relationship, implying a probable shared source or interaction between these pollutants.
*   Correlations with weather conditions: The correlations between PM2.5 and weather conditions shows that

1.   Has a negative correlation with TEMPT, where PM2.5 higher during lowest TEMPT
2.   Has a positive correlation with DEWP, where PM2.5 higher during higher humidity
3.  Has no strong correlation with PRESS and RAIN







In [None]:
changping_modified.to_csv(pwd + "\\dashboard\\changping_df.csv", index=False)

# Conclusion



*   During 2015 PM2.5 has higher concentrations in winter seasons which is in January, March and December and lower in summer seasons which is in August and September. This probably happens cause the atmosphere conditions.
*   PM2.5 has a negative correlation with TEMPT and has a positive correlation with DEWP. Unexpected correlation with PRESS and RAIN that has very small to no correlation. It means that PRESS and RAIN has no impact on the PM2.5 concentrations.