# Proyek Analisis Data: Air Quality Dataset
- **Nama:** Muhammad Naufal Ilman
- **Email:** mnaufalilman@gmail.com
- **ID Dicoding:** nopal_ilman

## Menentukan Pertanyaan Bisnis

- Bagaimana hubungan antara kondisi cuaca (temperatur, tekanan udara, kelembaban) dengan tingkat polusi udara?
- Apakah ada perbedaan kualitas udara di berbagai stasiun pemantauan?

## Import Semua Packages/Library yang Digunakan

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Wrangling

### Gathering Data

In [2]:
    folder_path = "Data"  # Ganti dengan path folder CSV kamu

# Ambil semua file CSV dalam folder
all_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Gabungkan semua file CSV menjadi satu DataFrame
df_list = [pd.read_csv(os.path.join(folder_path, file)) for file in all_files]
df_AirQuality = pd.concat(df_list, ignore_index=True)

df_AirQuality.to_csv("Air_Quality_Total.csv", index=False)
df_AirQuality.head()

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,1,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,3,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin
3,4,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin
4,5,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin


### Assessing Data

**1. Melakukan pengecekan tipe data apakah tipe data sudah benar atau tidak**

In [3]:
#cek tipe data
df_AirQuality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 420768 entries, 0 to 420767
Data columns (total 18 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   No       420768 non-null  int64  
 1   year     420768 non-null  int64  
 2   month    420768 non-null  int64  
 3   day      420768 non-null  int64  
 4   hour     420768 non-null  int64  
 5   PM2.5    412029 non-null  float64
 6   PM10     414319 non-null  float64
 7   SO2      411747 non-null  float64
 8   NO2      408652 non-null  float64
 9   CO       400067 non-null  float64
 10  O3       407491 non-null  float64
 11  TEMP     420370 non-null  float64
 12  PRES     420375 non-null  float64
 13  DEWP     420365 non-null  float64
 14  RAIN     420378 non-null  float64
 15  wd       418946 non-null  object 
 16  WSPM     420450 non-null  float64
 17  station  420768 non-null  object 
dtypes: float64(11), int64(5), object(2)
memory usage: 57.8+ MB


**Insight:**
- terlihat bahwa tipe variabel month dan wd tidak sesuai dengan yang saya inginkan yaitu category

**2. Cek Duplikasi data**

In [4]:
print("Jumlah duplikasi: ", df_AirQuality.duplicated().sum())

Jumlah duplikasi:  0


**Insight:**
- tidak terdapat duplikasi pada data


### Cleaning Data

In [5]:
# Cek Missing Value
total_missing = df_AirQuality.isnull().sum().sum() # Total missing value

percent_missing = (total_missing / df_AirQuality.size) * 100 # Persentase missing value

print(f"Total missing values: {total_missing}")
print(f"Percentage of missing values: {percent_missing:.2f}%")

Total missing values: 74027
Percentage of missing values: 0.98%


**Insight:**
- Terdapat Missing value yag berjumlah 0.98% dari data asli atau 74027 dari 420768

In [6]:
# Hapus Missing Value
df_AirQuality.dropna(inplace=True)

In [7]:
# Mengubah isi variabel month menjadi (1:Januari, 2:Februari, 3:Maret, 4:April, 5:Mei, 6:Juni, 7:Juli, 8:Agustus, 9:September, 10:Oktober, 11:Nopember, 12:Desember
# Mengubah isi wd menjadi (N:North, NNE:North - North East, NE:North East, ENE: East - North East, E:East, ESE:East - South East, SE:South East, SSE:South - South East, S:South, SSW:South - South West, SW:South West, WSW:West - South West, W:West, WNW:West - North West, NW:North West, NNW:North - North West)
df_AirQuality['month'] = df_AirQuality['month'].replace({1:'January', 2:'February', 3:'March', 4:'April', 5:'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'})
df_AirQuality['wd'] = df_AirQuality['wd'].replace({'N': 'North', 'NNE': 'North - North East', 'NE': 'North East', 'ENE': 'East - North East', 'E':'East', 'ESE': 'East - South East', 'SE': 'South East', 'SSE': 'South - South East', 'S': 'South', 'SSW': 'South - South West', 'SW': 'South West', 'WSW': 'West - South West', 'W': 'West', 'WNW': 'West - North West', 'NW': 'North West', 'NNW': 'North - North West'})

In [8]:
df_AirQuality.head()

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,1,2013,March,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,North - North West,4.4,Aotizhongxin
1,2,2013,March,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,North,4.7,Aotizhongxin
2,3,2013,March,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,North - North West,5.6,Aotizhongxin
3,4,2013,March,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,North West,3.1,Aotizhongxin
4,5,2013,March,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,North,2.0,Aotizhongxin


In [10]:
# Mengubah tipe data month, wd, dan station ke category
df_AirQuality['month'] = df_AirQuality['month'].astype('category')
df_AirQuality['wd'] = df_AirQuality['wd'].astype('category')
df_AirQuality['station'] = df_AirQuality['station'].astype('category')

In [11]:
# Kolom numerik
numerical_cols = [col for col in df_AirQuality.columns if df_AirQuality[col].dtype in ['int64', 'float64']]

# Kolom kategorikal
categorical_cols = [col for col in df_AirQuality.columns if df_AirQuality[col].dtype == 'category']

# Daftar Variabel
total_features = len(categorical_cols) + len(numerical_cols)
num_numerical_features = len(numerical_cols)
num_categorical_features = len(categorical_cols)

print(f"Number of features: {total_features}")
print(f"Number of numerical features: {num_numerical_features}")
print(f"Number of categorical features: {num_categorical_features}\n")
print(f"List of categorical features:\n{categorical_cols}\n")
print(f"List of numerical features:\n{numerical_cols}")

Number of features: 18
Number of numerical features: 15
Number of categorical features: 3

List of categorical features:
['month', 'wd', 'station']

List of numerical features:
['No', 'year', 'day', 'hour', 'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM']


In [12]:
# Menampilkan semua nilai unik dengan set()
for col in categorical_cols:
    print(f"Kolom: {col}")
    print(set(df_AirQuality[col].unique()))
    print("-" * 30)

Kolom: month
{'July', 'January', 'June', 'March', 'April', 'August', 'February', 'September', 'November', 'May', 'December', 'October'}
------------------------------
Kolom: wd
{'North', 'North - North East', 'North - North West', 'East', 'South West', 'East - South East', 'East - North East', 'South - South East', 'South', 'West - South West', 'West - North West', 'North West', 'North East', 'West', 'South - South West', 'South East'}
------------------------------
Kolom: station
{'Dongsi', 'Wanliu', 'Shunyi', 'Changping', 'Nongzhanguan', 'Dingling', 'Aotizhongxin', 'Tiantan', 'Wanshouxigong', 'Gucheng', 'Huairou', 'Guanyuan'}
------------------------------


In [13]:
#cek tipe data
df_AirQuality.info()

<class 'pandas.core.frame.DataFrame'>
Index: 382168 entries, 0 to 420767
Data columns (total 18 columns):
 #   Column   Non-Null Count   Dtype   
---  ------   --------------   -----   
 0   No       382168 non-null  int64   
 1   year     382168 non-null  int64   
 2   month    382168 non-null  category
 3   day      382168 non-null  int64   
 4   hour     382168 non-null  int64   
 5   PM2.5    382168 non-null  float64 
 6   PM10     382168 non-null  float64 
 7   SO2      382168 non-null  float64 
 8   NO2      382168 non-null  float64 
 9   CO       382168 non-null  float64 
 10  O3       382168 non-null  float64 
 11  TEMP     382168 non-null  float64 
 12  PRES     382168 non-null  float64 
 13  DEWP     382168 non-null  float64 
 14  RAIN     382168 non-null  float64 
 15  wd       382168 non-null  category
 16  WSPM     382168 non-null  float64 
 17  station  382168 non-null  category
dtypes: category(3), float64(11), int64(4)
memory usage: 47.7 MB


## Exploratory Data Analysis (EDA)

In [14]:
#statistika deskriptif secara keseluruhan
df_AirQuality.describe(include="all")

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
count,382168.0,382168.0,382168,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168.0,382168,382168.0,382168
unique,,,12,,,,,,,,,,,,,16,,12
top,,,March,,,,,,,,,,,,,North East,,Nongzhanguan
freq,,,33536,,,,,,,,,,,,,39319,,33114
mean,17955.93107,2014.714905,,15.711308,11.575184,79.432383,104.573837,15.634814,50.570068,1229.940563,57.376676,13.518694,1010.813471,2.417195,0.06503,,1.738031,
std,10001.787087,1.160266,,8.803064,6.933552,80.154901,91.379446,21.306103,35.062086,1157.151476,56.709013,11.425355,10.452381,13.798402,0.823901,,1.241152,
min,1.0,2013.0,,1.0,0.0,2.0,2.0,0.2856,2.0,100.0,0.2142,-19.9,982.4,-36.0,0.0,,0.0,
25%,9610.0,2014.0,,8.0,6.0,20.0,36.0,2.0,23.0,500.0,10.4958,3.1,1002.4,-9.0,0.0,,0.9,
50%,18103.0,2015.0,,16.0,12.0,55.0,82.0,7.0,43.0,900.0,45.0,14.4,1010.4,3.0,0.0,,1.4,
75%,26515.0,2016.0,,23.0,18.0,111.0,145.0,19.0,71.0,1500.0,82.0,23.2,1019.0,15.1,0.0,,2.2,


In [15]:
# Category Data
df_AirQuality[categorical_cols].describe().T

Unnamed: 0,count,unique,top,freq
month,382168,12,March,33536
wd,382168,16,North East,39319
station,382168,12,Nongzhanguan,33114


**Insight:**
- Stasiun Nongzhanguan merupakan lokasi pemantauan yang paling aktif atau memiliki data yang lebih lengkap dibandingkan stasiun lainnya.
- Bulan Maret merupakan bulan yang dimana paling sering dilakukan pemantauan dengan 33.536 entri data tercatat dalam dataset

In [16]:
# Numeric Data
df_AirQuality[numerical_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
No,382168.0,17955.93107,10001.787087,1.0,9610.0,18103.0,26515.0,35064.0
year,382168.0,2014.714905,1.160266,2013.0,2014.0,2015.0,2016.0,2017.0
day,382168.0,15.711308,8.803064,1.0,8.0,16.0,23.0,31.0
hour,382168.0,11.575184,6.933552,0.0,6.0,12.0,18.0,23.0
PM2.5,382168.0,79.432383,80.154901,2.0,20.0,55.0,111.0,844.0
PM10,382168.0,104.573837,91.379446,2.0,36.0,82.0,145.0,999.0
SO2,382168.0,15.634814,21.306103,0.2856,2.0,7.0,19.0,500.0
NO2,382168.0,50.570068,35.062086,2.0,23.0,43.0,71.0,290.0
CO,382168.0,1229.940563,1157.151476,100.0,500.0,900.0,1500.0,10000.0
O3,382168.0,57.376676,56.709013,0.2142,10.4958,45.0,82.0,1071.0


**Insight:**
- Polusi udara cukup tinggi, terutama untuk PM2.5, PM10, dan CO, dengan beberapa kejadian polusi ekstrem.
- Suhu rata-rata sekitar 13.5°C, tetapi rentangnya sangat luas, mencerminkan musim yang berbeda dalam data.
- Tekanan udara relatif stabil di sekitar 1010 hPa.
- Curah hujan sangat rendah dalam sebagian besar pengamatan, meskipun ada kejadian hujan ekstrem.
- Kecepatan angin umumnya rendah tetapi bisa mencapai kecepatan tinggi dalam beberapa kesempatan.


## Visualization & Explanatory Analysis

### Pertanyaan 1:

### Pertanyaan 2:

**Insight:**
- xxx
- xxx

## Analisis Lanjutan (Opsional)

## Conclusion

- Conclution pertanyaan 1
- Conclution pertanyaan 2