<a href="https://colab.research.google.com/github/kimjaehwankimjaehwan/korea/blob/main/pollution_in_seoul.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Importing Libraries and Data

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
import folium
import warnings
warnings.filterwarnings('ignore')

There are multiple csv files available. Lets open the measurement summary

In [18]:
pol_data = pd.read_csv("/content/drive/MyDrive/한국분석/air_pollution_in_seoul/AirPollutionSeoul/Measurement_summary.csv")
pol_data.head()

Unnamed: 0,Measurement date,Station code,Address,Latitude,Longitude,SO2,NO2,O3,CO,PM10,PM2.5
0,2017-01-01 00:00,101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,0.004,0.059,0.002,1.2,73.0,57.0
1,2017-01-01 01:00,101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,0.004,0.058,0.002,1.2,71.0,59.0
2,2017-01-01 02:00,101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,0.004,0.056,0.002,1.2,70.0,59.0
3,2017-01-01 03:00,101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,0.004,0.056,0.002,1.2,70.0,58.0
4,2017-01-01 04:00,101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,0.003,0.051,0.002,1.2,69.0,61.0


In [19]:
pol_data.shape

(647511, 11)

There are 11 columns and 647511 rows in the dataset

In [20]:
pol_data.isnull().sum()

Unnamed: 0,0
Measurement date,0
Station code,0
Address,0
Latitude,0
Longitude,0
SO2,0
NO2,0
O3,0
CO,0
PM10,0


There are no null values in the data. Lets see the distribution.

In [21]:
pol_data[['SO2', 'NO2', 'O3', 'CO', 'PM10', 'PM2.5']].describe()

Unnamed: 0,SO2,NO2,O3,CO,PM10,PM2.5
count,647511.0,647511.0,647511.0,647511.0,647511.0,647511.0
mean,-0.001795,0.022519,0.017979,0.509197,43.708051,25.411995
std,0.078832,0.115153,0.099308,0.405319,71.137342,43.924595
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,0.003,0.016,0.008,0.3,22.0,11.0
50%,0.004,0.025,0.021,0.5,35.0,19.0
75%,0.005,0.038,0.034,0.6,53.0,31.0
max,3.736,38.445,33.6,71.7,3586.0,6256.0


Here we can see that the minimum value was -1 in some cases. That is not an acceptable value as there is nothing like negative pollution. This could be a measurement error. Lets count the number of occurrances of this.

In [22]:
print("We have", pol_data['SO2'].loc[(pol_data['SO2']<0)].count(),"negative values for SO2")
print("We have", pol_data['NO2'].loc[(pol_data['NO2']<0)].count(),"negative values for NO2")
print("We have", pol_data['O3'].loc[(pol_data['O3']<0)].count(),"negative values for O3")
print("We have", pol_data['CO'].loc[(pol_data['CO']<0)].count(),"negative values for CO")
print("We have", pol_data['PM10'].loc[(pol_data['PM10']<0)].count(),"negative values for PM10")
print("We have", pol_data['PM2.5'].loc[(pol_data['PM2.5']<0)].count(),"negative values for PM2.5")

We have 3976 negative values for SO2
We have 3834 negative values for NO2
We have 4059 negative values for O3
We have 4036 negative values for CO
We have 3962 negative values for PM10
We have 3973 negative values for PM2.5


In [23]:
data = [go.Scatter(x=pol_data['Measurement date'],
                   y=pol_data['SO2'], name='SO2'),
        go.Scatter(x=pol_data['Measurement date'],
                   y=pol_data['NO2'], name='NO2'),
        go.Scatter(x=pol_data['Measurement date'],
                   y=pol_data['CO'], name='CO'),
        go.Scatter(x=pol_data['Measurement date'],
                   y=pol_data['O3'], name='O3')]

##layout object
layout = go.Layout(title='Gases Levels',
                    yaxis={'title':'Level (ppm)'},
                    xaxis={'title':'Date'})

## Figure object

fig = go.Figure(data=data, layout=layout)

## Plotting
py.iplot(fig)

In [24]:
data = pol_data[pol_data['SO2']<0]

In [25]:
data[['SO2','NO2','O3','CO','PM10','PM2.5']].describe()

It looks like most of this are occurring in same date, as we can see that the count is same and the mean is almost near to -1 in most of the columns. We can use imputation to replace these values with the mean.

In [29]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=-1, strategy='mean')
df_imputed = pd.DataFrame(imp.fit_transform(pol_data[["SO2","NO2","O3","CO","PM10","PM2.5"]]))
df_imputed.columns = pol_data[["SO2","NO2","O3","CO","PM10","PM2.5"]].columns
df_imputed.index = pol_data.index
remain_df = pol_data[pol_data.columns.difference(["SO2","NO2","O3","CO","PM10","PM2.5"])]
df = pd.concat([remain_df, df_imputed], axis=1)
df.head()

Unnamed: 0,Address,Latitude,Longitude,Measurement date,Station code,SO2,NO2,O3,CO,PM10,PM2.5
0,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2017-01-01 00:00,101,0.004,0.059,0.002,1.2,73.0,57.0
1,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2017-01-01 01:00,101,0.004,0.058,0.002,1.2,71.0,59.0
2,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2017-01-01 02:00,101,0.004,0.056,0.002,1.2,70.0,59.0
3,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2017-01-01 03:00,101,0.004,0.056,0.002,1.2,70.0,58.0
4,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2017-01-01 04:00,101,0.003,0.051,0.002,1.2,69.0,61.0


In [30]:
#TODO : Implement the time series with folium
last_entry = df.groupby('Station code').max() #here max is used just to get all type of pointers in the maps
# # last_entry.apply(lambda x: x.sample())
last_entry

Unnamed: 0_level_0,Address,Latitude,Longitude,Measurement date,SO2,NO2,O3,CO,PM10,PM2.5
Station code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2019-12-31 23:00,0.406,0.109,0.325,40.0,516.0,513.0
102,"15, Deoksugung-gil, Jung-gu, Seoul, Republic o...",37.564263,126.974676,2019-12-31 23:00,0.082,0.248,0.178,8.7,296.0,276.0
103,"136, Hannam-daero, Yongsan-gu, Seoul, Republic...",37.540033,127.00485,2019-12-31 23:00,0.016,0.106,0.164,1.9,330.0,6256.0
104,"215, Jinheung-ro, Eunpyeong-gu, Seoul, Republi...",37.609823,126.934848,2019-12-31 23:00,0.08,0.121,0.186,8.0,985.0,985.0
105,"32, Segeomjeong-ro 4-gil, Seodaemun-gu, Seoul,...",37.593742,126.949679,2019-12-31 23:00,0.027,0.092,0.175,10.9,985.0,985.0
106,"10, Poeun-ro 6-gil, Mapo-gu, Seoul, Republic o...",37.55558,126.905597,2019-12-31 23:00,0.332,0.097,0.368,31.3,985.0,985.0
107,"18, Ttukseom-ro 3-gil, Seongdong-gu, Seoul, Re...",37.541864,127.049659,2019-12-31 23:00,0.144,0.113,0.189,15.3,985.0,985.0
108,"571, Gwangnaru-ro, Gwangjin-gu, Seoul, Republi...",37.54718,127.092493,2019-12-31 23:00,0.342,0.135,0.336,37.5,1661.0,985.0
109,"43, Cheonho-daero 13-gil, Dongdaemun-gu, Seoul...",37.575743,127.028885,2019-12-31 23:00,0.312,0.121,0.277,30.9,329.0,323.0
110,"369, Yongmasan-ro, Jungnang-gu, Seoul, Republi...",37.584848,127.094023,2019-12-31 23:00,0.223,0.087,0.271,22.3,693.0,660.0


Now we need to know about the levels of the above chemicals that are good and bad.

In [33]:
safe_limit = pd.read_csv('/content/drive/MyDrive/한국분석/air_pollution_in_seoul/AirPollutionSeoul/Original Data/Measurement_item_info.csv')
safe_limit

Unnamed: 0,Item code,Item name,Unit of measurement,Good(Blue),Normal(Green),Bad(Yellow),Very bad(Red)
0,1,SO2,ppm,0.02,0.05,0.15,1.0
1,3,NO2,ppm,0.03,0.06,0.2,2.0
2,5,CO,ppm,2.0,9.0,15.0,50.0
3,6,O3,ppm,0.03,0.09,0.15,0.5
4,8,PM10,Mircrogram/m3,30.0,80.0,150.0,600.0
5,9,PM2.5,Mircrogram/m3,15.0,35.0,75.0,500.0


The get color function return a color based on the level of polution of each chemical

In [34]:
#https://stackoverflow.com/a/16729808
def get_colors(data, safe_limit, item):
    item_row = safe_limit.loc[safe_limit['Item name'] == item]
    if (data > item_row.iloc[0]['Very bad(Red)']):
        return 'red'
    elif (data > item_row.iloc[0]['Bad(Yellow)']):
        return 'yellow'
    elif (data > item_row.iloc[0]['Normal(Green)']):
        return 'green'
    else:
        return 'blue'

We are adding additional columns in the last_entry dataframe for representation purpose.

In [35]:
last_entry['SO2 Color'] = last_entry['SO2'].apply(get_colors, args =(safe_limit, 'SO2' ))
last_entry['NO2 Color'] = last_entry['NO2'].apply(get_colors, args =(safe_limit, 'NO2' ))
last_entry['O3 Color'] = last_entry['O3'].apply(get_colors, args =(safe_limit, 'O3' ))
last_entry['CO Color'] = last_entry['CO'].apply(get_colors, args =(safe_limit, 'CO' ))
last_entry['PM10 Color'] = last_entry['PM10'].apply(get_colors, args =(safe_limit, 'PM10' ))
last_entry['PM2.5 Color'] = last_entry['PM2.5'].apply(get_colors, args =(safe_limit, 'PM2.5' ))
last_entry

Unnamed: 0_level_0,Address,Latitude,Longitude,Measurement date,SO2,NO2,O3,CO,PM10,PM2.5,SO2 Color,NO2 Color,O3 Color,CO Color,PM10 Color,PM2.5 Color
Station code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
101,"19, Jong-ro 35ga-gil, Jongno-gu, Seoul, Republ...",37.572016,127.005008,2019-12-31 23:00,0.406,0.109,0.325,40.0,516.0,513.0,yellow,green,yellow,yellow,yellow,red
102,"15, Deoksugung-gil, Jung-gu, Seoul, Republic o...",37.564263,126.974676,2019-12-31 23:00,0.082,0.248,0.178,8.7,296.0,276.0,green,yellow,yellow,blue,yellow,yellow
103,"136, Hannam-daero, Yongsan-gu, Seoul, Republic...",37.540033,127.00485,2019-12-31 23:00,0.016,0.106,0.164,1.9,330.0,6256.0,blue,green,yellow,blue,yellow,red
104,"215, Jinheung-ro, Eunpyeong-gu, Seoul, Republi...",37.609823,126.934848,2019-12-31 23:00,0.08,0.121,0.186,8.0,985.0,985.0,green,green,yellow,blue,red,red
105,"32, Segeomjeong-ro 4-gil, Seodaemun-gu, Seoul,...",37.593742,126.949679,2019-12-31 23:00,0.027,0.092,0.175,10.9,985.0,985.0,blue,green,yellow,green,red,red
106,"10, Poeun-ro 6-gil, Mapo-gu, Seoul, Republic o...",37.55558,126.905597,2019-12-31 23:00,0.332,0.097,0.368,31.3,985.0,985.0,yellow,green,yellow,yellow,red,red
107,"18, Ttukseom-ro 3-gil, Seongdong-gu, Seoul, Re...",37.541864,127.049659,2019-12-31 23:00,0.144,0.113,0.189,15.3,985.0,985.0,green,green,yellow,yellow,red,red
108,"571, Gwangnaru-ro, Gwangjin-gu, Seoul, Republi...",37.54718,127.092493,2019-12-31 23:00,0.342,0.135,0.336,37.5,1661.0,985.0,yellow,green,yellow,yellow,red,red
109,"43, Cheonho-daero 13-gil, Dongdaemun-gu, Seoul...",37.575743,127.028885,2019-12-31 23:00,0.312,0.121,0.277,30.9,329.0,323.0,yellow,green,yellow,yellow,yellow,yellow
110,"369, Yongmasan-ro, Jungnang-gu, Seoul, Republi...",37.584848,127.094023,2019-12-31 23:00,0.223,0.087,0.271,22.3,693.0,660.0,yellow,green,yellow,yellow,red,red


Lets plot the map showing the level of SO2

## Pollution of SO2

In [36]:
# This creates the map object
m = folium.Map(
    location=[37.541, 126.981], # center of where the map initializes
    #tiles='Stamen Toner', # the style used for the map (defaults to OSM)
    zoom_start=11, # the initial zoom level
    title = "Pollution level of SO2")
for ind in last_entry.index:
    #print(row[1][0])
    folium.Marker([last_entry['Latitude'][ind], last_entry['Longitude'][ind]], popup=ind, icon=folium.Icon(color=last_entry['SO2 Color'][ind], icon='info-sign')).add_to(m)

# Diplay the map
m

## Pollution of NO2

In [37]:
# This creates the map object
m = folium.Map(
    location=[37.541, 126.981], # center of where the map initializes
    #tiles='Stamen Toner', # the style used for the map (defaults to OSM)
    zoom_start=11, # the initial zoom level
    title = "Pollution level of NO2")
for ind in last_entry.index:
    #print(row[1][0])
    folium.Marker([last_entry['Latitude'][ind], last_entry['Longitude'][ind]], popup=ind, icon=folium.Icon(color=last_entry['NO2 Color'][ind], icon='info-sign')).add_to(m)

# Diplay the map
m

## Pollution of O3

In [38]:
# This creates the map object
m = folium.Map(
    location=[37.541, 126.981], # center of where the map initializes
    #tiles='Stamen Toner', # the style used for the map (defaults to OSM)
    zoom_start=11, # the initial zoom level
    title = "Pollution level of O3")
for ind in last_entry.index:
    #print(row[1][0])
    folium.Marker([last_entry['Latitude'][ind], last_entry['Longitude'][ind]], popup=ind, icon=folium.Icon(color=last_entry['O3 Color'][ind], icon='info-sign')).add_to(m)

# Diplay the map
m

## Pollution of CO

In [39]:
# This creates the map object
m = folium.Map(
    location=[37.541, 126.981], # center of where the map initializes
    #tiles='Stamen Toner', # the style used for the map (defaults to OSM)
    zoom_start=11, # the initial zoom level
    title = "Pollution level of CO")
for ind in last_entry.index:
    #print(row[1][0])
    folium.Marker([last_entry['Latitude'][ind], last_entry['Longitude'][ind]], popup=ind, icon=folium.Icon(color=last_entry['CO Color'][ind], icon='info-sign')).add_to(m)

# Diplay the map
m

## Pollution of PM10

In [40]:
# This creates the map object
m = folium.Map(
    location=[37.541, 126.981], # center of where the map initializes
    #tiles='Stamen Toner', # the style used for the map (defaults to OSM)
    zoom_start=11, # the initial zoom level
    title = "Pollution level of PM10")
for ind in last_entry.index:
    #print(row[1][0])
    folium.Marker([last_entry['Latitude'][ind], last_entry['Longitude'][ind]], popup=ind, icon=folium.Icon(color=last_entry['PM10 Color'][ind], icon='info-sign')).add_to(m)

# Diplay the map
m

## Pollution of PM2.5

In [41]:
# This creates the map object
m = folium.Map(
    location=[37.541, 126.981], # center of where the map initializes
    #tiles='Stamen Toner', # the style used for the map (defaults to OSM)
    zoom_start=11, # the initial zoom level
    title = "Pollution level of PM2.5")
for ind in last_entry.index:
    #print(row[1][0])
    folium.Marker([last_entry['Latitude'][ind], last_entry['Longitude'][ind]], popup=ind, icon=folium.Icon(color=last_entry['PM2.5 Color'][ind], icon='info-sign')).add_to(m)

# Diplay the map
m

- TODO : Add time slider in folium maps