# Finding Pattern of Annual Weather in Malang Regency, East Java, Indonesia

This project is written in Jupyter Notebook with objective:

**How does the pattern of weather affect blanket that is needed to sleep?**

To answer, we will use data from the database of Meteorological, Climatological, and Geophysical Agency of Indonesia (BMKG). This is a second-party data that collected by East Java Climatology Station. The source of data is [here](https://dataonline.bmkg.go.id/home) (To access the data, we need to register first)

![bmkg logo](https://upload.wikimedia.org/wikipedia/commons/1/12/Logo_BMKG_%282010%29.png)

The data that we need is annual average temperature, humidity, and wind speed in last year. But, there is a catch about the data that we will use. It is only avalaible in monthly interval, so we need to download and merge it manually so it will be in the annual format. I used excel to combine the data and export it to CSV. Let's process our data.

## Prepare Data

To proceed, we need to import necessary libraries

In [6]:
import numpy as np
import pandas as pd

Let's import the CSV file and assign it to the dataframe variable

In [36]:
weather_df = pd.read_csv("annual_weather_data.csv", delimiter=";")

## Process Data

For processing data, check the dataframe firstly

In [37]:
weather_df.head()

Unnamed: 0,Date,Tx,Tavg,RH_avg,ff_x,ff_avg
0,19-08-2021,285,234,72.0,7.0,2.0
1,20-08-2021,274,233,82.0,4.0,2.0
2,21-08-2021,288,242,78.0,6.0,2.0
3,22-08-2021,286,243,76.0,6.0,3.0
4,23-08-2021,29,228,74.0,5.0,2.0


Because we only need the the average of temperature, humidity, and windspeed we can drop other two fields.

In [38]:
weather_df.drop(columns={'Tx', 'ff_x'}, inplace=True)

From the dataframe, we can see that columns are named with abbrevation. We can to change the columns' names to make it easier to understand.

In [39]:
weather_df.drop(365, inplace=True)

weather_df.rename(columns={'Tavg' : 'avg_temperature',
                           'RH_avg' : 'avg_humidity(%)',
                           'ff_avg' : 'avg_windspeed'}, inplace=True)
weather_df

Unnamed: 0,Date,avg_temperature,avg_humidity(%),avg_windspeed
0,19-08-2021,234,72.0,2.0
1,20-08-2021,233,82.0,2.0
2,21-08-2021,242,78.0,2.0
3,22-08-2021,243,76.0,3.0
4,23-08-2021,228,74.0,2.0
...,...,...,...,...
360,14-08-2022,246,81.0,2.0
361,15-08-2022,243,77.0,2.0
362,16-08-2022,24,79.0,2.0
363,17-08-2022,236,76.0,2.0


Next, summarize it so we can get high level understanding about the data (Data Profiling)

In [40]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             365 non-null    object 
 1   avg_temperature  365 non-null    object 
 2   avg_humidity(%)  364 non-null    float64
 3   avg_windspeed    365 non-null    float64
dtypes: float64(2), object(2)
memory usage: 11.5+ KB


After summarizing the dataframe, we can tell that our data has 365 rows and 4 columns, but in the summarization of data we can see that "Date" and "avg_temperature" are not in the right format, we need to change them. And there is a null in the field "avg_humidity", we can fill the null.

In [41]:
weather_df['avg_temperature'] = weather_df['avg_temperature'].str.replace(',', '.')
weather_df = weather_df.astype({'Date' : 'datetime64',
                                'avg_temperature' : 'float64'})
weather_df

  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
  to_datetime(arr).values,
 

Unnamed: 0,Date,avg_temperature,avg_humidity(%),avg_windspeed
0,2021-08-19,23.4,72.0,2.0
1,2021-08-20,23.3,82.0,2.0
2,2021-08-21,24.2,78.0,2.0
3,2021-08-22,24.3,76.0,3.0
4,2021-08-23,22.8,74.0,2.0
...,...,...,...,...
360,2022-08-14,24.6,81.0,2.0
361,2022-08-15,24.3,77.0,2.0
362,2022-08-16,24.0,79.0,2.0
363,2022-08-17,23.6,76.0,2.0


In [42]:
print(weather_df[weather_df.isna().any(axis=1)])

          Date  avg_temperature  avg_humidity(%)  avg_windspeed
132 2021-12-29             24.0              NaN            2.0


We can fill null value with average of sum previous row and next row values. 

In [43]:
weather_df.fillna(80, inplace=True)
weather_df.iloc[130:135]

Unnamed: 0,Date,avg_temperature,avg_humidity(%),avg_windspeed
130,2021-12-27,24.0,82.0,3.0
131,2021-12-28,25.3,75.0,3.0
132,2021-12-29,24.0,80.0,2.0
133,2021-12-30,24.3,86.0,2.0
134,2021-12-31,23.9,86.0,1.0


After cleaning the dataframe, we can check again to verify that our data is clean.

In [44]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             365 non-null    datetime64[ns]
 1   avg_temperature  365 non-null    float64       
 2   avg_humidity(%)  365 non-null    float64       
 3   avg_windspeed    365 non-null    float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 11.5 KB


In [46]:
weather_df.tail()

Unnamed: 0,Date,avg_temperature,avg_humidity(%),avg_windspeed
360,2022-08-14,24.6,81.0,2.0
361,2022-08-15,24.3,77.0,2.0
362,2022-08-16,24.0,79.0,2.0
363,2022-08-17,23.6,76.0,2.0
364,2022-08-18,23.4,80.0,2.0


Great, the data is clean, we can continue to analyze the dataframe.

## Analyze Data