# Cagayan Data Mining Notebook

<div style="text-align:center">
    <img src='https://riverbasin.denr.gov.ph/img/CDO%20RB/CDORB1.jpg' width='500px'/>
</div>

This notebook processes the cleaned data from the data cleaning notebook. Processes of this notebook includes the following:
1. Removes hour and minute from the timestamp.
2. Clean duplicated data by keeping the first duplicated variable.
3. Ammend missing Water Level Data by Interpolation.
4. File integrity checker

### Authors
- Gifrey John M. Sulay

### Checked by:
- Dr. Anabel A. Abuzo
- Engr. Augustine Ave Padagunan

In [40]:
import numpy as np
import pandas as pd
from pathlib import Path
import os
from IPython.display import display

__author__ = "Gifrey John M. Sulay"
__copyright__ = "Xavier University - Engineering Resource Center"

In [None]:
import datetime
try:
    input_start_date = str(input("Please input Start Date (Format=Month,Date,Year): "))
    start_date_datetime = pd.to_datetime(input_start_date,format"%m,%d,%Y")

    input_end_date = str(input("Please input End Date (Format=Month,Date,Year): "))
    end_date_datetime = pd.to_datetime(input_end_date,format"%m,%d,%Y")

    print((f"Date start is {start_date_datetime} and End Date is {end_date_datetime}"))
    date_confirm = input("Type y if the input is correct or n to try again: ")
    if date_confirm == "y":
        start_date = start_date_datetime
        end_date = end_date_datetime
    elif date_confirm == "n":
        continue
    else:
        raise ValueError
        
except Exception as e:
    print(e.message)

In [41]:
cag_path = Path("Edited_Data/Cagayan/cag.csv")
rg_path = Path("Edited_Data/Cagayan/rg1.csv")

cag = pd.read_csv(cag_path)[['Timestamp', 'Sensor Value']]
rg = pd.read_csv(rg_path)[['Timestamp', 'Sensor Value']]

display('Cagayan Water Table',cag.head(),cag.tail())
display('Rain Gauge Data',rg.head(),rg.tail())

'Cagayan Water Table'

Unnamed: 0,Timestamp,Sensor Value
0,12/2/2019 12:29:20 AM,0.910916
1,12/2/2019 1:29:23 AM,0.910916
2,12/2/2019 2:30:19 AM,0.886947
3,12/2/2019 3:29:57 AM,0.910916
4,12/2/2019 4:29:26 AM,0.934885


Unnamed: 0,Timestamp,Sensor Value
14937,1/1/2022 10:28:42 AM,0.982823
14938,1/1/2022 11:31:29 AM,0.982823
14939,1/1/2022 12:29:57 PM,0.982823
14940,1/1/2022 1:28:35 PM,0.982823
14941,1/1/2022 2:34:39 PM,0.982823


'Rain Gauge Data'

Unnamed: 0,Timestamp,Sensor Value
0,12/31/2020 00:00:00,0.0
1,12/31/2020 00:30:00,0.0
2,12/31/2020 01:00:00,0.0
3,12/31/2020 01:30:00,0.0
4,12/31/2020 02:00:00,0.01515


Unnamed: 0,Timestamp,Sensor Value
17515,12/31/2021 21:30:00,0.12
17516,12/31/2021 22:00:00,0.11
17517,12/31/2021 22:30:00,0.07935
17518,12/31/2021 23:00:00,0.03998
17519,12/31/2021 23:30:00,0.02873


### Checking For Duplicates

In [42]:
#Delete duplicates
cag_1 = cag.drop_duplicates(keep='first')
rg_1 = rg.drop_duplicates(keep='first')

cag_dropped=len(cag)-len(cag_1)
rg_dropped=len(rg)-len(rg_1)

print(f"No of cells dropped from water level table is {cag_dropped} cells")
print(f"No of cells dropped from rain gauge table is {rg_dropped} cells")

cag = cag_1
rg = rg_1

display(cag,rg)

No of cells dropped from water level table is 364 cells
No of cells dropped from rain gauge table is 0 cells


Unnamed: 0,Timestamp,Sensor Value
0,12/2/2019 12:29:20 AM,0.910916
1,12/2/2019 1:29:23 AM,0.910916
2,12/2/2019 2:30:19 AM,0.886947
3,12/2/2019 3:29:57 AM,0.910916
4,12/2/2019 4:29:26 AM,0.934885
...,...,...
14937,1/1/2022 10:28:42 AM,0.982823
14938,1/1/2022 11:31:29 AM,0.982823
14939,1/1/2022 12:29:57 PM,0.982823
14940,1/1/2022 1:28:35 PM,0.982823


Unnamed: 0,Timestamp,Sensor Value
0,12/31/2020 00:00:00,0.00000
1,12/31/2020 00:30:00,0.00000
2,12/31/2020 01:00:00,0.00000
3,12/31/2020 01:30:00,0.00000
4,12/31/2020 02:00:00,0.01515
...,...,...
17515,12/31/2021 21:30:00,0.12000
17516,12/31/2021 22:00:00,0.11000
17517,12/31/2021 22:30:00,0.07935
17518,12/31/2021 23:00:00,0.03998


### Time format

We have the timestamp data but it is ***unreadable*** for indexing as these are just text strings. Thus, we convert the string Timestamp data into a readable format for the pandas module used in this noetbook for data cleaning.

In [43]:
def column_apply(df,column, function):
    df[column] = df[column].apply(function)

def map_apply(df,function):
    df =  df.applymap(function)

In [44]:
import datetime

column_apply(cag,'Timestamp', lambda x: pd.to_datetime(x).replace(second = 0, minute=0))
column_apply(rg,'Timestamp', lambda x: pd.to_datetime(x).replace(second = 0, minute=0))

# column_apply(cag,'Timestamp', lambda x: x.replace(second = 0, minute=0))
# column_apply(rg,'Timestamp', lambda x: x.replace(second = 0, minute=0))

### Transforms the data into proper format

The data as it stands now is by ***30-minute increments***. We want the data to be transformed into an hourly format. Thus, we get the mean average of all the data points in an hourly timeframe.

In [45]:
rg = rg.groupby(by='Timestamp').mean().reset_index()
cag = cag.groupby(by='Timestamp').mean().reset_index()

Rename the the sensor value header into its correspendong value type (e.g. Water Level, Rain Gauge)

In [46]:
cag=cag.rename(columns={"Sensor Value":"Water Level"})
rg=rg.rename(columns={"Sensor Value":"Rain Gauge"})

In [47]:
display(cag,rg)

Unnamed: 0,Timestamp,Water Level
0,2019-12-02 00:00:00,0.910916
1,2019-12-02 01:00:00,0.910916
2,2019-12-02 02:00:00,0.886947
3,2019-12-02 03:00:00,0.910916
4,2019-12-02 04:00:00,0.934885
...,...,...
14560,2022-01-01 10:00:00,0.982823
14561,2022-01-01 11:00:00,0.982823
14562,2022-01-01 12:00:00,0.982823
14563,2022-01-01 13:00:00,0.982823


Unnamed: 0,Timestamp,Rain Gauge
0,2020-12-31 00:00:00,0.000000
1,2020-12-31 01:00:00,0.000000
2,2020-12-31 02:00:00,0.071375
3,2020-12-31 03:00:00,0.308200
4,2020-12-31 04:00:00,0.178800
...,...,...
8755,2021-12-31 19:00:00,0.001405
8756,2021-12-31 20:00:00,0.000000
8757,2021-12-31 21:00:00,0.060000
8758,2021-12-31 22:00:00,0.094675


### Checking for Missing Data

Create a new dataframe with correct timestamp progression and merge the water level dataframe and rain gauge dataframe to the correct timestamp progression as the base.

The correct timestamp progression will start at the earliest date from either the rain gauge and water level tables and will end at their latest.

In [48]:
rg_start_date = rg.iloc[0,0]
rg_end_date = rg.iloc[-1,0]
cag_start_date = cag.iloc[0,0]
cag_end_date = cag.iloc[-1,0]


if rg_start_date > cag_start_date:
    start_date = cag_start_date
else:
    start_date = rg_start_date

if rg_end_date < cag_end_date:
    end_date = cag_end_date
else:
    end_date = rg_end_date

base_time = pd.DataFrame({'Timestamp':pd.date_range(start=start_date, end=end_date, freq="H")})

In [49]:
cag = pd.merge(base_time, cag, how='left', on='Timestamp')
main_df = pd.merge(cag,rg,how='left', on='Timestamp')

In [50]:
main_df

Unnamed: 0,Timestamp,Water Level,Rain Gauge
0,2019-12-02 00:00:00,0.910916,
1,2019-12-02 01:00:00,0.910916,
2,2019-12-02 02:00:00,0.886947,
3,2019-12-02 03:00:00,0.910916,
4,2019-12-02 04:00:00,0.934885,
...,...,...,...
18274,2022-01-01 10:00:00,0.982823,
18275,2022-01-01 11:00:00,0.982823,
18276,2022-01-01 12:00:00,0.982823,
18277,2022-01-01 13:00:00,0.982823,


### Hour and Rain Gauge Difference Data
Create a new column to input the hour the data was recorded.

We also create a new column for the Rain Gauge difference by subtracting the current value to its preceeding value and where all negative values are replaced with a 0.

In [51]:
main_df['Hour'] = main_df['Timestamp'].apply(lambda x: pd.to_datetime(x).strftime('%H'))

#Rearranges the datafrom to place the Hour data as the first column
main_df = main_df[['Hour']+main_df.columns.values[:-1].tolist()]

In [58]:
def rg_diff(lst):
    new_list = [0]
    count = 1
    for value in lst[1:]:
        if value-lst[count-1] > 0:
            new_list.append(value)
        else:
            new_list.append(0)
        count += 1
    return new_list

main_df['RG_Diff'] = rg_diff(main_df.loc[:,'Rain Gauge'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


### Check the missing Values of both the Water Level and Rain Gauge

Checks if the water level data or rain gauge data has ani missing values

In [56]:
missing_water = main_df.loc[main_df['Water Level'].isna()]
missing_rg = main_df.loc[main_df['Rain Gauge'].isna()]

display(missing_water)
display(missing_rg)

Unnamed: 0,Hour,Timestamp,Water Level,Rain Gauge,RG_Diff
510,06,2019/12/23 06:00:00,,,0.0
1615,07,2020/02/07 07:00:00,,,0.0
1730,02,2020/02/12 02:00:00,,,0.0
2101,13,2020/02/27 13:00:00,,,0.0
4252,04,2020/05/27 04:00:00,,,0.0
...,...,...,...,...,...
18267,03,2022/01/01 03:00:00,,,0.0
18268,04,2022/01/01 04:00:00,,,0.0
18269,05,2022/01/01 05:00:00,,,0.0
18270,06,2022/01/01 06:00:00,,,0.0


Unnamed: 0,Hour,Timestamp,Water Level,Rain Gauge,RG_Diff
0,00,2019/12/02 00:00:00,0.910916,,0.0
1,01,2019/12/02 01:00:00,0.910916,,0.0
2,02,2019/12/02 02:00:00,0.886947,,0.0
3,03,2019/12/02 03:00:00,0.910916,,0.0
4,04,2019/12/02 04:00:00,0.934885,,0.0
...,...,...,...,...,...
18274,10,2022/01/01 10:00:00,0.982823,,0.0
18275,11,2022/01/01 11:00:00,0.982823,,0.0
18276,12,2022/01/01 12:00:00,0.982823,,0.0
18277,13,2022/01/01 13:00:00,0.982823,,0.0


Unnamed: 0,Hour,Timestamp,Water Level,Rain Gauge,RG_Diff
0,00,2019-12-02 00:00:00,0.910916,,0.0
1,01,2019-12-02 01:00:00,0.910916,,0.0
2,02,2019-12-02 02:00:00,0.886947,,0.0
3,03,2019-12-02 03:00:00,0.910916,,0.0
4,04,2019-12-02 04:00:00,0.934885,,0.0
...,...,...,...,...,...
18274,10,2022-01-01 10:00:00,0.982823,,0.0
18275,11,2022-01-01 11:00:00,0.982823,,0.0
18276,12,2022-01-01 12:00:00,0.982823,,0.0
18277,13,2022-01-01 13:00:00,0.982823,,0.0


In [55]:
time_frmt = '%Y/%m/%d %H:00:00'
main_df['Timestamp'] = main_df['Timestamp'].apply(lambda x: x.strftime(time_frmt))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  main_df['Timestamp'] = main_df['Timestamp'].apply(lambda x: x.strftime(time_frmt))
