# Tagoloan Data Mining Notebook

<div>
    <img src='https://riverbasin.denr.gov.ph/img/Tagoloan%20RB/TagoloanRB3.jpg' width='500px'/>
</div>

This notebook processes the cleaned data from the data cleaning notebook. Processes of this notebook includes the following:
1. Removes hour and minute from the timestamp.
2. Clean duplicated data by keeping the first duplicated variable.
3. Ammend missing Water Level Data by Interpolation.
4. File integrity checker

### Authors
- Gifrey John M. Sulay

### Checked by:
- Dr. Anabel A. Abuzo
- Engr. Augustine Ave Padagunan

# CODE PROPER

### Import Modules and Foo Data

In [1]:
import numpy as np
import pandas as pd
import re

__author__ = "Gifrey John M. Sulay"
__copyright__ = "Xavier University - Engineering Resource Center"


### Open Dataset

In [2]:
#open data
tag_path='Edited_Data/tag.csv'
rg0_path='Edited_Data/rg0.csv'
rg1_path='Edited_Data/rg1.csv'

tag=pd.read_csv(tag_path)
rg1=pd.read_csv(rg1_path)
rg0=pd.read_csv(rg0_path)

In [3]:
tag

Unnamed: 0.1,Unnamed: 0,Station Name,Sensor Type,Sensor Value,Sensor ID,Units,Timestamp,Sensor Label
0,0,FGEN Tagoloan Gauge Station,Level,1.535440,46481,Meters,3/25/2020 12:47:00 AM,Water Level
1,1,FGEN Tagoloan Gauge Station,Level,1.559315,46481,Meters,3/25/2020 1:47:00 AM,Water Level
2,2,FGEN Tagoloan Gauge Station,Level,1.463815,46481,Meters,3/25/2020 2:47:00 AM,Water Level
3,3,FGEN Tagoloan Gauge Station,Level,1.487690,46481,Meters,3/25/2020 3:47:00 AM,Water Level
4,4,FGEN Tagoloan Gauge Station,Level,1.487690,46481,Meters,3/25/2020 4:47:00 AM,Water Level
...,...,...,...,...,...,...,...,...
10068,1123,FGEN Tagoloan Gauge Station,Level,1.821940,46481,Meters,5/18/2021 8:47:00 AM,Water Level
10069,1124,FGEN Tagoloan Gauge Station,Level,1.989065,46481,Meters,5/18/2021 10:47:00 AM,Water Level
10070,1125,FGEN Tagoloan Gauge Station,Level,1.989065,46481,Meters,5/18/2021 11:47:00 AM,Water Level
10071,1126,FGEN Tagoloan Gauge Station,Level,2.012940,46481,Meters,5/18/2021 12:47:00 PM,Water Level


In [4]:
rg0

Unnamed: 0.1,Unnamed: 0,Station Name,Sensor Type,Sensor Value,Sensor ID,Units,Timestamp,Sensor Label
0,0,FGEN Tagoloan Gauge Station,Rain,319.6,46491,mm,3/25/2020 12:47:00 AM,Rain Gauge
1,1,FGEN Tagoloan Gauge Station,Rain,319.6,46491,mm,3/25/2020 1:47:00 AM,Rain Gauge
2,2,FGEN Tagoloan Gauge Station,Rain,319.6,46491,mm,3/25/2020 2:47:00 AM,Rain Gauge
3,3,FGEN Tagoloan Gauge Station,Rain,319.6,46491,mm,3/25/2020 3:47:00 AM,Rain Gauge
4,4,FGEN Tagoloan Gauge Station,Rain,319.6,46491,mm,3/25/2020 4:47:00 AM,Rain Gauge
...,...,...,...,...,...,...,...,...
10065,1123,FGEN Tagoloan Gauge Station,Rain,574.6,46491,mm,5/18/2021 8:47:00 AM,Rain Gauge
10066,1124,FGEN Tagoloan Gauge Station,Rain,576.2,46491,mm,5/18/2021 10:47:00 AM,Rain Gauge
10067,1125,FGEN Tagoloan Gauge Station,Rain,576.2,46491,mm,5/18/2021 11:47:00 AM,Rain Gauge
10068,1126,FGEN Tagoloan Gauge Station,Rain,576.2,46491,mm,5/18/2021 12:47:00 PM,Rain Gauge


In [5]:
rg1

Unnamed: 0.1,Unnamed: 0,Station Name,Sensor Type,Sensor Value,Sensor ID,Units,Timestamp,Sensor Label
0,0,FGEN Tagoloan Gauge Station,DailyRain,0.000,1546480,mm,3/25/2020 12:59:59 AM,Rain Gauge 1
1,1,FGEN Tagoloan Gauge Station,DailyRain,0.000,1546480,mm,3/25/2020 1:59:59 AM,Rain Gauge 1
2,2,FGEN Tagoloan Gauge Station,DailyRain,0.000,1546480,mm,3/25/2020 2:59:59 AM,Rain Gauge 1
3,3,FGEN Tagoloan Gauge Station,DailyRain,0.000,1546480,mm,3/25/2020 3:59:59 AM,Rain Gauge 1
4,4,FGEN Tagoloan Gauge Station,DailyRain,0.000,1546480,mm,3/25/2020 4:59:59 AM,Rain Gauge 1
...,...,...,...,...,...,...,...,...
9874,1137,FGEN Tagoloan Gauge Station,DailyRain,0.688,1546480,mm,5/18/2021 9:59:59 AM,Rain Gauge 1
9875,1138,FGEN Tagoloan Gauge Station,DailyRain,0.752,1546480,mm,5/18/2021 10:59:59 AM,Rain Gauge 1
9876,1139,FGEN Tagoloan Gauge Station,DailyRain,0.752,1546480,mm,5/18/2021 11:59:59 AM,Rain Gauge 1
9877,1140,FGEN Tagoloan Gauge Station,DailyRain,0.752,1546480,mm,5/18/2021 12:59:59 PM,Rain Gauge 1


### Cleaning
This section does the following
1. Acquires *Timestamp* and *Sensor Value* data from cleaned data set.
2. Removes minute and seconds in *Timestamp* column.
3. Creates dataframe of duplicated values.

In [6]:
#filter data
tag_f=tag[['Timestamp','Sensor Value']].rename(columns={'Sensor Value':'Water_Level'})
rg0_f=rg0[['Timestamp','Sensor Value']].rename(columns={'Sensor Value':'RG0_Level'})
rg1_f=rg1[['Timestamp','Sensor Value']].rename(columns={'Sensor Value':'RG1_Level'})

In [7]:
#create function to convert timestamp to day year/month/day hour
def timestamp_conv(df):
    headers=list(df.columns.values)
    array=df.to_numpy()
    timestamps=array[:,0]
    values=array[:,1]
    
    timestamp_edited =[]
    for i in timestamps:
        timestamp=i.split()
        date=timestamp[0]
        time=timestamp[1]
        hms=time.split(':')
        
        if len(timestamp)==3:
            morn_aft=timestamp[2]
            if int(hms[0])==12 and morn_aft=='AM':
                hour=0
            elif int(hms[0])==12 and morn_aft=='PM':
                hour=12
            elif morn_aft=='PM':
                hour=int(hms[0])+12
            elif morn_aft=='AM':
                hour=int(hms[0])

        if len(timestamp)==2:
            hour=int(hms[0])
        #reorganized timestamp to y/m/d
        date_split=date.split('/')
        month=date_split[0]
        day=date_split[1]
        year=date_split[2]
        reorganized_date=f"{int(year)}/{int(month)}/{int(day)}"

        new_timestamp=f"{reorganized_date} {int(hour)}:00:00"
        timestamp_edited.append(new_timestamp)
    
    timestamp_edited_arr=np.array(timestamp_edited)
    
    timestamp_edited_arr=timestamp_edited_arr[:,np.newaxis]
    values=values[:, np.newaxis]
    
    l1=list(timestamp_edited_arr)
    l2=list(values)
    
    data=np.hstack([timestamp_edited_arr, values])
    
    new_dataframe=pd.DataFrame(data, columns=headers)
    return new_dataframe

In [8]:
#convert timestamps
tag_n=timestamp_conv(tag_f)
rg0_n=timestamp_conv(rg0_f)
rg1_n=timestamp_conv(rg1_f)

In [9]:
#generate timestamp array for timestamp index for dataframe
#returns clean_timestamp
start="2020-03-25"
end="2021-05-19"
import datetime
x = pd.date_range(start=start, end=end, freq='H').tolist()
clean_timestamp=[]
for i in x:
    month=i.strftime("%m")
    day=i.strftime("%d")
    year=i.strftime("%Y")
    hour=i.strftime("%H")
    
    string=f"{int(year)}/{int(month)}/{int(day)} {int(hour)}:00:00"
    clean_timestamp.append(string)

In [10]:
#create df of duplicated values
tag_n_duplicates=pd.concat(g for _, g in tag_n.groupby("Timestamp") if len(g) > 1)
rg0_n_duplicates=pd.concat(g for _, g in rg0_n.groupby("Timestamp") if len(g) > 1)
rg1_n_duplicates=pd.concat(g for _, g in rg0_n.groupby("Timestamp") if len(g) > 1)

#drop redundant data points
tag_n.drop_duplicates(subset='Timestamp', keep=False, inplace=True)
rg0_n.drop_duplicates(subset='Timestamp', keep=False, inplace=True)
rg1_n.drop_duplicates(subset='Timestamp', keep=False, inplace=True)

### Merging
This section accomplishes the following:
1. Merges Water Level to Each Rain Gauge

In [11]:
#create dataframe for complete and correct timestamp
ts=pd.DataFrame({'Timestamp':clean_timestamp})

#create base dataframe(timestamp and water level)
ts_tag_n=pd.merge(ts,tag_n,how='left',on='Timestamp')
base=ts_tag_n

In [12]:
#merge base and rain gauge 0
base_rg0=pd.merge(base,rg0_n,how='left',on='Timestamp')
base_rg0

Unnamed: 0,Timestamp,Water_Level,RG0_Level
0,2020/3/25 0:00:00,1.53544,319.6
1,2020/3/25 1:00:00,1.55932,319.6
2,2020/3/25 2:00:00,1.46382,319.6
3,2020/3/25 3:00:00,1.48769,319.6
4,2020/3/25 4:00:00,1.48769,319.6
...,...,...,...
10076,2021/5/18 20:00:00,,
10077,2021/5/18 21:00:00,,
10078,2021/5/18 22:00:00,,
10079,2021/5/18 23:00:00,,


In [13]:
#merge base and rain gauge 1
base_rg1=pd.merge(base,rg1_n,how='left',on='Timestamp')
base_rg1

Unnamed: 0,Timestamp,Water_Level,RG1_Level
0,2020/3/25 0:00:00,1.53544,0
1,2020/3/25 1:00:00,1.55932,0
2,2020/3/25 2:00:00,1.46382,0
3,2020/3/25 3:00:00,1.48769,0
4,2020/3/25 4:00:00,1.48769,0
...,...,...,...
10076,2021/5/18 20:00:00,,
10077,2021/5/18 21:00:00,,
10078,2021/5/18 22:00:00,,
10079,2021/5/18 23:00:00,,


### First Save
This section does the following:
1. Saves merged datas of each rain gauge to **Cleaned_Data.xlsx**
2. Saves dataframe of missing values to **Missing Values.xlsx**

In [14]:
#save to excel sheet
with pd.ExcelWriter('Cleaned_Data.xlsx') as writer:
    base_rg0.to_excel(writer, sheet_name='Water Level - Rain Gauge 0')
    base_rg1.to_excel(writer, sheet_name='Water Level - Rain Gauge 1')

In [15]:
#create spreadsheet of missing values
water_level_missing=base_rg0[base_rg0['Water_Level'].isnull()].index.tolist()
rg0_missing=base_rg0[base_rg0['RG0_Level'].isnull()].index.tolist()
rg1_missing=base_rg1[base_rg1['RG1_Level'].isnull()].index.tolist()

water_level_missing_df=pd.DataFrame({'Missing Water Level':water_level_missing})
rg0_missing_df=pd.DataFrame({'Missing RG0':rg0_missing})
rg1_missing_df=pd.DataFrame({'Missing RG1':rg1_missing})

with pd.ExcelWriter('Missing Values.xlsx') as writer:
    water_level_missing_df.to_excel(writer, sheet_name='Water Level')
    rg0_missing_df.to_excel(writer, sheet_name='Rain Gauge 0')
    rg1_missing_df.to_excel(writer, sheet_name='Rain Gauge 1')
    
with pd.ExcelWriter('Duplicated Values.xlsx') as writer:
    tag_n_duplicates.set_index('Timestamp').to_excel(writer, sheet_name='Water Level')
    rg0_n_duplicates.set_index('Timestamp').to_excel(writer, sheet_name='Rain Gauge 0')
    rg1_n_duplicates.set_index('Timestamp').to_excel(writer, sheet_name='Rain Gauge 1')

### Ammendment of Missing Values
This section does the following
1. Water Level
    - Interpolate missing values using np.interpolate's linear method.
2. Rain Gauge 0
    - Replace *None* values in Rain Gauge 0 using np.interpolate's padding method.
3. Saves the ammended data in **Corrected_Data.xlsx**.

**Note:**
Data given in Rain Gauge 1 is unclear whether the data given is correct. *Thus, Rain Gauge 1 will not be used in the succeeding analysis.*

In [16]:
#interpolate water level
new_water_level=base['Water_Level'].astype('float64').interpolate()

#replace nan values in RG0 using np.interpolate's padding method
rg0_copy=base_rg0['RG0_Level'].astype('float64').interpolate(method='pad')

#create difference on Rain Gauge 0
rg0_diff=[0]
count=1
for i in rg0_copy[1:]:
    diff=i-rg0_copy[count-1]
    rg0_diff.append(diff)
    count+=1
count=1
for i in rg0_diff[1:]:
    if i<0:
        val=rg0_copy[count]
        rg0_diff[count]=val
    count += 1


In [17]:
#create dataframe with corrected data using base dataframe
corrected_df=base
corrected_df['Corrected_Water_Level']= new_water_level
corrected_df=corrected_df.drop(columns=['Water_Level'])
corrected_df['Corrected_RG0_Level']=rg0_copy
corrected_df['RG0_Diff']=rg0_diff
corrected_df=corrected_df.set_index('Timestamp')

In [18]:
def hours_only(df):
    timestamp_arr=df.index.values.tolist()
    hour_lst=[]
    for i in timestamp_arr:
        time=i.split()[1]
        hour=time.split(':')[0]
        hour_lst.append(hour)
    df['Hour']=hour_lst
    return df

corrected_df=hours_only(corrected_df)

In [19]:
headers=corrected_df.columns.values.tolist()
headers.remove('Hour')
headers.insert(0,'Hour')
corrected_df=corrected_df[headers]
corrected_df

Unnamed: 0_level_0,Hour,Corrected_Water_Level,Corrected_RG0_Level,RG0_Diff
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020/3/25 0:00:00,0,1.535440,319.6,0.0
2020/3/25 1:00:00,1,1.559315,319.6,0.0
2020/3/25 2:00:00,2,1.463815,319.6,0.0
2020/3/25 3:00:00,3,1.487690,319.6,0.0
2020/3/25 4:00:00,4,1.487690,319.6,0.0
...,...,...,...,...
2021/5/18 20:00:00,20,2.251690,576.2,0.0
2021/5/18 21:00:00,21,2.251690,576.2,0.0
2021/5/18 22:00:00,22,2.251690,576.2,0.0
2021/5/18 23:00:00,23,2.251690,576.2,0.0


In [20]:
with pd.ExcelWriter('Corrected_Data.xlsx') as writer:
    corrected_df.to_excel(writer,sheet_name='Corrected_Water_and_RG0_Level')

with pd.ExcelWriter('Corrected_Data.xlsx') as writer:
    corrected_df.to_excel(writer,sheet_name='Corrected_Water_and_RG0_Level')### Dataframe Integrity Check
Checks if dataframe has missing values

In [21]:
def check_integrity(df):
    return df[df.isnull().any(axis=1)]
x=check_integrity(corrected_df)
type(x)
x

Unnamed: 0_level_0,Hour,Corrected_Water_Level,Corrected_RG0_Level,RG0_Diff
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


### Ammendment 1: Erroneous Water Level
Erronoeous water level data were found. Data are catergorized as erroneous if the difference between previous and current datapoint exceeds 0.12

Objective:
* Create array and collect velues larger than 0.12
* Remove erroneous data
* Interpolate removed values


In [22]:
water_level = corrected_df['Corrected_Water_Level']
water_level

Timestamp
2020/3/25 0:00:00     1.535440
2020/3/25 1:00:00     1.559315
2020/3/25 2:00:00     1.463815
2020/3/25 3:00:00     1.487690
2020/3/25 4:00:00     1.487690
                        ...   
2021/5/18 20:00:00    2.251690
2021/5/18 21:00:00    2.251690
2021/5/18 22:00:00    2.251690
2021/5/18 23:00:00    2.251690
2021/5/19 0:00:00     2.251690
Name: Corrected_Water_Level, Length: 10081, dtype: float64

In [23]:
#diff limit determines whether the difference is erroneous if difference is more than diff limit then data is erroneous
diff_limit = 0.12
is_erroneous=[0]
count = 1
for i in water_level[1:]:
    val = i-water_level[count - 1]
    if val < 0:
        val=val*(-1)
    if val >= diff_limit:
        is_erroneous.append(1)
    else:
        is_erroneous.append(0)
    count += 1
corrected_df['err_check']=is_erroneous
print(len(is_erroneous))

10081


In [24]:
corrected_df.loc[corrected_df['err_check']==1, 'Corrected_Water_Level'] = np.nan
edited_water_level=corrected_df.loc[:, 'Corrected_Water_Level']
final_water_level=edited_water_level.astype('float64').interpolate()

In [25]:
corrected_df=corrected_df.drop(columns=['Corrected_Water_Level'])
corrected_df['Corrected_Water_Level']=final_water_level

In [26]:
corrected_df.loc[corrected_df['err_check']==1, 'Corrected_Water_Level']

Timestamp
2020/3/30 20:00:00    1.499627
2020/4/2 1:00:00      1.471773
2020/4/2 2:00:00      1.479732
2020/4/6 12:00:00     1.431982
2020/4/6 13:00:00     1.447898
                        ...   
2021/5/13 22:00:00    1.694607
2021/5/14 4:00:00     1.833878
2021/5/14 21:00:00    1.774190
2021/5/14 22:00:00    1.750315
2021/5/18 13:00:00    2.132315
Name: Corrected_Water_Level, Length: 536, dtype: float64

In [27]:
with pd.ExcelWriter('Corrected_Data.xlsx') as writer:
    corrected_df.to_excel(writer,sheet_name='Corrected_Water_and_RG0_Level')