<img src="GEOS_Logo.pdf" width="500" />


# Step **8** of **`G2FNL`**: <font color=blue>"remove_outliers.ipynb"</font>
#### Oct 5, 2021  <font color=red>(v. working)</font>
##### Jeonghyeop Kim (jeonghyeop.kim@gmail.com)

> input files: **`zeroFilled_i`**, **`days_per_month.dat`**, **`station_list_full.dat`**,  **`steps.txt`**, and **`time_vector.dat`** \
> output files: **`outlierRemoved_i`** 


0. This code is a part of GPS2FNL process 
1. It will get rid of outliers.  
> Position data for each month will be treated as a set. \
> The code will fit a linear line to each month and then subtract this model from the data. \
> Perform a simple statistical analysis for the residual. \
> Outliers for each month are defined as any data outside of the tolerance level. \
> The default tolerance level is +/- 3 sigma.
2. Potential issues: 
> There exist some stations that still show problematic outliers after this algorithm applied. \
> Possibly some post-seismic signals are identified as outliers and removed, which means a loss of interesting signal. \
> Maybe pass the month if an earthquake occurred in that month?

<div class="alert alert-danger">
Do NOT run this code twice without re-starting the kernel
</div>

In [1]:
# 1. import modules
import numpy as np
import pandas as pd
import os
from datetime import datetime

In [8]:
current_dir=os.getcwd()
os.getcwd()

'/Users/jkim/main/GPS2FNL_2021/summer_project_2021'

In [17]:
# 2. read files for (1) Number of Date per each month 
#               and (2) Number of Stations 
#               and (3) time_vector.dat for the first and end dates of the analysis
#               and (4) earthquake-related steps


#############################################
#(1)
datefile = 'days_per_month.dat'
dateNvec = pd.read_csv(datefile, sep = ' ', header = None)
dateNvec.columns = ['NofD']

#############################################
#(2)
list_full = "station_list_full.dat"
df_list=pd.read_csv(list_full, header=None)
N_list = len(df_list) 
#############################################
#(3)
timefile = 'time_vector.dat'
df_time=pd.read_csv(timefile, header=None)
startDateAnalysis=int(df_time.iloc[0])
endDateAnalysis=int(df_time.iloc[-1])
##########################################################################################
#(4)
metadata = "steps.txt" #file name
df_metadata=pd.read_csv(metadata, header=None, names=list('0123456'), sep=r'(?:,|\s+)', \
                        comment='#', engine='python')
## steps.txt is in an irregular shape
## 'names=list('0123456')' is to fill empty spots with NaN 
df_steps_earthquakes = df_metadata[df_metadata['2'] == 2].reset_index(drop=True)
df_steps_earthquakes.columns=['stID','time','flag','threshold','distance','mag','eventID'] 
#The step data has a time column in the form of yyMMMdd 
date_old = df_steps_earthquakes.time.tolist() # A DataFrame to a list
date_new = pd.to_datetime(date_old, format='%y%b%d').strftime('%Y%m%d') # convert date format
df_steps_earthquakes.loc[:,'time'] = date_new # replaces with the new date  in YYYYMMDD
df_steps_earthquakes['time']=df_steps_earthquakes['time'].astype(int) #str to int
df_steps_earthquakes = df_steps_earthquakes[(df_steps_earthquakes['time']>=startDateAnalysis) & \
                                            (df_steps_earthquakes['time']<=endDateAnalysis)]
df_steps_earthquakes = df_steps_earthquakes.reset_index(drop=True)

In [19]:
processing_dir = os.path.join(current_dir, 'data', 'processing')
os.chdir(processing_dir) # cp to processing directory
os.getcwd()

'/Users/jkim/main/GPS2FNL_2021/summer_project_2021/data/processing'

# **`IDENTIFY AND ELIMINATE OUTLIERS`**

In [38]:
N_months = len(dateNvec) # How many months for the time period of interest?



##############################################
# STEP 1: Read data files station by station #
##############################################

column_names = ['datenum','date','lon','lat','ue','un','uz','se','sn','sz','corr_en','flag']

for i in range(0,1):  # Later, replace range(0,1) with range(N_list)

    inputfile = "zeroFilled_"+str(i+1) #input_file = zeroFilled_"$i"
    df_input=pd.read_csv(inputfile,sep=' ',header=None)   
    df_input = df_input.reset_index() #Index column will be added and will be used as 'datenum' consecutive integers.
    df_input.columns = column_names
    df_input.loc[:,['datenum']]=df_input.loc[:,['datenum']]+1 #datenum starts from 1 instead of 0
    
##############################################
# STEP 2: READ DATA MONTH BY MONTH!          #
##############################################

    FirstMonth = 0
    for j in range(N_months):
        date_for_the_month=int(dateNvec.iloc[j])
        LastMonth = FirstMonth + date_for_the_month
        print(df_input.loc[FirstMonth:LastMonth-1,:].reset_index(drop=True))
        FirstMonth = LastMonth
        
##############################################
# STEP 3: Decide to pass the month or not    #
##############################################
        
        if:  # The number of non-zero values is less than 6 : pass
        elif: # an earthquake occurred in the month : pass
        else # otherwise, going!
        

    datenum      date      lon      lat        ue        un        uz  \
0         1  20060101  242.907  34.1164 -0.060673  0.739344  0.104282   
1         2  20060102  242.907  34.1164 -0.063803  0.742935  0.093794   
2         3  20060103  242.907  34.1164 -0.061580  0.739282  0.093533   
3         4  20060104  242.907  34.1164 -0.060382  0.739491  0.089145   
4         5  20060105  242.907  34.1164 -0.063088  0.746896  0.096681   
5         6  20060106  242.907  34.1164 -0.059025  0.734961  0.083636   
6         7  20060107  242.907  34.1164 -0.063438  0.743007  0.093221   
7         8  20060108  242.907  34.1164 -0.061394  0.741470  0.077786   
8         9  20060109  242.907  34.1164 -0.063111  0.739119  0.092447   
9        10  20060110  242.907  34.1164 -0.061224  0.740158  0.094935   
10       11  20060111  242.907  34.1164 -0.060240  0.739454  0.098241   
11       12  20060112  242.907  34.1164 -0.063474  0.740794  0.093735   
12       13  20060113  242.907  34.1164 -0.056689  

In [21]:
df_input

Unnamed: 0,datenum,date,lon,lat,ue,un,uz,se,sn,sz,corr_en,flag
0,1,20060101,242.907,34.1164,-0.060673,0.739344,0.104282,0.000993,0.001109,0.004718,-0.121517,1.0
1,2,20060102,242.907,34.1164,-0.063803,0.742935,0.093794,0.001094,0.001246,0.005135,-0.129376,1.0
2,3,20060103,242.907,34.1164,-0.061580,0.739282,0.093533,0.001055,0.001129,0.005246,-0.104294,1.0
3,4,20060104,242.907,34.1164,-0.060382,0.739491,0.089145,0.000975,0.001105,0.004680,-0.121829,1.0
4,5,20060105,242.907,34.1164,-0.063088,0.746896,0.096681,0.001119,0.001218,0.005573,-0.110840,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5717,5718,20210827,0.000,0.0000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
5718,5719,20210828,0.000,0.0000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
5719,5720,20210829,0.000,0.0000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
5720,5721,20210830,0.000,0.0000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0


In [None]:
#     df_save=df_save.fillna(float(0))
#     savefile = "zeroFilled_"+str(i+1) #output file = zeroFilled_"$i"
#     df_save.to_csv(savefile ,header=None, index=None ,float_format='%.6f', sep=' ') #SAVE AS THEY ARE