<img src="GEOS_Logo.pdf" width="500" />


# Step **8** of **`G2FNL`**: <font color=blue>"remove_outliers.ipynb"</font>
#### Oct 8, 2021  <font color=red>(v. working)</font>
##### Jeonghyeop Kim (jeonghyeop.kim@gmail.com)

> input files: **`zeroFilled_i`**, **`days_per_month.dat`**, **`station_list_full.dat`**,  **`steps.txt`**, and **`time_vector.dat`** \
> output files: **`outlierRemoved_i`** 


0. This code is a part of GPS2FNL process 
1. It will get rid of outliers.  
> Position data for each month will be treated as a set. \
> The code will fit a linear line to each month and then subtract this model from the data. \
> Perform a simple statistical analysis for the residual. \
> Outliers for each month are defined as any data outside of the tolerance level. \
> The default tolerance level is +/- 3 sigma.
2. Potential issues: 
> There exist some stations that still show problematic outliers after this algorithm applied. \
> Possibly some post-seismic signals are identified as outliers and removed, which means a loss of interesting signal. \
> Maybe pass the month if an earthquake occurred in that month?

<div class="alert alert-danger">
Do NOT run this code twice without re-starting the kernel
</div>

In [1]:
# 1. import modules
import numpy as np
import pandas as pd
import os
from datetime import datetime

In [2]:
current_dir=os.getcwd()
os.getcwd()

'/Users/jkim/main/GPS2FNL_2021/summer_project_2021'

In [3]:
# 2. read files for (1) Number of Date per each month 
#               and (2) Number of Stations 
#               and (3) time_vector.dat for the first and end dates of the analysis
#               and (4) earthquake-related steps


#############################################
#(1)
datefile = 'days_per_month.dat'
dateNvec = pd.read_csv(datefile, sep = ' ', header = None)
dateNvec.columns = ['NofD']

#############################################
#(2)
list_full = "station_list_full.dat"
df_list=pd.read_csv(list_full, header=None)
df_list.columns=['StID']
N_list = len(df_list) 


#############################################
#(3)
timefile = 'time_vector.dat'
df_time=pd.read_csv(timefile, header=None)
startDateAnalysis=int(df_time.iloc[0])
endDateAnalysis=int(df_time.iloc[-1])
##########################################################################################
#(4)
metadata = "steps.txt" #file name
df_metadata=pd.read_csv(metadata, header=None, names=list('0123456'), sep=r'(?:,|\s+)', \
                        comment='#', engine='python')
## steps.txt is in an irregular shape
## 'names=list('0123456')' is to fill empty spots with NaN 
df_steps_earthquakes = df_metadata[df_metadata['2'] == 2].reset_index(drop=True)
df_steps_earthquakes.columns=['stID','time','flag','threshold','distance','mag','eventID'] 
#The step data has a time column in the form of yyMMMdd 
date_old = df_steps_earthquakes.time.tolist() # A DataFrame to a list
date_new = pd.to_datetime(date_old, format='%y%b%d').strftime('%Y%m%d') # convert date format
df_steps_earthquakes.loc[:,'time'] = date_new # replaces with the new date  in YYYYMMDD
df_steps_earthquakes['time']=df_steps_earthquakes['time'].astype(int) #str to int
df_steps_earthquakes = df_steps_earthquakes[(df_steps_earthquakes['time']>=startDateAnalysis) & \
                                            (df_steps_earthquakes['time']<=endDateAnalysis)]
df_steps_earthquakes = df_steps_earthquakes.reset_index(drop=True)

In [4]:
processing_dir = os.path.join(current_dir, 'data', 'processing')
os.chdir(processing_dir) # cp to processing directory
os.getcwd()

'/Users/jkim/main/GPS2FNL_2021/summer_project_2021/data/processing'

# **`IDENTIFY AND ELIMINATE OUTLIERS`**

In [70]:
N_months = len(dateNvec) # How many months for the time period of interest?

##############################################
# STEP 1: Read data files station by station #
##############################################

column_names = ['datenum','date','lon','lat','ue','un','uz','se','sn','sz','corr_en','flag']

for i in range(0,1):  # Later, replace range(0,1) with range(N_list)

    inputfile = "zeroFilled_"+str(i+1) #input_file = zeroFilled_"$i"
    df_input=pd.read_csv(inputfile,sep=' ',header=None)   
    df_input = df_input.reset_index() #Index column will be added and will be used as 'datenum' consecutive integers.
    df_input.columns = column_names
    df_input.loc[:,['datenum']]=df_input.loc[:,['datenum']]+1 #datenum starts from 1 instead of 0
    
    stationID=df_list.loc[i,['StID']]
    stationID=stationID.tolist()[0]

##############################################
# STEP 2: READ DATA MONTH BY MONTH!          #
##############################################

    FirstMonth = 0
    for j in range(N_months):
        date_for_the_month=int(dateNvec.iloc[j])
        LastMonth = FirstMonth + date_for_the_month
        df_month = df_input.loc[FirstMonth:LastMonth-1,:].reset_index(drop=True)
        FirstMonth = LastMonth
        
##############################################
# STEP 3: Decide to pass the month or not    #
##############################################
        df_month_nonzero = df_month[df_month['lon']!=0]
    
    
        # 3-a The number of non-zero values is less than 6 : skip 
        if len(df_month_nonzero) < 6: 
            continue
            print("Small number of data: skip the month")
    
        # 3-b An earthquake occurred within that month : skip    
        IniTimeNonzeroMonth=df_month_nonzero.iloc[0,1]
        EndTimeNonzeroMonth=df_month_nonzero.iloc[-1,1]
        df_steps_exist=df_steps_earthquakes[(df_steps_earthquakes['stID']==stationID) & \
                                    (df_steps_earthquakes['time']>=IniTimeNonzeroMonth) & \
                                    (df_steps_earthquakes['time']<=EndTimeNonzeroMonth)]    
        
        if len(df_steps_exist) != 0: 
            continue
            print("Earthquake within the month %s for station %s : skip the month" %(str(f"{j:03}"),stationID))
            
        else: # otherwise, going!
            print("work from here : linear fitting and residual and 3sigma ...")

work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from here
work from 

In [None]:
#     df_save=df_save.fillna(float(0))
#     savefile = "zeroFilled_"+str(i+1) #output file = zeroFilled_"$i"
#     df_save.to_csv(savefile ,header=None, index=None ,float_format='%.6f', sep=' ') #SAVE AS THEY ARE