# **Aggregation and Manipulation of Timestamps**
## Grouping by Timestamp; Merging on Timestamp; Extracting Timestamp Information; Calculating Timedeltas; Adding Timedeltas; and Concatenating (SQL Union/Stacking/Appending) Dataframes.

## _ETL Workflow Notebook 1_

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

Install statsmodels library

In [None]:
! pip install statsmodels

Install tensorflow library

In [None]:
! pip install tensorflow

Install Keras library

In [None]:
! pip install keras

Install SHAP library

In [None]:
! pip install shap

In [None]:
#check the version of the package
! pip show shap

In [None]:
# Upgrade to the most recent library versions, if a given module is not present and analysis cannot be
# executed.
! pip install pip --upgrade
! pip install tensorflow --upgrade
! pip install keras --upgrade
! pip install shap --upgrade
! pip install sklearn --upgrade
! pip install pandas --upgrade
! pip install numpy --upgrade
! pip install matplotlib --upgrade
! pip install seaborn --upgrade
! pip install scipy --upgrade
! pip install statsmodels --upgrade

## **Load Python Libraries in Global Context**

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# **Function for mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [2]:
def mount_storage_system (source = 'aws', path_to_store_imported_s3_bucket = '/', s3_bucket_name = None, s3_obj_key_preffix = None):
    
    import sagemaker
    # sagemaker is AWS SageMaker Python SDK
    from sagemaker.session import Session
    from google.colab import drive
    
    # source = 'google' for mounting the google drive;
    # source = 'aws' for mounting an AWS S3 bucket.
    
    # THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # path_to_store_imported_s3_bucket: path of the Python environment to which the
    # S3 bucket contents will be imported. If it is None, or if 
    # path_to_store_imported_s3_bucket = '/', bucket will be imported to the root path. 
    # Alternatively, input the path as a string (in quotes). e.g. 
    # path_to_store_imported_s3_bucket = '/copied_s3_bucket'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_key_preffix = None. Keep it None or as an empty string (s3_obj_key_preffix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_preffix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    if (source == 'google'):
        
        print("Associate the Python environment to your Google Drive account, and authorize the access in the opened window.")
        
        drive.mount('/content/drive')
        
        print("Now your Python environment is connected to your Google Drive: the root directory of your environment is now the root of your Google Drive.")
        print("In Google Colab, navigate to the folder icon (\'Files\') of the left navigation menu to find a specific folder or file in your Google Drive.")
        print("Click on the folder or file name and select the elipsis (...) icon on the right of the name to reveal the option \'Copy path\', which will give you the path to use as input for loading objects and files on your Python environment.")
        print("Caution: save your files into different directories of the Google Drive. If files are all saved in a same folder or directory, like the root path, they may not be accessible from your Python environment.")
        print("If you still cannot see the file after moving it to a different folder, reload the environment.")
    
    elif (source == 'aws'):
        
        # Notice: if you wanted to authenticate directly from Python code, you could use
        # the following code, instead, to start the S3 client. boto3 is AWS S3 Python SDK:
        
        # import boto3
        # ACCESS_KEY = 'access_key_ID'
        # PASSWORD_KEY = 'password_key'
        # s3_client = boto3.client('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = PASSWORD_KEY)
        # ... [here, use the same following code until line new_session = Session()]
        # [keep the line for session start. Substitute the line with the .download_data
        # method by the following line:]
        # s3_client.download_file(s3_bucket_name, s3_file_name_with_extension, path_to_store_imported_s3_bucket)
        
        # Check if the whole bucket will be downloaded (s3_obj_key_preffix = None):
        if (s3_obj_key_preffix is None):
            
            s3_obj_key_preffix = ''
        
        # If the path to store is None, also import the bucket to the root path:
        if (path_to_store_imported_s3_bucket is None):
            
            path_to_store_imported_s3_bucket = '/'
        
        # If the bucket name was provided, start the session. If not, print an error
        # message:
        
        if (s3_bucket_name is None):
            
            print("Please, provide a valid S3 Bucket name to download from.")
        
        else:
        
            # start a new sagemaker session:

            print("Starting a SageMaker session to be associated with the S3 bucket.")

            new_session = Session()
            # Check sagemaker session class documentation:
            # https://sagemaker.readthedocs.io/en/stable/api/utility/session.html
            session.download_data(path = path_to_store_imported_s3_bucket, bucket = s3_bucket_name, key_prefix = s3_obj_key_preffix)

            print(f"S3 bucket contents successfully imported to path \'{path_to_store_imported_s3_bucket}\'.")
            
    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

# **Function for loading the dataframe**

In [15]:
def load_dataframe (file_directory_path, file_name_with_extension, has_header = True, txt_csv_col_sep = "comma", sheet_to_load = None):
    
    import os
    import pandas as pd
    
    # WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, etc), 
    # txt, or CSV (comma separated values) files.
    
    # file_directory_path - (string, in quotes): input the path of the directory (e.g. folder path) 
    # where the file is stored. e.g. file_directory_path = "/" or file_directory_path = "/folder"
    
    # file_name_with_extension - (string, in quotes): input the name of the file with the extension
    # e.g. file_name_with_extension = "file.xlsx", or, file_name_with_extension = "file.csv"
    
    # has_header = True if the the imported table has headers (row with columns names).
    # Alternatively, has_header = False if the dataframe does not have header.
    
    # txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
    # or 'csv'. It informs how the different columns are separated.
    # Alternatively, txt_csv_col_sep = "comma" for columns separated by comma (",")
    # txt_csv_col_sep = "whitespace" for columns separated by simple spaces (" ").
    
    # sheet_to_load - This parameter has effect only when for Excel files.
    # keep sheet_to_load = None not to specify a sheet of the file, so that the first sheet
    # will be loaded.
    # sheet_to_load may be an integer or an string (inside quotes). sheet_to_load = 0
    # loads the first sheet (sheet with index 0); sheet_to_load = 1 loads the second sheet
    # of the file (index 1); sheet_to_load = "Sheet1" loads a sheet named as "Sheet1".
    # Declare a number to load the sheet with that index, starting from 0; or declare a
    # name to load the sheet with that name.
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, file_name_with_extension)
    # Extract the file extension
    file_extension = os.path.splitext(file_path)[1][1:]
    # os.path.splitext(file_path) is a tuple of strings: the first is the complete file
    # root with no extension; the second is the extension starting with a point: '.txt'
    # When we set os.path.splitext(file_path)[1], we are selecting the second element of
    # the tuple. By selecting os.path.splitext(file_path)[1][1:], we are taking this string
    # from the second character (index 1), eliminating the dot: 'txt'
    
    if ((file_extension == 'txt') | (file_extension == 'csv')): 
        # The operator & is equivalent to 'And' (intersection).
        # The operator | is equivalent to 'Or' (union).
        # pandas.read_csv method must be used.
        
        if (has_header == True):
            
            if (txt_csv_col_sep == "comma"):
            
                dataset = pd.read_csv(file_path)
            
            elif (txt_csv_col_sep == "whitespace"):
                
                dataset = pd.read_csv(file_path, delim_whitespace = True)
            
            else:
                print(f"Enter a valid column separator for the {file_extension} file: \'comma\' or \'whitespace\'.")
        
        else:
            # has_header == False
              
            if (txt_csv_col_sep == "comma"):
            
                dataset = pd.read_csv(file_path, header = None)
            
            elif (txt_csv_col_sep == "whitespace"):
                
                dataset = pd.read_csv(file_path, delim_whitespace = True, header = None)
            
            else:
                print(f"Enter a valid column separator for the {file_extension} file: \'comma\' or \'whitespace\'.")
        
    else:
        # If it is not neither a csv nor a txt file, let's assume it is one of different
        # possible Excel files.
        print("Excel file inferred. If an error message is shown, check if a valid file extension was used: \'xlsx\', \'xls\', etc.")
            
        if (sheet_to_load is not None):        
        #Case where the user specifies which sheet of the Excel file should be loaded.
            
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load)
            
            else:
                #No header
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, header = None)
        
        else:
            #No sheet specified
            if (has_header == True):
                
                dataset = pd.read_excel(file_path)
            
            else:
                #No header
                dataset = pd.read_excel(file_path, header = None)
    
    print(f"Dataset extracted from {file_path}. Check the 10 first rows of the dataset:\n")
    print(dataset.head(10))
    
    return dataset   

# **Function for grouping the data by a timestamp**

In [42]:
def GROUP_BY_TIMESTAMP (df, timestamp_tag_column, grouping_frequency_unit = 'day', number_of_periods_to_group = 1, aggregate_function = 'mean', start_time = None, offset_time = None):
    
    import pandas as pd
    import numpy as np
    
    #df - dataframe/table containing the data to be grouped
    
    #timestamp_tag_colum: name (header) of the column containing the
    
    #timestamps for grouping the data.
    
    #grouping_frequency_unit: the frequency of aggregation. The possible values are:
    
    grp_frq_unit_dict = {'year': "Y", 'month': "M", 'week': "W", 
                            'day': "D", 'hour': "H", 'minute': "min", 'second': 'S'}
    
    #Simply provide the key: 'year', 'month', 'week',..., 'second', and this dictionary
    #will convert to the Pandas coding.
    #The default is 'day', so this will be inferred frequency if no value is provided.
    
    #To access the value of a dictionary d = {key1: item1, ...}:
    #d['key1'] = item1. - simply declare the key as a string (under quotes) inside brackets
    #just as if you were accessing a column from the dataframe.
    #Since grouping_frequency_unit is variable storing a string, it should not come under
    #quotes:
    
    #Convert the input to Pandas encoding:
    frq_unit = grp_frq_unit_dict[grouping_frequency_unit]
    
    #https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
    #To group by business day, check the example:
    #https://stackoverflow.com/questions/13019719/get-business-days-between-start-and-end-date-using-pandas
    
    #number_of_periods_to_group: the bin size. The default is 1, so we will group by '1day'
    #if number_of_periods_to_group = 2 we would be grouping by every 2 days.
    #If the unit was minute and number_of_periods_to_group = 30, we would be grouping into
    #30-min bins.
    
    if (number_of_periods_to_group <=0):
        
        print("Invalid number of periods to group. Changing to 1 period.")
        number_of_periods_to_group = 1
    
    if (number_of_periods_to_group == 1):
        
        #Do not put the number 1 prior to the frequency unit
        FREQ =  frq_unit
    
    else:
        #perform the string concatenation. Convert the number into a string:
        number_of_periods_to_group = str(number_of_periods_to_group)
        #Concatenate the strings:
        FREQ = number_of_periods_to_group + frq_unit
        #Expected output be like '2D' for a 2-days grouping
        
    #aggregate_function: Pandas aggregation method: 'mean', 'median', 'std', 'sum', 'min'
    # 'max', etc. The default is 'mean'. Then, if no aggregate is provided, 
    # the mean will be calculated.
    
    #You can pass a list of multiple aggregations, like: 
    #aggregate_function = [mean, max, sum]
    #You can also pass custom functions, like: pct30 (30-percentile), or np.mean
    #aggregate_function = pct30
    #aggregate_function = np.mean (numpy.mean)
    
    #ADJUST OF GROUPING BASED ON A FIXED TIMESTAMP
    #This parameters are set to None as default.
    #You can specify the origin (start_time) or the offset (offset_time), which are
    #equivalent. The parameter should be declared as a timestamp.
    #For instance: start_time = '2000-10-01 23:30:00'
    
    #WARNING: DECLARE ONLY ONE OF THESE PARAMETERS. DO NOT DECLARE AN OFFSET IF AN 
    #ORIGIN WAS SPECIFIED, AND VICE-VERSA.
    
    #Create a Pandas timestamp object from the timestamp_tag_column. It guarantees that
    #the timestamp manipulation methods can be correctly applied.
    #Let's create using nanoseconds resolution, so that the timestamps present the
    #maximum possible resolution:
    
    # START: CONVERT ALL TIMESTAMPS/DATETIMES/STRINGS TO pandas.Timestamp OBJECTS.
    # This will prevent any compatibility problems.
    
    #The pd.Timestamp function can handle a single timestamp per call. Then, we must
    # loop trough the series, and apply the function to each element.
    
    #1. Start a list to store the Pandas timestamps:
    timestamp_list = []
    
    #2. Loop through each element of the timestamp column, and apply the function
    # to guarantee that all elements are Pandas timestamps
    
    for timestamp in df[timestamp_tag_column]:
        #Access each element 'timestamp' of the series df[timestamp_tag_column]
        timestamp_list.append(pd.Timestamp(timestamp, unit = 'ns'))
    
    #3. Create a column in the dataframe that will be used as key for the Grouper class
    # The grouper requires a column in the dataframe - it cannot use a list for that.
    # Simply copy the list as the new column:
    df['timestamp_obj'] = timestamp_list
    
    #Now we have a list correspondent to timestamp_tag_column, but only with
    # Pandas timestamp objects
    
    #In this function, we do not convert the Timestamp to a datetime64 object.
    #That is because the Grouper class specifically requires a Pandas Timestamp
    #object to group the dataframes.
    
    if (start_time is not None):
        
        grouped_df = df.groupby(pd.Grouper(key = 'timestamp_obj' , freq = FREQ, origin = start_time)).agg(aggregate_function)
    
    elif (offset_time is not None):
        
        grouped_df = df.groupby(pd.Grouper(key = 'timestamp_obj' , freq = FREQ, offset = offset_time)).agg(aggregate_function)
    
    else:
        
        #Standard situation, when both start_time and offset_time are None
        grouped_df = df.groupby(pd.Grouper(key = 'timestamp_obj' , freq = FREQ)).agg(aggregate_function)
    
    print (f"Dataframe grouped by every {number_of_periods_to_group} {frq_unit}.")
    
    #The parameter 'key' of the Grouper class must be the name (string) of a column
    # of the dataframe
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html
    
    #The objects 'timestamp_obj' are now the index from grouped_df dataframe
    #Let's store them as a column and restart the index:
    #1. Copy the index to a new column:
    grouped_df['Timestamp_grouped'] = grouped_df.index
    
    #2. Reset the index:
    grouped_df = grouped_df.reset_index(drop = True)
    
    #3. 'pandas.Timestamp_grouped' is now the last column. Let's create a list of the
    # reordered columns, starting from 'pandas.Timestamp_grouped'
    
    reordered_cols_list = ['Timestamp_grouped']
    
    for i in range((len(grouped_df.columns)-1)):
        
        #This loop goes from i = 0 to i = (len(grouped_df.columnns)-2)
        # grouped_df.columnns is a list containing the columns names. Since indexing goes
        # from 0, the last element is the index i = (len(grouped_df.columnns)-1).
        # But this last element is 'pandas.Timestamp_grouped', which we used as the
        # first element of the list reordered_cols_list. Then, we must loop from the
        # first element of grouped_df.columnns to the element immediately before 'pandas.Timestamp_grouped'.
        # Then, the last element to be read is (len(grouped_df.columnns)-2)
        # range (i, j) goes from i to j-1. If only one value is specified, i = 0 and j =
        # declared value. If you print all i values in range(10), numbers from 0 to 9
        # will be shown.
        
        reordered_cols_list.append(grouped_df.columns[i])
    
    #4. Reorder the dataframe passing the list reordered_cols_list as the column filters
    # / columns selection list.Notice that df[['col1', 'col2']] = df[list], where list =
    # ['col1', 'col2']. To select or reorder columns, we pass the list of columns under
    # brackets as parameter.
    
    grouped_df = grouped_df[reordered_cols_list]
    
    
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Dataframe successfully grouped. Check its 10 first rows:\n")
    print(grouped_df.head(10))
    
    #Now return the grouped dataframe with the timestamp as the first column:
    
    return grouped_df

# **Function for merging (joining) the data on a timestamp column**

In [46]:
def MERGE_ON_TIMESTAMP (df_left, df_right, left_key, right_key, how_to_join = "inner", merge_method = 'ordered', merged_suffixes = ('_left', '_right'), asof_direction = 'nearest', ordered_filling = None):
    
    #WARNING: Only two dataframes can be merged on each call of the function.
    
    import pandas as pd
    import numpy as np
    
    #df_left: dataframe to be joined as the left one.
    
    #df_right: dataframe to be joined as the right one
    
    #left_key: (String) name of column of the left dataframe to be used as key for joining.
    
    #right_key: (String) name of column of the right dataframe to be used as key for joining.
    
    #how_to_join: joining method: "inner", "outer", "left", "right". The default is "inner".
    
    #merge_method: which pandas merging method will be applied:
    #merge_method = 'ordered' for using the .merge_ordered method.
    #merge_method = "asof" for using the .merge_asof method.
    #WARNING: .merge_asof uses fuzzy matching, so the how_to_join parameter is not applicable.
    
    # merged_suffixes = ('_left', '_right') - tuple of the suffixes to be added to columns
    #with equal names. Simply modify the strings inside quotes to modify the standard
    #values. If no tuple is provided, the standard denomination will be used.
    
    #asof_direction: this parameter will only be used if the .merge_asof method is
    #selected. The default is 'nearest' to merge the closest timestamps in both 
    #directions. The other options are: 'backward' or 'forward'.
    
    #ordered_filling: this parameter will only be used on the merge_ordered method.
    #The default is None. Input ordered_filling = 'ffill' to fill missings with the
    #previous value.
    
    if (merge_method == 'ordered'):
    
        if (ordered_filling == 'ffill'):
            
            merged_df = pd.merge_ordered(df_left, df_right, left_on = left_key, right_on = right_key, how = how_to_join, suffixes = merged_suffixes, fill_method='ffill')
        
        else:
            
            merged_df = pd.merge_ordered(df_left, df_right, left_on = left_key, right_on = right_key, how = how_to_join, suffixes = merged_suffixes)
    
    elif (merge_method == 'asof'):
        
        merged_df = pd.merge_asof(df_left, df_right, left_on = left_key, right_on = right_key, suffixes = merged_suffixes, direction = asof_direction)
    
    else:
        
        print("You did not enter a valid merge method for this function, \'ordered\' or \'asof\'.")
        print("Then, applying the conventional Pandas .merge method, followed by .sort_values method.")
        
        #Pandas sort_values method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
        
        merged_df = df_left.merge(df_right, left_on = left_key, right_on = right_key, how = how_to_join, suffixes = merged_suffixes)
        merged_df = merged_df.sort_values(by = merged_df.columns[0], ascending = True)
        #sort by the first column, with index 0.
    
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Dataframe successfully merged. Check its 10 first rows:\n")
    print(merged_df.head(10))
    
    return merged_df

# **Function for creating a column with isolated informations from the timestamp**
- Use this function for creating a column containing isolated information from the timestamp: 
    - Value of year,
    - Value of month,
    - Value of day,
    - Value of hour,
    - etc.

In [47]:
def EXTRACT_TIMESTAMP_INFO (df, timestamp_tag_column, extracted_info, new_column_name = None):
    
    import pandas as pd
    import numpy as np
    
    #df: dataframe containing the timestamp.
    
    #timestamp_tag_column: declare as a string under quotes. This is the column from 
    #which we will extract the timestamp.
    
    #extracted_info: information to extract from the timestamp. The allowed values are:
    #'year', 'month', 'week', 'day', 'hour', 'minute', or 'second'
    
    #new_column_name: name (string)of the new created column. 
    #If no value is provided, it will be equals to extracted_info.
    
    if (new_column_name is None):
        
        new_column_name = extracted_info
    
    # START: CONVERT ALL TIMESTAMPS/DATETIMES/STRINGS TO pandas.Timestamp OBJECTS.
    # This will prevent any compatibility problems.
    
    #The pd.Timestamp function can handle a single timestamp per call. Then, we must
    # loop trough the series, and apply the function to each element.
    
    #1. Start a list to store the Pandas timestamps:
    timestamp_list = []
    
    #2. Loop through each element of the timestamp column, and apply the function
    # to guarantee that all elements are Pandas timestamps
    
    for timestamp in df[timestamp_tag_column]:
        #Access each element 'timestamp' of the series df[timestamp_tag_column]
        timestamp_list.append(pd.Timestamp(timestamp, unit = 'ns'))
    
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html
    
    #Use the extracted_info as key to access the correct command in the dictionary.
    #To access an item from a dictionary d = {'key1': item1, ...}, declare d['key1'],
    #as if you would do to access a column from a dataframe.
    
    #By doing so, you will select the extraction command from the dictionary:
    # Loop through each element of the dataset, access the timestamp, 
    # extract the information and store it in the correspondent position of the 
    # new_column. Again. The methods can only be applied to a single Timestamp object,
    # not to the series. That is why we must loop through each of them:
    
    #start a list to store the values of the new column
    new_column_vals = []
    
    for i in range(len(df)):
        # i goes from zero to the index of the last element of the dataframe df
        # This element has index len(df) - 1
        # Append the values to the list according to the selected extracted_info
        
        if (extracted_info == 'year'):
            
            new_column_vals.append((timestamp_list[i]).year)
        
        elif (extracted_info == "month"):
            
            new_column_vals.append((timestamp_list[i]).month)
        
        elif (extracted_info == "week"):
            
            new_column_vals.append((timestamp_list[i]).week)
        
        elif (extracted_info == "day"):
            
            new_column_vals.append((timestamp_list[i]).day)
        
        elif (extracted_info == "hour"):
            
            new_column_vals.append((timestamp_list[i]).hour)
        
        elif (extracted_info == "minute"):
            
            new_column_vals.append((timestamp_list[i]).minute)
        
        elif (extracted_info == "second"):
            
            new_column_vals.append((timestamp_list[i]).second)
        
        else:
            
            print("Invalid extracted information. Please select: year, month, week, day, hour, minute, or second.")
    
    # Copy the list 'new_column_vals' to a new column of the dataframe:
    
    df[new_column_name] = new_column_vals
     
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Timestamp information successfully extracted. Check dataset's 10 first rows:\n")
    print(df.head(10))
    
    #Now that the information were retrieved from all Timestamps, return the new
    #dataframe:
    
    return df

# **Function for calculating differences between timestamps (timedeltas)**
- Use this function for creating a column containing differences between two or more timestamp columns.

In [48]:
def CALCULATE_TIMEDELTA (df, timestamp_tag_column1, timestamp_tag_column2, timedelta_column_name  = None, returned_timedelta_unit = None):
    
    import pandas as pd
    import numpy as np
    
    #THIS FUNCTION PERFORMS THE OPERATION df[timestamp_tag_column1] - df[timestamp_tag_colum2]
    #The declaration order will determine the sign of the output.
    
    #df: dataframe containing the two timestamp columns.
    
    #timestamp_tag_column1: string containing the name of the column with the timestamp
    # on the left (from which the right timestamp will be subtracted).
    
    #timestamp_tag_column2: string containing the name of the column with the timestamp
    # on the right, that will be substracted from the timestamp on the left.
    
    #timedelta_column_name: name of the new column. If no value is provided, the default
    #name [timestamp_tag_column1]-[timestamp_tag_column2] will be given:
    
    if (timedelta_column_name is None):
        
        #apply the default name:
        timedelta_column_name = "[" + timestamp_tag_column1 + "]" + "-" + "[" + timestamp_tag_column2 + "]"
    
    #Pandas Timedelta class: applicable to timedelta objects
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html
    #The delta method from the Timedelta class converts returns the timedelta in
    #nanoseconds, guaranteeing the internal compatibility:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.delta.html#pandas.Timedelta.delta
    
    #returned_timedelta_unit: unit of the new column. If no value is provided, the unit will be
    # considered as nanoseconds. 
    # POSSIBLE VALUES FOR THE TIMEDELTA UNIT:
    #'year', 'month', 'day', 'hour', 'minute', 'second'.
    
    # START: CONVERT ALL TIMESTAMPS/DATETIMES/STRINGS TO pandas.Timestamp OBJECTS.
    # This will prevent any compatibility problems.
    
    #The pd.Timestamp function can handle a single timestamp per call. Then, we must
    # loop trough the series, and apply the function to each element.
    
    #1. Start a list to store the Pandas timestamps:
    timestamp_list = []
    
    #2. Loop through each element of the timestamp column, and apply the function
    # to guarantee that all elements are Pandas timestamps
    
    for timestamp in df[timestamp_tag_column1]:
        #Access each element 'timestamp' of the series df[timestamp_tag_column1]
        timestamp_list.append(pd.Timestamp(timestamp, unit = 'ns'))
    
    #3. Create a column in the dataframe that will store the timestamps.
    # Simply copy the list as the column:
    df[timestamp_tag_column1] = timestamp_list
    
    #Repeate these steps for the other column (timestamp_tag_column2):
    # Restart the list, loop through all the column, and apply the pd.Timestamp function
    # to each element, individually:
    timestamp_list = []
    
    for timestamp in df[timestamp_tag_column2]:
        #Access each element 'timestamp' of the series df[timestamp_tag_column2]
        timestamp_list.append(pd.Timestamp(timestamp, unit = 'ns'))
    
    df[timestamp_tag_column2] = timestamp_list
    
    # Pandas Timestamps can be subtracted to result into a Pandas Timedelta.
    # We will apply the delta method from Pandas Timedeltas.
    
    #4. Create a timedelta object as the difference between the timestamps:
    
    # NOTICE: Even though a list could not be submitted to direct operations like
    # sum, subtraction and multiplication, the series and NumPy arrays can. When we
    # copied the list as a new column on the dataframes, we converted the lists to series
    # called df[timestamp_tag_column1] and df[timestamp_tag_column2]. These two series now
    # can be submitted to direct operations.
    
    timedelta_obj = df[timestamp_tag_column1] - df[timestamp_tag_column2]
    
    #This timedelta_obj is a series of timedelta64 objects. The Pandas Timedelta function
    # can process only one element of the series in each call. Then, we must loop through
    # the series to obtain the float values in nanoseconds. Even though this loop may 
    # look unecessary, it uses the Delta method to guarantee the internal compatibility.
    # Then, no errors due to manipulation of timestamps with different resolutions, or
    # due to the presence of global variables, etc. will happen. This is the safest way
    # to manipulate timedeltas.
    
    #5. Create an empty list to store the timedeltas in nanoseconds
    TimedeltaList = []
    
    #6. Loop through each timedelta_obj and convert it to nanoseconds using the Delta
    # method. Both pd.Timedelta function and the delta method can be applied to a 
    # a single object.
    #len(timedelta_obj) is the total of timedeltas present.
    
    for i in range(len(timedelta_obj)):
        
        #This loop goes from i = 0 to i = [len(timedelta_obj) - 1], so that
        #all indices are evaluated.
        
        #append the element resultant from the delta method application on the
        # i-th element of the list timedelta_obj, i.e., timedelta_obj[i].
        TimedeltaList.append(pd.Timedelta(timedelta_obj[i]).delta)
    
    #Notice that the loop is needed because Pandas cannot handle a series/list of
    #Timedelta objects simultaneously. It can manipulate a single object
    # in each call or iteration.
    
    #Now the list contains the timedeltas in nanoseconds and guarantees internal
    #compatibility.
    # The delta method converts the Timedelta object to an integer number equals to the
    # value of the timedelta in nanoseconds. Then we are now dealing with numbers, not
    # with timestamps.
    # Even though some steps seem unecessary, they are added to avoid errors and bugs
    # hard to identify, resultant from a timestamp assigned to the wrong type of
    # object.
    
    #The list is not as the series (columns) and arrays: it cannot be directly submitted to 
    # operations like sum, division, and multiplication. For doing so, we can loop through 
    # each element, what would be the case for using the Pandas Timestamp and Timedelta 
    # functions, which can only manipulate one object per call.
    # For simpler operations like division, we can convert the list to a NumPy array and
    # submit the entire array to the operation at the same time, avoiding the use of 
    # memory consuminh iterative methods.
    
    #Convert the timedelta list to a NumPy array:
    # Notice that we could have created a column with the Timedeltalist, so that it would
    # be converted to a series. On the other hand, we still did not defined the name of the
    # new column. So, it is easier to simply convert it to a NumPy array, and then copy
    # the array as a new column.
    TimedeltaList = np.array(TimedeltaList)
    
    #Convert the array to the desired unit by dividing it by the proper factor:
    
    if (returned_timedelta_unit == 'year'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #4. Convert it to days (1 day = 24 h):
        TimedeltaList = TimedeltaList / 24.0 #in days
        
        #5. Convert it to years. 1 year = 365 days + 6 h = 365 days + 6/24 h/(h/day)
        # = (365 + 1/4) days = 365.25 days
        
        TimedeltaList = TimedeltaList / (365.25) #in years
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in years. Considered 1 year = 365 days + 6 h.")
    
    
    elif (returned_timedelta_unit == 'month'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #4. Convert it to days (1 day = 24 h):
        TimedeltaList = TimedeltaList / 24.0 #in days
        
        #5. Convert it to months. Consider 1 month = 30 days
        
        TimedeltaList = TimedeltaList / (30.0) #in months
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in months. Considered 1 month = 30 days.")
        
    
    elif (returned_timedelta_unit == 'day'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #4. Convert it to days (1 day = 24 h):
        TimedeltaList = TimedeltaList / 24.0 #in days
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in days.")
        
    
    elif (returned_timedelta_unit == 'hour'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in hours [h].")
    

    elif (returned_timedelta_unit == 'minute'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in minutes [min].")
        
        
    elif (returned_timedelta_unit == 'second'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in seconds [s].")
        
        
    else:
        
        returned_timedelta_unit = 'ns'
        print("No unit or invalid unit provided for timedelta. Then, returned timedelta in nanoseconds (1s = 10^9 ns).")
        
        #In case None unit is provided or a non-valid value or string is provided,
        #The calculus will be in nanoseconds.
    
    #Finally, create a column in the dataframe named as timedelta_column_name 
    # with the elements of TimedeltaList converted to the correct unit of time:
    
    #Append the selected unit as a suffix on the timedelta_column_name:
    timedelta_column_name = timedelta_column_name + "_" + returned_timedelta_unit
    
    df[timedelta_column_name] = TimedeltaList
      
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Timedeltas successfully calculated. Check dataset's 10 first rows:\n")
    print(df.head(10))
    
    #Finally, return the dataframe with the new column:
    
    return df

# **Function for calculating differences between successive timestamps (delay)**
- Use this function for creating a column containing differences between two successive timestamps from a same column.

In [3]:
def CALCULATE_DELAY (df, timestamp_tag_column, new_timedelta_column_name  = None, returned_timedelta_unit = None, return_avg_delay = True):
    
    import pandas as pd
    import numpy as np
    
    #THIS FUNCTION CALCULATES THE DIFFERENCE (timedelta - delay) BETWEEN TWO SUCCESSIVE
    # Timestamps from a same column
    
    #df: dataframe containing the two timestamp columns.
    #timestamp_tag_column: string containing the name of the column with the timestamps
    
    #new_timedelta_column_name: name of the new column. If no value is provided, the default
    #name [timestamp_tag_column1]-[timestamp_tag_column2] will be given:
    
    # return_avg_delay = True will print and return the value of the average delay.
    # return_avg_delay = False will omit this information
    
    if (new_timedelta_column_name is None):
        
        #apply the default name:
        new_timedelta_column_name = "time_delay"
    
    #Pandas Timedelta class: applicable to timedelta objects
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html
    #The delta method from the Timedelta class converts returns the timedelta in
    #nanoseconds, guaranteeing the internal compatibility:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.delta.html#pandas.Timedelta.delta
    
    #returned_timedelta_unit: unit of the new column. If no value is provided, the unit will be
    # considered as nanoseconds. 
    # POSSIBLE VALUES FOR THE TIMEDELTA UNIT:
    #'year', 'month', 'day', 'hour', 'minute', 'second'.
    
    # START: CONVERT ALL TIMESTAMPS/DATETIMES/STRINGS TO pandas.Timestamp OBJECTS.
    # This will prevent any compatibility problems.
    
    #The pd.Timestamp function can handle a single timestamp per call. Then, we must
    # loop trough the series, and apply the function to each element.
    
    #1. Start a list to store the Pandas timestamps:
    timestamp_list = []
    
    #2. Loop through each element of the timestamp column, and apply the function
    # to guarantee that all elements are Pandas timestamps
    
    for timestamp in df[timestamp_tag_column]:
        #Access each element 'timestamp' of the series df[timestamp_tag_column1]
        timestamp_list.append(pd.Timestamp(timestamp, unit = 'ns'))
    
    #3. Create a column in the dataframe that will store the timestamps.
    # Simply copy the list as the column:
    timestamp_tag_column1 = timestamp_tag_column + "_ts"
    df[timestamp_tag_column1] = timestamp_list
    
    # Now, let's create a list of the following timestamps
    following_timestamp = []
    # Let's skip the index 0, correspondent to the first timestamp:
    
    for i in range (1, len(timestamp_list)):
        
        # this loop goes from i = 1 to i = len(timestamp_list) - 1, the last index
        # of the list. If we simply declared range (len(timestamp_list)), the loop
        # will start from 0, the default
        
        #append the element from timestamp_list to following_timestamp:
        following_timestamp.append(timestamp_list[i])
    
    # Notice that this list has one element less than the original list, because we started
    # copying from index 1, not 0. Therefore, let's repeat the last element of timestamp_list:
    following_timestamp.append(timestamp_list[i])
    # Notice that, once we did not restarted the variable i, it keeps its last value obtained
    # during the loop, correspondent to the index of the last element.
    # Now, let's store it into a column (series) of the dataframe:
    timestamp_tag_column2 = timestamp_tag_column + "_ts_delayed"
    df[timestamp_tag_column2] = following_timestamp
    
    # Pandas Timestamps can be subtracted to result into a Pandas Timedelta.
    # We will apply the delta method from Pandas Timedeltas.
    
    #4. Create a timedelta object as the difference between the timestamps:
    
    # NOTICE: Even though a list could not be submitted to direct operations like
    # sum, subtraction and multiplication, the series and NumPy arrays can. When we
    # copied the list as a new column on the dataframes, we converted the lists to series
    # called df[timestamp_tag_column1] and df[timestamp_tag_column2]. These two series now
    # can be submitted to direct operations.
    
    # Delay = next measurement (tag_column2, timestamp higher) - current measurement
    # (tag_column2, timestamp lower). Since we repeated the last timestamp twice,
    # in the last row it will be subtracted from itself, resulting in zero.
    # This is the expected, since we do not have a delay yet
    timedelta_obj = df[timestamp_tag_column2] - df[timestamp_tag_column1]
    
    #This timedelta_obj is a series of timedelta64 objects. The Pandas Timedelta function
    # can process only one element of the series in each call. Then, we must loop through
    # the series to obtain the float values in nanoseconds. Even though this loop may 
    # look unecessary, it uses the Delta method to guarantee the internal compatibility.
    # Then, no errors due to manipulation of timestamps with different resolutions, or
    # due to the presence of global variables, etc. will happen. This is the safest way
    # to manipulate timedeltas.
    
    #5. Create an empty list to store the timedeltas in nanoseconds
    TimedeltaList = []
    
    #6. Loop through each timedelta_obj and convert it to nanoseconds using the Delta
    # method. Both pd.Timedelta function and the delta method can be applied to a 
    # a single object.
    #len(timedelta_obj) is the total of timedeltas present.
    
    for i in range(len(timedelta_obj)):
        
        #This loop goes from i = 0 to i = [len(timedelta_obj) - 1], so that
        #all indices are evaluated.
        
        #append the element resultant from the delta method application on the
        # i-th element of the list timedelta_obj, i.e., timedelta_obj[i].
        TimedeltaList.append(pd.Timedelta(timedelta_obj[i]).delta)
    
    #Notice that the loop is needed because Pandas cannot handle a series/list of
    #Timedelta objects simultaneously. It can manipulate a single object
    # in each call or iteration.
    
    #Now the list contains the timedeltas in nanoseconds and guarantees internal
    #compatibility.
    # The delta method converts the Timedelta object to an integer number equals to the
    # value of the timedelta in nanoseconds. Then we are now dealing with numbers, not
    # with timestamps.
    # Even though some steps seem unecessary, they are added to avoid errors and bugs
    # hard to identify, resultant from a timestamp assigned to the wrong type of
    # object.
    
    #The list is not as the series (columns) and arrays: it cannot be directly submitted to 
    # operations like sum, division, and multiplication. For doing so, we can loop through 
    # each element, what would be the case for using the Pandas Timestamp and Timedelta 
    # functions, which can only manipulate one object per call.
    # For simpler operations like division, we can convert the list to a NumPy array and
    # submit the entire array to the operation at the same time, avoiding the use of 
    # memory consuminh iterative methods.
    
    #Convert the timedelta list to a NumPy array:
    # Notice that we could have created a column with the Timedeltalist, so that it would
    # be converted to a series. On the other hand, we still did not defined the name of the
    # new column. So, it is easier to simply convert it to a NumPy array, and then copy
    # the array as a new column.
    TimedeltaList = np.array(TimedeltaList)
    
    #Convert the array to the desired unit by dividing it by the proper factor:
    
    if (returned_timedelta_unit == 'year'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #4. Convert it to days (1 day = 24 h):
        TimedeltaList = TimedeltaList / 24.0 #in days
        
        #5. Convert it to years. 1 year = 365 days + 6 h = 365 days + 6/24 h/(h/day)
        # = (365 + 1/4) days = 365.25 days
        
        TimedeltaList = TimedeltaList / (365.25) #in years
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in years. Considered 1 year = 365 days + 6 h.")
    
    
    elif (returned_timedelta_unit == 'month'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #4. Convert it to days (1 day = 24 h):
        TimedeltaList = TimedeltaList / 24.0 #in days
        
        #5. Convert it to months. Consider 1 month = 30 days
        
        TimedeltaList = TimedeltaList / (30.0) #in months
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in months. Considered 1 month = 30 days.")
        
    
    elif (returned_timedelta_unit == 'day'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #4. Convert it to days (1 day = 24 h):
        TimedeltaList = TimedeltaList / 24.0 #in days
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in days.")
        
    
    elif (returned_timedelta_unit == 'hour'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #3. Convert it to hours (1 h = 60 min):
        TimedeltaList = TimedeltaList / 60.0 #in hours
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in hours [h].")
    

    elif (returned_timedelta_unit == 'minute'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #2. Convert it to minutes (1 min = 60 s):
        TimedeltaList = TimedeltaList / 60.0 #in minutes
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in minutes [min].")
        
        
    elif (returned_timedelta_unit == 'second'):
        
        #1. Convert the list to seconds (1 s = 10**9 ns, where 10**9 represents
        #the potentiation operation in Python, i.e., 10^9. e.g. 10**2 = 100):
        TimedeltaList = TimedeltaList / (10**9) #in seconds
        
        #The .0 after the numbers guarantees a float division.
        
        print("Returned timedelta in seconds [s].")
        
        
    else:
        
        returned_timedelta_unit = 'ns'
        print("No unit or invalid unit provided for timedelta. Then, returned timedelta in nanoseconds (1s = 10^9 ns).")
        
        #In case None unit is provided or a non-valid value or string is provided,
        #The calculus will be in nanoseconds.
    
    #Finally, create a column in the dataframe named as new_timedelta_column_name 
    # with the elements of TimedeltaList converted to the correct unit of time:
    
    #Append the selected unit as a suffix on the new_timedelta_column_name:
    new_timedelta_column_name = new_timedelta_column_name + "_" + returned_timedelta_unit
    
    df[new_timedelta_column_name] = TimedeltaList
      
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Time delays successfully calculated. Check dataset's 10 first rows:\n")
    print(df.head(10))
    
    if (return_avg_delay == True):
        
        # Let's calculate the average delay, print and return it:
        # Firstly, we must remove the last element of the TimedeltaList.
        # Remember that this element is 0 because there is no delay. It was added to allow
        # the element-wise operations between the series.
        # Let's eliminate the last element from TimedeltaList. Since this list was already
        # copied to the dataframe, there is no risk of losing information.
        
        # Index of the last element:
        last_element_index = len(TimedeltaList) - 1
        
        # Delete the element:
        del TimedeltaList[last_element_index]
        # Deleted item at index last_element_index from TimedeltaList list
        
        # Now we calculate the average value:
        avg_delay = np.average(TimedeltaList)
        
        print(f"Average delay = {avg_delay} {returned_timedelta_unit}")
        
        # Return the dataframe and the average value:
        return df, avg_delay
    
    #Finally, return the dataframe with the new column:
    
    else: 
        # Return only the dataframe
        return df

# **Function for adding or subtracting a timedelta from a timestamp**
- Use this function for creating a column containing timestamps added or subtracted by a fixed timedelta value (offset).
- Set `timedelta` as a negative value to subtract this timedelta from the timestamp (as explained in the comments).

In [49]:
def ADD_TIMEDELTA (df, timestamp_tag_column, timedelta, new_timestamp_col  = None, timedelta_unit = None):
    
    import pandas as pd
    import numpy as np
    
    #THIS FUNCTION PERFORMS THE OPERATION ADDING A FIXED TIMEDELTA (difference of time
    # or offset) to a timestamp.
    
    #df: dataframe containing the timestamp column.
    
    #timestamp_tag_column: string containing the name of the column with the timestamp
    # to which the timedelta will be added to.
    
    #timedelta: numeric value of the timedelta.
    # WARNING: simply input a numeric value, not a string with unit. e.g. timedelta = 2.4
    # If you want to subtract a timedelta, input a negative value. e.g. timedelta = - 2.4
    
    #new_timestamp_col: name of the new column containing the obtained timestamp. 
    # If no value is provided, the default name [timestamp_tag_column]+[timedelta] 
    # will be given (at the end of the code, after we created the timedelta object 
    # with correct units)
    
    #Pandas Timedelta class: applicable to timedelta objects
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html
    #The delta method from the Timedelta class converts returns the timedelta in
    #nanoseconds, guaranteeing the internal compatibility:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.delta.html#pandas.Timedelta.delta
    
    #timedelta_unit: unit of the timedelta interval. If no value is provided, 
    # the unit will be considered 'ns' (default). Possible values are:
    #'day', 'hour', 'minute', 'second', 'ns'.
    
    if (timedelta_unit is None):
        
        timedelta_unit = 'ns'
    
    # Pandas do not support timedeltas in years or months, since these values may
    # be ambiguous (e.g. a month may have 30 or 31 days, so an approximation would
    # be necessary).
    
    # START: CONVERT ALL TIMESTAMPS/DATETIMES/STRINGS TO pandas.Timestamp OBJECTS.
    # This will prevent any compatibility problems.
    
    #The pd.Timestamp function can handle a single timestamp per call. Then, we must
    # loop trough the series, and apply the function to each element.
    
    #1. Start a list to store the Pandas timestamps:
    timestamp_list = []
    
    #2. Loop through each element of the timestamp column, and apply the function
    # to guarantee that all elements are Pandas timestamps
    
    for timestamp in df[timestamp_tag_column]:
        #Access each element 'timestamp' of the series df[timestamp_tag_column1]
        timestamp_list.append(pd.Timestamp(timestamp, unit = 'ns'))
    
    #3. Create a column in the dataframe that will store the timestamps.
    # Simply copy the list as the column:
    df[timestamp_tag_column] = timestamp_list
    
    # The Pandas Timestamp can be directly added to a Pandas Timedelta.
 
    #Dictionary for converting the timedelta_unit to Pandas encoding for the
    # Timedelta method. to access the element of a dictionary d = {"key": element},
    # simply declare d['key'], as if you were accessing the column of a dataframe. Here,
    # the key is the argument of the function, whereas the element is the correspondent
    # Pandas encoding for this method. With this dictionary we simplify the search for the
    # proper time encoding: actually, depending on the Pandas method the encoding may be
    # 'd', "D" or "day" for day, for instance. So, we avoid having to check the whole
    # documentation by creating a simpler common encoding for the functions in this notebook.
    
    unit_dict = {
        
        'day': 'd',
        'hour': 'h',
        'minute': 'min',
        'second': 's',
        'ns': 'ns'
        
    }
    
    #Create the Pandas timedelta object from the timedelta value and the selected
    # time units:
    timedelta = pd.Timedelta(timedelta, unit_dict[timedelta_unit])
    
    #A pandas Timedelta object has total compatibility with a pandas
    #Timestamp, so we can simply add the Timedelta to the Timestamp to obtain a new 
    #corrected timestamp.
    # Again, notice that the timedelta can be positive (sum of time), or negative
    # (subtraction of time).
    
    #Now, add the timedelta to the timestamp, and store it into a proper list/series:
    new_timestamps = df[timestamp_tag_column] + timedelta
     
    #Finally, create a column in the dataframe named as new_timestamp_col
    #and store the new timestamps into it
    
    if (new_timestamp_col is None):
        
        #apply the default name:
        new_timestamp_col = "[" + timestamp_tag_column + "]" + "+" + "[" + str(timedelta) + "]"
        #The str function converts the timedelta object to a string, so it can be
        #concatenated in this line of code.
        #Notice that we defined the name of the new column at the end of the code so
        #that we already converted the 'timedelta' to a Timedelta object containing
        #the correct units.
    
    df[new_timestamp_col] = new_timestamps
      
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Timedeltas successfully added. Check dataset's 10 first rows:\n")
    print(df.head(10))
    
    #Finally, return the dataframe with the new column:
    
    return df

# **Function for concatenating (SQL UNION) multiple dataframes**
- Vertical concatenation of the dataframes.
- Equivalent to SQL Union: vertical stack/append of the tables.

In [50]:
def UNION_DATAFRAMES (list_of_dataframes, ignore_index_on_union = True, sort_values_on_union = True, union_join_type = None):
    
    import pandas as pd
    #JOIN can be 'inner' to perform an inner join, eliminating the missing values
    #The default (None) is 'outer': the dataframes will be stacked on the columns with
    #same names but, in case there is no correspondence, the row will present a missing
    #value for the columns which are not present in one of the dataframes.
    #When using the 'inner' method, only the common columns will remain
    
    #list_of_dataframes must be a list containing the dataframe objects
    # example: list_of_dataframes = [df1, df2, df3, df4]
    #Notice that the dataframes are objects, not strings. Therefore, they should not
    # be declared inside quotes.
    # There is no limit of dataframes. In this example, we will concatenate 4 dataframes.
    # If list_of_dataframes = [df1, df2, df3] we would concatenate 3, and if
    # list_of_dataframes = [df1, df2, df3, df4, df5] we would concatenate 5 dataframes.
    
    #The other parameters are the same from Pandas .concat method.
    # ignore_index_on_union = ignore_index;
    # sort_values_on_union = sort
    # union_join_type = join
    #Check Datacamp course Joining Data with pandas, Chap.3, 
    # Advanced Merging and Concatenating
    
    if (union_join_type == 'inner'):
        
        print("Warning: concatenating dataframes using the \'inner\' join method, that removes missing values.")
        concat_df = pd.concat(list_of_dataframes, ignore_index = ignore_index_on_union, sort = sort_values_on_union, join = union_join_type)
    
    else:
        
        #In case None or an invalid value is provided, use the default 'outer', by simply
        # not declaring the 'join':
        concat_df = pd.concat(list_of_dataframes, ignore_index = ignore_index_on_union, sort = sort_values_on_union)
    
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Dataframes successfully concatenated. Check the 10 first rows of new dataframe:\n")
    print(concat_df.head(10))
    
    #Now return the concatenated dataframe:
    
    return concat_df

# **Function for exporting the dataframe**

In [None]:
def export_dataframe (dataframe_to_be_exported, new_file_name_with_csv_extension, file_directory_path = None, export_to_s3_bucket = False, s3_bucket_name = None, desired_s3_file_name_with_csv_extension = None):
    
    import os
    import boto3
    #boto3 is AWS S3 Python SDK
    import pandas as pd
    
    ## WARNING: all file extensions should be .csv for this function
    
    # FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
    # (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
    # or FILE_DIRECTORY_PATH = "/folder"
    # If you want to export the file to AWS S3, this parameter will have no effect.
    # In this case, you can set FILE_DIRECTORY_PATH = None

    # NEW_FILE_NAME_WITH_CSV_EXTENSION - (string, in quotes): input the name of the 
    # file with the  extension. e.g. FILE_NAME_WITH_CSV_EXTENSION = "file.csv"
    
    # export_to_s3_bucket = False. Alternatively, set as True to export the file to an
    # AWS S3 Bucket.

    ## The following parameters have effect only when export_to_s3_bucket == True:

    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. S3_BUCKET_NAME = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"

    # The name desired for the object stored in S3 (string, in quotes). 
    # Keep it None to set it equals to new_file_name_with_csv_extension. 
    # Alternatively, set it as a string analogous to new_file_name_with_csv_extension. 
    # e.g. desired_s3_file_name_with_csv_extension = "S3_file.csv"
    
    if (export_to_s3_bucket == True):
        
        if (desired_s3_file_name_with_csv_extension is None):
            #Repeat new_file_name_with_extension
            desired_s3_file_name_with_csv_extension = new_file_name_with_csv_extension
        
        # If the bucket name was provided, start the session. If not, print an error
        # message:
        
        if (s3_bucket_name is None):
            
            print("Please, provide a valid S3 Bucket name to download from.")
        
        else:
        
            # start S3 client:
            print("Starting AWS S3 client.")
        
            # Let's export the file to a AWS S3 (simple storage service) bucket
            # instantiate S3 client and upload to s3
            s3_client = boto3.resource('s3')
            
            # Create a local copy of the file on the root.
            local_copy_path = os.path.join("/", new_file_name_with_csv_extension)
            dataframe_to_be_exported.to_csv(local_copy_path, index = False)
            
            print("Local copy of the dataframe created on the root path to export to S3.")
            print("Simply delete this file from the root path if you only want to keep the S3 version.")
            
            # Upload this local copy to S3:
            try:
                response = s3_client.meta.client.upload_file(local_copy_path, s3_bucket_name, desired_s3_file_name_with_extension)
            
            except ClientError as e:
                logging.error(e)
                return False
            
            print(f"{desired_s3_file_name_with_csv_extension} successfully exported to {s3_bucket_name} AWS S3 bucket.")
            return True
            # Check AWS Documentation:
            # https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html
            
            # Notice: if you wanted to authenticate directly from Python code, you could use
            # the following code, instead:        
            # ACCESS_KEY = 'access_key_ID'
            # PASSWORD_KEY = 'password_key'
            # s3_client = boto3.client('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = PASSWORD_KEY)
            # s3_client.upload_file(local_copy_path, s3_bucket_name, desired_s3_file_name_with_extension)
            
    else :
        # Do not export to AWS S3. Export to other path.
        # Create the complete file path:
        file_path = os.path.join(file_directory_path, new_file_name_with_csv_extension)

        dataframe_to_be_exported.to_csv(file_path, index = False)

        print(f"Dataframe {new_file_name_with_csv_extension} exported as \'{file_path}\'.")
        print("Warning: if there was a file in this file path, it was replaced by the exported dataframe.")

# **Function for downloading a file from Google Colab or AWS S3 to the local machine or uploading a file from the machine to S3 or to Colab's instant memory**

In [2]:
def download_or_upload_file (source = 'aws', action = 'download', object_to_download_from_colab = None, s3_bucket_name = None, local_path_of_storage = '/', file_name_with_extension = None):
    
    import os
    import boto3
    # boto3 is AWS S3 Python SDK
    from google.colab import files
    
    # source = 'google' for downloading from (or uploading to) Google Colab's instant memory;
    # source = 'aws' for downloading from (or uploading to) an AWS S3 bucket.
    
    # action = 'download' to download the file to the local machine
    # action = 'upload' to upload a file from local machine to AWS S3 or to
    # Google Colab's instant memory
    
    # object_to_download_from_colab = None. This option has effect only when
    # source == 'google'. In this case, this parameter is obbligatory. 
    # Declare as object_to_download_from_colab the object that you want to download.
    # Since it is an object and not a string, it should not be declared in quotes.
    # e.g. to download a dictionary named dict, object_to_download_from_colab = dict.
    # To download a dataframe named df, declare object_to_download_from_colab = df.
    # To export a model named keras_model, declare object_to_download_from_colab = keras_model
    
    ## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # LOCAL_PATH_OF_STORAGE: path of the local computer environment 
    # to which the S3 bucket contents will be downloaded (ACTION == 'download'); or
    # path of the folder containing the file that will be uploaded in S3 (ACTION = 'upload'). 
    # If it is None, or if LOCAL_PATH_OF_STORAGE = '/', files 
    # will be imported to the root path. Alternatively, input the path as a string 
    # (in quotes).
    # Examples: LOCAL_PATH_OF_STORAGE = '/copied_s3_bucket'; 
    # LOCAL_PATH_OF_STORAGE = "/My_folder"; LOCAL_PATH_OF_STORAGE = "/Users/Me/Documents/"
    # Notice that only the directories should be declared: do not include the file name and
    # its extension.
    
    # file_name_with_extension: string, in quotes, containing the file name which will be
    # downloaded from S3; or uploaded from S3, followed by its extension. 
    ## This parameter is obbligatory when source == 'aws'
    # Examples:
    # file_name_with_extension = 'Screen_Shot.png'; file_name_with_extension = 'dataset.csv',
    # file_name_with_extension = "dictionary.pkl", file_name_with_extension = "model.h5",
    # file_name_with_extension = 'doc.pdf', file_name_with_extension = 'model.dill'

    if (source == 'google'):
        
        if (action == 'upload'):
            
            print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
            print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
            # this functionality requires the previous declaration:
            ## from google.colab import files
            
            colab_files_dict = files.upload()
            
            # The files are stored into a dictionary called colab_files_dict where the keys
            # are the names of the files and the values are the files themselves.
            ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
            ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
            ## representing the contents of the file. The length of this value is the size of the
            ## uploaded file, in bytes.
            ## To access the file is like accessing a value from a dictionary: 
            ## d = {'key1': 'val1'}, d['key1'] == 'val1'
            ## we simply declare the key inside brackets and quotes, the same way we would do for
            ## accessing the column of a dataframe.
            ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
            ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
            ## file in bytes.
            ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
            ## parentheses): colab_files_dict.keys()
            
            for key in colab_files_dict.keys():
                #loop through each element of the list of keys of the dictionary
                # (list colab_files_dict.keys()). Each element is named 'key'
                print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
                # The key is the name of the file, and the length of the value
                ## correspondent to the key is the file's size in bytes.
                ## Notice that the content of the uploaded object must be passed 
                ## as argument for a proper function to be interpreted. 
                ## For instance, the content of a xlsx file should be passed as
                ## argument for Pandas .read_excel function; the pkl file must be passed as
                ## argument for pickle.
                ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
                ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
                ## df from the uploaded table. Notice that is the value, not the key, that is the
                ## argument.
                
                print("The uploaded files are stored into a dictionary object named as colab_files_dict.")
                print("Each key from this dictionary is the name of an uploaded file. The value correspondent to that key is the file itself.")
                print("The structure of a general Python dictionary is dict = {\'key1\': value1}. To access value1, declare file = dict[\'key1\'], as if you were accessing a column from a dataframe.")
                print("Then, if you uploaded a file named \'table.xlsx\', you can access this file as:")
                print("uploaded_file = colab_files_dict[\'table.xlsx\']")
                print("Notice, though, that the object uploaded_file is the whole file content, not a Python object already converted. To convert to a Python object, pass this element as argument for a proper function or method.")
                print("In this example, to convert the object uploaded_file to a dataframe, Pandas pd.read_excel function could be used. In the following line, a df dataframe object is obtained from the uploaded file:")
                print("df = pd.read_excel(uploaded_file)")
        
        elif (action == 'download'):
            
            if (object_to_download_from_colab is None):
                
                #No object was declared
                print("Please, inform an object to download. Since it is an object, not a string, it should not be declared in quotes.")
            
            else:
                
                print("The file will be downloaded to your computer.")

                files.download(object_to_download_from_colab)

                print(f"File {object_to_download_from_colab} successfully downloaded from Colab environment.")

        else:
            
            print("Please, select a valid action, download or upload.")
          
    elif (source == 'aws'):
        
        # Notice: if you wanted to authenticate directly from Python code, you could use
        # the following code, instead for starting the client:
        
        # ACCESS_KEY = 'access_key_ID'
        # PASSWORD_KEY = 'password_key'
        # s3_client = boto3.client('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = PASSWORD_KEY)
        # Nextly, the code is the same.
        
        
        # If the path to store is None, also import the bucket content to root path;
        # or upload the file from root path to the bucket
        if (local_path_of_storage is None):
            
            local_path_of_storage = '/'
        
        # If the bucket name was provided, start the session. If not, print an error
        # message. The same for the file name with extension:
        
        if (s3_bucket_name is None):
            
            print("Please, provide a valid S3 Bucket name.")
        
        elif (file_name_with_extension is None):
            
            print("Please, provide a valid file name with its extension. e.g. \'dataset.csv\'.")
        
        else:
            
            # Obtain the full file path from which the file will be uploaded to S3; or to
            # which the file will be downloaded from S3:
            file_path = os.path.join(local_path_of_storage, file_name_with_extension)
            
            # Start S3 client:
            s3_client = boto3.resource('s3')
            
            print("Starting AWS S3 client.")
            
            if (action == 'upload'):
                
                s3_client.Object(s3_bucket_name, file_name_with_extension).\
                    upload_file(Filename = file_path)
                
                print(f"File {file_name_with_extension} successfully uploaded to AWS S3 {s3_bucket_name} bucket.")
            
            elif (action == 'download'):

                print("The file will be downloaded to your computer.")
                
                s3_client.Object(s3_bucket_name, file_name_with_extension).download_file(file_path)
                
                print(f"File {file_name_with_extension} successfully downloaded from AWS S3 {s3_bucket_name} bucket.")

            else:

                print("Please, select a valid action, download or upload.")

    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = '/'
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None, or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/copied_s3_bucket'

S3_BUCKET_NAME = 'name_of_aws_s3_bucket_to_be_accessed'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_KEY_PREFFIX_FOLDER = None
# S3_OBJECT_KEY_PREFFIX_FOLDER = None. Keep it None or as an empty string 
# (S3_OBJECT_KEY_PREFFIX_FOLDER = '') to import the whole bucket content, instead of a 
# single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, key_preffix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_key_preffix = S3_OBJECT_KEY_PREFFIX_FOLDER)

### **Importing the dataset**

In [3]:
# WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, etc), 
# txt, or CSV (comma separated values) files.

FILE_DIRECTORY_PATH = "/"
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
# or FILE_DIRECTORY_PATH = "/folder"

FILE_NAME_WITH_EXTENSION = "dataset.csv"
# FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
# FILE_NAME_WITH_EXTENSION = "file.csv"
    
HAS_HEADER = True
# HAS_HEADER = True if the the imported table has headers (row with columns names).
# Alternatively, HAS_HEADER = False if the dataframe does not have header.

TXT_CSV_COL_SEP = "comma"
# TXT_CSV_COL_SEP = "comma" - This parameter has effect only when the file is a 'txt'
# or 'csv'. It informs how the different columns are separated.
# Alternatively, TXT_CSV_COL_SEP = "comma" for columns separated by comma (",")
# TXT_CSV_COL_SEP = "whitespace" for columns separated by simple spaces (" ").

SHEET_TO_LOAD = None
# SHEET_TO_LOAD - This parameter has effect only when for Excel files.
# keep SHEET_TO_LOAD = None not to specify a sheet of the file, so that the first sheet
# will be loaded.
# SHEET_TO_LOAD may be an integer or an string (inside quotes). SHEET_TO_LOAD = 0
# loads the first sheet (sheet with index 0); SHEET_TO_LOAD = 1 loads the second sheet
# of the file (index 1); SHEET_TO_LOAD = "Sheet1" loads a sheet named as "Sheet1".
# Declare a number to load the sheet with that index, starting from 0; or declare a
# name to load the sheet with that name.

#The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = load_dataframe (file_directory_path = FILE_DIRECTORY_PATH, file_name_with_extension = FILE_NAME_WITH_EXTENSION, has_header = HAS_HEADER, txt_csv_col_sep = TXT_CSV_COL_SEP, sheet_to_load = SHEET_TO_LOAD)

### **Grouping the data by a timestamp**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be grouped

TIMESTAMP_TAG_COLUMN = "DATE"
#Alternatively: string (inside quotes) containing the name (header) of the timestamp column

GROUPING_FREQUENCY_UNIT = 'day'
#Alternatively: 'year', 'month', 'week', 'hour', 'minute', 'day', or 'second'

NUMBER_OF_PERIODS_TO_GROUP = 1 
# Group by every NUMBER_OF_PERIODS_TO_GROUP = 1 periods (every day, if 'day' is selected).
#Bin size. Alternatively: any integer number. Check the instructions in function comments.

AGGREGATE_FUNCTION = 'mean'
# Keep the method inside quotes.
#Alternatively: any Pandas aggregation method: 'mean', 'median', 'std', 'sum', 'min' 'max',
# etc. You can pass a list of multiple aggregations, like: AGGREGATE_FUNCTION  = 
# [mean, max, sum]. You can also pass custom functions, like: pct30 (30-percentile), 
# or np.mean: AGGREGATE_FUNCTION = pct30, AGGREGATE_FUNCTION = np.mean.

#ADJUST OF GROUPING BASED ON A FIXED TIMESTAMP
# You can specify the origin (start_time) or the offset (offset_time), which are equivalent.
#WARNING: DECLARE ONLY ONE OF THESE PARAMETERS. DO NOT DECLARE AN OFFSET IF AN ORIGIN WAS 
# SPECIFIED, AND VICE-VERSA.
START_TIME = None
OFFSET_TIME = None
# Alternatively, these parameters should be declared as a pandas Timestamp or in the
# specific notation of Pandas offset_time for the Grouper class:
# START_TIME = pd.Timestamp('2000-10-01 23:30:00', unit = 'ns')
# Simply substitute the Timestamp inside quotes by the correct start timestamp.
# This timestamp do not have to be complete, but must be interpretable by the Timestamp
# function.
# OFFSET_TIME = '23h30min', OFFSET_TIME = '2min', etc. Simply substitute the offset time
# inside quotes by the correct value.
# For examples on the notation for start and offset time, check Pandas grouper class
# documentation, and Pandas timestamp class documentation:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html


#New dataframe saved as grouped_df. Simply modify this object on the left of equality:
grouped_df = GROUP_BY_TIMESTAMP (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, grouping_frequency_unit = GROUPING_FREQUENCY_UNIT, number_of_periods_to_group = NUMBER_OF_PERIODS_TO_GROUP, aggregate_function = AGGREGATE_FUNCTION, start_time = START_TIME, offset_time = OFFSET_TIME)

### **Merging (joining) the data on a timestamp column**

In [None]:
DF_LEFT = dataset1 #Alternatively: object containing the dataset to be joined on the left
DF_RIGHT = dataset2 #Alternatively: object containing the dataset to be joined on the right

LEFT_KEY = "DATE" 
#Alternatively: (string) name of the column of the left dataframe to be used as key for 
# joining. Keep inside quotes.
RIGHT_KEY = "DATE"
#Alternatively: (string) name of the column of the right dataframe to be used as key for 
# joining. Keep inside quotes.

HOW_TO_JOIN = "inner"
#Alternatively: "inner", "outer", "left", "right". This option has no effect 
# if MERGE_METHOD = "asof". Keep inside quotes.

MERGE_METHOD = "ordered"
#Alternatively: MERGE_METHOD = 'ordered' to use pandas .merge_ordered method, or
# MERGE_METHOD = "asof" for using the .merge_asof method.
# WARNING: .merge_asof uses fuzzy matching, so the HOW_TO_JOIN parameter is not applicable.
# Keep inside quotes.

MERGED_SUFFIXES = ('_left', '_right')
# SUFFIXES = ('_left', '_right') - tuple of the suffixes to be added to columns.
# Example: suppose both datasets have the column 'Value'. The column from the left dataset
# will be renamed as "Value_left", and the column from the right dataset will be renamed as
# "Value_right".
# Alternatively: modify the strings inside quotes to modify the standard values. 
# Do not eliminate the parenthesis that indicate the tuple object.
# Any unmutable list is a tuple. A tuple can be also declared as an unmutable list of two
# objects inside parenthesis instead of the brackets used for lists: []

ASOF_DIRECTION = "nearest"
# Parameter of .merge_asof method. 'nearest' merge the closest timestamps in both directions.
# Alternatively: 'backward' or 'forward'.
# This option has no effect if MERGE_METHOD = "ordered". Keep inside quotes.

ORDERED_FILLING = None
# Parameter or .merge_ordered method.
# Alternatively: ORDERED_FILLING = 'ffill' (inside quotes) to fill missings 
# with the previous value.
# This option has no effect if MERGE_METHOD = "asof".


#New dataframe saved as merged_df. Simply modify this object on the left of equality:
merged_df = MERGE_ON_TIMESTAMP (df_left = DF_LEFT, df_right = DF_RIGHT, left_key = LEFT_KEY, right_key = RIGHT_KEY, how_to_join = HOW_TO_JOIN, merge_method = MERGE_METHOD, merged_suffixes = MERGED_SUFFIXES, asof_direction = ASOF_DIRECTION, ordered_filling = ORDERED_FILLING)

### **Creating a column with isolated informations from the timestamp**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN = "DATE"
#Alternatively: string (inside quotes) containing the name (header) of the timestamp column
#Keep inside quotes.

EXTRACTED_INFO = "year"
#information to extract from the timestamp. The allowed values are:
#Alternatively: 'year', 'month', 'week', 'day', 'hour', 'minute', or 'second'

NEW_COLUMN_NAME = None
# Name of the new created column. If no value is provided, it will be equals to 
# extracted_info. Alternatively: keep it as None, or input a name (string) for the new
# column, inside quotes (e.g. NEW_COLUMN_NAME = "extracted_information")


#New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = EXTRACT_TIMESTAMP_INFO (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, extracted_info = EXTRACTED_INFO, new_column_name = None)

### **Calculating differences between timestamps (timedeltas)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN1 = "DATE"
#Alternatively: string (inside quotes) containing the name (header) of the timestamp column
# on the left (from which the right timestamp will be subtracted).
#Keep inside quotes.

TIMESTAMP_TAG_COLUMN2 = "TIMESTAMP2"
#Alternatively: string (inside quotes) containing the name (header) of the timestamp column
# on the right, that will be substracted from the timestamp on the left.
#Keep inside quotes.

TIMEDELTA_COLUMN_NAME = None
# Name of the new column. If no value is provided, the default name 
# [timestamp_tag_column1]-[timestamp_tag_column2] will be given.
# Alternatively: keep it as None or input a name (string) for the new column inside quotes:
# e.g. TIMEDELTA_COLUMN_NAME = "Timestamp_difference"
    
RETURNED_TIMEDELTA_UNIT = None
#Unit of the new column. If no value is provided, the unit will be considered as nanoseconds. 
# Alternatively: keep it None, for the results in nanoseconds, or input RETURNED_TIMEDELTA_UNIT = 
# 'year', 'month', 'day', 'hour', 'minute', or 'second' (keep these inside quotes).


# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = CALCULATE_TIMEDELTA (df = DATASET, timestamp_tag_column1 = TIMESTAMP_TAG_COLUMN1, timestamp_tag_column2 = TIMESTAMP_TAG_COLUMN2, timedelta_column_name  = TIMEDELTA_COLUMN_NAME, returned_timedelta_unit = RETURNED_TIMEDELTA_UNIT)

### **Calculating differences between successive timestamps (delays)**

#### Case 1: return average delay

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN = "DATE"
#Alternatively: string (inside quotes) containing the name (header) of the timestamp column.
#Keep inside quotes.

NEW_TIMEDELTA_COLUMN_NAME = None
# Name of the new column. If no value is provided, the default name 
# [timestamp_tag_column1]-[timestamp_tag_column2] will be given.
# Alternatively: keep it as None or input a name (string) for the new column inside quotes:
# e.g. NEW_TIMEDELTA_COLUMN_NAME = "Timestamp_difference"
    
RETURNED_TIMEDELTA_UNIT = None
#Unit of the new column. If no value is provided, the unit will be considered as nanoseconds. 
# Alternatively: keep it None, for the results in nanoseconds, or input RETURNED_TIMEDELTA_UNIT = 
# 'year', 'month', 'day', 'hour', 'minute', or 'second' (keep these inside quotes).

RETURN_AVG_DELAY = True
# RETURN_AVG_DELAY = True will print and return the value of the average delay.
# RETURN_AVG_DELAY = False will omit this information

# New dataframe saved as new_df. Simply modify this object on the left of equality.
# Average delay float value istored into variable avg_delay. 
# Simply modify this object on the left of equality.
new_df, avg_delay = CALCULATE_DELAY (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, new_timedelta_column_name  = NEW_TIMEDELTA_COLUMN_NAME, returned_timedelta_unit = RETURNED_TIMEDELTA_UNIT, return_avg_delay = RETURN_AVG_DELAY)

#### Case 2: do not return average delay

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN = "DATE"
#Alternatively: string (inside quotes) containing the name (header) of the timestamp column.
#Keep inside quotes.

NEW_TIMEDELTA_COLUMN_NAME = None
# Name of the new column. If no value is provided, the default name 
# [timestamp_tag_column1]-[timestamp_tag_column2] will be given.
# Alternatively: keep it as None or input a name (string) for the new column inside quotes:
# e.g. TIMEDELTA_COLUMN_NAME = "Timestamp_difference"
    
RETURNED_TIMEDELTA_UNIT = None
#Unit of the new column. If no value is provided, the unit will be considered as nanoseconds. 
# Alternatively: keep it None, for the results in nanoseconds, or input RETURNED_TIMEDELTA_UNIT = 
# 'year', 'month', 'day', 'hour', 'minute', or 'second' (keep these inside quotes).

RETURN_AVG_DELAY = False
# RETURN_AVG_DELAY = True will print and return the value of the average delay.
# RETURN_AVG_DELAY = False will omit this information

# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = CALCULATE_DELAY (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, new_timedelta_column_name  = NEW_TIMEDELTA_COLUMN_NAME, returned_timedelta_unit = RETURNED_TIMEDELTA_UNIT, return_avg_delay = RETURN_AVG_DELAY)

### **Adding or subtracting a timedelta from a timestamp**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN = "DATE"
# Alternatively: string (inside quotes) containing the name (header) of the timestamp column

TIMEDELTA = 2
# Numeric value of the timedelta.
# WARNING: simply input a numeric value, not a string with unit. e.g. timedelta = 2.4
# If you want to subtract a timedelta, input a negative value. e.g. timedelta = - 2.4
# Alternatively, input any desired real number.

NEW_TIMESTAMP_COL = None
# Name of the new column containing the obtained timestamp.  If no value is provided, the 
# default name [timestamp_tag_column]+[timedelta] will be given.
# Alternatively, input a string value inside quotes with the name of this new column.
# e.g. NEW_TIMESTAMP_COL = "new_timestamp"

TIMEDELTA_UNIT = None
# Unit of the timedelta interval. If no value is provided, the unit will be considered 'ns' 
# (default). 
# Possible values are: TIMEDELTA_UNIT = None, 'day', 'hour', 'minute', 'second', or 'ns'.
# Keep the unit inside quotes. 

#New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = ADD_TIMEDELTA (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, timedelta = TIMEDELTA, new_timestamp_col  = NEW_TIMESTAMP_COL, timedelta_unit = TIMEDELTA_UNIT)

### **Concatenating (SQL UNION) multiple dataframes**

In [None]:
LIST_OF_DATAFRAMES = [dataset1, dataset2]
# LIST_OF_DATAFRAMES must be a list containing the dataframe objects
# example: list_of_dataframes = [df1, df2, df3, df4]
# Notice that the dataframes are objects, not strings. Therefore, they should not
# be declared inside quotes.
# There is no limit of dataframes. In this example, we will concatenate 4 dataframes.
# If LIST_OF_DATAFRAMES = [df1, df2, df3] we would concatenate 3, and if
# LIST_OF_DATAFRAMES = [df1, df2, df3, df4, df5] we would concatenate 5 dataframes.


IGNORE_INDEX_ON_UNION = True # Alternatively: True or False

SORT_VALUES_ON_UNION = True # Alternatively: True or False

UNION_JOIN_TYPE = None
# JOIN can be 'inner' to perform an inner join, eliminating the missing values
# The default (None) is 'outer': the dataframes will be stacked on the columns with
# same names but, in case there is no correspondence, the row will present a missing
# value for the columns which are not present in one of the dataframes.
# When using the 'inner' method, only the common columns will remain.
# Alternatively, keep UNION_JOIN_TYPE = None for the standard outer join; or set
# UNION_JOIN_TYPE = "inner" (inside quotes) for using the inner join.
    
#These 3 last parameters are the same from Pandas .concat method:
# IGNORE_INDEX_ON_UNION = ignore_index;
# SORT_VALUES_ON_UNION = sort
# UNION_JOIN_TYPE = join
# Check Datacamp course Joining Data with pandas, Chap.3, 
# Advanced Merging and Concatenating
    

#New dataframe saved as concat_df. Simply modify this object on the left of equality:
concat_df = UNION_DATAFRAMES (list_of_dataframes = LIST_OF_DATAFRAMES, ignore_index_on_union = IGNORE_INDEX_ON_UNION, sort_values_on_union = SORT_VALUES_ON_UNION, union_join_type = UNION_JOIN_TYPE)

## **Exporting the dataframe as CSV file**

In [None]:
## WARNING: all file extensions should be .csv for this function

DATAFRAME_TO_BE_EXPORTED = dataset
#Alternatively: object containing the dataset to be exported.

FILE_DIRECTORY_PATH = "/"
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
# or FILE_DIRECTORY_PATH = "/folder"
# If you want to export the file to AWS S3, this parameter will have no effect.
# In this case, you can set FILE_DIRECTORY_PATH = None

NEW_FILE_NAME_WITH_CSV_EXTENSION = "dataset.csv"
# NEW_FILE_NAME_WITH_CSV_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_CSV_EXTENSION = "file.csv"

EXPORT_TO_S3_BUCKET = False
# export_to_s3_bucket = False. Alternatively, set as True to export the file to an
# AWS S3 Bucket.
    
## The following parameters have effect only when export_to_s3_bucket == True:

S3_BUCKET_NAME = None   
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. S3_BUCKET_NAME = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"
DESIRED_S3_FILE_NAME_WITH_CSV_EXTENSION = None
# The name desired for the object stored in S3 (string, in quotes). 
# Keep it None to set it equals to NEW_FILE_NAME_WITH_CSV_EXTENSION. 
# Alternatively, set it as a string analogous to NEW_FILE_NAME_WITH_CSV_EXTENSION.
# e.g. DESIRED_S3_FILE_NAME_WITH_CSV_EXTENSION = "S3_file.csv"

export_dataframe(dataframe_to_be_exported = DATAFRAME_TO_BE_EXPORTED, new_file_name_with_csv_extension = NEW_FILE_NAME_WITH_CSV_EXTENSION, file_directory_path = FILE_DIRECTORY_PATH, export_to_s3_bucket = EXPORT_TO_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, desired_s3_file_name_with_csv_extension = DESIRED_S3_FILE_NAME_WITH_CSV_EXTENSION)

## **Downloading a file from Google Colab or AWS S3 to the local machine or uploading a file from the machine to S3 or to Colab's instant memory**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for downloading from (or uploading to) Google Colab's instant memory;
# SOURCE = 'aws' for downloading from (or uploading to) an AWS S3 bucket.

ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to AWS S3 or to Google Colab's 
# instant memory

OBJECT_TO_DOWNLOAD_FROM_COLAB = None
# OBJECT_TO_DOWNLOAD_FROM_COLAB = None. This option has effect only when
# SOURCE == 'google'. In this case, this parameter is obbligatory. 
# Declare as OBJECT_TO_DOWNLOAD_FROM_COLAB the object that you want to download.
# Since it is an object and not a string, it should not be declared in quotes.
# e.g. to download a dictionary named dict, OBJECT_TO_DOWNLOAD_FROM_COLAB = dict.
# To download a dataframe named df, declare OBJECT_TO_DOWNLOAD_FROM_COLAB = df.
# To export a model named keras_model, declare OBJECT_TO_DOWNLOAD_FROM_COLAB = keras_model
    
## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'

S3_BUCKET_NAME = None
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. S3_BUCKET_NAME = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

LOCAL_PATH_OF_STORAGE = '/'
# LOCAL_PATH_OF_STORAGE: path of the local computer environment 
# to which the S3 bucket contents will be downloaded (ACTION == 'download'); or
# path of the folder containing the file that will be uploaded in S3 (ACTION = 'upload'). 
# If it is None, or if LOCAL_PATH_OF_STORAGE = '/', files 
# will be imported to the root path. Alternatively, input the path as a string (in quotes). 
# Examples: LOCAL_PATH_OF_STORAGE = '/copied_s3_bucket'; 
# LOCAL_PATH_OF_STORAGE = "/My_folder"; LOCAL_PATH_OF_STORAGE = "/Users/Me/Documents/"
# Notice that only the directories should be declared: do not include the file name and
# its extension.

FILE_NAME_WITH_EXTENSION = None
# FILE_NAME_WITH_EXTENSION: string, in quotes, containing the file name which will be
# downloaded from S3; or uploaded from S3, followed by its extension. 
## This parameter is obbligatory when SOURCE == 'aws'
# Examples:
# FILE_NAME_WITH_EXTENSION = 'Screen_Shot.png'; FILE_NAME_WITH_EXTENSION = 'dataset.csv',
# FILE_NAME_WITH_EXTENSION = "dictionary.pkl", FILE_NAME_WITH_EXTENSION = "model.h5",
# FILE_NAME_WITH_EXTENSION = 'doc.pdf', FILE_NAME_WITH_EXTENSION = 'model.dill'

download_or_upload_file (source = SOURCE, action = ACTION, object_to_download_from_colab = OBJECT_TO_DOWNLOAD_FROM_COLAB, s3_bucket_name = S3_BUCKET_NAME, local_path_of_storage = LOCAL_PATH_OF_STORAGE, file_name_with_extension = FILE_NAME_WITH_EXTENSION)

****

# **Grouping by Date in Pandas - Background and Documentation**

- Suppose we have timestamps with the datetime objects stored in column 'Date' of the dataframe df.

## In the examples below, we aggregate the dataframes by date (year, month, day, min) in terms of the mean values over the set time interval.
- The time interval is the aggregation bin.
- To aggregate in terms of sum, simply substitute .mean() by .sum().
- The same is applied to the other possible aggregate functions: median, var, std, min, max, etc.
- **There are many use cases where we want the total sum over a given period of time. In those cases, we apply the .sum() aggregate** function of Pandas, instead of the .mean() used in the next examples.

### WARNING: Before grouping, make sure that the 'Date' column stores a pandas Timestamp object, with resolution of at least seconds. For that, use:
`timestamp_object = pd.Timestamp(datetime_object, unit = 's')`
- For a resolution in other scale, simply modify this parameter. For instance, unit = 'ns' for nanoseconds.
- Check the pandas.Timestamp class documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

## Calling Grouper class
- Firstly, convert all datetime objects into pandas.Timestamps.
- To group by dates, we must call the Grouper class:
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html

Syntax:

```
pandas.Grouper(key=None, level=None, freq=None, axis=0, sort=False)
```
- Notice that setting sort = True will sort the grouped values. We do not need to specify axis = 0, since it is the default.

## Group by Year

```
df.groupby(pd.Grouper(key='Date', freq='1Y')).mean()
```

In this case, we grouped by intervals of 1 year. We could group by different values of years, though. For instance:

```
df.groupby(pd.Grouper(key='Date', freq='2Y')).mean()
```
Groups by intervals of 2 years.

## Group by Month

```
df.groupby(pd.Grouper(key='Date', freq='1M')).mean()
```
- Again, we could modify the number of months. For instance, the aggregation by trimesters is done as:

```
df.groupby(pd.Grouper(key='Date', freq='3M')).mean()
```

## Group by Week

```
df.groupby(pd.Grouper(key='Date', freq='1W')).mean()
```
- As usual, simply modify the number before 'W' to change the number of weeks in the grouping.
- The substitution of '1W' by '2W' results in the aggregation every 2 weeks.

## Group by Day

```
df.groupby(pd.Grouper(key='Date', freq='1D')).mean()
```

- If you want to group by a different number of days, simply modify the number before 'D'.
- The group by every two days, so, is performed as `df.groupby(pd.Grouper(key='Date', freq='2D')).mean()`; whereas `df.groupby(pd.Grouper(key='Date', freq='5D')).mean()` groups by every five days.

## Group by Hour

```
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
```

## Group by Minute

```
df.groupby(pd.Grouper(key='Date', freq='1min')).mean()
```
- To group by every 15 mins: `df.groupby(pd.Grouper(key='Date', freq='15min')).mean()`
- To group by every 2 mins: `df.groupby(pd.Grouper(key='Date', freq='2min')).mean()`

## Group by Second

The next example upsample the time series into 30 second bins.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html

```
df.asfreq(freq='30S')
```

### Adjusting the time bins based on a fixed timestamp:
- Suppose a grouping by every 17 mins.
- You can specify an origin or specify an offset (equivalent):

```
df.groupby(pd.Grouper(key='Date', freq='17min', origin='2000-01-01')).mean()
```

If the resolution of the timestamps is in days, the grouping will consider the first instant as 00:00:00. So, the following lines are completely equivalent: in the second one, we simply specified the offset in hours and minutes to not start the grouping by 00:00:00 of a given day (we specifically set the first day to start from '23h30min' after 00:00:00:

```
df.groupby(pd.Grouper(key='Date', freq='17min', origin='2000-10-01 23:30:00')).mean()
df.groupby(pd.Grouper(key='Date', freq='17min', offset='23h30min')).mean()
```
The same output can be obtained by defining a string or timestamp and passing it as argument:

```
start = '2000-10-01 23:30:00'
df.groupby(pd.Grouper(key='Date', freq='17min', origin= start)).mean()
```

Now, suppose the timestamps contain the hour information (e.g.: 01:10:20). Now, the **'offset' parameter will represent a moment for starting after the first timestamp.**
- That is because our timestamp is not necessarily 00:00:00, as before. 
- When the hours are not declare, Python gives the time 00:00:00 to each timestamp.
- So, if we have `offset='2min'` the first timestamp of the grouping bins will be 2 min after the first timestamp of the dataframe df.
- Therefore, the `offset = 'XXhYYmin'` indicates to the `Grouper` class that the first bin should start with an offset of XX h and YY min in relation to the first timestamp, i.e., XX h and YY min after the first timestamp.

# **Merging (joining) the data by a timestamp with Pandas - Background and Documentation**
- We could use the .merge method, but this will not return an ordered dataframe.
- Let's use the .merge_ordered instead.
- If the data is not synchronous, we can perform the fuzzy merging using the .merge_asof method.

## Methods comparison
_From Datacamp course: Joining Data with pandas, chapter 4 - Merging Ordered and Time-Series Data_

### .merge() method:
- Column(s) to join on: on , left_on , and right_on
- Type of join: how (left, right, inner, outer) {{@}}
    - Default: 'inner'.
- Overlapping column names: suffixes
- Calling the method: df1.merge(df2)

### .merge_ordered() method:
- Column(s) to join on: on , left_on , and right_on
- Type of join: how (left, right, inner, outer)
    - Default: 'outer'.
- Overlapping column names: suffixes
- Calling the method: pd.merge_ordered(df1, df2)

Examples:

```
import pandas as pd
pd.merge_ordered(appl, mcd, on='date', suffixes=('_aapl','_mcd'))
```
#### Forward fill: fills missing with previous value

```
pd.merge_ordered(appl, mcd, on='date', suffixes=('_aapl','_mcd'), fill_method='ffill')
```
- When to use merge_ordered()?
    - Ordered data / time series.
    - Filling in missing values.

### .merge_asof() method:
- Similar to a merge_ordered() left join.
    - Similar features as merge_ordered().
- Match on the nearest key column and not exact matches.
    - Merged "on" columns must be sorted.

```
pd.merge_asof(visa, ibm, on='date_time', suffixes=('_visa','_ibm'))
```
#### merge_asof() example with direction
```
pd.merge_asof(visa, ibm, on=['date_time'], suffixes=('_visa','_ibm'), direction='forward')
```

direction: ‘backward’ (default), ‘forward’, or ‘nearest’.
-'nearest' allows both directions.
- merge_asof does not allow filling. Check: 
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_ordered.html#pandas.merge_ordered


- When to use merge_asof()
    - Data sampled from a process.
    - Developing a training set (no data leakage).
    - .merge_asof uses fuzzy matching, so the HOW parameter is not applicable.