# **Dataset Transformation**
## Transforming the dataset and reverse transforms: log-transform; exponential transform; Box-Cox transform; One-Hot Encoding; feature scaling; importing or exporting models and dictionaries.

## _ETL Workflow Notebook 3_

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

Install statsmodels library

In [None]:
! pip install statsmodels

Install tensorflow library

In [None]:
! pip install tensorflow

Install Keras library

In [None]:
! pip install keras

Install SHAP library

In [None]:
! pip install shap

In [None]:
#check the version of the package
! pip show shap

In [None]:
# Upgrade to the most recent library versions, if a given module is not present and analysis cannot be
# executed.
! pip install pip --upgrade
! pip install tensorflow --upgrade
! pip install keras --upgrade
! pip install shap --upgrade
! pip install sklearn --upgrade
! pip install pandas --upgrade
! pip install numpy --upgrade
! pip install matplotlib --upgrade
! pip install seaborn --upgrade
! pip install scipy --upgrade
! pip install statsmodels --upgrade

## **Load Python Libraries in Global Context**

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# **Function for mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [2]:
def mount_storage_system (source = 'aws', path_to_store_imported_s3_bucket = '/', s3_bucket_name = None, s3_obj_key_preffix = None):
    
    import sagemaker
    # sagemaker is AWS SageMaker Python SDK
    from sagemaker.session import Session
    from google.colab import drive
    
    # source = 'google' for mounting the google drive;
    # source = 'aws' for mounting an AWS S3 bucket.
    
    # THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # path_to_store_imported_s3_bucket: path of the Python environment to which the
    # S3 bucket contents will be imported. If it is None, or if 
    # path_to_store_imported_s3_bucket = '/', bucket will be imported to the root path. 
    # Alternatively, input the path as a string (in quotes). e.g. 
    # path_to_store_imported_s3_bucket = '/copied_s3_bucket'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_key_preffix = None. Keep it None or as an empty string (s3_obj_key_preffix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_preffix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    if (source == 'google'):
        
        print("Associate the Python environment to your Google Drive account, and authorize the access in the opened window.")
        
        drive.mount('/content/drive')
        
        print("Now your Python environment is connected to your Google Drive: the root directory of your environment is now the root of your Google Drive.")
        print("In Google Colab, navigate to the folder icon (\'Files\') of the left navigation menu to find a specific folder or file in your Google Drive.")
        print("Click on the folder or file name and select the elipsis (...) icon on the right of the name to reveal the option \'Copy path\', which will give you the path to use as input for loading objects and files on your Python environment.")
        print("Caution: save your files into different directories of the Google Drive. If files are all saved in a same folder or directory, like the root path, they may not be accessible from your Python environment.")
        print("If you still cannot see the file after moving it to a different folder, reload the environment.")
    
    elif (source == 'aws'):
        
        # Notice: if you wanted to authenticate directly from Python code, you could use
        # the following code, instead, to start the S3 client. boto3 is AWS S3 Python SDK:
        
        # import boto3
        # ACCESS_KEY = 'access_key_ID'
        # PASSWORD_KEY = 'password_key'
        # s3_client = boto3.client('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = PASSWORD_KEY)
        # ... [here, use the same following code until line new_session = Session()]
        # [keep the line for session start. Substitute the line with the .download_data
        # method by the following line:]
        # s3_client.download_file(s3_bucket_name, s3_file_name_with_extension, path_to_store_imported_s3_bucket)
        
        # Check if the whole bucket will be downloaded (s3_obj_key_preffix = None):
        if (s3_obj_key_preffix is None):
            
            s3_obj_key_preffix = ''
        
        # If the path to store is None, also import the bucket to the root path:
        if (path_to_store_imported_s3_bucket is None):
            
            path_to_store_imported_s3_bucket = '/'
        
        # If the bucket name was provided, start the session. If not, print an error
        # message:
        
        if (s3_bucket_name is None):
            
            print("Please, provide a valid S3 Bucket name to download from.")
        
        else:
        
            # start a new sagemaker session:

            print("Starting a SageMaker session to be associated with the S3 bucket.")

            new_session = Session()
            # Check sagemaker session class documentation:
            # https://sagemaker.readthedocs.io/en/stable/api/utility/session.html
            session.download_data(path = path_to_store_imported_s3_bucket, bucket = s3_bucket_name, key_prefix = s3_obj_key_preffix)

            print(f"S3 bucket contents successfully imported to path \'{path_to_store_imported_s3_bucket}\'.")
            
    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

# **Function for loading the dataframe**

In [9]:
def load_dataframe (file_directory_path, file_name_with_extension, has_header = True, txt_csv_col_sep = "comma", sheet_to_load = None):
    
    import os
    import pandas as pd
    
    # WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, etc), 
    # txt, or CSV (comma separated values) files.
    
    # file_directory_path - (string, in quotes): input the path of the directory (e.g. folder path) 
    # where the file is stored. e.g. file_directory_path = "/" or file_directory_path = "/folder"
    
    # file_name_with_extension - (string, in quotes): input the name of the file with the extension
    # e.g. file_name_with_extension = "file.xlsx", or, file_name_with_extension = "file.csv"
    
    # has_header = True if the the imported table has headers (row with columns names).
    # Alternatively, has_header = False if the dataframe does not have header.
    
    # txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
    # or 'csv'. It informs how the different columns are separated.
    # Alternatively, txt_csv_col_sep = "comma" for columns separated by comma (",")
    # txt_csv_col_sep = "whitespace" for columns separated by simple spaces (" ").
    
    # sheet_to_load - This parameter has effect only when for Excel files.
    # keep sheet_to_load = None not to specify a sheet of the file, so that the first sheet
    # will be loaded.
    # sheet_to_load may be an integer or an string (inside quotes). sheet_to_load = 0
    # loads the first sheet (sheet with index 0); sheet_to_load = 1 loads the second sheet
    # of the file (index 1); sheet_to_load = "Sheet1" loads a sheet named as "Sheet1".
    # Declare a number to load the sheet with that index, starting from 0; or declare a
    # name to load the sheet with that name.
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, file_name_with_extension)
    # Extract the file extension
    file_extension = os.path.splitext(file_path)[1][1:]
    # os.path.splitext(file_path) is a tuple of strings: the first is the complete file
    # root with no extension; the second is the extension starting with a point: '.txt'
    # When we set os.path.splitext(file_path)[1], we are selecting the second element of
    # the tuple. By selecting os.path.splitext(file_path)[1][1:], we are taking this string
    # from the second character (index 1), eliminating the dot: 'txt'
    
    if ((file_extension == 'txt') | (file_extension == 'csv')): 
        # The operator & is equivalent to 'And' (intersection).
        # The operator | is equivalent to 'Or' (union).
        # pandas.read_csv method must be used.
        
        if (has_header == True):
            
            if (txt_csv_col_sep == "comma"):
            
                dataset = pd.read_csv(file_path)
            
            elif (txt_csv_col_sep == "whitespace"):
                
                dataset = pd.read_csv(file_path, delim_whitespace = True)
            
            else:
                print(f"Enter a valid column separator for the {file_extension} file: \'comma\' or \'whitespace\'.")
        
        else:
            # has_header == False
              
            if (txt_csv_col_sep == "comma"):
            
                dataset = pd.read_csv(file_path, header = None)
            
            elif (txt_csv_col_sep == "whitespace"):
                
                dataset = pd.read_csv(file_path, delim_whitespace = True, header = None)
            
            else:
                print(f"Enter a valid column separator for the {file_extension} file: \'comma\' or \'whitespace\'.")
        
    else:
        # If it is not neither a csv nor a txt file, let's assume it is one of different
        # possible Excel files.
        print("Excel file inferred. If an error message is shown, check if a valid file extension was used: \'xlsx\', \'xls\', etc.")
            
        if (sheet_to_load is not None):        
        #Case where the user specifies which sheet of the Excel file should be loaded.
            
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load)
            
            else:
                #No header
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, header = None)
        
        else:
            #No sheet specified
            if (has_header == True):
                
                dataset = pd.read_excel(file_path)
            
            else:
                #No header
                dataset = pd.read_excel(file_path, header = None)
    
    print(f"Dataset extracted from {file_path}. Check the 10 first rows of the dataset:\n")
    print(dataset.head(10))
    
    return dataset   

# **Function for dataframe general characterization**

In [10]:
def df_gen_charac (df):
    
    import pandas as pd
    
    print("Dataframe 10 first rows:")
    print(df.head(10))
    
    #Line break before next information:
    print("\n")
    df_shape  = df.shape
    print(f"Dataframe shape (rows, columns) = {df_shape}.")
    
    #Line break before next information:
    print("\n")
    df_columns_list = df.columns
    print(f"Dataframe columns list = {df_columns_list}.")
    
    #Line break before next information:
    print("\n")
    df_dtypes = df.dtypes
    print("Dataframe variables types:")
    print(df_dtypes)
    
    #Line break before next information:
    print("\n")
    df_general_statistics = df.describe()
    print("Dataframe general statistics (numerical variables):")
    print(df_general_statistics)
    
    #Line break before next information:
    print("\n")
    df_missing_values = df.isna().sum()
    print("Total of missing values for each feature:")
    print(df_missing_values)
    
    return df_shape, df_columns_list, df_dtypes, df_general_statistics, df_missing_values

# **Function for obtaining the correlation plot**

- The Pandas method dataset.corr() calculates the Pearson's correlation coefficients R.
- Pearson's correlation coefficients R go from -1 to 1.
- These coefficients are R, not R².

#### To obtain the coefficients R², we raise the results to the 2nd power, i.e., we calculate (dataset.corr())**2
- R² goes from 0 to 1, where 1 represents the perfect correlation.

In [11]:
def correlation_plot (df, show_masked_plot = True, responses_to_return_corr = None, set_returned_limit = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 110):
    
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    #show_masked_plot = True - keep as True if you want to see a cleaned version of the plot
    # where a mask is applied.
    
    #responses_to_return_corr - keep as None to return the full correlation tensor.
    # If you want to display the correlations for a particular group of features, input them
    # as a list, even if this list contains a single element. Examples:
    # responses_to_return_corr = ['response1'] for a single response
    # responses_to_return_corr = ['response1', 'response2', 'response3'] for multiple
    # responses. Notice that 'response1',... should be substituted by the name ('string')
    # of a column of the dataset that represents a response variable.
    # WARNING: The returned coefficients will be ordered according to the order of the list
    # of responses. i.e., they will be firstly ordered based on 'response1'
    
    # set_returned_limit = None - This variable will only present effects in case you have
    # provided a response feature to be returned. In this case, keep set_returned_limit = None
    # to return all of the correlation coefficients; or, alternatively, 
    # provide an integer number to limit the total of coefficients returned. 
    # e.g. if set_returned_limit = 10, only the ten highest coefficients will be returned. 

    correlation_matrix = df.corr(method='pearson')
    
    if (show_masked_plot == False):
        #Show standard plot
        
        plt.figure()
        sns.heatmap((correlation_matrix)**2, annot=True, fmt=".2f")
        
        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = "/"

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "correlation_plot"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 110 dpi
                png_resolution_dpi = 110

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        plt.figure(figsize=(24,8));

    #Oncee the pandas method .corr() calculates R, we raised it to the second power 
    # to obtain R². R² goes from zero to 1, where 1 represents the perfect correlation.
    
    else:
        
        #Show masked (cleaner) plot instead of the standard one
        
        plt.figure()
        # Mask for the upper triangle
        mask = np.zeros_like((correlation_matrix)**2)

        mask[np.triu_indices_from(mask)] = True

        # Generate a custom diverging colormap
        cmap = sns.diverging_palette(220, 10, as_cmap=True)

        # Heatmap with mask and correct aspect ratio
        sns.heatmap(((correlation_matrix)**2), mask=mask, cmap=cmap, vmax=.3, center=0,
                    square=True, linewidths=.5, cbar_kws={"shrink": .5})
        
        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = "/"

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "correlation_plot"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 110 dpi
                png_resolution_dpi = 110

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        plt.figure(figsize=(24,8));

        #Again, the method dataset.corr() calculates R within the variables of dataset.
        #To calculate R², we simply raise it to the second power: (dataset.corr()**2)
    
    #Sort the values of correlation_matrix in Descending order:
    
    if (responses_to_return_corr is not None):
        
        #Select only the desired responses, by passing the list responses_to_return_corr
        # as parameter for column filtering:
        correlation_matrix = correlation_matrix[responses_to_return_corr]
        
        #Now sort the values according to the responses, by passing the list
        # responses_to_return_corr as the parameter
        correlation_matrix = correlation_matrix.sort_values(by = responses_to_return_corr, ascending = False)
        
        # If a limit of coefficients was determined, apply it:
        if (set_returned_limit is not None):
                
                correlation_matrix = correlation_matrix.head(set_returned_limit)
                #Pandas .head(X) method returns the first X rows of the dataframe.
                # Here, it returns the defined limit of coefficients, set_returned_limit.
                # The default .head() is X = 5.
        
        print(correlation_matrix)
    
    print("ATTENTION: The correlation plots show the linear correlations R², which go from 0 (none correlation) to 1 (perfect correlation). Obviously, the main diagonal always shows R² = 1, since the data is perfectly correlated to itself.")
    print("The returned correlation matrix, on the other hand, presents the linear coefficients of correlation R, not R². R values go from -1 (perfect negative correlation) to 1 (perfect positive correlation).")
    print("None of these coefficients take non-linear relations and the presence of a multiple linear correlation in account. For these cases, it is necessary to calculate R² adjusted, which takes in account the presence of multiple preditors and non-linearities.")
    
    return correlation_matrix

# **Function for obtaining scatter plots and simple linear regressions**
- Here, only a single prediction variable will be analyzed by once.
- The plots will show Y x X, where X is the predict or independent variable.
- The linear regressions will be of the type Y = aX + b, i.e., a single pair (X, Y) analyzed.

        x1, y1, lab1: blue
        x2, y2, lab2: red
        x3, y3, lab3: green
        x4, y4, lab4: black
        x5, y5, lab5: magenta
        x6, y6, lab6: yellow

In [12]:
def scatter_plot_lin_reg (x1 = None, y1 = None, x2 = None, y2 = None, x3 = None, y3 = None, x4 = None, y4 = None, x5 = None, y5 = None, x6 = None, y6 = None, x_axis_rotation = 0, y_axis_rotation = 0, show_linear_reg = True, grid = True, add_splines_lines = False, lab1 = None, lab2 = None, lab3 = None, lab4 = None, lab5 = None, lab6 = None, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 110): 
    
    import matplotlib.pyplot as plt
    import pandas as pd
    from scipy import stats
    
    if (add_splines_lines == True):
        line_value = '-'
    else:
        line_value = ''
    
    if (show_linear_reg == True):
        estatisticas = []
        estatisticas.append("Linear Fitting:")
        estatisticas.append("R² = ")
    
    fig = plt.figure()
    ax = fig.add_subplot()
    
    if not (x1 is None):
        
        if not (lab1 is None):
            label_1 = lab1
        else:
            label_1 = "Y1 x X1"
        
        #falsa negativa: passa se os valores nao forem nulos
        ax.plot(x1, y1, linestyle = line_value, marker = 'o', color='blue', label=label_1)
        
        if (show_linear_reg == True):
            reta1 = []
            #Calculo da regressao linear:
            reg1 = stats.linregress(x1, y1)
            #organizar os X, para que os splines formem a reta correta
            x_reg1 = x1.sort_values()
            x_reg1 = x_reg1.reset_index(drop = True)
            #curva obtida:
            y_reg1 = (reg1).intercept + (reg1).slope*(x_reg1)
            #gerar string da reta
            string1 = "y = %.2f*x + %.2f" %((reg1).slope, (reg1).intercept)
            reta1.append(string1)
            #calcular R2
            r_sq1 = (reg1).rvalue**2
            reta1.append(r_sq1)
            print("\nLinear Fitting 1: " + string1)
            #concatena as strings
            print("\nR² (fitting 1) = %.4f" %(r_sq1))
            string_label1 = 'Linear regression: ' + label_1
            ax.plot(x_reg1, y_reg1,  linestyle='-', marker='', color='blue', label = string_label1)
    
    if not (x2 is None):
        #roda apenas se ambos estiverem presentes
                
        if not (lab2 is None):
            label_2 = lab2
        else:
            label_2 = "Y2"
        
        ax.plot(x2, y2, linestyle = line_value, marker = 'o', color='red', label=label_2)    
        
        if (show_linear_reg == True):
            reta2 = []
            #Calculo da regressao linear:
            reg2 = stats.linregress(x2, y2)
            #organizar os X, para que os splines formem a reta correta
            x_reg2 = x2.sort_values()
            x_reg2 = x_reg2.reset_index(drop = True)
            #curva obtida:
            y_reg2 = (reg2).intercept + (reg2).slope*(x_reg2)
            #gerar string da reta
            string2 = "y = %.2f*x + %.2f" %((reg2).slope, (reg2).intercept)
            reta2.append(string2)
            #calcular R2
            r_sq2 = (reg2).rvalue**2
            reta2.append(r_sq2)
            print("\nLinear Fitting 2: " + string2)
            #concatena as strings
            print("\nR² (fitting 2) = %.4f" %(r_sq2))
            string_label2 = 'Linear regression: ' + label_2
            ax.plot(x_reg2, y_reg2,  linestyle='-', marker='', color='red', label = string_label2)
        
    if not (x3 is None):
                
        if not (lab3 is None):
            label_3 = lab3
        else:
            label_3 = "Y3"
        
        ax.plot(x3, y3, linestyle = line_value, marker = 'o', color='green', label=label_3)
        
        if (show_linear_reg == True):
            reta3 = []
            #Calculo da regressao linear:
            reg3 = stats.linregress(x3, y3)
            #organizar os X, para que os splines formem a reta correta
            x_reg3 = x3.sort_values()
            x_reg3 = x_reg3.reset_index(drop = True)
            #curva obtida:
            y_reg3 = (reg3).intercept + (reg3).slope*(x_reg3)
            #gerar string da reta
            string3 = "y = %.2f*x + %.2f" %((reg3).slope, (reg3).intercept)
            reta3.append(string3)
            #calcular R2
            r_sq3 = (reg3).rvalue**2
            reta3.append(r_sq3)
            print("\nLinear Fitting 3: " + string3)
            #concatena as strings
            print("\nR² (fitting 3) = %.4f" %(r_sq3))
            string_label3 = 'Linear regression: ' + label_3
            ax.plot(x_reg3, y_reg3,  linestyle='-', marker='', color='green', label = string_label3)
        
    if not (x4 is None):
                
        if not (lab4 is None):
            label_4 = lab4
        else:
            label_4 = "Y4"
        
        ax.plot(x4, y4, linestyle = line_value, marker = 'o', color='black', label=label_4)
        
        if (show_linear_reg == True):
            reta4 = []
            #Calculo da regressao linear:
            reg4 = stats.linregress(x4, y4)
            #organizar os X, para que os splines formem a reta correta
            x_reg4 = x4.sort_values()
            x_reg4 = x_reg4.reset_index(drop = True)
            #curva obtida:
            y_reg4 = (reg4).intercept + (reg4).slope*(x_reg4)
            #gerar string da reta
            string4 = "y = %.2f*x + %.2f" %((reg4).slope, (reg4).intercept)
            reta4.append(string4)
            #calcular R2
            r_sq4 = (reg4).rvalue**2
            reta4.append(r_sq4)
            print("\nLinear Fitting 4: " + string4)
            #concatena as strings
            print("\nR² (fitting 4) = %.4f" %(r_sq4))
            string_label4 = 'Linear regression: ' + label_4
            ax.plot(x_reg4, y_reg4,  linestyle='-', marker='', color='black', label = string_label4)
    
    if not (x5 is None):
               
        if not (lab5 is None):
            label_5 = lab5
        else:
            label_5 = "Y5"
        
        ax.plot(x5, y5, linestyle = line_value, marker = 'o', color='magenta', label=label_5)
        
        if (show_linear_reg == True):
            reta5 = []
            #Calculo da regressao linear:
            reg5 = stats.linregress(x5, y5)
            #organizar os X, para que os splines formem a reta correta
            x_reg5 = x5.sort_values()
            x_reg5 = x_reg5.reset_index(drop = True)
            #curva obtida:
            y_reg5 = (reg5).intercept + (reg5).slope*(x_reg5)
            #gerar string da reta
            string5 = "y = %.2f*x + %.2f" %((reg5).slope, (reg5).intercept)
            reta5.append(string5)
            #calcular R2
            r_sq5 = (reg5).rvalue**2
            reta5.append(r_sq5)
            print("\nLinear Fitting 5: " + string5)
            #concatena as strings
            print("\nR² (fitting 5) = %.4f" %(r_sq5))
            string_label5 = 'Linear regression: ' + label_5
            ax.plot(x_reg5, y_reg5,  linestyle='-', marker='', color='magenta', label = string_label5)
   
    if not (x6 is None):
               
        if not (lab6 is None):
            label_6 = lab6
        else:
            label_6 = "Y6"
        
        ax.plot(x6, y6, linestyle = line_value, marker = 'o', color='yellow', label=label_6)
            
        if (show_linear_reg == True):
            reta6 = []
            #Calculo da regressao linear:
            reg6 = stats.linregress(x6, y6)
            #organizar os X, para que os splines formem a reta correta
            x_reg6 = x6.sort_values()
            x_reg6 = x_reg6.reset_index(drop = True)
            #curva obtida:
            y_reg6 = (reg6).intercept + (reg6).slope*(x_reg6)
            #gerar string da reta
            string6 = "y = %.2f*x + %.2f" %((reg6).slope, (reg6).intercept)
            reta6.append(string6)
            #calcular R2
            r_sq6 = (reg6).rvalue**2
            reta6.append(r_sq6)
            print("\nLinear Fitting 6: " + string6)
            #concatena as strings
            print("\nR² (fitting 6) = %.4f" %(r_sq6))
            string_label6 = 'Linear regression: ' + label_6
            ax.plot(x_reg6, y_reg6,  linestyle='-', marker='', color='yellow', label = string_label6)
   
    if not (plot_title is None):
        #titulo do grafico
        ax.set_title(plot_title) 
    
    if not (horizontal_axis_title is None):
        #Titulo do eixo X
        ax.set_xlabel(horizontal_axis_title)
    
    if not (vertical_axis_title is None):
        #Titulo do eixo Y
        ax.set_ylabel(vertical_axis_title)
    
    
    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 0 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)
    
    ax.grid(grid)
    ax.legend()
    
    if (export_png == True):
        # Image will be exported
        import os
        
        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = "/"
        
        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "scatter_plot_lin_reg"
        
        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 110 dpi
            png_resolution_dpi = 110
        
        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)
        
        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")
    
    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    plt.figure(figsize=(12, 8))
    #fig.tight_layout()
    
    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
    
    if (show_linear_reg == True):
        
        if not (x2 is None):
            
            if not (x3 is None):
                
                if not (x4 is None):
                    
                    if not (x5 is None):
                        
                        if not (x6 is None):
                            #todos estao presentes
                            d = {'Statistics': estatisticas,
                                 label_1: reta1,
                                 label_2: reta2,
                                 label_3: reta3,
                                 label_4: reta4,
                                 label_5: reta5,
                                 label_6: reta6}
                        
                        else:
                            #apenas 5 estão presentes:
                            d = {'Statistics': estatisticas,
                                 label_1: reta1,
                                 label_2: reta2,
                                 label_3: reta3,
                                 label_4: reta4,
                                 label_5: reta5}
                    
                    else:
                        #apenas 4 estão presentes:
                        d = {'Statistics': estatisticas,
                             label_1: reta1,
                             label_2: reta2,
                             label_3: reta3,
                             label_4: reta4}
                
                else:
                    #apenas 3 estão presentes:
                    d = {'Statistics': estatisticas,
                         label_1: reta1,
                         label_2: reta2,
                         label_3: reta3}
            
            else:
                #apenas 2 estão presentes:
                d = {'Statistics': estatisticas,
                     label_1: reta1,
                     label_2: reta2}
        
        else:
            #apenas 1 esta presente:
            d = {'Statistics': estatisticas,
                 label_1: reta1}
        
        lin_reg_summary = pd.DataFrame(data = d)
        
        return lin_reg_summary     

# **Function for time series visualization**

        x1, y1, lab1: blue
        x2, y2, lab2: red
        x3, y3, lab3: green
        x4, y4, lab4: black
        x5, y5, lab5: magenta
        x6, y6, lab6: yellow

In [13]:
def time_series_vis (x1 = None, y1 = None, x2 = None, y2 = None, x3 = None, y3 = None, x4 = None, y4 = None, x5 = None, y5 = None, x6 = None, y6 = None, x_axis_rotation = 70, y_axis_rotation = 0, grid = True, add_splines_lines = True, add_scatter_dots = False, lab1 = None, lab2 = None, lab3 = None, lab4 = None, lab5 = None, lab6 = None, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 110):
    
    import matplotlib.pyplot as plt
    
    if (add_splines_lines == True):
        line_value = '-'
    else:
        line_value = ''
    
    if (add_scatter_dots == True):
        marker_value = 'o'
    else:
        marker_value = ''
    
    fig = plt.figure()
    ax = fig.add_subplot()
    
    if not (lab1 is None):
        
        label_1 = lab1
    
    else:
        label_1 = "Y1"

    if not (x1 is None):
        ax.plot(x1, y1, linestyle = line_value, marker = marker_value, color='blue', label=label_1)
    
    if not (x2 is None):
        #runs only when both are present
        if not (lab2 is None):
            label_2 = lab2
        else:
            label_2 = "Y2"
        
        ax.plot(x2, y2, linestyle = line_value, marker = marker_value, color='red', label=label_2)
    
    if not (x3 is None):
                
        if not (lab3 is None):
            label_3 = lab3
        else:
            label_3 = "Y3"
        
        ax.plot(x3, y3, linestyle = line_value, marker = marker_value, color='green', label=label_3)
    
    if not (x4 is None):
                
        if not (lab4 is None):
            label_4 = lab4
        else:
            label_4 = "Y4"
        
        ax.plot(x4, y4, linestyle = line_value, marker = marker_value, color='black', label=label_4)
    
    if not (x5 is None):
               
        if not (lab5 is None):
            label_5 = lab5
        else:
            label_5 = "Y5"
        
        ax.plot(x5, y5, linestyle = line_value, marker = marker_value, color='magenta', label=label_5)
   
    if not (x6 is None):
               
        if not (lab6 is None):
            label_6 = lab6
        else:
            label_6 = "Y6"
        
        ax.plot(x6, y6, linestyle = line_value, marker = marker_value, color='yellow', label=label_6)
   
    if not (plot_title is None):
        #graphic's title
        ax.set_title(plot_title) 
    
    if not (horizontal_axis_title is None):
        #X-axis title
        ax.set_xlabel(horizontal_axis_title)
    
    if not (vertical_axis_title is None):
        #Y-axis title
        ax.set_ylabel(vertical_axis_title)
    
    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 70 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)
    
    ax.grid(grid)
    ax.legend()
    
    if (export_png == True):
        # Image will be exported
        import os
        
        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = "/"
        
        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "time_series_vis"
        
        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 110 dpi
            png_resolution_dpi = 110
        
        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)
        
        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")
    
    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    plt.figure(figsize=(12, 8))
    #fig.tight_layout()
    
    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()

# **Functions for histogram visualization**

- Function `histogram`: ideal bin interval is calculated through Montgomery's method. Histogram is obtained from this calculated bin size.
    - Douglas C. Montgomery (2009). Introduction to Statistical Process Control, Sixth Edition, John Wiley & Sons.
- Function `histogram_alternative`: histogram is obtained by manually defining the total of bins (i.e., into how much intervals the sample space should be divided).

In [14]:
def histogram (y, bar_width, x_axis_rotation = 70, y_axis_rotation = 0, grid = True, normal_curve_overlay = True, data_units_label = None, y_title = None, histogram_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 110):
    
    import pandas as pd
    import matplotlib
    import numpy as np
    import matplotlib.pyplot as plt
    
    # ideal bin interval calculated through Montgomery's method. 
    # Histogram is obtained from this calculated bin size.
    # Douglas C. Montgomery (2009). Introduction to Statistical Process Control, 
    # Sixth Edition, John Wiley & Sons.
    
    
    #Calculo do bin size - largura do histograma:
    #1: Encontrar o menor (lowest) e o maior (highest) valor dentro da tabela de dados)
    #2: Calcular rangehist = highest - lowest
    #3: Calcular quantidade de dados (samplesize) de entrada fornecidos
    #4: Calcular a quantidade de celulas da tabela de frequencias (ncells)
    #ncells = numero inteiro mais proximo da (raiz quadrada de samplesize)
    #5: Calcular binsize = rangehist/(ncells)
    #ATENCAO: Nao se esquecer de converter range, ncells, samplesize e binsize para valores absolutos (modulos)
    #isso porque a largura do histograma tem que ser um numero positivo

    y = y.reset_index(drop=True)
    #faz com que os indices desta serie sejam consecutivos e a partir de zero

    #Estatisticas gerais: media (mu) e desvio-padrao (sigma)
    mu = y.mean() 
    sigma = y.std() 

    #Calculo do bin-size
    highest = y.max()
    lowest = y.min()
    rangehist = highest - lowest
    rangehist = abs(rangehist)
    #garante que sera um numero positivo
    samplesize = y.count() #contagem do total de entradas
    ncells = (samplesize)**0.5 #potenciacao: ** - raiz quadrada de samplesize
    #resultado da raiz quadrada e sempre positivo
    ncells = round(ncells) #numero "redondo" mais proximo
    ncells = int(ncells) #parte inteira do numero arredondado
    #ncells = numero de linhas da tabela de frequencias
    binsize = rangehist/ncells
    binsize = round(binsize)
    binsize = int(binsize) #precisa ser inteiro
    
    #Construcao da tabela de frequencias

    j = 0 #indice da tabela de frequencias
    #Este indice e diferente do ordenamento dos valores em ordem crescente
    xhist = []
    #Lista vazia que contera os x do histograma
    yhist = []
    #Listas vazia que conteras o y do histograma
    hist_labels = []
    #Esta lista gravara os limites da barra na forma de strings

    pontomediodabarra = lowest + binsize/2 
    limitedabarra = lowest + binsize
    #ponto medio da barra 
    #limite da primeira barra do histograma
    seriedohist1 = y
    seriedohist1 = seriedohist1.sort_values(ascending=True)
    #serie com os valores em ordem crescente
    seriedohist1 = seriedohist1.reset_index(drop=True)
    #garante que a nova serie tenha indices consecutivos, iniciando em zero
    i = 0 #linha inicial da serie do histograma em ordem crescente
    valcomparado = seriedohist1[i]
    #primeiro valor da serie, o mais baixo

    while (j <= (ncells-1)):
        
        #para quando termina o numero de linhas da tabela
        xhist.append(pontomediodabarra)
        #tempo da tabela de frequencias
        cont = 0
        #variavel de contagem do histograma
        #contagem deve ser reiniciada
       
        if (i < samplesize):
            #2 condicionais para impedir que um termo de indice inexistente
            #seja acessado
            while (valcomparado <= limitedabarra) and (valcomparado < highest):
                #o segundo criterio garante a parada em casos em que os dados sao
                #muito proximos
                    cont = cont + 1 #adiciona contagem a tabela de frequencias
                    i = i + 1
                    
                    if (i < samplesize): 
                        valcomparado = seriedohist1[i]
        
        yhist.append(cont) #valor de ocorrencias contadas
        
        limite_infdabarra = pontomediodabarra - binsize/2
        rotulo = "%.2f - %.2f" %(limite_infdabarra, limitedabarra)
        #intervalo da tabela de frequencias
        #%.2f: 2 casas decimais de aproximação
        hist_labels.append(rotulo)
        
        pontomediodabarra = pontomediodabarra + binsize
        #tanto os pontos medios quanto os limites se deslocam do mesmo intervalo
        
        limitedabarra = limitedabarra + binsize
        #proxima barra
        
        j = j + 1
    
    #Temos que verificar se o valor maximo foi incluido
    #isso porque o processo de aproximacao por numero inteiro pode ter
    #arredondado para baixo e excluido o limite superior
    #Porem, note que na ultima iteracao o limite superior da barra foi 
    #somado de binsize, mas como j ja e maior que ncells-1, o loop parou
    
    #assim, o limitedabarra nesse momento e o limite da barra que seria
    #construida em seguida, nao da ultima barra da tabela de frequencias
    #isso pode fazer com que esta barra ja seja maior que o highest
    
    #note porem que nao aumentamos o valor do limite inferior da barra
    #por isso, basta vermos se ele mais o binsize sao menores que o valor mais alto
        
    while ((limite_infdabarra+binsize) < highest):
        
        #vamos criar novas linhas ate que o ponto mais alto do histograma
        #tenha sido contado
        ncells = ncells + 1 #adiciona uma linha a tabela de frequencias
        xhist.append(pontomediodabarra)
        
        cont = 0 #variavel de contagem do histograma
        
        while (valcomparado <= limitedabarra):
                cont = cont + 1 #adiciona contagem a tabela de frequencias
                i = i + 1
                if (i < samplesize):
                    valcomparado = seriedohist1[i]
                    #apenas se i ainda nao e maior que o total de dados
                
                else: 
                    
                    break
        
        #parar o loop se i atingiu um tamanho maior que a quantidade 
        #de dados.Temos que ter este cuidado porque estamos acrescentando
        #mais linhas a tabela de frequencias para corrigir a aproximacao
        #de ncells por um numero inteiro
        
        yhist.append(cont) #valor de ocorrencias contadas
        
        limite_infdabarra = pontomediodabarra - binsize/2
        rotulo = "%.2f - %.2f" %(limite_infdabarra, limitedabarra)
        #intervalo da tabela de frequencias - 2 casas decimais
        hist_labels.append(rotulo)
        
        pontomediodabarra = pontomediodabarra + binsize
        #tanto os pontos medios quanto os limites se deslocam do mesmo intervalo
        
        limitedabarra = limitedabarra + binsize
        #proxima barra
        
    estatisticas_col1 = []
    #contera as descricoes das colunas da tabela de estatisticas gerais
    estatisticas_col2 = []
    #contera os valores da tabela de estatisticas gerais
    
    estatisticas_col1.append("Count of data evaluated")
    estatisticas_col2.append(samplesize)
    estatisticas_col1.append("Average (mu)")
    estatisticas_col2.append(mu)
    estatisticas_col1.append("Standard deviation (sigma)")
    estatisticas_col2.append(sigma)
    estatisticas_col1.append("Highest value")
    estatisticas_col2.append(highest)
    estatisticas_col1.append("Lowest value")
    estatisticas_col2.append(lowest)
    estatisticas_col1.append("Data range (maximum value - lowest value)")
    estatisticas_col2.append(rangehist)
    estatisticas_col1.append("Bin size (bar width)")
    estatisticas_col2.append(binsize)
    estatisticas_col1.append("Total rows in frequency table")
    estatisticas_col2.append(ncells)
    #como o comando append grava linha a linha em sequencia, garantimos
    #a correspondencia das colunas
    #Assim como em qualquer string, incluindo de rotulos de graficos
    #os \n sao lidos como quebra de linha
    
    d1 = {"General Statistics": estatisticas_col1, "Calculated Value": estatisticas_col2}
    #dicionario das duas series, para criar o dataframe com as descricoes
    estatisticas_gerais = pd.DataFrame(data = d1)
    
    #Casos os títulos estejam presentes (valor nao e None):
    #vamos utiliza-los
    #Caso contrario, vamos criar nomenclaturas genericas para o histograma
    
    eixo_y = "Counting/Frequency"
    
    if not (data_units_label is None):
        xlabel = data_units_label
    
    else:
        xlabel = "Frequency\n table data"
    
    if not (y_title is None):
        eixo_x = y_title
        #lembre-se que no histograma, os dados originais vao pro eixo X
        #O eixo Y vira o eixo da contagem/frequencia daqueles dados
    
    else:
        eixo_x = "X: Mean value of the interval"
    
    if not (histogram_title is None):
        string1 = "- $\mu = %.2f$, $\sigma = %.2f$" %(mu, sigma)
        main_label = histogram_title + string1
        #concatena a string do titulo a string com a media e desvio-padrao
        #%.2f: o numero entre %. e f indica a quantidade de casas decimais da 
        #variavel float f. No caso, arredondamos para 2 casas
        #NAO SE ESQUECA DO PONTO: ele que indicara que sera arredondado o 
        #numero de casas
    
    else:
        main_label = "Data Histogram - $\mu = %.2f$, $\sigma = %.2f$" %(mu, sigma)
        #os simbolos $\ $ substituem o simbolo pela letra grega
    
    d2 = {"Considered interval": hist_labels, eixo_x: xhist, eixo_y: yhist}
    #dicionario que compoe a tabela de frequencias
    tab_frequencias = pd.DataFrame(data = d2)
    #cria a tabela de frequencias como um dataframe de saida
    
    #parametros da normal ja calculados:
    #mu e sigma
    #numero de bins: ncells
    #limites de especificacao: lsl,usl - target
    
    #valor maximo do histograma
    max_hist = max(yhist)
    #seleciona o valor maximo da serie, para ajustar a curva normal
    #isso porque a normal é criada com valores entre 0 e 1
    #multiplicando ela por max_hist, fazemos ela se adequar a altura do histograma
    
    if (normal_curve_overlay == True):
        
        #construir a normal ajustada/esperada
        #vamos criar pontos ao redor da media mu - 4sigma ate mu + 4sigma, 
        #de modo a garantir a quase totalidade da curva normal. 
        #O incremento será de 0.10 sigma a cada iteracao
        x_inf = mu -(4)*sigma
        x_sup = mu + 4*sigma
        x_inc = (0.10)*sigma
        
        x_normal_adj = []
        y_normal_adj = []
        
        x_adj = x_inf
        y_adj = ((1 / (np.sqrt(2 * np.pi) * sigma)) *np.exp(-0.5 * (1 / sigma * (x_adj - mu))**2))
        x_normal_adj.append(x_adj)
        y_normal_adj.append(y_adj)
        
        while(x_adj < x_sup): 
            
            x_adj = x_adj + x_inc
            y_adj = ((1 / (np.sqrt(2 * np.pi) * sigma)) *np.exp(-0.5 * (1 / sigma * (x_adj - mu))**2))
            x_normal_adj.append(x_adj)
            y_normal_adj.append(y_adj)
        
        #vamos ajustar a altura da curva ao histograma. Para isso, precisamos
        #calcular quantas vezes o ponto mais alto do histograma é maior que o ponto
        #mais alto da normal (chamaremos essa relação de fator). A seguir,
        #multiplicamos cada elemento da normal por este mesmo fator
        max_normal = max(y_normal_adj) 
        #maximo da normal ajustada, numero entre 0 e 1
        
        fator = (max_hist)/(max_normal)
        size_normal = len(y_normal_adj) #quantidade de dados criados
        
        i = 0
        while (i < size_normal):
            y_normal_adj[i] = (y_normal_adj[i])*(fator)
            i = i + 1
    
    #Fazer o grafico
    fig, ax = plt.subplots()
    
    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 70 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)
    
    #STANDARD MATPLOTLIB METHOD:
    #bins = number of bins (intervals) of the histogram. Adjust it manually
    #increasing bins will increase the histogram's resolution, but height of bars
    
    #ax.hist(y, bins=20, width = bar_width, label=xlabel, color='blue')
    #IF GRAPHIC IS NOT SHOWN: THAT IS BECAUSE THE DISTANCES BETWEEN VALUES ARE LOW, AND YOU WILL
    #HAVE TO USE THE STANDARD HISTOGRAM METHOD FROM MATPLOTLIB.
    #TO DO THAT, UNMARK LINE ABOVE: ax.hist(y, bins=20, width = bar_width, label=xlabel, color='blue')
    #AND MARK LINE BELOW AS COMMENT: ax.bar(xhist, yhist, width = bar_width, label=xlabel, color='blue')
    
    #IF YOU WANT TO CREATE GRAPHIC AS A BAR CHART BASED ON THE CALCULATED DISTRIBUTION TABLE, 
    #MARK THE LINE ABOVE AS COMMENT AND UNMARK LINE BELOW:
    ax.bar(xhist, yhist, width = bar_width, label=xlabel, color='blue')
    #ajuste manualmente a largura, width, para deixar as barras mais ou menos proximas
    
    if (normal_curve_overlay == True):
    
        #adicionar a normal
        ax.plot(x_normal_adj, y_normal_adj, color = 'black', label = 'Adjusted/expected\n normal curve')
    
    ax.set_xlabel(eixo_x)
    ax.set_ylabel(eixo_y)
    ax.set_title(main_label)
    ax.set_xticks(xhist)
    
    ax.legend()
    ax.grid(grid)
    
    if (export_png == True):
        # Image will be exported
        import os
        
        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = "/"
        
        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "histogram"
        
        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 110 dpi
            png_resolution_dpi = 110
        
        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)
        
        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")
    
    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    plt.figure(figsize=(12, 8))
    #fig.tight_layout()
    
    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
    
    print("General statistics:\n")
    print(estatisticas_gerais)
    print("\n") # line break
    print("Frequency table:\n")
    print(tab_frequencias)

    return estatisticas_gerais, tab_frequencias

In [15]:
def histogram_alternative (y, total_of_bins, bar_width, x_axis_rotation = 70, y_axis_rotation = 0, grid = True, data_units_label = None, y_title = None, histogram_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 110):
    
    import pandas as pd
    import matplotlib
    import numpy as np
    import matplotlib.pyplot as plt
    
    #Calculo do bin size - largura do histograma:
    #1: Encontrar o menor (lowest) e o maior (highest) valor dentro da tabela de dados)
    #2: Calcular rangehist = highest - lowest
    #3: Calcular quantidade de dados (samplesize) de entrada fornecidos
    #4: Calcular a quantidade de celulas da tabela de frequencias (ncells)
    #ncells = numero inteiro mais proximo da (raiz quadrada de samplesize)
    #5: Calcular binsize = rangehist/(ncells)
    #ATENCAO: Nao se esquecer de converter range, ncells, samplesize e binsize para valores absolutos (modulos)
    #isso porque a largura do histograma tem que ser um numero positivo
    
    #this variable here is to simply guarantee the compatibility of the function,
    # with no extensive code modifications. It has no real effect.
    normal_curve_overlay = True
    

    y = y.reset_index(drop=True)
    #faz com que os indices desta serie sejam consecutivos e a partir de zero

    #Estatisticas gerais: media (mu) e desvio-padrao (sigma)
    mu = y.mean() 
    sigma = y.std() 

    #Calculo do bin-size
    highest = y.max()
    lowest = y.min()
    rangehist = highest - lowest
    rangehist = abs(rangehist)
    #garante que sera um numero positivo
    samplesize = y.count() #contagem do total de entradas
    ncells = (samplesize)**0.5 #potenciacao: ** - raiz quadrada de samplesize
    #resultado da raiz quadrada e sempre positivo
    ncells = round(ncells) #numero "redondo" mais proximo
    ncells = int(ncells) #parte inteira do numero arredondado
    #ncells = numero de linhas da tabela de frequencias
    binsize = rangehist/ncells
    binsize = round(binsize)
    binsize = int(binsize) #precisa ser inteiro
    
    #Construcao da tabela de frequencias

    j = 0 #indice da tabela de frequencias
    #Este indice e diferente do ordenamento dos valores em ordem crescente
    xhist = []
    #Lista vazia que contera os x do histograma
    yhist = []
    #Listas vazia que conteras o y do histograma
    hist_labels = []
    #Esta lista gravara os limites da barra na forma de strings

    pontomediodabarra = lowest + binsize/2 
    limitedabarra = lowest + binsize
    #ponto medio da barra 
    #limite da primeira barra do histograma
    seriedohist1 = y
    seriedohist1 = seriedohist1.sort_values(ascending=True)
    #serie com os valores em ordem crescente
    seriedohist1 = seriedohist1.reset_index(drop=True)
    #garante que a nova serie tenha indices consecutivos, iniciando em zero
    i = 0 #linha inicial da serie do histograma em ordem crescente
    valcomparado = seriedohist1[i]
    #primeiro valor da serie, o mais baixo
        
    estatisticas_col1 = []
    #contera as descricoes das colunas da tabela de estatisticas gerais
    estatisticas_col2 = []
    #contera os valores da tabela de estatisticas gerais
    
    estatisticas_col1.append("Count of data evaluated")
    estatisticas_col2.append(samplesize)
    estatisticas_col1.append("Average (mu)")
    estatisticas_col2.append(mu)
    estatisticas_col1.append("Standard deviation (sigma)")
    estatisticas_col2.append(sigma)
    estatisticas_col1.append("Highest value")
    estatisticas_col2.append(highest)
    estatisticas_col1.append("Lowest value")
    estatisticas_col2.append(lowest)
    estatisticas_col1.append("Data range (maximum value - lowest value)")
    estatisticas_col2.append(rangehist)
    estatisticas_col1.append("Bin size (bar width)")
    estatisticas_col2.append(binsize)
    estatisticas_col1.append("Total rows in frequency table")
    estatisticas_col2.append(ncells)
    #como o comando append grava linha a linha em sequencia, garantimos
    #a correspondencia das colunas
    #Assim como em qualquer string, incluindo de rotulos de graficos
    #os \n sao lidos como quebra de linha
    
    d1 = {"General Statistics": estatisticas_col1, "Calculated Value": estatisticas_col2}
    #dicionario das duas series, para criar o dataframe com as descricoes
    estatisticas_gerais = pd.DataFrame(data = d1)
    
    #Casos os títulos estejam presentes (valor nao e None):
    #vamos utiliza-los
    #Caso contrario, vamos criar nomenclaturas genericas para o histograma
    
    eixo_y = "Counting/Frequency"
    
    if not (data_units_label is None):
        xlabel = data_units_label
    
    else:
        xlabel = "Frequency\n table data"
    
    if not (y_title is None):
        eixo_x = y_title
        #lembre-se que no histograma, os dados originais vao pro eixo X
        #O eixo Y vira o eixo da contagem/frequencia daqueles dados
    
    else:
        eixo_x = "X: Mean value of the interval"
    
    if not (histogram_title is None):
        string1 = "- $\mu = %.2f$, $\sigma = %.2f$" %(mu, sigma)
        main_label = histogram_title + string1
        #concatena a string do titulo a string com a media e desvio-padrao
        #%.2f: o numero entre %. e f indica a quantidade de casas decimais da 
        #variavel float f. No caso, arredondamos para 2 casas
        #NAO SE ESQUECA DO PONTO: ele que indicara que sera arredondado o 
        #numero de casas
    
    else:
        main_label = "Data Histogram - $\mu = %.2f$, $\sigma = %.2f$" %(mu, sigma)
        #os simbolos $\ $ substituem o simbolo pela letra grega
    
    d2 = {"Considered interval": hist_labels, eixo_x: xhist, eixo_y: yhist}
    #dicionario que compoe a tabela de frequencias
    tab_frequencias = pd.DataFrame(data = d2)
    #cria a tabela de frequencias como um dataframe de saida
   
    #parametros da normal ja calculados:
    #mu e sigma
    #numero de bins: ncells
    #limites de especificacao: lsl,usl - target
    
    #valor maximo do histograma
    #max_hist = max(yhist)
    #seleciona o valor maximo da serie, para ajustar a curva normal
    #isso porque a normal é criada com valores entre 0 e 1
    #multiplicando ela por max_hist, fazemos ela se adequar a altura do histograma
    
    
    #Fazer o grafico
    fig, ax = plt.subplots()
    
    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 70 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)
    
    #STANDARD MATPLOTLIB METHOD:
    #bins = number of bins (intervals) of the histogram. Adjust it manually
    #increasing bins will increase the histogram's resolution, but height of bars
    
    ax.hist(y, bins = total_of_bins, width = bar_width, label=xlabel, color='blue')
    #IF GRAPHIC IS NOT SHOWN: THAT IS BECAUSE THE DISTANCES BETWEEN VALUES ARE LOW, AND YOU WILL
    #HAVE TO USE THE STANDARD HISTOGRAM METHOD FROM MATPLOTLIB.
    #TO DO THAT, UNMARK LINE ABOVE: ax.hist(y, bins=20, width = bar_width, label=xlabel, color='blue')
    #AND MARK LINE BELOW AS COMMENT: ax.bar(xhist, yhist, width = bar_width, label=xlabel, color='blue')
    
    #IF YOU WANT TO CREATE GRAPHIC AS A BAR CHART BASED ON THE CALCULATED DISTRIBUTION TABLE, 
    #MARK THE LINE ABOVE AS COMMENT AND UNMARK LINE BELOW:
    #ax.bar(xhist, yhist, width = bar_width, label=xlabel, color='blue')
    #ajuste manualmente a largura, width, para deixar as barras mais ou menos proximas
    
    ax.set_xlabel(eixo_x)
    ax.set_ylabel(eixo_y)
    ax.set_title(main_label)
    #ax.set_xticks(xhist)
    
    ax.legend()
    ax.grid(grid)
    
    if (export_png == True):
        # Image will be exported
        import os
        
        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = "/"
        
        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "histogram_alternative"
        
        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 110 dpi
            png_resolution_dpi = 110
        
        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)
        
        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")
    
    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    plt.figure(figsize=(12, 8))
    #fig.tight_layout()
    
    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
    
    print("General statistics:\n")
    print(estatisticas_gerais)
    # This function is supposed to be used in cases where the differences between data
    # is very small. In such cases, there will be no trust values calculated for the 
    # frequency table. Therefore, we omit it here, but it can be accessed from the
    # returned dataframe.

    return estatisticas_gerais, tab_frequencias

# **Function for testing data normality and visualizing probability plot**
- Check the probability that data is actually described by a normal distribution.

In [19]:
def test_data_normality (y, alpha = 0.10, show_probability_plot = True, x_axis_rotation = 0, y_axis_rotation = 0, grid = True, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 110):
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from statsmodels.stats import diagnostic
    from scipy import stats
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html#scipy.stats.probplot
    # Check https://docs.scipy.org/doc/scipy/tutorial/stats.html
    # Check https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.stats.normaltest.html
    
    # WARNING: The statistical tests require at least 20 samples
    
    # Confidence level = 1 - ALPHA. For ALPHA = 0.10, we get a 0.90 = 90% confidence
    # Set ALPHA = 0.05 to get 0.95 = 95% confidence in the analysis.
    # Notice that, when less trust is needed, we can increase ALPHA to get less restrictive
    # results.
    
    # y = series of data that will be tested.
    # y = dataset['Y']
    
    #Alternatively: set SHOW_PROBABILITY_PLOT = True to obtain the probability plot for the
    # variable Y (normal distribution tested). 
    # Set SHOW_PROBABILITY_PLOT = False to omit the probability plot.
    
    lista1 = []
    #esta lista sera a primeira coluna, com as descrições das demais
    lista1.append("p-value: probability that data is described by the normal distribution.")
    lista1.append("Probability of being described by the normal distribution (\%).")
    lista1.append("alpha")
    lista1.append("Criterium: is not described by normal if p < alpha = %.3f." %(alpha))
    #%.3f apresenta f com 3 casas decimais
    #%f se refere a uma variavel float
    #informa ao usuario o valor definido para a rejeição
    lista1.append("Are data described by the normal?")
    #Note que o comando append adiciona os elementos em sequencia, linha a linha
    #nao se especifica indice, pois ja esta subentendido que esta na proxima
    #linha
    
    #Scipy.stats’ normality test
    # It is based on D’Agostino and Pearson’s test that combines 
    # skew and kurtosis to produce an omnibus test of normality.
    _, scipystats_test_pval = stats.normaltest(y)
    # The underscore indicates an output to be ignored, which is s^2 + k^2, 
    # where s is the z-score returned by skewtest and k is the z-score returned by kurtosistest.
    # https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.stats.normaltest.html
    
    #create list with only the p-val
    p_scipy = []
    p_scipy.append(scipystats_test_pval) #p-value
    p_scipy.append(100*scipystats_test_pval) #p in percent
    p_scipy.append(alpha)
    
    if (scipystats_test_pval < alpha):
        p_scipy.append("p = %.3f < %.3f" %(scipystats_test_pval, alpha))
        p_scipy.append("Data not described by normal.")
    else:
        p_scipy.append("p = %.3f >= %.3f" %(scipystats_test_pval, alpha))
        p_scipy.append("Data described by normal.")    
    
    #Lilliefors’ test
    lilliefors_test = diagnostic.kstest_normal(y, dist='norm', pvalmethod='table')
    #Return: linha 1: ksstat: float
    #Kolmogorov-Smirnov test statistic with estimated mean and variance.
    #Linha 2: p-value:float
    #If the pvalue is lower than some threshold, e.g. 0.10, then we can reject the Null hypothesis that the sample comes from a normal distribution.
    
    #criar lista apenas com o p-valor
    p_lillie = []
    p_lillie.append(lilliefors_test[1]) #p-valor
    p_lillie.append(100*lilliefors_test[1]) #p em porcentagem
    p_lillie.append(alpha)
    
    if (lilliefors_test[1] < alpha):
        p_lillie.append("p = %.3f < %.3f" %(lilliefors_test[1], alpha))
        p_lillie.append("Data not described by normal.")
    else:
        p_lillie.append("p = %.3f >= %.3f" %(lilliefors_test[1], alpha))
        p_lillie.append("Data described by normal.")
        
    
    #Anderson-Darling
    ad_test = diagnostic.normal_ad(y, axis=0)
    #Return: Linha 1: ad2: float
    #Anderson Darling test statistic.
    #Linha 2: p-val: float
    #The p-value for hypothesis that the data comes from a normal distribution with unknown mean and variance.
    
    #criar lista apenas com o p-valor
    p_ad = []
    p_ad.append(ad_test[1]) #p-valor
    p_ad.append(100*ad_test[1]) #p em porcentagem
    p_ad.append(alpha)
    
    if (ad_test[1] < alpha):
        p_ad.append("p = %.3f < %.3f" %(ad_test[1], alpha))
        p_ad.append("Data not described by normal.")
    else:
        p_ad.append("p = %.3f >= %.3f" %(ad_test[1], alpha))
        p_ad.append("Data described by normal.")
    
    #NOTA: o comando %f apresenta a variavel float com todas as casas
    #decimais possiveis. Se desejamos um numero certo de casas decimais
    #acrescentamos esse numero a frente. Exemplos: %.1f: 1 casa decimal
    # %.2f: 2 casas; %.3f: 3 casas decimais, %.4f: 4 casas
    
    data_normality_dict = {'Parameters and Interpretation': lista1, 'D’Agostino and Pearson normality test': , 'Lilliefors Test': p_lillie, 'Anderson-Darling Test': p_ad}
    
    #dicionario dos valores obtidos
    data_normality_res = pd.DataFrame(data = data_normality_dict)
    #dataframe de saída
    
    print("Check data normality results:\n")
    print(data_normality_res)
    print("\n") #line break
    
    # Calculate data skewness and kurtosis
    
    # Skewness
    data_skew = stats.skew(y)
    # skewness = 0 : normally distributed.
    # skewness > 0 : more weight in the left tail of the distribution.
    # skewness < 0 : more weight in the right tail of the distribution.
    # https://www.geeksforgeeks.org/scipy-stats-skew-python/
    
    # Kurtosis
    data_kurtosis = stats.kurtosis(y, fisher = True)
    # scipy.stats.kurtosis(array, axis=0, fisher=True, bias=True) function 
    # calculates the kurtosis (Fisher or Pearson) of a data set. It is the the fourth 
    # central moment divided by the square of the variance. 
    # It is a measure of the “tailedness” i.e. descriptor of shape of probability 
    # distribution of a real-valued random variable. 
    # In simple terms, one can say it is a measure of how heavy tail is compared 
    # to a normal distribution.
    # fisher parameter: fisher : Bool; Fisher’s definition is used (normal 0.0) if True; 
    # else Pearson’s definition is used (normal 3.0) if set to False.
    # https://www.geeksforgeeks.org/scipy-stats-kurtosis-function-python/
    print("A normal distribution should present no skewness (distribution distortion); and no kurtosis (long-tail).")
    print(f"For the data analyzed: skewness = {data_skew}; kurtosis = {data_kurtosis}")
    
    if (data_skew < 0):
        
        print(f"Skewness {data_skew} < 0: more weight in the left tail of the distribution.")
    
    elif (data_skew > 0):
        
        print(f"Skewness {data_skew} > 0: more weight in the right tail of the distribution.")
        
    else:
        
        print(f"Skewness {data_skew} = 0: no distortion of the distribution.")
    
    
    print(f"Data kurtosis = {data_kurtosis}")
    
    if (data_kurtosis == 0):
        
        print("Data kurtosis = 0. No long-tail effects detected.")
    
    #Calculate the mode of the distribution:
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html
    data_mode = stats.mode(y, axis = None)[0]
    # returns an array of arrays. The first array is called mode=array and contains the mode.
    # Axis: Default is 0. If None, compute over the whole array.
    # we set axis = None to compute the general mode.
    
    #Create general statistics dictionary:
    general_statistics_dict = {
        
        "Count_of_analyzed_values": len(y)
        "Data_mean": np.mean(y),
        "Data_mean_ignoring_missing_values": np.nanmean(y),
        "Data_variance": np.var(y),
        "Data_variance_ignoring_missing_values": np.nanvar(y),
        "Data_standard_deviation": np.std(y),
        "Data_standard_deviation_ignoring_missing_values": np.nanstd(y),
        "Data_skewness": data_skew,
        "Data_kurtosis": data_kurtosis,
        "Data_mode": data_mode
        
    }
    
    print("Skewness and kurtosis successfully returned in the dictionary general_statistics_dict.\n")
    print(general_statistics_dict)
    print("/n")
    
    if (show_probability_plot == True):
        #Obtain the probability plot  
        fig, ax = plt.subplots()

        ax.set_title("Probability Plot for Normal Distribution")

        #ROTATE X AXIS IN XX DEGREES
        plt.xticks(rotation = x_axis_rotation)
        # XX = 70 DEGREES x_axis (Default)
        #ROTATE Y AXIS IN XX DEGREES:
        plt.yticks(rotation = y_axis_rotation)
        # XX = 0 DEGREES y_axis (Default)   

        res = stats.probplot(y, dist = 'norm', fit = True, plot = ax)
        #This function resturns a tuple, so we must store it into res
        
        #Other distributions to check, see scipy Stats documentation. 
        # you could test dist=stats.loggamma, where stats was imported from scipy
        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html#scipy.stats.probplot

        ax.grid(grid)
        ax.legend()

        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = "/"

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "probability_plot"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 110 dpi
                png_resolution_dpi = 110

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        plt.figure(figsize=(12, 8))
        #fig.tight_layout()

        ## Show an image read from an image file:
        ## import matplotlib.image as pltimg
        ## img=pltimg.imread('mydecisiontree.png')
        ## imgplot = plt.imshow(img)
        ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
        ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
        ##  '03_05_END.ipynb'
        plt.show()
    
    return data_normality_res, general_statistics_dict

# **Function for column filtering (selecting) or column renaming**

In [16]:
def col_filter_rename (df, cols_list, mode = 'filter'):
    
    import pandas as pd
    
    #mode = 'filter' for filtering only the list of columns passed as cols_list;
    #mode = 'rename' for renaming the columns with the names passed as cols_list.
    
    #cols_list = list of strings containing the names (headers) of the columns to select
    # (filter); or to be set as the new columns' names, according to the selected mode.
    # For instance: cols_list = ['col1', 'col2', 'col3'] will 
    # select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
    # Declare the names inside quotes.
    
    print(f"Original columns in the dataframe:\n{df.columns}")
    
    if (mode == 'filter'):
        
        #filter the dataframe so that it will contain only the cols_list.
        df = df[cols_list]
        print("Dataframe filtered according to the list provided.")
        
    elif (mode == 'rename'):
        
        #Check if the number of columns of the dataset is equal to the number of elements
        # of the new list. It will avoid raising an exception error.
        boolean_filter = (len(cols_list) == len(df.columns))
        
        if (boolean_filter == False):
            #Impossible to rename, number of elements are different.
            print("The number of columns of the dataframe is different from the number of elements of the list. Please, provide a list with number of elements equals to the number of columns.")
        
        else:
            #Same number of elements, so that we can update the columns' names.
            df.columns = cols_list
            print("Dataframe columns renamed according to the list provided.")
            print("Warning: the substitution is element-wise: the first element of the list is now the name of the first column, and so on, ..., so that the last element is the name of the last column.")
        
        
    else:
        print("Enter a valid mode: \'filter\' or \'rename\'.")
    
    return df

# **Function for log-transforming the variables**

- One curve derived from the normal is the log-normal.
- If the values Y follow a log-normal distribution, their log follow a normal.
- A log normal curve resembles a normal, but with skewness (distortion); and kurtosis (long-tail).

Applying the log is a methodology for **normalizing the variables**: the sample space gets shrinkled after the transformation, making the data more adequate for being processed by Machine Learning algorithms.
- Preferentially apply the transformation to the whole dataset, so that all variables will be of same order of magnitude.
- Obviously, it is not necessary for variables ranging from -100 to 100 in numerical value, where most outputs from the log transformation are.

#### **WARNING**: This function will eliminate rows where the selected variables present values lower or equal to zero (condition for the logarithm to be applied).

In [17]:
def log_transform (df, subset = None, create_new_columns = True, new_columns_suffix = "_log"):
    
    import pandas as pd
    import numpy as np
    
    #### WARNING: This function will eliminate rows where the selected variables present 
    #### values lower or equal to zero (condition for the logarithm to be applied).
    
    # subset = None
    # Set subset = None to transform the whole dataset. Alternatively, pass a list with 
    # columns names for the transformation to be applied. For instance:
    # subset = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
    # as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
    # Declaring the full list of columns is equivalent to setting subset = None.
    
    # create_new_columns = True
    # Alternatively, set create_new_columns = True to store the transformed data into new
    # columns. Or set create_new_columns = False to overwrite the existing columns
    
    #new_columns_suffix = "_log"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_log", the new column will be named as
    # "collumn1_log".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    # Check if a subset was defined. If so, make columns_list = subset 
    if not (subset is None):
        
        columns_list = subset
    
    else:
        #There is no declared subset. Then, make columns_list equals to the list of
        # columns of the dataframe.
        columns_list = subset.columns
    
    #Loop through each column:
    for column in columns_list:
        #access each element in the list column_list. The element is named 'column'.
        
        #boolean filter to check if the entry is higher than zero, condition for the log
        # to be applied
        boolean_filter = (df[column] > 0)
        #This filter is equals True only for the rows where the column is higher than zero.
        
        #Apply the boolean filter to the dataframe, removing the entries where the column
        # cannot be log transformed.
        # The boolean_filter selects only the rows for which the filter values are True.
        df = df[boolean_filter]
        
        #Check if a new column will be created, or if the original column should be
        # substituted.
        if (create_new_columns == True):
            # Create a new column.
            
            # The new column name will be set as column + new_columns_suffix
            new_column_name = column + new_columns_suffix
        
        else:
            # Overwrite the existing column. Simply set new_column_name as the value 'column'
            new_column_name = column
        
        # Calculate the column value as the log transform of the original series (column)
        df[new_column_name] = np.log(df[column])
    
    print("The columns were successfully log-transformed. Check the 10 first rows of the new dataset:\n")
    print(df.head(10))
    
    return df

# One curve derived from the normal is the log-normal.
# If the values Y follow a log-normal distribution, their log follow a normal.
# A log normal curve resembles a normal, but with skewness (distortion); 
# and kurtosis (long-tail).

# Applying the log is a methodology for normalizing the variables: 
# the sample space gets shrinkled after the transformation, making the data more 
# adequate for being processed by Machine Learning algorithms. Preferentially apply 
# the transformation to the whole dataset, so that all variables will be of same order 
# of magnitude.
# Obviously, it is not necessary for variables ranging from -100 to 100 in numerical 
# value, where most outputs from the log transformation are.

# **Function for reversing the log-transform - applying the exponential transformation**

In [18]:
def reverse_log_transform(df, subset = None, create_new_columns = True, new_columns_suffix = "_originalScale"):
    
    import pandas as pd
    import numpy as np
    
    #### WARNING: This function will eliminate rows where the selected variables present 
    #### values lower or equal to zero (condition for the logarithm to be applied).
    
    # subset = None
    # Set subset = None to transform the whole dataset. Alternatively, pass a list with 
    # columns names for the transformation to be applied. For instance:
    # subset = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
    # as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
    # Declaring the full list of columns is equivalent to setting subset = None.
    
    # create_new_columns = True
    # Alternatively, set create_new_columns = True to store the transformed data into new
    # columns. Or set create_new_columns = False to overwrite the existing columns
    
    #new_columns_suffix = "_log"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_originalScale", the new column will be named 
    # as "collumn1_originalScale".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    # Check if a subset was defined. If so, make columns_list = subset 
    if not (subset is None):
        
        columns_list = subset
    
    else:
        #There is no declared subset. Then, make columns_list equals to the list of
        # columns of the dataframe.
        columns_list = subset.columns
    
    #Loop through each column:
    for column in columns_list:
        #access each element in the list column_list. The element is named 'column'.
        
        # The exponential transformation can be applied to zero and negative values,
        # so we remove the boolean filter.
        
        #Check if a new column will be created, or if the original column should be
        # substituted.
        if (create_new_columns == True):
            # Create a new column.
            
            # The new column name will be set as column + new_columns_suffix
            new_column_name = column + new_columns_suffix
        
        else:
            # Overwrite the existing column. Simply set new_column_name as the value 'column'
            new_column_name = column
        
        # Calculate the column value as the log transform of the original series (column)
        df[new_column_name] = np.exp(df[column])
    
    print("The log_transform was successfully reversed through the exponential transformation. Check the 10 first rows of the new dataset:\n")
    print(df.head(10))
    
    return df

# **Function for obtaining and applying Box-Cox transform**
- Transform data into a series that are represented by the normal distribution.

In [20]:
def box_cox_transform (df, column_to_transform, mode = 'calculate_and_apply', lambda_boxcox = None, suffix = '_BoxCoxTransf', specification_lims = None):
    
    import pandas as pd
    import numpy as np
    from statsmodels.stats import diagnostic
    from scipy import stats
    
    # This function will process a single column column_to_transform 
    # of the dataframe df per call.
    
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
    ## Box-Cox transform is given by:
    ## y = (x**lmbda - 1) / lmbda,  for lmbda != 0
    ## log(x),                  for lmbda = 0
    
    # column_to_transform must be a string with the name of the column.
    # e.g. column_to_transform = 'column1' to transform a column named as 'column1'
    
    # mode = 'calculate_and_apply'
    # Aternatively, mode = 'calculate_and_apply' to calculate lambda and apply Box-Cox
    # transform; mode = 'apply_only' to apply the transform for a known lambda.
    # To 'apply_only', lambda_box must be provided.
    
    # lambda_boxcox must be a float value. e.g. lamda_boxcox = 1.7
    # If you calculated lambda from the function box_cox_transform and saved the
    # transformation data summary dictionary as data_sum_dict, simply set:
    # lambda_boxcox = data_sum_dict['lambda_boxcox']
    # This will access the value on the key 'lambda_boxcox' of the dictionary, which
    # contains the lambda. 
    
    # Analogously, spec_lim_dict['Inf_spec_lim_transf'] access
    # the inferior specification limit transformed; and spec_lim_dict['Sup_spec_lim_transf'] 
    # access the superior specification limit transformed.
    
    # If lambda_boxcox is None, 
    # the mode will be automatically set as 'calculate_and_apply'.
    
    #suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_BoxCoxTransf', the transformed column will be
    # identified as 'Y_BoxCoxTransf'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
    
    #specification_lims = None if there are no specification limits for the variable being
    # transformed by the function.
    #In case there were originally specification limits for the variable (column) being
    # transformed, declare them as a list, array, or tuple of two numbers (float).
    # e.g. if the column represents a variable with specifications between 10 to 20 kg, declare
    # specification_lims = [10, 20]. If it represents a variable which specifications should
    # be betweewn 0 to 12.5 L, declare specification_lims = [0, 12.5]
    # Then, the function will return the specifications transformed by the same Box-Cox
    # transformation applied to the data. Remember: if data were transformed, so should be
    # the specification limits.

    y = df[column_to_transform]
    
    boolean_check1 = (lambda_boxcox is None)
    # | is the 'or' operator.
    # If boolean_check1 is True, automatically set mode = 'calculate_and_apply'
    
    if (boolean_check1 == True):
        print("Invalid value set for \'lambda_boxcox'\. Setting mode to \'calculate_and_apply\'.")
        mode = 'calculate_and_apply'
    
    boolean_chek2 = (mode != 'calculate_and_apply') & (mode != 'apply_only')
    # & is the 'and' operator. != is the 'is different from' operator.
    #Check if neither 'calculate_and_apply' nor 'apply_only' were selected
    
    if (boolean_check2 == True):
        print("Invalid value set for \'mode'\. Setting mode to \'calculate_and_apply\'.")
        mode = 'calculate_and_apply'
    
    if (mode == 'calculate_and_apply'):
        # Calculate lambda_boxcox
        lambda_boxcox = stats.boxcox_normmax(y, method='pearsonr')
        #calcula o lambda da transformacao box-cox utilizando o metodo da maxima verossimilhanca
        #por meio da maximizacao do coeficiente de correlacao de pearson da funcao
        #y = boxcox(x), onde boxcox representa a transformacao
    
    # For other cases, we will apply the lambda_boxcox set as the function parameter.

    #Calculo da variavel transformada
    y_transform = stats.boxcox(y, lmbda=lambda_boxcox, alpha=None)
    #Calculo da transformada
    
    if not (suffix is None):
        #only if a suffix was declared
        #concatenate the column name to the suffix
        new_col = column_to_transform + suffix
    
    else:
        #concatenate the column name to the standard '_BoxCoxTransf' suffix
        new_col = column_to_transform + '_BoxCoxTransf'
    
    data_transformed_df = df
    data_transformed_df[new_col] = y_transform
    #dataframe contendo os dados transformados
    
    print("Data successfully transformed. Check the 10 first transformed rows:\n")
    print(data_transformed_df.head(10))
    print("\n") #line break
    
    #testes de normalidade da variavel transformada
    #Lilliefors’ test
    lilliefors_test = diagnostic.kstest_normal(y, dist='norm', pvalmethod='table')
    #Return: linha 1: ksstat: float
    #Kolmogorov-Smirnov test statistic with estimated mean and variance.
    #Linha 2: p-value:float
    #If the pvalue is lower than some threshold, e.g. 0.10, then we can reject the Null hypothesis that the sample comes from a normal distribution.
    
    p_lillie = (lilliefors_test[1])
    #apenas o p-valor na lista
    
    #Anderson-Darling
    ad_test = diagnostic.normal_ad(y, axis=0)
    #Return: Linha 1: ad2: float
    #Anderson Darling test statistic.
    #Linha 2: p-val: float
    #The p-value for hypothesis that the data comes from a normal distribution with unknown mean and variance.
    
    p_ad = (ad_test[1])
    #apenas o p-valor na lista
    
    data_sum_dict = {'lambda_boxcox': lambda_boxcox, 'Lilliefors_p_value': p_lillie, 'AndersonDarling_p_value': p_ad}
    #dicionario dos p-valores e do lambda
    
    print("Box-Cox Transformation Summary:\n")
    print(data_sum_dict)
    print("\n") #line break
    
    if not (specification_lims is None):
        #apenas executa este passo quando o limite de especificação for fornecido
        
        #Convert the list of specifications into a NumPy array:
        spec_lim_array = np.array(specification_lims)
        
        #Apply the Box-Cox transform to this array and store the results in the same array:
        spec_lim_array = stats.boxcox(spec_lim_array, lmbda=lambda_boxcox, alpha=None)
        
        spec_lim_dict = {['Inf_spec_lim_transf', 'Sup_spec_lim_transf']: spec_lim_array}
        
        print("New specification limits successfully obtained:\n")
        print(spec_lim_dict)
    
    if not (specification_lims is None):
        #Caso haja limites de especificacao, retorna os limites transformados
        return data_transformed_df, data_sum_dict, spec_lim_dict
    
    #caso nao haja limite de especificacao:
    else:
        return data_transformed_df, data_sum_dict

# **Function for reversing Box-Cox transform**

In [21]:
def reverse_box_cox (df, column_to_transform, lambda_boxcox, suffix = '_ReversedBoxCox'):
    
    import pandas as pd
    import numpy as np
    
    # This function will process a single column column_to_transform 
    # of the dataframe df per call.
    
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
    ## Box-Cox transform is given by:
    ## y = (x**lmbda - 1) / lmbda,  for lmbda != 0
    ## log(x),                  for lmbda = 0
    
    # column_to_transform must be a string with the name of the column.
    # e.g. column_to_transform = 'column1' to transform a column named as 'column1'
    
    # lambda_boxcox must be a float value. e.g. lamda_boxcox = 1.7
    # If you calculated lambda from the function box_cox_transform and saved the
    # transformation data summary dictionary as data_sum_dict, simply set:
    # lambda_boxcox = data_sum_dict['lambda_boxcox']
    # This will access the value on the key 'lambda_boxcox' of the dictionary, which
    # contains the lambda. 
    
    # Analogously, spec_lim_dict['Inf_spec_lim_transf'] access
    # the inferior specification limit transformed; and spec_lim_dict['Sup_spec_lim_transf'] 
    # access the superior specification limit transformed.
    
    #suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
    # identified as '_ReversedBoxCox'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
    
    y = df[column_to_transform]
    
    if (lambda_boxcox == 0):
        #ytransf = np.log(y), according to Box-Cox definition. Then
        #y_retransform = np.exp(y)
        #In the case of this function, ytransf is passed as the argument y.
        y_transform = np.exp(y)
    
    else:
        #apply Box-Cox function:
        #y_transf = (y**lmbda - 1) / lmbda. Then,
        #y_retransf ** (lmbda) = (y_transf * lmbda) + 1
        #y_retransf = ((y_transf * lmbda) + 1) ** (1/lmbda), where ** is the potentiation
        #In the case of this function, ytransf is passed as the argument y.
        y_transform = ((y * lambda_boxcox) + 1) ** (1/lambda_boxcox)
    
    if not (suffix is None):
        #only if a suffix was declared
        #concatenate the column name to the suffix
        new_col = column_to_transform + suffix
    
    else:
        #concatenate the column name to the standard '_ReversedBoxCox' suffix
        new_col = column_to_transform + '_ReversedBoxCox'
    
    data_retransformed_df = df
    data_retransformed_df[new_col] = y_transform
    #dataframe contendo os dados transformados
    
    print("Data successfully retransformed. Check the 10 first retransformed rows:\n")
    print(data_retransformed_df.head(10))
    print("\n") #line break
 
    return data_retransformed_df

# **Function for One-Hot Encoding categorical features**

- Transform categorical values without notion of order into numerical (binary) features.
- Process a single categorical column per function call.
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.
- The new columns will be named as the original possible categories.
- Each column is a binary variable of the type "is classified in this category or not".

Therefore, for a category "A", a column named "A" is created.
- If the row is an element from category "A", the value for the column "A" is 1.
- If not, the value for column "A" is 0.

In [22]:
def OneHotEncode_df (df, subset_of_features_to_be_encoded):

    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder
    
    #df: the whole dataframe to be processed.
    
    #subset_of_features_to_be_encoded: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    #Start an encoding dictionary empty:
    encoding_dict = {}
    
    #Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df
    
    #loop through each column of the subset:
    for column in subset_of_features_to_be_encoded:
        
        # Loop through each element (named 'column') of the list of columns to analyze,
        # subset_of_features_to_be_encoded
        
        #We could process the whole subset at once, but it could make us lose information
        # about the generated columns
        
        # set a subset of the dataframe X containing 'column' as the only column:
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X  = df[[column]]
        
        #Start the OneHotEncoder object:
        encoded_X = OneHotEncoder()
        
        #Fit the object to that column:
        encoded_X = encoded_X.fit_transform(X) 
        
        #It will create a scipy sparse matrix full of null values.
        #Show encoded categories and store this array. 
        #It will give the proper columns' names:
        encoded_columns = encoded_X.categories_

        #encoded_columns is a list containing a single element.
        # This element is an array like:
        # array(['cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8'], dtype=object)
        # Then, this array is the element of index 0 from the list encoded_columns.
        # It is represented as encoded_columns[0]

        #Therefore, we actually want the array which is named as encoded_columns[0]
        # Each element of this array is the name of one of the encoded columns. In the
        # example above, the element 'cat2' would be accessed as encoded_columns[0][1],
        # since it is the element of index [1] (second element) from the array 
        # encoded_columns[0].
        
        #Update the dictionary to store the original column name as key, and the categories
        # array as the value:
        encoding_dict.update({column: encoded_columns[0]})

        #Create the dense array:
        encoded_X = encoded_X.toarray()
        #print("One-Hot Encoding Matrix:")
        #print(encoded_X)

        #Convert it into a dataframe:
        encoded_X_df = pd.DataFrame(encoded_X)

        #modify the names of the columns for the ones stored in the array encoded_columns[0]
        # Simply access the values stored in the dictionary. To access a value, simply pass
        # the name of the key (in quotes) inside brackets after the name of the dictionary,
        # just as accessing a column from a dataframe:
        encoded_X_df.columns = encoding_dict[column]
        
        #Inner join the new dataset with the encoded dataset.
        # Use the index as the key, since indices are necessarily correspondent.
        # To use join on index, we apply pandas .concat method.
        # To join on a specific key, we could use pandas .merge method with the arguments
        # left_on = 'left_key', right_on = 'right_key'; or, if the keys have same name,
        # on = 'key':
        # Check Pandas merge and concat documentation:
        # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
        
        new_df = pd.concat([new_df, encoded_X_df], axis = 1, join = "inner")
        
        print(f"Successfully encoded column \'{column}\' and merged the encoded columns to the dataframe.")
        print("Check first 5 rows of the encoded table that was merged:\n")
        print(encoded_X_df.head())
        # The default of the head method, when no parameter is printed, is to show 5 rows; if an
        # integer number Y is passed as argument .head(Y), Pandas shows the first Y-rows.
    
    print("Finished One-Hot Encoding. Returning the new transformed dataframe; and an encoding dictionary with the original columns as keys, and arrays containing the categories on those columns as the correspondent values.")
    print(f"For each category in the columns \'{subset_of_features_to_be_encoded}\', a new column has value 1, if it is the actual category of that row; or is 0 if not.")
    print("Check the first 10 rows of the new dataframe:\n")
    print(new_df.head(10))

    #return the transformed dataframe and the encoding dictionary:
    return new_df, encoding_dict

# **Function for scaling the features**
- Machine Learning algorithms are extremely sensitive to scale. This function provides 3 methods (modes) of scaling:
    - `mode = 'standard'`: applies the standard scaling, which creates a new variable with mean = 0; and standard deviation = 1. Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean of the training samples, and s is the standard deviation of the training samples or one if with_std=False.
    - `mode = 'min_max'`: applies min-max normalization, with a resultant feature ranging from 0 to 1. Each value Y is transformed as Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and maximum values of Y, respectively.
    - `mode = 'factor'`: divide the whole series by a numeric value provided as argument. For a factor F, the new Y values will be Ytransf = Y/F.

In [23]:
def feature_scaling (df, subset_of_features_to_scale, mode = 'standard', scale_with_new_params = True, scaling_params = None, suffix = '_scaled'):
    
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    # Scikit-learn Preprocessing data guide:
    # https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
    # Standard scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    # Min-Max scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.set_params
    
    ## Machine Learning algorithms are extremely sensitive to scale. 
    
    ## This function provides 3 methods (modes) of scaling:
    ## mode = 'standard': applies the standard scaling, 
    ##  which creates a new variable with mean = 0; and standard deviation = 1.
    ##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
    ##  of the training samples, and s is the standard deviation of the training samples.
    
    ## mode = 'min_max': applies min-max normalization, with a resultant feature 
    ## ranging from 0 to 1. each value Y is transformed as 
    ## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
    ## maximum values of Y, respectively.
    
    ## mode = 'factor': divides the whole series by a numeric value provided as argument. 
    ## For a factor F, the new Y values will be Ytransf = Y/F.
    
    #df: the whole dataframe to be processed.
    
    #subset_of_features_to_be_scaled: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_scaled = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_scaled = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    # scale_with_new_params = True
    # Alternatively, set scale_with_new_params = True if you want to calculate a new
    # scaler for the data; or set scale_with_new_params = False if you want to apply 
    # parameters previously obtained to the data (i.e., if you want to apply the scaler
    # previously trained to another set of data; or wants to simply apply again the same
    # scaler).
    
    # scale_params:
    # This variable has effect only when SCALE_WITH_NEW_PARAMS = False
    ## WARNING: The mode 'factor' demmands the input of the list of factors that will be 
    # used for normalizing each column. Therefore, it can be used only 
    # when scale_with_new_params = False.
    
    ## For the mode 'factor', declare scaling_params as a dictionary containing the 
    # column name as the key and the correspondent factor as the value.
    # e.g. subset_of_features_to_scale = ['col1', 'col2'], 'col1' will be divided by 2.0, 
    # and 'col2' will be divided by 3.2,  then:
    # scaling_params = {'col1': 2.0, 'col2': 3.2}
    
    ## WARNING: For scaling_params (when scale_with_new_params = False and 
    # mode = 'standard' or mode = 'min_max'), the dictionary must be declared with the
    # column name as the key, and the whole dictionary of parameters as the correspondent
    # value. Then, it will be a dictionary of dictionaries, where there is a dictionary 
    # correspondent to each key. Each dictionary should be declared in the same way as the 
    # scaling_dictionary printed as output when the scaler is trained.
    
    #suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_scaled', the transformed column will be
    # identified as '_scaled'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
      
    if (suffix is None):
        #set as the default
        suffix = '_scaled'
    
    #Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df
    
    if (scale_with_new_params == True):
            #Let's create a new scaler
            
            #Start an scaling dictionary empty:
            scaling_dict = {}
            
            if (mode == 'standard'):
                
                for column in subset_of_features_to_scale:
                    # Loop through each element (named 'column') of the list of columns 
                    # to analyze:
                    
                    #Create a dataframe X by subsetting only the analyzed column
                    # it will be equivalent to using .reshape(-1,1) to set a 1D-series
                    # or array in the shape for scikit-learn:
                    # For doing so, pass a list of columns for column filtering, containing
                    # the object column as its single element:
                    X = new_df[[column]]
                    
                    #start the scaler:
                    scaler = StandardScaler()
                    
                    #fit the scaler to the column
                    scaler = scaler.fit(X)
                    
                    #calculate the scaled feature, and store it as new array:
                    scaled_feature = scaler.transform(X)
                    # scaler.inverse_transform(X) would reverse the scaling.

                    # Create the new_column name:
                    new_column = column + suffix
                    # Create the new_column by dividing the previous column by the scaling factor:
                    
                    # Set the new column as scaled_feature
                    new_df[new_column] = scaled_feature
                    
                    # Get the scaling parameters for that column:
                    scaling_params = scaler.get_params(deep=True)
                    
                    #scaling_params is a dictionary containing the scaling parameters.
                    #Update the dictionary to store the original column name as key, 
                    # and the dictionary of parameters as the value:
                    encoding_dict.update({column: scaling_params})
                    
                    print(f"Successfully scaled column {column}.")
                
                print("Successfully scaled the dataframe. Returning the transformed dataframe and the scaling dictionary.")
                print("Check 10 first rows of the new dataframe:\n")
                print(new_df.head(10))
                print("\n") # line break
                print("Check also the scaling dictionary obtained:\n")
                print(scaling_dict)
                
                return new_df, scaling_dict
                
            elif (mode == 'min_max'):
                  
                for column in subset_of_features_to_scale:
                    # Loop through each element (named 'column') of the list of columns 
                    # to analyze:
                    
                    #Create a dataframe X by subsetting only the analyzed column
                    # it will be equivalent to using .reshape(-1,1) to set a 1D-series
                    # or array in the shape for scikit-learn:
                    # For doing so, pass a list of columns for column filtering, containing
                    # the object column as its single element:
                    X = new_df[[column]]
                    
                    #start the scaler:
                    scaler = MinMaxScaler()
                    
                    #fit the scaler to the column
                    scaler = scaler.fit(X)
                    
                    #calculate the scaled feature, and store it as new array:
                    scaled_feature = scaler.transform(X)
                    # scaler.inverse_transform(X) would reverse the scaling.

                    # Create the new_column name:
                    new_column = column + suffix
                    # Create the new_column by dividing the previous column by the scaling factor:
                    
                    # Set the new column as scaled_feature
                    new_df[new_column] = scaled_feature
                    
                    # Get the scaling parameters for that column:
                    scaling_params = scaler.get_params(deep=True)
                    
                    #scaling_params is a dictionary containing the scaling parameters.
                    #Update the dictionary to store the original column name as key, 
                    # and the dictionary of parameters as the value:
                    encoding_dict.update({column: scaling_params})
                    
                    print(f"Successfully scaled column {column}.")
                
                print("Successfully scaled the dataframe. Returning the transformed dataframe and the scaling dictionary.")
                print("Check 10 first rows of the new dataframe:\n")
                print(new_df.head(10))
                print("\n") # line break
                print("Check also the scaling dictionary obtained:\n")
                print(scaling_dict)
                
                return new_df, scaling_dict
                
            else:
                print("Enter a valid mode, standard or min_max. The mode factor can be only used when scale_with_new_params == False and when a scaling dictionary was input as scaling_params.")       
                return "error", "error"
                
    else: 
        # scale_with_new_params == False
        # Use a previously obtained scaling_dict:
        
        scaling_dict = scaling_params
        
        if (mode == 'factor'):
            
            for column in subset_of_features_to_scale:
                # Loop through each element (named 'column') of the list of columns 
                # to analyze:
                
                # Create the new_column name:
                new_column = column + suffix
                # Create the new_column by dividing the previous column by the scaling factor:
                new_df[new_column] = (new_df[column])/(scaling_dict[column])
                
                print(f"Successfully scaled column {column}.")

            print("Successfully scaled the dataframe.")
            print("Check 10 first rows of the new dataframe:\n")
            print(new_df.head(10))

            return new_df
        
        elif (mode == 'standard'):
            
            for column in subset_of_features_to_scale:
                # Loop through each element (named 'column') of the list of columns 
                # to analyze:
                
                #Create a dataframe X by subsetting only the analyzed column
                # it will be equivalent to using .reshape(-1,1) to set a 1D-series
                # or array in the shape for scikit-learn:
                # For doing so, pass a list of columns for column filtering, containing
                # the object column as its single element:
                X = new_df[[column]]
                    
                #start the scaler:
                scaler = StandardScaler()
                    
                #Get the dictionary of scaling parameters for the feature 'column':
                # For that, access the key: 'column' in the scaling_dict dictionary
                # to retrieve its value, i.e., the dictionary for that feature:
                scaling_params = scaling_dict[column]
                    
                # Now, set the scaler parameters to be equal to the values retrieved
                # as the dictionary scaling_params:
                scaler = scaler.set_params(scaling_params)
                # Notice that the .set_params method substitute the step where we applied
                # the .fit method.
                    
                #calculate the scaled feature, and store it as new array:
                scaled_feature = scaler.transform(X)
                # scaler.inverse_transform(X) would reverse the scaling.

                # Create the new_column name:
                new_column = column + suffix
                # Create the new_column by dividing the previous column by the scaling factor:
                    
                # Set the new column as scaled_feature
                new_df[new_column] = scaled_feature
                    
                print(f"Successfully scaled column {column}.")
                
            print("Successfully scaled the dataframe.")
            print("Check 10 first rows of the new dataframe:\n")
            print(new_df.head(10))
                
            return new_df
        
        elif (mode == 'min_max'):
            
            for column in subset_of_features_to_scale:
                # Loop through each element (named 'column') of the list of columns 
                # to analyze:
                
                #Create a dataframe X by subsetting only the analyzed column
                # it will be equivalent to using .reshape(-1,1) to set a 1D-series
                # or array in the shape for scikit-learn:
                # For doing so, pass a list of columns for column filtering, containing
                # the object column as its single element:
                X = new_df[[column]]
                    
                #start the scaler:
                scaler = MinMaxScaler()
                    
                #Get the dictionary of scaling parameters for the feature 'column':
                # For that, access the key: 'column' in the scaling_dict dictionary
                # to retrieve its value, i.e., the dictionary for that feature:
                scaling_params = scaling_dict[column]
                    
                # Now, set the scaler parameters to be equal to the values retrieved
                # as the dictionary scaling_params:
                scaler = scaler.set_params(scaling_params)
                # Notice that the .set_params method substitute the step where we applied
                # the .fit method.
                    
                #calculate the scaled feature, and store it as new array:
                scaled_feature = scaler.transform(X)
                # scaler.inverse_transform(X) would reverse the scaling.

                # Create the new_column name:
                new_column = column + suffix
                # Create the new_column by dividing the previous column by the scaling factor:
                    
                # Set the new column as scaled_feature
                new_df[new_column] = scaled_feature
                    
                print(f"Successfully scaled column {column}.")
                
            print("Successfully scaled the dataframe.")
            print("Check 10 first rows of the new dataframe:\n")
            print(new_df.head(10))
                
            return new_df
        
        else:

            print("Select a valid mode: standard, min_max, or factor.")
            return "error"

# **Function for reversing the scaling of the features**
- `mode = 'standard'`.
- `mode = 'min_max'`.
- `mode = 'factor'`.

In [24]:
def reverse_feature_scaling (df, subset_of_features_to_scale, scaling_params, mode = 'standard', suffix = '_reverseScaling'):
    
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    # Scikit-learn Preprocessing data guide:
    # https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
    # Standard scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    # Min-Max scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.set_params
    
    ## Machine Learning algorithms are extremely sensitive to scale. 
    
    ## This function provides 3 methods (modes) of scaling:
    ## mode = 'standard': applies the standard scaling, 
    ##  which creates a new variable with mean = 0; and standard deviation = 1.
    ##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
    ##  of the training samples, and s is the standard deviation of the training samples.
    
    ## mode = 'min_max': applies min-max normalization, with a resultant feature 
    ## ranging from 0 to 1. each value Y is transformed as 
    ## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
    ## maximum values of Y, respectively.
    
    ## mode = 'factor': divides the whole series by a numeric value provided as argument. 
    ## For a factor F, the new Y values will be Ytransf = Y/F.
    
    #df: the whole dataframe to be processed.
    
    #subset_of_features_to_be_scaled: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_scaled = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_scaled = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    ## WARNING: The mode 'factor' demmands the input of the list of factors that will be 
    # used for normalizing each column.
    
    ## For the mode 'factor', declare scaling_params as a dictionary containing the 
    # column name as the key and the correspondent factor as the value.
    # e.g. subset_of_features_to_scale = ['col1', 'col2'], 'col1' will be divided by 2.0, 
    # and 'col2' will be divided by 3.2,  then:
    # scaling_params = {'col1': 2.0, 'col2': 3.2}
    
    ## WARNING: For scaling_params (when scale_with_new_params = False and 
    # mode = 'standard' or mode = 'min_max'), the dictionary must be declared with the
    # column name as the key, and the whole dictionary of parameters as the correspondent
    # value. Then, it will be a dictionary of dictionaries, where there is a dictionary 
    # correspondent to each key. Each dictionary should be declared in the same way as the 
    # scaling_dictionary printed as output when the scaler is trained.
    
    #suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
    # identified as '_reverseScaling'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
      
    if (suffix is None):
        #set as the default
        suffix = '_reverseScaling'
    
    #Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df
    
    # Use a previously obtained scaling_dict:
        
    scaling_dict = scaling_params
        
    if (mode == 'factor'):
            
        for column in subset_of_features_to_scale:
            # Loop through each element (named 'column') of the list of columns 
            # to analyze:
                
            # Create the new_column name:
            new_column = column + suffix
            # Create the new_column.
            # Once the scaling was performed through division, the reverse of it consists
            # on a multiplication:
            
            new_df[new_column] = (new_df[column])*(scaling_dict[column])
                
            print(f"Successfully re-scaled column {column}.")

            print("Successfully re-scaled the dataframe.")
            print("Check 10 first rows of the new dataframe:\n")
            print(new_df.head(10))

            return new_df
        
    elif (mode == 'standard'):
            
        for column in subset_of_features_to_scale:
            # Loop through each element (named 'column') of the list of columns 
            # to analyze:
                
            #Create a dataframe X by subsetting only the analyzed column
            # it will be equivalent to using .reshape(-1,1) to set a 1D-series
            # or array in the shape for scikit-learn:
            # For doing so, pass a list of columns for column filtering, containing
            # the object column as its single element:
            X = new_df[[column]]
                    
            #start the scaler:
            scaler = StandardScaler()
                    
            #Get the dictionary of scaling parameters for the feature 'column':
            # For that, access the key: 'column' in the scaling_dict dictionary
            # to retrieve its value, i.e., the dictionary for that feature:
            scaling_params = scaling_dict[column]
                    
            # Now, set the scaler parameters to be equal to the values retrieved
            # as the dictionary scaling_params:
            scaler = scaler.set_params(scaling_params)
            # Notice that the .set_params method substitute the step where we applied
            # the .fit method.
                    
            #Invert the scaling of the feature, and store it as new array:
            scaled_feature = scaler.inverse_transform(X)
            # Notice that this step substitutes the application of the method
            # scaler.transform(X), used for scaling the variable.

            # Create the new_column name:
            new_column = column + suffix
            # Create the new_column by dividing the previous column by the scaling factor:
                    
            # Set the new column as scaled_feature
            new_df[new_column] = scaled_feature
                    
            print(f"Successfully re-scaled column {column}.")
                
        print("Successfully re-scaled the dataframe.")
        print("Check 10 first rows of the new dataframe:\n")
        print(new_df.head(10))
                
        return new_df
        
    elif (mode == 'min_max'):
            
        for column in subset_of_features_to_scale:
            # Loop through each element (named 'column') of the list of columns 
            # to analyze:
                
            #Create a dataframe X by subsetting only the analyzed column
            # it will be equivalent to using .reshape(-1,1) to set a 1D-series
            # or array in the shape for scikit-learn:
            # For doing so, pass a list of columns for column filtering, containing
            # the object column as its single element:
            X = new_df[[column]]
                    
            #start the scaler:
            scaler = MinMaxScaler()
                    
            #Get the dictionary of scaling parameters for the feature 'column':
            # For that, access the key: 'column' in the scaling_dict dictionary
            # to retrieve its value, i.e., the dictionary for that feature:
            scaling_params = scaling_dict[column]
                    
            # Now, set the scaler parameters to be equal to the values retrieved
            # as the dictionary scaling_params:
            scaler = scaler.set_params(scaling_params)
            # Notice that the .set_params method substitute the step where we applied
            # the .fit method.
                    
            #Invert the scaling of the feature, and store it as new array:
            scaled_feature = scaler.inverse_transform(X)
            # Notice that this step substitutes the application of the method
            # scaler.transform(X), used for scaling the variable.
                
            # Create the new_column name:
            new_column = column + suffix
            # Create the new_column by dividing the previous column by the scaling factor:
                    
            # Set the new column as scaled_feature
            new_df[new_column] = scaled_feature
                    
            print(f"Successfully re-scaled column {column}.")
                
        print("Successfully re-scaled the dataframe.")
        print("Check 10 first rows of the new dataframe:\n")
        print(new_df.head(10))
                
        return new_df
        
    else:

        print("Select a valid mode: standard, min_max, or factor.")
        return "error"

# **Function for exporting the dataframe**

In [None]:
def export_dataframe (dataframe_to_be_exported, new_file_name_with_csv_extension, file_directory_path = None, export_to_s3_bucket = False, s3_bucket_name = None, desired_s3_file_name_with_csv_extension = None):
    
    import os
    import boto3
    #boto3 is AWS S3 Python SDK
    import pandas as pd
    
    ## WARNING: all file extensions should be .csv for this function
    
    # FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
    # (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
    # or FILE_DIRECTORY_PATH = "/folder"
    # If you want to export the file to AWS S3, this parameter will have no effect.
    # In this case, you can set FILE_DIRECTORY_PATH = None

    # NEW_FILE_NAME_WITH_CSV_EXTENSION - (string, in quotes): input the name of the 
    # file with the  extension. e.g. FILE_NAME_WITH_CSV_EXTENSION = "file.csv"
    
    # export_to_s3_bucket = False. Alternatively, set as True to export the file to an
    # AWS S3 Bucket.

    ## The following parameters have effect only when export_to_s3_bucket == True:

    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. S3_BUCKET_NAME = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"

    # The name desired for the object stored in S3 (string, in quotes). 
    # Keep it None to set it equals to new_file_name_with_csv_extension. 
    # Alternatively, set it as a string analogous to new_file_name_with_csv_extension. 
    # e.g. desired_s3_file_name_with_csv_extension = "S3_file.csv"
    
    if (export_to_s3_bucket == True):
        
        if (desired_s3_file_name_with_csv_extension is None):
            #Repeat new_file_name_with_extension
            desired_s3_file_name_with_csv_extension = new_file_name_with_csv_extension
        
        # If the bucket name was provided, start the session. If not, print an error
        # message:
        
        if (s3_bucket_name is None):
            
            print("Please, provide a valid S3 Bucket name to download from.")
        
        else:
        
            # start S3 client:
            print("Starting AWS S3 client.")
        
            # Let's export the file to a AWS S3 (simple storage service) bucket
            # instantiate S3 client and upload to s3
            s3_client = boto3.resource('s3')
            
            # Create a local copy of the file on the root.
            local_copy_path = os.path.join("/", new_file_name_with_csv_extension)
            dataframe_to_be_exported.to_csv(local_copy_path, index = False)
            
            print("Local copy of the dataframe created on the root path to export to S3.")
            print("Simply delete this file from the root path if you only want to keep the S3 version.")
            
            # Upload this local copy to S3:
            try:
                response = s3_client.meta.client.upload_file(local_copy_path, s3_bucket_name, desired_s3_file_name_with_extension)
            
            except ClientError as e:
                logging.error(e)
                return False
            
            print(f"{desired_s3_file_name_with_csv_extension} successfully exported to {s3_bucket_name} AWS S3 bucket.")
            return True
            # Check AWS Documentation:
            # https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html
            
            # Notice: if you wanted to authenticate directly from Python code, you could use
            # the following code, instead:        
            # ACCESS_KEY = 'access_key_ID'
            # PASSWORD_KEY = 'password_key'
            # s3_client = boto3.client('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = PASSWORD_KEY)
            # s3_client.upload_file(local_copy_path, s3_bucket_name, desired_s3_file_name_with_extension)
            
    else :
        # Do not export to AWS S3. Export to other path.
        # Create the complete file path:
        file_path = os.path.join(file_directory_path, new_file_name_with_csv_extension)

        dataframe_to_be_exported.to_csv(file_path, index = False)

        print(f"Dataframe {new_file_name_with_csv_extension} exported as \'{file_path}\'.")
        print("Warning: if there was a file in this file path, it was replaced by the exported dataframe.")

# **Function for importing or exporting models and dictionaries**

In [2]:
def import_export_model_or_dict (action = 'import', objects_manipulated = 'model_only', model_file_name = None, dictionary_file_name = None, directory_path = '/', model_type = 'keras', dict_to_export = None, model_to_export = None, use_colab_memory = False):
    
    import os
    from statsmodels.tsa.arima.model import ARIMA
    from statsmodels.tsa.arima.model import ARIMAResults
    from keras.models import load_model
    from google.colab import files
    import pickel as pkl
    import dill
    
    # action = 'import' for importing a model and/or a dictionary;
    # action = 'export' for exporting a model and/or a dictionary.
    
    # objects_manipulated = 'model_only' if only a model will be manipulated.
    # objects_manipulated = 'dict_only' if only a dictionary will be manipulated.
    # objects_manipulated = 'model_and_dict' if both a model and a dictionary will be
    # manipulated.
    
    #model_file_name: string with the name of the file containing the model (for 'import');
    # or of the name that the exported file will have (for 'export')
    # e.g. model_file_name = 'model'
    # WARNING: Do not add the file extension.
    # Keep it in quotes. Keep model_file_name = None if no model will be manipulated.
    
    # dictionary_file_name: string with the name of the file containing the dictionary 
    # (for 'import');
    # or of the name that the exported file will have (for 'export')
    # e.g. dictionary_file_name = 'history_dict'
    # WARNING: Do not add the file extension.
    # Keep it in quotes. Keep DICTIONARY_FILE_NAME = None if no 
    # dictionary will be manipulated.
    
    # DIRECTORY_PATH: path of the directory where the model will be saved,
    # or from which the model will be retrieved. If no value is provided,
    # the DIRECTORY_PATH will be the root: "/"
    # Notice that the model and the dictionary must be stored in the same path.
    # If a model and a dictionary will be exported, they will be stored in the same
    # DIRECTORY_PATH.
    
    # model_type: This parameter has effect only when a model will be manipulated.
    # model_type: 'keras' for deep learning keras/ tensorflow models with extension .h5
    # model_type = 'sklearn_xgb' for models from sklearn or xgboost (non-deep learning)
    # model_type = 'arima' for ARIMA model (Statsmodels)
    
    # dict_to_export and model_to_export: 
    # These two parameters have effect only when ACTION == 'export'. In this case, they
    # must be declared. If ACTION == 'export', keep:
    # dict_to_export = None, 
    # model_to_export = None
    # If one of these objects will be exported, substitute None by the name of the object
    # e.g. if your model is stored in the global memory as 'keras_model' declare:
    # model_to_export = keras_model. Notice that it must be declared without quotes, since
    # it is not a string, but an object.
    # For exporting a dictionary named as 'dict':
    # dict_to_export = dict
    
    # use_colab_memory: this parameter has only effect when using Google Colab (or it will
    # raise an error). Set as use_colab_memory = True if you want to use the instant memory
    # from Google Colaboratory: you will update or download the file and it will be available
    # only during the time when the kernel is running. It will be excluded when the kernel
    # dies, for instance, when you close the notebook.
    
    # If action == 'export' and use_colab_memory == True, then the file will be downloaded
    # to your computer (running the cell will start the download).
    
    # Check the directory path
    if (directory_path is None):
        # set as the root:
        directory_path = "/"
        
        
    bool_check1 = (objects_manipulated != 'model_only')
    # bool_check1 == True if a dictionary will be manipulated
    
    bool_check2 = (objects_manipulated != 'dict_only')
    # bool_check1 == True if a dictionary will be manipulated
    
    if (bool_check1 == True):
        #manipulate a dictionary
        
        if (dictionary_file_name is None):
            print("Please, enter a name for the dictionary.")
            return "error1"
        
        else:
            # Create the file path for the dictionary:
            dict_path = os.path.join(directory_path, dictionary_file_name)
            # Extract the file extension
            dict_extension = 'pkl'
            #concatenate:
            dict_path = dict_path + "." + dict_extension
            
    
    if (bool_check2 == True):
        #manipulate a model
        
        if (model_file_name is None):
            print("Please, enter a name for the model.")
            return "error1"
        
        else:
            # Create the file path for the dictionary:
            model_path = os.path.join(directory_path, model_file_name)
            # Extract the file extension
            
            #check model_type:
            if (model_type == 'keras'):
                model_extension = 'h5'
            
            elif (model_type == 'sklearn_xgb'):
                model_extension = 'dill'
                #it could be 'pkl', though
            
            elif (model_tyoe == 'arima'):
                model_extension = 'pkl'
            
            else:
                print("Enter a valid model_type: keras, sklearn_xgb, or arima.")
                return "error2"
            
            #concatenate:
            model_path = model_path +  "." + model_extension
            
    # Now we have the full paths for the dictionary and for the model.
    
    if (action == 'import'):
        
        if (use_colab_memory == True):
            
            print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
            print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
            # this functionality requires the previous declaration:
            ## from google.colab import files
            colab_files_dict = files.upload()
            # The files are stored into a dictionary called colab_files_dict where the keys
            # are the names of the files and the values are the files themselves.
            ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
            ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
            ## representing the contents of the file. The length of this value is the size of the
            ## uploaded file, in bytes.
            ## To access the file is like accessing a value from a dictionary: 
            ## d = {'key1': 'val1'}, d['key1'] == 'val1'
            ## we simply declare the key inside brackets and quotes, the same way we would do for
            ## accessing the column of a dataframe.
            ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
            ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
            ## file in bytes.
            ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
            ## parentheses): colab_files_dict.keys()
            
            for key in colab_files_dict.keys():
                #loop through each element of the list of keys of the dictionary
                # (list colab_files_dict.keys()). Each element is named 'key'
                print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
                # The key is the name of the file, and the length of the value
                ## correspondent to the key is the file's size in bytes.
                ## Notice that the content of the uploaded object must be passed 
                ## as argument for a proper function to be interpreted. 
                ## For instance, the content of a xlsx file should be passed as
                ## argument for Pandas .read_excel function; the pkl file must be passed as
                ## argument for pickle.
                ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
                ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
                ## df from the uploaded table. Notice that is the value, not the key, that is the
                ## argument.
        
        if (bool_check1 == True):
            #manipulate a dictionary
            if (use_colab_memory == True):
                key = dictionary_file_name + "." + dict_extension
                #Use the key to access the file content, and pass the file content
                # to pickle:
                imported_dict = pkl.load(open(colab_files_dict[key], 'rb'))
                print(f"Dictionary {key} successfully imported to Colab environment.")
            
            else:
                #standard method             
                imported_dict = pkl.load(open(dict_path, 'rb'))
                # 'rb' stands for read binary (read mode). For writing mode, 'wb', 'write binary'
                print(f"Dictionary successfully imported from {dict_path}.")
                
        if (bool_chek2 == True):
            #manipulate a model
            # select the proper model
        
            if (model_type == 'keras'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = load_model(colab_files_dict[key])
                    print(f"Keras/TensorFlow model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    # We previously declared:
                    # from keras.models import load_model
                    model = load_model(model_path)
                    print(f"Keras/TensorFlow model successfully imported from {model_path}.")

            elif (model_type == 'sklearn_xgb'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = dill.load(open(colab_files_dict[key], 'rb'))
                    print(f"Scikit-learn or XGBoost model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    model = dill.load(open(model_path, 'rb'))
                    print(f"Scikit-learn or XGBoost model successfully imported from {model_path}.")
                    # For loading a pickle model:
                    ## model = pkl.load(open(model_path, 'rb'))
                    # 'rb' stands for read binary (read mode). For writing mode, 'wb', 'write binary'

            elif (model_type == 'arima'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = ARIMAResults.load(colab_files_dict[key])
                    print(f"ARIMA model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    # We previously declared:
                    # from statsmodels.tsa.arima.model import ARIMAResults
                    model = ARIMAResults.load(model_path)
                    print(f"ARIMA model successfully imported from {model_path}.")
            
            if (objects_manipulated == 'model_only'):
                # only the model should be returned
                return model
            
            elif (objects_manipulated == 'dict_only'):
                # only the dictionary should be returned:
                return imported_dict
            
            else:
                # Both objects are returned:
                return model, imported_dict

    
    elif (action == 'export'):
        
        #Let's export the models or dictionary:
        if (use_colab_memory == True):
            print("The files will be downloaded to your computer.")
        
        if (bool_check1 == True):
            #manipulate a dictionary
            if (use_colab_memory == True):
                ## Download the dictionary
                key = dictionary_file_name + "." + dict_extension
                pkl.dump(dict_to_export, open(key, 'wb'))
                # this functionality requires the previous declaration:
                ## from google.colab import files
                files.download(key)
                
                print(f"Dictionary {key} successfully downloaded from Colab environment.")
            
            else:
                #standard method             
                pkl.dump(dict_to_export, open(dict_path, 'wb'))
                #to save the file, the mode must be set as 'wb' (write binary)
                print(f"Dictionary successfully exported as {dict_path}.")
                
        if (bool_chek2 == True):
            #manipulate a model
            # select the proper model
        
            if (model_type == 'keras'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save(key)
                    files.download(key)
                    print(f"Keras/TensorFlow model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save(model_path)
                    print(f"Keras/TensorFlow model successfully exported as {model_path}.")

            elif (model_type == 'sklearn_xgb'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    dill.dump(model_to_export, open(key, 'wb'))
                    #to save the file, the mode must be set as 'wb' (write binary)
                    files.download(key)
                    print(f"Scikit-learn or XGBoost model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    dill.dump(model_to_export, open(model_path, 'wb'))
                    print(f"Scikit-learn or XGBoost model successfully exported as {model_path}.")
                    # For exporting a pickle model:
                    ## pkl.dump(model_to_export, open(model_path, 'wb'))
                    
            elif (model_type == 'arima'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save(key)
                    files.download(key)
                    print(f"ARIMA model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save(model_path)
                    print(f"ARIMA model successfully exported as {model_path}.")
        
        print("Export of files completed.")
    
    else:
        print("Enter a valid action, import or export.")

# **Function for downloading a file from Google Colab or AWS S3 to the local machine or uploading a file from the machine to S3 or to Colab's instant memory**

In [2]:
def download_or_upload_file (source = 'aws', action = 'download', object_to_download_from_colab = None, s3_bucket_name = None, local_path_of_storage = '/', file_name_with_extension = None):
    
    import os
    import boto3
    # boto3 is AWS S3 Python SDK
    from google.colab import files
    
    # source = 'google' for downloading from (or uploading to) Google Colab's instant memory;
    # source = 'aws' for downloading from (or uploading to) an AWS S3 bucket.
    
    # action = 'download' to download the file to the local machine
    # action = 'upload' to upload a file from local machine to AWS S3 or to
    # Google Colab's instant memory
    
    # object_to_download_from_colab = None. This option has effect only when
    # source == 'google'. In this case, this parameter is obbligatory. 
    # Declare as object_to_download_from_colab the object that you want to download.
    # Since it is an object and not a string, it should not be declared in quotes.
    # e.g. to download a dictionary named dict, object_to_download_from_colab = dict.
    # To download a dataframe named df, declare object_to_download_from_colab = df.
    # To export a model named keras_model, declare object_to_download_from_colab = keras_model
    
    ## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # LOCAL_PATH_OF_STORAGE: path of the local computer environment 
    # to which the S3 bucket contents will be downloaded (ACTION == 'download'); or
    # path of the folder containing the file that will be uploaded in S3 (ACTION = 'upload'). 
    # If it is None, or if LOCAL_PATH_OF_STORAGE = '/', files 
    # will be imported to the root path. Alternatively, input the path as a string 
    # (in quotes).
    # Examples: LOCAL_PATH_OF_STORAGE = '/copied_s3_bucket'; 
    # LOCAL_PATH_OF_STORAGE = "/My_folder"; LOCAL_PATH_OF_STORAGE = "/Users/Me/Documents/"
    # Notice that only the directories should be declared: do not include the file name and
    # its extension.
    
    # file_name_with_extension: string, in quotes, containing the file name which will be
    # downloaded from S3; or uploaded from S3, followed by its extension. 
    ## This parameter is obbligatory when source == 'aws'
    # Examples:
    # file_name_with_extension = 'Screen_Shot.png'; file_name_with_extension = 'dataset.csv',
    # file_name_with_extension = "dictionary.pkl", file_name_with_extension = "model.h5",
    # file_name_with_extension = 'doc.pdf', file_name_with_extension = 'model.dill'

    if (source == 'google'):
        
        if (action == 'upload'):
            
            print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
            print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
            # this functionality requires the previous declaration:
            ## from google.colab import files
            
            colab_files_dict = files.upload()
            
            # The files are stored into a dictionary called colab_files_dict where the keys
            # are the names of the files and the values are the files themselves.
            ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
            ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
            ## representing the contents of the file. The length of this value is the size of the
            ## uploaded file, in bytes.
            ## To access the file is like accessing a value from a dictionary: 
            ## d = {'key1': 'val1'}, d['key1'] == 'val1'
            ## we simply declare the key inside brackets and quotes, the same way we would do for
            ## accessing the column of a dataframe.
            ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
            ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
            ## file in bytes.
            ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
            ## parentheses): colab_files_dict.keys()
            
            for key in colab_files_dict.keys():
                #loop through each element of the list of keys of the dictionary
                # (list colab_files_dict.keys()). Each element is named 'key'
                print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
                # The key is the name of the file, and the length of the value
                ## correspondent to the key is the file's size in bytes.
                ## Notice that the content of the uploaded object must be passed 
                ## as argument for a proper function to be interpreted. 
                ## For instance, the content of a xlsx file should be passed as
                ## argument for Pandas .read_excel function; the pkl file must be passed as
                ## argument for pickle.
                ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
                ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
                ## df from the uploaded table. Notice that is the value, not the key, that is the
                ## argument.
                
                print("The uploaded files are stored into a dictionary object named as colab_files_dict.")
                print("Each key from this dictionary is the name of an uploaded file. The value correspondent to that key is the file itself.")
                print("The structure of a general Python dictionary is dict = {\'key1\': value1}. To access value1, declare file = dict[\'key1\'], as if you were accessing a column from a dataframe.")
                print("Then, if you uploaded a file named \'table.xlsx\', you can access this file as:")
                print("uploaded_file = colab_files_dict[\'table.xlsx\']")
                print("Notice, though, that the object uploaded_file is the whole file content, not a Python object already converted. To convert to a Python object, pass this element as argument for a proper function or method.")
                print("In this example, to convert the object uploaded_file to a dataframe, Pandas pd.read_excel function could be used. In the following line, a df dataframe object is obtained from the uploaded file:")
                print("df = pd.read_excel(uploaded_file)")
        
        elif (action == 'download'):
            
            if (object_to_download_from_colab is None):
                
                #No object was declared
                print("Please, inform an object to download. Since it is an object, not a string, it should not be declared in quotes.")
            
            else:
                
                print("The file will be downloaded to your computer.")

                files.download(object_to_download_from_colab)

                print(f"File {object_to_download_from_colab} successfully downloaded from Colab environment.")

        else:
            
            print("Please, select a valid action, download or upload.")
          
    elif (source == 'aws'):
        
        # Notice: if you wanted to authenticate directly from Python code, you could use
        # the following code, instead for starting the client:
        
        # ACCESS_KEY = 'access_key_ID'
        # PASSWORD_KEY = 'password_key'
        # s3_client = boto3.client('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = PASSWORD_KEY)
        # Nextly, the code is the same.
        
        
        # If the path to store is None, also import the bucket content to root path;
        # or upload the file from root path to the bucket
        if (local_path_of_storage is None):
            
            local_path_of_storage = '/'
        
        # If the bucket name was provided, start the session. If not, print an error
        # message. The same for the file name with extension:
        
        if (s3_bucket_name is None):
            
            print("Please, provide a valid S3 Bucket name.")
        
        elif (file_name_with_extension is None):
            
            print("Please, provide a valid file name with its extension. e.g. \'dataset.csv\'.")
        
        else:
            
            # Obtain the full file path from which the file will be uploaded to S3; or to
            # which the file will be downloaded from S3:
            file_path = os.path.join(local_path_of_storage, file_name_with_extension)
            
            # Start S3 client:
            s3_client = boto3.resource('s3')
            
            print("Starting AWS S3 client.")
            
            if (action == 'upload'):
                
                s3_client.Object(s3_bucket_name, file_name_with_extension).\
                    upload_file(Filename = file_path)
                
                print(f"File {file_name_with_extension} successfully uploaded to AWS S3 {s3_bucket_name} bucket.")
            
            elif (action == 'download'):

                print("The file will be downloaded to your computer.")
                
                s3_client.Object(s3_bucket_name, file_name_with_extension).download_file(file_path)
                
                print(f"File {file_name_with_extension} successfully downloaded from AWS S3 {s3_bucket_name} bucket.")

            else:

                print("Please, select a valid action, download or upload.")

    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = '/'
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None, or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/copied_s3_bucket'

S3_BUCKET_NAME = 'name_of_aws_s3_bucket_to_be_accessed'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_KEY_PREFFIX_FOLDER = None
# S3_OBJECT_KEY_PREFFIX_FOLDER = None. Keep it None or as an empty string 
# (S3_OBJECT_KEY_PREFFIX_FOLDER = '') to import the whole bucket content, instead of a 
# single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, key_preffix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_key_preffix = S3_OBJECT_KEY_PREFFIX_FOLDER)

### **Importing the dataset**

In [3]:
# WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, etc), 
# txt, or CSV (comma separated values) files.

FILE_DIRECTORY_PATH = "/"
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
# or FILE_DIRECTORY_PATH = "/folder"

FILE_NAME_WITH_EXTENSION = "dataset.csv"
# FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
# FILE_NAME_WITH_EXTENSION = "file.csv"
    
HAS_HEADER = True
# HAS_HEADER = True if the the imported table has headers (row with columns names).
# Alternatively, HAS_HEADER = False if the dataframe does not have header.

TXT_CSV_COL_SEP = "comma"
# TXT_CSV_COL_SEP = "comma" - This parameter has effect only when the file is a 'txt'
# or 'csv'. It informs how the different columns are separated.
# Alternatively, TXT_CSV_COL_SEP = "comma" for columns separated by comma (",")
# TXT_CSV_COL_SEP = "whitespace" for columns separated by simple spaces (" ").

SHEET_TO_LOAD = None
# SHEET_TO_LOAD - This parameter has effect only when for Excel files.
# keep SHEET_TO_LOAD = None not to specify a sheet of the file, so that the first sheet
# will be loaded.
# SHEET_TO_LOAD may be an integer or an string (inside quotes). SHEET_TO_LOAD = 0
# loads the first sheet (sheet with index 0); SHEET_TO_LOAD = 1 loads the second sheet
# of the file (index 1); SHEET_TO_LOAD = "Sheet1" loads a sheet named as "Sheet1".
# Declare a number to load the sheet with that index, starting from 0; or declare a
# name to load the sheet with that name.

#The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = load_dataframe (file_directory_path = FILE_DIRECTORY_PATH, file_name_with_extension = FILE_NAME_WITH_EXTENSION, has_header = HAS_HEADER, txt_csv_col_sep = TXT_CSV_COL_SEP, sheet_to_load = SHEET_TO_LOAD)

### **Filtering (selecting) or renaming columns of the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'filter'
# MODE = 'filter' for filtering only the list of columns passed as cols_list;
# MODE = 'rename' for renaming the columns with the names passed as cols_list.

COLS_LIST = ['column1', 'column2', 'column3']
# COLS_LIST = list of strings containing the names (headers) of the columns to select
# (filter); or to be set as the new columns' names, according to the selected mode.
# For instance: COLS_LIST = ['col1', 'col2', 'col3'] will 
# select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
# Declare the names inside quotes.
# Simply substitute the list by the list of columns that you want to select; or the
# list of the new names you want to give to the dataset columns.

#New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = col_filter_rename (df = DATASET, cols_list = COLS_LIST, mode = MODE)

### **log-transforming the variables**

In [None]:
#### WARNING: This function will eliminate rows where the selected variables present 
#### values lower or equal to zero (condition for the logarithm to be applied).

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_log"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_log", the new column will be named as
# "collumn1_log".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

#New dataframe saved as log_transf_df.
# Simply modify this object on the left of equality:
log_transf_df = log_transform (df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

# One curve derived from the normal is the log-normal.
# If the values Y follow a log-normal distribution, their log follow a normal.
# A log normal curve resembles a normal, but with skewness (distortion); 
# and kurtosis (long-tail).

# Applying the log is a methodology for normalizing the variables: 
# the sample space gets shrinkled after the transformation, making the data more 
# adequate for being processed by Machine Learning algorithms. Preferentially apply 
# the transformation to the whole dataset, so that all variables will be of same order 
# of magnitude.
# Obviously, it is not necessary for variables ranging from -100 to 100 in numerical 
# value, where most outputs from the log transformation are.

### **Reversing the log-transform - Exponentially transforming variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_originalScale"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_originalScale", the new column will be named as
# "collumn1_originalScale".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

#New dataframe saved as rescaled_df.
# Simply modify this object on the left of equality:
rescaled_df = reverse_log_transform(df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

### **Obtaining and applying Box-Cox transform**
- Transform a series of data into a series described by a normal distribution.

#### Case 1: no specification limits provided to Box-Cox transform

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

MODE = 'calculate_and_apply'
# Aternatively, mode = 'calculate_and_apply' to calculate lambda and apply Box-Cox
# transform; mode = 'apply_only' to apply the transform for a known lambda.
# To 'apply_only', lambda_box must be provided.

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_BoxCoxTransf'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_BoxCoxTransf', the transformed column will be
# identified as 'Y_BoxCoxTransf'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

SPECIFICATION_LIMS = None
#specification_lims = None if there are no specification limits for the variable being
# transformed by the function.
#In case there were originally specification limits for the variable (column) being
# transformed, declare them as a list, array, or tuple of two numbers (float).
# e.g. if the column represents a variable with specifications between 10 to 20 kg, declare
# specification_lims = [10, 20]. If it represents a variable which specifications should
# be betweewn 0 to 12.5 L, declare specification_lims = [0, 12.5]
# Then, the function will return the specifications transformed by the same Box-Cox
# transformation applied to the data. Remember: if data were transformed, so should be
# the specification limits.

#New dataframe saved as data_transformed_df; dictionary saved as data_sum_dict.
# Simply modify this object on the left of equality:
data_transformed_df, data_sum_dict = box_cox_transform (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, mode = MODE, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX, specification_lims = SPECIFICATION_LIMS)

#### Case 2: specification limits provided to Box-Cox transform

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

MODE = 'calculate_and_apply'
# Aternatively, mode = 'calculate_and_apply' to calculate lambda and apply Box-Cox
# transform; mode = 'apply_only' to apply the transform for a known lambda.
# To 'apply_only', lambda_box must be provided.

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_BoxCoxTransf'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_BoxCoxTransf', the transformed column will be
# identified as 'Y_BoxCoxTransf'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

SPECIFICATION_LIMS = [None, None]
## First element: inferior specification limit (Float)
## Second element: superior specification limit (Float)

#specification_lims = None if there are no specification limits for the variable being
# transformed by the function.
#In case there were originally specification limits for the variable (column) being
# transformed, declare them as a list, array, or tuple of two numbers (float).
# e.g. if the column represents a variable with specifications between 10 to 20 kg, declare
# specification_lims = [10, 20]. If it represents a variable which specifications should
# be betweewn 0 to 12.5 L, declare specification_lims = [0, 12.5]
# Then, the function will return the specifications transformed by the same Box-Cox
# transformation applied to the data. Remember: if data were transformed, so should be
# the specification limits.

#New dataframe saved as data_transformed_df; dictionaries saved as data_sum_dict and
# spec_lim_dict.
# Simply modify this object on the left of equality:
data_transformed_df, data_sum_dict, spec_lim_dict = box_cox_transform (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, mode = MODE, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX, specification_lims = SPECIFICATION_LIMS)

### **Reversing Box-Cox transform**

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_ReversedBoxCox'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
# identified as 'Y_ReversedBoxCox'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as retransformed_df.
# Simply modify this object on the left of equality:
retransformed_df = reverse_box_cox (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX)

### **One-Hot Encoding the categorical variables**
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.For a category "A", a column named "A" is created.
    - If the row is an element from category "A", the value for the column "A" is 1.
    - If not, the value for column "A" is 0.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_BE_ENCODED = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

#New dataframe saved as one_hot_encoded_df; dictionary saved as encoding_dict.
# Simply modify this object on the left of equality:
one_hot_encoded_df, encoding_dict = OneHotEncode_df (df = DATASET, subset_of_features_to_be_encoded = SUBSET_OF_FEATURES_TO_BE_ENCODED)

### **Scaling the features - Standard scaler, Min-Max scaler, division by factor**

#### Case 1: obtention of a new scaler

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'standard'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor'
## This function provides 3 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

SCALE_WITH_NEW_PARAMS = True
# Alternatively, set SCALE_WITH_NEW_PARAMS = True if you want to calculate a new
# scaler for the data; or SCALE_WITH_NEW_PARAMS = False if you want to apply 
# parameters previously obtained to the data (i.e., if you want to apply the scaler
# previously trained to another set of data; or wants to simply apply again the same
# scaler).
    
## WARNING: The MODE 'factor' demmands the input of the list of factors that will be 
# used for normalizing each column. Therefore, it can be used only 
# when SCALE_WITH_NEW_PARAMS = False.

SCALING_PARAMS = None
# This variable has effect only when SCALE_WITH_NEW_PARAMS = False
## For the MODE 'factor', declare SCALING_PARAMS as a dictionary containing the 
# column name as the key and the correspondent factor as the value.
# e.g. SUBSET_OF_FEATURES_TO_SCALE = ['col1', 'col2'], 'col1' will be divided by 2.0, 
# and 'col2' will be divided by 3.2,  then:
# SCALING_PARAMS = {'col1': 2.0, 'col2': 3.2}
    
## WARNING: For SCALING_PARAMS when SCALE_WITH_NEW_PARAMS = True and 
# MODE = 'standard' or MODE = 'min_max', the dictionary must be declared with the
# column name as the key, and the whole dictionary of parameters as the correspondent
# value. Then, it will be a dictionary of dictionaries, where there is a dictionary 
# correspondent to each key. Each dictionary should be declared in the same way as the 
# scaling_dictionary printed as output when the scaler is trained.

SUFFIX = '_scaled'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_scaled', the transformed column will be
# identified as 'Y_scaled'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as new_df; dictionary saved as scaling_dict.
# Simply modify this object on the left of equality:
new_df, scaling_dict = feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, mode = MODE, scale_with_new_params = SCALE_WITH_NEW_PARAMS, scaling_params = SCALING_PARAMS, suffix = SUFFIX)

#### Case 2: using scaling parameters previously obtained

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'standard'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor'
## This function provides 3 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

SCALE_WITH_NEW_PARAMS = False
# Alternatively, set SCALE_WITH_NEW_PARAMS = True if you want to calculate a new
# scaler for the data; or SCALE_WITH_NEW_PARAMS = False if you want to apply 
# parameters previously obtained to the data (i.e., if you want to apply the scaler
# previously trained to another set of data; or wants to simply apply again the same
# scaler).
    
## WARNING: The MODE 'factor' demmands the input of the list of factors that will be 
# used for normalizing each column. Therefore, it can be used only 
# when SCALE_WITH_NEW_PARAMS = False.

SCALING_PARAMS = None
# This variable has effect only when SCALE_WITH_NEW_PARAMS = False
## For the MODE 'factor', declare SCALING_PARAMS as a dictionary containing the 
# column name as the key and the correspondent factor as the value.
# e.g. SUBSET_OF_FEATURES_TO_SCALE = ['col1', 'col2'], 'col1' will be divided by 2.0, 
# and 'col2' will be divided by 3.2,  then:
# SCALING_PARAMS = {'col1': 2.0, 'col2': 3.2}
    
## WARNING: For SCALING_PARAMS when SCALE_WITH_NEW_PARAMS = True and 
# MODE = 'standard' or MODE = 'min_max', the dictionary must be declared with the
# column name as the key, and the whole dictionary of parameters as the correspondent
# value. Then, it will be a dictionary of dictionaries, where there is a dictionary 
# correspondent to each key. Each dictionary should be declared in the same way as the 
# scaling_dictionary printed as output when the scaler is trained.

SUFFIX = '_scaled'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_scaled', the transformed column will be
# identified as 'Y_scaled'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as new_df.
# Simply modify this object on the left of equality:
new_df = feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, mode = MODE, scale_with_new_params = SCALE_WITH_NEW_PARAMS, scaling_params = SCALING_PARAMS, suffix = SUFFIX)

### **Reversing scaling of the features - Standard scaler, Min-Max scaler, division by factor**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'standard'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor'
## This function provides 3 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

SCALING_PARAMS = {None}
# This variable has effect only when SCALE_WITH_NEW_PARAMS = False
## For the MODE 'factor', declare SCALING_PARAMS as a dictionary containing the 
# column name as the key and the correspondent factor as the value.
# e.g. SUBSET_OF_FEATURES_TO_SCALE = ['col1', 'col2'], 'col1' will be divided by 2.0, 
# and 'col2' will be divided by 3.2,  then:
# SCALING_PARAMS = {'col1': 2.0, 'col2': 3.2}
    
## WARNING: For SCALING_PARAMS when SCALE_WITH_NEW_PARAMS = True and 
# MODE = 'standard' or MODE = 'min_max', the dictionary must be declared with the
# column name as the key, and the whole dictionary of parameters as the correspondent
# value. Then, it will be a dictionary of dictionaries, where there is a dictionary 
# correspondent to each key. Each dictionary should be declared in the same way as the 
# scaling_dictionary printed as output when the scaler is trained.

SUFFIX = '_reverseScaling'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
# identified as 'Y_reverseScaling'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as new_df.
# Simply modify this object on the left of equality:
new_df = reverse_feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, scaling_params = SCALING_PARAMS, mode = MODE, suffix = SUFFIX)

### **Importing or exporting models and dictionaries**

#### Case 1: import only a model

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_FILE_NAME = None
# DICTIONARY_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_FILE_NAME = None if no dictionary will be manipulated.

DIRECTORY_PATH = '/'
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: "/"
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'keras'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning keras/ tensorflow models with extension .h5
# MODEL_TYPE = 'sklearn_xgb' for models from sklearn or xgboost (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

#Model object saved as model.
# Simply modify this object on the left of equality:
model = import_export_model_or_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_file_name = DICTIONARY_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_to_export = DICT_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY)    

#### Case 2: import only a dictionary

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'dict_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_FILE_NAME = None
# DICTIONARY_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_FILE_NAME = None if no dictionary will be manipulated.

DIRECTORY_PATH = '/'
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: "/"
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'keras'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning keras/ tensorflow models with extension .h5
# MODEL_TYPE = 'sklearn_xgb' for models from sklearn or xgboost (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Dictionary saved as imported_dict.
# Simply modify this object on the left of equality:
imported_dict = import_export_model_or_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_file_name = DICTIONARY_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_to_export = DICT_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY)    

#### Case 3: import a model and a dictionary

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_and_dict'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_FILE_NAME = None
# DICTIONARY_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_FILE_NAME = None if no dictionary will be manipulated.

DIRECTORY_PATH = '/'
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: "/"
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'keras'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning keras/ tensorflow models with extension .h5
# MODEL_TYPE = 'sklearn_xgb' for models from sklearn or xgboost (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model. Dictionary saved as imported_dict.
# Simply modify these objects on the left of equality:
model, imported_dict = import_export_model_or_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_file_name = DICTIONARY_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_to_export = DICT_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY)    

#### Case 4: export a model and/or a dictionary

In [None]:
ACTION = 'export'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_FILE_NAME = None
# DICTIONARY_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_FILE_NAME = None if no dictionary will be manipulated.

DIRECTORY_PATH = '/'
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: "/"
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning keras/ tensorflow models with extension .h5
# MODEL_TYPE = 'sklearn_xgb' for models from sklearn or xgboost (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

import_export_model_or_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_file_name = DICTIONARY_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_to_export = DICT_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY)    

### **Characterizing the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

#New dataframes saved as df_shape, df_columns_list, df_dtypes, df_general_statistics, df_missing_values.
# Simply modify this object on the left of equality:
df_shape, df_columns_list, df_dtypes, df_general_statistics, df_missing_values = df_gen_charac (df = DATASET)

### **Obtaining correlation plots**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SHOW_MASKED_PLOT = True
#SHOW_MASKED_PLOT = True - keep as True if you want to see a cleaned version of the plot
# where a mask is applied. Alternatively, SHOW_MASKED_PLOT = True, or 
# SHOW_MASKED_PLOT = False

RESPONSES_TO_RETURN_CORR = None
#RESPONSES_TO_RETURN_CORR - keep as None to return the full correlation tensor.
# If you want to display the correlations for a particular group of features, input them
# as a list, even if this list contains a single element. Examples:
# responses_to_return_corr = ['response1'] for a single response
# responses_to_return_corr = ['response1', 'response2', 'response3'] for multiple
# responses. Notice that 'response1',... should be substituted by the name ('string')
# of a column of the dataset that represents a response variable.
# WARNING: The returned coefficients will be ordered according to the order of the list
# of responses. i.e., they will be firstly ordered based on 'response1'
# Alternatively: a list containing strings (inside quotes) with the names of the response
# columns that you want to see the correlations. Declare as a list even if it contains a
# single element.

SET_RETURNED_LIMIT = None
# SET_RETURNED_LIMIT = None - This variable will only present effects in case you have
# provided a response feature to be returned. In this case, keep set_returned_limit = None
# to return all of the correlation coefficients; or, alternatively, 
# provide an integer number to limit the total of coefficients returned. 
# e.g. if set_returned_limit = 10, only the ten highest coefficients will be returned. 

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "/" 
# or DIRECTORY_TO_SAVE = "/folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = "/"

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'scatter_plot_lin_reg.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 110
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 110.

#New dataframe saved as correlation_matrix. Simply modify this object on the left of equality:
correlation_matrix = correlation_plot (df = DATASET, show_masked_plot = SHOW_MASKED_PLOT, responses_to_return_corr = RESPONSES_TO_RETURN_CORR, set_returned_limit = SET_RETURNED_LIMIT, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Obtaining scatter plots and simple linear regressions**

        x1, y1, lab1: blue
        x2, y2, lab2: red
        x3, y3, lab3: green
        x4, y4, lab4: black
        x5, y5, lab5: magenta
        x6, y6, lab6: yellow

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

X1 = DATASET['X1']
#Alternatively: None; or other column in quotes, substituting 'X1'
# e.g. X1 = DATASET['Time'] for a X variable named 'Time', if 'Time' is a float, not a
# a datetime64. If 'Time' should be interpreted as a timestamp, then, we would declare as:

# X1 = (DATASET['Time']).astype('datetime64[D]')

# In summary: apply the method .astype('datetime64[D]') if you want the value to be
# interpreted (correctly) as a timestamp.

Y1 = DATASET['Y1'] 
#Alternatively: None; or other column in quotes, substituting 'Y1'
# e.g. Y1 = DATASET['Speed'] for a Y variable named 'Speed'

X2 = None #Alternatively: series for X2 (analogous to X1)
Y2 = None #Alternatively: series for Y2 (analogous to Y1)
X3 = None #Alternatively: series for X3 (analogous to X1)
Y3 = None #Alternatively: series for Y3 (analogous to Y1)
X4 = None #Alternatively: series for X4 (analogous to X1)
Y4 = None #Alternatively: series for Y4 (analogous to Y1)
X5 = None #Alternatively: series for X5 (analogous to X1)
Y5 = None #Alternatively: series for Y5 (analogous to Y1)
X6 = None #Alternatively: series for X6 (analogous to X1)
Y6 = None #Alternatively: series for Y6 (analogous to Y1)
# Warning: if X2, X3, X4, X5, and X6 were timestamps, do not forget to use the method
# .astype('datetime64[D]'). e.g.: X2 = (DATASET['DATE']).astype('datetime64[D]')
# If all X axis are the same, you can also declare: X2 = X1, X3 = X1, X4 = X1, X5 = X1
# and X6 = X1.

X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

SHOW_LINEAR_REG = True
#Alternatively: set SHOW_LINEAR_REG = True to plot the linear regressions graphics and show 
# the linear regressions calculated for each pair Y x X (i.e., each correlation 
# Y = aX + b, as well as the R² coefficient calculated). 
# Set SHOW_LINEAR_REG = False to omit both the linear regressions plots on the graphic, and
# the correlations and R² coefficients obtained.

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = False #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.

LAB1 = None #Alternatively: string inside quotes containing the label for series 1
LAB2 = None #Alternatively: string inside quotes containing the label for series 2
LAB3 = None #Alternatively: string inside quotes containing the label for series 3
LAB4 = None #Alternatively: string inside quotes containing the label for series 4
LAB5 = None #Alternatively: string inside quotes containing the label for series 5
LAB6 = None #Alternatively: string inside quotes containing the label for series 6
#e.g. LAB1 = "Y1_values"

HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "/" 
# or DIRECTORY_TO_SAVE = "/folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = "/"

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'scatter_plot_lin_reg.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 110
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 110.

#New dataframe saved as lin_reg_summary. Simply modify this object on the left of equality:
lin_reg_summary = scatter_plot_lin_reg (x1 = X1, y1 = Y1, x2 = X2, y2 = Y2, x3 = X3, y3 = Y3, x4 = X4, y4 = Y4, x5 = X5, y5 = Y5, x6 = X6, y6 = Y6, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, show_linear_reg = SHOW_LINEAR_REG, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, lab1 = LAB1, lab2 = LAB2, lab3 = LAB3, lab4 = LAB4, lab5 = LAB5, lab6 = LAB6, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing time series**

        x1, y1, lab1: blue
        x2, y2, lab2: red
        x3, y3, lab3: green
        x4, y4, lab4: black
        x5, y5, lab5: magenta
        x6, y6, lab6: yellow

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

#X1 = dataset.index to use the index as the axis itself
X1 = (DATASET['DATE']).astype('datetime64[D]') 
#Alternatively: None; or other column in quotes, substituting 'DATE'
# WARNING: Modify only the object in the first parenthesis: DATASET['DATE']
# Do not modify the method .astype('datetime64[D]')
#Remove .astype('datetime64[D]') if it is not a datetime.
# e.g. X1 = DATASET['Time'] for a X variable named 'Time', if 'Time' is a float, not a
# a datetime64. If 'Time' should be interpreted as a timestamp, then, we would declare as:

# X1 = (DATASET['Time']).astype('datetime64[D]')

# In summary: apply the method .astype('datetime64[D]') if you want the value to be
# interpreted (correctly) as a timestamp.

#Notice that there is a data transforming step to guarantee that the 'DATE' was interpreted as a timestamp, not as object or string.
#The astype method defines the type of variable as 'datetime64[D]'. If we wanted the timestamps to be resolved in seconds, we should use
# 'datetime64[ns]'.
Y1 = DATASET['Y1'] 
#Alternatively: None; or other column in quotes, substituting 'Y1'
# e.g. Y1 = DATASET['Speed'] for a Y variable named 'Speed'

X2 = None #Alternatively: series for X2 (analogous to X1)
Y2 = None #Alternatively: series for Y2 (analogous to Y1)
X3 = None #Alternatively: series for X3 (analogous to X1)
Y3 = None #Alternatively: series for Y3 (analogous to Y1)
X4 = None #Alternatively: series for X4 (analogous to X1)
Y4 = None #Alternatively: series for Y4 (analogous to Y1)
X5 = None #Alternatively: series for X5 (analogous to X1)
Y5 = None #Alternatively: series for Y5 (analogous to Y1)
X6 = None #Alternatively: series for X6 (analogous to X1)
Y6 = None #Alternatively: series for Y6 (analogous to Y1)
# Warning: if X2, X3, X4, X5, and X6 were timestamps, do not forget to use the method
# .astype('datetime64[D]'). e.g.: X2 = (DATASET['DATE']).astype('datetime64[D]')
# If all X axis are the same, you can also declare: X2 = X1, X3 = X1, X4 = X1, X5 = X1
# and X6 = X1.

X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = True #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
ADD_SCATTER_DOTS = False #Alternatively: True or False
# If ADD_SCATTER_DOTS = False, the dots (scatter plot) are omitted, so only the lines
# correspondent to the series are shown.

# Notice that adding the dots and omitting the spline lines is equivalent to obtain a
# scatter plot. If you want to do so, consider using the scatter_plot_lin_reg function, 
# capable of calculating the linear regressions.

LAB1 = None #Alternatively: string inside quotes containing the label for series 1
LAB2 = None #Alternatively: string inside quotes containing the label for series 2
LAB3 = None #Alternatively: string inside quotes containing the label for series 3
LAB4 = None #Alternatively: string inside quotes containing the label for series 4
LAB5 = None #Alternatively: string inside quotes containing the label for series 5
LAB6 = None #Alternatively: string inside quotes containing the label for series 6
#e.g. LAB1 = "Y1_values"

HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "/" 
# or DIRECTORY_TO_SAVE = "/folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = "/"

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'time_series_vis.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 110
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 110.

time_series_vis (x1 = X1, y1 = Y1, x2 = X2, y2 = Y2, x3 = X3, y3 = Y3, x4 = X4, y4 = Y4, x5 = X5, y5 = Y5, x6 = X6, y6 = Y6, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, add_scatter_dots = ADD_SCATTER_DOTS, lab1 = LAB1, lab2 = LAB2, lab3 = LAB3, lab4 = LAB4, lab5 = LAB5, lab6 = LAB6, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing histograms**

#### Case 1: automatically calculate the ideal histogram bin size
- The ideal bin interval is calculated through Montgomery's method. Histogram is obtained from this calculated bin size.
    - Douglas C. Montgomery (2009). Introduction to Statistical Process Control, Sixth Edition, John Wiley & Sons.

In [None]:
# REMEMBER: A histogram is the representation of a statistical distribution 
# of a given variable.

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

ANALYZED_VARIABLE = DATASET['analyzed_variable']
#Alternatively: other column in quotes, substituting 'analyzed_variable'
# e.g., if the analyzed variable is in a column named 'column1':
# ANALYZED_VARIABLE = DATASET['column1']

SET_GRAPHIC_BAR_WIDTH = 2.0
# This parameter must be visually adjusted for each particular analyzed variable.
# Manually set this parameter until you see only a minimal separation between successive
# bars (i.e., you know that the bars are not overlapping, but they are not so distant that
# the statistic distribution profile is not clear).
# You can input any numeric value, and the order of magnitude will vary depending on the
# dispersion and on the width of the sample space.
# e.g. SET_GRAPHIC_BAR_WIDTH = 3; SET_GRAPHIC_BAR_WIDTH = 0.003

X_AXIS_ROTATION = 70 
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

Y_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

NORMAL_CURVE_OVERLAY = True
#Alternatively: set NORMAL_CURVE_OVERLAY = True to show a normal curve overlaying the
# histogram; or set NORMAL_CURVE_OVERLAY = False to omit the normal curve (show only
# the histogram).

DATA_UNITS_LABEL = None
# Input a string inside quotes for setting a label for the X-axis, that will represent
# The type of data that the histogram is evaluating, i.e., what the statistic distribution
# shown by the histogram means.
# e.g. if DATA_UNITS_LABEL = "particle_diameter_in_nm", the axis X will be labelled with
# this string. Then, we can know that the diagram represents the distribution (counts of
# data for each defined bin) of particles diameters.

Y_TITLE = None
#Alternatively: string inside quotes for vertical title. e.g. Y_TITLE = "Analyzed_values".

HISTOGRAM_TITLE = None
#Alternatively: string inside quotes for graphic title. e.g. 
# HISTOGRAM_TITLE = "Analyzed_values_histogram".

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "/" 
# or DIRECTORY_TO_SAVE = "/folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = "/"

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'histogram.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 110
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 110.

#New dataframes saved as general_statistics and frequency_tab.
# Simply modify this object on the left of equality:
general_statistics, frequency_tab = histogram (y = ANALYZED_VARIABLE, bar_width = SET_GRAPHIC_BAR_WIDTH, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, normal_curve_overlay = NORMAL_CURVE_OVERLAY, data_units_label = DATA_UNITS_LABEL, y_title = Y_TITLE, histogram_title = HISTOGRAM_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

# Suppose we registered the values corresponding to a given feature / property / attribute
# in column Y, and we want to know the Y statistic distribution. the maximum value observed
# for Y is named Ymax, whereas the minimum value observed is Ymin.
# Therefore, our sample space ranges from Ymin to Ymax. Now, we divide this sample space into
# equally-separated intervals, named bins. The width of each bin is the bin_size. The 1st bin
# corresponds to the interval Ymin to (Ymin + bin_size), where we can call (Ymin + bin_size)
# = Y1. Then, the 2nd interval ranges from Y1 to (Y1 + bin_size),..., until we get to the
# last interval, where we find Ymax.
# Now, we count how many values of Y belong to each bin. The graphic of count of values in
# each bin x the bin interval (or the value correspondent to the half of the bin) is the
# histogram. For the first bin this mid-value would be Ymin + bin_size/2, since this value is
# exactly in the middle of interval Ymin to (Ymin + bin_size).

# In other words we can imagine that each Y value was print on the surface of a ball, and
# each bin is a bucket labelled Ymin - (Ymin + bin_size), (Ymin + bin_size) - 
# (Ymin + 2bin_size), untill we cover the Ymax value. We put every ball inside the bucket,
# given that the ball value must be in the interval labelling the bucket. Finally, we count
# the balls per bucket, and plot count of balls x (the middle value of the interval 
# labelling the correspondent bucket). This graphic will be the histogram and will 
# represent the statistical distribution.

#### Case 2: set number of bins
- Use this one if the distance between data is too small, or if the histogram function did not return a valid histogram.
- Here, the histogram is obtained by manually defining the total of bins (i.e., into how much intervals the sample space should be divided).

In [None]:
# REMEMBER: A histogram is the representation of a statistical distribution 
# of a given variable.

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

ANALYZED_VARIABLE = DATASET['analyzed_variable']
#Alternatively: other column in quotes, substituting 'analyzed_variable'
# e.g., if the analyzed variable is in a column named 'column1':
# ANALYZED_VARIABLE = DATASET['column1']

TOTAL_OF_BINS = 50
# This parameter must be an integer number: it represents the total of bins of the 
# histogram, i.e., the number of divisions of the sample space (in how much intervals
# the sample space will be divided. Check comments after the histogram_alternative
# function call).
# Manually adjust this parameter to obtain more or less resolution of the statistical
# distribution: less bins tend to result into higher counting of values per bin, since
# a larger interval of values is grouped. After modifying the total of bins, do not forget
# to adjust the bar width in SET_GRAPHIC_BAR_WIDTH.
# Examples: TOTAL_OF_BINS = 50, to divide the sample space into 50 equally-separated 
# intervals; TOTAL_OF_BINS = 10 to divide it into 10 intervals; TOTAL_OF_BINS = 100 to
# divide it into 100 intervals.

SET_GRAPHIC_BAR_WIDTH = 2.0
# This parameter must be visually adjusted for each particular analyzed variable.
# Manually set this parameter until you see only a minimal separation between successive
# bars (i.e., you know that the bars are not overlapping, but they are not so distant that
# the statistic distribution profile is not clear).
# You can input any numeric value, and the order of magnitude will vary depending on the
# dispersion and on the width of the sample space.
# e.g. SET_GRAPHIC_BAR_WIDTH = 3; SET_GRAPHIC_BAR_WIDTH = 0.003

X_AXIS_ROTATION = 70 
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

Y_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

DATA_UNITS_LABEL = None
# Input a string inside quotes for setting a label for the X-axis, that will represent
# The type of data that the histogram is evaluating, i.e., what the statistic distribution
# shown by the histogram means.
# e.g. if DATA_UNITS_LABEL = "particle_diameter_in_nm", the axis X will be labelled with
# this string. Then, we can know that the diagram represents the distribution (counts of
# data for each defined bin) of particles diameters.

Y_TITLE = None
#Alternatively: string inside quotes for vertical title. e.g. Y_TITLE = "Analyzed_values".

HISTOGRAM_TITLE = None
#Alternatively: string inside quotes for graphic title. e.g. 
# HISTOGRAM_TITLE = "Analyzed_values_histogram".

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "/" 
# or DIRECTORY_TO_SAVE = "/folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = "/"

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'histogram_alternative.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 110
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 110.

#New dataframes saved as general_statistics and frequency_tab.
# Simply modify this object on the left of equality:
general_statistics, frequency_tab = histogram_alternative (y = ANALYZED_VARIABLE, total_of_bins = TOTAL_OF_BINS, bar_width = SET_GRAPHIC_BAR_WIDTH, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, data_units_label = DATA_UNITS_LABEL, y_title = Y_TITLE, histogram_title = HISTOGRAM_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

# Suppose we registered the values corresponding to a given feature / property / attribute
# in column Y, and we want to know the Y statistic distribution. the maximum value observed
# for Y is named Ymax, whereas the minimum value observed is Ymin.
# Therefore, our sample space ranges from Ymin to Ymax. Now, we divide this sample space into
# equally-separated intervals, named bins. The width of each bin is the bin_size. The 1st bin
# corresponds to the interval Ymin to (Ymin + bin_size), where we can call (Ymin + bin_size)
# = Y1. Then, the 2nd interval ranges from Y1 to (Y1 + bin_size),..., until we get to the
# last interval, where we find Ymax.
# Now, we count how many values of Y belong to each bin. The graphic of count of values in
# each bin x the bin interval (or the value correspondent to the half of the bin) is the
# histogram. For the first bin this mid-value would be Ymin + bin_size/2, since this value is
# exactly in the middle of interval Ymin to (Ymin + bin_size).

# In other words we can imagine that each Y value was print on the surface of a ball, and
# each bin is a bucket labelled Ymin - (Ymin + bin_size), (Ymin + bin_size) - 
# (Ymin + 2bin_size), untill we cover the Ymax value. We put every ball inside the bucket,
# given that the ball value must be in the interval labelling the bucket. Finally, we count
# the balls per bucket, and plot count of balls x (the middle value of the interval 
# labelling the correspondent bucket). This graphic will be the histogram and will 
# represent the statistical distribution.

### **Testing data normality and visualizing probability plot**
- Check the probability that data is actually described by a normal distribution.

In [None]:
# WARNING: The statistical tests require at least 20 samples

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

Y = DATASET['Y'] 
#Alternatively: other column in quotes, substituting 'Y'
# e.g. Y = DATASET['Speed'] for a Y variable named 'Speed'

ALPHA = 0.10
# Confidence level = 1 - ALPHA. For ALPHA = 0.10, we get a 0.90 = 90% confidence
# Set ALPHA = 0.05 to get 0.95 = 95% confidence in the analysis.
# Notice that, when less trust is needed, we can increase ALPHA to get less restrictive
# results.

SHOW_PROBABILITY_PLOT = True
#Alternatively: set SHOW_PROBABILITY_PLOT = True to obtain the probability plot for the
# variable Y (normal distribution tested). 
# Set SHOW_PROBABILITY_PLOT = False to omit the probability plot.
X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "/" 
# or DIRECTORY_TO_SAVE = "/folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = "/"

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'scatter_plot_lin_reg.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 110
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 110.

#New dataframe saved as data_normality_res
# Skewness kurtosis and general statistics dictionary returned as general_statistics_dict
# Simply modify these objects on the left of equality:
data_normality_res, general_statistics_dict = test_data_normality (y = Y, alpha = ALPHA, show_probability_plot = SHOW_PROBABILITY_PLOT, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

## **Exporting the dataframe as CSV file**

In [None]:
## WARNING: all file extensions should be .csv for this function

DATAFRAME_TO_BE_EXPORTED = dataset
#Alternatively: object containing the dataset to be exported.

FILE_DIRECTORY_PATH = "/"
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
# or FILE_DIRECTORY_PATH = "/folder"
# If you want to export the file to AWS S3, this parameter will have no effect.
# In this case, you can set FILE_DIRECTORY_PATH = None

NEW_FILE_NAME_WITH_CSV_EXTENSION = "dataset.csv"
# NEW_FILE_NAME_WITH_CSV_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_CSV_EXTENSION = "file.csv"

EXPORT_TO_S3_BUCKET = False
# export_to_s3_bucket = False. Alternatively, set as True to export the file to an
# AWS S3 Bucket.
    
## The following parameters have effect only when export_to_s3_bucket == True:

S3_BUCKET_NAME = None    
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. S3_BUCKET_NAME = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"
DESIRED_S3_FILE_NAME_WITH_CSV_EXTENSION = None
# The name desired for the object stored in S3 (string, in quotes). 
# Keep it None to set it equals to NEW_FILE_NAME_WITH_CSV_EXTENSION. 
# Alternatively, set it as a string analogous to NEW_FILE_NAME_WITH_CSV_EXTENSION.
# e.g. DESIRED_S3_FILE_NAME_WITH_CSV_EXTENSION = "S3_file.csv"

export_dataframe(dataframe_to_be_exported = DATAFRAME_TO_BE_EXPORTED, new_file_name_with_csv_extension = NEW_FILE_NAME_WITH_CSV_EXTENSION, file_directory_path = FILE_DIRECTORY_PATH, export_to_s3_bucket = EXPORT_TO_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, desired_s3_file_name_with_csv_extension = DESIRED_S3_FILE_NAME_WITH_CSV_EXTENSION)

## **Downloading a file from Google Colab or AWS S3 to the local machine or uploading a file from the machine to S3 or to Colab's instant memory**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for downloading from (or uploading to) Google Colab's instant memory;
# SOURCE = 'aws' for downloading from (or uploading to) an AWS S3 bucket.

ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to AWS S3 or to Google Colab's 
# instant memory

OBJECT_TO_DOWNLOAD_FROM_COLAB = None
# OBJECT_TO_DOWNLOAD_FROM_COLAB = None. This option has effect only when
# SOURCE == 'google'. In this case, this parameter is obbligatory. 
# Declare as OBJECT_TO_DOWNLOAD_FROM_COLAB the object that you want to download.
# Since it is an object and not a string, it should not be declared in quotes.
# e.g. to download a dictionary named dict, OBJECT_TO_DOWNLOAD_FROM_COLAB = dict.
# To download a dataframe named df, declare OBJECT_TO_DOWNLOAD_FROM_COLAB = df.
# To export a model named keras_model, declare OBJECT_TO_DOWNLOAD_FROM_COLAB = keras_model
    
## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'

S3_BUCKET_NAME = None
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. S3_BUCKET_NAME = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

LOCAL_PATH_OF_STORAGE = '/'
# LOCAL_PATH_OF_STORAGE: path of the local computer environment 
# to which the S3 bucket contents will be downloaded (ACTION == 'download'); or
# path of the folder containing the file that will be uploaded in S3 (ACTION = 'upload'). 
# If it is None, or if LOCAL_PATH_OF_STORAGE = '/', files 
# will be imported to the root path. Alternatively, input the path as a string (in quotes). 
# Examples: LOCAL_PATH_OF_STORAGE = '/copied_s3_bucket'; 
# LOCAL_PATH_OF_STORAGE = "/My_folder"; LOCAL_PATH_OF_STORAGE = "/Users/Me/Documents/"
# Notice that only the directories should be declared: do not include the file name and
# its extension.

FILE_NAME_WITH_EXTENSION = None
# FILE_NAME_WITH_EXTENSION: string, in quotes, containing the file name which will be
# downloaded from S3; or uploaded from S3, followed by its extension. 
## This parameter is obbligatory when SOURCE == 'aws'
# Examples:
# FILE_NAME_WITH_EXTENSION = 'Screen_Shot.png'; FILE_NAME_WITH_EXTENSION = 'dataset.csv',
# FILE_NAME_WITH_EXTENSION = "dictionary.pkl", FILE_NAME_WITH_EXTENSION = "model.h5",
# FILE_NAME_WITH_EXTENSION = 'doc.pdf', FILE_NAME_WITH_EXTENSION = 'model.dill'

download_or_upload_file (source = SOURCE, action = ACTION, object_to_download_from_colab = OBJECT_TO_DOWNLOAD_FROM_COLAB, s3_bucket_name = S3_BUCKET_NAME, local_path_of_storage = LOCAL_PATH_OF_STORAGE, file_name_with_extension = FILE_NAME_WITH_EXTENSION)

****

# **One-Hot Encoding - Background**

If there are **categorical features**, they should be converted into numerical variables for being processed by the machine learning algorithms.

\- We can assign integer values for each one of the categories. This works well for situations where there is a scale or order for the assignment of the variables (e.g., if there is a satisfaction grade).

\- On the other hand, the results may be compromised if there is no order. That is because the ML algorithms assume that, if two categories have close numbers, then the categories are similar, what is not necessarily true. There are cases where the categories have no relation with each other.

\- In these cases, the best strategy is the One-Hot Encoding. For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.

\- Naturally, the number of columns grow with the number of possible labels. The One-Hot Encoder from Sklearn creates a Scipy Sparse matrix that stores the position of the zeros in the dataset. Then, the computational cost is reduced due to the fact that we are not storing a huge amount of null values.

\- Since each column is a binary variable of the type "is classified in this category or not", we expect that the created columns contain more zeros than 1s. That is because if an element belongs to one category (= 1), it does not belong to the others, so its value is zero for all other columns.