# **Process Simulation**
- Input datasets for sensitivity analysis into the models; and reverse transformations to simulate and visualize different conditions

## _Machine Learning Modelling Workflow Notebook 6_

## Content:
1. Loading the dataframes;
2. Loading the models;
3. Converting the datasets into NumPy arrays with correct format for CNN and RNN Architectures;
4. Using the models to predict outputs;
5. Using the classification models to predict probabilities;
6. Merging (joining) dataframes on given keys; and sorting the merged table;
7. Concatenating (SQL Union/Stacking/Appending) dataframes;
8. Column filtering (selecting) or column renaming;
9. Reversing transforms: log-transform (exponentially transforming variables); 
10. Box-Cox transform; 
11. One-Hot Encoding;
12. Feature scaling;
13. Bar chart visualization;
14. Time series visualization.

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

In [None]:
# To install a library (e.g. tensorflow), unmark and run:
# ! pip install tensorflow
# to update a library (e.g. tensorflow), unmark and run:
# ! pip install tensorflow --upgrade
# to update pip, unmark and run:
# ! pip install pip --upgrade
# to show if a library is installed and visualize its information, unmark and run
# (e.g. tensorflow):
# ! pip show tensorflow
# To run a Python file (e.g idsw_etl.py) saved in the notebook's workspace directory,
# unmark and run:
# import idsw_etl
# or:
# import idsw_etl as etl

## **Load Python Libraries in Global Context**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
import tensorflow as tf
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
from xgboost import XGBRegressor, XGBClassifier

# **Function for mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
def mount_storage_system (source = 'aws', path_to_store_imported_s3_bucket = '', s3_bucket_name = None, s3_obj_prefix = None):
    
    # source = 'google' for mounting the google drive;
    # source = 'aws' for mounting an AWS S3 bucket.
    
    # THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # path_to_store_imported_s3_bucket: path of the Python environment to which the
    # S3 bucket contents will be imported. If it is None, or if 
    # path_to_store_imported_s3_bucket = '/', bucket will be imported to the root path. 
    # Alternatively, input the path as a string (in quotes). e.g. 
    # path_to_store_imported_s3_bucket = 'copied_s3_bucket'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_prefix = None. Keep it None or as an empty string (s3_obj_key_prefix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_prefix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    # So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
    # a given folder (directory) of the bucket.
    # DO NOT PUT A SLASH before (to the right of) the prefix;
    # DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

    # Alternatively, provide the full path of a given file if you want to import only it:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
    # where my_file is the file's name, and ext is its extension.


    # Attention: after running this function for fetching AWS Simple Storage System (S3), 
    # your 'AWS Access key ID' and your 'Secret access key' will be requested.
    # The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
    # other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
    # and the prefix. All of these are sensitive information from the organization.
    # Therefore, after importing the information, always remember of cleaning the output of this cell
    # and of removing such information from the strings.
    # Remember that these data may contain privilege for accessing the information, so it should not
    # be used for non-authorized people.

    # Also, remember of deleting the imported files from the workspace after finishing the analysis.
    # The costs for storing the files in S3 is quite inferior than those for storing directly in the
    # workspace. Also, files stored in S3 may be accessed for other users than those with access to
    # the notebook's workspace.
    
    
    if (source == 'google'):
        
        from google.colab import drive
        # Google Colab library must be imported only in case it is
        # going to be used, for avoiding AWS compatibility issues.
        
        print("Associate the Python environment to your Google Drive account, and authorize the access in the opened window.")
        
        drive.mount('/content/drive')
        
        print("Now your Python environment is connected to your Google Drive: the root directory of your environment is now the root of your Google Drive.")
        print("In Google Colab, navigate to the folder icon (\'Files\') of the left navigation menu to find a specific folder or file in your Google Drive.")
        print("Click on the folder or file name and select the elipsis (...) icon on the right of the name to reveal the option \'Copy path\', which will give you the path to use as input for loading objects and files on your Python environment.")
        print("Caution: save your files into different directories of the Google Drive. If files are all saved in a same folder or directory, like the root path, they may not be accessible from your Python environment.")
        print("If you still cannot see the file after moving it to a different folder, reload the environment.")
    
    elif (source == 'aws'):
        
        import os
        import boto3
        # boto3 is AWS S3 Python SDK
        # sagemaker and boto3 libraries must be imported only in case 
        # they are going to be used, for avoiding 
        # Google Colab compatibility issues.
        from getpass import getpass

        # Check if path_to_store_imported_s3_bucket is None. If it is, make it the root directory:
        if ((path_to_store_imported_s3_bucket is None)|(str(path_to_store_imported_s3_bucket) == "/")):
            
            # For the S3 buckets, the path should not start with slash. Assign the empty
            # string instead:
            path_to_store_imported_s3_bucket = ""
            print("Bucket\'s content will be copied to the notebook\'s root directory.")
        
        elif (str(path_to_store_imported_s3_bucket) == ""):
            # Guarantee that the path is the empty string.
            # Avoid accessing the else condition, what would raise an error
            # since the empty string has no character of index 0
            path_to_store_imported_s3_bucket = str(path_to_store_imported_s3_bucket)
            print("Bucket\'s content will be copied to the notebook\'s root directory.")
        
        else:
            # Use the str attribute to guarantee that the path was read as a string:
            path_to_store_imported_s3_bucket = str(path_to_store_imported_s3_bucket)
            
            if(path_to_store_imported_s3_bucket[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # The slash is character 0. Then, we want all characters from character 1 (the
                # second) to character len(str(path_to_store_imported_s3_bucket)) - 1, the index
                # of the last character. So, we can slice the string from position 1 to position
                # the slicing syntax is: string[1:] - all string characters from character 1
                # string[:10] - all string characters from character 10-1 = 9 (including 9); or
                # string[1:10] - characters from 1 to 9
                # So, slice the whole string, starting from character 1:
                path_to_store_imported_s3_bucket = path_to_store_imported_s3_bucket[1:]
                # attention: even though strings may be seem as list of characters, that can be
                # sliced, we cannot neither simply assign a character to a given position nor delete
                # a character from a position.

        # Ask the user to provide the credentials:
        ACCESS_KEY = input("Enter your AWS Access Key ID here (in the right). It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
        print("\n") # line break
        SECRET_KEY = getpass("Enter your password (Secret key) here (in the right). It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        
        # The use of 'getpass' instead of 'input' hide the password behind dots.
        # So, the password is not visible by other users and cannot be copied.
        
        print("\n")
        print("WARNING: The bucket\'s name, the prefix, the AWS access key ID, and the AWS Secret access key are all sensitive information, which may grant access to protected information from the organization.\n")
        print("After copying data from S3 to your workspace, remember of removing these information from the notebook, specially if it is going to be shared. Also, remember of removing the files from the workspace.\n")
        print("The cost for storing files in Simple Storage Service is quite inferior than the one for storing directly in SageMaker workspace. Also, files stored in S3 may be accessed for other users than those with access the notebook\'s workspace.\n")

        # Check if the user actually provided the mandatory inputs, instead
        # of putting None or empty string:
        if ((ACCESS_KEY is None) | (ACCESS_KEY == '')):
            print("AWS Access Key ID is missing. It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
            return "error"
        elif ((SECRET_KEY is None) | (SECRET_KEY == '')):
            print("AWS Secret Access Key is missing. It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
            return "error"
        elif ((s3_bucket_name is None) | (s3_bucket_name == '')):
            print ("Please, enter a valid S3 Bucket\'s name. Do not add sub-directories or folders (prefixes), only the name of the bucket itself.")
            return "error"
        
        else:
            # Use the str attribute to guarantee that all AWS parameters were properly read as strings, and not as
            # other variables (like integers or floats):
            ACCESS_KEY = str(ACCESS_KEY)
            SECRET_KEY = str(SECRET_KEY)
            s3_bucket_name = str(s3_bucket_name)
        
        if(s3_bucket_name[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # So, slice the whole string, starting from character 1 (as did for 
                # path_to_store_imported_s3_bucket):
                s3_bucket_name = s3_bucket_name[1:]

        # Remove any possible trailing (white and tab spaces) spaces
        # That may be present in the string. Use the Python string
        # rstrip method, which is the equivalent to the Trim function:
        # When no arguments are provided, the whitespaces and tabulations
        # are the removed characters
        # https://www.w3schools.com/python/ref_string_rstrip.asp?msclkid=ee2d05c3c56811ecb1d2189d9f803f65
        s3_bucket_name = s3_bucket_name.rstrip()
        ACCESS_KEY = ACCESS_KEY.rstrip()
        SECRET_KEY = SECRET_KEY.rstrip()
        # Since the user manually inputs the parameters ACCESS and SECRET_KEY,
        # it is easy to input whitespaces without noticing that.

        # Now process the non-obbligatory parameter.
        # Check if a prefix was passed as input parameter. If so, we must select only the names that start with
        # The prefix.
        # Example: in the bucket 'my_bucket' we have a directory 'dir1'.
        # In the main (root) directory, we have a file 'file1.json' like: '/file1.json'
        # If we pass the prefix 'dir1', we want only the files that start as '/dir1/'
        # such as: 'dir1/file2.json', excluding the file in the main (root) directory and excluding the files in other
        # directories. Also, we want to eliminate the file names with no extensions, like 'dir1/' or 'dir1/dir2',
        # since these object names represent folders or directories, not files.	

        if (s3_obj_prefix is None):
            print ("No prefix, specific object, or subdirectory provided.") 
            print (f"Then, retrieving all content from the bucket \'{s3_bucket_name}\'.\n")
        elif ((s3_obj_prefix == "/") | (s3_obj_prefix == '')):
            # The root directory in the bucket must not be specified starting with the slash
            # If the root "/" or the empty string '' is provided, make
            # it equivalent to None (no directory)
            s3_obj_prefix = None
            print ("No prefix, specific object, or subdirectory provided.") 
            print (f"Then, retrieving all content from the bucket \'{s3_bucket_name}\'.\n")
    
        else:
            # Since there is a prefix, use the str attribute to guarantee that the path was read as a string:
            s3_obj_prefix = str(s3_obj_prefix)
            
            if(s3_obj_prefix[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # So, slice the whole string, starting from character 1 (as did for 
                # path_to_store_imported_s3_bucket):
                s3_obj_prefix = s3_obj_prefix[1:]

            # Remove any possible trailing (white and tab spaces) spaces
            # That may be present in the string. Use the Python string
            # rstrip method, which is the equivalent to the Trim function:
            s3_obj_prefix = s3_obj_prefix.rstrip()
            
            # Store the total characters in the prefix string after removing the initial slash
            # and trailing spaces:
            prefix_len = len(s3_obj_prefix)
            
            print("AWS Access Credentials, and bucket\'s prefix, object or subdirectory provided.\n")	

            
        print ("Starting connection with the S3 bucket.\n")
        
        try:
            # Start S3 client as the object 's3_client'
            s3_client = boto3.resource('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = SECRET_KEY)
        
            print(f"Credentials accepted by AWS. S3 client successfully started.\n")
            # An object 'data_table.xlsx' in the main (root) directory of the s3_bucket is stored in Python environment as:
            # s3.ObjectSummary(bucket_name='bucket_name', key='data_table.xlsx')
            # The name of each object is stored as the attribute 'key' of the object.
        
        except:
            
            print("Failed to connect to AWS Simple Storage Service (S3). Review if your credentials are correct.")
            print("The variable \'access_key\' must be set as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("The variable \'secret_key\' must be set as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
        
        try:
            # Connect to the bucket specified as 'bucket_name'.
            # The bucket is started as the object 's3_bucket':
            s3_bucket = s3_client.Bucket(s3_bucket_name)
            print(f"Connection with bucket \'{s3_bucket_name}\' stablished.\n")
            
        except:
            
            print("Failed to connect with the bucket, which usually happens when declaring a wrong bucket\'s name.") 
            print("Check the spelling of your bucket_name string and remember that it must be all in lower-case.\n")
                

        # Then, let's obtain a list of all objects in the bucket (list bucket_objects):
        
        bucket_objects_list = []

        # Loop through all objects of the bucket:
        for stored_obj in s3_bucket.objects.all():
            
            # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
            # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
            # Let's store only the key attribute and use the str function
            # to guarantee that all values were stored as strings.
            bucket_objects_list.append(str(stored_obj.key))
        
        # Now start a support list to store only the elements from
        # bucket_objects_list that are not folders or directories
        # (objects with extensions).
        # If a prefix was provided, only files with that prefix should
        # be added:
        support_list = []
        
        for stored_obj in bucket_objects_list:
            
            # Loop through all elements 'stored_obj' from the list
            # bucket_objects_list

            # Check the file extension.
            file_extension = os.path.splitext(stored_obj)[1][1:]
            
            # The os.path.splitext method splits the string into its FIRST dot (".") to
            # separate the file extension from the full path. Example:
            # "C:/dir1/dir2/data_table.csv" is split into:
            # "C:/dir1/dir2/data_table" (root part) and '.csv' (extension part)
            # https://www.geeksforgeeks.org/python-os-path-splitext-method/?msclkid=2d56198fc5d311ec820530cfa4c6d574

            # os.path.splitext(stored_obj) is a tuple of strings: the first is the complete file
            # root with no extension; the second is the extension starting with a point: '.txt'
            # When we set os.path.splitext(stored_obj)[1], we are selecting the second element of
            # the tuple. By selecting os.path.splitext(stored_obj)[1][1:], we are taking this string
            # from the second character (index 1), eliminating the dot: 'txt'


            # Check if the file extension is not an empty string '' (i.e., that it is different from != the empty
            # string:
            if (file_extension != ''):
                    
                    # The extension is different from the empty string, so it is not neither a folder nor a directory
                    # The object is actually a file and may be copied if it satisfies the prefix condition. If there
                    # is no prefix to check, we may simply copy the object to the list.

                    # If there is a prefix, the first characters of the stored_obj must be the prefix:
                    if not (s3_obj_prefix is None):
                        
                        # Check the characters from the position 0 (1st character) to the position
                        # prefix_len - 1. Since a prefix was declared, we want only the objects that this first portion
                        # corresponds to the prefix. string[i:j] slices the string from index i to index j-1
                        # Then, the 1st portion of the string to check is: string[0:(prefix_len)]

                        # Slice the string stored_obj from position 0 (1st character) to position prefix_len - 1,
                        # The position that the prefix should end.
                        obj_name_first_part = (stored_obj)[0:(prefix_len)]
                        
                        # If this first part is the prefix, then append the object to 
                        # support list:
                        if (obj_name_first_part == (s3_obj_prefix)):

                                support_list.append(stored_obj)

                    else:
                        # There is no prefix, so we can simply append the object to the list:
                        support_list.append(stored_obj)

            
        # Make the objects list the support list itself:
        bucket_objects_list = support_list
            
        # Now, bucket_objects_list contains the names of all objects from the bucket that must be copied.

        print("Finished mapping objects to fetch. Now, all these objects from S3 bucket will be copied to the notebook\'s workspace, in the specified directory.\n")
        print(f"A total of {len(bucket_objects_list)} files were found in the specified bucket\'s prefix (\'{s3_obj_prefix}\').")
        print(f"The first file found is \'{bucket_objects_list[0]}\'; whereas the last file found is \'{bucket_objects_list[len(bucket_objects_list) - 1]}\'.")
            
        # Now, let's try copying the files:
            
        try:
            
            # Loop through all objects in the list bucket_objects and copy them to the workspace:
            for copied_object in bucket_objects_list:

                # Select the object in the bucket previously started as 's3_bucket':
                selected_object = s3_bucket.Object(copied_object)
            
                # Now, copy this object to the workspace:
                # Set the new file_path. Notice that by now, copied_object may be a string like:
                # 'dir1/.../dirN/file_name.ext', where dirN is the n-th directory and ext is the file extension.
                # We want only the file_name to joing with the path to store the imported bucket. So, we can use the
                # str.split method specifying the separator sep = '/' to break the string into a list of substrings.
                # The last element from this list will be 'file_name.ext'
                # https://www.w3schools.com/python/ref_string_split.asp?msclkid=135399b6c63111ecada75d7d91add056

                # 1. Break the copied_object full path into the list object_path_list, using the .split method:
                object_path_list = copied_object.split(sep = "/")

                # 2. Get the last element from this list. Since it has length len(object_path_list) and indexing starts from
                # zero, the index of the last element is (len(object_path_list) - 1):
                fetched_object = object_path_list[(len(object_path_list) - 1)]

                # 3. Finally, join the string fetched_object with the new path (path on the notebook's workspace) to finish
                # The new object's file_path:

                file_path = os.path.join(path_to_store_imported_s3_bucket, fetched_object)

                # Download the selected object to the workspace in the specified file_path
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" copies a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                selected_object.download_file(Filename = file_path)

                print(f"The file \'{fetched_object}\' was successfully copied to notebook\'s workspace.\n")

                
            print("Finished copying the files from the bucket to the notebook\'s workspace. It may take a couple of minutes untill they be shown in SageMaker environment.\n") 
            print("Do not forget to delete these copies after finishing the analysis. They will remain stored in the bucket.\n")


        except:

            # Run this code for any other exception that may happen (no exception error
            # specified, so any exception runs the following code).
            # Check: https://pythonbasics.org/try-except/?msclkid=4f6b4540c5d011ecb1fe8a4566f632a6
            # for seeing how to handle successive exceptions

            print("Attention! The function raised an exception error, which is probably due to the AWS Simple Storage Service (S3) permissions.")
            print("Before running again this function, check this quick guide for configuring the permission roles in AWS.\n")
            print("It is necessary to create an user with full access permissions to interact with S3 from SageMaker. To configure the User, go to the upper ribbon of AWS, click on Services, and select IAM – Identity and Access Management.")
            print("1. In IAM\'s lateral panel, search for \'Users\' in the group of Access Management.")
            print("2. Click on the \'Add users\' button.")
            print("3. Set an user name in the text box \'User name\'.")
            print("Attention: users and S3 buckets cannot be written in upper case. Also, selecting a name already used by an Amazon user or bucket will raise an error message.\n")
            print("4. In the field \'Select type of Access to AWS\'-\'Select type of AWS credentials\' select the option \'Access key - Programmatic access\'. After that, click on the button \'Next: Permissions\'.")
            print("5. In the field \'Set Permissions\', keep the \'Add user to a group\' button marked.")
            print("6. In the field \'Add user to a group\', click on \'Create group\' (alternatively, you can be added to a group already configured or copy the permissions of another user.")
            print("7. In the text box \'Group\'s name\', set a name for the new group of permissions.")
            print("8. In the search bar below (\'Filter politics\'), search for a politics that fill your needs, and check the option button on the left of this politic. The politics \'AmazonS3FullAccess\' grants full access to the S3 content.")
            print("9. Finally, click on \'Create a group\'.")
            print("10. After the group is created, it will appear with a check box marked, over the previous groups. Keep it marked and click on the button \'Next: Tags\'.")
            print("11. Create and note down the Access key ID and Secret access key. You can also download a comma separated values (CSV) file containing the credentials for future use.")
            print("ATTENTION: These parameters are required for accessing the bucket\'s content from any application, including AWS SageMaker.")
            print("12. Click on \'Next: Review\' and review the user credentials information and permissions.")
            print("13. Click on \'Create user\' and click on the download button to download the CSV file containing the user credentials information.")
            print("The headers of the CSV file (the stored fields) is: \'User name, Password, Access key ID, Secret access key, Console login link\'.")
            print("You need both the values indicated as \'Access key ID\' and as \'Secret access key\' to fetch the S3 bucket.")
            print("\n") # line break
            print("After acquiring the necessary user privileges, use the boto3 library to fetch the bucket from the Python code. boto3 is AWS S3 Python SDK.")
            print("For fetching a specific bucket\'s file use the following code:\n")
            print("1. Set a variable \'access_key\' as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("2. Set a variable \'secret_key\' as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
            print("3. Set a variable \'bucket_name\' as a string containing only the name of the bucket. Do not add subdirectories, folders (prefixes), or file names.")
            print("Example: if your bucket is named \'my_bucket\' and its main directory contains folders like \'folder1\', \'folder2\', etc, do not declare bucket_name = \'my_bucket/folder1\', even if you only want files from folder1.")
            print("ALWAYS declare only the bucket\'s name: bucket_name = \'my_bucket\'.")
            print("4. Set a variable \'file_path\' containing the path from the bucket\'s subdirectories to the file you want to fetch. Include the file name and its extension.")
            print("If the file is stored in the bucket\'s root (main) directory: file_path = \"my_file.ext\".")
            print("If the path of the file in the bucket is: \'dir1/…/dirN/my_file.ext\', where dirN is the N-th subdirectory, and dir1 is a folder or directory of the main (root) bucket\'s directory: file_path = \"dir1/…/dirN/my_file.ext\".")
            print("Also, we say that \'dir1/…/dirN/\' is the file\'s prefix. Notice that the name of the bucket is never declared here as the path for fetching its content from the Python code.")
            print("5. Set a variable named \'new_path\' to store the path of the file copied to the notebook’s workspace. This path must contain the file name and its extension.")
            print("Example: if you want to copy \'my_file.ext\' to the root directory of the notebook’s workspace, set: new_path = \"/my_file.ext\".")
            print("6. Finally, declare the following code, which refers to the defined variables:\n")

            # Let's use triple quotes to declare a formated string
            example_code = """
                import boto3
                # Start S3 client as the object 's3_client'
                s3_client = boto3.resource('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key)
                # Connect to the bucket specified as 'bucket_name'.
                # The bucket is started as the object 's3_bucket':
                s3_bucket = s3_client.Bucket(bucket_name)
                # Select the object in the bucket previously started as 's3_bucket':
                selected_object = s3_bucket.Object(file_path)
                # Download the selected object to the workspace in the specified file_path
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" copies a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                selected_object.download_file(Filename = new_path)
                """

            print(example_code)

            print("An object \'my_file.ext\' in the main (root) directory of the s3_bucket is stored in Python environment as:")
            print("""s3.ObjectSummary(bucket_name='bucket_name', key='my_file.ext'""") 
            # triple quotes to keep the internal quotes without using too much backslashes "\" (the ignore next character)
            print("Then, the name of each object is stored as the attribute \'key\' of the object. To view all objects, we can loop through their \'key\' attributes:\n")
            example_code = """
                # Loop through all objects of the bucket:
                for stored_obj in s3_bucket.objects.all():		
                    # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
                    # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
                    # Print the object’s names:
                    print(stored_obj.key)
                    """

            print(example_code)

                
    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

# **Function for downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

In [None]:
def upload_to_or_download_file_from_colab (action = 'download', file_to_download_from_colab = None):
    
    # action = 'download' to download the file to the local machine
    # action = 'upload' to upload a file from local machine to
    # Google Colab's instant memory
    
    # file_to_download_from_colab = None. This parameter is obbligatory when
    # action = 'download'. 
    # Declare as file_to_download_from_colab the file that you want to download, with
    # the correspondent extension.
    # It should not be declared in quotes.
    # e.g. to download a dictionary named dict, object_to_download_from_colab = 'dict.pkl'
    # To download a dataframe named df, declare object_to_download_from_colab = 'df.csv'
    # To export a model named keras_model, declare object_to_download_from_colab = 'keras_model.h5'
 
    from google.colab import files
    # google.colab library must be imported only in case 
    # it is going to be used, for avoiding 
    # AWS compatibility issues.
        
    if (action == 'upload'):
            
        print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
        print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
        # this functionality requires the previous declaration:
        ## from google.colab import files
            
        colab_files_dict = files.upload()
            
        # The files are stored into a dictionary called colab_files_dict where the keys
        # are the names of the files and the values are the files themselves.
        ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
        ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
        ## representing the contents of the file. The length of this value is the size of the
        ## uploaded file, in bytes.
        ## To access the file is like accessing a value from a dictionary: 
        ## d = {'key1': 'val1'}, d['key1'] == 'val1'
        ## we simply declare the key inside brackets and quotes, the same way we would do for
        ## accessing the column of a dataframe.
        ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
        ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
        ## file in bytes.
        ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
        ## parentheses): colab_files_dict.keys()
            
        for key in colab_files_dict.keys():
            #loop through each element of the list of keys of the dictionary
            # (list colab_files_dict.keys()). Each element is named 'key'
            print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
            # The key is the name of the file, and the length of the value
            ## correspondent to the key is the file's size in bytes.
            ## Notice that the content of the uploaded object must be passed 
            ## as argument for a proper function to be interpreted. 
            ## For instance, the content of a xlsx file should be passed as
            ## argument for Pandas .read_excel function; the pkl file must be passed as
            ## argument for pickle.
            ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
            ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
            ## df from the uploaded table. Notice that is the value, not the key, that is the
            ## argument.
                
            print("The uploaded files are stored into a dictionary object named as colab_files_dict.")
            print("Each key from this dictionary is the name of an uploaded file. The value correspondent to that key is the file itself.")
            print("The structure of a general Python dictionary is dict = {\'key1\': value1}. To access value1, declare file = dict[\'key1\'], as if you were accessing a column from a dataframe.")
            print("Then, if you uploaded a file named \'table.xlsx\', you can access this file as:")
            print("uploaded_file = colab_files_dict[\'table.xlsx\']")
            print("Notice, though, that the object uploaded_file is the whole file content, not a Python object already converted. To convert to a Python object, pass this element as argument for a proper function or method.")
            print("In this example, to convert the object uploaded_file to a dataframe, Pandas pd.read_excel function could be used. In the following line, a df dataframe object is obtained from the uploaded file:")
            print("df = pd.read_excel(uploaded_file)")
            print("Also, the uploaded file itself will be available in the Colaboratory Notebook\'s workspace.")
            
            return colab_files_dict
        
    elif (action == 'download'):
            
        if (file_to_download_from_colab is None):
                
            #No object was declared
            print("Please, inform a file to download from the notebook\'s workspace. It should be declared in quotes and with the extension: e.g. \'table.csv\'.")
            
        else:
                
            print("The file will be downloaded to your computer.")

            files.download(file_to_download_from_colab)

            print(f"File {file_to_download_from_colab} successfully downloaded from Colab environment.")

    else:
            
            print("Please, select a valid action, \'download\' or \'upload\'.")

# **Function for loading the dataframe**

In [None]:
def load_pandas_dataframe (file_directory_path, file_name_with_extension, load_txt_file_with_json_format = False, how_missing_values_are_registered = None, has_header = True, decimal_separator = '.', txt_csv_col_sep = "comma", load_all_sheets_at_once = False, sheet_to_load = None, json_record_path = None, json_field_separator = "_", json_metadata_prefix_list = None):
    
    # Pandas documentation:
    # pd.read_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    # pd.read_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
    # pd.json_normalize: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
    # Python JSON documentation:
    # https://docs.python.org/3/library/json.html
    
    import os
    import json
    import numpy as np
    import pandas as pd
    from pandas import json_normalize
    
    ## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
    ## JSON, txt, or CSV (comma separated values) files.
    
    # file_directory_path - (string, in quotes): input the path of the directory (e.g. folder path) 
    # where the file is stored. e.g. file_directory_path = "/" or file_directory_path = "/folder"
    
    # FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
    # extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
    # FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
    # Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.
    
    # load_txt_file_with_json_format = False. Set load_txt_file_with_json_format = True 
    # if you want to read a file with txt extension containing a text formatted as JSON 
    # (but not saved as JSON).
    # WARNING: if load_txt_file_with_json_format = True, all the JSON file parameters of the 
    # function (below) must be set. If not, an error message will be raised.
    
    # HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
    # empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
    # This parameter manipulates the argument na_values (default: None) from Pandas functions.
    # By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
    #‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
    # ‘n/a’, ‘nan’, ‘null’.

    # If a different denomination is used, indicate it as a string. e.g.
    # HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
    # HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

    # If dict passed, specific per-column NA values. For example, if zero is the missing value
    # only in column 'numeric_col', you can specify the following dictionary:
    # how_missing_values_are_registered = {'numeric-col': 0}
    
    
    # has_header = True if the the imported table has headers (row with columns names).
    # Alternatively, has_header = False if the dataframe does not have header.
    
    # DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
    # the decimal separator. Alternatively, specify here the separator.
    # e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
    # It manipulates the argument 'decimal' from Pandas functions.
    
    # txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
    # or 'csv'. It informs how the different columns are separated.
    # Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
    # for columns separated by comma;
    # txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
    # for columns separated by simple spaces.
    # You can also set a specific separator as string. For example:
    # txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
    # is used as separator for the columns - '\t' represents the tab character).
    
    
    ## Parameters for loading Excel files:
    
    # load_all_sheets_at_once = False - This parameter has effect only when for Excel files.
    # If load_all_sheets_at_once = True, the function will return a list of dictionaries, each
    # dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
    # value will be the name (or number) of the table (sheet). The second key will be 'df',
    # and its value will be the pandas dataframe object obtained from that sheet.
    # This argument has preference over sheet_to_load. If it is True, all sheets will be loaded.
    
    # sheet_to_load - This parameter has effect only when for Excel files.
    # keep sheet_to_load = None not to specify a sheet of the file, so that the first sheet
    # will be loaded.
    # sheet_to_load may be an integer or an string (inside quotes). sheet_to_load = 0
    # loads the first sheet (sheet with index 0); sheet_to_load = 1 loads the second sheet
    # of the file (index 1); sheet_to_load = "Sheet1" loads a sheet named as "Sheet1".
    # Declare a number to load the sheet with that index, starting from 0; or declare a
    # name to load the sheet with that name.
    
    
    ## Parameters for loading JSON files:
    
    # json_record_path (string): manipulate parameter 'record_path' from json_normalize method.
    # Path in each object to list of records. If not passed, data will be assumed to 
    # be an array of records. If a given field from the JSON stores a nested JSON (or a nested
    # dictionary) declare it here to decompose the content of the nested data. e.g. if the field
    # 'books' stores a nested JSON, declare, json_record_path = 'books'
    
    # json_field_separator = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
    # Nested records will generate names separated by sep. 
    # e.g., for json_field_separator = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
    # Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
    # the name of the columns of the dataframe will be formed by concatenating 'main_field', the
    # separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...
    
    # json_metadata_prefix_list: list of strings (in quotes). Manipulates the parameter 
    # 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
    # table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
    # will be repeated in the rows of the dataframe to give the metadata (context) of the rows.
    
    # e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
    # 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
    # Here, there are nested JSONs in the field 'books'. The fields that are not nested
    # are 'name' and 'last'.
    # Then, json_record_path = 'books'
    # json_metadata_prefix_list = ['name', 'last']
    
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, file_name_with_extension)
    # Extract the file extension
    file_extension = os.path.splitext(file_path)[1][1:]
    # os.path.splitext(file_path) is a tuple of strings: the first is the complete file
    # root with no extension; the second is the extension starting with a point: '.txt'
    # When we set os.path.splitext(file_path)[1], we are selecting the second element of
    # the tuple. By selecting os.path.splitext(file_path)[1][1:], we are taking this string
    # from the second character (index 1), eliminating the dot: 'txt'
    
    # Check if the decimal separator is None. If it is, set it as '.' (period):
    if (decimal_separator is None):
        decimal_separator = '.'
    
    if ((file_extension == 'txt') | (file_extension == 'csv')): 
        # The operator & is equivalent to 'And' (intersection).
        # The operator | is equivalent to 'Or' (union).
        # pandas.read_csv method must be used.
        if (load_txt_file_with_json_format == True):
            
            print("Reading a txt file containing JSON parsed data. A reading error will be raised if you did not set the JSON parameters.\n")
            
            with open(file_path, 'r') as opened_file:
                # 'r' stands for read mode; 'w' stands for write mode
                # read the whole file as a string named 'file_full_text'
                file_full_text = opened_file.read()
                # if we used the readlines() method, we would be reading the
                # file by line, not the whole text at once.
                # https://stackoverflow.com/questions/8369219/how-to-read-a-text-file-into-a-string-variable-and-strip-newlines?msclkid=a772c37bbfe811ec9a314e3629df4e1e
                # https://www.tutorialkart.com/python/python-read-file-as-string/#:~:text=example.py%20%E2%80%93%20Python%20Program.%20%23open%20text%20file%20in,and%20prints%20it%20to%20the%20standard%20output.%20Output.?msclkid=a7723a1abfe811ecb68bba01a2b85bd8
                
            #Now, file_full_text is a string containing the full content of the txt file.
            json_file = json.loads(file_full_text)
            # json.load() : This method is used to parse JSON from URL or file.
            # json.loads(): This method is used to parse string with JSON content.
            # e.g. .json.loads() must be used to read a string with JSON and convert it to a flat file
            # like a dataframe.
            # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.
            dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
        
        else:
            # Not a JSON txt
        
            if (has_header == True):

                if ((txt_csv_col_sep == "comma") | (txt_csv_col_sep == ",")):

                    dataset = pd.read_csv(file_path, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    # verbose = True for showing number of NA values placed in non-numeric columns.
                    #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                    # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                    # parsing speed by 5-10x.

                elif ((txt_csv_col_sep == "whitespace") | (txt_csv_col_sep == " ")):

                    dataset = pd.read_csv(file_path, delim_whitespace = True, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    
                else:
                    
                    try:
                        
                        # Try using the character specified as the argument txt_csv_col_sep:
                        dataset = pd.read_csv(file_path, sep = txt_csv_col_sep, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    except:
                        # An error was raised, the separator is not valid
                        print(f"Enter a valid column separator for the {file_extension} file, like: \'comma\' or \'whitespace\'.")


            else:
                # has_header == False

                if ((txt_csv_col_sep == "comma") | (txt_csv_col_sep == ",")):

                    dataset = pd.read_csv(file_path, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)

                    
                elif ((txt_csv_col_sep == "whitespace") | (txt_csv_col_sep == " ")):

                    dataset = pd.read_csv(file_path, delim_whitespace = True, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    
                else:
                    
                    try:
                        
                        # Try using the character specified as the argument txt_csv_col_sep:
                        dataset = pd.read_csv(file_path, sep = txt_csv_col_sep, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    except:
                        # An error was raised, the separator is not valid
                        print(f"Enter a valid column separator for the {file_extension} file, like: \'comma\' or \'whitespace\'.")

    elif (file_extension == 'json'):
        
        with open(file_path, 'r') as opened_file:
            
            json_file = json.load(opened_file)
            # The structure json_file = json.load(open(file_path)) relies on the GC to close the file. That's not a 
            # good idea: If someone doesn't use CPython the garbage collector might not be using refcounting (which 
            # collects unreferenced objects immediately) but e.g. collect garbage only after some time.
            # Since file handles are closed when the associated object is garbage collected or closed 
            # explicitly (.close() or .__exit__() from a context manager) the file will remain open until 
            # the GC kicks in.
            # Using 'with' ensures the file is closed as soon as the block is left - even if an exception 
            # happens inside that block, so it should always be preferred for any real application.
            # source: https://stackoverflow.com/questions/39447362/equivalent-ways-to-json-load-a-file-in-python
            
        # json.load() : This method is used to parse JSON from URL or file.
        # json.loads(): This method is used to parse string with JSON content.
        # Then, json.load for a .json file
        # and json.loads for text file containing json
        # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.   
        dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
    
    else:
        # If it is not neither a csv nor a txt file, let's assume it is one of different
        # possible Excel files.
        print("Excel file inferred. If an error message is shown, check if a valid file extension was used: \'xlsx\', \'xls\', etc.\n")
        # For Excel type files, Pandas automatically detects the decimal separator and requires only the parameter parse_dates.
        # Firstly, the argument infer_datetime_format was present on read_excel function, but was removed.
        # From version 1.4 (beta, in 10 May 2022), it will be possible to pass the parameter 'decimal' to
        # read_excel function for detecting decimal cases in strings. For numeric variables, it is not needed, though
        
        if (load_all_sheets_at_once == True):
            
            # Corresponds to setting sheet_name = None
            
            if (has_header == True):
                
                xlsx_doc = pd.read_excel(file_path, sheet_name = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                # verbose = True for showing number of NA values placed in non-numeric columns.
                #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                # parsing speed by 5-10x.
                
            else:
                #No header
                xlsx_doc = pd.read_excel(file_path, sheet_name = None, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
            
            # xlsx_doc is a dictionary containing the sheet names as keys, and dataframes as items.
            # Let's convert it to the desired format.
            # Dictionary dict, dict.keys() is the array of keys; dict.values() is an array of the values;
            # and dict.items() is an array of tuples with format ('key', value)
            
            # Create a list of returned datasets:
            list_of_datasets = []
            
            # Let's iterate through the array of tuples. The first element returned is the key, and the
            # second is the value
            for sheet_name, dataframe in (xlsx_doc.items()):
                # sheet_name = key; dataframe = value
                # Define the dictionary with the standard format:
                df_dict = {'sheet': sheet_name,
                            'df': dataframe}
                
                # Add the dictionary to the list:
                list_of_datasets.append(df_dict)
            
            print("\n")
            print(f"A total of {len(list_of_datasets)} dataframes were retrieved from the Excel file.\n")
            print(f"The dataframes correspond to the following Excel sheets: {list(xlsx_doc.keys())}\n")
            print("Returning a list of dictionaries. Each dictionary contains the key \'sheet\', with the original sheet name; and the key \'df\', with the Pandas dataframe object obtained.\n")
            print(f"Check the 10 first rows of the dataframe obtained from the first sheet, named {list_of_datasets[0]['sheet']}:\n")
            
            try:
                # only works in Jupyter Notebook:
                from IPython.display import display
                display((list_of_datasets[0]['df']).head(10))
            
            except: # regular mode
                print((list_of_datasets[0]['df']).head(10))
            
            return list_of_datasets
            
        elif (sheet_to_load is not None):        
        #Case where the user specifies which sheet of the Excel file should be loaded.
            
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                # verbose = True for showing number of NA values placed in non-numeric columns.
                #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                # parsing speed by 5-10x.
                
            else:
                #No header
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
        
        else:
            #No sheet specified
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
            else:
                #No header
                dataset = pd.read_excel(file_path, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
    print(f"Dataset extracted from {file_path}. Check the 10 first rows of this dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(dataset.head(10))
            
    except: # regular mode
        print(dataset.head(10))
    
    return dataset

# **Function for converting JSON object to dataframe**
- Objects may be:
    - String with JSON formatted text;
    - List with nested dictionaries (JSON formatted);
    - Each dictionary may contain nested dictionaries, or nested lists of dictionaries (nested JSON).

In [None]:
def json_obj_to_pandas_dataframe (json_obj_to_convert, json_obj_type = 'list', json_record_path = None, json_field_separator = "_", json_metadata_prefix_list = None):
    
    import json
    import pandas as pd
    from pandas import json_normalize
    
    # JSON object in terms of Python structure: list of dictionaries, where each value of a
    # dictionary may be a dictionary or a list of dictionaries (nested structures).
    # example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
    # structure could be declared and stored into a string variable. For instance, if you have a txt
    # file containing JSON, you could read the txt and save its content as a string.
    # json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
    # 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
    # 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]    

    # json_obj_type = 'list', in case the object was saved as a list of dictionaries (JSON format)
    # json_obj_type = 'string', in case it was saved as a string (text) containing JSON.

    # json_obj_to_convert: object containing JSON, or string with JSON content to parse.
    # Objects may be: string with JSON formatted text;
    # list with nested dictionaries (JSON formatted);
    # dictionaries, possibly with nested dictionaries (JSON formatted).
    
    # https://docs.python.org/3/library/json.html
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html#pandas.json_normalize
    
    # json_record_path (string): manipulate parameter 'record_path' from json_normalize method.
    # Path in each object to list of records. If not passed, data will be assumed to 
    # be an array of records. If a given field from the JSON stores a nested JSON (or a nested
    # dictionary) declare it here to decompose the content of the nested data. e.g. if the field
    # 'books' stores a nested JSON, declare, json_record_path = 'books'
    
    # json_field_separator = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
    # Nested records will generate names separated by sep. 
    # e.g., for json_field_separator = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
    # Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
    # the name of the columns of the dataframe will be formed by concatenating 'main_field', the
    # separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...
    
    # json_metadata_prefix_list: list of strings (in quotes). Manipulates the parameter 
    # 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
    # table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
    # will be repeated in the rows of the dataframe to give the metadata (context) of the rows.
    
    # e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
    # 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
    # Here, there are nested JSONs in the field 'books'. The fields that are not nested
    # are 'name' and 'last'.
    # Then, json_record_path = 'books'
    # json_metadata_prefix_list = ['name', 'last']

    
    if (json_obj_type == 'string'):
        # Use the json.loads method to convert the string to json
        json_file = json.loads(json_obj_to_convert)
        # json.load() : This method is used to parse JSON from URL or file.
        # json.loads(): This method is used to parse string with JSON content.
        # e.g. .json.loads() must be used to read a string with JSON and convert it to a flat file
        # like a dataframe.
        # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.
    
    elif (json_obj_type == 'list'):
        
        # make the json_file the object itself:
        json_file = json_obj_to_convert
    
    else:
        print ("Enter a valid JSON object type: \'list\', in case the JSON object is a list of dictionaries in JSON format; or \'string\', if the JSON is stored as a text (string variable).")
        return "error"
    
    dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
    
    print(f"JSON object converted to a flat dataframe object. Check the 10 first rows of this dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(dataset.head(10))
            
    except: # regular mode
        print(dataset.head(10))
    
    return dataset

# **Function for importing or exporting models, lists, or dictionaries**

In [None]:
def import_export_model_list_dict (action = 'import', objects_manipulated = 'model_only', model_file_name = None, dictionary_or_list_file_name = None, directory_path = '', model_type = 'keras', dict_or_list_to_export = None, model_to_export = None, use_colab_memory = False):
    
    import os
    import pickle
    import dill
    # pickle and dill save the file in binary (bits) serialized mode. So, we must use
    # open 'rb' or 'wb' when calling the context manager. The 'b' stands for 'binary',
    # informing the context manager (with statement) that a bit-file will be processed
    import tensorflow as tf
    from statsmodels.tsa.arima.model import ARIMA, ARIMAResults
    from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
    from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
    from sklearn.neural_network import MLPRegressor, MLPClassifier
    from xgboost import XGBRegressor, XGBClassifier
    
    # action = 'import' for importing a model and/or a dictionary;
    # action = 'export' for exporting a model and/or a dictionary.
    
    # objects_manipulated = 'model_only' if only a model will be manipulated.
    # objects_manipulated = 'dict_or_list_only' if only a dictionary or list will be manipulated.
    # objects_manipulated = 'model_and_dict' if both a model and a dictionary will be
    # manipulated.
    
    # model_file_name: string with the name of the file containing the model (for 'import');
    # or of the name that the exported file will have (for 'export')
    # e.g. model_file_name = 'model'
    # WARNING: Do not add the file extension.
    # Keep it in quotes. Keep model_file_name = None if no model will be manipulated.
    
    # dictionary_or_list_file_name: string with the name of the file containing the dictionary 
    # (for 'import');
    # or of the name that the exported file will have (for 'export')
    # e.g. dictionary_or_list_file_name = 'history_dict'
    # WARNING: Do not add the file extension.
    # Keep it in quotes. Keep dictionary_or_list_file_name = None if no 
    # dictionary or list will be manipulated.
    
    # DIRECTORY_PATH: path of the directory where the model will be saved,
    # or from which the model will be retrieved. If no value is provided,
    # the DIRECTORY_PATH will be the root: "/"
    # Notice that the model and the dictionary must be stored in the same path.
    # If a model and a dictionary will be exported, they will be stored in the same
    # DIRECTORY_PATH.
    
    # model_type: This parameter has effect only when a model will be manipulated.
    # model_type = 'keras' for deep learning keras/ tensorflow models with extension .h5
    # model_type = 'tensorflow_lambda' for deep learning tensorflow models containing 
    # lambda layers. Such models are compressed as tar.gz.
    # model_type = 'sklearn' for models from scikit-learn (non-deep learning)
    # model_type = 'xgb_regressor' for XGBoost regression models (non-deep learning)
    # model_type = 'xgb_classifier' for XGBoost classification models (non-deep learning)
    # model_type = 'arima' for ARIMA model (Statsmodels)
    
    # dict_or_list_to_export and model_to_export: 
    # These two parameters have effect only when ACTION == 'export'. In this case, they
    # must be declared. If ACTION == 'export', keep:
    # dict_or_list_to_export = None, 
    # model_to_export = None
    # If one of these objects will be exported, substitute None by the name of the object
    # e.g. if your model is stored in the global memory as 'keras_model' declare:
    # model_to_export = keras_model. Notice that it must be declared without quotes, since
    # it is not a string, but an object.
    # For exporting a dictionary named as 'dict':
    # dict_or_list_to_export = dict
    
    # use_colab_memory: this parameter has only effect when using Google Colab (or it will
    # raise an error). Set as use_colab_memory = True if you want to use the instant memory
    # from Google Colaboratory: you will update or download the file and it will be available
    # only during the time when the kernel is running. It will be excluded when the kernel
    # dies, for instance, when you close the notebook.
    
    # If action == 'export' and use_colab_memory == True, then the file will be downloaded
    # to your computer (running the cell will start the download).
    
    # Check the directory path
    if (directory_path is None):
        # set as the root (empty string):
        directory_path = ""
        
        
    bool_check1 = (objects_manipulated != 'model_only')
    # bool_check1 == True if a dictionary will be manipulated
    
    bool_check2 = (objects_manipulated != 'dict_or_list_only')
    # bool_check1 == True if a dictionary will be manipulated
    
    if (bool_check1 == True):
        #manipulate a dictionary
        
        if (dictionary_or_list_file_name is None):
            print("Please, enter a name for the dictionary or list.")
            return "error1"
        
        else:
            # Create the file path for the dictionary:
            dict_path = os.path.join(directory_path, dictionary_or_list_file_name)
            # Extract the file extension
            dict_extension = 'pkl'
            #concatenate:
            dict_path = dict_path + "." + dict_extension
            
    
    if (bool_check2 == True):
        #manipulate a model
        
        if (model_file_name is None):
            print("Please, enter a name for the model.")
            return "error1"
        
        else:
            # Create the file path for the dictionary:
            model_path = os.path.join(directory_path, model_file_name)
            # Extract the file extension
            
            #check model_type:
            if (model_type == 'keras'):
                model_extension = 'h5'
            
            elif (model_type == 'keras_lambda'):
                model_extension = 'tar.gz'
            
            elif (model_type == 'sklearn'):
                model_extension = 'dill'
                #it could be 'pkl', though
            
            elif (model_type == 'xgb_regressor'):
                model_extension = 'json'
                #it could be 'ubj', though
            
            elif (model_type == 'xgb_classifier'):
                model_extension = 'json'
                #it could be 'ubj', though
            
            elif (model_type == 'arima'):
                model_extension = 'pkl'
            
            else:
                print("Enter a valid model_type: keras, sklearn_xgb, or arima.")
                return "error2"
            
            #concatenate:
            model_path = model_path +  "." + model_extension
            
    # Now we have the full paths for the dictionary and for the model.
    
    if (action == 'import'):
        
        if (use_colab_memory == True):
             
            from google.colab import files
            # google.colab library must be imported only in case 
            # it is going to be used, for avoiding 
            # AWS compatibility issues.
            
            print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
            print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
            # this functionality requires the previous declaration:
            ## from google.colab import files
            colab_files_dict = files.upload()
            # The files are stored into a dictionary called colab_files_dict where the keys
            # are the names of the files and the values are the files themselves.
            ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
            ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
            ## representing the contents of the file. The length of this value is the size of the
            ## uploaded file, in bytes.
            ## To access the file is like accessing a value from a dictionary: 
            ## d = {'key1': 'val1'}, d['key1'] == 'val1'
            ## we simply declare the key inside brackets and quotes, the same way we would do for
            ## accessing the column of a dataframe.
            ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
            ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
            ## file in bytes.
            ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
            ## parentheses): colab_files_dict.keys()
            
            for key in colab_files_dict.keys():
                #loop through each element of the list of keys of the dictionary
                # (list colab_files_dict.keys()). Each element is named 'key'
                print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
                # The key is the name of the file, and the length of the value
                ## correspondent to the key is the file's size in bytes.
                ## Notice that the content of the uploaded object must be passed 
                ## as argument for a proper function to be interpreted. 
                ## For instance, the content of a xlsx file should be passed as
                ## argument for Pandas .read_excel function; the pkl file must be passed as
                ## argument for pickle.
                ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
                ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
                ## df from the uploaded table. Notice that is the value, not the key, that is the
                ## argument.
        
        if (bool_check1 == True):
            #manipulate a dictionary
            if (use_colab_memory == True):
                key = dictionary_file_name + "." + dict_extension
                #Use the key to access the file content, and pass the file content
                # to pickle:
                with open(colab_files_dict[key], 'rb') as opened_file:
            
                    imported_dict = pickle.load(opened_file)
                    # The structure imported_dict = pkl.load(open(colab_files_dict[key], 'rb')) relies 
                    # on the GC to close the file. That's not a good idea: If someone doesn't use 
                    # CPython the garbage collector might not be using refcounting (which collects 
                    # unreferenced objects immediately) but e.g. collect garbage only after some time.
                    # Since file handles are closed when the associated object is garbage collected or 
                    # closed explicitly (.close() or .__exit__() from a context manager) the file 
                    # will remain open until the GC kicks in.
                    # Using 'with' ensures the file is closed as soon as the block is left - even if 
                    # an exception happens inside that block, so it should always be preferred for any 
                    # real application.
                    # source: https://stackoverflow.com/questions/39447362/equivalent-ways-to-json-load-a-file-in-python

                print(f"Dictionary or list {key} successfully imported to Colab environment.")
            
            else:
                #standard method
                with open(dict_path, 'rb') as opened_file:
            
                    imported_dict = pickle.load(opened_file)
                
                # 'rb' stands for read binary (read mode). For writing mode, 'wb', 'write binary'
                print(f"Dictionary or list successfully imported from {dict_path}.")
                
        if (bool_check2 == True):
            #manipulate a model
            # select the proper model
        
            if (model_type == 'keras'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = tf.keras.models.load_model(colab_files_dict[key])
                    print(f"Keras/TensorFlow model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    # We previously declared:
                    # from keras.models import load_model
                    model = tf.keras.models.load_model(model_path)
                    print(f"Keras/TensorFlow model successfully imported from {model_path}.")
            
            elif (model_type == 'tensorflow_lambda'):
                
                if (use_colab_memory == True):
                    
                    key = model_file_name + "." + model_extension
                    
                    # Try accessing the tar.gz file directly from the environment:
                    model_path = key
                    # to access from the dictionary:
                    # model_path = colab_files_dict[key]
                    
                    # Extract to a temporary 'tmp' directory:
                    #try:
                    # Compress the directory using tar
                    # https://www.gnu.org/software/tar/manual/tar.html
                    #    ! tar --extract --file=model_path --verbose --verbose tmp/
                    
                    #except:
                        
                    from tarfile import TarFile
                    # pickle, csv, tarfile, and zipfile are on Python standard library
                    # https://docs.python.org/3/library/tarfile.html
                    # https://docs.python.org/3/library/zipfile.html#module-zipfile
                    tar_file = TarFile.open(model_path, mode = 'r:gz')
                    tar_file.extractall("tmp/")
                    tar_file.close()
                    
                    model = tf.keras.models.load_model("tmp/saved_model")
                    print(f"TensorFlow model: {model_path} successfully imported to Colab environment.")
                    
                else:
                    #standard method
                    # Extract to a temporary 'tmp' directory:
                    #try:
                        # Compress the directory using tar
                        # https://www.gnu.org/software/tar/manual/tar.html
                    #    ! tar --extract --file=model_path --verbose --verbose tmp/
                    
                    #except:
                        
                    from tarfile import TarFile
                    # pickle, csv, tarfile, and zipfile are on Python standard library
                    # https://docs.python.org/3/library/tarfile.html
                    # https://docs.python.org/3/library/zipfile.html#module-zipfile
                    tar_file = TarFile.open(model_path, mode = 'r:gz')
                    tar_file.extractall("tmp/")
                    tar_file.close()
                    
                    model = tf.keras.models.load_model("tmp/saved_model")
                    print(f"TensorFlow model successfully imported from {model_path}.")
            
            elif (model_type == 'sklearn'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    
                    with open(colab_files_dict[key], 'rb') as opened_file:
            
                        model = dill.load(opened_file)
                    
                    print(f"Scikit-learn model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    with open(model_path, 'rb') as opened_file:
            
                        model = dill.load(opened_file)
                
                    print(f"Scikit-learn model successfully imported from {model_path}.")
                    # For loading a pickle model:
                    ## model = pkl.load(open(model_path, 'rb'))
                    # 'rb' stands for read binary (read mode). For writing mode, 'wb', 'write binary'

            elif (model_type == 'xgb_regressor'):
                
                # Create an instance (object) from the class XGBRegressor:
                
                model = XGBRegressor()
                # Now we can apply the load_model method from this class:
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = model.load_model(colab_files_dict[key])
                    print(f"XGBoost regression model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    model = model.load_model(model_path)
                    print(f"XGBoost regression model successfully imported from {model_path}.")
                    # model.load_model("model.json") or model.load_model("model.ubj")
                    # .load_model is a method from xgboost object
            
            elif (model_type == 'xgb_classifier'):

                # Create an instance (object) from the class XGBClassifier:

                model = XGBClassifier()
                # Now we can apply the load_model method from this class:
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = model.load_model(colab_files_dict[key])
                    print(f"XGBoost classification model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    model = model.load_model(model_path)
                    print(f"XGBoost classification model successfully imported from {model_path}.")
                    # model.load_model("model.json") or model.load_model("model.ubj")
                    # .load_model is a method from xgboost object

            elif (model_type == 'arima'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = ARIMAResults.load(colab_files_dict[key])
                    print(f"ARIMA model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    # We previously declared:
                    # from statsmodels.tsa.arima.model import ARIMAResults
                    model = ARIMAResults.load(model_path)
                    print(f"ARIMA model successfully imported from {model_path}.")
            
            if (objects_manipulated == 'model_only'):
                # only the model should be returned
                return model
            
            elif (objects_manipulated == 'dict_only'):
                # only the dictionary should be returned:
                return imported_dict
            
            else:
                # Both objects are returned:
                return model, imported_dict

    
    elif (action == 'export'):
        
        #Let's export the models or dictionary:
        if (use_colab_memory == True):
            
            from google.colab import files
            # google.colab library must be imported only in case 
            # it is going to be used, for avoiding 
            # AWS compatibility issues.
            
            print("The files will be downloaded to your computer.")
        
        if (bool_check1 == True):
            #manipulate a dictionary
            if (use_colab_memory == True):
                ## Download the dictionary
                key = dictionary_or_list_file_name + "." + dict_extension
                
                with open(key, 'wb') as opened_file:
            
                    pickle.dump(dict_or_list_to_export, opened_file)
                
                # this functionality requires the previous declaration:
                ## from google.colab import files
                files.download(key)
                
                print(f"Dictionary or list {key} successfully downloaded from Colab environment.")
            
            else:
                #standard method 
                with open(dict_path, 'wb') as opened_file:
            
                    pickle.dump(dict_or_list_to_export, opened_file)
                
                #to save the file, the mode must be set as 'wb' (write binary)
                print(f"Dictionary or list successfully exported as {dict_path}.")
                
        if (bool_check2 == True):
            #manipulate a model
            # select the proper model
        
            if (model_type == 'keras'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save(key)
                    files.download(key)
                    print(f"Keras/TensorFlow model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save(model_path)
                    print(f"Keras/TensorFlow model successfully exported as {model_path}.")
            
            elif (model_type == 'tensorflow_lambda'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    
                    # Save your model in the SavedModel format
                    model_to_export.save('saved_model/my_model')
                    
                    #try:
                        # Compress the directory using tar
                        # https://www.gnu.org/software/tar/manual/tar.html
                    #    ! tar -czvf model_path saved_model/
                    
                    #except NotFoundError:
                        
                    from tarfile import TarFile
                    # pickle, csv, tarfile, and zipfile are on Python standard library
                    # https://docs.python.org/3/library/tarfile.html
                    # https://docs.python.org/3/library/zipfile.html#module-zipfile
                    tar_file = TarFile.open(model_path, mode = 'w:gz')
                    tar_file.add('saved_model/')
                    tar_file.close()
                    
                    key = model_file_name + "." + model_extension
                    files.download(key)
                    print(f"TensorFlow model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    # Save your model in the SavedModel format
                    model_to_export.save('saved_model/my_model')
                    
                    #try:
                        # Compress the directory using tar
                    #    ! tar -czvf model_path saved_model/
                    
                    #except NotFoundError:
                        
                    from tarfile import TarFile
                        # pickle, csv, tarfile, and zipfile are on Python standard library
                        # https://docs.python.org/3/library/tarfile.html
                        # https://docs.python.org/3/library/zipfile.html#module-zipfile
                    tar_file = TarFile.open(model_path, mode = 'w:gz')
                    tar_file.add('saved_model/')
                    tar_file.close()
                        
                    print(f"TensorFlow model successfully exported as {model_path}.")

            elif (model_type == 'sklearn'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    
                    with open(key, 'wb') as opened_file:

                        dill.dump(model_to_export, opened_file)
                    
                    #to save the file, the mode must be set as 'wb' (write binary)
                    files.download(key)
                    print(f"Scikit-learn model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    with open(model_path, 'wb') as opened_file:

                        dill.dump(model_to_export, opened_file)
                    
                    print(f"Scikit-learn model successfully exported as {model_path}.")
                    # For exporting a pickle model:
                    ## pkl.dump(model_to_export, open(model_path, 'wb'))
            
            elif ((model_type == 'xgb_regressor')|(model_type == 'xgb_classifier')):
                # In both cases, the XGBoost object is already loaded in global
                # context memory. So there is already the object for using the
                # save_model method, available for both classes (XGBRegressor and
                # XGBClassifier).
                # We can simply check if it is one type OR the other, since the
                # method is the same:
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save_model(key)
                    files.download(key)
                    print(f"XGBoost model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save_model(model_path)
                    print(f"XGBoost model successfully exported as {model_path}.")
                    # For exporting a pickle model:
                    ## pkl.dump(model_to_export, open(model_path, 'wb'))
            
            elif (model_type == 'arima'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save(key)
                    files.download(key)
                    print(f"ARIMA model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save(model_path)
                    print(f"ARIMA model successfully exported as {model_path}.")
        
        print("Export of files completed.")
    
    else:
        print("Enter a valid action, import or export.")

# **Function for converting the datasets into NumPy arrays with correct format for CNN and RNN Architectures**
- These architectures require the conversion of the dataset to NumPy arrays with specific shapes. 
    - Use this function for converting the dataset (or list with prediction parameters) to the correct formats before feeding the deep learning models.
    - This function must be called before the train-test splitting: pass the arrays obtained from this function to the train-test splitting function.

In [2]:
def convert_dataset_into_numpy_arrays (input_data, arrays_for = 'training', architecture_to_be_fed_with_returned_arrays = 'cnn'):
    
    #WARNING: PASS ONLY THE DESIRED COLUMNS AND KEEP THE RESPONSE VARIABLE AS THE LAST COLUMN
    #OF DATAFRAME df
    
    import numpy as np
    import pandas as pd
    
    # input_data is a dataframe or a list passed as input.
    
    # When making a single-entry prediction:
    # It is equivalent to pass a list or to pass a dataframe with 
    # a single row. In case of passing a list, each element of the list 
    # should correspond to one variable, in the same order of columns
    # of the dataframe used for training. For instance:
    # if the dataframe has 3 columns (predictive variables):
    # 'col1', 'col2', 'col3', the list should have 3 elements: 
    # input_data = [val1, val2, val3]. The first
    # element val1 is the value for the variable 'col1', val2 corresponds
    # to 'col2', and val3 corresponds to 'col3'. Then, the X_array output
    # from this function can be input on the model for predicting the
    # output for the combination val1, val2, val3.
    # Example: input_data = [1.0, 2.3, 7] (point is the decimal separator)
    
    # arrays_for = 'training' should be used when the generated arrays
    # will be used for training. In this case, the last column of the
    # input dataset must be the response variable (the labels); and only
    # the features selected as predictors should be on the other columns.
    
    # arrays_for = 'prediction': use this parameter if you are going to
    # pass the arrays to the model to obtain a prediction for them.
    # In this case, there is no response, so all values will be interpreted
    # as belonging to a predictive feature.
    
    # ARCHITECTURE_TO_BE_FED_WITH_RETURNED_ARRAYS (string) = 'cnn', 
    # 'lstm', 'encoder_decoder', or 'cnn_ltsm', depending on the model 
    # that will be fed with the arrays.
    # Notice that LSTM and CNN architectures are fed with the same format of
    # arrays, so ARCHITECTURE_TO_BE_FED_WITH_RETURNED_ARRAYS = 'lstm'
    # and ARCHITECTURE_TO_BE_FED_WITH_RETURNED_ARRAYS = 'cnn' return the same
    # results.
    
    
    # CONVERSION OF THE DATASET INTO NUMPY ARRAYS FOR DEEP LEARNING MODELS:
    # - Step needed for adapting techniques of image classification (convolutions) and
    # text classification (RNNs like LSTMs or GRUs) for structured (tabular) data.
    
    # X = Subset containing only the predictive variables (columns);
    # X contains N rows (N entries) and M columns.
    # y = series containing the response variable. y is a single column with N values
    # (N rows or entries).
    
    # 1. We must convert the dataset into NumPy arrays.
    # - The dataframe X is converted into a big array X_array.
    # - Each row of the original dataset X becomes an element of the array X_array.
    #   - X_array is an array of arrays: each element from X_array is itself an array.
    #   - Since each row becomes one array, X_array will be an array with N elements, one per row.
    #   For a given element (a given array nested in X_array):
    #     - Each element on the nested array correspond to one column of the original dataset.
    #     - In other words, we pick a row from the dataset X and save it as a separate array.
    #     - Then, we append this separate array as a new element of X_array.
    #     - Since the nested array is simply one row from X, it contains M values, one per column.
    #     - Also, the values are in the same order as the columns from X, since it is simply a copy.
    #    Each nested array is a sequence that will be read by the deep learning algorithm.
    
    # 2. We must convert y to an array of arrays. But this array is simpler: since there is a single
    # response, each array contains a single value, i.e., the response variable for a given row.
    # There will be N arrays with single value, since there are N rows.
    
    # Finally, for the LSTM and CNN, the arrays are reshaped as:
    # (X_array.shape[0], X_array.shape[1], 1)
    # (y_array.shape[0], 1)
    
    # For the Encoder-Decoder LSTM architecture, though, the shape final shape of the y_arrays
    # must be the same as the shape for X_array (i.e., it must have 3 dimensions). So, in this
    # case, X_array is still reshaped as:
    # (X_array.shape[0], X_array.shape[1], 1)
    # but y_array suffers a second reshaped to have 3 dimensions too:
    # Firstly (y_array.shape[0], 1), and finally:
    # (y_array.shape[0], y_array.shape[1], 1)
    
    # For the CNN-LSTM architecture, the X_array is a bit different, because it involves another
    # modification:
    # Basically, when using a hybrid CNN-LSTM model, we will further divide each 
    # sample into further subsequences. The CNN model will interpret each sub-sequence 
    # and the LSTM will piece together the interpretations from the subsequences. 
    # As such, we will split each sample into 2 subsequences of 2 times per subsequence.
    # - We further divide each sample X_array into further subsequences, as previously mentioned.
    # - We split each sample into 2 subsequences of 2 times per subsequence.
    # - So, for a total of M entries = X_train.shape[0] (entries of the original dataset), 
    # the data must now be converted into arrays of the following format before feeding the model: 
    #   [X.shape[0], 2, 2, 1]
    # Here, y_arrays are the same used in the LSTM and CNN architectures, without a second reshape.
    # Then, y_array is simply reshaped as:
    # (y_array.shape[0], 1)

    
    # Check if a valid architecture was selected: 
    if (architecture_to_be_fed_with_returned_arrays == 'lstm')
        print("Preparing arrays for the Simplified Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) Architecture.\n")
    
    elif (architecture_to_be_fed_with_returned_arrays == 'cnn')
        print("Preparing arrays for the Convolutional Neural Network (CNN) Architecture.\n")
    
    elif (architecture_to_be_fed_with_returned_arrays == 'cnn_ltsm')
        print("Preparing arrays for the CNN-LSTM Hybrid Architecture.\n")
    
    elif (architecture_to_be_fed_with_returned_arrays == 'encoder_decoder')
        print("Preparing arrays for the Encoder-Decoder Recurrent Neural Network (RNN) Architecture.\n")
   
    else:
        print("Please, input a valid architecture: \'lstm\', \'cnn\', \'cnn_ltsm\', or \'encoder_decoder\'.")
        return "error"
    
    
    boolean_check = (arrays_for == 'training')
    # The steps regarding the manipulation of the y-array will only
    # take place when boolean_check is True
    
    # Check input data type:
    # Notice that the data type is an object or a type (special word).
    # Then, it should not be declared in quotes
    if (type(input_data) == list):
        input_data_type = 'list'
        
    else:
        # It is a dataframe
        input_data_type = 'dataframe'
    
        print("Converting the dataframe to the array format required by CNNs and RNNs.\n")
        print("WARNING: This function should be used for modelling a single response variable.")
        print("\n")
        
        if (boolean_check): # arrays for training
            
            print("Before calling this function, make sure that the response variable is the last column of the dataframe passed as input.")
            print("The other columns should contain only the features selected as predictors.")
            print("\n")
            print("arrays_for = \'training\' - two arrays will be returned: the array with the predictors and the array with the responses.")
            print("Notice, though, that the last column will always be interpreted as the response, whereas the others will be interpreted as predictors")
            print("\n")
    
    if (arrays_for == 'prediction'):
        
        print("arrays_for = \'prediction\' - a single array will be returned: this array must be used as input of the model, to obtain the response.")
        print("In this case, all the columns or values input into this function should be representative of predictive features.")
        print("\n")
        
    print("WARNING: You must call this function before the train-test splitting: pass the output arrays to the function destined to splitting data into train and test sets.")
    print("It is equivalent to pass a list declaring input_data_type = \'list\' or to pass a single-row dataframe declaring input_data_type = \'dataframe\'.")
    print("\n")
    
    
    if (input_data_type == 'list'):
        
        print("Input data interpreted as Python list.")
        # Let's convert a single row for making the prediction
        
        # 1. Let's create a list of columns:
        cols_list = []
        
        for i in range(len(input_data)):
            # goes from i = 0 to i = len(input_data) - 1, index of
            # the last element of the list:
            cols_list.append(("col_" + str(i)))
            # The addition of i guarantees that all columns have different
            # names, so we can create the dataframe
        
        # Create a dictionary where the cols are the keys and the list
        # values are the values. Since cols_list was created by
        # looping through the first list, they have same size.
        pred_dict = {cols_list: input_data}
        
        # Now, convert it to a dataframe:
        # This dataframe contains a single row. That is why it is
        # equivalent to input a single-row dataframe
        df = pd.DataFrame(data = pred_dict)
        print("The list was converted to a single-row Pandas dataframe before processing.")
    
    else:
        print("Input data interpreted as a Pandas dataframe.")
        # simply copy input_data into df:
        df = input_data
    
    #Check number of rows and columns of dataframe df
    #df.shape[0] is the number of rows, whereas df.shape[1] 
    #is the number of columns of dataframe df
    num_rows = df.shape[0]
    num_columns = df.shape[1]
    
    # Save the following values for comparison after obtaining the
    # NumPy arrays too
    
    # Slice a dataframe: df[i:j]
    # Slice the dataframe, getting only row i to row (j-1)
    # Indexing naturally starts from 0
    # Notice that the slicer defined as df[i:j] takes all columns from
    # the dataframe: it copies the dataframe structure (columns), but
    # selects only the specified rows.
    original_first_row = df[0:1]
    # This is equivalent to df[:1] - if there is no start for the
    # slicer, the start from 0 is implicit
    # slice: get rows from row 0 to row (1-1) = 0
    # Therefore, we will obtain a copy of the dataframe, but containing
    # only the first row (row 0)
    original_last_row = df[(num_rows - 1):(num_rows)] 
    # slice the dataframe from row (num_rows - 1), the index of the
    # last row, to row (num_rows) - 1 = (num_rows - 1)
    # Therefore, this slicer is a copy of the dataframe but containing
    # only its last row.
    
    # Slices are (fractions of) pandas dataframes, so elements must be
    # accessed through .iloc or .loc method
    
    # Let's get the first and last elements from the first column:
    # They have the index 0 (first index)
    first_element_first_col = original_first_row.iloc[0,0]
    last_element_first_col = original_last_row.iloc[0,0]
    
    if (boolean_check):
        
        #store the first and last response (last column). 
        # The index of the last column is (num_columns - 1) 
        # since it starts from zero:
        first_element_last_col = original_first_row.iloc[0,(num_columns - 1)]
        last_element_last_col = original_last_row.iloc[0,(num_columns - 1)]
    

    # Now, let's start the data conversion:
    # Notice that the steps regarding y, the response variables, only
    # take place when the value of the boolean boolean_check is True:
    
    #Initialize the arrays as empty NumPy arrays:
    X = np.array([])
    
    if (boolean_check): # implicity that only when == True
        y = np.array([])
    
    for i in range(num_rows):
        
        #for loop may be declared as for i in range(N, M).
        #In this case, i goes from i = N to i = (M-1).
        #Also, for loop may be declared as for i in range (M)
        #In this case, i goes from i = 0 to i = (M-1).
        #Since the first element was not declared, i goes from zero (first index) to (num_rows-1),
        #which is the last index possible for the rows.
        #At the end of each loop, i = i + 1 (automatically)
        
        #Start the lists that will store the attributes/variables' values for that row (x_list);
        #and the response variable for that row (y_list).
        #Start as empty lists
        
        x_list = []
        
        if (boolean_check):
            y_list = []
        
        #loop through each column, appending each value of the variables as a new element of the list
        
        for j in range((num_columns)-1):
            
            #j goes from column j = 0 (first column) to column j = (num_columns-2), index of last column
            #prior to the response variable
            
            #NOTICE: RESPONSE VARIABLE MUST BE THE LAST COLUMN
            x_list.append(df.iloc[i,j])
            #append element of row i and column j
        
        if (boolean_check):
            
            y_list.append(df.iloc[i,((num_columns)-1)])
            #Append only the value of the response variable
            #If you put this command inside the second for loop, one y will be added for each j, so the list will get
            #a bunch of equal responses (one response for each column j, instead of a single value)   

            #Concatenate the y_list as elements of the NumPy arrays:
            #y_array must be in the form array([y1,y2,...,yn])
            #i.e., a single array, with all elements in sequence
            #To do so, we concatenate the array with the list.
            #CONCATENATE method appends a list or a Numpy array in the right to the end of a numpy array in the left,
            #forming a single array of elements.
            y = np.concatenate((y, (y_list)), axis = 0)

            #Theoretically, we could use the method numpy.stack to stack all of the arrays X.
            #We want X_array to be an array of arrays/lists, i.e., each element of the array is a list itself,
            #obtaining the format [[x1,..., xn],...,[x1,...xn]] - 1 array for each row, and each individual array
            #containing a number of elements equals to the number of columns.

        #The problem is that the NumPy.stack demmands all stacked arrays to have the same dimension. So, we 
        #would have to firstly store all the lists and then stack then in a single time.
        #The best method is to concatenate and use NUMPY.split to split the biggest array into several smaller
        #arrays.
        X = np.concatenate((X, (x_list)), axis = 0)
        
    
    #Now X array is a single array containing num_rows x num_columns elements
    
    X = np.split(X, num_rows)
    #np.split(array, N) splits array into an array containing N sub-arrays;
    #in this case, N = number of rows = total elements of the array y_array: we want one array for each row.
    
    #Now, each row corresponds to one element of the array y_array (the response), and one sub-array from
    #X_array. Each element of this sub-array contains the value of one column on the original dataset.
    
    X_array = np.array(X)
    
    if (boolean_check):
        
        y_array = np.array(y)
    
    #Number of elements in each list (each sub-array from X_array):
    num_elements_in_each_X_array = len(x_list)
    #must be equals to the number of features
    print("\n")
    print("Successfully converted the data to the array format needed for the CNNs and RNNs.")
    print("\n")
    print("Now the data is in the same format as the datasets used for language processing: in these datasets, each row represents a sentence.")
    print("In turns, the sentences are split into tokens, which may be the words or punctuation. So, each entry (row) of the dataset corresponds to a sentence, and the value for each column is a token.")
    print("Keras demands sequences with equal sizes, independently of the problem: text processing, image processing, etc.")
    print("As consequence, the text sequences must be padded, i.e., broke or completed, so that all sequences have same number of tokens. In case of images, the axes lengths must be constant, so that some images may get cropped, and others may have to be filled.")
    print("\n")
    print("Here, the sequences are also the rows, and the number of elements of the sequence is the number of columns itself: instead of tokens, we have the values for each variable as the columns, took here as the sequences elements. Naturally, all entries must have the same sequence size, in this case, the same number of columns (attributes).")
    print("The total of loops performed for an RNN is the number of elements from each sequence: in word processing: it is the number of tokens; in time series analyzes, it is the number of variables (columns).")
    print("e.g. if all sequences have 100 tokens, then the RNN will loop 100 times. If your time series is described by 10 features, the RNN will loop 10 times.")
    
    print("\n")
    print("Now, let\'s compare the properties and the first and last elements of the arrays and of the original dataframe to check if the array generation did not resulted in an error.")
    print(f"Original dataframe shape = {df.shape}")
    print("\n")
             
    print("The generated X-arrays should have one element for each feature on the dataset.")
    
    if (boolean_check):
        
        # Remove 1 (the response column) to get the total of predictive
        # variables columns:
        print(f"Original number of column features = {num_columns - 1}")
        print(f"Number of X-arrays = {num_elements_in_each_X_array}")
        
        if (num_elements_in_each_X_array == (num_columns - 1)):
            print("The X-arrays indeed have one element per predictive variable.")
            print("\n")
        else:
            print("WARNING: review the input data passed: the function generated X-arrays with number of elements different from the original number of predictive variables columns.")
            print("\n")      
    
    else:
        # Do not remove one column. All columns from the original dataset
        # are from predictive features
        print(f"Original number of column features (all of the columns) = {num_columns}")
        print(f"Number of X-arrays = {num_elements_in_each_X_array}")
        
        if (num_elements_in_each_X_array == (num_columns)):
            print("The X-arrays indeed have one element per predictive variable.")
            print("\n")
        else:
            print("WARNING: review the input data passed: the function generated X-arrays with number of elements different from the original number of predictive variables columns.")
            print("\n") 
    
    if (boolean_check):
        # Test only when there is an y-array
        print("The total of X-arrays must be equal to the total of y-arrays:")
        print(f"Total of X-arrays = {X_array.shape[0]}")
        # X_array.shape is a tuple (N, M), where
        # M = total of arrays; N = total of elements in each array = 
        # num_elements_in_each_X_array (length of the list used to
        # generate the array). Then, here we want the first element
        # from the tuple:
        print(f"Total of y-arrays = {len(y_array)}")
        # y_array.shape is a tuple (M,), with second position empty
        # so we can simply pick the length of the array
        
        if (X_array.shape[0] == len(y_array)):
            print("The number of X-arrays and y-arrays are indeed equal.")
            print("\n")
        else:
            print("WARNING: review the input data passed: the function generated a total of X-arrays different from the total of y-arrays.")
            print("\n")
    
    # Test for both cases:
    print("The total of X-arrays must be equal to the number of rows of the original dataset: each row (entry) must have been converted into an array.")
    print(f"Total of X-arrays = {X_array.shape[0]}")
    print(f"Original number of rows = {num_rows}")
    
    if (X_array.shape[0] == num_rows):
        print("There is indeed one array per row of the original dataset.")
        print("\n")
    else:
        print("WARNING: review the input data passed: the function generated a total of X-arrays different from the total of rows of the original dataset.")
        print("\n")
    
    if (boolean_check):
        # Test only when there is an y-array
        print("The first y-array must store the first element from the response column; whereas the last y-array must store the last element of the response column.")
        print(f"1st element from the response column of the dataset = {first_element_last_col}")
        print(f"1st y-array = {y_array[0]}")
        
        if (y_array[0] == first_element_last_col):
            print("1st element from the response column was correctly stored in the first y-array.")
            print("\n")
        else:
            print("WARNING: review the input data passed: the function generated a 1st y-array different from the 1st element from the response variable column of the dataset.")
            print("\n")
        
        print(f"Last element from the response column of the dataset = {last_element_last_col}")
        print(f"Last y-array = {y_array[(len(y_array) - 1)]}")
        # there are len(y_array) arrays in total, so the last index is 
        # len(y_array) - 1
       
        if (y_array[(len(y_array) - 1)] == last_element_last_col):
            print("Last element from the response column was correctly stored in the last y-array.")
            print("\n")
        else:
            print("WARNING: review the input data passed: the function generated a last y-array different from the last element from the response variable column of the dataset.")
            print("\n")
            
    # X_array is an array of arrays:
    # Each row from the original dataset was converted into a sequence, 
    # i.e, into a separated array;
    # Each of these sequences (arrays) was stored as an element of the
    # bigger array named X_array.
    
    # So, if we have N predictive features, 
    # and M rows in the dataset, we have now an array like:
    
    # X_array = array([[val_0_0, val_1_0, ..., val_(N-1)_0],
                    # [val_0_1, val_1_1, ..., val_(N-1)_1],
                    # ...
                    # [val_0_(M-1), val_1_(M-1), ..., val_(N-1)_(M-1)]
                    # ])
                    
    # where val_0_0 is the value of the first column on the first
    # entry (row 0),..., val_(N-1)_0 is the value of the last column
    # (column N-1) for row 0, ..., val_0_i is the value of the 1st
    # element for row i (the row stored as the i-th array), 
    # val_j_i is the value for column j, row i,..., and 
    # val_(N-1)_(M-1) is the value for the last column (N-1) and last
    # row (M-1)
    
    # If we get the first array from X_array, we would have:
    # X_array[0] = array([[val_0_0, val_1_0, ..., val_(N-1)_0]])
    # That is off course the 1st row from the original dataset.
    
    # So, the row i from the original dataset would be accessed as:
    # X_array[i] = array([[val_0_i, val_1_i, ..., val_(N-1)_i]])
    
    # On the other hand, each element from the array is also indexed.
    # For instance: X_array[i][0] = val_0_i
    # and the element correspondent to column j, row i is accessed as:
    # X_array[i][j] = val_j_i
    
    # If val_j_i was also an array, it would be also indexed, and the
    # element on the index k would be accessed as X_array[i][j][k]
    
    # Therefore, simply put more brackets to index successive dimensions
    # Here we have two dimensions (one array into another), so we must
    # index two indices.
    
    
    # Check if the first element from the first X-array corresponds
    # to the 1st element of the 1st column of the dataset:
    print("The 1st element stored in the first X-array must correspond to the element on the 1st column and 1st row of the dataset.")
    print(f"Element on the 1st row and 1st column of the dataset = {first_element_first_col}")
    print(f"1st element of the 1st array = {X_array[0][0]}")
    
    if (X_array[0][0] == first_element_first_col):
        
        print("1st element from the 1st column was correctly stored as the 1st element of the 1st array.")
        print("\n")
    else:
        print("WARNING: review the input data passed: the function generated a 1st element of the first X-array different from the 1st element of the 1st column of the dataset.")
        print("Compare the 1st X-array with the 1st row from the dataset. 1st X array:")
        print(X_array[0])
        print("1st row of the dataset:")
        print(original_first_row)
        print("\n")
    
    print("Finally, the 1st element stored in the last X-array must correspond to the element on the 1st column and last row of the dataset.")
    print(f"Element on the last row and 1st column of the dataset = {last_element_first_col}")
    print(f"1st element of the last array = {X_array[(X_array.shape[0] - 1)][0]}")
    # The total of arrays is X_array.shape[0], so the index of the last
    # array is (X_array.shape[0] - 1)
    # The index [0] from this last array is its 1st element
    
    if (X_array[(X_array.shape[0]-1)][0] == last_element_first_col):
        
        print("Last element from the 1st column was correctly stored as the 1st element of the last array.")
        print("\n")
    else:
        print("WARNING: review the input data passed: the function generated a 1st element of the last X-array different from the last element of the 1st column of the dataset.")
        print("Compare the last X-array with the last row from the dataset. last X array:")
        print(X_array[(X_array.shape[0] - 1)])
        print("Last row of the dataset:")
        print(original_last_row)
        print("\n")
    
    # Now that the conversion was checked, perform the last reshapes 
    # for getting the data ready for the model.
    # The final reshapes will result into data much more difficult to
    # compare with the original dataset, and so we firstly check the
    # conversion. If the array conversion was correct, the final step
    # should not result in (extra) shape problems
    
    print("Now that we checked the conversion of the dataset to NumPy arrays, we can perform the last reshapes for getting the data ready for the models.")
    print("\n")
    
    # If preparing data for the CNN-RNN Architecture, perform a special 
    # reshape:
    if (architecture_to_be_fed_with_returned_arrays = 'cnn_ltsm'):
        
        # reshape from [samples, timesteps] into 
        # [samples, subsequences, timesteps, features]
        # As such, we will split each sample into 2 subsequences of 2 times 
        # per subsequence.
        X_array = X_array.reshape((X_array.shape[0], 2, 2, 1))
    
    else:
        # ordinary reshape o X_array, the same for the other 3 architectures:
        
        # reshape from [samples, timesteps] into [samples, timesteps, features]
        # We must add a third dimension to X_array
        X_array = X_array.reshape(X_array.shape[0], X_array.shape[1], 1)
    
    print(f"Final shape of the X-arrays = {X_array.shape}")
    
    if (boolean_check):
        # If there are y-arrays, reshape them:
        
        # ordinary reshape o y_array, the same for all:
        y_array = y_array.reshape(y_array.shape[0], 1)
        
        # If preparing data for training the Encoder-Decoder RNN architecture,
        # perform the special (second) reshape to obtain the 3rd dimension:
        if (architecture_to_be_fed_with_returned_arrays = 'encoder_decoder'):
            # When using the encoder-decoder architecture, y_arrays must have
            # the same shape as X_arrays (i.e., must have 3 dimensions):
            y_array = y_array.reshape(y_array.shape[0], y_array.shape[1], 1)
        
        # We are working with only a single response y.
        # If we had two responses, y would be a numpy array in the 
        # format [y1, y2], i.e., each row would be an array with two 
        # elements.
        # In this case, the command would be again 
        # y_array.reshape(y_array.shape[0], y_array.shape[1], 1)
        # The command y_array.reshape(y_array.shape[0], 1) is used 
        # only when we have a single response, the present situation
        
        # Here, our y_arrays are arrays containing a single element:
        # y_array = ([[y_0],
                    # [y_1],
                    # ...,
                    # [y_(M-1)]
                    # ])
        # This configuration is equivalent to a single array with (M-1)
        # elements, so the reshape is simpler.
        
        print(f"Final shape of the y-arrays = {y_array.shape}")
        print("\n")
        print("Returning X and y arrays in the correct format for the deep learning models.") 
        print("Now, you can pass X_array and y_array as inputs of the function \'split_data_into_train_and_test\' to split them into training and test sets, as usual.")
        
        return X_array, y_array
    
    else:
        # There is no y_array to return
        print("Returning X arrays in the correct format for the deep learning models.") 
        print("Now, you can pass X_array as input of the trained model to get its predictions.")
        
        return X_array

    # Final note:
    # In this function, we directly applied the .reshape method instead
    # of using the function np.reshape. That is because X_array and
    # y_array were originally created as NumPy arrays, so they have
    # the .reshape method available.

# **Function for making predictions with the models**

In [None]:
def make_model_predictions (model_object, X, dataframe_for_concatenating_predictions = None, col_with_predictions_suffix = None):
    
    import numpy as np
    import pandas as pd
    import tensorflow as tf
    from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neural_network import MLPRegressor
    from sklearn.neural_network import MLPClassifier
    from xgboost import XGBRegressor
    from xgboost import XGBClassifier
    
    # predict_for = 'subset' or predict_for = 'single_entry'
    # The function will automatically detect if it is dealing with lists, NumPy arrays
    # or Pandas dataframes. If X is a list or a single-dimension array, predict_for
    # will be set as 'single_entry'. If X is a multi-dimension NumPy array (as the
    # outputs for preparing data - even single_entry - for deep learning models), or if
    # it is a Pandas dataframe, the function will set predict_for = 'subset'
    
    # X = subset of predictive variables (dataframe, NumPy array, or list).
    # If PREDICT_FOR = 'single_entry', X should be a list of parameters values.
    # e.g. X = [1.2, 3, 4] (dot is the decimal case separator, comma separate values). 
    # Notice that the list should contain only the numeric values, in the same order of the
    # correspondent columns.
    # If PREDICT_FOR = 'subset' (prediction for multiple entries), X should be a dataframe 
    # (subset) or a multi-dimensional NumPy array of the parameters values, as usual.
    
    # model_object: object containing the model that will be analyzed. e.g.
    # model_object = elastic_net_linear_reg_model
    
    # dataframe_for_concatenating_predictions: if you want to concatenate the predictions
    # to a dataframe, pass it here:
    # e.g. dataframe_for_concatenating_predictions = df
    # If the dataframe must be the same one passed as X, repeat the dataframe object here:
    # X = dataset, dataframe_for_concatenating_predictions = dataset.
    # Alternatively, if dataframe_for_concatenating_predictions = None, 
    # the prediction will be returned as a series or NumPy array, depending on the input format.
    # Notice that the concatenated predictions will be added as a new column.
    
    # col_with_predictions_suffix = None. If the predictions are added as a new column
    # of the dataframe dataframe_for_concatenating_predictions, you can declare this
    # parameter as string with a suffix for identifying the model. If no suffix is added, the new
    # column will be named 'y_pred'.
    # e.g. col_with_predictions_suffix = '_keras' will create a column named "y_pred_keras". This
    # parameter is useful when working with multiple models. Always start the suffix with underscore
    # "_" so that no blank spaces are added; the suffix will not be merged to the column; and there
    # will be no confusion with the dot (.) notation for methods, JSON attributes, etc.
    
    
    # Check the type of input: if we are predicting the output for a subset (NumPy array reshaped
    # for deep learning models or Pandas dataframe); or predicting for a single entry (single-
    # dimension NumPy array or Python list).
    
    # 1. Check if a list was input. Lists do not have the attribute shape, present in dataframes
    # and NumPy arrays. Accessing the attribute shape from a list will raise the Exception error
    # named AttributeError
    # Try to access the attribute shape. If the error AttributeError is raised, it is a list, so
    # set predict_for = 'single_entry':
    
    try:
        
        # Try accessing the shape attribute
        X_shape = X.shape
        
        # Now, check the type of the object X: if it is a dataframe or a numpy array:
        X_type = type(X)
        
        # type(X) == numpy.ndarray (or np.ndarray if NumPy was imported as np) if it is
        # an array
        # type(X) == pandas.core.frame.DataFrame (or pd.core.frame.DataFrame if Pandas
        # was imported as pd) if it is a pandas dataframe.
        # Notice that the object type is not a string, so it should not be declared in quotes.
        
        if (X_type == np.ndarray):
            
            # It is a NumPy array
            # If this array was previously manipulated for the deep learning models, it has 3
            # dimensions, so: X_shape = (N, M, 1), N = number of arrays (the number of rows
            # of the original dataset), and M = number of elements on each array (the number
            # of columns of the original dataset)
            
            # If the array has the 3rd dimension, we should consider the prediction for 'subset',
            # even if it is for a single entry. That is because the array is already reshaped
            # and the single_entry code would reshape again.
            
            # Let's try to access the 3rd dimension as X_shape[2]. 
            # If there is no 3rd dimension, the exception error IndexError will be raised, since
            # there is no index 2:
            try:
                
                # Try accessing the 3rd dimension:
                third_dim = X_shape[2]
                
                # Since it was accessed, the array is already in the correct shape, so set
                # prediction for subset:
                predict_for = 'subset'
            
            except IndexError:
                
                # The index error was raised because there is no 3rd dimension. Then, we are
                # dealing with a numpy array equivalent to a list. Set prediction for single_entry.
                # It is true even if there are two dimensions like (N, 1) - (2nd dimension added
                # by the function for correcting the array format for deep learning).
                predict_for = 'single_entry'
        
        else:
            # It is a Pandas dataframe
            # Set prediction for a subset:
            predict_for = 'subset'
        
        
    except AttributeError:
        
        # The AttributeError is raised when there is no attribute. 
        # Since Python lists do not have the shape attribute, 
        # the input of a list raises this error when trying to access the object's shape.
        # Since it is a list, set predict_for = 'single_entry':
        predict_for = 'single_entry'
        
    
    if (predict_for == 'single_entry'):
        
        print("Making prediction for a single entry X.")
        print("X must be a list with values in the order of the correspondent columns of the dataset.")
        print("In other words: declare X as a Python list of values correspondent to each variable, using the same order of variables (columns) used in the dataset.")
        
        # Get reshaped list for making the prediction:
        X_reshaped = np.reshape(np.array(X), (-1, 1))
        
        y_pred = model_object.predict(X_reshaped)
            
        print(f"Output value predicted for the entry parameters = {y_pred}\n")
        print("Attention: for classification with Keras/TensorFlow and other deep learning frameworks, this output will not be a class, but an array of probabilities correspondent to the probability that the entry belongs to each class. In this case, it is better to use the function calculate_class_probability below, setting model_type == \'deep_learning\'. This function will result into dataframes containing the classes as columns and the probabilities in the respective row.")
        print("The output class from the deep learning model is the class with higher probability indicated by the predict method. Again, the order of classes is the order they appear in the training dataset. For instance, when using the ImageDataGenerator, the 1st class is the name of the 1st read directory, the 2nd class is the 2nd directory, and so on.")
            
        print("Returning only the predicted value.")
            
        return y_pred
    
    else:
        
        # prediction for a subset
        y_pred = model_object.predict(X)
        print("Attention: for classification with Keras/TensorFlow and other deep learning frameworks, this output will not be a class, but an array of probabilities correspondent to the probability that the entry belongs to each class. In this case, it is better to use the function calculate_class_probability below, setting model_type == \'deep_learning\'. This function will result into dataframes containing the classes as columns and the probabilities in the respective row.")
        print("The output class from the deep learning model is the class with higher probability indicated by the predict method. Again, the order of classes is the order they appear in the training dataset. For instance, when using the ImageDataGenerator, the 1st class is the name of the 1st read directory, the 2nd class is the 2nd directory, and so on.")
        
        # If y_pred came from a RNN with the parameter return_sequences = True and/or
        # return_states = True, then the hidden and/or cell states from the LSTMs
        # were returned. So, the returned array has at least one extra dimensions (two
        # if both parameters are True). On the other hand, we want only the first dimension,
        # correspondent to the actual output.
        
        # Remember that, due to the reshapes for preparing data for deep learning models,
        # y_pred must have at least 2 dimensions: (N, 1), where N is the number of rows of
        # the original dataset. But y_pred returned from a model with return_sequences = True
        # or return_states = True will be of dimension (N, N, 1). If both parameters are True,
        # the dimension is (N, N, N, 1), since there are extra arrays for both the hidden and
        # cell states.
        
        # The conclusion is that there is a third dimension only for models where return_sequences
        # = True or return_states = True
        
        # Check if y_pred is a numpy array, instead of a Pandas dataframe:
        
        if (type(y_pred) == np.ndarray):
            
                # Try accessing the array's 3rd dimension. If there is no 3rd dimension,
                # the exception error IndexError will be raised.
                # Notice: if 4 or more dimensions are present, we can still access
                # the 3rd dimension (naturally).
                try:
                    
                    third_dim = y_pred.shape[2]
                
                    # If we could access the third_dimension, than return_states and
                    # or return_sequences = True
                    
                    # We want only the values stored as the 1st dimension
                    # y_pred is an array where each element is an array with two elements. 
                    # To get only the first elements:
                    # (slice the arrays: get all values only for dimension 0, the 1st dim):
                    y_pred = y_pred[:,0]
                    # if we used y_pred[:,1] we would get the second element, 
                    # which is the hidden state h (input of the next LSTM unit).
                    # It happens because of the parameter return_sequences = True. 
                    # If return_states = True, there would be a third element, corresponding 
                    # to the cell state c.
                    # Notice that we want only the 1st dimension (0), no matter the case.
                
                except IndexError:
                
                    # The index error was raised because there is no 3rd dimension. Then,
                    # we do not have to worry with the returned states
                    # simply set y_pred as itself:
                    y_pred = y_pred
                    # Even though the slicing y_pred = y_pred[:,0] would not generate an
                    # error, it would unecessarily modify the shape of the array (extra
                    # critical step).
                    
                    # Also, the array obtained as y_pred[:,0] when there are 3 or more 
                    # dimensions has same shape as y_pred when there are only 1 or 2 
                    # dimensions. So, the extra modification of the shape would eliminate
                    # this correspondence.
                
                # If we wanted only the first array, we could set y_pred = y_pred[0]
        
        # Check if there is a dataframe to concatenate the predictions
        if not (dataframe_for_concatenating_predictions is None):
            
            # there is a dataframe for concatenating the predictions
            
            # concatenate the predicted values with dataframe_for_concatenating_predictions.
            # Add the predicted values as a column:
            
            # check if there is a suffix:
            if not (col_with_predictions_suffix is None):
                # There is a suffix declared
                # Since there is a suffix, concatenate it to 'y_pred':
                col_name = "y_pred" + col_with_predictions_suffix
            
            else:
                # Create the column name as the standard.
                # The name of the new column is simply 'y_pred'
                col_name = "y_pred"
            
            # Set a local copy of the dataframe to manipulate:
            X_copy = dataframe_for_concatenating_predictions
            
            # Add the predictions as the new column named col_name:
            X_copy[col_name] = y_pred
            
            print(f"The prediction was added as the new column {col_name} of the dataframe, and this dataframe was returned. Check its 10 first rows:\n")
            print(X_copy.head(10))
            
            return X_copy
        
        else:
            
            print("Returning only the predicted values. Check the 10 first values of the series:\n")
            print(y_pred[:10]) # slice until 10th element from the series or list
            # dataset[:,10]: all rows for column 10 of dataset
            # dataset[1,:] - slice of all rows for row 1 of dataset.
            
            return y_pred

# **Function for calculating probabilities associated to each class**
- Set the list_of_classes returned from function `retrieve_classes_used_to_train` as the input of this function.
- The predictions (outputs) from deep learning models (e.g. Keras/TensorFlow models) are themselves the probabilities associated to each possible class.
    - For Scikit-learn and XGBoost, we must use a specific method for retrieving the probabilities.

In [None]:
def calculate_class_probability (model_object, X, list_of_classes, type_of_model = 'other', dataframe_for_concatenating_predictions = None):

    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.neural_network import MLPClassifier
    from xgboost import XGBClassifier
    
    # predict_for = 'subset' or predict_for = 'single_entry'
    # The function will automatically detect if it is dealing with lists, NumPy arrays
    # or Pandas dataframes. If X is a list or a single-dimension array, predict_for
    # will be set as 'single_entry'. If X is a multi-dimension NumPy array (as the
    # outputs for preparing data - even single_entry - for deep learning models), or if
    # it is a Pandas dataframe, the function will set predict_for = 'subset'
    
    # X = subset of predictive variables (dataframe, NumPy array, or list).
    # If PREDICT_FOR = 'single_entry', X should be a list of parameters values.
    # e.g. X = [1.2, 3, 4] (dot is the decimal case separator, comma separate values). 
    # Notice that the list should contain only the numeric values, in the same order of the
    # correspondent columns.
    # If PREDICT_FOR = 'subset' (prediction for multiple entries), X should be a dataframe 
    # (subset) or a multi-dimensional NumPy array of the parameters values, as usual.
    
    # model_object: object containing the model that will be analyzed. e.g.
    # model_object = elastic_net_linear_reg_model
    
    # list_of_classes is the list of classes effectively used for training
    # the model. Set this parameter as the object returned from function
    # retrieve_classes_used_to_train
    
    # type_of_model = 'other' or type_of_model = 'deep_learning'
    
    # Notice that the output will be an array of probabilities, where each
    # element corresponds to a possible class, in the order classes appear.
    
    # dataframe_for_concatenating_predictions: if you want to concatenate the predictions
    # to a dataframe, pass it here:
    # e.g. dataframe_for_concatenating_predictions = df
    # If the dataframe must be the same one passed as X, repeat the dataframe object here:
    # X = dataset, dataframe_for_concatenating_predictions = dataset.
    # Alternatively, if dataframe_for_concatenating_predictions = None, 
    # the prediction will be returned as a series or NumPy array, depending on the input format.
    # Notice that the concatenated predictions will be added as a new column.
    
    # All of the new columns (appended or not) will have the prefix "prob_class_" followed
    # by the correspondent class name to identify them.
    
       
    # 1. Check if a list was input. Lists do not have the attribute shape, present in dataframes
    # and NumPy arrays. Accessing the attribute shape from a list will raise the Exception error
    # named AttributeError
    # Try to access the attribute shape. If the error AttributeError is raised, it is a list, so
    # set predict_for = 'single_entry':
    
    try:
        
        # Try accessing the shape attribute
        X_shape = X.shape
        
        # Now, check the type of the object X: if it is a dataframe or a numpy array:
        X_type = type(X)
        
        # type(X) == numpy.ndarray (or np.ndarray if NumPy was imported as np) if it is
        # an array
        # type(X) == pandas.core.frame.DataFrame (or pd.core.frame.DataFrame if Pandas
        # was imported as pd) if it is a pandas dataframe.
        # Notice that the object type is not a string, so it should not be declared in quotes.
        
        if (X_type == np.ndarray):
            
            # It is a NumPy array
            # If this array was previously manipulated for the deep learning models, it has 3
            # dimensions, so: X_shape = (N, M, 1), N = number of arrays (the number of rows
            # of the original dataset), and M = number of elements on each array (the number
            # of columns of the original dataset)
            
            # If the array has the 3rd dimension, we should consider the prediction for 'subset',
            # even if it is for a single entry. That is because the array is already reshaped
            # and the single_entry code would reshape again.
            
            # Let's try to access the 3rd dimension as X_shape[2]. 
            # If there is no 3rd dimension, the exception error IndexError will be raised, since
            # there is no index 2:
            try:
                
                # Try accessing the 3rd dimension:
                third_dim = X_shape[2]
                
                # Since it was accessed, the array is already in the correct shape, so set
                # prediction for subset:
                predict_for = 'subset'
            
            except IndexError:
                
                # The index error was raised because there is no 3rd dimension. Then, we are
                # dealing with a numpy array equivalent to a list. Set prediction for single_entry.
                # It is true even if there are two dimensions like (N, 1) - (2nd dimension added
                # by the function for correcting the array format for deep learning).
                predict_for = 'single_entry'
        
        else:
            # It is a Pandas dataframe
            # Set prediction for a subset:
            predict_for = 'subset'
        
        
    except AttributeError:
        
        # The AttributeError is raised when there is no attribute. 
        # Since Python lists do not have the shape attribute, 
        # the input of a list raises this error when trying to access the object's shape.
        # Since it is a list, set predict_for = 'single_entry':
        predict_for = 'single_entry'
        
        
    # Check if it is a keras or other deep learning framework; or if it is a sklearn or xgb model:
    boolean_check = (type_of_model == 'deep_learning')
    
    if (boolean_check): # run if it is True
        print("The predictions (outputs) from deep learning models are themselves the probabilities associated to each possible class.")
        print("\n") #line break
        print("The output will be an array of float values: each float represents the probability of one class, in the order the classes appear. For a binary classifier, the first element will correspond to class 0; and the second element will be the probability of class 1.")
    
    
    if (predict_for == 'single_entry'):
        
        print("Calculating probabilities for a single entry X.")
        print("X must be a list with values in the order of the correspondent columns of the dataset.")
        print("In other words: declare X as a Python list of values correspondent to each variable, using the same order of variables (columns) used in the dataset.")
        
        # Get reshaped list for making the prediction:
        X_reshaped = np.reshape(np.array(X), (-1, 1))
        
        if (boolean_check): 
            # Use the predict method itself for deep learning models.
            # These models do not have the predict_proba method.
            # Their output is itself the probability for each class.
            y_pred_probabilities = model_object.predict(X_reshaped)
        
        else:
            # use the predict_proba method from sklearn and xgboost:
            y_pred_probabilities = model_object.predict_proba(X_reshaped)
        
        print("Probabilities calculated using the entry parameters.") 
        print(f"Probabilities calculated for each one of the classes {list_of_classes} (in the order of classes) = {y_pred_probabilities}\n")
        
        # create a dictionary with the possible classes and the correspondent probabilities:
        # Use the list attribute to guarantee that the probabilities are
        # retrieved as a list:
        probability_dict = {'class': list_of_classes,
                            'probability': list(y_pred_probabilities)}
            
        # Convert it to a Pandas dataframe:
        probabilities_df = pd.DataFrame(data = probability_dict)
            
        print("Returning a dataframe containing the classes and the probabilities calculated for the entry to belong to each class. Check it below:")
        print(probabilities_df)
            
        return probabilities_df
    
    
    else:
        
        # prediction for a subset
        
        if (boolean_check): 
            # Use the predict method itself for deep learning models.
            # These models do not have the predict_proba method.
            # Their output is itself the probability for each class.
            y_pred_probabilities = model_object.predict(X)
            
            # If y_pred_probabilities came from a RNN with the parameter return_sequences = True 
            # and/or return_states = True, then the hidden and/or cell states from the LSTMs
            # were returned. So, the returned array has at least one extra dimensions (two
            # if both parameters are True). On the other hand, we want only the first dimension,
            # correspondent to the actual output.

            # Remember that, due to the reshapes for preparing data for deep learning models,
            # y_pred_probabilities must have at least 2 dimensions: (N, 1), where N is the number 
            # of rows of the original dataset. But y_pred_probabilities returned from a model 
            # with return_sequences = True or return_states = True will be of dimension (N, N, 1). 
            # If both parameters are True, the dimension is (N, N, N, 1), since there are extra 
            # arrays for both the hidden and cell states.

            # The conclusion is that there is a third dimension only for models where 
            # return_sequences = True or return_states = True

            # Check if y_pred_probabilities is a numpy array, instead of a Pandas dataframe:

            if (type(y_pred_probabilities) == np.ndarray):

                    # Try accessing the array's 3rd dimension. If there is no 3rd dimension,
                    # the exception error IndexError will be raised.
                    # Notice: if 4 or more dimensions are present, we can still access
                    # the 3rd dimension (naturally).
                    try:

                        third_dim = y_pred_probabilities.shape[2]

                        # If we could access the third_dimension, than return_states and
                        # or return_sequences = True

                        # We want only the values stored as the 1st dimension
                        # y_pred_probabilities is an array where each element is an array with 
                        # two elements. To get only the first elements:
                        # (slice the arrays: get all values only for dimension 0, the 1st dim):
                        y_pred_probabilities = y_pred_probabilities[:,0]
                        # if we used y_pred_probabilities[:,1] we would get the second element, 
                        # which is the hidden state h (input of the next LSTM unit).
                        # It happens because of the parameter return_sequences = True. 
                        # If return_states = True, there would be a third element, corresponding 
                        # to the cell state c.
                        # Notice that we want only the 1st dimension (0), no matter the case.

                    except IndexError:

                        # The index error was raised because there is no 3rd dimension. Then,
                        # we do not have to worry with the returned states
                        # simply set y_pred_probabilities as itself:
                        y_pred_probabilities = y_pred_probabilities
                        # Even though the slicing y_pred = y_pred[:,0] would not generate an
                        # error, it would unecessarily modify the shape of the array (extra
                        # critical step).

                        # Also, the array obtained as y_pred[:,0] when there are 3 or more 
                        # dimensions has same shape as y_pred when there are only 1 or 2 
                        # dimensions. So, the extra modification of the shape would eliminate
                        # this correspondence.

                    # If we wanted only the first array, we could set 
                    # y_pred_probabilities = y_pred_probabilities[0]
        
        else:
            # use the predict_proba method from sklearn and xgboost:
            y_pred_probabilities = model_object.predict_proba(X)
        
        # y_pred_probabilities is a column containing arrays of probabilities
        # Let's create a dataframe separating each element of the array into
        # a separate column
        
        # Get the size of each array. It is the total of elements from
        # list_of_classes (total of possible classes):
        total_of_classes = len(list_of_classes)
        
        # Get the total of rows. It is the length of X:
        
        # If X is a NumPy array, get its first dimension:
        if (X_type == np.ndarray):
            
            # Get the first dimension of the array (dimension 0)
            # This dimension is the total of arrays, i.e., the total
            # of rows on the original dataset:
            # X.shape = (N, M, 1), N = total of arrays (rows of the original
            # dataset); M = total of elements in each array (columns of the
            # original dataset). Analogously, y.shape = (N, 1)
            total_rows = X.shape[0]
        
        else:
            
            # X is a dataframe, so the number of rows is its length
            total_rows = len(X)
        
        # Starts a dictionary. This dictionary will have the class as the
        # key and a list of the probabilities that the element belong to that
        # class as the value (in the dataframe, the class will be column,
        # with its calculated probability in each row):
        probability_dict = {}
        
        # Loop through each possible class:
        for i in range (total_of_classes):
            # loops from i = 0 (first index) 
            # to i = (total_of_classes - 1), index of the last element of list
            # 'list_of_classes'
            
            # Retrieve the name of the class in the list 'list_of_classes'.
            # It is the i-th element from list list_of_classes:
            class_name = list_of_classes[i]
            # Let's concatenate the prefix "prob_class_" to this strings.
            # This string will be used as column name, so it will be clear 
            # in the output dataframe that the column is referrent to the 
            # probability calculated for the class. Since the elements may 
            # have been saved as numbers use the str attribute to guarantee 
            # that the element was read as a string, and concatenate the
            # prefix to its left:
            class_name = "prob_class_" + str(class_name)
            
            # Start a list of probabilities:
            prob_list = []
            
            # Now loop through each row j from the dataframe
            # to retrieve the array in the column y_pred_probabilities:
            
            for j in len(total_rows):
                # goes from j = 0 (first row of the dataframe) to
                # j = total_rows - 1, index of the last row
                # Get the array of probabilities for that row:
                prob_array = y_pred_probabilities[j]
                
                # Append the i-th element of that array in prob_list
                # The i-th position of the array is the probability
                # of the class being analyzed in the i-th iteration of
                # the main loop
                prob_list.append(prob_array[i])
            
            # Now that the probabilities for the class correspondent to
            # each row were retrieved as the list prob_list, update the
            # dictionary. Use the class name saved as class_name as the
            # key, and put the prob_list as the correspondent value:
            probability_dict.update({class_name: prob_list})
        
        # Now that we finished the loop, the probability dictionary contains
        # each one of the classes as its keys, and the list of probabilities
        # for each row as the correspondent values. 
        # Also, the keys are identified with the prefix 'prob_class' to
        # indicate that they are referrent to the probability of belonging to
        # one class. Let's convert this dictionary to a Pandas dataframe:
        
        probabilities_df = pd.DataFrame(data = probability_dict)
        
        # Check if there is a dataframe to concatenate the predictions
        if not (dataframe_for_concatenating_predictions is None):
            
            # there is a dataframe for concatenating the predictions.
            
            # Set a local copy of the dataframe to manipulate:
            X_copy = X
            
            # Append the columns from probabilities_df with Pandas concat
            # method, setting axis = 1 (axis = 0  appends rows)
            # Use the pandas 'inner' join, which removes entries without
            # correspondence. It is the same strategy used for concatenating
            # the dataframe obtained from One-Hot Encoding transformation in the
            # ETL Workflow (3_Dataset_Transformation)
            X_copy = pd.concat([X_copy, probabilities_df], axis = 1, join = "inner")
      
            print(f"The dataframe X was concatenated to the probabilities calculated for each class and returned. Check its first 10 entries:\n")
            print(X_copy.head(10))
            
            return X_copy
        
        else:
            
            print("Returning only the dataframe with the probabilities calculated for each class. Check its first 10 entries:\n")
            print(probabilities_df.head(10))
            
            return probabilities_df

# **Function for merging (joining) dataframes on given keys; and sorting the merged table**
- Merge (join) types:
    - 'inner': resultant dataframe contains only the rows on the left dataframe with correspondent values on the right dataframe. Can be used for filtering a set of labelled rows. Results in no missing values;
    - 'left': resultant dataframe contains all the rows from the left table (even those without correspondence on the right); and the rows from the right table that have correspondence on the left one. Since rows from the left table may not have correspondence, it may result in missing values.
    - 'right': resultant dataframe contains all the rows from the right table (even those without correspondence on the right); and the rows from the left table that have correspondence on the right one. Since rows from the right table may not have correspondence, it may result in missing values.
    - 'outer': in SQL, the Pandas 'outer' merge usually corresponds to the FULL OUTER JOIN: the resultant dataframe contains all rows from both tables, not taking in account if there is correspondence. So, it may result in missing values.

In [None]:
def MERGE_AND_SORT_DATAFRAMES (df_left, df_right, left_key, right_key, how_to_join = "inner", merged_suffixes = ('_left', '_right'), sort_merged_df = False, column_to_sort = None, ascending_sorting = True):
    
    #WARNING: Only two dataframes can be merged on each call of the function.
    
    import numpy as np
    import pandas as pd
    
    # df_left: dataframe to be joined as the left one.
    
    # df_right: dataframe to be joined as the right one
    
    # left_key: (String) name of column of the left dataframe to be used as key for joining.
    
    # right_key: (String) name of column of the right dataframe to be used as key for joining.
    
    # how_to_join: joining method: "inner", "outer", "left", "right". The default is "inner".
    
    # merge_method: which pandas merging method will be applied:
    # merge_method = 'ordered' for using the .merge_ordered method.
    # merge_method = "asof" for using the .merge_asof method.
    # WARNING: .merge_asof uses fuzzy matching, so the how_to_join parameter is not applicable.
    
    # merged_suffixes = ('_left', '_right') - tuple of the suffixes to be added to columns
    # with equal names. Simply modify the strings inside quotes to modify the standard
    # values. If no tuple is provided, the standard denomination will be used.
    
    # sort_merged_df = False not to sort the merged dataframe. If you want to sort it,
    # set as True. If sort_merged_df = True and column_to_sort = None, the dataframe will
    # be sorted by its first column.
    
    # column_to_sort = None. Keep it None if the dataframe should not be sorted.
    # Alternatively, pass a string with a column name to sort, such as:
    # column_to_sort = 'col1'; or a list of columns to use for sorting: column_to_sort = 
    # ['col1', 'col2']
    
    # ascending_sorting = True. If you want to sort the column(s) passed on column_to_sort in
    # ascending order, set as True. Set as False if you want to sort in descending order. If
    # you want to sort each column passed as list column_to_sort in a specific order, pass a 
    # list of booleans like ascending_sorting = [False, True] - the first column of the list
    # will be sorted in descending order, whereas the 2nd will be in ascending. Notice that
    # the correspondence is element-wise: the boolean in list ascending_sorting will correspond 
    # to the sorting order of the column with the same position in list column_to_sort.
    # If None, the dataframe will be sorted in ascending order.
    
    # Create dataframe local copies to manipulate, avoiding that Pandas operates on
    # the original objects; or that Pandas tries to set values on slices or copies,
    # resulting in unpredictable results.
    # Use the copy method to effectively create a second object with the same properties
    # of the input parameters, but completely independent from it.
    DF_LEFT = df_left.copy(deep = True)
    DF_RIGHT = df_right.copy(deep = True)
    
    # check if the keys are the same:
    boolean_check = (left_key == right_key)
    # if boolean_check is True, we will merge using the on parameter, instead of left_on and right_on:
    
    if (boolean_check): # runs if it is True:
        
        merged_df = DF_LEFT.merge(DF_RIGHT, on = left_key, how = how_to_join, suffixes = merged_suffixes)
    
    else:
        # use left_on and right_on
        merged_df = DF_LEFT.merge(DF_RIGHT, left_on = left_key, right_on = right_key, how = how_to_join, suffixes = merged_suffixes)
    
    # Check if the dataframe should be sorted:
    if (sort_merged_df == True):
        
        # check if column_to_sort = None. If it is, set it as the first column (index 0):
        if (column_to_sort is None):
            
            column_to_sort = merged_df.columns[0]
            print(f"Sorting merged dataframe by its first column = {column_to_sort}\n")
        
        # check if ascending_sorting is None. If it is, set it as True:
        if (ascending_sorting is None):
            
            ascending_sorting = True
            print("Sorting merged dataframe in ascending order.\n")
        
        # Now, sort the dataframe according to the parameters:
        merged_df = merged_df.sort_values(by = column_to_sort, ascending = ascending_sorting)
        #sort by the first column, with index 0.
    
        # Now, reset index positions:
        merged_df = merged_df.reset_index(drop = True)
        print("Merged dataframe successfully sorted.\n")
    
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Dataframe successfully merged. Check its 10 first rows:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(merged_df.head(10))
            
    except: # regular mode
        print(merged_df.head(10))
    
    return merged_df

# **Function for concatenating (SQL UNION) multiple dataframes**
- Vertical concatenation of the dataframes.
- Equivalent to SQL Union: vertical stack/append of the tables.

In [None]:
def UNION_DATAFRAMES (list_of_dataframes, what_to_append = 'rows', ignore_index_on_union = True, sort_values_on_union = True, union_join_type = None):
    
    import pandas as pd
    #JOIN can be 'inner' to perform an inner join, eliminating the missing values
    #The default (None) is 'outer': the dataframes will be stacked on the columns with
    #same names but, in case there is no correspondence, the row will present a missing
    #value for the columns which are not present in one of the dataframes.
    #When using the 'inner' method, only the common columns will remain
    
    #list_of_dataframes must be a list containing the dataframe objects
    # example: list_of_dataframes = [df1, df2, df3, df4]
    #Notice that the dataframes are objects, not strings. Therefore, they should not
    # be declared inside quotes.
    # There is no limit of dataframes. In this example, we will concatenate 4 dataframes.
    # If list_of_dataframes = [df1, df2, df3] we would concatenate 3, and if
    # list_of_dataframes = [df1, df2, df3, df4, df5] we would concatenate 5 dataframes.
    
    # what_to_append = 'rows' for appending the rows from one dataframe
    # into the other; what_to_append = 'columns' for appending the columns
    # from one dataframe into the other (horizontal or lateral append).
    
    # When what_to_append = 'rows', Pandas .concat method is defined as
    # axis = 0, i.e., the operation occurs in the row level, so the rows
    # of the second dataframe are added to the bottom of the first one.
    # It is the SQL union, and creates a dataframe with more rows, and
    # total of columns equals to the total of columns of the first dataframe
    # plus the columns of the second one that were not in the first dataframe.
    # When what_to_append = 'columns', Pandas .concat method is defined as
    # axis = 1, i.e., the operation occurs in the column level: the two
    # dataframes are laterally merged using the index as the key, 
    # preserving all columns from both dataframes. Therefore, the number of
    # rows will be the total of rows of the dataframe with more entries,
    # and the total of columns will be the sum of the total of columns of
    # the first dataframe with the total of columns of the second dataframe.
    
    #The other parameters are the same from Pandas .concat method.
    # ignore_index_on_union = ignore_index;
    # sort_values_on_union = sort
    # union_join_type = join
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
    
    #Check Datacamp course Joining Data with pandas, Chap.3, 
    # Advanced Merging and Concatenating
    
    # Create dataframe local copies to manipulate, avoiding that Pandas operates on
    # the original objects; or that Pandas tries to set values on slices or copies,
    # resulting in unpredictable results.
    # Use the copy method to effectively create a second object with the same properties
    # of the input parameters, but completely independent from it.
    
    # Start a list of copied dataframes:
    LIST_OF_DATAFRAMES = []
    
    # Loop through each element from list_of_dataframes:
    for dataframe in list_of_dataframes:
        
        # create a copy of the object:
        copied_df = dataframe.copy(deep = True)
        # Append this element to the LIST_OF_DATAFRAMES:
        LIST_OF_DATAFRAMES.append(copied_df)
    
    # Check axis:
    if (what_to_append == 'rows'):
        
        AXIS = 0
    
    elif (what_to_append == 'columns'):
        
        AXIS = 1
        
        # In this case, we must save a list of columns of each one of the dataframes, containing
        # the different column names observed. That is because the concat method eliminates the
        # original column names when AXIS = 1
        # We can start the LIST_OF_COLUMNS as the columns from the first object on the
        # LIST_OF_DATAFRAMES, eliminating one iteration cycle. Since the columns method generates
        # an array, we use the list attribute to convert the array to a regular list:
        
        i = 0
        analyzed_df = LIST_OF_DATAFRAMES[i]
        LIST_OF_COLUMNS = list(analyzed_df.columns)
        
        # Now, loop through each other element on LIST_OF_DATAFRAMES. Since index 0 was already
        # considered, start from index 1:
        for i in range (1, len(LIST_OF_DATAFRAMES)):
            
            analyzed_df = LIST_OF_DATAFRAMES[i]
            
            # Now, loop through each column, named 'col', from the list of columns of analyzed_df:
            for col in list(analyzed_df.columns):
                
                # If 'col' is not in LIST_OF_COLUMNS, append it to the list with its current name.
                # The order of the columns on the concatenated dataframe will be the same (the order
                # they appear):
                if not (col in LIST_OF_COLUMNS):
                    LIST_OF_COLUMNS.append(col)
                
                else:
                    # There is already a column with this name. So, append col with a suffix:
                    LIST_OF_COLUMNS.append(col + "_df_" + str(i))
                    
        # Now, we have a list of all column names, that we will use for retrieving the headers after
        # concatenation.
    
    else:
        print("No valid string was input to what_to_append, so appending rows (vertical append, equivalent to SQL UNION).\n")
        AXIS = 0
    
    if (union_join_type == 'inner'):
        
        print("Warning: concatenating dataframes using the \'inner\' join method, that removes missing values.\n")
        concat_df = pd.concat(LIST_OF_DATAFRAMES, axis = AXIS, ignore_index = ignore_index_on_union, sort = sort_values_on_union, join = union_join_type)
    
    else:
        
        #In case None or an invalid value is provided, use the default 'outer', by simply
        # not declaring the 'join':
        concat_df = pd.concat(LIST_OF_DATAFRAMES, axis = AXIS, ignore_index = ignore_index_on_union, sort = sort_values_on_union)
    
    if (AXIS == 1):
        # If we concatentated columns, we lost the columns' names (headers). So, use the list
        # LIST_OF_COLUMNS as the new headers for this case:
        concat_df.columns = LIST_OF_COLUMNS
    
    # Pandas .head(Y) method results in a dataframe containing the first Y rows of the 
    # original dataframe. The default .head() is Y = 5. Print first 10 rows of the 
    # new dataframe:
    print("Dataframes successfully concatenated. Check the 10 first rows of new dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(concat_df.head(10))
            
    except: # regular mode
        print(concat_df.head(10))
    
    #Now return the concatenated dataframe:
    
    return concat_df

# **Function for plotting the bar chart**
- Bars may be vertically or horizontally oriented.
- Bar charts are plotted after selecting an aggregation function, and the cumulative percent curve may be obtained and plotted with the bars (in secondary axis).
- To obtain a **Pareto chart**, keep `aggregate_function = 'sum'`, `plot_cumulative_percent = True`, and `orientation = 'vertical'`.
- For obtaining the **data distribution of categorical variables**, select any numeric column as the response, and set `aggregate_function = 'count'`. You can also set `plot_cumulative_percent = True` to compare the frequencies of each possible value.

### Use this function for obtaining the statistical distributions for categorical variables

In [None]:
def bar_chart (df, categorical_var_name, response_var_name, aggregate_function = 'sum', add_suffix_to_aggregated_col = True, suffix = None, calculate_and_plot_cumulative_percent = True, orientation = 'vertical', limit_of_plotted_categories = None, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, x_axis_rotation = 70, y_axis_rotation = 0, grid = True, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from scipy import stats
    
    # df: dataframe being analyzed
    
    # categorical_var_name: string (inside quotes) containing the name 
    # of the column to be analyzed. e.g. 
    # categorical_var_name = "column1"
    
    # response_var_name: string (inside quotes) containing the name 
    # of the column that stores the response correspondent to the
    # categories. e.g. response_var_name = "response_feature" 
    
    # aggregate_function = 'sum': String defining the aggregation 
    # method that will be applied. Possible values:
    # 'median', 'mean', 'mode', 'sum', 'min', 'max', 'variance', 'count',
    # 'standard_deviation', '10_percent_quantile', '20_percent_quantile',
    # '25_percent_quantile', '30_percent_quantile', '40_percent_quantile',
    # '50_percent_quantile', '60_percent_quantile', '70_percent_quantile',
    # '75_percent_quantile', '80_percent_quantile', '90_percent_quantile',
    # '95_percent_quantile', 'kurtosis', 'skew', 'interquartile_range',
    # 'mean_standard_error', 'entropy'
    # To use another aggregate function, you can use the .agg method, passing 
    # the aggregate as argument, such as in:
    # .agg(scipy.stats.mode), 
    # where the argument is a Scipy aggregate function.
    # If None or an invalid function is input, 'sum' will be used.
    
    # add_suffix_to_aggregated_col = True will add a suffix to the
    # aggregated column. e.g. 'responseVar_mean'. If add_suffix_to_aggregated_col 
    # = False, the aggregated column will have the original column name.
    
    # suffix = None. Keep it None if no suffix should be added, or if
    # the name of the aggregate function should be used as suffix, after
    # "_". Alternatively, set it as a string. As recommendation, put the
    # "_" sign in the beginning of this string to separate the suffix from
    # the original column name. e.g. if the response variable is 'Y' and
    # suffix = '_agg', the new aggregated column will be named as 'Y_agg'
    
    # calculate_and_plot_cumulative_percent = True to calculate and plot
    # the line of cumulative percent, or 
    # calculate_and_plot_cumulative_percent = False to omit it.
    # This feature is only shown when aggregate_function = 'sum', 'median',
    # 'mean', or 'mode'. So, it will be automatically set as False if 
    # another aggregate is selected.
    
    # orientation = 'vertical' is the standard, and plots vertical bars
    # (perpendicular to the X axis). In this case, the categories are shown
    # in the X axis, and the correspondent responses are in Y axis.
    # Alternatively, orientation = 'horizontal' results in horizontal bars.
    # In this case, categories are in Y axis, and responses in X axis.
    # If None or invalid values are provided, orientation is set as 'vertical'.
    
    # Note: to obtain a Pareto chart, keep aggregate_function = 'sum',
    # plot_cumulative_percent = True, and orientation = 'vertical'.
    
    # limit_of_plotted_categories: integer value that represents
    # the maximum of categories that will be plot. Keep it None to plot
    # all categories. Alternatively, set an integer value. e.g.: if
    # limit_of_plotted_categories = 4, but there are more categories,
    # the dataset will be sorted in descending order and: 1) The remaining
    # categories will be sum in a new category named 'others' if the
    # aggregate function is 'sum'; 2) Or the other categories will be simply
    # omitted from the plot, for other aggregate functions. Notice that
    # it limits only the variables in the plot: all of them will be
    # returned in the dataframe.
    # Use this parameter to obtain a cleaner plot. Notice that the remaining
    # columns will be aggregated as 'others' even if there is a single column
    # beyond the limit.
    
    
    # Create a local copy of the dataframe to manipulate:
    DATASET = df.copy(deep = True)
    
    # Before calling the method, we must guarantee that the variables may be
    # used for that aggregate. Some aggregations are permitted only for numeric variables, so calling
    # the methods before selecting the variables may raise warnings or errors.
    
    
    list_of_aggregates = ['median', 'mean', 'mode', 'sum', 'min', 'max', 'variance', 'count',
                          'standard_deviation', '10_percent_quantile', '20_percent_quantile', 
                          '25_percent_quantile', '30_percent_quantile', '40_percent_quantile', 
                          '50_percent_quantile', '60_percent_quantile', '70_percent_quantile', 
                          '75_percent_quantile', '80_percent_quantile', '90_percent_quantile', 
                          '95_percent_quantile', 'kurtosis', 'skew', 'interquartile_range', 
                          'mean_standard_error', 'entropy']
    
    list_of_numeric_aggregates = ['median', 'mean', 'sum', 'min', 'max', 'variance',
                                  'standard_deviation', '10_percent_quantile', '20_percent_quantile', 
                                  '25_percent_quantile', '30_percent_quantile', '40_percent_quantile', 
                                  '50_percent_quantile', '60_percent_quantile', '70_percent_quantile', 
                                  '75_percent_quantile', '80_percent_quantile', '90_percent_quantile',
                                  '95_percent_quantile', 'kurtosis', 'skew', 'interquartile_range', 
                                  'mean_standard_error']
    
    # Check if an invalid or no aggregation function was selected:
    if ((aggregate_function not in (list_of_aggregates)) | (aggregate_function is None)):
        
        aggregate_function = 'sum'
        print("Invalid or no aggregation function input, so using the default \'sum\'.\n")
    
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]
    
    # Check if a numeric aggregate was selected:
    if (aggregate_function in list_of_numeric_aggregates):
        
        column_data_type = DATASET[response_var_name].dtype
        
        if (column_data_type not in numeric_dtypes):
            
                # If the Pandas series was defined as an object, it means it is categorical
                # (string, date, etc).
                print("Numeric aggregate selected, but categorical variable indicated as response variable.")
                print("Setting aggregate_function = \'mode\', to make aggregate compatible with data type.\n")
                
                aggregate_function = 'mode'
    
    else: # categorical aggregate function
        
        column_data_type = DATASET[response_var_name].dtype
        
        if ((column_data_type in numeric_dtypes) & (aggregate_function != 'count')):
                # count is the only aggregate for categorical that can be used for numerical variables as well.
                
                print("Categorical aggregate selected, but numeric variable indicated as response variable.")
                print("Setting aggregate_function = \'sum\', to make aggregate compatible with data type.\n")
                
                aggregate_function = 'sum'
    
    # Before grouping, let's remove the missing values, avoiding the raising of TypeError.
    # Pandas deprecated the automatic dropna with aggregation:
    DATASET = DATASET.dropna(axis = 0)
    
    # Convert categorical_var_name to Pandas 'category' type. If the variable is represented by
    # a number, the dataframe will be grouped in terms of an aggregation of the variable, instead
    # of as a category. It will prevents this to happen:
    DATASET[categorical_var_name] = DATASET[categorical_var_name].astype("category")    
    
    # If an aggregate function different from 'sum', 'mean', 'median' or 'mode' 
    # is used with plot_cumulative_percent = True, 
    # set plot_cumulative_percent = False:
    # (check if aggregate function is not in the list of allowed values):
    if ((aggregate_function not in ['sum', 'mean', 'median', 'mode', 'count']) & (calculate_and_plot_cumulative_percent == True)):
        
        calculate_and_plot_cumulative_percent = False
        print("The cumulative percent is only calculated when aggregate_function = \'sum\', \'mean\', \'median\', \'mode\', or \'count\'. So, plot_cumulative_percent was set as False.")
    
    # Guarantee that the columns from the aggregated dataset have the correct names
    
    # Groupby according to the selection.
    # Here, there is a great gain of performance in not using a dictionary of methods:
    # If using a dictionary of methods, Pandas would calculate the results for each one of the methods.
    
    # Pandas groupby method documentation:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?msclkid=7b3531a6cff211ec9086f4edaddb94ba
    # argument as_index = False: prevents the grouper variable to be set as index of the new dataframe.
    # (default: as_index = True);
    # dropna = False: do not removes the missing values (default: dropna = True, used here to avoid
    # compatibility and version issues)
    
    if (aggregate_function == 'median'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg('median')

    elif (aggregate_function == 'mean'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].mean()
    
    elif (aggregate_function == 'mode'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg(stats.mode)
    
    elif (aggregate_function == 'sum'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].sum()
    
    elif (aggregate_function == 'count'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].count()

    elif (aggregate_function == 'min'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].min()
    
    elif (aggregate_function == 'max'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].max()
    
    elif (aggregate_function == 'variance'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].var()

    elif (aggregate_function == 'standard_deviation'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].std()
    
    elif (aggregate_function == '10_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.10)
    
    elif (aggregate_function == '20_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.20)
    
    elif (aggregate_function == '25_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.25)
    
    elif (aggregate_function == '30_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.30)
    
    elif (aggregate_function == '40_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.40)
    
    elif (aggregate_function == '50_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.50)

    elif (aggregate_function == '60_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.60)
    
    elif (aggregate_function == '70_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.30)

    elif (aggregate_function == '75_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.75)

    elif (aggregate_function == '80_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.80)
    
    elif (aggregate_function == '90_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.90)
    
    elif (aggregate_function == '95_percent_quantile'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].quantile(0.95)

    elif (aggregate_function == 'kurtosis'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg(stats.kurtosis)
    
    elif (aggregate_function == 'skew'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg(stats.skew)

    elif (aggregate_function == 'interquartile_range'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg(stats.iqr)
    
    elif (aggregate_function == 'mean_standard_error'):
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg(stats.sem)
    
    else: # entropy
        
        DATASET = DATASET.groupby(by = categorical_var_name, as_index = False, sort = True)[response_var_name].agg(stats.entropy)

    
    # List of columns of the aggregated dataset:
    list_of_columns = list(DATASET.columns) # convert to a list
    
    if (add_suffix_to_aggregated_col == True):
            
        if (suffix is None):
                
            suffix = "_" + aggregate_function
            
        new_columns = [(str(name) + suffix) for name in list_of_columns]
        # Do not consider the first element, which is the aggregate function with a suffix.
        # Concatenate the correct name with the columns from the second element of the list:
        new_columns = [categorical_var_name] + new_columns[1:]
        # Make it the new columns:
        DATASET.columns = new_columns
        # Update the list of columns:
        list_of_columns = DATASET.columns
    
    if (aggregate_function == 'mode'):
        
        # The columns was saved as a series of Tuples. Each row contains a tuple like:
        # ([calculated_mode], [counting_of_occurrences]). We want only the calculated mode.
        # On the other hand, if we do column[0], we will get the columns first row. So, we have to
        # go through each column, retrieving only the mode:
        
        # Loop through each column:
        for column in list_of_columns:
            
            # Save the series as a list:
            list_of_modes_arrays = list(DATASET[column])
            # Start a list of modes:
            list_of_modes = []
            
            # Loop through each element from the list of arrays:
            for mode_array in list_of_modes_arrays:
                # mode array is like:
                # ModeResult(mode=array([calculated_mode]), count=array([counting_of_occurrences]))
                # To retrieve only the mode, we must access the element [0][0] from this array:
                try:
                    list_of_modes.append(mode_array[0][0])
                
                except IndexError:
                    # This error is generated when trying to access an array storing no values.
                    # (i.e., with missing values). Since there is no dimension, it is not possible
                    # to access the [0][0] position. In this case, simply append the np.nan 
                    # the (missing value):
                    list_of_modes.append(np.nan)
            
            # Make the list of modes the column itself:
            DATASET[column] = list_of_modes
    
            
    # the name of the response variable is now the second element from the list of column:
    response_var_name = list(DATASET.columns)[1]
    # the categorical variable name was not changed.
    
    # Let's sort the dataframe.
    
    # Order the dataframe in descending order by the response.
    # If there are equal responses, order them by category, in
    # ascending order; put the missing values in the first position
    # To pass multiple columns and multiple types of ordering, we use
    # lists. If there was a single column to order by, we would declare
    # it as a string. If only one order of ascending was used, we would
    # declare it as a simple boolean
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
    
    DATASET = DATASET.sort_values(by = [response_var_name, categorical_var_name], ascending = [False, True], na_position = 'first')
    
    # Now, reset index positions:
    DATASET = DATASET.reset_index(drop = True)
    
    if (aggregate_function == 'count'):
        
        # Here, the column represents the counting, no matter the variable set as response.
        DATASET.columns = [categorical_var_name, 'count_of_entries']
        response_var_name = 'count_of_entries'
    
    # plot_cumulative_percent = True, create a column to store the
    # cumulative percent:
    if (calculate_and_plot_cumulative_percent): 
        # Run the following code if the boolean value is True (implicity)
        # Only calculates cumulative percent in case aggregate is 'sum' or 'mode'
        
        # Create a column series for the cumulative sum:
        cumsum_col = response_var_name + "_cumsum"
        DATASET[cumsum_col] = DATASET[response_var_name].cumsum()
        
        # total sum is the last element from this series
        # (i.e. the element with index len(DATASET) - 1)
        total_sum = DATASET[cumsum_col][(len(DATASET) - 1)]
        
        # Now, create a column for the accumulated percent
        # by dividing cumsum_col by total_sum and multiplying it by
        # 100 (%):
        cum_pct_col = response_var_name + "_cum_pct"
        DATASET[cum_pct_col] = (DATASET[cumsum_col])/(total_sum) * 100
        print(f"Successfully calculated cumulative sum and cumulative percent correspondent to the response variable {response_var_name}.")
    
    print("Successfully aggregated and ordered the dataset to plot. Check the 10 first rows of this returned dataset:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
    
    # Check if the total of plotted categories is limited:
    if not (limit_of_plotted_categories is None):
        
        # Since the value is not None, we have to limit it
        # Check if the limit is lower than or equal to the length of the dataframe.
        # If it is, we simply copy the columns to the series (there is no need of
        # a memory-consuming loop or of applying the head method to a local copy
        # of the dataframe):
        df_length = len(DATASET)
            
        if (df_length <= limit_of_plotted_categories):
            # Simply copy the columns to the graphic series:
            categories = DATASET[categorical_var_name]
            responses = DATASET[response_var_name]
            # If there is a cum_pct column, copy it to a series too:
            if (calculate_and_plot_cumulative_percent):
                cum_pct = DATASET[cum_pct_col]
        
        else:
            # The limit is lower than the total of categories,
            # so we actually have to limit the size of plotted df:
        
            # If aggregate_function is not 'sum', we simply apply
            # the head method to obtain the first rows (number of
            # rows input as parameter; if no parameter is input, the
            # number of 5 rows is used):
            
            # Limit to the number limit_of_plotted_categories:
            # create another local copy of the dataframe not to
            # modify the returned dataframe object:
            plotted_df = DATASET.copy(deep = True).head(limit_of_plotted_categories)

            # Create the series of elements to plot:
            categories = list(plotted_df[categorical_var_name])
            responses = list(plotted_df[response_var_name])
            # If the cumulative percent was obtained, create the series for it:
            if (calculate_and_plot_cumulative_percent):
                cum_pct = list(plotted_df[cum_pct_col])
            
            # Start variable to store the aggregates from the others:
            other_responses = 0
            
            # Loop through each row from DATASET:
            for i in range(0, len(DATASET)):
                
                # Check if the category is not in categories:
                category = DATASET[categorical_var_name][i]
                
                if (category not in categories):
                    
                    # sum the value in the response variable to other_responses:
                    other_responses = other_responses + DATASET[response_var_name][i]
            
            # Now we finished the sum of the other responses, let's add these elements to
            # the lists:
            categories.append("others")
            responses.append(other_responses)
            # If there is a cumulative percent, append 100% to the list:
            if (calculate_and_plot_cumulative_percent):
                cum_pct.append(100)
                # The final cumulative percent must be the total, 100%
            
            else:

                # Firstly, copy the elements that will be kept to x, y and (possibly) cum_pct
                # lists.
                # Start the lists:
                categories = []
                responses = []
                if (calculate_and_plot_cumulative_percent):
                    cum_pct = [] # start this list only if its needed to save memory

                for i in range (0, limit_of_plotted_categories):
                    # i goes from 0 (first index) to limit_of_plotted_categories - 1
                    # (index of the last category to be kept):
                    # copy the elements from the DATASET to the list
                    # category is the 1st column (column 0); response is the 2nd (col 1);
                    # and cumulative percent is the 4th (col 3):
                    categories.append(DATASET.iloc[i, 0])
                    responses.append(DATASET.iloc[i, 1])
                    
                    if (calculate_and_plot_cumulative_percent):
                        cum_pct.append(DATASET.iloc[i, 3]) # only if there is something to iloc
                    
                # Now, i = limit_of_plotted_categories - 1
                # Create a variable to store the sum of other responses
                other_responses = 0
                # loop from i = limit_of_plotted_categories to i = df_length-1, index
                # of the last element. Notice that this loop may have a single call, if there
                # is only one element above the limit:
                for i in range (limit_of_plotted_categories, (df_length - 1)):
                    
                    other_responses = other_responses + (DATASET.iloc[i, 1])
                
                # Now, add the last elements to the series:
                # The last category is named 'others':
                categories.append('others')
                # The correspondent aggregated response is the value 
                # stored in other_responses:
                responses.append(other_responses)
                # The cumulative percent is 100%, since this must be the sum of all
                # elements (the previous ones plus the ones aggregated as 'others'
                # must totalize 100%).
                # On the other hand, the cumulative percent is stored only if needed:
                cum_pct.append(100)
    
    else:
        # This is the situation where there is no limit of plotted categories. So, we
        # simply copy the columns to the plotted series (it is equivalent to the 
        # situation where there is a limit, but the limit is equal or inferior to the
        # size of the dataframe):
        categories = DATASET[categorical_var_name]
        responses = DATASET[response_var_name]
        # If there is a cum_pct column, copy it to a series too:
        if (calculate_and_plot_cumulative_percent):
            cum_pct = DATASET[cum_pct_col]
    
    
    # Now the data is prepared and we only have to plot 
    # categories, responses, and cum_pct:
    
    # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
    # so that the bars do not completely block other views.
    OPACITY = 0.95
    
    # Set labels and titles for the case they are None
    if (plot_title is None):
        
        if (aggregate_function == 'count'):
            # The graph is the same count, no matter the response
            plot_title = f"Bar_chart_count_of_{categorical_var_name}"
        
        else:
            plot_title = f"Bar_chart_for_{response_var_name}_by_{categorical_var_name}"
    
    if (horizontal_axis_title is None):

        horizontal_axis_title = categorical_var_name

    if (vertical_axis_title is None):
        # Notice that response_var_name already has the suffix indicating the
        # aggregation function
        vertical_axis_title = response_var_name
    
    fig, ax1 = plt.subplots(figsize = (12, 8))
    # Set image size (x-pixels, y-pixels) for printing in the notebook's cell:

    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 70 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)
    
    plt.title(plot_title)
    
    if (orientation == 'horizontal'):
        
        # invert the axes in relation to the default (vertical, below)
        ax1.set_ylabel(horizontal_axis_title)
        ax1.set_xlabel(vertical_axis_title, color = 'darkblue')
        
        # Horizontal bars used - barh method (bar horizontal):
        # https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.barh.html
        # Now, the categorical variables stored in series categories must be
        # positioned as the vertical axis Y, whereas the correspondent responses
        # must be in the horizontal axis X.
        ax1.barh(categories, responses, color = 'darkblue', alpha = OPACITY, label = categorical_var_name)
        #.barh(y, x, ...)
        
        if (calculate_and_plot_cumulative_percent):
            # Let's plot the line for the cumulative percent
            # Set the grid for the bar chart as False. If it is True, there will
            # be to grids, one for the bars and other for the percents, making 
            # the image difficult to interpretate:
            ax1.grid(False)
            
            # Create the twin plot for the cumulative percent:
            # for the vertical orientation, we use the twinx. Here, we use twiny
            ax2 = ax1.twiny()
            # Here, the x axis must be the cum_pct value, and the Y
            # axis must be categories (it must be correspondent to the
            # bar chart)
            ax2.plot(cum_pct, categories, '-ro', label = "cumulative\npercent")
            #.plot(x, y, ...)
            ax2.tick_params('x', color = 'red')
            ax2.set_xlabel("Cumulative Percent (%)", color = 'red')
            ax2.legend()
            ax2.grid(grid) # shown if user set grid = True
            # If user wants to see the grid, it is shown only for the cumulative line.
        
        else:
            # There is no cumulative line, so the parameter grid must control 
            # the bar chart's grid
            ax1.legend()
            ax1.grid(grid)
        
    else: 
        
        ax1.set_xlabel(horizontal_axis_title)
        ax1.set_ylabel(vertical_axis_title, color = 'darkblue')
        # If None or an invalid orientation was used, set it as vertical
        # Use Matplotlib standard bar method (vertical bar):
        # https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html#matplotlib.pyplot.bar
        
        # In this standard case, the categorical variables (categories) are positioned
        # as X, and the responses as Y:
        ax1.bar(categories, responses, color = 'darkblue', alpha = OPACITY, label = categorical_var_name)
        #.bar(x, y, ...)
        
        if (calculate_and_plot_cumulative_percent):
            # Let's plot the line for the cumulative percent
            # Set the grid for the bar chart as False. If it is True, there will
            # be to grids, one for the bars and other for the percents, making 
            # the image difficult to interpretate:
            ax1.grid(False)
            
            # Create the twin plot for the cumulative percent:
            ax2 = ax1.twinx()
            ax2.plot(categories, cum_pct, '-ro', label = "cumulative\npercent")
            #.plot(x, y, ...)
            ax2.tick_params('y', color = 'red')
            ax2.set_ylabel("Cumulative Percent (%)", color = 'red', rotation = 270)
            # rotate the twin axis so that its label is inverted in relation to the main
            # vertical axis.
            ax2.legend()
            ax2.grid(grid) # shown if user set grid = True
            # If user wants to see the grid, it is shown only for the cumulative line.
        
        else:
            # There is no cumulative line, so the parameter grid must control 
            # the bar chart's grid
            ax1.legend()
            ax1.grid(grid)
    
    # Notice that the .plot method is used for generating the plot for both orientations.
    # It is different from .bar and .barh, which specify the orientation of a bar; or
    # .hline (creation of an horizontal constant line); or .vline (creation of a vertical
    # constant line).
    
    # Now the parameters specific to the configurations are finished, so we can go back
    # to the general code:
    
    if (export_png == True):
        # Image will be exported
        import os
        
        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = ""
        
        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "bar_chart"
        
        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 330 dpi
            png_resolution_dpi = 330
        
        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)
        
        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")
    
    #fig.tight_layout()
    
    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
    
    return DATASET

# **Function for time series visualization**

In [None]:
def time_series_vis (data_in_same_column = False, df = None, column_with_predict_var_x = None, column_with_response_var_y = None, column_with_labels = None, list_of_dictionaries_with_series_to_analyze = [{'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}], x_axis_rotation = 70, y_axis_rotation = 0, grid = True, add_splines_lines = True, add_scatter_dots = False, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
     
    import random
    # Python Random documentation:
    # https://docs.python.org/3/library/random.html?msclkid=9d0c34b2d13111ec9cfa8ddaee9f61a1
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import matplotlib.colors as mcolors
    
    # matplotlib.colors documentation:
    # https://matplotlib.org/3.5.0/api/colors_api.html?msclkid=94286fa9d12f11ec94660321f39bf47f
    
    # Matplotlib list of colors:
    # https://matplotlib.org/stable/gallery/color/named_colors.html?msclkid=0bb86abbd12e11ecbeb0a2439e5b0d23
    # Matplotlib colors tutorial:
    # https://matplotlib.org/stable/tutorials/colors/colors.html
    # Matplotlib example of Python code using matplotlib.colors:
    # https://matplotlib.org/stable/_downloads/0843ee646a32fc214e9f09328c0cd008/colors.py
    # Same example as Jupyter Notebook:
    # https://matplotlib.org/stable/_downloads/2a7b13c059456984288f5b84b4b73f45/colors.ipynb
    
        
    # data_in_same_column = False: set as True if all the values to plot are in a same column.
    # If data_in_same_column = True, you must specify the dataframe containing the data as df;
    # the column containing the predict variable (X) as column_with_predict_var_x; the column 
    # containing the responses to plot (Y) as column_with_response_var_y; and the column 
    # containing the labels (subgroup) indication as column_with_labels. 
    # df is an object, so do not declare it in quotes. The other three arguments (columns' names) 
    # are strings, so declare in quotes. 
    
    # Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
    # All the results for both groups are in a column named 'results', wich will be plot against
    # the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
    # an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
    # column 'group' shows the value 'B'. In this example:
    # data_in_same_column = True,
    # df = dataset,
    # column_with_predict_var_x = 'time',
    # column_with_response_var_y = 'results', 
    # column_with_labels = 'group'
    # If you want to declare a list of dictionaries, keep data_in_same_column = False and keep
    # df = None (the other arguments may be set as None, but it is not mandatory: 
    # column_with_predict_var_x = None, column_with_response_var_y = None, column_with_labels = None).
    

    # Parameter to input when DATA_IN_SAME_COLUMN = False:
    # list_of_dictionaries_with_series_to_analyze: if data is already converted to series, lists
    # or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
    # even if there is a single dictionary.
    # Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
    # (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
    # keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
    # represents the series and label of the added dictionary (you can pass 'lab': None, but if 
    # 'x' or 'y' are None, the new dictionary will be ignored).
    
    # Examples:
    # list_of_dictionaries_with_series_to_analyze = 
    # [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
    # will plot a single variable. In turns:
    # list_of_dictionaries_with_series_to_analyze = 
    # [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
    # will plot two series, Y1 x X and Y2 x X.
    # Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
    # If None is provided to 'lab', an automatic label will be generated.
    
    
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]
    
    if (data_in_same_column == True):
        
        print("Data to be plotted in a same column.\n")
        
        if (df is None):
            
            print("Please, input a valid dataframe as df.\n")
            list_of_dictionaries_with_series_to_analyze = []
            # The code will check the size of this list on the next block.
            # If it is zero, code is simply interrupted.
            # Instead of returning an error, we use this code structure that can be applied
            # on other graphic functions that do not return a summary (and so we should not
            # return a value like 'error' to interrupt the function).
        
        elif (column_with_predict_var_x is None):
            
            print("Please, input a valid column name as column_with_predict_var_x.\n")
            list_of_dictionaries_with_series_to_analyze = []
           
        elif (column_with_response_var_y is None):
            
            print("Please, input a valid column name as column_with_response_var_y.\n")
            list_of_dictionaries_with_series_to_analyze = []
        
        else:
            
            # set a local copy of the dataframe:
            DATASET = df.copy(deep = True)
            
            if (column_with_labels is None):
            
                print("Using the whole series (column) for correlation.\n")
                column_with_labels = 'whole_series_' + column_with_response_var_y
                DATASET[column_with_labels] = column_with_labels
            
            # sort DATASET; by column_with_predict_var_x; by column column_with_labels
            # and by column_with_response_var_y, all in Ascending order
            # Since we sort by label (group), it is easier to separate the groups.
            DATASET = DATASET.sort_values(by = [column_with_predict_var_x, column_with_labels, column_with_response_var_y], ascending = [True, True, True])
            
            # Reset indices:
            DATASET = DATASET.reset_index(drop = True)
            
            # If column_with_predict_var_x is an object, the user may be trying to pass a date as x. 
            # So, let's try to convert it to datetime:
            if ((DATASET[column_with_predict_var_x]).dtype not in numeric_dtypes):
                  
                try:
                    DATASET[column_with_predict_var_x] = (DATASET[column_with_predict_var_x]).astype('datetime64[ns]')
                    print("Variable X successfully converted to datetime64[ns].\n")
                    
                except:
                    # Simply ignore it
                    pass
            
            # Get a series of unique values of the labels, and save it as a list using the
            # list attribute:
            unique_labels = list(DATASET[column_with_labels].unique())
            print(f"{len(unique_labels)} different labels detected: {unique_labels}.\n")
            
            # Start a list to store the dictionaries containing the keys:
            # 'x': list of predict variables; 'y': list of responses; 'lab': the label (group)
            list_of_dictionaries_with_series_to_analyze = []
            
            # Loop through each possible label:
            for lab in unique_labels:
                # loop through each element from the list unique_labels, referred as lab
                
                # Set a filter for the dataset, to select only rows correspondent to that
                # label:
                boolean_filter = (DATASET[column_with_labels] == lab)
                
                # Create a copy of the dataset, with entries selected by that filter:
                ds_copy = (DATASET[boolean_filter]).copy(deep = True)
                # Sort again by X and Y, to guarantee the results are in order:
                ds_copy = ds_copy.sort_values(by = [column_with_predict_var_x, column_with_response_var_y], ascending = [True, True])
                # Restart the index of the copy:
                ds_copy = ds_copy.reset_index(drop = True)
                
                # Re-extract the X and Y series and convert them to NumPy arrays 
                # (these arrays will be important later in the function):
                x = np.array(ds_copy[column_with_predict_var_x])
                y = np.array(ds_copy[column_with_response_var_y])
            
                # Then, create the dictionary:
                dict_of_values = {'x': x, 'y': y, 'lab': lab}
                
                # Now, append dict_of_values to list_of_dictionaries_with_series_to_analyze:
                list_of_dictionaries_with_series_to_analyze.append(dict_of_values)
                
            # Now, we have a list of dictionaries with the same format of the input list.
            
    else:
        
        # The user input a list_of_dictionaries_with_series_to_analyze
        # Create a support list:
        support_list = []
        
        # Loop through each element on the list list_of_dictionaries_with_series_to_analyze:
        
        for i in range (0, len(list_of_dictionaries_with_series_to_analyze)):
            # from i = 0 to i = len(list_of_dictionaries_with_series_to_analyze) - 1, index of the
            # last element from the list
            
            # pick the i-th dictionary from the list:
            dictionary = list_of_dictionaries_with_series_to_analyze[i]
            
            # access 'x', 'y', and 'lab' keys from the dictionary:
            x = dictionary['x']
            y = dictionary['y']
            lab = dictionary['lab']
            # Remember that all this variables are series from a dataframe, so we can apply
            # the astype function:
            # https://www.askpython.com/python/built-in-methods/python-astype?msclkid=8f3de8afd0d411ec86a9c1a1e290f37c
            
            # check if at least x and y are not None:
            if ((x is not None) & (y is not None)):
                
                # If column_with_predict_var_x is an object, the user may be trying to pass a date as x. 
                # So, let's try to convert it to datetime:
                if (x.dtype not in numeric_dtypes):

                    try:
                        x = (x).astype('datetime64[ns]')
                        print(f"Variable X from {i}-th dictionary successfully converted to datetime64[ns].\n")

                    except:
                        # Simply ignore it
                        pass
                
                # Possibly, x and y are not ordered. Firstly, let's merge them into a temporary
                # dataframe to be able to order them together.
                # Use the 'list' attribute to guarantee that x and y were read as lists. These lists
                # are the values for a dictionary passed as argument for the constructor of the
                # temporary dataframe. When using the list attribute, we make the series independent
                # from its origin, even if it was created from a Pandas dataframe. Then, we have a
                # completely independent dataframe that may be manipulated and sorted, without worrying
                # that it may modify its origin:
                
                temp_df = pd.DataFrame(data = {'x': list(x), 'y': list(y)})
                # sort this dataframe by 'x' and 'y':
                temp_df = temp_df.sort_values(by = ['x', 'y'], ascending = [True, True])
                # restart index:
                temp_df = temp_df.reset_index(drop = True)
                
                # Re-extract the X and Y series and convert them to NumPy arrays 
                # (these arrays will be important later in the function):
                x = np.array(temp_df['x'])
                y = np.array(temp_df['y'])
                
                # check if lab is None:
                if (lab is None):
                    # input a default label.
                    # Use the str attribute to convert the integer to string, allowing it
                    # to be concatenated
                    lab = "X" + str(i) + "_x_" + "Y" + str(i)
                    
                # Then, create the dictionary:
                dict_of_values = {'x': x, 'y': y, 'lab': lab}
                
                # Now, append dict_of_values to support list:
                support_list.append(dict_of_values)
            
        # Now, support_list contains only the dictionaries with valid entries, as well
        # as labels for each collection of data. The values are independent from their origin,
        # and now they are ordered and in the same format of the data extracted directly from
        # the dataframe.
        # So, make the list_of_dictionaries_with_series_to_analyze the support_list itself:
        list_of_dictionaries_with_series_to_analyze = support_list
        print(f"{len(list_of_dictionaries_with_series_to_analyze)} valid series input.\n")

        
    # Now that both methods of input resulted in the same format of list, we can process both
    # with the same code.
    
    # Each dictionary in list_of_dictionaries_with_series_to_analyze represents a series to
    # plot. So, the total of series to plot is:
    total_of_series = len(list_of_dictionaries_with_series_to_analyze)
    
    if (total_of_series <= 0):
        
        print("No valid series to plot. Please, provide valid arguments.\n")
    
    else:
        
        # Continue to plotting and calculating the fitting.
        # Notice that we sorted the all the lists after they were separated and before
        # adding them to dictionaries. Also, the timestamps were converted to datetime64 variables
        # Now we finished the loop, list_of_dictionaries_with_series_to_analyze 
        # contains all series converted to NumPy arrays, with timestamps parsed as datetimes.
        # This list will be the object returned at the end of the function. Since it is an
        # JSON-formatted list, we can use the function json_obj_to_pandas_dataframe to convert
        # it to a Pandas dataframe.
        
        
        # Now, we can plot the figure.
        # we set alpha = 0.95 (opacity) to give a degree of transparency (5%), 
        # so that one series do not completely block the visualization of the other.
        
        # Let's retrieve the list of Matplotlib CSS colors:
        css4 = mcolors.CSS4_COLORS
        # css4 is a dictionary of colors: {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', ...}
        # Each key of this dictionary is a color name to be passed as argument color on the plot
        # function. So let's retrieve the array of keys, and use the list attribute to convert this
        # array to a list of colors:
        list_of_colors = list(css4.keys())
        
        # In 11 May 2022, this list of colors had 148 different elements
        # Since this list is in alphabetic order, let's create a random order for the colors.
        
        # Function random.sample(input_sequence, number_of_samples): 
        # this function creates a list containing a total of elements equals to the parameter 
        # "number_of_samples", which must be an integer.
        # This list is obtained by ramdomly selecting a total of "number_of_samples" elements from the
        # list "input_sequence" passed as parameter.
        
        # Function random.choices(input_sequence, k = number_of_samples):
        # similarly, randomly select k elements from the sequence input_sequence. This function is
        # newer than random.sample
        # Since we want to simply randomly sort the sequence, we can pass k = len(input_sequence)
        # to obtain the randomly sorted sequence:
        list_of_colors = random.choices(list_of_colors, k = len(list_of_colors))
        # Now, we have a random list of colors to use for plotting the charts
        
        if (add_splines_lines == True):
            LINE_STYLE = '-'

        else:
            LINE_STYLE = ''
        
        if (add_scatter_dots == True):
            MARKER = 'o'
            
        else:
            MARKER = ''
        
        # Matplotlib linestyle:
        # https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html?msclkid=68737f24d16011eca9e9c4b41313f1ad
        
        if (plot_title is None):
            # Set graphic title
            plot_title = f"Y_x_timestamp"

        if (horizontal_axis_title is None):
            # Set horizontal axis title
            horizontal_axis_title = "timestamp"

        if (vertical_axis_title is None):
            # Set vertical axis title
            vertical_axis_title = "Y"
        
        # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
        # so that the bars do not completely block other views.
        OPACITY = 0.95
        
        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        fig = plt.figure(figsize = (12, 8))
        ax = fig.add_subplot()

        i = 0 # Restart counting for the loop of colors
        
        # Loop through each dictionary from list_of_dictionaries_with_series_and_predictions:
        for dictionary in list_of_dictionaries_with_series_to_analyze:
            
            # Try selecting a color from list_of_colors:
            try:
                
                COLOR = list_of_colors[i]
                # Go to the next element i, so that the next plot will use a different color:
                i = i + 1
            
            except IndexError:
                
                # This error will be raised if list index is out of range, 
                # i.e. if i >= len(list_of_colors) - we used all colors from the list (at least 148).
                # So, return the index to zero to restart the colors from the beginning:
                i = 0
                COLOR = list_of_colors[i]
                i = i + 1
            
            # Access the arrays and label from the dictionary:
            X = dictionary['x']
            Y = dictionary['y']
            LABEL = dictionary['lab']
            
            # Scatter plot:
            ax.plot(X, Y, linestyle = LINE_STYLE, marker = MARKER, color = COLOR, alpha = OPACITY, label = LABEL)
            # Axes.plot documentation:
            # https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.plot.html?msclkid=42bc92c1d13511eca8634a2c93ab89b5
            
            # x and y are positional arguments: they are specified by their position in function
            # call, not by an argument name like 'marker'.
            
            # Matplotlib markers:
            # https://matplotlib.org/stable/api/markers_api.html?msclkid=36c5eec5d16011ec9583a5777dc39d1f
            
        # Now we finished plotting all of the series, we can set the general configuration:
        
        #ROTATE X AXIS IN XX DEGREES
        plt.xticks(rotation = x_axis_rotation)
        # XX = 0 DEGREES x_axis (Default)
        #ROTATE Y AXIS IN XX DEGREES:
        plt.yticks(rotation = y_axis_rotation)
        # XX = 0 DEGREES y_axis (Default)

        ax.set_title(plot_title)
        ax.set_xlabel(horizontal_axis_title)
        ax.set_ylabel(vertical_axis_title)

        ax.grid(grid) # show grid or not
        ax.legend(loc = 'upper left')
        # position options: 'upper right'; 'upper left'; 'lower left'; 'lower right';
        # 'right', 'center left'; 'center right'; 'lower center'; 'upper center', 'center'
        # https://www.statology.org/matplotlib-legend-position/

        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = ""

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "time_series_vis"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 330 dpi
                png_resolution_dpi = 330

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        #plt.figure(figsize = (12, 8))
        #fig.tight_layout()

        ## Show an image read from an image file:
        ## import matplotlib.image as pltimg
        ## img=pltimg.imread('mydecisiontree.png')
        ## imgplot = plt.imshow(img)
        ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
        ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
        ##  '03_05_END.ipynb'
        plt.show()

# **Function for column filtering (selecting); ordering; or renaming all columns**

In [None]:
def select_order_or_rename_columns (df, columns_list, mode = 'select_or_order_columns'):
    
    import numpy as np
    import pandas as pd
    
    # MODE = 'select_or_order_columns' for filtering only the list of columns passed as columns_list,
    # and setting a new column order. In this mode, you can pass the columns in any order: 
    # the order of elements on the list will be the new order of columns.

    # MODE = 'rename_columns' for renaming the columns with the names passed as columns_list. In this
    # mode, the list must have same length and same order of the columns of the dataframe. That is because
    # the columns will sequentially receive the names in the list. So, a mismatching of positions
    # will result into columns with incorrect names.
    
    # columns_list = list of strings containing the names (headers) of the columns to select
    # (filter); or to be set as the new columns' names, according to the selected mode.
    # For instance: columns_list = ['col1', 'col2', 'col3'] will 
    # select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
    # Declare the names inside quotes.
    
    # Set a local copy of the dataframe to manipulate:
    DATASET = df.copy(deep = True)
    
    print(f"Original columns in the dataframe:\n{DATASET.columns}\n")
    
    if ((columns_list is None) | (columns_list == np.nan)):
        # empty list
        columns_list = []
    
    if (len(columns_list) == 0):
        print("Please, input a valid list of columns.\n")
        return DATASET
    
    if (mode == 'select_or_order_columns'):
        
        #filter the dataframe so that it will contain only the cols_list.
        DATASET = DATASET[columns_list]
        print("Dataframe filtered according to the list provided.\n")
        print("Check the new dataframe:\n")
        
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(DATASET)

        except: # regular mode
            print(DATASET)
        
    elif (mode == 'rename_columns'):
        
        # Check if the number of columns of the dataset is equal to the number of elements
        # of the new list. It will avoid raising an exception error.
        boolean_filter = (len(columns_list) == len(DATASET.columns))
        
        if (boolean_filter == False):
            #Impossible to rename, number of elements are different.
            print("The number of columns of the dataframe is different from the number of elements of the list. Please, provide a list with number of elements equals to the number of columns.\n")
            return DATASET
        
        else:
            #Same number of elements, so that we can update the columns' names.
            DATASET.columns = columns_list
            print("Dataframe columns renamed according to the list provided.\n")
            print("Warning: the substitution is element-wise: the first element of the list is now the name of the first column, and so on, ..., so that the last element is the name of the last column.\n")
            print("Check the new dataframe:\n")
            try:
                # only works in Jupyter Notebook:
                from IPython.display import display
                display(DATASET)

            except: # regular mode
                print(DATASET)
        
    else:
        print("Enter a valid mode: \'select_or_order_columns\' or \'rename_columns\'.")
        return DATASET
    
    return DATASET

# **Function for reversing the log-transform - applying the exponential transformation**

In [None]:
def reverse_log_transform (df, subset = None, create_new_columns = True, new_columns_suffix = "_originalScale"):
    
    import numpy as np
    import pandas as pd
    
    #### WARNING: This function will eliminate rows where the selected variables present 
    #### values lower or equal to zero (condition for the logarithm to be applied).
    
    # subset = None
    # Set subset = None to transform the whole dataset. Alternatively, pass a list with 
    # columns names for the transformation to be applied. For instance:
    # subset = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
    # as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
    # Declaring the full list of columns is equivalent to setting subset = None.
    
    # create_new_columns = True
    # Alternatively, set create_new_columns = True to store the transformed data into new
    # columns. Or set create_new_columns = False to overwrite the existing columns
    
    # new_columns_suffix = "_originalScale"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_originalScale", the new column will be named 
    # as "column1_originalScale".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)
    
    # Check if a subset was defined. If so, make columns_list = subset 
    if not (subset is None):
        
        columns_list = subset
    
    else:
        #There is no declared subset. Then, make columns_list equals to the list of
        # numeric columns of the dataframe.
        columns_list = list(DATASET.columns)
        
    # Let's check if there are categorical columns in columns_list. Only numerical
    # columns should remain
    # Start a support list:
    support_list = []
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]

    # Loop through each column in columns_list:
    for column in columns_list:
        
        # Check the Pandas series (column) data type:
        column_type = DATASET[column].dtype
            
        # If it is not categorical (object), append it to the support list:
        if (column_type in numeric_dtypes):
                
            support_list.append(column)
    
    # Finally, make the columns_list support_list itself:
    columns_list = support_list
    
    #Loop through each column to apply the transform:
    for column in columns_list:
        #access each element in the list column_list. The element is named 'column'.
        
        # The exponential transformation can be applied to zero and negative values,
        # so we remove the boolean filter.
        
        #Check if a new column will be created, or if the original column should be
        # substituted.
        if (create_new_columns == True):
            # Create a new column.
            
            # The new column name will be set as column + new_columns_suffix
            new_column_name = column + new_columns_suffix
        
        else:
            # Overwrite the existing column. Simply set new_column_name as the value 'column'
            new_column_name = column
        
        # Calculate the column value as the log transform of the original series (column)
        DATASET[new_column_name] = np.exp(DATASET[column])
    
    print("The log_transform was successfully reversed through the exponential transformation. Check the 10 first rows of the new dataset:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
    
    return DATASET

# **Function for reversing Box-Cox transform**

In [None]:
def reverse_box_cox (df, column_to_transform, lambda_boxcox, suffix = '_ReversedBoxCox'):
    
    import numpy as np
    import pandas as pd
    
    # This function will process a single column column_to_transform 
    # of the dataframe df per call.
    
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
    ## Box-Cox transform is given by:
    ## y = (x**lmbda - 1) / lmbda,  for lmbda != 0
    ## log(x),                  for lmbda = 0
    
    # column_to_transform must be a string with the name of the column.
    # e.g. column_to_transform = 'column1' to transform a column named as 'column1'
    
    # lambda_boxcox must be a float value. e.g. lamda_boxcox = 1.7
    # If you calculated lambda from the function box_cox_transform and saved the
    # transformation data summary dictionary as data_sum_dict, simply set:
    # lambda_boxcox = data_sum_dict['lambda_boxcox']
    # This will access the value on the key 'lambda_boxcox' of the dictionary, which
    # contains the lambda. 
    
    # Analogously, spec_lim_dict['Inf_spec_lim_transf'] access
    # the inferior specification limit transformed; and spec_lim_dict['Sup_spec_lim_transf'] 
    # access the superior specification limit transformed.
    
    #suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
    # identified as '_ReversedBoxCox'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
    
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)

    y = DATASET[column_to_transform]
    
    if (lambda_boxcox == 0):
        #ytransf = np.log(y), according to Box-Cox definition. Then
        #y_retransform = np.exp(y)
        #In the case of this function, ytransf is passed as the argument y.
        y_transform = np.exp(y)
    
    else:
        #apply Box-Cox function:
        #y_transf = (y**lmbda - 1) / lmbda. Then,
        #y_retransf ** (lmbda) = (y_transf * lmbda) + 1
        #y_retransf = ((y_transf * lmbda) + 1) ** (1/lmbda), where ** is the potentiation
        #In the case of this function, ytransf is passed as the argument y.
        y_transform = ((y * lambda_boxcox) + 1) ** (1/lambda_boxcox)
    
    if not (suffix is None):
        #only if a suffix was declared
        #concatenate the column name to the suffix
        new_col = column_to_transform + suffix
    
    else:
        #concatenate the column name to the standard '_ReversedBoxCox' suffix
        new_col = column_to_transform + '_ReversedBoxCox'
    
    DATASET[new_col] = y_transform
    #dataframe contendo os dados transformados
    
    print("Data successfully retransformed. Check the 10 first retransformed rows:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
    
    print("\n") #line break
 
    return DATASET

# **Function for One-Hot Encoding categorical features**
- Transform categorical values without notion of order into numerical (binary) features.
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.
- The new columns will be named as the original columns + "_" + possible categories + "OneHotEnc".
- Each column is a binary variable of the type "is classified in this category or not".

Therefore, for a category "A", a column named "A" is created.
- If the row is an element from category "A", the value for the column "A" is 1.
- If not, the value for column "A" is 0.

In [None]:
def OneHotEncode_df (df, subset_of_features_to_be_encoded):

    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    
    # df: the whole dataframe to be processed.
    
    # subset_of_features_to_be_encoded: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    #Start an encoding list empty (it will be a JSON object):
    encoding_list = []
    
    # Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display  
    except:
        pass
    
    #loop through each column of the subset:
    for column in subset_of_features_to_be_encoded:
        
        # Start two empty dictionaries:
        encoding_dict = {}
        nested_dict = {}
        
        # Add the column to encoding_dict as the key 'column':
        encoding_dict['column'] = column
        
        # Loop through each element (named 'column') of the list of columns to analyze,
        # subset_of_features_to_be_encoded
        
        # We could process the whole subset at once, but it could make us lose information
        # about the generated columns
        
        # set a subset of the dataframe X containing 'column' as the only column:
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X  = df[[column]]
        
        #Start the OneHotEncoder object:
        OneHot_enc_obj = OneHotEncoder()
        
        #Fit the object to that column:
        OneHot_enc_obj = OneHot_enc_obj.fit(X)
        # Get the transformed columns as a SciPy sparse matrix: 
        transformed_columns = OneHot_enc_obj.transform(X)
        # Convert the sparse matrix to a NumPy dense array:
        transformed_columns = transformed_columns.toarray()
        
        # Now, let's retrieve the encoding information and save it:
        # Show encoded categories and store this array. 
        # It will give the proper columns' names:
        encoded_columns = OneHot_enc_obj.categories_

        # encoded_columns is an array containing a single element.
        # This element is an array like:
        # array(['cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8'], dtype=object)
        # Then, this array is the element of index 0 from the list encoded_columns.
        # It is represented as encoded_columns[0]

        # Therefore, we actually want the array which is named as encoded_columns[0]
        # Each element of this array is the name of one of the encoded columns. In the
        # example above, the element 'cat2' would be accessed as encoded_columns[0][1],
        # since it is the element of index [1] (second element) from the array 
        # encoded_columns[0].
        
        new_columns = encoded_columns[0]
        # To identify the column that originated these new columns, we can join the
        # string column to each element from new_columns:
        
        # Update the nested dictionary: store the new_columns as the key 'categories':
        nested_dict['categories'] = new_columns
        # Store the encoder object as the key 'OneHot_enc_obj'
        # Add the encoder object to the dictionary:
        nested_dict['OneHot_enc_obj'] = OneHot_enc_obj
        
        # Store the nested dictionary in the encoding_dict as the key 'OneHot_encoder':
        encoding_dict['OneHot_encoder'] = nested_dict
        # Append the encoding_dict as an element from list encoding_list:
        encoding_list.append(encoding_dict)
        
        # Now we saved all encoding information, let's transform the data:
        
        # Start a support_list to store the concatenated strings:
        support_list = []
        
        for encoded_col in new_columns:
            # Use the str attribute to guarantee that the array stores only strings.
            # Add an underscore "_" to separate the strings and an identifier of the transform:
            new_column = column + "_" + str(encoded_col) + "_OneHotEnc"
            # Append it to the support_list:
            support_list.append(new_column)
            
        # Convert the support list to NumPy array, and make new_columns the support list itself:
        new_columns = np.array(support_list)
        
        # Crete a Pandas dataframe from the array transformed_columns:
        encoded_X_df = pd.DataFrame(transformed_columns)
        
        # Modify the name of the columns to make it equal to new_columns:
        encoded_X_df.columns = new_columns
        
        #Inner join the new dataset with the encoded dataset.
        # Use the index as the key, since indices are necessarily correspondent.
        # To use join on index, we apply pandas .concat method.
        # To join on a specific key, we could use pandas .merge method with the arguments
        # left_on = 'left_key', right_on = 'right_key'; or, if the keys have same name,
        # on = 'key':
        # Check Pandas merge and concat documentation:
        # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
        
        new_df = pd.concat([new_df, encoded_X_df], axis = 1, join = "inner")
        # When axis = 0, the .concat operation occurs in the row level, so the rows
        # of the second dataframe are added to the bottom of the first one.
        # It is the SQL union, and creates a dataframe with more rows, and
        # total of columns equals to the total of columns of the first dataframe
        # plus the columns of the second one that were not in the first dataframe.
        # When axis = 1, the operation occurs in the column level: the two
        # dataframes are laterally merged using the index as the key, 
        # preserving all columns from both dataframes. Therefore, the number of
        # rows will be the total of rows of the dataframe with more entries,
        # and the total of columns will be the sum of the total of columns of
        # the first dataframe with the total of columns of the second dataframe.
        
        print(f"Successfully encoded column \'{column}\' and merged the encoded columns to the dataframe.\n")
        print("Check first 5 rows of the encoded table that was merged:\n")
        
        try:
            display(encoded_X_df.head())
        except: # regular mode
            print(encoded_X_df.head())
        
        # The default of the head method, when no parameter is printed, is to show 5 rows; if an
        # integer number Y is passed as argument .head(Y), Pandas shows the first Y-rows.
        print("\n")
        
    print("Finished One-Hot Encoding. Returning the new transformed dataframe; and an encoding list.\n")
    print("Each element from this list is a dictionary with the original column name as key \'column\', and a nested dictionary as the key \'OneHot_encoder\'.\n")
    print("In turns, the nested dictionary shows the different categories as key \'categories\' and the encoder object as the key \'OneHot_enc_obj\'.\n")
    print("Use the encoder object to inverse the One-Hot Encoding in the correspondent function.\n")
    print(f"For each category in the columns \'{subset_of_features_to_be_encoded}\', a new column has value 1, if it is the actual category of that row; or is 0 if not.\n")
    print("Check the first 10 rows of the new dataframe:\n")
    
    try:
        display(new_df.head(10))
    except:
        print(new_df.head(10))

    #return the transformed dataframe and the encoding dictionary:
    return new_df, encoding_list

# **Function for reversing the scaling of the features**
- `mode = 'standard'`.
- `mode = 'min_max'`.
- `mode = 'factor'`.

In [None]:
def reverse_feature_scaling (df, subset_of_features_to_scale, list_of_scaling_params, mode = 'min_max', suffix = '_reverseScaling'):
    
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    # Scikit-learn Preprocessing data guide:
    # https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
    # Standard scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    # Min-Max scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.set_params
    
    ## mode = 'standard': reverses the standard scaling, 
    ##  which creates a new variable with mean = 0; and standard deviation = 1.
    ##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
    ##  of the training samples, and s is the standard deviation of the training samples.
    
    ## mode = 'min_max': reverses min-max normalization, with a resultant feature 
    ## ranging from 0 to 1. each value Y is transformed as 
    ## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
    ## maximum values of Y, respectively.
    ## mode = 'factor': reverses the division of the whole series by a numeric value 
    # provided as argument. 
    ## For a factor F, the new Y transformed values are Ytransf = Y/F.
    # Notice that if the original mode was 'normalize_by_maximum', then the maximum value used
    # must be declared as any other factor.
    
    # df: the whole dataframe to be processed.
    
    # subset_of_features_to_be_scaled: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_scaled = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_scaled = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    # list_of_scaling_params is a list of dictionaries with the same format of the list returned
    # from this function. Each dictionary must correspond to one of the features that will be scaled,
    # but the list do not have to be in the same order of the columns - it will check one of the
    # dictionary keys.
    # The first key of the dictionary must be 'column'. This key must store a string with the exact
    # name of the column that will be scaled.
    # the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
    # one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
    # numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
    # must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
    # two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
    # For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
    # standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
    # factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
    # Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
    # division.
    # The key 'scaler_details' will not create an object: the transform will be directly performed 
    # through vectorial operations.
    
    # suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
    # identified as '_reverseScaling'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
      
    if (suffix is None):
        #set as the default
        suffix = '_reverseScaling'
    
    #Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    #Start an scaling list empty (it will be a JSON object):
    scaling_list = []
    
    # Use a previously obtained scaler:
    
    for column in subset_of_features_to_scale:
        
        # Create a dataframe X by subsetting only the analyzed column
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X = new_df[[column]]

        # Loop through each element of the list:
            
        for scaling_dict in list_of_scaling_params:
                
            # check if the dictionary is from that column:
            if (scaling_dict['column'] == column):
                    
                # We found the correct dictionary. Let's retrieve the information:
                # retrieve the nested dictionary:
                nested_dict = scaling_dict['scaler']
                    
                # try accessing the scaler object:
                try:
                    scaler = nested_dict['scaler_obj']
                    #calculate the reversed scaled feature, and store it as new array:
                    rev_scaled_feature = scaler.inverse_transform(X)
                        
                    # Add the parameters to the nested dictionary:
                    nested_dict['scaling_params'] = scaler.get_params(deep = True)
                        
                    if (mode == 'standard'):
                            
                        nested_dict['scaler_details'] = {
                                'mu': rev_scaled_feature.mean(),
                                'sigma': rev_scaled_feature.std()
                            }
                        
                    elif (mode == 'min_max'):
                            
                        nested_dict['scaler_details'] = {
                                'min': rev_scaled_feature.min(),
                                'max': rev_scaled_feature.max()
                            }
                    
                except:
                        
                    try:
                        # As last alternative, let's try accessing the scaler details dict
                        scaler_details = nested_dict['scaler_details']
                                
                        if (mode == 'standard'):
                                
                            nested_dict['scaling_params'] = 'standard_scaler_manually_defined'
                            mu = scaler_details['mu']
                            sigma = scaler_details['sigma']
                                    
                            if (sigma != 0):
                                # scaled_feature = (X - mu)/sigma
                                rev_scaled_feature = (X * sigma) + mu
                            else:
                                # scaled_feature = (X - mu)
                                rev_scaled_feature = (X + mu)
                                
                        elif (mode == 'min_max'):
                                    
                            nested_dict['scaling_params'] = 'min_max_scaler_manually_defined'
                            minimum = scaler_details['min']
                            maximum = scaler_details['max']
                                    
                            if ((maximum - minimum) != 0):
                                # scaled_feature = (X - minimum)/(maximum - minimum)
                                rev_scaled_feature = (X * (maximum - minimum)) + minimum
                            else:
                                # scaled_feature = X/maximum
                                rev_scaled_feature = (X * maximum)
                                
                        elif (mode == 'factor'):
                                
                            nested_dict['scaling_params'] = 'normalization_by_factor'
                            factor = scaler_details['factor']
                            # scaled_feature = X/(factor)
                            rev_scaled_feature = (X * factor)
                                
                        else:
                            print("Select a valid mode: standard, min_max, or factor.\n")
                            return "error", "error"
                            
                    except:
                                
                        print(f"No valid scaling dictionary was input for column {column}.\n")
                        return "error", "error"
         
                # Create the new_column name:
                new_column = column + suffix
                # Create the new_column by dividing the previous column by the scaling factor:

                # Set the new column as rev_scaled_feature
                new_df[new_column] = rev_scaled_feature

                # Add the nested dictionary to the scaling_dict:
                scaling_dict['scaler'] = nested_dict

                # Finally, append the scaling_dict to the list scaling_list:
                scaling_list.append(scaling_dict)

                print(f"Successfully re-scaled column {column}.\n")
                
    print("Successfully re-scaled the dataframe.\n")
    print("Check 10 first rows of the new dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_df.head(10))
            
    except: # regular mode
        print(new_df.head(10))
                
    return new_df, scaling_list

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = ''
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None; or if it is an empty string; or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = 'copied_s3_bucket'

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for fetching AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

## **Downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

#### Case 1: upload a file to Colab's workspace

In [None]:
ACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model named keras_model, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'keras_model.h5'

# Dictionary storing the uploaded files returned as colab_files_dict.
# Simply modify this object on the left of the equality:
colab_files_dict = upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

#### Case 2: download a file from Colab's workspace

In [None]:
ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model nameACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

### **Importing the dataset**

In [None]:
## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
## JSON, txt, or CSV (comma separated values) files.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"

FILE_NAME_WITH_EXTENSION = "dataset.csv"
# FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
# FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
# Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.

LOAD_TXT_FILE_WITH_JSON_FORMAT = False
# LOAD_TXT_FILE_WITH_JSON_FORMAT = False. Set LOAD_TXT_FILE_WITH_JSON_FORMAT = True 
# if you want to read a file with txt extension containing a text formatted as JSON 
# (but not saved as JSON).
# WARNING: if LOAD_TXT_FILE_WITH_JSON_FORMAT = True, all the JSON file parameters of the 
# function (below) must be set. If not, an error message will be raised.

HOW_MISSING_VALUES_ARE_REGISTERED = None
# HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
# empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
# This parameter manipulates the argument na_values (default: None) from Pandas functions.
# By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
#‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
# ‘n/a’, ‘nan’, ‘null’.

# If a different denomination is used, indicate it as a string. e.g.
# HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
# HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

# If dict passed, specific per-column NA values. For example, if zero is the missing value
# only in column 'numeric_col', you can specify the following dictionary:
# how_missing_values_are_registered = {'numeric-col': 0}

    
HAS_HEADER = True
# HAS_HEADER = True if the the imported table has headers (row with columns names).
# Alternatively, HAS_HEADER = False if the dataframe does not have header.

DECIMAL_SEPARATOR = '.'
# DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
# the decimal separator. Alternatively, specify here the separator.
# e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
# It manipulates the argument 'decimal' from Pandas functions.

TXT_CSV_COL_SEP = "comma"
# txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
# or 'csv'. It informs how the different columns are separated.
# Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
# for columns separated by comma;
# txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
# for columns separated by simple spaces.
# You can also set a specific separator as string. For example:
# txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
# is used as separator for the columns - '\t' represents the tab character).

## Parameters for loading Excel files:

LOAD_ALL_SHEETS_AT_ONCE = False
# LOAD_ALL_SHEETS_AT_ONCE = False - This parameter has effect only when for Excel files.
# If LOAD_ALL_SHEETS_AT_ONCE = True, the function will return a list of dictionaries, each
# dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
# value will be the name (or number) of the table (sheet). The second key will be 'df',
# and its value will be the pandas dataframe object obtained from that sheet.
# This argument has preference over SHEET_TO_LOAD. If it is True, all sheets will be loaded.
    
SHEET_TO_LOAD = None
# SHEET_TO_LOAD - This parameter has effect only when for Excel files.
# keep SHEET_TO_LOAD = None not to specify a sheet of the file, so that the first sheet
# will be loaded.
# SHEET_TO_LOAD may be an integer or an string (inside quotes). SHEET_TO_LOAD = 0
# loads the first sheet (sheet with index 0); SHEET_TO_LOAD = 1 loads the second sheet
# of the file (index 1); SHEET_TO_LOAD = "Sheet1" loads a sheet named as "Sheet1".
# Declare a number to load the sheet with that index, starting from 0; or declare a
# name to load the sheet with that name.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = load_pandas_dataframe (file_directory_path = FILE_DIRECTORY_PATH, file_name_with_extension = FILE_NAME_WITH_EXTENSION, load_txt_file_with_json_format = LOAD_TXT_FILE_WITH_JSON_FORMAT, how_missing_values_are_registered = HOW_MISSING_VALUES_ARE_REGISTERED, has_header = HAS_HEADER, decimal_separator = DECIMAL_SEPARATOR, txt_csv_col_sep = TXT_CSV_COL_SEP, load_all_sheets_at_once = LOAD_ALL_SHEETS_AT_ONCE, sheet_to_load = SHEET_TO_LOAD, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

# OBS: If an Excel file is loaded and LOAD_ALL_SHEETS_AT_ONCE = True, then the object
# dataset will be a list of dictionaries, with 'sheet' as key containing the sheet name; and 'df'
# as key correspondent to the Pandas dataframe. So, to access the 3rd dataframe (index 2, since
# indexing starts from zero): df = dataframe[2]['df'], where dataframe is the list returned.

### **Converting JSON object to dataframe**

In [None]:
# JSON object in terms of Python structure: list of dictionaries, where each value of a
# dictionary may be a dictionary or a list of dictionaries (nested structures).
# example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
# structure could be declared and stored into a string variable. For instance, if you have a txt
# file containing JSON, you could read the txt and save its content as a string.
# json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
# 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
# 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]

JSON_OBJ_TO_CONVERT = json_object #Alternatively: object containing the JSON to be converted

# JSON_OBJ_TO_CONVERT: object containing JSON, or string with JSON content to parse.
# Objects may be: string with JSON formatted text;
# list with nested dictionaries (JSON formatted);
# dictionaries, possibly with nested dictionaries (JSON formatted).

JSON_OBJ_TYPE = 'list'
# JSON_OBJ_TYPE = 'list', in case the object was saved as a list of dictionaries (JSON format)
# JSON_OBJ_TYPE = 'string', in case it was saved as a string (text) containing JSON.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: [{'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]}]
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = json_obj_to_pandas_dataframe (json_obj_to_convert = JSON_OBJ_TO_CONVERT, json_obj_type = JSON_OBJ_TYPE, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

### **Importing or exporting models and dictionaries (or lists)**

#### Case 1: import only a model

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model.
# Simply modify this object on the left of equality:
model = import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 2: import only a dictionary or a list

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'dict_or_list_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Dictionary or list saved as imported_dict_or_list.
# Simply modify this object on the left of equality:
imported_dict_or_list = import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 3: import a model and a dictionary (or a list)

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_and_dict'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model. Dictionary or list saved as imported_dict_or_list.
# Simply modify these objects on the left of equality:
model, imported_dict_or_list = import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 4: export a model and/or a dictionary (or a list)

In [None]:
ACTION = 'export'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

### **Making predictions with the models**

In [None]:
MODEL_OBJECT = lstm_model # Alternatively: object storing another model
# MODEL_OBJECT: object containing the model that will be analyzed. e.g.
# MODEL_OBJECT = elastic_net_linear_reg_model

X_df = X
# predict_for = 'subset' or predict_for = 'single_entry'
# The function will automatically detect if it is dealing with lists, NumPy arrays
# or Pandas dataframes. If X_df is a list or a single-dimension array, predict_for
# will be set as 'single_entry'. If X is a multi-dimension NumPy array (as the
# outputs for preparing data - even single_entry - for deep learning models), or if
# it is a Pandas dataframe, the function will set predict_for = 'subset'
    
# X_df = subset of predictive variables (dataframe, NumPy array, or list).
# If PREDICT_FOR = 'single_entry', X_df should be a list of parameters values.
# e.g. X_df = [1.2, 3, 4] (dot is the decimal case separator, comma separate values). 
# Notice that the list should contain only the numeric values, in the same order of the
# correspondent columns.
# If PREDICT_FOR = 'subset' (prediction for multiple entries), X_df should be a dataframe 
# (subset) or a multi-dimensional NumPy array of the parameters values, as usual.

DATAFRAME_FOR_CONCATENATING_PREDICTIONS = dataset  
# DATAFRAME_FOR_CONCATENATING_PREDICTIONS: if you want to concatenate the predictions
# to a dataframe, pass it here:
# e.g. DATAFRAME_FOR_CONCATENATING_PREDICTIONS = df
# If the dataframe must be the same one passed as X, repeat the dataframe object here:
# X_df = dataset, DATAFRAME_FOR_CONCATENATING_PREDICTIONS = dataset.
# Alternatively, if DATAFRAME_FOR_CONCATENATING_PREDICTIONS = None, 
# the prediction will be returned as a series or NumPy array, depending on the input format.
# Notice that the concatenated predictions will be added as a new column.

COLUMN_WITH_PREDICTIONS_SUFFIX = None
# COLUMN_WITH_PREDICTIONS_SUFFIX = None. If the predictions are added as a new column
# of the dataframe DATAFRAME_FOR_CONCATENATING_PREDICTIONS, you can declare this
# parameter as string with a suffix for identifying the model. If no suffix is added, the new
# column will be named 'y_pred'.
# e.g. COLUMN_WITH_PREDICTIONS_SUFFIX = '_keras' will create a column named "y_pred_keras". This
# parameter is useful when working with multiple models. Always start the suffix with underscore
# "_" so that no blank spaces are added; the suffix will not be merged to the column; and there
# will be no confusion with the dot (.) notation for methods, JSON attributes, etc.

# Predictions returned as prediction_output
# Simply modify this object (or variable) on the left of equality:
prediction_output = make_model_predictions (model_object = MODEL_OBJECT, X = X_df, dataframe_for_concatenating_predictions = DATAFRAME_FOR_CONCATENATING_PREDICTIONS, col_with_predictions_suffix = COLUMN_WITH_PREDICTIONS_SUFFIX)

### **Calculating probabilities associated to each class**

In [None]:
MODEL_OBJECT = lstm_model # Alternatively: object storing another model
# MODEL_OBJECT: object containing the model that will be analyzed. e.g.
# MODEL_OBJECT = mlp_model

X_df = X
# predict_for = 'subset' or predict_for = 'single_entry'
# The function will automatically detect if it is dealing with lists, NumPy arrays
# or Pandas dataframes. If X_df is a list or a single-dimension array, predict_for
# will be set as 'single_entry'. If X is a multi-dimension NumPy array (as the
# outputs for preparing data - even single_entry - for deep learning models), or if
# it is a Pandas dataframe, the function will set predict_for = 'subset'
    
# X_df = subset of predictive variables (dataframe, NumPy array, or list).
# If PREDICT_FOR = 'single_entry', X_df should be a list of parameters values.
# e.g. X_df = [1.2, 3, 4] (dot is the decimal case separator, comma separate values). 
# Notice that the list should contain only the numeric values, in the same order of the
# correspondent columns.
# If PREDICT_FOR = 'subset' (prediction for multiple entries), X_df should be a dataframe 
# (subset) or a multi-dimensional NumPy array of the parameters values, as usual.

LIST_OF_CLASSES = list_of_classes
# LIST_OF_CLASSES is the list of classes effectively used for training
# the model. Set this parameter as the object returned from function
# retrieve_classes_used_to_train

TYPE_OF_MODEL = 'deep_learning'
# TYPE_OF_MODEL = 'deep_learning' if Keras/TensorFlow or other deep learning
# framework was used to obtain the model;
# TYPE_OF_MODEL = 'other' for Scikit-learn or XGBoost models.

DATAFRAME_FOR_CONCATENATING_PREDICTIONS = dataset  
# DATAFRAME_FOR_CONCATENATING_PREDICTIONS: if you want to concatenate the predictions
# to a dataframe, pass it here:
# e.g. DATAFRAME_FOR_CONCATENATING_PREDICTIONS = df
# If the dataframe must be the same one passed as X, repeat the dataframe object here:
# X_df = dataset, DATAFRAME_FOR_CONCATENATING_PREDICTIONS = dataset.
# Alternatively, if DATAFRAME_FOR_CONCATENATING_PREDICTIONS = None, 
# the prediction will be returned as a series or NumPy array, depending on the input format.
# Notice that the concatenated predictions will be added as a new column.    
# All of the new columns (appended or not) will have the prefix "prob_class_" followed
# by the correspondent class name to identify them.


# Probabilities returned as calculated_probability
# Simply modify this object (or variable) on the left of equality:
calculated_probability = calculate_class_probability (model_object = MODEL_OBJECT, X = X_df, list_of_classes = LIST_OF_CLASSES, type_of_model = TYPE_OF_MODEL, dataframe_for_concatenating_predictions = DATAFRAME_FOR_CONCATENATING_PREDICTIONS)

### **Merging (joining) dataframes on given keys; and sorting the merged table**
- Merge (join) types:
    - 'inner': resultant dataframe contains only the rows on the left dataframe with correspondent values on the right dataframe. Can be used for filtering a set of labelled rows. Results in no missing values;
    - 'left': resultant dataframe contains all the rows from the left table (even those without correspondence on the right); and the rows from the right table that have correspondence on the left one. Since rows from the left table may not have correspondence, it may result in missing values.
    - 'right': resultant dataframe contains all the rows from the right table (even those without correspondence on the right); and the rows from the left table that have correspondence on the right one. Since rows from the right table may not have correspondence, it may result in missing values.
    - 'outer': in SQL, the Pandas 'outer' merge usually corresponds to the FULL OUTER JOIN: the resultant dataframe contains all rows from both tables, not taking in account if there is correspondence. So, it may result in missing values.

In [None]:
DF_LEFT = dataset1 #Alternatively: object containing the dataset to be joined on the left
DF_RIGHT = dataset2 #Alternatively: object containing the dataset to be joined on the right

LEFT_KEY = "left_key_column" 
#Alternatively: (string) name of the column of the left dataframe to be used as key for 
# joining. Keep inside quotes.
RIGHT_KEY = "right_key_column"
#Alternatively: (string) name of the column of the right dataframe to be used as key for 
# joining. Keep inside quotes.

HOW_TO_JOIN = "inner"
#Alternatively: "inner", "outer", "left", "right".

MERGED_SUFFIXES = ('_left', '_right')
# SUFFIXES = ('_left', '_right') - tuple of the suffixes to be added to columns.
# Example: suppose both datasets have the column 'Value'. The column from the left dataset
# will be renamed as "Value_left", and the column from the right dataset will be renamed as
# "Value_right".
# Alternatively: modify the strings inside quotes to modify the standard values. 
# Do not eliminate the parenthesis that indicate the tuple object.
# Any unmutable list is a tuple. A tuple can be also declared as an unmutable list of two
# objects inside parenthesis instead of the brackets used for lists: []

SORT_MERGED_DF = False
# SORT_MERGED_DF = False not to sort the merged dataframe. If you want to sort it,
# set as True. If SORT_MERGED_DF = True and COLUMN_TO_SORT = None, the dataframe will
# be sorted by its first column.

COLUMN_TO_SORT = None
# COLUMN_TO_SORT = None. Keep it None if the dataframe should not be sorted.
# Alternatively, pass a string with a column name to sort, such as:
# COLUMN_TO_SORT = 'col1'; or a list of columns to use for sorting: COLUMN_TO_SORT = 
# ['col1', 'col2']

ASCENDING_SORTING = True
# ascending_sorting = True. If you want to sort the column(s) passed on column_to_sort in
# ascending order, set as True. Set as False if you want to sort in descending order. If
# you want to sort each column passed as list column_to_sort in a specific order, pass a 
# list of booleans like ASCENDING_SORTING = [False, True] - the first column of the list
# will be sorted in descending order, whereas the 2nd will be in ascending. Notice that
# the correspondence is element-wise: the boolean in list ASCENDING_SORTING will correspond 
# to the sorting order of the column with the same position in list COLUMN_TO_SORT.
# If None, the dataframe will be sorted in ascending order.
    

#New dataframe saved as merged_df. Simply modify this object on the left of equality:
merged_df = MERGE_AND_SORT_DATAFRAMES (df_left = DF_LEFT, df_right = DF_RIGHT, left_key = LEFT_KEY, right_key = RIGHT_KEY, how_to_join = HOW_TO_JOIN, merged_suffixes = MERGED_SUFFIXES, sort_merged_df = SORT_MERGED_DF, column_to_sort = COLUMN_TO_SORT, ascending_sorting = ASCENDING_SORTING)

### **Concatenating (SQL UNION) multiple dataframes**

In [None]:
LIST_OF_DATAFRAMES = [dataset1, dataset2]
# LIST_OF_DATAFRAMES must be a list containing the dataframe objects
# example: list_of_dataframes = [df1, df2, df3, df4]
# Notice that the dataframes are objects, not strings. Therefore, they should not
# be declared inside quotes.
# There is no limit of dataframes. In this example, we will concatenate 4 dataframes.
# If LIST_OF_DATAFRAMES = [df1, df2, df3] we would concatenate 3, and if
# LIST_OF_DATAFRAMES = [df1, df2, df3, df4, df5] we would concatenate 5 dataframes.

WHAT_TO_APPEND = 'rows'
# WHAT_TO_APPEND = 'rows' for appending the rows from one dataframe
# into the other; WHAT_TO_APPEND = 'columns' for appending the columns
# from one dataframe into the other (horizontal or lateral append).

IGNORE_INDEX_ON_UNION = True # Alternatively: True or False

SORT_VALUES_ON_UNION = True # Alternatively: True or False

UNION_JOIN_TYPE = None
# JOIN can be 'inner' to perform an inner join, eliminating the missing values
# The default (None) is 'outer': the dataframes will be stacked on the columns with
# same names but, in case there is no correspondence, the row will present a missing
# value for the columns which are not present in one of the dataframes.
# When using the 'inner' method, only the common columns will remain.
# Alternatively, keep UNION_JOIN_TYPE = None for the standard outer join; or set
# UNION_JOIN_TYPE = "inner" (inside quotes) for using the inner join.
    
#These 3 last parameters are the same from Pandas .concat method:
# IGNORE_INDEX_ON_UNION = ignore_index;
# SORT_VALUES_ON_UNION = sort
# UNION_JOIN_TYPE = join
# Check Datacamp course Joining Data with pandas, Chap.3, 
# Advanced Merging and Concatenating
    

#New dataframe saved as concat_df. Simply modify this object on the left of equality:
concat_df = UNION_DATAFRAMES (list_of_dataframes = LIST_OF_DATAFRAMES, what_to_append = WHAT_TO_APPEND, ignore_index_on_union = IGNORE_INDEX_ON_UNION, sort_values_on_union = SORT_VALUES_ON_UNION, union_join_type = UNION_JOIN_TYPE)

### **Filtering (selecting); ordering; or renaming columns from the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'select_or_order_columns'
# MODE = 'select_or_order_columns' for filtering only the list of columns passed as COLUMNS_LIST,
# and setting a new column order. In this mode, you can pass the columns in any order: 
# the order of elements on the list will be the new order of columns.

# MODE = 'rename_columns' for renaming the columns with the names passed as COLUMNS_LIST. In this
# mode, the list must have same length and same order of the columns of the dataframe. That is because
# the columns will sequentially receive the names in the list. So, a mismatching of positions
# will result into columns with incorrect names.

COLUMNS_LIST = ['column1', 'column2', 'column3']
# COLUMNS_LIST = list of strings containing the names (headers) of the columns to select
# (filter); or to be set as the new columns' names, according to the selected mode.
# For instance: COLUMNS_LIST = ['col1', 'col2', 'col3'] will 
# select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
# Declare the names inside quotes.
# Simply substitute the list by the list of columns that you want to select; or the
# list of the new names you want to give to the dataset columns.

# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = select_order_or_rename_columns (df = DATASET, columns_list = COLUMNS_LIST, mode = MODE)

### **Reversing the log-transform - Exponentially transforming variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_originalScale"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_originalScale", the new column will be named as
# "column1_originalScale".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

#New dataframe saved as rescaled_df.
# Simply modify this object on the left of equality:
rescaled_df = reverse_log_transform(df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

### **Reversing Box-Cox transform**

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_ReversedBoxCox'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
# identified as 'Y_ReversedBoxCox'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as retransformed_df.
# Simply modify this object on the left of equality:
retransformed_df = reverse_box_cox (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX)

### **One-Hot Encoding the categorical variables**
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.For a category "A", a column named "A" is created.
    - If the row is an element from category "A", the value for the column "A" is 1.
    - If not, the value for column "A" is 0.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_BE_ENCODED = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# SUBSET_OF_FEATURES_TO_BE_ENCODED: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.SUBSET_OF_FEATURES_TO_BE_ENCODED = ["column1"] 
# will analyze the column named as 'column1'; 
# SUBSET_OF_FEATURES_TO_BE_ENCODED = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

# New dataframe saved as one_hot_encoded_df; list of encoding information,
# including different categories and encoder objects as OneHot_encoding_list.
# Simply modify this object on the left of equality:
one_hot_encoded_df, OneHot_encoding_list = OneHotEncode_df (df = DATASET, subset_of_features_to_be_encoded = SUBSET_OF_FEATURES_TO_BE_ENCODED)

### **Reversing scaling of the features - Standard scaler, Min-Max scaler, division by factor**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'min_max'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor'
## This function provides 3 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

LIST_OF_SCALING_PARAMS = [
                            {'column': None,
                            'scaler': {'scaler_obj': None, 
                                      'scaler_details': None}},
                            {'column': None,
                            'scaler': {'scaler_obj': None, 
                                      'scaler_details': None}}
                            
                         ]
# LIST_OF_SCALING_PARAMS is a list of dictionaries with the same format of the list returned
# from this function. Each dictionary must correspond to one of the features that will be scaled,
# but the list do not have to be in the same order of the columns - it will check one of the
# dictionary keys.
# The first key of the dictionary must be 'column'. This key must store a string with the exact
# name of the column that will be scaled.
# the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
# one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
# numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
# must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
# two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
# For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
# standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
# factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
# Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
# division.

SUFFIX = '_reverseScaling'
# suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
# identified as 'Y_reverseScaling'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

# New dataframe saved as rescaled_df; list of scaling parameters saved as scaling_list
# Simply modify this object on the left of equality:
rescaled_df, scaling_list = reverse_feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, list_of_scaling_params = LIST_OF_SCALING_PARAMS, mode = MODE, suffix = SUFFIX)

### **Plotting a bar chart**
- To obtain a **Pareto chart**, keep `aggregate_function = 'sum'`, `plot_cumulative_percent = True`, and `orientation = 'vertical'`.
- For obtaining the **data distribution of categorical variables**, select any numeric column as the response, and set `aggregate_function = 'count'`. You can also set `plot_cumulative_percent = True` to compare the frequencies of each possible value.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

CATEGORICAL_VAR_NAME = 'categorical_column_name'
# CATEGORICAL_VAR_NAME: string (inside quotes) containing the name 
# of the column to be analyzed. e.g. 
# CATEGORICAL_VAR_NAME = "column1"

RESPONSE_VAR_NAME = "response_column_name"
# RESPONSE_VAR_NAME: string (inside quotes) containing the name 
# of the column that stores the response correspondent to the
# categories. e.g. RESPONSE_VAR_NAME = "response_feature"

AGGREGATE_FUNCTION = 'sum'
# AGGREGATE_FUNCTION = 'sum': String defining the aggregation 
# method that will be applied. Possible values:
# 'median', 'mean', 'mode', 'sum', 'min', 'max', 'variance', 'count',
# 'standard_deviation','10_percent_quantile', '20_percent_quantile',
# '25_percent_quantile', '30_percent_quantile', '40_percent_quantile',
# '50_percent_quantile', '60_percent_quantile', '70_percent_quantile',
# '75_percent_quantile', '80_percent_quantile', '90_percent_quantile',
# and '95_percent_quantile'.
# To use another aggregate function, the method must be added to the
# dictionary of methods agg_methods_dict, defined in the function.
# If None or an invalid function is input, 'sum' will be used.

ADD_SUFFIX_TO_AGGREGATED_COL = True
# ADD_SUFFIX_TO_AGGREGATED_COL = True will add a suffix to the
# aggregated column. e.g. 'responseVar_mean'. If ADD_SUFFIX_TO_AGGREGATED_COL
# = False, the aggregated column will have the original column name.
SUFFIX = None
# suffix = None. Keep it None if no suffix should be added, or if
# the name of the aggregate function should be used as suffix, after
# "_". Alternatively, set it as a string. As recommendation, put the
# "_" sign in the beginning of this string to separate the suffix from
# the original column name. e.g. if the response variable is 'Y' and
# suffix = '_agg', the new aggregated column will be named as 'Y_agg'
CALCULATE_AND_PLOT_CUMULATIVE_PERCENT = True
# CALCULATE_AND_PLOT_CUMULATIVE_PERCENT = True to calculate and plot
# the line of cumulative percent, or 
# CALCULATE_AND_PLOT_CUMULATIVE_PERCENT = False to omit it.
# This feature is only shown when AGGREGATE_FUNCTION = 'sum', 'median',
# 'mean', or 'mode'. So, it will be automatically set as False if 
# another aggregate is selected.
ORIENTATION = 'vertical'
# ORIENTATION = 'vertical' is the standard, and plots vertical bars
# (perpendicular to the X axis). In this case, the categories are shown
# in the X axis, and the correspondent responses are in Y axis.
# Alternatively, ORIENTATION = 'horizontal' results in horizontal bars.
# In this case, categories are in Y axis, and responses in X axis.
# If None or invalid values are provided, orientation is set as 'vertical'.
LIMIT_OF_PLOTTED_CATEGORIES = None
# LIMIT_OF_PLOTTED_CATEGORIES: integer value that represents
# the maximum of categories that will be plot. Keep it None to plot
# all categories. Alternatively, set an integer value. e.g.: if
# LIMIT_OF_PLOTTED_CATEGORIES = 4, but there are more categories,
# the dataset will be sorted in descending order and: 1) The remaining
# categories will be sum in a new category named 'others' if the
# aggregate function is 'sum'; 2) Or the other categories will be simply
# omitted from the plot, for other aggregate functions. Notice that
# it limits only the variables in the plot: all of them will be
# returned in the dataframe.
# Use this parameter to obtain a cleaner plot. Notice that the remaining
# columns will be aggregated as 'others' even if there is a single column
# beyond the limit.

X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'bar_chart.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# New dataframe saved as aggregated_sorted_df. 
# Simply modify this object on the left of equality:
aggregated_sorted_df = bar_chart (df = DATASET, categorical_var_name = CATEGORICAL_VAR_NAME, response_var_name = RESPONSE_VAR_NAME, aggregate_function = AGGREGATE_FUNCTION, add_suffix_to_aggregated_col = ADD_SUFFIX_TO_AGGREGATED_COL, suffix = SUFFIX, calculate_and_plot_cumulative_percent = CALCULATE_AND_PLOT_CUMULATIVE_PERCENT, orientation = ORIENTATION, limit_of_plotted_categories = LIMIT_OF_PLOTTED_CATEGORIES, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing time series**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = True #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
ADD_SCATTER_DOTS = False
# If ADD_SCATTER_DOTS = False, no dots representing the data points are shown.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'time_series_vis.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


time_series_vis (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, add_scatter_dots = ADD_SCATTER_DOTS, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

****