# **Dataset Transformation**

## _ETL Workflow Notebook 3_

## Content:
1. Removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type;
2. Capitalizing or lowering case of string variables (string homogenizing);
3. Adding contractions to the contractions library;
4. Correcting contracted strings;
5. Substituting (replacing) substrings on string variables;
6. Inverting the order of the string characters;
7. Slicing the strings;
8. Getting the leftest characters from the strings (retrieve last characters);
9. Getting the rightest characters from the strings (retrieve first characters);
10. Joining strings from a same column into a single string;
11. Joining several string columns into a single string column;
12. Splitting strings into a list of strings;
13. Substituting (replacing or switching) whole strings by different text values (on string variables);
14. Replacing strings with Machine Learning: finding similar strings and replacing them by standard strings;
15. Searching for Regular Expression (RegEx) within a string column;
16. Replacing a Regular Expression (RegEx) from a string column;
17. Applying Fast Fourier Transform;
18. Generating columns with frequency information;
19. Transforming the dataset and reverse transforms: log-transform; 
20. Exponential transform; 
21. Box-Cox transform; 
22. One-Hot Encoding;
23. Ordinal Encoding;
24. Feature scaling; 
25. Importing or exporting models and dictionaries.

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

In [None]:
# To install a library (e.g. tensorflow), unmark and run:
# ! pip install tensorflow
# to update a library (e.g. tensorflow), unmark and run:
# ! pip install tensorflow --upgrade
# to update pip, unmark and run:
# ! pip install pip --upgrade
# to show if a library is installed and visualize its information, unmark and run
# (e.g. tensorflow):
# ! pip show tensorflow
# To run a Python file (e.g idsw_etl.py) saved in the notebook's workspace directory,
# unmark and run:
# import idsw_etl
# or:
# import idsw_etl as etl

## **Load Python Libraries in Global Context**

In [None]:
import pandas as pd
import numpy as np

# **Function for mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
def mount_storage_system (source = 'aws', path_to_store_imported_s3_bucket = '', s3_bucket_name = None, s3_obj_prefix = None):
    
    # source = 'google' for mounting the google drive;
    # source = 'aws' for mounting an AWS S3 bucket.
    
    # THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # path_to_store_imported_s3_bucket: path of the Python environment to which the
    # S3 bucket contents will be imported. If it is None, or if 
    # path_to_store_imported_s3_bucket = '/', bucket will be imported to the root path. 
    # Alternatively, input the path as a string (in quotes). e.g. 
    # path_to_store_imported_s3_bucket = 'copied_s3_bucket'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_prefix = None. Keep it None or as an empty string (s3_obj_key_prefix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_prefix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    # So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
    # a given folder (directory) of the bucket.
    # DO NOT PUT A SLASH before (to the right of) the prefix;
    # DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

    # Alternatively, provide the full path of a given file if you want to import only it:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
    # where my_file is the file's name, and ext is its extension.


    # Attention: after running this function for fetching AWS Simple Storage System (S3), 
    # your 'AWS Access key ID' and your 'Secret access key' will be requested.
    # The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
    # other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
    # and the prefix. All of these are sensitive information from the organization.
    # Therefore, after importing the information, always remember of cleaning the output of this cell
    # and of removing such information from the strings.
    # Remember that these data may contain privilege for accessing the information, so it should not
    # be used for non-authorized people.

    # Also, remember of deleting the imported files from the workspace after finishing the analysis.
    # The costs for storing the files in S3 is quite inferior than those for storing directly in the
    # workspace. Also, files stored in S3 may be accessed for other users than those with access to
    # the notebook's workspace.
    
    
    if (source == 'google'):
        
        from google.colab import drive
        # Google Colab library must be imported only in case it is
        # going to be used, for avoiding AWS compatibility issues.
        
        print("Associate the Python environment to your Google Drive account, and authorize the access in the opened window.")
        
        drive.mount('/content/drive')
        
        print("Now your Python environment is connected to your Google Drive: the root directory of your environment is now the root of your Google Drive.")
        print("In Google Colab, navigate to the folder icon (\'Files\') of the left navigation menu to find a specific folder or file in your Google Drive.")
        print("Click on the folder or file name and select the elipsis (...) icon on the right of the name to reveal the option \'Copy path\', which will give you the path to use as input for loading objects and files on your Python environment.")
        print("Caution: save your files into different directories of the Google Drive. If files are all saved in a same folder or directory, like the root path, they may not be accessible from your Python environment.")
        print("If you still cannot see the file after moving it to a different folder, reload the environment.")
    
    elif (source == 'aws'):
        
        import os
        import boto3
        # boto3 is AWS S3 Python SDK
        # sagemaker and boto3 libraries must be imported only in case 
        # they are going to be used, for avoiding 
        # Google Colab compatibility issues.
        from getpass import getpass

        # Check if path_to_store_imported_s3_bucket is None. If it is, make it the root directory:
        if ((path_to_store_imported_s3_bucket is None)|(str(path_to_store_imported_s3_bucket) == "/")):
            
            # For the S3 buckets, the path should not start with slash. Assign the empty
            # string instead:
            path_to_store_imported_s3_bucket = ""
            print("Bucket\'s content will be copied to the notebook\'s root directory.")
        
        elif (str(path_to_store_imported_s3_bucket) == ""):
            # Guarantee that the path is the empty string.
            # Avoid accessing the else condition, what would raise an error
            # since the empty string has no character of index 0
            path_to_store_imported_s3_bucket = str(path_to_store_imported_s3_bucket)
            print("Bucket\'s content will be copied to the notebook\'s root directory.")
        
        else:
            # Use the str attribute to guarantee that the path was read as a string:
            path_to_store_imported_s3_bucket = str(path_to_store_imported_s3_bucket)
            
            if(path_to_store_imported_s3_bucket[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # The slash is character 0. Then, we want all characters from character 1 (the
                # second) to character len(str(path_to_store_imported_s3_bucket)) - 1, the index
                # of the last character. So, we can slice the string from position 1 to position
                # the slicing syntax is: string[1:] - all string characters from character 1
                # string[:10] - all string characters from character 10-1 = 9 (including 9); or
                # string[1:10] - characters from 1 to 9
                # So, slice the whole string, starting from character 1:
                path_to_store_imported_s3_bucket = path_to_store_imported_s3_bucket[1:]
                # attention: even though strings may be seem as list of characters, that can be
                # sliced, we cannot neither simply assign a character to a given position nor delete
                # a character from a position.

        # Ask the user to provide the credentials:
        ACCESS_KEY = input("Enter your AWS Access Key ID here (in the right). It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
        print("\n") # line break
        SECRET_KEY = getpass("Enter your password (Secret key) here (in the right). It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        
        # The use of 'getpass' instead of 'input' hide the password behind dots.
        # So, the password is not visible by other users and cannot be copied.
        
        print("\n")
        print("WARNING: The bucket\'s name, the prefix, the AWS access key ID, and the AWS Secret access key are all sensitive information, which may grant access to protected information from the organization.\n")
        print("After copying data from S3 to your workspace, remember of removing these information from the notebook, specially if it is going to be shared. Also, remember of removing the files from the workspace.\n")
        print("The cost for storing files in Simple Storage Service is quite inferior than the one for storing directly in SageMaker workspace. Also, files stored in S3 may be accessed for other users than those with access the notebook\'s workspace.\n")

        # Check if the user actually provided the mandatory inputs, instead
        # of putting None or empty string:
        if ((ACCESS_KEY is None) | (ACCESS_KEY == '')):
            print("AWS Access Key ID is missing. It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
            return "error"
        elif ((SECRET_KEY is None) | (SECRET_KEY == '')):
            print("AWS Secret Access Key is missing. It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
            return "error"
        elif ((s3_bucket_name is None) | (s3_bucket_name == '')):
            print ("Please, enter a valid S3 Bucket\'s name. Do not add sub-directories or folders (prefixes), only the name of the bucket itself.")
            return "error"
        
        else:
            # Use the str attribute to guarantee that all AWS parameters were properly read as strings, and not as
            # other variables (like integers or floats):
            ACCESS_KEY = str(ACCESS_KEY)
            SECRET_KEY = str(SECRET_KEY)
            s3_bucket_name = str(s3_bucket_name)
        
        if(s3_bucket_name[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # So, slice the whole string, starting from character 1 (as did for 
                # path_to_store_imported_s3_bucket):
                s3_bucket_name = s3_bucket_name[1:]

        # Remove any possible trailing (white and tab spaces) spaces
        # That may be present in the string. Use the Python string
        # rstrip method, which is the equivalent to the Trim function:
        # When no arguments are provided, the whitespaces and tabulations
        # are the removed characters
        # https://www.w3schools.com/python/ref_string_rstrip.asp?msclkid=ee2d05c3c56811ecb1d2189d9f803f65
        s3_bucket_name = s3_bucket_name.rstrip()
        ACCESS_KEY = ACCESS_KEY.rstrip()
        SECRET_KEY = SECRET_KEY.rstrip()
        # Since the user manually inputs the parameters ACCESS and SECRET_KEY,
        # it is easy to input whitespaces without noticing that.

        # Now process the non-obbligatory parameter.
        # Check if a prefix was passed as input parameter. If so, we must select only the names that start with
        # The prefix.
        # Example: in the bucket 'my_bucket' we have a directory 'dir1'.
        # In the main (root) directory, we have a file 'file1.json' like: '/file1.json'
        # If we pass the prefix 'dir1', we want only the files that start as '/dir1/'
        # such as: 'dir1/file2.json', excluding the file in the main (root) directory and excluding the files in other
        # directories. Also, we want to eliminate the file names with no extensions, like 'dir1/' or 'dir1/dir2',
        # since these object names represent folders or directories, not files.	

        if (s3_obj_prefix is None):
            print ("No prefix, specific object, or subdirectory provided.") 
            print (f"Then, retrieving all content from the bucket \'{s3_bucket_name}\'.\n")
        elif ((s3_obj_prefix == "/") | (s3_obj_prefix == '')):
            # The root directory in the bucket must not be specified starting with the slash
            # If the root "/" or the empty string '' is provided, make
            # it equivalent to None (no directory)
            s3_obj_prefix = None
            print ("No prefix, specific object, or subdirectory provided.") 
            print (f"Then, retrieving all content from the bucket \'{s3_bucket_name}\'.\n")
    
        else:
            # Since there is a prefix, use the str attribute to guarantee that the path was read as a string:
            s3_obj_prefix = str(s3_obj_prefix)
            
            if(s3_obj_prefix[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # So, slice the whole string, starting from character 1 (as did for 
                # path_to_store_imported_s3_bucket):
                s3_obj_prefix = s3_obj_prefix[1:]

            # Remove any possible trailing (white and tab spaces) spaces
            # That may be present in the string. Use the Python string
            # rstrip method, which is the equivalent to the Trim function:
            s3_obj_prefix = s3_obj_prefix.rstrip()
            
            # Store the total characters in the prefix string after removing the initial slash
            # and trailing spaces:
            prefix_len = len(s3_obj_prefix)
            
            print("AWS Access Credentials, and bucket\'s prefix, object or subdirectory provided.\n")	

            
        print ("Starting connection with the S3 bucket.\n")
        
        try:
            # Start S3 client as the object 's3_client'
            s3_client = boto3.resource('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = SECRET_KEY)
        
            print(f"Credentials accepted by AWS. S3 client successfully started.\n")
            # An object 'data_table.xlsx' in the main (root) directory of the s3_bucket is stored in Python environment as:
            # s3.ObjectSummary(bucket_name='bucket_name', key='data_table.xlsx')
            # The name of each object is stored as the attribute 'key' of the object.
        
        except:
            
            print("Failed to connect to AWS Simple Storage Service (S3). Review if your credentials are correct.")
            print("The variable \'access_key\' must be set as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("The variable \'secret_key\' must be set as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
        
        try:
            # Connect to the bucket specified as 'bucket_name'.
            # The bucket is started as the object 's3_bucket':
            s3_bucket = s3_client.Bucket(s3_bucket_name)
            print(f"Connection with bucket \'{s3_bucket_name}\' stablished.\n")
            
        except:
            
            print("Failed to connect with the bucket, which usually happens when declaring a wrong bucket\'s name.") 
            print("Check the spelling of your bucket_name string and remember that it must be all in lower-case.\n")
                

        # Then, let's obtain a list of all objects in the bucket (list bucket_objects):
        
        bucket_objects_list = []

        # Loop through all objects of the bucket:
        for stored_obj in s3_bucket.objects.all():
            
            # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
            # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
            # Let's store only the key attribute and use the str function
            # to guarantee that all values were stored as strings.
            bucket_objects_list.append(str(stored_obj.key))
        
        # Now start a support list to store only the elements from
        # bucket_objects_list that are not folders or directories
        # (objects with extensions).
        # If a prefix was provided, only files with that prefix should
        # be added:
        support_list = []
        
        for stored_obj in bucket_objects_list:
            
            # Loop through all elements 'stored_obj' from the list
            # bucket_objects_list

            # Check the file extension.
            file_extension = os.path.splitext(stored_obj)[1][1:]
            
            # The os.path.splitext method splits the string into its FIRST dot (".") to
            # separate the file extension from the full path. Example:
            # "C:/dir1/dir2/data_table.csv" is split into:
            # "C:/dir1/dir2/data_table" (root part) and '.csv' (extension part)
            # https://www.geeksforgeeks.org/python-os-path-splitext-method/?msclkid=2d56198fc5d311ec820530cfa4c6d574

            # os.path.splitext(stored_obj) is a tuple of strings: the first is the complete file
            # root with no extension; the second is the extension starting with a point: '.txt'
            # When we set os.path.splitext(stored_obj)[1], we are selecting the second element of
            # the tuple. By selecting os.path.splitext(stored_obj)[1][1:], we are taking this string
            # from the second character (index 1), eliminating the dot: 'txt'


            # Check if the file extension is not an empty string '' (i.e., that it is different from != the empty
            # string:
            if (file_extension != ''):
                    
                    # The extension is different from the empty string, so it is not neither a folder nor a directory
                    # The object is actually a file and may be copied if it satisfies the prefix condition. If there
                    # is no prefix to check, we may simply copy the object to the list.

                    # If there is a prefix, the first characters of the stored_obj must be the prefix:
                    if not (s3_obj_prefix is None):
                        
                        # Check the characters from the position 0 (1st character) to the position
                        # prefix_len - 1. Since a prefix was declared, we want only the objects that this first portion
                        # corresponds to the prefix. string[i:j] slices the string from index i to index j-1
                        # Then, the 1st portion of the string to check is: string[0:(prefix_len)]

                        # Slice the string stored_obj from position 0 (1st character) to position prefix_len - 1,
                        # The position that the prefix should end.
                        obj_name_first_part = (stored_obj)[0:(prefix_len)]
                        
                        # If this first part is the prefix, then append the object to 
                        # support list:
                        if (obj_name_first_part == (s3_obj_prefix)):

                                support_list.append(stored_obj)

                    else:
                        # There is no prefix, so we can simply append the object to the list:
                        support_list.append(stored_obj)

            
        # Make the objects list the support list itself:
        bucket_objects_list = support_list
            
        # Now, bucket_objects_list contains the names of all objects from the bucket that must be copied.

        print("Finished mapping objects to fetch. Now, all these objects from S3 bucket will be copied to the notebook\'s workspace, in the specified directory.\n")
        print(f"A total of {len(bucket_objects_list)} files were found in the specified bucket\'s prefix (\'{s3_obj_prefix}\').")
        print(f"The first file found is \'{bucket_objects_list[0]}\'; whereas the last file found is \'{bucket_objects_list[len(bucket_objects_list) - 1]}\'.")
            
        # Now, let's try copying the files:
            
        try:
            
            # Loop through all objects in the list bucket_objects and copy them to the workspace:
            for copied_object in bucket_objects_list:

                # Select the object in the bucket previously started as 's3_bucket':
                selected_object = s3_bucket.Object(copied_object)
            
                # Now, copy this object to the workspace:
                # Set the new file_path. Notice that by now, copied_object may be a string like:
                # 'dir1/.../dirN/file_name.ext', where dirN is the n-th directory and ext is the file extension.
                # We want only the file_name to joing with the path to store the imported bucket. So, we can use the
                # str.split method specifying the separator sep = '/' to break the string into a list of substrings.
                # The last element from this list will be 'file_name.ext'
                # https://www.w3schools.com/python/ref_string_split.asp?msclkid=135399b6c63111ecada75d7d91add056

                # 1. Break the copied_object full path into the list object_path_list, using the .split method:
                object_path_list = copied_object.split(sep = "/")

                # 2. Get the last element from this list. Since it has length len(object_path_list) and indexing starts from
                # zero, the index of the last element is (len(object_path_list) - 1):
                fetched_object = object_path_list[(len(object_path_list) - 1)]

                # 3. Finally, join the string fetched_object with the new path (path on the notebook's workspace) to finish
                # The new object's file_path:

                file_path = os.path.join(path_to_store_imported_s3_bucket, fetched_object)

                # Download the selected object to the workspace in the specified file_path
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" copies a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                selected_object.download_file(Filename = file_path)

                print(f"The file \'{fetched_object}\' was successfully copied to notebook\'s workspace.\n")

                
            print("Finished copying the files from the bucket to the notebook\'s workspace. It may take a couple of minutes untill they be shown in SageMaker environment.\n") 
            print("Do not forget to delete these copies after finishing the analysis. They will remain stored in the bucket.\n")


        except:

            # Run this code for any other exception that may happen (no exception error
            # specified, so any exception runs the following code).
            # Check: https://pythonbasics.org/try-except/?msclkid=4f6b4540c5d011ecb1fe8a4566f632a6
            # for seeing how to handle successive exceptions

            print("Attention! The function raised an exception error, which is probably due to the AWS Simple Storage Service (S3) permissions.")
            print("Before running again this function, check this quick guide for configuring the permission roles in AWS.\n")
            print("It is necessary to create an user with full access permissions to interact with S3 from SageMaker. To configure the User, go to the upper ribbon of AWS, click on Services, and select IAM – Identity and Access Management.")
            print("1. In IAM\'s lateral panel, search for \'Users\' in the group of Access Management.")
            print("2. Click on the \'Add users\' button.")
            print("3. Set an user name in the text box \'User name\'.")
            print("Attention: users and S3 buckets cannot be written in upper case. Also, selecting a name already used by an Amazon user or bucket will raise an error message.\n")
            print("4. In the field \'Select type of Access to AWS\'-\'Select type of AWS credentials\' select the option \'Access key - Programmatic access\'. After that, click on the button \'Next: Permissions\'.")
            print("5. In the field \'Set Permissions\', keep the \'Add user to a group\' button marked.")
            print("6. In the field \'Add user to a group\', click on \'Create group\' (alternatively, you can be added to a group already configured or copy the permissions of another user.")
            print("7. In the text box \'Group\'s name\', set a name for the new group of permissions.")
            print("8. In the search bar below (\'Filter politics\'), search for a politics that fill your needs, and check the option button on the left of this politic. The politics \'AmazonS3FullAccess\' grants full access to the S3 content.")
            print("9. Finally, click on \'Create a group\'.")
            print("10. After the group is created, it will appear with a check box marked, over the previous groups. Keep it marked and click on the button \'Next: Tags\'.")
            print("11. Create and note down the Access key ID and Secret access key. You can also download a comma separated values (CSV) file containing the credentials for future use.")
            print("ATTENTION: These parameters are required for accessing the bucket\'s content from any application, including AWS SageMaker.")
            print("12. Click on \'Next: Review\' and review the user credentials information and permissions.")
            print("13. Click on \'Create user\' and click on the download button to download the CSV file containing the user credentials information.")
            print("The headers of the CSV file (the stored fields) is: \'User name, Password, Access key ID, Secret access key, Console login link\'.")
            print("You need both the values indicated as \'Access key ID\' and as \'Secret access key\' to fetch the S3 bucket.")
            print("\n") # line break
            print("After acquiring the necessary user privileges, use the boto3 library to fetch the bucket from the Python code. boto3 is AWS S3 Python SDK.")
            print("For fetching a specific bucket\'s file use the following code:\n")
            print("1. Set a variable \'access_key\' as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("2. Set a variable \'secret_key\' as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
            print("3. Set a variable \'bucket_name\' as a string containing only the name of the bucket. Do not add subdirectories, folders (prefixes), or file names.")
            print("Example: if your bucket is named \'my_bucket\' and its main directory contains folders like \'folder1\', \'folder2\', etc, do not declare bucket_name = \'my_bucket/folder1\', even if you only want files from folder1.")
            print("ALWAYS declare only the bucket\'s name: bucket_name = \'my_bucket\'.")
            print("4. Set a variable \'file_path\' containing the path from the bucket\'s subdirectories to the file you want to fetch. Include the file name and its extension.")
            print("If the file is stored in the bucket\'s root (main) directory: file_path = \"my_file.ext\".")
            print("If the path of the file in the bucket is: \'dir1/…/dirN/my_file.ext\', where dirN is the N-th subdirectory, and dir1 is a folder or directory of the main (root) bucket\'s directory: file_path = \"dir1/…/dirN/my_file.ext\".")
            print("Also, we say that \'dir1/…/dirN/\' is the file\'s prefix. Notice that the name of the bucket is never declared here as the path for fetching its content from the Python code.")
            print("5. Set a variable named \'new_path\' to store the path of the file copied to the notebook’s workspace. This path must contain the file name and its extension.")
            print("Example: if you want to copy \'my_file.ext\' to the root directory of the notebook’s workspace, set: new_path = \"/my_file.ext\".")
            print("6. Finally, declare the following code, which refers to the defined variables:\n")

            # Let's use triple quotes to declare a formated string
            example_code = """
                import boto3
                # Start S3 client as the object 's3_client'
                s3_client = boto3.resource('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key)
                # Connect to the bucket specified as 'bucket_name'.
                # The bucket is started as the object 's3_bucket':
                s3_bucket = s3_client.Bucket(bucket_name)
                # Select the object in the bucket previously started as 's3_bucket':
                selected_object = s3_bucket.Object(file_path)
                # Download the selected object to the workspace in the specified file_path
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" copies a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                selected_object.download_file(Filename = new_path)
                """

            print(example_code)

            print("An object \'my_file.ext\' in the main (root) directory of the s3_bucket is stored in Python environment as:")
            print("""s3.ObjectSummary(bucket_name='bucket_name', key='my_file.ext'""") 
            # triple quotes to keep the internal quotes without using too much backslashes "\" (the ignore next character)
            print("Then, the name of each object is stored as the attribute \'key\' of the object. To view all objects, we can loop through their \'key\' attributes:\n")
            example_code = """
                # Loop through all objects of the bucket:
                for stored_obj in s3_bucket.objects.all():		
                    # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
                    # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
                    # Print the object’s names:
                    print(stored_obj.key)
                    """

            print(example_code)

                
    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

# **Function for loading the dataframe**

In [None]:
def load_pandas_dataframe (file_directory_path, file_name_with_extension, load_txt_file_with_json_format = False, how_missing_values_are_registered = None, has_header = True, decimal_separator = '.', txt_csv_col_sep = "comma", load_all_sheets_at_once = False, sheet_to_load = None, json_record_path = None, json_field_separator = "_", json_metadata_prefix_list = None):
    
    # Pandas documentation:
    # pd.read_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    # pd.read_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
    # pd.json_normalize: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
    # Python JSON documentation:
    # https://docs.python.org/3/library/json.html
    
    import os
    import json
    import numpy as np
    import pandas as pd
    from pandas import json_normalize
    
    ## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
    ## JSON, txt, or CSV (comma separated values) files.
    
    # file_directory_path - (string, in quotes): input the path of the directory (e.g. folder path) 
    # where the file is stored. e.g. file_directory_path = "/" or file_directory_path = "/folder"
    
    # FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
    # extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
    # FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
    # Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.
    
    # load_txt_file_with_json_format = False. Set load_txt_file_with_json_format = True 
    # if you want to read a file with txt extension containing a text formatted as JSON 
    # (but not saved as JSON).
    # WARNING: if load_txt_file_with_json_format = True, all the JSON file parameters of the 
    # function (below) must be set. If not, an error message will be raised.
    
    # HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
    # empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
    # This parameter manipulates the argument na_values (default: None) from Pandas functions.
    # By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
    #‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
    # ‘n/a’, ‘nan’, ‘null’.

    # If a different denomination is used, indicate it as a string. e.g.
    # HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
    # HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

    # If dict passed, specific per-column NA values. For example, if zero is the missing value
    # only in column 'numeric_col', you can specify the following dictionary:
    # how_missing_values_are_registered = {'numeric-col': 0}
    
    
    # has_header = True if the the imported table has headers (row with columns names).
    # Alternatively, has_header = False if the dataframe does not have header.
    
    # DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
    # the decimal separator. Alternatively, specify here the separator.
    # e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
    # It manipulates the argument 'decimal' from Pandas functions.
    
    # txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
    # or 'csv'. It informs how the different columns are separated.
    # Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
    # for columns separated by comma;
    # txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
    # for columns separated by simple spaces.
    # You can also set a specific separator as string. For example:
    # txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
    # is used as separator for the columns - '\t' represents the tab character).
    
    
    ## Parameters for loading Excel files:
    
    # load_all_sheets_at_once = False - This parameter has effect only when for Excel files.
    # If load_all_sheets_at_once = True, the function will return a list of dictionaries, each
    # dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
    # value will be the name (or number) of the table (sheet). The second key will be 'df',
    # and its value will be the pandas dataframe object obtained from that sheet.
    # This argument has preference over sheet_to_load. If it is True, all sheets will be loaded.
    
    # sheet_to_load - This parameter has effect only when for Excel files.
    # keep sheet_to_load = None not to specify a sheet of the file, so that the first sheet
    # will be loaded.
    # sheet_to_load may be an integer or an string (inside quotes). sheet_to_load = 0
    # loads the first sheet (sheet with index 0); sheet_to_load = 1 loads the second sheet
    # of the file (index 1); sheet_to_load = "Sheet1" loads a sheet named as "Sheet1".
    # Declare a number to load the sheet with that index, starting from 0; or declare a
    # name to load the sheet with that name.
    
    
    ## Parameters for loading JSON files:
    
    # json_record_path (string): manipulate parameter 'record_path' from json_normalize method.
    # Path in each object to list of records. If not passed, data will be assumed to 
    # be an array of records. If a given field from the JSON stores a nested JSON (or a nested
    # dictionary) declare it here to decompose the content of the nested data. e.g. if the field
    # 'books' stores a nested JSON, declare, json_record_path = 'books'
    
    # json_field_separator = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
    # Nested records will generate names separated by sep. 
    # e.g., for json_field_separator = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
    # Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
    # the name of the columns of the dataframe will be formed by concatenating 'main_field', the
    # separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...
    
    # json_metadata_prefix_list: list of strings (in quotes). Manipulates the parameter 
    # 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
    # table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
    # will be repeated in the rows of the dataframe to give the metadata (context) of the rows.
    
    # e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
    # 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
    # Here, there are nested JSONs in the field 'books'. The fields that are not nested
    # are 'name' and 'last'.
    # Then, json_record_path = 'books'
    # json_metadata_prefix_list = ['name', 'last']
    
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, file_name_with_extension)
    # Extract the file extension
    file_extension = os.path.splitext(file_path)[1][1:]
    # os.path.splitext(file_path) is a tuple of strings: the first is the complete file
    # root with no extension; the second is the extension starting with a point: '.txt'
    # When we set os.path.splitext(file_path)[1], we are selecting the second element of
    # the tuple. By selecting os.path.splitext(file_path)[1][1:], we are taking this string
    # from the second character (index 1), eliminating the dot: 'txt'
    
    # Check if the decimal separator is None. If it is, set it as '.' (period):
    if (decimal_separator is None):
        decimal_separator = '.'
    
    if ((file_extension == 'txt') | (file_extension == 'csv')): 
        # The operator & is equivalent to 'And' (intersection).
        # The operator | is equivalent to 'Or' (union).
        # pandas.read_csv method must be used.
        if (load_txt_file_with_json_format == True):
            
            print("Reading a txt file containing JSON parsed data. A reading error will be raised if you did not set the JSON parameters.\n")
            
            with open(file_path, 'r') as opened_file:
                # 'r' stands for read mode; 'w' stands for write mode
                # read the whole file as a string named 'file_full_text'
                file_full_text = opened_file.read()
                # if we used the readlines() method, we would be reading the
                # file by line, not the whole text at once.
                # https://stackoverflow.com/questions/8369219/how-to-read-a-text-file-into-a-string-variable-and-strip-newlines?msclkid=a772c37bbfe811ec9a314e3629df4e1e
                # https://www.tutorialkart.com/python/python-read-file-as-string/#:~:text=example.py%20%E2%80%93%20Python%20Program.%20%23open%20text%20file%20in,and%20prints%20it%20to%20the%20standard%20output.%20Output.?msclkid=a7723a1abfe811ecb68bba01a2b85bd8
                
            #Now, file_full_text is a string containing the full content of the txt file.
            json_file = json.loads(file_full_text)
            # json.load() : This method is used to parse JSON from URL or file.
            # json.loads(): This method is used to parse string with JSON content.
            # e.g. .json.loads() must be used to read a string with JSON and convert it to a flat file
            # like a dataframe.
            # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.
            dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
        
        else:
            # Not a JSON txt
        
            if (has_header == True):

                if ((txt_csv_col_sep == "comma") | (txt_csv_col_sep == ",")):

                    dataset = pd.read_csv(file_path, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    # verbose = True for showing number of NA values placed in non-numeric columns.
                    #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                    # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                    # parsing speed by 5-10x.

                elif ((txt_csv_col_sep == "whitespace") | (txt_csv_col_sep == " ")):

                    dataset = pd.read_csv(file_path, delim_whitespace = True, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    
                else:
                    
                    try:
                        
                        # Try using the character specified as the argument txt_csv_col_sep:
                        dataset = pd.read_csv(file_path, sep = txt_csv_col_sep, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    except:
                        # An error was raised, the separator is not valid
                        print(f"Enter a valid column separator for the {file_extension} file, like: \'comma\' or \'whitespace\'.")


            else:
                # has_header == False

                if ((txt_csv_col_sep == "comma") | (txt_csv_col_sep == ",")):

                    dataset = pd.read_csv(file_path, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)

                    
                elif ((txt_csv_col_sep == "whitespace") | (txt_csv_col_sep == " ")):

                    dataset = pd.read_csv(file_path, delim_whitespace = True, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    
                else:
                    
                    try:
                        
                        # Try using the character specified as the argument txt_csv_col_sep:
                        dataset = pd.read_csv(file_path, sep = txt_csv_col_sep, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    except:
                        # An error was raised, the separator is not valid
                        print(f"Enter a valid column separator for the {file_extension} file, like: \'comma\' or \'whitespace\'.")

    elif (file_extension == 'json'):
        
        with open(file_path, 'r') as opened_file:
            
            json_file = json.load(opened_file)
            # The structure json_file = json.load(open(file_path)) relies on the GC to close the file. That's not a 
            # good idea: If someone doesn't use CPython the garbage collector might not be using refcounting (which 
            # collects unreferenced objects immediately) but e.g. collect garbage only after some time.
            # Since file handles are closed when the associated object is garbage collected or closed 
            # explicitly (.close() or .__exit__() from a context manager) the file will remain open until 
            # the GC kicks in.
            # Using 'with' ensures the file is closed as soon as the block is left - even if an exception 
            # happens inside that block, so it should always be preferred for any real application.
            # source: https://stackoverflow.com/questions/39447362/equivalent-ways-to-json-load-a-file-in-python
            
        # json.load() : This method is used to parse JSON from URL or file.
        # json.loads(): This method is used to parse string with JSON content.
        # Then, json.load for a .json file
        # and json.loads for text file containing json
        # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.   
        dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
    
    else:
        # If it is not neither a csv nor a txt file, let's assume it is one of different
        # possible Excel files.
        print("Excel file inferred. If an error message is shown, check if a valid file extension was used: \'xlsx\', \'xls\', etc.\n")
        # For Excel type files, Pandas automatically detects the decimal separator and requires only the parameter parse_dates.
        # Firstly, the argument infer_datetime_format was present on read_excel function, but was removed.
        # From version 1.4 (beta, in 10 May 2022), it will be possible to pass the parameter 'decimal' to
        # read_excel function for detecting decimal cases in strings. For numeric variables, it is not needed, though
        
        if (load_all_sheets_at_once == True):
            
            # Corresponds to setting sheet_name = None
            
            if (has_header == True):
                
                xlsx_doc = pd.read_excel(file_path, sheet_name = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                # verbose = True for showing number of NA values placed in non-numeric columns.
                #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                # parsing speed by 5-10x.
                
            else:
                #No header
                xlsx_doc = pd.read_excel(file_path, sheet_name = None, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
            
            # xlsx_doc is a dictionary containing the sheet names as keys, and dataframes as items.
            # Let's convert it to the desired format.
            # Dictionary dict, dict.keys() is the array of keys; dict.values() is an array of the values;
            # and dict.items() is an array of tuples with format ('key', value)
            
            # Create a list of returned datasets:
            list_of_datasets = []
            
            # Let's iterate through the array of tuples. The first element returned is the key, and the
            # second is the value
            for sheet_name, dataframe in (xlsx_doc.items()):
                # sheet_name = key; dataframe = value
                # Define the dictionary with the standard format:
                df_dict = {'sheet': sheet_name,
                            'df': dataframe}
                
                # Add the dictionary to the list:
                list_of_datasets.append(df_dict)
            
            print("\n")
            print(f"A total of {len(list_of_datasets)} dataframes were retrieved from the Excel file.\n")
            print(f"The dataframes correspond to the following Excel sheets: {list(xlsx_doc.keys())}\n")
            print("Returning a list of dictionaries. Each dictionary contains the key \'sheet\', with the original sheet name; and the key \'df\', with the Pandas dataframe object obtained.\n")
            print(f"Check the 10 first rows of the dataframe obtained from the first sheet, named {list_of_datasets[0]['sheet']}:\n")
            
            try:
                # only works in Jupyter Notebook:
                from IPython.display import display
                display((list_of_datasets[0]['df']).head(10))
            
            except: # regular mode
                print((list_of_datasets[0]['df']).head(10))
            
            return list_of_datasets
            
        elif (sheet_to_load is not None):        
        #Case where the user specifies which sheet of the Excel file should be loaded.
            
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                # verbose = True for showing number of NA values placed in non-numeric columns.
                #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                # parsing speed by 5-10x.
                
            else:
                #No header
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
        
        else:
            #No sheet specified
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
            else:
                #No header
                dataset = pd.read_excel(file_path, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
    print(f"Dataset extracted from {file_path}. Check the 10 first rows of this dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(dataset.head(10))
            
    except: # regular mode
        print(dataset.head(10))
    
    return dataset

# **Function for converting JSON object to dataframe**
- Objects may be:
    - String with JSON formatted text;
    - List with nested dictionaries (JSON formatted);
    - Each dictionary may contain nested dictionaries, or nested lists of dictionaries (nested JSON).

In [None]:
def json_obj_to_pandas_dataframe (json_obj_to_convert, json_obj_type = 'list', json_record_path = None, json_field_separator = "_", json_metadata_prefix_list = None):
    
    import json
    import pandas as pd
    from pandas import json_normalize
    
    # JSON object in terms of Python structure: list of dictionaries, where each value of a
    # dictionary may be a dictionary or a list of dictionaries (nested structures).
    # example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
    # structure could be declared and stored into a string variable. For instance, if you have a txt
    # file containing JSON, you could read the txt and save its content as a string.
    # json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
    # 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
    # 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]    

    # json_obj_type = 'list', in case the object was saved as a list of dictionaries (JSON format)
    # json_obj_type = 'string', in case it was saved as a string (text) containing JSON.

    # json_obj_to_convert: object containing JSON, or string with JSON content to parse.
    # Objects may be: string with JSON formatted text;
    # list with nested dictionaries (JSON formatted);
    # dictionaries, possibly with nested dictionaries (JSON formatted).
    
    # https://docs.python.org/3/library/json.html
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html#pandas.json_normalize
    
    # json_record_path (string): manipulate parameter 'record_path' from json_normalize method.
    # Path in each object to list of records. If not passed, data will be assumed to 
    # be an array of records. If a given field from the JSON stores a nested JSON (or a nested
    # dictionary) declare it here to decompose the content of the nested data. e.g. if the field
    # 'books' stores a nested JSON, declare, json_record_path = 'books'
    
    # json_field_separator = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
    # Nested records will generate names separated by sep. 
    # e.g., for json_field_separator = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
    # Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
    # the name of the columns of the dataframe will be formed by concatenating 'main_field', the
    # separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...
    
    # json_metadata_prefix_list: list of strings (in quotes). Manipulates the parameter 
    # 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
    # table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
    # will be repeated in the rows of the dataframe to give the metadata (context) of the rows.
    
    # e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
    # 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
    # Here, there are nested JSONs in the field 'books'. The fields that are not nested
    # are 'name' and 'last'.
    # Then, json_record_path = 'books'
    # json_metadata_prefix_list = ['name', 'last']

    
    if (json_obj_type == 'string'):
        # Use the json.loads method to convert the string to json
        json_file = json.loads(json_obj_to_convert)
        # json.load() : This method is used to parse JSON from URL or file.
        # json.loads(): This method is used to parse string with JSON content.
        # e.g. .json.loads() must be used to read a string with JSON and convert it to a flat file
        # like a dataframe.
        # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.
    
    elif (json_obj_type == 'list'):
        
        # make the json_file the object itself:
        json_file = json_obj_to_convert
    
    else:
        print ("Enter a valid JSON object type: \'list\', in case the JSON object is a list of dictionaries in JSON format; or \'string\', if the JSON is stored as a text (string variable).")
        return "error"
    
    dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
    
    print(f"JSON object converted to a flat dataframe object. Check the 10 first rows of this dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(dataset.head(10))
            
    except: # regular mode
        print(dataset.head(10))
    
    return dataset

# **Function for dataframe general characterization**

In [None]:
def df_general_characterization (df):
    
    import pandas as pd

    # Set a local copy of the dataframe:
    DATASET = df.copy(deep = True)

    # Show dataframe's header
    print("Dataframe\'s 10 first rows:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))

    # Show dataframe's tail:
    # Line break before next information:
    print("\n")
    print("Dataframe\'s 10 last rows:\n")
    try:
        display(DATASET.tail(10))
    except:
        print(DATASET.tail(10))
    
    # Show dataframe's shape:
    # Line break before next information:
    print("\n")
    df_shape  = DATASET.shape
    print("Dataframe\'s shape = (number of rows, number of columns) =\n")
    try:
        display(df_shape)
    except:
        print(df_shape)
    
    # Show dataframe's columns:
    # Line break before next information:
    print("\n")
    df_columns_array = DATASET.columns
    print("Dataframe\'s columns =\n")
    try:
        display(df_columns_array)
    except:
        print(df_columns_array)
    
    # Show dataframe's columns types:
    # Line break before next information:
    print("\n")
    df_dtypes = DATASET.dtypes
    # Now, the df_dtypes seroes has the original columns set as index, but this index has no name.
    # Let's rename it using the .rename method from Pandas Index object:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.rename.html#pandas.Index.rename
    # To access the Index object, we call the index attribute from Pandas dataframe.
    # By setting inplace = True, we modify the object inplace, by simply calling the method:
    df_dtypes.index.rename(name = 'dataframe_column', inplace = True)
    # Let's also modify the series label or name:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rename.html
    df_dtypes.rename('dtype_series', inplace = True)
    print("Dataframe\'s variables types:\n")
    try:
        display(df_dtypes)
    except:
        print(df_dtypes)
    
    # Show dataframe's general statistics for numerical variables:
    # Line break before next information:
    print("\n")
    df_general_statistics = DATASET.describe()
    print("Dataframe\'s general (summary) statistics for numeric variables:\n")
    try:
        display(df_general_statistics)
    except:
        print(df_general_statistics)
    
    # Show total of missing values for each variable:
    # Line break before next information:
    print("\n")
    total_of_missing_values_series = DATASET.isna().sum()
    # This is a series which uses the original column names as index
    proportion_of_missing_values_series = DATASET.isna().mean()
    percent_of_missing_values_series = proportion_of_missing_values_series * 100
    missingness_dict = {'count_of_missing_values': total_of_missing_values_series,
                       'proportion_of_missing_values': proportion_of_missing_values_series,
                       'percent_of_missing_values': percent_of_missing_values_series}
    
    df_missing_values = pd.DataFrame(data = missingness_dict)
    # Now, the dataframe has the original columns set as index, but this index has no name.
    # Let's rename it using the .rename method from Pandas Index object:
    df_missing_values.index.rename(name = 'dataframe_column', inplace = True)
    
    # Create a one row dataframe with the missingness for the whole dataframe:
    # Pass the scalars as single-element lists or arrays:
    one_row_data = {'dataframe_column': ['missingness_accross_rows'],
                    'count_of_missing_values': [len(DATASET) - len(DATASET.copy(deep = True).dropna(how = 'any'))],
                    'proportion_of_missing_values': [(len(DATASET) - len(DATASET.copy(deep = True).dropna(how = 'any')))/(len(DATASET))],
                    'percent_of_missing_values': [(len(DATASET) - len(DATASET.copy(deep = True).dropna(how = 'any')))/(len(DATASET))*100]
                    }
    one_row_df = pd.DataFrame(data = one_row_data)
    one_row_df.set_index('dataframe_column', inplace = True)
    
    # Append this one_row_df to df_missing_values:
    df_missing_values = pd.concat([df_missing_values, one_row_df])
    
    print("Missing values on each feature; and missingness considering all rows from the dataframe:")
    print("(note: \'missingness_accross_rows\' was calculated by: checking which rows have at least one missing value (NA); and then comparing total rows with NAs with total rows in the dataframe).\n")
    
    try:
        display(df_missing_values)
    except:
        print(df_missing_values)
    
    return df_shape, df_columns_array, df_dtypes, df_general_statistics, df_missing_values

# **Function for obtaining the correlation plot**
- The Pandas method dataset.corr() calculates the Pearson's correlation coefficients R.
- Pearson's correlation coefficients R go from -1 to 1.
- These coefficients are R, not R².

#### To obtain the coefficients R², we raise the results to the 2nd power, i.e., we calculate (dataset.corr())**2
- R² goes from 0 to 1, where 1 represents the perfect correlation.

In [None]:
def correlation_plot (df, show_masked_plot = True, responses_to_return_corr = None, set_returned_limit = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    #show_masked_plot = True - keep as True if you want to see a cleaned version of the plot
    # where a mask is applied.
    
    #responses_to_return_corr - keep as None to return the full correlation tensor.
    # If you want to display the correlations for a particular group of features, input them
    # as a list, even if this list contains a single element. Examples:
    # responses_to_return_corr = ['response1'] for a single response
    # responses_to_return_corr = ['response1', 'response2', 'response3'] for multiple
    # responses. Notice that 'response1',... should be substituted by the name ('string')
    # of a column of the dataset that represents a response variable.
    # WARNING: The returned coefficients will be ordered according to the order of the list
    # of responses. i.e., they will be firstly ordered based on 'response1'
    
    # set_returned_limit = None - This variable will only present effects in case you have
    # provided a response feature to be returned. In this case, keep set_returned_limit = None
    # to return all of the correlation coefficients; or, alternatively, 
    # provide an integer number to limit the total of coefficients returned. 
    # e.g. if set_returned_limit = 10, only the ten highest coefficients will be returned. 
    
    # set a local copy of the dataset to perform the calculations:
    DATASET = df.copy(deep = True)
    
    correlation_matrix = DATASET.corr(method = 'pearson')
    
    if (show_masked_plot == False):
        #Show standard plot
        
        plt.figure(figsize = (12, 8))
        sns.heatmap((correlation_matrix)**2, annot = True, fmt = ".2f")
        
        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = ""

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "correlation_plot"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 330 dpi
                png_resolution_dpi = 330

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        plt.show()

    #Oncee the pandas method .corr() calculates R, we raised it to the second power 
    # to obtain R². R² goes from zero to 1, where 1 represents the perfect correlation.
    
    else:
        
        # Show masked (cleaner) plot instead of the standard one
        # Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        plt.figure(figsize = (12, 8))
        # Mask for the upper triangle
        mask = np.zeros_like((correlation_matrix)**2)

        mask[np.triu_indices_from(mask)] = True

        # Generate a custom diverging colormap
        cmap = sns.diverging_palette(220, 10, as_cmap = True)

        # Heatmap with mask and correct aspect ratio
        sns.heatmap(((correlation_matrix)**2), mask = mask, cmap = cmap, center = 0,
                    linewidths = .5)
        
        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = ""

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "correlation_plot"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 330 dpi
                png_resolution_dpi = 330

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        plt.show()

        #Again, the method dataset.corr() calculates R within the variables of dataset.
        #To calculate R², we simply raise it to the second power: (dataset.corr()**2)
    
    #Sort the values of correlation_matrix in Descending order:
    
    if (responses_to_return_corr is not None):
        
        if (type(responses_to_return_corr) == str):
            # If a string was input, put it inside a list
            responses_to_return_corr = [responses_to_return_corr]
        
        #Select only the desired responses, by passing the list responses_to_return_corr
        # as parameter for column filtering:
        correlation_matrix = correlation_matrix[responses_to_return_corr]
        # By passing a list as argument, we assure that the output is a dataframe
        # and not a series, even if the list contains a single element.
        
        # Create a list of boolean variables == False, one False correspondent to
        # each one of the responses
        ascending_modes = [False for i in range(0, len(responses_to_return_corr))]
        
        #Now sort the values according to the responses, by passing the list
        # response
        correlation_matrix = correlation_matrix.sort_values(by = responses_to_return_corr, ascending = ascending_modes)
        
        # If a limit of coefficients was determined, apply it:
        if (set_returned_limit is not None):
                
                correlation_matrix = correlation_matrix.head(set_returned_limit)
                #Pandas .head(X) method returns the first X rows of the dataframe.
                # Here, it returns the defined limit of coefficients, set_returned_limit.
                # The default .head() is X = 5.
    
    print("ATTENTION: The correlation plots show the linear correlations R², which go from 0 (none correlation) to 1 (perfect correlation). Obviously, the main diagonal always shows R² = 1, since the data is perfectly correlated to itself.\n")
    print("The returned correlation matrix, on the other hand, presents the linear coefficients of correlation R, not R². R values go from -1 (perfect negative correlation) to 1 (perfect positive correlation).\n")
    print("None of these coefficients take non-linear relations and the presence of a multiple linear correlation in account. For these cases, it is necessary to calculate R² adjusted, which takes in account the presence of multiple preditors and non-linearities.\n")
    
    print("Correlation matrix - numeric results:\n")
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(correlation_matrix)
            
    except: # regular mode
        print(correlation_matrix)
    
    return correlation_matrix

# **Function for obtaining scatter plots and simple linear regressions**
- Here, only a single prediction variable will be analyzed by once.
- The plots will show Y x X, where X is the predict or independent variable.
- The linear regressions will be of the type Y = aX + b, i.e., a single pair (X, Y) analyzed.

In [None]:
def scatter_plot_lin_reg (data_in_same_column = False, df = None, column_with_predict_var_x = None, column_with_response_var_y = None, column_with_labels = None, list_of_dictionaries_with_series_to_analyze = [{'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}], x_axis_rotation = 70, y_axis_rotation = 0, show_linear_reg = True, grid = True, add_splines_lines = False, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330): 
    
    import random
    # Python Random documentation:
    # https://docs.python.org/3/library/random.html?msclkid=9d0c34b2d13111ec9cfa8ddaee9f61a1
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import matplotlib.colors as mcolors
    from scipy import stats
    
    # matplotlib.colors documentation:
    # https://matplotlib.org/3.5.0/api/colors_api.html?msclkid=94286fa9d12f11ec94660321f39bf47f
    
    # Matplotlib list of colors:
    # https://matplotlib.org/stable/gallery/color/named_colors.html?msclkid=0bb86abbd12e11ecbeb0a2439e5b0d23
    # Matplotlib colors tutorial:
    # https://matplotlib.org/stable/tutorials/colors/colors.html
    # Matplotlib example of Python code using matplotlib.colors:
    # https://matplotlib.org/stable/_downloads/0843ee646a32fc214e9f09328c0cd008/colors.py
    # Same example as Jupyter Notebook:
    # https://matplotlib.org/stable/_downloads/2a7b13c059456984288f5b84b4b73f45/colors.ipynb
    
        
    # data_in_same_column = False: set as True if all the values to plot are in a same column.
    # If data_in_same_column = True, you must specify the dataframe containing the data as df;
    # the column containing the predict variable (X) as column_with_predict_var_x; the column 
    # containing the responses to plot (Y) as column_with_response_var_y; and the column 
    # containing the labels (subgroup) indication as column_with_labels. 
    # df is an object, so do not declare it in quotes. The other three arguments (columns' names) 
    # are strings, so declare in quotes. 
    
    # Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
    # All the results for both groups are in a column named 'results', wich will be plot against
    # the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
    # an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
    # column 'group' shows the value 'B'. In this example:
    # data_in_same_column = True,
    # df = dataset,
    # column_with_predict_var_x = 'time',
    # column_with_response_var_y = 'results', 
    # column_with_labels = 'group'
    # If you want to declare a list of dictionaries, keep data_in_same_column = False and keep
    # df = None (the other arguments may be set as None, but it is not mandatory: 
    # column_with_predict_var_x = None, column_with_response_var_y = None, column_with_labels = None).
    

    # Parameter to input when DATA_IN_SAME_COLUMN = False:
    # list_of_dictionaries_with_series_to_analyze: if data is already converted to series, lists
    # or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
    # even if there is a single dictionary.
    # Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
    # (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
    # keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
    # represents the series and label of the added dictionary (you can pass 'lab': None, but if 
    # 'x' or 'y' are None, the new dictionary will be ignored).
    
    # Examples:
    # list_of_dictionaries_with_series_to_analyze = 
    # [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
    # will plot a single variable. In turns:
    # list_of_dictionaries_with_series_to_analyze = 
    # [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
    # will plot two series, Y1 x X and Y2 x X.
    # Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
    # If None is provided to 'lab', an automatic label will be generated.
    
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]
    
    if (data_in_same_column == True):
        
        print("Data to be plotted in a same column.\n")
        
        if (df is None):
            
            print("Please, input a valid dataframe as df.\n")
            list_of_dictionaries_with_series_to_analyze = []
            # The code will check the size of this list on the next block.
            # If it is zero, code is simply interrupted.
            # Instead of returning an error, we use this code structure that can be applied
            # on other graphic functions that do not return a summary (and so we should not
            # return a value like 'error' to interrupt the function).
        
        elif (column_with_predict_var_x is None):
            
            print("Please, input a valid column name as column_with_predict_var_x.\n")
            list_of_dictionaries_with_series_to_analyze = []
           
        elif (column_with_response_var_y is None):
            
            print("Please, input a valid column name as column_with_response_var_y.\n")
            list_of_dictionaries_with_series_to_analyze = []
        
        else:
            
            # set a local copy of the dataframe:
            DATASET = df.copy(deep = True)
            
            if (column_with_labels is None):
            
                print("Using the whole series (column) for correlation.\n")
                column_with_labels = 'whole_series_' + column_with_response_var_y
                DATASET[column_with_labels] = column_with_labels
            
            # sort DATASET; by column_with_predict_var_x; by column column_with_labels
            # and by column_with_response_var_y, all in Ascending order
            # Since we sort by label (group), it is easier to separate the groups.
            DATASET = DATASET.sort_values(by = [column_with_predict_var_x, column_with_labels, column_with_response_var_y], ascending = [True, True, True])
            
            # Reset indices:
            DATASET = DATASET.reset_index(drop = True)
            
            # If column_with_predict_var_x is an object, the user may be trying to pass a date as x. 
            # So, let's try to convert it to datetime:
    
            if ((DATASET[column_with_predict_var_x]).dtype not in numeric_dtypes):
                  
                try:
                    DATASET[column_with_predict_var_x] = (DATASET[column_with_predict_var_x]).astype('datetime64[ns]')
                    print("Variable X successfully converted to datetime64[ns].\n")
                    
                except:
                    # Simply ignore it
                    pass
            
            # Get a series of unique values of the labels, and save it as a list using the
            # list attribute:
            unique_labels = list(DATASET[column_with_labels].unique())
            print(f"{len(unique_labels)} different labels detected: {unique_labels}.\n")
            
            # Start a list to store the dictionaries containing the keys:
            # 'x': list of predict variables; 'y': list of responses; 'lab': the label (group)
            list_of_dictionaries_with_series_to_analyze = []
            
            # Loop through each possible label:
            for lab in unique_labels:
                # loop through each element from the list unique_labels, referred as lab
                
                # Set a filter for the dataset, to select only rows correspondent to that
                # label:
                boolean_filter = (DATASET[column_with_labels] == lab)
                
                # Create a copy of the dataset, with entries selected by that filter:
                ds_copy = (DATASET[boolean_filter]).copy(deep = True)
                # Sort again by X and Y, to guarantee the results are in order:
                ds_copy = ds_copy.sort_values(by = [column_with_predict_var_x, column_with_response_var_y], ascending = [True, True])
                # Restart the index of the copy:
                ds_copy = ds_copy.reset_index(drop = True)
                
                # Re-extract the X and Y series and convert them to NumPy arrays 
                # (these arrays will be important later in the function):
                x = np.array(ds_copy[column_with_predict_var_x])
                y = np.array(ds_copy[column_with_response_var_y])
            
                # Then, create the dictionary:
                dict_of_values = {'x': x, 'y': y, 'lab': lab}
                
                # Now, append dict_of_values to list_of_dictionaries_with_series_to_analyze:
                list_of_dictionaries_with_series_to_analyze.append(dict_of_values)
                
            # Now, we have a list of dictionaries with the same format of the input list.
            
    else:
        
        # The user input a list_of_dictionaries_with_series_to_analyze
        # Create a support list:
        support_list = []
        
        # Loop through each element on the list list_of_dictionaries_with_series_to_analyze:
        
        for i in range (0, len(list_of_dictionaries_with_series_to_analyze)):
            # from i = 0 to i = len(list_of_dictionaries_with_series_to_analyze) - 1, index of the
            # last element from the list
            
            # pick the i-th dictionary from the list:
            dictionary = list_of_dictionaries_with_series_to_analyze[i]
            
            # access 'x', 'y', and 'lab' keys from the dictionary:
            x = dictionary['x']
            y = dictionary['y']
            lab = dictionary['lab']
            # Remember that all this variables are series from a dataframe, so we can apply
            # the astype function:
            # https://www.askpython.com/python/built-in-methods/python-astype?msclkid=8f3de8afd0d411ec86a9c1a1e290f37c
            
            # check if at least x and y are not None:
            if ((x is not None) & (y is not None)):
                
                # If column_with_predict_var_x is an object, the user may be trying to pass a date as x. 
                # So, let's try to convert it to datetime:
                if (x.dtype not in numeric_dtypes):

                    try:
                        x = (x).astype('datetime64[ns]')
                        print(f"Variable X from {i}-th dictionary successfully converted to datetime64[ns].\n")

                    except:
                        # Simply ignore it
                        pass
                
                # Possibly, x and y are not ordered. Firstly, let's merge them into a temporary
                # dataframe to be able to order them together.
                # Use the 'list' attribute to guarantee that x and y were read as lists. These lists
                # are the values for a dictionary passed as argument for the constructor of the
                # temporary dataframe. When using the list attribute, we make the series independent
                # from its origin, even if it was created from a Pandas dataframe. Then, we have a
                # completely independent dataframe that may be manipulated and sorted, without worrying
                # that it may modify its origin:
                
                temp_df = pd.DataFrame(data = {'x': list(x), 'y': list(y)})
                # sort this dataframe by 'x' and 'y':
                temp_df = temp_df.sort_values(by = ['x', 'y'], ascending = [True, True])
                # restart index:
                temp_df = temp_df.reset_index(drop = True)
                
                # Re-extract the X and Y series and convert them to NumPy arrays 
                # (these arrays will be important later in the function):
                x = np.array(temp_df['x'])
                y = np.array(temp_df['y'])
                
                # check if lab is None:
                if (lab is None):
                    # input a default label.
                    # Use the str attribute to convert the integer to string, allowing it
                    # to be concatenated
                    lab = "X" + str(i) + "_x_" + "Y" + str(i)
                    
                # Then, create the dictionary:
                dict_of_values = {'x': x, 'y': y, 'lab': lab}
                
                # Now, append dict_of_values to support list:
                support_list.append(dict_of_values)
            
        # Now, support_list contains only the dictionaries with valid entries, as well
        # as labels for each collection of data. The values are independent from their origin,
        # and now they are ordered and in the same format of the data extracted directly from
        # the dataframe.
        # So, make the list_of_dictionaries_with_series_to_analyze the support_list itself:
        list_of_dictionaries_with_series_to_analyze = support_list
        print(f"{len(list_of_dictionaries_with_series_to_analyze)} valid series input.\n")

        
    # Now that both methods of input resulted in the same format of list, we can process both
    # with the same code.
    
    # Each dictionary in list_of_dictionaries_with_series_to_analyze represents a series to
    # plot. So, the total of series to plot is:
    total_of_series = len(list_of_dictionaries_with_series_to_analyze)
    
    if (total_of_series <= 0):
        
        print("No valid series to plot. Please, provide valid arguments.\n")
        return "error" 
        # we return the value because this function always returns an object.
        # In other functions, this return would be omitted.
    
    else:
        
        # Continue to plotting and calculating the fitting.
        # Notice that we sorted the all the lists after they were separated and before
        # adding them to dictionaries. Also, the timestamps were converted to datetime64 variables
        
        # Now we pre-processed the data, we can obtain a final list of dictionaries, containing
        # the linear regression information (it will be plotted only if the user asked to). Start
        # a list to store all predictions:
        list_of_dictionaries_with_series_and_predictions = []
        
        # Loop through each dictionary (element) on the list list_of_dictionaries_with_series_to_analyze:
        for dictionary in list_of_dictionaries_with_series_to_analyze:
            
            x_is_datetime = False
            # boolean that will map if x is a datetime or not. Only change to True when it is.
            
            # Access keys 'x' and 'y' to retrieve the arrays.
            x = dictionary['x']
            y = dictionary['y']
            
            # Check if the elements from array x are np.datetime64 objects. Pick the first
            # element to check:
            
            if (type(x[0]) == np.datetime64):
                
                x_is_datetime = True
                
            if (x_is_datetime):
                # In this case, performing the linear regression directly in X will
                # return an error. We must associate a sequential number to each time.
                # to keep the distance between these integers the same as in the original sequence
                # let's define a difference of 1 ns as 1. The 1st timestamp will be zero, and the
                # addition of 1 ns will be an addition of 1 unit. So a timestamp recorded 10 ns
                # after the time zero will have value 10. At the end, we divide every element by
                # 10**9, to obtain the correspondent distance in seconds.
                
                # start a list for the associated integer timescale. Put the number zero,
                # associated to the first timestamp:
                int_timescale = [0]
                
                # loop through each element of the array x, starting from index 1:
                for i in range(1, len(x)):
                    
                    # calculate the timedelta between x[i] and x[i-1]:
                    # The delta method from the Timedelta class converts the timedelta to
                    # nanoseconds, guaranteeing the internal compatibility:
                    timedelta = pd.Timedelta(x[i] - x[(i-1)]).delta
                    
                    # Sum this timedelta (integer number of nanoseconds) to the
                    # previous element from int_timescale, and append the result to the list:
                    int_timescale.append((timedelta + int_timescale[(i-1)]))
                
                # Now convert the new scale (that preserves the distance between timestamps)
                # to NumPy array:
                int_timescale = np.array(int_timescale)
                
                # Divide by 10**9 to obtain the distances in seconds, reducing the order of
                # magnitude of the integer numbers (the division is allowed for arrays)
                int_timescale = int_timescale / (10**9)
                
                # Finally, use this timescale to obtain the linear regression:
                lin_reg = stats.linregress(int_timescale, y = y)
            
            else:
                # Obtain the linear regression object directly from x. Since x is not a
                # datetime object, we can calculate the regression directly on it:
                lin_reg = stats.linregress(x, y = y)
                
            # Retrieve the equation as a string.
            # Access the attributes intercept and slope from the lin_reg object:
            lin_reg_equation = "y = %.2f*x + %.2f" %((lin_reg).slope, (lin_reg).intercept)
            # .2f: float with only two decimals
                
            # Retrieve R2 (coefficient of correlation) also as a string
            r2_lin_reg = "R²_lin_reg = %.4f" %(((lin_reg).rvalue) ** 2)
            # .4f: 4 decimals. ((lin_reg).rvalue) is the coefficient R. We
            # raise it to the second power by doing **2, where ** is the potentiation.
                
            # Add these two strings to the dictionary
            dictionary['lin_reg_equation'] = lin_reg_equation
            dictionary['r2_lin_reg'] = r2_lin_reg
                
            # Now, as final step, let's apply the values x to the linear regression
            # equation to obtain the predicted series used to plot the straight line.
                
            # The lists cannot perform vector operations like element-wise sum or product, 
            # but numpy arrays can. For example, [1, 2] + 1 would be interpreted as the try
            # for concatenation of two lists, resulting in error. But, np.array([1, 2]) + 1
            # is allowed, resulting in: np.array[2, 3].
            # This and the fact that Scipy and Matplotlib are built on NumPy were the reasons
            # why we converted every list to numpy arrays.
            
            # Save the predicted values as the array y_pred_lin_reg.
            # Access the attributes intercept and slope from the lin_reg object.
            # The equation is y = (slope * x) + intercept
            
            # Notice that again we cannot apply the equation directly to a timestamp.
            # So once again we will apply the integer scale to obtain the predictions
            # if we are dealing with datetime objects:
            if (x_is_datetime):
                y_pred_lin_reg = ((lin_reg).intercept) + ((lin_reg).slope) * (int_timescale)
            
            else:
                # x is not a timestamp, so we can directly apply it to the regression
                # equation:
                y_pred_lin_reg = ((lin_reg).intercept) + ((lin_reg).slope) * (x)
            
            # Add this array to the dictionary with the key 'y_pred_lin_reg':
            dictionary['y_pred_lin_reg'] = y_pred_lin_reg
            
            if (x_is_datetime):
            
                print("For performing the linear regression, a sequence of floats proportional to the timestamps was created. In this sequence, check on the returned object a dictionary containing the timestamps and the correspondent integers, that keeps the distance proportion between successive timestamps. The sequence was created by calculating the timedeltas as an integer number of nanoseconds, which were converted to seconds. The first timestamp was considered time = 0.")
                print("Notice that the regression equation is based on the use of this sequence of floats as X.\n")
                
                dictionary['warning'] = "x is a numeric scale that was obtained from datetimes, preserving the distance relationships. It was obtained for allowing the polynomial fitting."
                dictionary['numeric_to_datetime_correlation'] = {
                    
                    'x = 0': x[0],
                    f'x = {max(int_timescale)}': x[(len(x) - 1)]
                    
                }
                
                dictionary['sequence_of_floats_correspondent_to_timestamps'] = {
                                                                                'original_timestamps': x,
                                                                                'sequence_of_floats': int_timescale
                                                                                }
                
            # Finally, append this dictionary to list support_list:
            list_of_dictionaries_with_series_and_predictions.append(dictionary)
        
        print("Returning a list of dictionaries. Each one contains the arrays of valid series and labels, and the equations, R² and values predicted by the linear regressions.\n")
        
        # Now we finished the loop, list_of_dictionaries_with_series_and_predictions 
        # contains all series converted to NumPy arrays, with timestamps parsed as datetimes, 
        # and all the information regarding the linear regression, including the predicted 
        # values for plotting.
        # This list will be the object returned at the end of the function. Since it is an
        # JSON-formatted list, we can use the function json_obj_to_pandas_dataframe to convert
        # it to a Pandas dataframe.
        
        
        # Now, we can plot the figure.
        # we set alpha = 0.95 (opacity) to give a degree of transparency (5%), 
        # so that one series do not completely block the visualization of the other.
        
        # Let's retrieve the list of Matplotlib CSS colors:
        css4 = mcolors.CSS4_COLORS
        # css4 is a dictionary of colors: {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', ...}
        # Each key of this dictionary is a color name to be passed as argument color on the plot
        # function. So let's retrieve the array of keys, and use the list attribute to convert this
        # array to a list of colors:
        list_of_colors = list(css4.keys())
        
        # In 11 May 2022, this list of colors had 148 different elements
        # Since this list is in alphabetic order, let's create a random order for the colors.
        
        # Function random.sample(input_sequence, number_of_samples): 
        # this function creates a list containing a total of elements equals to the parameter 
        # "number_of_samples", which must be an integer.
        # This list is obtained by ramdomly selecting a total of "number_of_samples" elements from the
        # list "input_sequence" passed as parameter.
        
        # Function random.choices(input_sequence, k = number_of_samples):
        # similarly, randomly select k elements from the sequence input_sequence. This function is
        # newer than random.sample
        # Since we want to simply randomly sort the sequence, we can pass k = len(input_sequence)
        # to obtain the randomly sorted sequence:
        list_of_colors = random.choices(list_of_colors, k = len(list_of_colors))
        # Now, we have a random list of colors to use for plotting the charts
        
        if (add_splines_lines == True):
            LINE_STYLE = '-'

        else:
            LINE_STYLE = ''
        
        # Matplotlib linestyle:
        # https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html?msclkid=68737f24d16011eca9e9c4b41313f1ad
        
        if (plot_title is None):
            # Set graphic title
            plot_title = f"Y_x_X"

        if (horizontal_axis_title is None):
            # Set horizontal axis title
            horizontal_axis_title = "X"

        if (vertical_axis_title is None):
            # Set vertical axis title
            vertical_axis_title = "Y"
        
        # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
        # so that the bars do not completely block other views.
        OPACITY = 0.95
        
        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        fig = plt.figure(figsize = (12, 8))
        ax = fig.add_subplot()

        i = 0 # Restart counting for the loop of colors
        
        # Loop through each dictionary from list_of_dictionaries_with_series_and_predictions:
        for dictionary in list_of_dictionaries_with_series_and_predictions:
            
            # Try selecting a color from list_of_colors:
            try:
                
                COLOR = list_of_colors[i]
                # Go to the next element i, so that the next plot will use a different color:
                i = i + 1
            
            except IndexError:
                
                # This error will be raised if list index is out of range, 
                # i.e. if i >= len(list_of_colors) - we used all colors from the list (at least 148).
                # So, return the index to zero to restart the colors from the beginning:
                i = 0
                COLOR = list_of_colors[i]
                i = i + 1
            
            # Access the arrays and label from the dictionary:
            X = dictionary['x']
            Y = dictionary['y']
            LABEL = dictionary['lab']
            
            # Scatter plot:
            ax.plot(X, Y, linestyle = LINE_STYLE, marker = "o", color = COLOR, alpha = OPACITY, label = LABEL)
            # Axes.plot documentation:
            # https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.plot.html?msclkid=42bc92c1d13511eca8634a2c93ab89b5
            
            # x and y are positional arguments: they are specified by their position in function
            # call, not by an argument name like 'marker'.
            
            # Matplotlib markers:
            # https://matplotlib.org/stable/api/markers_api.html?msclkid=36c5eec5d16011ec9583a5777dc39d1f
            
            if (show_linear_reg == True):
                
                # Plot the linear regression using the same color.
                # Access the array of fitted Y's in the dictionary:
                Y_PRED = dictionary['y_pred_lin_reg']
                Y_PRED_LABEL = 'lin_reg_' + str(LABEL) # for the case where label is numeric
                
                ax.plot(X, Y_PRED,  linestyle = '-', marker = '', color = COLOR, alpha = OPACITY, label = Y_PRED_LABEL)

        # Now we finished plotting all of the series, we can set the general configuration:
        
        #ROTATE X AXIS IN XX DEGREES
        plt.xticks(rotation = x_axis_rotation)
        # XX = 0 DEGREES x_axis (Default)
        #ROTATE Y AXIS IN XX DEGREES:
        plt.yticks(rotation = y_axis_rotation)
        # XX = 0 DEGREES y_axis (Default)
        
        ax.set_title(plot_title)
        ax.set_xlabel(horizontal_axis_title)
        ax.set_ylabel(vertical_axis_title)

        ax.grid(grid) # show grid or not
        ax.legend(loc = 'upper left')
        # position options: 'upper right'; 'upper left'; 'lower left'; 'lower right';
        # 'right', 'center left'; 'center right'; 'lower center'; 'upper center', 'center'
        # https://www.statology.org/matplotlib-legend-position/

        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = ""

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "scatter_plot_lin_reg"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 330 dpi
                png_resolution_dpi = 330

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        #fig.tight_layout()

        ## Show an image read from an image file:
        ## import matplotlib.image as pltimg
        ## img=pltimg.imread('mydecisiontree.png')
        ## imgplot = plt.imshow(img)
        ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
        ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
        ##  '03_05_END.ipynb'
        plt.show()
        
        if (show_linear_reg == True):
            
            try:
                # only works in Jupyter Notebook:
                from IPython.display import display
            except:
                pass
            
            print("\nLinear regression summaries (equations and R²):\n")
            
            for dictionary in list_of_dictionaries_with_series_and_predictions:
                
                print(f"Linear regression summary for {dictionary['lab']}:\n")
                
                try:
                    display(dictionary['lin_reg_equation'])
                    display(dictionary['r2_lin_reg'])

                except: # regular mode                  
                    print(dictionary['lin_reg_equation'])
                    print(dictionary['r2_lin_reg'])
                
                print("\n")
         
        
        return list_of_dictionaries_with_series_and_predictions

# **Function for time series visualization**

In [None]:
def time_series_vis (data_in_same_column = False, df = None, column_with_predict_var_x = None, column_with_response_var_y = None, column_with_labels = None, list_of_dictionaries_with_series_to_analyze = [{'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}], x_axis_rotation = 70, y_axis_rotation = 0, grid = True, add_splines_lines = True, add_scatter_dots = False, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
     
    import random
    # Python Random documentation:
    # https://docs.python.org/3/library/random.html?msclkid=9d0c34b2d13111ec9cfa8ddaee9f61a1
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import matplotlib.colors as mcolors
    
    # matplotlib.colors documentation:
    # https://matplotlib.org/3.5.0/api/colors_api.html?msclkid=94286fa9d12f11ec94660321f39bf47f
    
    # Matplotlib list of colors:
    # https://matplotlib.org/stable/gallery/color/named_colors.html?msclkid=0bb86abbd12e11ecbeb0a2439e5b0d23
    # Matplotlib colors tutorial:
    # https://matplotlib.org/stable/tutorials/colors/colors.html
    # Matplotlib example of Python code using matplotlib.colors:
    # https://matplotlib.org/stable/_downloads/0843ee646a32fc214e9f09328c0cd008/colors.py
    # Same example as Jupyter Notebook:
    # https://matplotlib.org/stable/_downloads/2a7b13c059456984288f5b84b4b73f45/colors.ipynb
    
        
    # data_in_same_column = False: set as True if all the values to plot are in a same column.
    # If data_in_same_column = True, you must specify the dataframe containing the data as df;
    # the column containing the predict variable (X) as column_with_predict_var_x; the column 
    # containing the responses to plot (Y) as column_with_response_var_y; and the column 
    # containing the labels (subgroup) indication as column_with_labels. 
    # df is an object, so do not declare it in quotes. The other three arguments (columns' names) 
    # are strings, so declare in quotes. 
    
    # Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
    # All the results for both groups are in a column named 'results', wich will be plot against
    # the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
    # an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
    # column 'group' shows the value 'B'. In this example:
    # data_in_same_column = True,
    # df = dataset,
    # column_with_predict_var_x = 'time',
    # column_with_response_var_y = 'results', 
    # column_with_labels = 'group'
    # If you want to declare a list of dictionaries, keep data_in_same_column = False and keep
    # df = None (the other arguments may be set as None, but it is not mandatory: 
    # column_with_predict_var_x = None, column_with_response_var_y = None, column_with_labels = None).
    

    # Parameter to input when DATA_IN_SAME_COLUMN = False:
    # list_of_dictionaries_with_series_to_analyze: if data is already converted to series, lists
    # or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
    # even if there is a single dictionary.
    # Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
    # (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
    # keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
    # represents the series and label of the added dictionary (you can pass 'lab': None, but if 
    # 'x' or 'y' are None, the new dictionary will be ignored).
    
    # Examples:
    # list_of_dictionaries_with_series_to_analyze = 
    # [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
    # will plot a single variable. In turns:
    # list_of_dictionaries_with_series_to_analyze = 
    # [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
    # will plot two series, Y1 x X and Y2 x X.
    # Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
    # If None is provided to 'lab', an automatic label will be generated.
    
    
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]
    
    if (data_in_same_column == True):
        
        print("Data to be plotted in a same column.\n")
        
        if (df is None):
            
            print("Please, input a valid dataframe as df.\n")
            list_of_dictionaries_with_series_to_analyze = []
            # The code will check the size of this list on the next block.
            # If it is zero, code is simply interrupted.
            # Instead of returning an error, we use this code structure that can be applied
            # on other graphic functions that do not return a summary (and so we should not
            # return a value like 'error' to interrupt the function).
        
        elif (column_with_predict_var_x is None):
            
            print("Please, input a valid column name as column_with_predict_var_x.\n")
            list_of_dictionaries_with_series_to_analyze = []
           
        elif (column_with_response_var_y is None):
            
            print("Please, input a valid column name as column_with_response_var_y.\n")
            list_of_dictionaries_with_series_to_analyze = []
        
        else:
            
            # set a local copy of the dataframe:
            DATASET = df.copy(deep = True)
            
            if (column_with_labels is None):
            
                print("Using the whole series (column) for correlation.\n")
                column_with_labels = 'whole_series_' + column_with_response_var_y
                DATASET[column_with_labels] = column_with_labels
            
            # sort DATASET; by column_with_predict_var_x; by column column_with_labels
            # and by column_with_response_var_y, all in Ascending order
            # Since we sort by label (group), it is easier to separate the groups.
            DATASET = DATASET.sort_values(by = [column_with_predict_var_x, column_with_labels, column_with_response_var_y], ascending = [True, True, True])
            
            # Reset indices:
            DATASET = DATASET.reset_index(drop = True)
            
            # If column_with_predict_var_x is an object, the user may be trying to pass a date as x. 
            # So, let's try to convert it to datetime:
            if ((DATASET[column_with_predict_var_x]).dtype not in numeric_dtypes):
                  
                try:
                    DATASET[column_with_predict_var_x] = (DATASET[column_with_predict_var_x]).astype('datetime64[ns]')
                    print("Variable X successfully converted to datetime64[ns].\n")
                    
                except:
                    # Simply ignore it
                    pass
            
            # Get a series of unique values of the labels, and save it as a list using the
            # list attribute:
            unique_labels = list(DATASET[column_with_labels].unique())
            print(f"{len(unique_labels)} different labels detected: {unique_labels}.\n")
            
            # Start a list to store the dictionaries containing the keys:
            # 'x': list of predict variables; 'y': list of responses; 'lab': the label (group)
            list_of_dictionaries_with_series_to_analyze = []
            
            # Loop through each possible label:
            for lab in unique_labels:
                # loop through each element from the list unique_labels, referred as lab
                
                # Set a filter for the dataset, to select only rows correspondent to that
                # label:
                boolean_filter = (DATASET[column_with_labels] == lab)
                
                # Create a copy of the dataset, with entries selected by that filter:
                ds_copy = (DATASET[boolean_filter]).copy(deep = True)
                # Sort again by X and Y, to guarantee the results are in order:
                ds_copy = ds_copy.sort_values(by = [column_with_predict_var_x, column_with_response_var_y], ascending = [True, True])
                # Restart the index of the copy:
                ds_copy = ds_copy.reset_index(drop = True)
                
                # Re-extract the X and Y series and convert them to NumPy arrays 
                # (these arrays will be important later in the function):
                x = np.array(ds_copy[column_with_predict_var_x])
                y = np.array(ds_copy[column_with_response_var_y])
            
                # Then, create the dictionary:
                dict_of_values = {'x': x, 'y': y, 'lab': lab}
                
                # Now, append dict_of_values to list_of_dictionaries_with_series_to_analyze:
                list_of_dictionaries_with_series_to_analyze.append(dict_of_values)
                
            # Now, we have a list of dictionaries with the same format of the input list.
            
    else:
        
        # The user input a list_of_dictionaries_with_series_to_analyze
        # Create a support list:
        support_list = []
        
        # Loop through each element on the list list_of_dictionaries_with_series_to_analyze:
        
        for i in range (0, len(list_of_dictionaries_with_series_to_analyze)):
            # from i = 0 to i = len(list_of_dictionaries_with_series_to_analyze) - 1, index of the
            # last element from the list
            
            # pick the i-th dictionary from the list:
            dictionary = list_of_dictionaries_with_series_to_analyze[i]
            
            # access 'x', 'y', and 'lab' keys from the dictionary:
            x = dictionary['x']
            y = dictionary['y']
            lab = dictionary['lab']
            # Remember that all this variables are series from a dataframe, so we can apply
            # the astype function:
            # https://www.askpython.com/python/built-in-methods/python-astype?msclkid=8f3de8afd0d411ec86a9c1a1e290f37c
            
            # check if at least x and y are not None:
            if ((x is not None) & (y is not None)):
                
                # If column_with_predict_var_x is an object, the user may be trying to pass a date as x. 
                # So, let's try to convert it to datetime:
                if (x.dtype not in numeric_dtypes):

                    try:
                        x = (x).astype('datetime64[ns]')
                        print(f"Variable X from {i}-th dictionary successfully converted to datetime64[ns].\n")

                    except:
                        # Simply ignore it
                        pass
                
                # Possibly, x and y are not ordered. Firstly, let's merge them into a temporary
                # dataframe to be able to order them together.
                # Use the 'list' attribute to guarantee that x and y were read as lists. These lists
                # are the values for a dictionary passed as argument for the constructor of the
                # temporary dataframe. When using the list attribute, we make the series independent
                # from its origin, even if it was created from a Pandas dataframe. Then, we have a
                # completely independent dataframe that may be manipulated and sorted, without worrying
                # that it may modify its origin:
                
                temp_df = pd.DataFrame(data = {'x': list(x), 'y': list(y)})
                # sort this dataframe by 'x' and 'y':
                temp_df = temp_df.sort_values(by = ['x', 'y'], ascending = [True, True])
                # restart index:
                temp_df = temp_df.reset_index(drop = True)
                
                # Re-extract the X and Y series and convert them to NumPy arrays 
                # (these arrays will be important later in the function):
                x = np.array(temp_df['x'])
                y = np.array(temp_df['y'])
                
                # check if lab is None:
                if (lab is None):
                    # input a default label.
                    # Use the str attribute to convert the integer to string, allowing it
                    # to be concatenated
                    lab = "X" + str(i) + "_x_" + "Y" + str(i)
                    
                # Then, create the dictionary:
                dict_of_values = {'x': x, 'y': y, 'lab': lab}
                
                # Now, append dict_of_values to support list:
                support_list.append(dict_of_values)
            
        # Now, support_list contains only the dictionaries with valid entries, as well
        # as labels for each collection of data. The values are independent from their origin,
        # and now they are ordered and in the same format of the data extracted directly from
        # the dataframe.
        # So, make the list_of_dictionaries_with_series_to_analyze the support_list itself:
        list_of_dictionaries_with_series_to_analyze = support_list
        print(f"{len(list_of_dictionaries_with_series_to_analyze)} valid series input.\n")

        
    # Now that both methods of input resulted in the same format of list, we can process both
    # with the same code.
    
    # Each dictionary in list_of_dictionaries_with_series_to_analyze represents a series to
    # plot. So, the total of series to plot is:
    total_of_series = len(list_of_dictionaries_with_series_to_analyze)
    
    if (total_of_series <= 0):
        
        print("No valid series to plot. Please, provide valid arguments.\n")
    
    else:
        
        # Continue to plotting and calculating the fitting.
        # Notice that we sorted the all the lists after they were separated and before
        # adding them to dictionaries. Also, the timestamps were converted to datetime64 variables
        # Now we finished the loop, list_of_dictionaries_with_series_to_analyze 
        # contains all series converted to NumPy arrays, with timestamps parsed as datetimes.
        # This list will be the object returned at the end of the function. Since it is an
        # JSON-formatted list, we can use the function json_obj_to_pandas_dataframe to convert
        # it to a Pandas dataframe.
        
        
        # Now, we can plot the figure.
        # we set alpha = 0.95 (opacity) to give a degree of transparency (5%), 
        # so that one series do not completely block the visualization of the other.
        
        # Let's retrieve the list of Matplotlib CSS colors:
        css4 = mcolors.CSS4_COLORS
        # css4 is a dictionary of colors: {'aliceblue': '#F0F8FF', 'antiquewhite': '#FAEBD7', ...}
        # Each key of this dictionary is a color name to be passed as argument color on the plot
        # function. So let's retrieve the array of keys, and use the list attribute to convert this
        # array to a list of colors:
        list_of_colors = list(css4.keys())
        
        # In 11 May 2022, this list of colors had 148 different elements
        # Since this list is in alphabetic order, let's create a random order for the colors.
        
        # Function random.sample(input_sequence, number_of_samples): 
        # this function creates a list containing a total of elements equals to the parameter 
        # "number_of_samples", which must be an integer.
        # This list is obtained by ramdomly selecting a total of "number_of_samples" elements from the
        # list "input_sequence" passed as parameter.
        
        # Function random.choices(input_sequence, k = number_of_samples):
        # similarly, randomly select k elements from the sequence input_sequence. This function is
        # newer than random.sample
        # Since we want to simply randomly sort the sequence, we can pass k = len(input_sequence)
        # to obtain the randomly sorted sequence:
        list_of_colors = random.choices(list_of_colors, k = len(list_of_colors))
        # Now, we have a random list of colors to use for plotting the charts
        
        if (add_splines_lines == True):
            LINE_STYLE = '-'

        else:
            LINE_STYLE = ''
        
        if (add_scatter_dots == True):
            MARKER = 'o'
            
        else:
            MARKER = ''
        
        # Matplotlib linestyle:
        # https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html?msclkid=68737f24d16011eca9e9c4b41313f1ad
        
        if (plot_title is None):
            # Set graphic title
            plot_title = f"Y_x_timestamp"

        if (horizontal_axis_title is None):
            # Set horizontal axis title
            horizontal_axis_title = "timestamp"

        if (vertical_axis_title is None):
            # Set vertical axis title
            vertical_axis_title = "Y"
        
        # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
        # so that the bars do not completely block other views.
        OPACITY = 0.95
        
        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        fig = plt.figure(figsize = (12, 8))
        ax = fig.add_subplot()

        i = 0 # Restart counting for the loop of colors
        
        # Loop through each dictionary from list_of_dictionaries_with_series_and_predictions:
        for dictionary in list_of_dictionaries_with_series_to_analyze:
            
            # Try selecting a color from list_of_colors:
            try:
                
                COLOR = list_of_colors[i]
                # Go to the next element i, so that the next plot will use a different color:
                i = i + 1
            
            except IndexError:
                
                # This error will be raised if list index is out of range, 
                # i.e. if i >= len(list_of_colors) - we used all colors from the list (at least 148).
                # So, return the index to zero to restart the colors from the beginning:
                i = 0
                COLOR = list_of_colors[i]
                i = i + 1
            
            # Access the arrays and label from the dictionary:
            X = dictionary['x']
            Y = dictionary['y']
            LABEL = dictionary['lab']
            
            # Scatter plot:
            ax.plot(X, Y, linestyle = LINE_STYLE, marker = MARKER, color = COLOR, alpha = OPACITY, label = LABEL)
            # Axes.plot documentation:
            # https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.plot.html?msclkid=42bc92c1d13511eca8634a2c93ab89b5
            
            # x and y are positional arguments: they are specified by their position in function
            # call, not by an argument name like 'marker'.
            
            # Matplotlib markers:
            # https://matplotlib.org/stable/api/markers_api.html?msclkid=36c5eec5d16011ec9583a5777dc39d1f
            
        # Now we finished plotting all of the series, we can set the general configuration:
        
        #ROTATE X AXIS IN XX DEGREES
        plt.xticks(rotation = x_axis_rotation)
        # XX = 0 DEGREES x_axis (Default)
        #ROTATE Y AXIS IN XX DEGREES:
        plt.yticks(rotation = y_axis_rotation)
        # XX = 0 DEGREES y_axis (Default)

        ax.set_title(plot_title)
        ax.set_xlabel(horizontal_axis_title)
        ax.set_ylabel(vertical_axis_title)

        ax.grid(grid) # show grid or not
        ax.legend(loc = 'upper left')
        # position options: 'upper right'; 'upper left'; 'lower left'; 'lower right';
        # 'right', 'center left'; 'center right'; 'lower center'; 'upper center', 'center'
        # https://www.statology.org/matplotlib-legend-position/

        if (export_png == True):
            # Image will be exported
            import os

            #check if the user defined a directory path. If not, set as the default root path:
            if (directory_to_save is None):
                #set as the default
                directory_to_save = ""

            #check if the user defined a file name. If not, set as the default name for this
            # function.
            if (file_name is None):
                #set as the default
                file_name = "time_series_vis"

            #check if the user defined an image resolution. If not, set as the default 110 dpi
            # resolution.
            if (png_resolution_dpi is None):
                #set as 330 dpi
                png_resolution_dpi = 330

            #Get the new_file_path
            new_file_path = os.path.join(directory_to_save, file_name)

            #Export the file to this new path:
            # The extension will be automatically added by the savefig method:
            plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
            #quality could be set from 1 to 100, where 100 is the best quality
            #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
            #transparent = True or False
            # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
            print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

        #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
        #plt.figure(figsize = (12, 8))
        #fig.tight_layout()

        ## Show an image read from an image file:
        ## import matplotlib.image as pltimg
        ## img=pltimg.imread('mydecisiontree.png')
        ## imgplot = plt.imshow(img)
        ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
        ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
        ##  '03_05_END.ipynb'
        plt.show()

# **Functions for histogram visualization**
- Ideal number of bins is calculated through Montgomery's method.
    - Douglas C. Montgomery (2009). Introduction to Statistical Process Control, Sixth Edition, John Wiley & Sons.

In [16]:
class capability_analysis:
            
    # Initialize instance attributes.
    # define the Class constructor, i.e., how are its objects:
    def __init__ (self, df, column_with_variable_to_be_analyzed, specification_limits, total_of_bins = 10, alpha = 0.10):
                
        import numpy as np
        import pandas as pd
        
        # If the user passes the argument, use them. Otherwise, use the standard values.
        # Set the class objects' attributes.
        # Suppose the object is named plot. We can access the attribute as:
        # plot.dictionary, for instance.
        # So, we can save the variables as objects' attributes.
        self.df = df
        self.column_with_variable_to_be_analyzed = column_with_variable_to_be_analyzed
        self.specification_limits = specification_limits
        self.sample_size = df[column_with_variable_to_be_analyzed].count()
        self.mu = (df[column_with_variable_to_be_analyzed]).mean() 
        self.median = (df[column_with_variable_to_be_analyzed]).median()
        self.sigma = (df[column_with_variable_to_be_analyzed]).std()
        self.lowest = (df[column_with_variable_to_be_analyzed]).min()
        self.highest = (df[column_with_variable_to_be_analyzed]).max()
        self.total_of_bins = total_of_bins
        self.alpha = alpha
        
        # Start a dictionary of constants
        self.dict_of_constants = {}
        # Get parameters to update later:
        self.histogram_dict = {}
        self.capability_dict = {}
        self.normality_dict = {}
        
        print("WARNING: this capability analysis is based on the strong hypothesis that data follows the normal (Gaussian) distribution.\n")
        
    # Define the class methods.
    # All methods must take an object from the class (self) as one of the parameters
   
    # Define a dictionary of constants.
    # Each key in the dictionary corresponds to a number of samples in a subgroup.
    # sample_size - This variable represents the total of labels or subgroups n. 
    # If there are multiple labels, this variable will be updated later.
    
    def check_data_normality (self):
        
        import numpy as np
        import pandas as pd
        from scipy import stats
        from statsmodels.stats import diagnostic
        
        alpha = self.alpha
        df = self.df
        column_with_variable_to_be_analyzed = self.column_with_variable_to_be_analyzed
        sample_size = self.sample_size
        mu = self.mu 
        median = self.median
        sigma = self.sigma
        lowest = self.lowest
        highest = self.highest
        normality_dict = self.normality_dict # empty dictionary 
        
        print("WARNING: The statistical tests require at least 20 samples.\n")
        print("Interpretation:")
        print("p-value: probability that data is described by the normal distribution.")
        print("Criterion: the series is not described by normal if p < alpha = %.3f." %(alpha))
        
        if (sample_size < 20):
            
            print(f"Unable to test series normality: at least 20 samples are needed, but found only {sample_size} entries for this series.\n")
            normality_dict['WARNING'] = "Series without the minimum number of elements (20) required to test the normality."
            
        else:
            # Let's test the series.
            y = df[column_with_variable_to_be_analyzed]
            
            # Scipy.stats’ normality test
            # It is based on D’Agostino and Pearson’s test that combines 
            # skew and kurtosis to produce an omnibus test of normality.
            _, scipystats_test_pval = stats.normaltest(y)
            # The underscore indicates an output to be ignored, which is s^2 + k^2, 
            # where s is the z-score returned by skewtest and k is the z-score returned by kurtosistest.
            # https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.stats.normaltest.html
            
            print("\n")
            print("D\'Agostino and Pearson\'s normality test (scipy.stats normality test):")
            print(f"p-value = {scipystats_test_pval} = {scipystats_test_pval*100}% of probability of being normal.")
            
            if (scipystats_test_pval < alpha):
                
                print("p = %.3f < %.3f" %(scipystats_test_pval, alpha))
                print(f"According to this test, data is not described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            else:
                
                print("p = %.3f >= %.3f" %(scipystats_test_pval, alpha))
                print(f"According to this test, data is described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            # add this test result to the dictionary:
            normality_dict['dagostino_pearson_p_val'] = scipystats_test_pval
            normality_dict['dagostino_pearson_p_in_pct'] = scipystats_test_pval*100
            
            # Scipy.stats’ Shapiro-Wilk test
            # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html
            shapiro_test = stats.shapiro(y)
            # returns ShapiroResult(statistic=0.9813305735588074, pvalue=0.16855233907699585)
             
            print("\n")
            print("Shapiro-Wilk normality test:")
            print(f"p-value = {shapiro_test[1]} = {(shapiro_test[1])*100}% of probability of being normal.")
            
            if (shapiro_test[1] < alpha):
                
                print("p = %.3f < %.3f" %(shapiro_test[1], alpha))
                print(f"According to this test, data is not described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            else:
                
                print("p = %.3f >= %.3f" %(shapiro_test[1], alpha))
                print(f"According to this test, data is described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            # add this test result to the dictionary:
            normality_dict['shapiro_wilk_p_val'] = shapiro_test[1]
            normality_dict['shapiro_wilk_p_in_pct'] = (shapiro_test[1])*100
            
            # Lilliefors’ normality test
            lilliefors_test = diagnostic.kstest_normal(y, dist = 'norm', pvalmethod = 'table')
            # Returns a tuple: index 0: ksstat: float
            # Kolmogorov-Smirnov test statistic with estimated mean and variance.
            # index 1: p-value:float
            # If the pvalue is lower than some threshold, e.g. 0.10, then we can reject the Null hypothesis that the sample comes from a normal distribution.
            
            print("\n")
            print("Lilliefors\'s normality test:")
            print(f"p-value = {lilliefors_test[1]} = {(lilliefors_test[1])*100}% of probability of being normal.")
            
            if (lilliefors_test[1] < alpha):
                
                print("p = %.3f < %.3f" %(lilliefors_test[1], alpha))
                print(f"According to this test, data is not described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            else:
                
                print("p = %.3f >= %.3f" %(lilliefors_test[1], alpha))
                print(f"According to this test, data is described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            # add this test result to the dictionary:
            normality_dict['lilliefors_p_val'] = lilliefors_test[1]
            normality_dict['lilliefors_p_in_pct'] = (lilliefors_test[1])*100

            # Anderson-Darling normality test
            ad_test = diagnostic.normal_ad(y, axis = 0)
            # Returns a tuple: index 0 - ad2: float
            # Anderson Darling test statistic.
            # index 1 - p-val: float
            # The p-value for hypothesis that the data comes from a normal distribution with unknown mean and variance.
            
            print("\n")
            print("Anderson-Darling (AD) normality test:")
            print(f"p-value = {ad_test[1]} = {(ad_test[1])*100}% of probability of being normal.")
            
            if (ad_test[1] < alpha):
                
                print("p = %.3f < %.3f" %(ad_test[1], alpha))
                print(f"According to this test, data is not described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            else:
                
                print("p = %.3f >= %.3f" %(ad_test[1], alpha))
                print(f"According to this test, data is described by the normal distribution, for the {alpha*100}% confidence level defined.")
            
            # add this test result to the dictionary:
            normality_dict['anderson_darling_p_val'] = ad_test[1]
            normality_dict['anderson_darling_p_in_pct'] = (ad_test[1])*100
            
            # Update the attribute:
            self.normality_dict = normality_dict
            
            return self
    
    def get_constants (self):
        
        if (self.sample_size < 2):
            
            self.sample_size = 2
            
        if (self.sample_size <= 25):
            
            dict_of_constants = {
                
                2: {'A':2.121, 'A2':1.880, 'A3':2.659, 'c4':0.7979, '1/c4':1.2533, 'B3':0, 'B4':3.267, 'B5':0, 'B6':2.606, 'd2':1.128, '1/d2':0.8865, 'd3':0.853, 'D1':0, 'D2':3.686, 'D3':0, 'D4':3.267},
                3: {'A':1.732, 'A2':1.023, 'A3':1.954, 'c4':0.8862, '1/c4':1.1284, 'B3':0, 'B4':2.568, 'B5':0, 'B6':2.276, 'd2':1.693, '1/d2':0.5907, 'd3':0.888, 'D1':0, 'D2':4.358, 'D3':0, 'D4':2.574},
                4: {'A':1.500, 'A2':0.729, 'A3':1.628, 'c4':0.9213, '1/c4':1.0854, 'B3':0, 'B4':2.266, 'B5':0, 'B6':2.088, 'd2':2.059, '1/d2':0.4857, 'd3':0.880, 'D1':0, 'D2':4.698, 'D3':0, 'D4':2.282},
                5: {'A':1.342, 'A2':0.577, 'A3':1.427, 'c4':0.9400, '1/c4':1.0638, 'B3':0, 'B4':2.089, 'B5':0, 'B6':1.964, 'd2':2.326, '1/d2':0.4299, 'd3':0.864, 'D1':0, 'D2':4.918, 'D3':0, 'D4':2.114},
                6: {'A':1.225, 'A2':0.483, 'A3':1.287, 'c4':0.9515, '1/c4':1.0510, 'B3':0.030, 'B4':1.970, 'B5':0.029, 'B6':1.874, 'd2':2.534, '1/d2':0.3946, 'd3':0.848, 'D1':0, 'D2':5.078, 'D3':0, 'D4':2.004},
                7: {'A':1.134, 'A2':0.419, 'A3':1.182, 'c4':0.9594, '1/c4':1.0423, 'B3':0.118, 'B4':1.882, 'B5':0.113, 'B6':1.806, 'd2':2.704, '1/d2':0.3698, 'd3':0.833, 'D1':0.204, 'D2':5.204, 'D3':0.076, 'D4':1.924},
                8: {'A':1.061, 'A2':0.373, 'A3':1.099, 'c4':0.9650, '1/c4':1.0363, 'B3':0.185, 'B4':1.815, 'B5':0.179, 'B6':1.751, 'd2':2.847, '1/d2':0.3512, 'd3':0.820, 'D1':0.388, 'D2':5.306, 'D3':0.136, 'D4':1.864},
                9: {'A':1.000, 'A2':0.337, 'A3':1.032, 'c4':0.9693, '1/c4':1.0317, 'B3':0.239, 'B4':1.761, 'B5':0.232, 'B6':1.707, 'd2':2.970, '1/d2':0.3367, 'd3':0.808, 'D1':0.547, 'D2':5.393, 'D3':0.184, 'D4':1.816},
                10: {'A':0.949, 'A2':0.308, 'A3':0.975, 'c4':0.9727, '1/c4':1.0281, 'B3':0.284, 'B4':1.716, 'B5':0.276, 'B6':1.669, 'd2':3.078, '1/d2':0.3249, 'd3':0.797, 'D1':0.687, 'D2':5.469, 'D3':0.223, 'D4':1.777},
                11: {'A':0.905, 'A2':0.285, 'A3':0.927, 'c4':0.9754, '1/c4':1.0252, 'B3':0.321, 'B4':1.679, 'B5':0.313, 'B6':1.637, 'd2':3.173, '1/d2':0.3152, 'd3':0.787, 'D1':0.811, 'D2':5.535, 'D3':0.256, 'D4':1.744},
                12: {'A':0.866, 'A2':0.266, 'A3':0.886, 'c4':0.9776, '1/c4':1.0229, 'B3':0.354, 'B4':1.646, 'B5':0.346, 'B6':1.610, 'd2':3.258, '1/d2':0.3069, 'd3':0.778, 'D1':0.922, 'D2':5.594, 'D3':0.283, 'D4':1.717},
                13: {'A':0.832, 'A2':0.249, 'A3':0.850, 'c4':0.9794, '1/c4':1.0210, 'B3':0.382, 'B4':1.618, 'B5':0.374, 'B6':1.585, 'd2':3.336, '1/d2':0.2998, 'd3':0.770, 'D1':1.025, 'D2':5.647, 'D3':0.307, 'D4':1.693},
                14: {'A':0.802, 'A2':0.235, 'A3':0.817, 'c4':0.9810, '1/c4':1.0194, 'B3':0.406, 'B4':1.594, 'B5':0.399, 'B6':1.563, 'd2':3.407, '1/d2':0.2935, 'd3':0.763, 'D1':1.118, 'D2':5.696, 'D3':0.328, 'D4':1.672},
                15: {'A':0.775, 'A2':0.223, 'A3':0.789, 'c4':0.9823, '1/c4':1.0180, 'B3':0.428, 'B4':1.572, 'B5':0.421, 'B6':1.544, 'd2':3.472, '1/d2':0.2880, 'd3':0.756, 'D1':1.203, 'D2':5.741, 'D3':0.347, 'D4':1.653},
                16: {'A':0.750, 'A2':0.212, 'A3':0.763, 'c4':0.9835, '1/c4':1.0168, 'B3':0.448, 'B4':1.552, 'B5':0.440, 'B6':1.526, 'd2':3.532, '1/d2':0.2831, 'd3':0.750, 'D1':1.282, 'D2':5.782, 'D3':0.363, 'D4':1.637},
                17: {'A':0.728, 'A2':0.203, 'A3':0.739, 'c4':0.9845, '1/c4':1.0157, 'B3':0.466, 'B4':1.534, 'B5':0.458, 'B6':1.511, 'd2':3.588, '1/d2':0.2787, 'd3':0.744, 'D1':1.356, 'D2':5.820, 'D3':0.378, 'D4':1.622},
                18: {'A':0.707, 'A2':0.194, 'A3':0.718, 'c4':0.9854, '1/c4':1.0148, 'B3':0.482, 'B4':1.518, 'B5':0.475, 'B6':1.496, 'd2':3.640, '1/d2':0.2747, 'd3':0.739, 'D1':1.424, 'D2':5.856, 'D3':0.391, 'D4':1.608},
                19: {'A':0.688, 'A2':0.187, 'A3':0.698, 'c4':0.9862, '1/c4':1.0140, 'B3':0.497, 'B4':1.503, 'B5':0.490, 'B6':1.483, 'd2':3.689, '1/d2':0.2711, 'd3':0.734, 'D1':1.487, 'D2':5.891, 'D3':0.403, 'D4':1.597},
                20: {'A':0.671, 'A2':0.180, 'A3':0.680, 'c4':0.9869, '1/c4':1.0133, 'B3':0.510, 'B4':1.490, 'B5':0.504, 'B6':1.470, 'd2':3.735, '1/d2':0.2677, 'd3':0.729, 'D1':1.549, 'D2':5.921, 'D3':0.415, 'D4':1.585},
                21: {'A':0.655, 'A2':0.173, 'A3':0.663, 'c4':0.9876, '1/c4':1.0126, 'B3':0.523, 'B4':1.477, 'B5':0.516, 'B6':1.459, 'd2':3.778, '1/d2':0.2647, 'd3':0.724, 'D1':1.605, 'D2':5.951, 'D3':0.425, 'D4':1.575},
                22: {'A':0.640, 'A2':0.167, 'A3':0.647, 'c4':0.9882, '1/c4':1.0119, 'B3':0.534, 'B4':1.466, 'B5':0.528, 'B6':1.448, 'd2':3.819, '1/d2':0.2618, 'd3':0.720, 'D1':1.659, 'D2':5.979, 'D3':0.434, 'D4':1.566},
                23: {'A':0.626, 'A2':0.162, 'A3':0.633, 'c4':0.9887, '1/c4':1.0114, 'B3':0.545, 'B4':1.455, 'B5':0.539, 'B6':1.438, 'd2':3.858, '1/d2':0.2592, 'd3':0.716, 'D1':1.710, 'D2':6.006, 'D3':0.443, 'D4':1.557},
                24: {'A':0.612, 'A2':0.157, 'A3':0.619, 'c4':0.9892, '1/c4':1.0109, 'B3':0.555, 'B4':1.445, 'B5':0.549, 'B6':1.429, 'd2':3.895, '1/d2':0.2567, 'd3':0.712, 'D1':1.759, 'D2':6.031, 'D3':0.451, 'D4':1.548},
                25: {'A':0.600, 'A2':0.153, 'A3':0.606, 'c4':0.9896, '1/c4':1.0105, 'B3':0.565, 'B4':1.435, 'B5':0.559, 'B6':1.420, 'd2':3.931, '1/d2':0.2544, 'd3':0.708, 'D1':1.806, 'D2':6.056, 'D3':0.459, 'D4':1.541},
            }
            
            # Access the key:
            dict_of_constants = dict_of_constants[self.sample_size]
            
        else: #>= 26
            
            dict_of_constants = {'A':(3/(self.sample_size**(0.5))), 'A2':0.153, 
                                 'A3':3/((4*(self.sample_size-1)/(4*self.sample_size-3))*(self.sample_size**(0.5))), 
                                 'c4':(4*(self.sample_size-1)/(4*self.sample_size-3)), 
                                 '1/c4':1/((4*(self.sample_size-1)/(4*self.sample_size-3))), 
                                 'B3':(1-3/(((4*(self.sample_size-1)/(4*self.sample_size-3)))*((2*(self.sample_size-1))**(0.5)))), 
                                 'B4':(1+3/(((4*(self.sample_size-1)/(4*self.sample_size-3)))*((2*(self.sample_size-1))**(0.5)))),
                                 'B5':(((4*(self.sample_size-1)/(4*self.sample_size-3)))-3/((2*(self.sample_size-1))**(0.5))), 
                                 'B6':(((4*(self.sample_size-1)/(4*self.sample_size-3)))+3/((2*(self.sample_size-1))**(0.5))), 
                                 'd2':3.931, '1/d2':0.2544, 'd3':0.708, 'D1':1.806, 'D2':6.056, 'D3':0.459, 'D4':1.541}
        
        # Update the attribute
        self.dict_of_constants = dict_of_constants
        
        return self
    
    def get_histogram_array (self):
        
        import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        
        df = self.df
        column_with_variable_to_be_analyzed = self.column_with_variable_to_be_analyzed
        y_hist = df[column_with_variable_to_be_analyzed]
        lowest = self.lowest
        highest = self.highest
        sample_size = self.sample_size
        
        # Number of bins set by the user:
        total_of_bins = self.total_of_bins
        
        # Firstly, get the ideal bin-size according to the Montgomery's method:
        # Douglas C. Montgomery (2009). Introduction to Statistical Process Control, 
        # Sixth Edition, John Wiley & Sons.
        # Sort by the column to analyze (ascending order) and reset the index:
        y_hist = y_hist.sort_values(ascending = True)
        y_hist = y_hist.reset_index(drop = True)
        #Calculo do bin size - largura do histograma:
        #1: Encontrar o menor (lowest) e o maior (highest) valor dentro da tabela de dados)
        #2: Calcular rangehist = highest - lowest
        #3: Calcular quantidade de dados (samplesize) de entrada fornecidos
        #4: Calcular a quantidade de celulas da tabela de frequencias (ncells)
        #ncells = numero inteiro mais proximo da (raiz quadrada de samplesize)
        #5: Calcular binsize = (df[column_to_analyze])rangehist/(ncells)
        #ATENCAO: Nao se esquecer de converter range, ncells, samplesize e binsize para valores absolutos (modulos)
        #isso porque a largura do histograma tem que ser um numero positivo 

        # bin-size
        range_hist = abs(highest - lowest)
        n_cells = int(np.rint((sample_size)**(0.5)))
        # We must use the int function to guarantee that the ncells will store an
        # integer number of cells (we cannot have a fraction of a sentence).
        # The int function guarantees that the variable will be stored as an integer.
        # The numpy.rint(a) function rounds elements of the array to the nearest integer.
        # https://numpy.org/doc/stable/reference/generated/numpy.rint.html
        # For values exactly halfway between rounded decimal values, 
        # NumPy rounds to the nearest even value. 
        # Thus 1.5 and 2.5 round to 2.0; -0.5 and 0.5 round to 0.0; etc.
        if (n_cells > 3):
            
            print(f"Ideal number of histogram bins calculated through Montgomery's method = {n_cells} bins.\n")
        
        # Retrieve the histogram array hist_array
        fig, ax = plt.subplots() # (0,0) not to show the plot now:
        
        # Get a 10-bins histogram:
        hist_array = plt.hist(y_hist, bins = total_of_bins)
        plt.delaxes(ax) # this will delete ax, so that it will not be plotted.
        plt.show()
        print("") # use this print not to mix with the final plot

        # hist_array is an array of arrays:
        # hist_array = (array([count_1, count_2, ..., cont_n]), array([bin_center_1,...,
        # bin_center_n])), where n = total_of_bins
        # hist_array[0] is the array of countings for each bin, whereas hist_array[1] is
        # the array of the bin center, i.e., the central value of the analyzed variable for
        # that bin.

        # It is possible that the hist_array[0] contains more elements than hist_array[1].
        # This happens when the last bins created by the division contain zero elements.
        # In this case, we have to pad the sequence of hist_array[0], completing it with zeros.

        MAX_LENGTH = max(len(hist_array[0]), len(hist_array[1])) # Get the length of the longest sequence
        SEQUENCES = [list(hist_array[0]), list(hist_array[1])] # get a list of sequences to pad.
        # Notice that we applied the list attribute to create a list of lists

        # We cannot pad with the function pad_sequences from tensorflow because it converts all values
        # to integers. Then, we have to pad the sequences by looping through the elements from SEQUENCES:

        # Start a support_list
        support_list = []

        # loop through each sequence in SEQUENCES:
        for sequence in SEQUENCES:
            # add a zero at the end of the sequence until its length reaches MAX_LENGTH
            while (len(sequence) < MAX_LENGTH):

                sequence.append(0)

            # append the sequence to support_list:
            support_list.append(sequence)

        # Tuples and arrays are immutable. It means they do not support assignment, i.e., we cannot
        # do tuple[0] = variable. Since arrays support vectorial (element-wise) operations, we can
        # modify the whole array making it equals to support_list at once by using function np.array:
        hist_array = np.array(support_list)

        # Get the bin_size as the average difference between successive elements from support_list[1]:

        diff_lists = []

        for i in range (1, len(support_list[1])):

            diff_lists.append(support_list[1][i] - support_list[1][(i-1)])

        # Now, get the mean value as the bin_size:
        bin_size = np.amax(np.array(diff_lists))

        # Let's get the frequency table, which will be saved on DATASET (to get the code
        # equivalent to the code for the function 'histogram'):

        DATASET = pd.DataFrame(data = {'bin_center': hist_array[1], 'count': hist_array[0]})

        # Get a lists of bin_center and column_to_analyze:
        list_of_bins = list(hist_array[1])
        list_of_counts = list(hist_array[0])

        # get the maximum count:
        max_count = DATASET['count'].max()
        # Get the index of the max count:
        max_count_index = list_of_counts.index(max_count)

        # Get the value bin_center correspondent to the max count (maximum probability):
        bin_of_max_proba = list_of_bins[max_count_index]
        bin_after_the_max_proba = list_of_bins[(max_count_index + 1)] # the next bin
        number_of_bins = len(DATASET) # Total of elements on the frequency table
        
        # Obtain a list of differences between bins
        bins_diffs = [(list_of_bins[i] - list_of_bins[(i-1)]) for i in range (1, len(list_of_bins))]
        # Convert it to Pandas series and use the mean method to retrieve the average bin size:
        bin_size = pd.Series(bins_diffs).mean()
        
        self.histogram_dict = {'df': DATASET, 'list_of_bins': list_of_bins, 'list_of_counts': list_of_counts,
                              'max_count': max_count, 'max_count_index': max_count_index,
                              'bin_of_max_proba': bin_of_max_proba, 'bin_after_the_max_proba': bin_after_the_max_proba,
                              'number_of_bins': number_of_bins, 'bin_size': bin_size}
        
        return self
    
    def get_desired_normal (self):
        
        import numpy as np
        import pandas as pd
        
        # Get a normal completely (6s) in the specifications, and centered
        # within these limits
        
        mu = self.mu
        sigma = self.sigma
        histogram_dict = self.histogram_dict
        max_count = histogram_dict['max_count']
        
        specification_limits = self.specification_limits
        
        lower_spec = specification_limits['lower_spec_lim']
        upper_spec = specification_limits['upper_spec_lim']
        
        if (lower_spec is None):
            
            # There is no lower specification: everything below it is in the specifications.
            # Make it mean - 6sigma (virtually infinite).
            lower_spec = mu - 6*(sigma)
            # Update the dictionary:
            specification_limits['lower_spec_lim'] = lower_spec
        
        if (upper_spec is None):
            
            # There is no upper specification: everything above it is in the specifications.
            # Make it mean + 6sigma (virtually infinite).
            upper_spec = mu + 6*(sigma)
            # Update the dictionary:
            specification_limits['upper_spec_lim'] = upper_spec
        
        # Desired normal mu: center of the specification limits.
        desired_mu = (lower_spec + upper_spec)/2
        
        # Desired sigma: 6 times the variation within the specific limits
        desired_sigma = (upper_spec - lower_spec)/6
        
        if (desired_sigma == 0):
            print("Impossible to obtain a normal curve overlayed, because the standard deviation is zero.\n")
            print("The analyzed variable is constant throughout the whole sample space.\n")
            
            # Get a dictionary of empty lists for this case
            desired_normal = {'x': [], 'y':[]}
            
        else:
            # create lists to store the normal curve. Center the normal curve in the bin
            # of maximum bar (max probability, which will not be the mean if the curve
            # is skewed). For normal distributions, this value will be the mean and the median.

            # set the lowest value x used for obtaining the normal curve as center_of_bin_of_max_proba - 4*sigma
            # the highest x will be center_of_bin_of_max_proba - 4*sigma
            # each value will be created by incrementing (0.10)*sigma

            # The arrays created by the plt.hist method present the value of the extreme left 
            # (the beginning) of the histogram bars, not the bin center. So, let's add half of the bin size
            # to the bin_of_max_proba, so that the adjusted normal will be positioned on the center of the
            # bar of maximum probability. We can do it by taking the average between bin_of_max_proba
            # and the following bin, bin_after_the_max_proba:
            
            # Let's create a normal around the desired mean value. Firstly, create the range X - 4s to
            # X + 4s. The probabilities will be calculated for each value in this range:

            x = (desired_mu - (4 * desired_sigma))
            x_of_normal = [x]

            while (x < (desired_mu + (4 * desired_sigma))):

                x = x + (0.10)*(desired_sigma)
                x_of_normal.append(x)

            # Convert the list to a NumPy array, so that it is possible to perform element-wise
            # (vectorial) operations:
            x_of_normal = np.array(x_of_normal)

            # Create an array of the normal curve y, applying the normal curve equation:
            # normal curve = 1/(sigma* ((2*pi)**(0.5))) * exp(-((x-mu)**2)/(2*(sigma**2)))
            # where pi = 3,14...., and exp is the exponential function (base e)
            # Let's center the normal curve on desired_mu:
            y_normal = (1 / (desired_sigma* (np.sqrt(2 * (np.pi))))) * (np.exp(-0.5 * (((1 / desired_sigma) * (x_of_normal - desired_mu)) ** 2)))
            y_normal = np.array(y_normal)

            # Pick the maximum value obtained for y_normal:
            # https://numpy.org/doc/stable/reference/generated/numpy.amax.html#numpy.amax
            y_normal_max = np.amax(y_normal)

            # Let's get a correction factor, comparing the maximum of the histogram counting, max_count,
            # with y_normal_max:
            correction_factor = max_count/(y_normal_max)

            # Now, multiply each value of the array y_normal by the correction factor, to adjust the height:
            y_normal = y_normal * correction_factor
            # Now the probability density function (values originally from 0 to 1) has the same 
            # height as the histogram.
            
            desired_normal = {'x': x_of_normal, 'y': y_normal}
        
        # Nest the desired_normal dictionary into specification_limits dictionary:
        specification_limits['desired_normal'] = desired_normal
        # Update the attribute:
        self.specification_limits = specification_limits
        
        return self
    
    def get_fitted_normal (self):
        
        import numpy as np
        import pandas as pd
        
        # Get a normal completely (6s) in the specifications, and centered
        # within these limits
        
        mu = self.mu
        sigma = self.sigma
        histogram_dict = self.histogram_dict
        max_count = histogram_dict['max_count']
        bin_of_max_proba = histogram_dict['bin_of_max_proba']
        specification_limits = self.specification_limits
        
        if (sigma == 0):
            print("Impossible to obtain a normal curve overlayed, because the standard deviation is zero.\n")
            print("The analyzed variable is constant throughout the whole sample space.\n")
            
            # Get a dictionary of empty lists for this case
            fitted_normal = {'x': [], 'y':[]}
            
        else:
            # create lists to store the normal curve. Center the normal curve in the bin
            # of maximum bar (max probability, which will not be the mean if the curve
            # is skewed). For normal distributions, this value will be the mean and the median.

            # set the lowest value x used for obtaining the normal curve as bin_of_max_proba - 4*sigma
            # the highest x will be bin_of_max_proba - 4*sigma
            # each value will be created by incrementing (0.10)*sigma

            x = (bin_of_max_proba - (4 * sigma))
            x_of_normal = [x]

            while (x < (bin_of_max_proba + (4 * sigma))):

                x = x + (0.10)*(sigma)
                x_of_normal.append(x)

            # Convert the list to a NumPy array, so that it is possible to perform element-wise
            # (vectorial) operations:
            x_of_normal = np.array(x_of_normal)

            # Create an array of the normal curve y, applying the normal curve equation:
            # normal curve = 1/(sigma* ((2*pi)**(0.5))) * exp(-((x-mu)**2)/(2*(sigma**2)))
            # where pi = 3,14...., and exp is the exponential function (base e)
            # Let's center the normal curve on bin_of_max_proba
            y_normal = (1 / (sigma* (np.sqrt(2 * (np.pi))))) * (np.exp(-0.5 * (((1 / sigma) * (x_of_normal - bin_of_max_proba)) ** 2)))
            y_normal = np.array(y_normal)

            # Pick the maximum value obtained for y_normal:
            # https://numpy.org/doc/stable/reference/generated/numpy.amax.html#numpy.amax
            y_normal_max = np.amax(y_normal)

            # Let's get a correction factor, comparing the maximum of the histogram counting, max_count,
            # with y_normal_max:
            correction_factor = max_count/(y_normal_max)

            # Now, multiply each value of the array y_normal by the correction factor, to adjust the height:
            y_normal = y_normal * correction_factor
            
            fitted_normal = {'x': x_of_normal, 'y': y_normal}
        
        # Nest the fitted_normal dictionary into specification_limits dictionary:
        specification_limits['fitted_normal'] = fitted_normal
        # Update the attribute:
        self.specification_limits = specification_limits
        
        return self
    
    def get_actual_pdf (self):
        
        # PDF: probability density function.
        # KDE: Kernel density estimation: estimation of the actual probability density
        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde
        
        import numpy as np
        import pandas as pd
        from scipy import stats
        
        df = self.df
        column_with_variable_to_be_analyzed = self.column_with_variable_to_be_analyzed
        array_to_analyze = np.array(df[column_with_variable_to_be_analyzed])
        
        mu = self.mu
        sigma = self.sigma
        lowest = self.lowest
        highest = self.highest
        sample_size = self.sample_size
        
        histogram_dict = self.histogram_dict
        max_count = histogram_dict['max_count']
        specification_limits = self.specification_limits 
        
        # Get the KDE object
        kde = stats.gaussian_kde(array_to_analyze)
        
        # Here, kde may represent a distribution with high skewness and kurtosis. So, let's check
        # if the intervals mu - 6s and mu + 6s are represented by the array:
        inf_kde_lim = mu - 6*sigma
        sup_kde_lim = mu + 6*sigma
        
        if (inf_kde_lim > min(list(array_to_analyze))):
            # make the inferior limit the minimum value from the array:
            inf_kde_lim = min(list(array_to_analyze))
        
        if (sup_kde_lim < max(list(array_to_analyze))):
            # make the superior limit the minimum value from the array:
            sup_kde_lim = max(list(array_to_analyze))
        
        # Let's obtain a X array, consisting with all values from which we will calculate the PDF:
        new_x = inf_kde_lim
        new_x_list = [new_x]
        
        while ((new_x) < sup_kde_lim):
            # There is already the first element, so go to the next one.
            new_x = new_x + (0.10)*sigma
            new_x_list.append(new_x)
        
        # Convert the new_x_list to NumPy array, making it the array_to_analyze:
        array_to_analyze = np.array(new_x_list)
        
        # Apply the pdf method to convert the array_to_analyze into the array of probabilities:
        # i.e., calculate the probability for each one of the values in array_to_analyze:
        # PDF: Probability density function
        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.pdf.html#scipy.stats.gaussian_kde.pdf
        array_of_probs = kde.pdf(array_to_analyze)
        
        # Pick the maximum value obtained for array_of_probs:
        # https://numpy.org/doc/stable/reference/generated/numpy.amax.html#numpy.amax
        array_of_probs_max = np.amax(array_of_probs)

        # Let's get a correction factor, comparing the maximum of the histogram counting, max_count,
        # with array_of_probs_max:
        correction_factor = max_count/(array_of_probs_max)

        # Now, multiply each value of the array y_normal by the correction factor, to adjust the height:
        array_of_probs = array_of_probs * correction_factor
        # Now the probability density function (values originally from 0 to 1) has the same 
        # height as the histogram.
        
        # Define a dictionary
        # X of the probability density plot: values from the series being analyzed.
        # Y of the probability density plot: probabilities calculated for each X.
        actual_pdf = {'x': array_to_analyze, 'y': array_of_probs}
        
        # Nest the desired_normal dictionary into specification_limits dictionary:
        specification_limits['actual_pdf'] = actual_pdf
        # Update the attribute:
        self.specification_limits = specification_limits
        
        return self
    
    def get_capability_indicators (self):
        
        import numpy as np
        import pandas as pd
        
        # Get a normal completely (6s) in the specifications, and centered
        # within these limits
        
        mu = self.mu
        sigma = self.sigma
        histogram_dict = self.histogram_dict
        bin_of_max_proba = histogram_dict['bin_of_max_proba']
        bin_after_the_max_proba = histogram_dict['bin_after_the_max_proba']
        max_count = histogram_dict['max_count']
        
        specification_limits = self.specification_limits
        lower_spec = specification_limits['lower_spec_lim']
        upper_spec = specification_limits['upper_spec_lim']
        desired_mu = (lower_spec + upper_spec)/2 
        # center of the specification limits: we want the mean to be in the center of the
        # specification limits
        
        range_spec = abs(upper_spec - lower_spec)
        
        # Get the constant:
        self = self.get_constants()
        dict_of_constants = self.dict_of_constants
        constant = dict_of_constants['1/c4']
        
        # Calculate corrected sigma:
        sigma_corrected = sigma*constant
        
        # Calculate the capability indicators, adding them to the
        # capability_dict
        cp = (range_spec)/(6*sigma_corrected)
        cr = 100*(6*sigma_corrected)/(range_spec)
        cm = (range_spec)/(8*sigma_corrected)
        zu = (upper_spec - mu)/(sigma_corrected)
        zl = (mu - lower_spec)/(sigma_corrected)
        
        z_min = min(zu, zl)
        cpk = (z_min)/3

        cpm_factor = 1 + ((mu - desired_mu)/sigma_corrected)**2
        cpm_factor = cpm_factor**(0.5) # square root
        cpm = (cp)/(cpm_factor)
        
        capability_dict = {'indicator': ['cp', 'cr', 'cm', 'zu', 'zl', 'z_min', 'cpk', 'cpm'], 
                            'value': [cp, cr, cm, zu, zl, z_min, cpk, cpm]}
        # Already in format for pd.DataFrame constructor
        
        # Update the attribute:
        self.capability_dict = capability_dict
        
        return self
    
    def capability_interpretation (self):
       
        print("Capable process: a process which attends its specifications.")
        print("Naturally, we want processes capable of attending the specifications.\n")
        
        print("Specification range:")
        print("Absolute value of the difference between the upper and the lower limits of specification.\n")
        
        print("6s interval:")
        print("Consider mean value = mu; standard deviation = s")
        print("For a normal distribution, 99.7% of the values range from its (mu - 3s) to (mu + 3s).")
        print("So, if the process follows the normal distribution, we can consider that virtually all of the data is in this range with 6s width.\n")
        
        print ("Cp:")
        print ("Relation between specification range and 6s.\n")
        
        print("Cr:")
        print("Usually, 6s > specification range.")
        print("So, the inverse of Cp is the fraction of 6s correspondent to the specification range.")
        print("Example: if 1/Cp = 0.2, then the specification range corresponds to 0.20 (20%) of the 6s interval.")
        print("Cr = 100 x (1/Cp) - the percent of 6s correspondent to the specification range.")
        print("Again, if 1/Cp = 0.2, then Cr = 20: the specification range corresponds to 20% of the 6s interval.\n")
        
        print("Cm:")
        print("It is a more generalized version of Cp.")
        print("Cm is the relation between specification range and 8s.")
        print("Then, even highly distant values from long-tailed curves are analyzed by this indicator.\n")
        
        print("Zu:")
        print("Represents how far is the mean of the values from the upper specification limit.")
        print("Zu = ([upper specification limit] - mu)/s")
        print("A higher Zu indicates a mean value lower than (and more distant from) the upper specification.")
        print("A negative Zu, in turns, indicates that the mean value is greater than the upper specification (i.e.: in average, specification is not attended).\n")
        
        print("Zl:")
        print("Represents how far is the mean of the values from the lower specification limit.")
        print("Zl = (mu - [lower specification limit])/s\n")
        print("A higher Zl indicates a mean value higher than  (and more distant from) the lower specification.")
        print("A negative Zl, in turns, indicates that the mean value is inferior than the lower specification (i.e.: in average, specification is not attended).\n")
        
        print("Zmin:")
        print("It is the minimum value between Zu and Zl.")
        print("So, Zmin indicates which specification is more difficult for the process to attend: the upper or the lower one.")
        print("Example: if Zmin = Zl, the mean of the process is closer to the lower specification than it is from the upper specification.")
        print("If Zmin, Zu, and Zl are equal, than the process is equally distant from the two specifications.")
        print("Again, if Zmin is negative, at least one of the specifications is not attended.\n")
        
        print("Cpk:")
        print("This is the most fundamental capability indicator.")
        print("Consider again that 99.7% of the normally distributed data are within [(mu - 3s), (mu + 3s)].")
        print("Cpk = Zmin/3")
        print("Cpk = min((([upper specification limit] - mu)/3s), ((mu - [lower specification limit])/3s))")
        print("\n")
        print("Cpk simultaneously assess the process centrality, and if the process is capable of attending its specifications.")
        print("Here, the process centrality is verified as results which are well and simetrically distributed throughout the mean of the specification limits.")
        print("Basically, a perfectly-centralized process has its mean equally distant from both specifications")
        print("i.e., the mean is in the center of the specification interval.")
        print("Cpk = + 1 is usually considered the minimum value acceptable for a process.")
        print("Many quality programs define reaching Cpk = + 1.33 as their goal.")
        print("A 6-sigma process, in turns, is defined as a process with Cpk = + 2.")
        print("\n")
        print("High values of Cpk indicate that the process is not only centralized, but that the differences")
        print("([upper specification limit] - mu) and (mu - [lower specification limit]) are greater than 3s.")
        print("Since mu +- 3s is the range for 99.7% of data, it indicates that most of the values generated fall in a range")
        print("that is only a fraction of the specification range.")
        print("So, it is easier for the process to attend the specifications.")
        print("\n")
        print("Cpk values inferior than 1 indicate that at least one of the intervals ([upper specification limit] - mu) and (mu - [lower specification limit])")
        print("is lower than 3s, i.e., the process naturally generates values beyond at least one of the specifications.")
        print("Low values of Cpk (in particular the negative ones) indicate not-centralized processes and processes not capable of attending their specifications.")
        print("So, lower (and, specially, more negative) Cpk: process' outputs more distant from the specifications.\n")
        
        print("Cpm:")
        print("This indicator is a more generalized version of the Cpk.")
        print("It basically consists on a standard normalization of the Cpk.")
        print("For that, a normalization factor is defined as:")
        print("factor = square root(1 + ((mu - target)/s)**2)")
        print("where target is the center of the specification limits, and **2 represents the second power (square)")
        print("Cpm = Cpk/(factor)")

In [17]:
def histogram (df, column_to_analyze, total_of_bins = 10, normal_curve_overlay = True, x_axis_rotation = 0, y_axis_rotation = 0, grid = True, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # column_to_analyze: string with the name of the column that will be analyzed.
    # column_to_analyze = 'col1' obtain a histogram from column 1.
    
    # Set a local copy of the dataframe to manipulate:
    DATASET = df.copy(deep = True)
    
    # Sort by the column to analyze (ascending order) and reset the index:
    DATASET = DATASET.sort_values(by = column_to_analyze, ascending = True)
    
    DATASET = DATASET.reset_index(drop = True)
    
    # Create an instance (object) from class capability_analysis:
    capability_obj = capability_analysis(df = DATASET, column_with_variable_to_be_analyzed = column_to_analyze, specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': None}, total_of_bins = total_of_bins)
     
    # Get histogram array:
    capability_obj = capability_obj.get_histogram_array()
    # Attribute .histogram_dict: dictionary with keys 'list_of_bins' and 'list_of_counts'.
    
    # Get fitted normal:
    capability_obj = capability_obj.get_fitted_normal()
    # Now the .specification_limits attribute contains the nested dict desired_normal = {'x': x_of_normal, 'y': y_normal}
    # in key 'fitted_normal'.
    
    # Get the actual probability density function (PDF):
    capability_obj = capability_obj.get_actual_pdf()
    # Now the dictionary in the attribute .specification_limits has the nested dict actual_pdf = {'x': array_to_analyze, 'y': array_of_probs}
    # in key 'actual_pdf'.
    
    # Retrieve general statistics:
    stats_dict = {
        
        'sample_size': capability_obj.sample_size,
        'mu': capability_obj.mu,
        'median': capability_obj.median,
        'sigma': capability_obj.sigma,
        'lowest': capability_obj.lowest,
        'highest': capability_obj.highest
    }
    
    # Retrieve the histogram dict:
    histogram_dict = capability_obj.histogram_dict
    
    # Retrieve the specification limits dictionary updated:
    specification_limits = capability_obj.specification_limits
    # Retrieve the desired normal and actual PDFs dictionaries:
    fitted_normal = specification_limits['fitted_normal']
    actual_pdf = specification_limits['actual_pdf']
    
    string_for_title = " - $\mu = %.2f$, $\sigma = %.2f$" %(stats_dict['mu'], stats_dict['sigma'])
    
    if not (plot_title is None):
        plot_title = plot_title + string_for_title
        # %.2f: the number between . and f indicates the number of printed decimal cases
        # the notation $\ - Latex code for printing formatted equations and symbols.
    
    else:
        # Set graphic title
        plot_title = f"histogram_of_{column_to_analyze}" + string_for_title

    if (horizontal_axis_title is None):
        # Set horizontal axis title
        horizontal_axis_title = column_to_analyze

    if (vertical_axis_title is None):
        # Set vertical axis title
        vertical_axis_title = "Counting/Frequency"
        
    # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
    # so that the bars do not completely block other views.
    OPACITY = 0.95
    
    y_hist = DATASET[column_to_analyze]
    
    # Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    fig = plt.figure(figsize = (12, 8))
    ax = fig.add_subplot()
    
    #STANDARD MATPLOTLIB METHOD:
    #bins = number of bins (intervals) of the histogram. Adjust it manually
    #increasing bins will increase the histogram's resolution, but height of bars
    
    ax.hist(y_hist, bins = total_of_bins, alpha = OPACITY, label = f'counting_of\n{column_to_analyze}', color = 'darkblue')
    #ax.hist(y, bins=20, width = bar_width, label=xlabel, color='blue')
    #IF GRAPHIC IS NOT SHOWN: THAT IS BECAUSE THE DISTANCES BETWEEN VALUES ARE LOW, AND YOU WILL
    #HAVE TO USE THE STANDARD HISTOGRAM METHOD FROM MATPLOTLIB.
    #TO DO THAT, UNMARK LINE ABOVE: ax.hist(y, bins=20, width = bar_width, label=xlabel, color='blue')
    #AND MARK LINE BELOW AS COMMENT: ax.bar(xhist, yhist, width = bar_width, label=xlabel, color='blue')
    
    #IF YOU WANT TO CREATE GRAPHIC AS A BAR CHART BASED ON THE CALCULATED DISTRIBUTION TABLE, 
    #MARK THE LINE ABOVE AS COMMENT AND UNMARK LINE BELOW:
    #ax.bar(x_hist, y_hist, label = f'counting_of\n{column_to_analyze}', color = 'darkblue')
    #ajuste manualmente a largura, width, para deixar as barras mais ou menos proximas
    
    # Plot the probability density function for the data:
    pdf_x = actual_pdf['x']
    pdf_y = actual_pdf['y']
    
    ax.plot(pdf_x, pdf_y, color = 'darkgreen', linestyle = '-', alpha = OPACITY, label = 'probability\ndensity')
    
    if (normal_curve_overlay == True):
        
        # Check if a normal curve was obtained:
        x_of_normal = fitted_normal['x']
        y_normal = fitted_normal['y']

        if (len(x_of_normal) > 0):
            # Non-empty list, add the normal curve:
            ax.plot(x_of_normal, y_normal, color = 'crimson', linestyle = 'dashed', alpha = OPACITY, label = 'expected\nnormal_curve')

    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 0 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)

    ax.set_title(plot_title)
    ax.set_xlabel(horizontal_axis_title)
    ax.set_ylabel(vertical_axis_title)

    ax.grid(grid) # show grid or not
    ax.legend(loc = 'upper right')
    # position options: 'upper right'; 'upper left'; 'lower left'; 'lower right';
    # 'right', 'center left'; 'center right'; 'lower center'; 'upper center', 'center'
    # https://www.statology.org/matplotlib-legend-position/

    if (export_png == True):
        # Image will be exported
        import os

        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = ""

        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "histogram"

        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 330 dpi
            png_resolution_dpi = 330

        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)

        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    #plt.figure(figsize = (12, 8))
    #fig.tight_layout()

    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
      
    stats_dict = {
                  'statistics': ['mean', 'median', 'standard_deviation', f'lowest_{column_to_analyze}', 
                                f'highest_{column_to_analyze}', 'count_of_values', 'number_of_bins', 
                                 'bin_size', 'bin_of_max_proba', 'count_on_bin_of_max_proba'],
                  'value': [stats_dict['mu'], stats_dict['median'], stats_dict['sigma'], 
                            stats_dict['lowest'], stats_dict['highest'], stats_dict['sample_size'], 
                            histogram_dict['number_of_bins'], histogram_dict['bin_size'], 
                            histogram_dict['bin_of_max_proba'], histogram_dict['max_count']]
                 }
    
    # Convert it to a Pandas dataframe setting the list 'statistics' as the index:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
    general_stats = pd.DataFrame(data = stats_dict)
    
    # Set the column 'statistics' as the index of the dataframe, using set_index method:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
    
    # If inplace = True, modifies the DataFrame in place (do not create a new object).
    # Then, we do not create an object equal to the expression. We simply apply the method (so,
    # None is returned from the method):
    general_stats.set_index(['statistics'], inplace = True)
    
    print("Check the general statistics from the analyzed variable:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(general_stats)
            
    except: # regular mode
        print(general_stats)
    
    print("\n")
    print("Check the frequency table:\n")
    
    freq_table = histogram_dict['df']
    
    try:    
        display(freq_table)    
    except:
        print(freq_table)

    return general_stats, freq_table

# **Function for testing data normality and visualizing the probability plot**
- Check the probability that data is actually described by a normal distribution.

In [18]:
def test_data_normality (df, column_to_analyze, column_with_labels_to_test_subgroups = None, alpha = 0.10, show_probability_plot = True, x_axis_rotation = 0, y_axis_rotation = 0, grid = True, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from statsmodels.stats import diagnostic
    from scipy import stats
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html#scipy.stats.probplot
    # Check https://docs.scipy.org/doc/scipy/tutorial/stats.html
    # Check https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.stats.normaltest.html
    
    # WARNING: The statistical tests require at least 20 samples
    
    # column_to_analyze: column (variable) of the dataset that will be tested. Declare as a string,
    # in quotes.
    # e.g. column_to_analyze = 'col1' will analyze a column named 'col1'.
    
    # column_with_labels_to_test_subgroups: if there is a column with labels or
    # subgroup indication, and the normality should be tested separately for each label, indicate
    # it here as a string (in quotes). e.g. column_with_labels_to_test_subgroups = 'col2' 
    # will retrieve the labels from 'col2'.
    # Keep column_with_labels_to_test_subgroups = None if a single series (the whole column)
    # will be tested.
    
    # Confidence level = 1 - ALPHA. For ALPHA = 0.10, we get a 0.90 = 90% confidence
    # Set ALPHA = 0.05 to get 0.95 = 95% confidence in the analysis.
    # Notice that, when less trust is needed, we can increase ALPHA to get less restrictive
    # results.
    
    print("WARNING: The statistical tests require at least 20 samples.\n")
    print("Interpretation:")
    print("p-value: probability that data is described by the normal distribution.")
    print("Criterion: the series is not described by normal if p < alpha = %.3f." %(alpha))
    
    # Set a local copy of the dataframe to manipulate:
    DATASET = df.copy(deep = True)
    
    # Start a list to store the different Pandas series to test:
    list_of_dicts = []
    
    if not (column_with_labels_to_test_subgroups is None):
        
        # 1. Get the unique values from column_with_labels_to_test_subgroups
        # and save it as the list labels_list:
        labels_list = list(DATASET[column_with_labels_to_test_subgroups].unique())
        
        # 2. Loop through each element from labels_list:
        for label in labels_list:
            
            # 3. Create a copy of the DATASET, filtering for entries where 
            # column_with_labels_to_test_subgroups == label:
            filtered_df = (DATASET[DATASET[column_with_labels_to_test_subgroups] == label]).copy(deep = True)
            # 4. Reset index of the copied dataframe:
            filtered_df = filtered_df.reset_index(drop = True)
            # 5. Create a dictionary, with an identification of the series, and the series
            # that will be tested:
            series_dict = {'series_id': (column_to_analyze + "_" + label), 
                           'series': filtered_df[column_to_analyze],
                           'total_elements_to_test': len(filtered_df[column_to_analyze])}
            
            # 6. Append this dictionary to the list of series:
            list_of_dicts.append(series_dict)
        
    else:
        # In this case, the only series is the column itself. So, let's create a dictionary with
        # same structure:
        series_dict = {'series_id': column_to_analyze, 'series': DATASET[column_to_analyze],
                       'total_elements_to_test': len(DATASET[column_to_analyze])}
        
        # Append this dictionary to the list of series:
        list_of_dicts.append(series_dict)
    
    
    # Now, loop through each element from the list of series:
    
    for series_dict in list_of_dicts:
        
        # start a support list:
        support_list = []
        
        # Check if there are at least 20 samples to test:
        series_id = series_dict['series_id']
        total_elements_to_test = series_dict['total_elements_to_test']
        
        # Create an instance (object) from class capability_analysis:
        capability_obj = capability_analysis(df = DATASET, column_with_variable_to_be_analyzed = series_id, specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': None}, alpha = alpha)
        
        # Check data normality:
        capability_obj = capability_obj.check_data_normality()
        # Attribute .normality_dict: dictionary with results from normality tests
        
        # Retrieve the normality dictionary:
        normality_dict = capability_obj.normality_dict
        # Nest it in series_dict:
        series_dict['normality_dict'] = normality_dict
        
        # Finally, append the series dictionary to the support list:
        support_list.append(series_dict)
        
        if ((total_elements_to_test >= 20) & (show_probability_plot == True)):
            
            y = series_dict['series']
        
            print("\n")
            #Obtain the probability plot  
            fig, ax = plt.subplots(figsize = (12, 8))

            ax.set_title(f"probability_plot_of_{series_id}_for_normal_distribution")
            
            plot_results = stats.probplot(y, dist = 'norm', fit = True, plot = ax)
            #This function resturns a tuple, so we must store it into res
            
            ax.grid(grid)
            #ROTATE X AXIS IN XX DEGREES
            plt.xticks(rotation = x_axis_rotation)
            # XX = 70 DEGREES x_axis (Default)
            #ROTATE Y AXIS IN XX DEGREES:
            plt.yticks(rotation = y_axis_rotation)
            # XX = 0 DEGREES y_axis (Default)   
            
            # Other distributions to check, see scipy Stats documentation. 
            # you could test dist=stats.loggamma, where stats was imported from scipy
            # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html#scipy.stats.probplot

            if (export_png == True):
                # Image will be exported
                import os

                #check if the user defined a directory path. If not, set as the default root path:
                if (directory_to_save is None):
                    #set as the default
                    directory_to_save = ""

                #check if the user defined a file name. If not, set as the default name for this
                # function.
                if (file_name is None):
                    #set as the default
                    file_name = "probability_plot_normal"

                #check if the user defined an image resolution. If not, set as the default 110 dpi
                # resolution.
                if (png_resolution_dpi is None):
                    #set as 330 dpi
                    png_resolution_dpi = 330

                #Get the new_file_path
                new_file_path = os.path.join(directory_to_save, file_name)

                #Export the file to this new path:
                # The extension will be automatically added by the savefig method:
                plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
                #quality could be set from 1 to 100, where 100 is the best quality
                #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
                #transparent = True or False
                # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
                print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

            #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
            #plt.figure(figsize = (12, 8))
            #fig.tight_layout()
            ## Show an image read from an image file:
            ## import matplotlib.image as pltimg
            ## img=pltimg.imread('mydecisiontree.png')
            ## imgplot = plt.imshow(img)
            ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
            ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
            ##  '03_05_END.ipynb'
            plt.show()
                
            print("\n")
            
    # Now we left the for loop, make the list of dicts support list itself:
    list_of_dicts = support_list
    
    print("\n")
    print("Finished normality tests. Returning a list of dictionaries, where each dictionary contains the series analyzed and the p-values obtained.\n")
    print("Now, check general statistics of the data distribution:\n")
    
    # Now, let's obtain general statistics for all of the series, even those without the normality
    # test results.
    
    # start a support list:
    support_list = []
    
    for series_dict in list_of_dicts:
        
        y = series_dict['series']
        # Guarantee it is still a pandas series:
        y = pd.Series(y)
        # Calculate data skewness and kurtosis
    
        # Skewness
        data_skew = stats.skew(y)
        # skewness = 0 : normally distributed.
        # skewness > 0 : more weight in the left tail of the distribution.
        # skewness < 0 : more weight in the right tail of the distribution.
        # https://www.geeksforgeeks.org/scipy-stats-skew-python/

        # Kurtosis
        data_kurtosis = stats.kurtosis(y, fisher = True)
        # scipy.stats.kurtosis(array, axis=0, fisher=True, bias=True) function 
        # calculates the kurtosis (Fisher or Pearson) of a data set. It is the the fourth 
        # central moment divided by the square of the variance. 
        # It is a measure of the “tailedness” i.e. descriptor of shape of probability 
        # distribution of a real-valued random variable. 
        # In simple terms, one can say it is a measure of how heavy tail is compared 
        # to a normal distribution.
        # fisher parameter: fisher : Bool; Fisher’s definition is used (normal 0.0) if True; 
        # else Pearson’s definition is used (normal 3.0) if set to False.
        # https://www.geeksforgeeks.org/scipy-stats-kurtosis-function-python/
        print("A normal distribution should present no skewness (distribution distortion); and no kurtosis (long-tail).\n")
        print("For the data analyzed:\n")
        print(f"skewness = {data_skew}")
        print(f"kurtosis = {data_kurtosis}\n")

        if (data_skew < 0):

            print(f"Skewness = {data_skew} < 0: more weight in the left tail of the distribution.")

        elif (data_skew > 0):

            print(f"Skewness = {data_skew} > 0: more weight in the right tail of the distribution.")

        else:

            print(f"Skewness = {data_skew} = 0: no distortion of the distribution.")
                

        if (data_kurtosis == 0):

            print("Data kurtosis = 0. No long-tail effects detected.\n")

        else:

            print(f"The kurtosis different from zero indicates long-tail effects on the distribution.\n")

        #Calculate the mode of the distribution:
        # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html
        data_mode = stats.mode(y, axis = None)[0][0]
        # returns an array of arrays. The first array is called mode=array and contains the mode.
        # Axis: Default is 0. If None, compute over the whole array.
        # we set axis = None to compute the general mode.

        #Create general statistics dictionary:
        general_statistics_dict = {

            "series_mean": y.mean(),
            "series_variance": y.var(),
            "series_standard_deviation": y.std(),
            "series_skewness": data_skew,
            "series_kurtosis": data_kurtosis,
            "series_mode": data_mode

        }
        
        # Add this dictionary to the series dictionary:
        series_dict['general_statistics'] = general_statistics_dict
        
        # Append the dictionary to support list:
        support_list.append(series_dict)
    
    # Now, make the list of dictionaries support_list itself:
    list_of_dicts = support_list

    return list_of_dicts

# **Function for column filtering (selecting); ordering; or renaming all columns**

In [19]:
def select_order_or_rename_columns (df, columns_list, mode = 'select_or_order_columns'):
    
    import numpy as np
    import pandas as pd
    
    # MODE = 'select_or_order_columns' for filtering only the list of columns passed as columns_list,
    # and setting a new column order. In this mode, you can pass the columns in any order: 
    # the order of elements on the list will be the new order of columns.

    # MODE = 'rename_columns' for renaming the columns with the names passed as columns_list. In this
    # mode, the list must have same length and same order of the columns of the dataframe. That is because
    # the columns will sequentially receive the names in the list. So, a mismatching of positions
    # will result into columns with incorrect names.
    
    # columns_list = list of strings containing the names (headers) of the columns to select
    # (filter); or to be set as the new columns' names, according to the selected mode.
    # For instance: columns_list = ['col1', 'col2', 'col3'] will 
    # select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
    # Declare the names inside quotes.
    
    # Set a local copy of the dataframe to manipulate:
    DATASET = df.copy(deep = True)
    
    print(f"Original columns in the dataframe:\n{DATASET.columns}\n")
    
    if ((columns_list is None) | (columns_list == np.nan)):
        # empty list
        columns_list = []
    
    if (len(columns_list) == 0):
        print("Please, input a valid list of columns.\n")
        return DATASET
    
    if (mode == 'select_or_order_columns'):
        
        #filter the dataframe so that it will contain only the cols_list.
        DATASET = DATASET[columns_list]
        print("Dataframe filtered according to the list provided.\n")
        print("Check the new dataframe:\n")
        
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(DATASET)

        except: # regular mode
            print(DATASET)
        
    elif (mode == 'rename_columns'):
        
        # Check if the number of columns of the dataset is equal to the number of elements
        # of the new list. It will avoid raising an exception error.
        boolean_filter = (len(columns_list) == len(DATASET.columns))
        
        if (boolean_filter == False):
            #Impossible to rename, number of elements are different.
            print("The number of columns of the dataframe is different from the number of elements of the list. Please, provide a list with number of elements equals to the number of columns.\n")
            return DATASET
        
        else:
            #Same number of elements, so that we can update the columns' names.
            DATASET.columns = columns_list
            print("Dataframe columns renamed according to the list provided.\n")
            print("Warning: the substitution is element-wise: the first element of the list is now the name of the first column, and so on, ..., so that the last element is the name of the last column.\n")
            print("Check the new dataframe:\n")
            try:
                # only works in Jupyter Notebook:
                from IPython.display import display
                display(DATASET)

            except: # regular mode
                print(DATASET)
        
    else:
        print("Enter a valid mode: \'select_or_order_columns\' or \'rename_columns\'.")
        return DATASET
    
    return DATASET

# **Function for renaming specific columns from the dataframe; or cleaning columns' labels**
- The function `select_order_or_rename_columns` requires the user to pass a list containing the names from all columns.
- Also, this list must contain the columns in the correct order (the order they appear in the dataframe).
- This function may manipulate one or several columns by call, and is not dependent on their order.
- This function can also be used for cleaning the columns' labels: capitalize (upper case) or lower cases of all columns' names; replace substrings on columns' names; or eliminating trailing and leading white spaces or characters from columns' labels.

In [20]:
def rename_or_clean_columns_labels (df, mode = 'set_new_names', substring_to_be_replaced = ' ', new_substring_for_replacement = '_', trailing_substring = None, list_of_columns_labels = [{'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}, {'column_name': None, 'new_column_name': None}]):
    
    import numpy as np
    import pandas as pd
    # Pandas .rename method:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
    
    # mode = 'set_new_names' will change the columns according to the specifications in
    # list_of_columns_labels.
    
    # list_of_columns_labels = [{'column_name': None, 'new_column_name': None}]
    # This is a list of dictionaries, where each dictionary contains two key-value pairs:
    # the first one contains the original column name; and the second one contains the new name
    # that will substitute the original one. The function will loop through all dictionaries in
    # this list, access the values of the keys 'column_name', and it will be replaced (switched) 
    # by the correspondent value in key 'new_column_name'.
    
    # The object list_of_columns_labels must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'column_name' for the original label; 
    # and 'new_column_name', for the correspondent new label.
    # Notice that this function will not search substrings: it will substitute a value only when
    # there is perfect correspondence between the string in 'column_name' and one of the columns
    # labels. So, the cases (upper or lower) must be the same.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to replace more
    # values.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'column_name': original_col, 'new_column_name': new_col}, 
    # where original_col and new_col represent the strings for searching and replacement 
    # (If one of the keys contains None, the new dictionary will be ignored).
    # Example: list_of_columns_labels = [{'column_name': 'col1', 'new_column_name': 'col'}] will
    # rename 'col1' as 'col'.
    
    
    # mode = 'capitalize_columns' will capitalize all columns names (i.e., they will be put in
    # upper case). e.g. a column named 'column' will be renamed as 'COLUMN'
    
    # mode = 'lowercase_columns' will lower the case of all columns names. e.g. a column named
    # 'COLUMN' will be renamed as 'column'.
    
    # mode = 'replace_substring' will search on the columns names (strings) for the 
    # substring_to_be_replaced (which may be a character or a string); and will replace it by 
    # new_substring_for_replacement (which again may be either a character or a string). 
    # Numbers (integers or floats) will be automatically converted into strings.
    # As an example, consider the default situation where we search for a whitespace ' ' 
    # and replace it by underscore '_': 
    # substring_to_be_replaced = ' ', new_substring_for_replacement = '_'  
    # In this case, a column named 'new column' will be renamed as 'new_column'.
    
    # mode = 'trim' will remove all trailing or leading whitespaces from column names.
    # e.g. a column named as ' col1 ' will be renamed as 'col1'; 'col2 ' will be renamed as
    # 'col2'; and ' col3' will be renamed as 'col3'.
    
    # mode = 'eliminate_trailing_characters' will eliminate a defined trailing and leading 
    # substring from the columns' names. 
    # The substring must be indicated as trailing_substring, and its default, when no value
    # is provided, is equivalent to mode = 'trim' (eliminate white spaces). 
    # e.g., if trailing_substring = '_test' and you have a column named 'col_test', it will be 
    # renamed as 'col'.
    
    
    # Set a local copy of the dataframe to manipulate:
    DATASET = df.copy(deep = True)
    # Guarantee that the columns were read as strings:
    DATASET.columns = (DATASET.columns).astype(str)
    # dataframe.columns is a Pandas Index object, so it has the dtype attribute as other Pandas
    # objects. So, we can use the astype method to set its type as str or 'object' (or "O").
    # Notice that there are situations with int Index, when integers are used as column names or
    # as row indices. So, this portion guarantees that we can call the str attribute to apply string
    # methods.
    
    if (mode == 'set_new_names'):
        
        # Start a mapping dictionary:
        mapping_dict = {}
        # This dictionary will be in the format required by .rename method: old column name as key,
        # and new name as value.

        # Loop through each element from list_of_columns_labels:
        for dictionary in list_of_columns_labels:

            # Access the values in keys:
            column_name = dictionary['column_name']
            new_column_name = dictionary['new_column_name']

            # Check if neither is None:
            if ((column_name is not None) & (new_column_name is not None)):
                
                # Guarantee that both were read as strings:
                column_name = str(column_name)
                new_column_name = str(new_column_name)

                # Add it to the mapping dictionary setting column_name as key, and the new name as the
                # value:
                mapping_dict[column_name] = new_column_name

        # Now, the dictionary is in the correct format for the method. Let's apply it:
        DATASET.rename(columns = mapping_dict, inplace = True)
    
    elif (mode == 'capitalize_columns'):
        
        DATASET.rename(str.upper, axis = 'columns', inplace = True)
    
    elif (mode == 'lowercase_columns'):
        
        DATASET.rename(str.lower, axis = 'columns', inplace = True)
    
    elif (mode == 'replace_substring'):
        
        if (substring_to_be_replaced is None):
            # set as the default (whitespace):
            substring_to_be_replaced = ' '
        
        if (new_substring_for_replacement is None):
            # set as the default (underscore):
            new_substring_for_replacement = '_'
        
        # Apply the str attribute to guarantee that numbers were read as strings:
        substring_to_be_replaced = str(substring_to_be_replaced)
        new_substring_for_replacement = str(new_substring_for_replacement)
        # Replace the substrings in the columns' names:
        substring_replaced_series = (pd.Series(DATASET.columns)).str.replace(substring_to_be_replaced, new_substring_for_replacement)
        # The Index object is not callable, and applying the str attribute to a np.array or to a list
        # will result in a single string concatenating all elements from the array. So, we convert
        # the columns index to a pandas series for performing a element-wise string replacement.
        
        # Now, convert the columns to the series with the replaced substrings:
        DATASET.columns = substring_replaced_series
        
    elif (mode == 'trim'):
        # Use the strip method from str attribute with no argument, correspondening to the
        # Trim function.
        DATASET.rename(str.strip, axis = 'columns', inplace = True)
    
    elif (mode == 'eliminate_trailing_characters'):
        
        if ((trailing_substring is None) | (trailing_substring == np.nan)):
            # Apply the str.strip() with no arguments:
            DATASET.rename(str.strip, axis = 'columns', inplace = True)
        
        else:
            # Apply the str attribute to guarantee that numbers were read as strings:
            trailing_substring = str(trailing_substring)

            # Apply the strip method:
            stripped_series = (pd.Series(DATASET.columns)).str.strip(trailing_substring)
            # The Index object is not callable, and applying the str attribute to a np.array or to a list
            # will result in a single string concatenating all elements from the array. So, we convert
            # the columns index to a pandas series for performing a element-wise string replacement.

            # Now, convert the columns to the series with the stripped strings:
            DATASET.columns = stripped_series
    
    else:
        print("Select a valid mode: \'set_new_names\', \'capitalize_columns\', \'lowercase_columns\', \'replace_substrings\', \'trim\', or \'eliminate_trailing_characters\'.\n")
        return "error"
    
    print("Finished renaming dataframe columns.\n")
    print("Check the new dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET)
            
    except: # regular mode
        print(DATASET)
        
    return DATASET

# **Function for removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type**

In [21]:
def trim_spaces_or_characters (df, column_to_analyze, new_variable_type = None, method = 'trim', substring_to_eliminate = None, create_new_column = True, new_column_suffix = "_trim"):
    
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # new_variable_type = None. String (in quotes) that represents a given data type for the column
    # after transformation. Set:
    # - new_variable_type = 'int' to convert the column to integer type after the transform;
    # - new_variable_type = 'float' to convert the column to float (decimal number);
    # - new_variable_type = 'datetime' to convert it to date or timestamp;
    # - new_variable_type = 'category' to convert it to Pandas categorical variable.
    
    # method = 'trim' will eliminate trailing and leading white spaces from the strings in
    # column_to_analyze.
    # method = 'substring' will eliminate a defined trailing and leading substring from
    # column_to_analyze.
    
    # substring_to_eliminate = None. Set as a string (in quotes) if method = 'substring'.
    # e.g. suppose column_to_analyze contains time information: each string ends in " min":
    # "1 min", "2 min", "3 min", etc. If substring_to_eliminate = " min", this portion will be
    # eliminated, resulting in: "1", "2", "3", etc. If new_variable_type = None, these values will
    # continue to be strings. By setting new_variable_type = 'int' or 'float', the series will be
    # converted to a numeric type.
    
    # create_new_column = True
    # Alternatively, set create_new_column = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_trim"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_column_suffix. Then, if the original
    # column was "column1" and the suffix is "_trim", the new column will be named as
    # "column1_trim".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    if (method == 'substring'):
        
        if (substring_to_eliminate is None):
            
            method = 'trim'
            print("No valid substring input. Modifying method to \'trim\'.\n")
    
    if (method == 'substring'):
        
        print("ATTENTION: Operations of string strip (removal) or replacement are all case-sensitive. There must be correct correspondence between cases and spaces for the strings being removed or replaced.\n")
        # For manipulating strings, call the str attribute and, then, the method to be applied:
        new_series = new_series.str.strip(substring_to_eliminate)
    
    else:
        
        new_series = new_series.str.strip()
    
    # Check if a the series type should be modified:
    if (new_variable_type is not None):
        
        if (new_variable_type == 'int'):

            new_type = np.int64

        elif (new_variable_type == 'float'):
            
            new_type = np.float64
        
        elif (new_variable_type == 'datetime'):
            
            new_type = np.datetime64
        
        elif (new_variable_type == 'category'):
            
            new_type = new_variable_type
        
        # Try converting the type:
        try:
            new_series = new_series.astype(new_type)
            print(f"Successfully converted the series to the type {new_variable_type}.\n")
        
        except:
            pass

    if (create_new_column):
        
        if (new_column_suffix is None):
            new_column_suffix = "_trim"
                
        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:
        
        DATASET[column_to_analyze] = new_series
    
    # Now, we are in the main code.
    print("Finished removing leading and trailing spaces or characters (substrings).")
    print("Check the 10 first elements from the series:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))
            
    except: # regular mode
        print(new_series.head(10))
    
    return DATASET

# **Function for capitalizing or lowering case of string variables (string homogenizing)**

In [22]:
def capitalize_or_lower_string_case (df, column_to_analyze, method = 'lowercase', create_new_column = True, new_column_suffix = "_homogenized"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # method = 'capitalize' will capitalize all letters from the input string 
    # (turn them to upper case).
    # method = 'lowercase' will make the opposite: turn all letters to lower case.
    # e.g. suppose column_to_analyze contains strings such as 'String One', 'STRING 2',  and
    # 'string3'. If method = 'capitalize', the output will contain the strings: 
    # 'STRING ONE', 'STRING 2', 'STRING3'. If method = 'lowercase', the outputs will be:
    # 'string one', 'string 2', 'string3'.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_homogenized"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_column_suffix. Then, if the original
    # column was "column1" and the suffix is "_homogenized", the new column will be named as
    # "column1_homogenized".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    if (method == 'capitalize'):
        
        print("Capitalizing the string (moving all characters to upper case).\n")
        # For manipulating strings, call the str attribute and, then, the method to be applied:
        new_series = new_series.str.upper()
    
    else:
        
        print("Lowering the string case (moving all characters to lower case).\n")
        new_series = new_series.str.lower()
        
    if (create_new_column):
        
        if (new_column_suffix is None):
            new_column_suffix = "_homogenized"
                
        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:
        
        DATASET[column_to_analyze] = new_series
    
    # Now, we are in the main code.
    print(f"Finished homogenizing the string case of {column_to_analyze}, giving value consistency.")
    print("Check the 10 first elements from the series:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))
            
    except: # regular mode
        print(new_series.head(10))
    
    return DATASET

# **Function for adding contractions to the contractions library**

In [23]:
def add_contractions_to_library (list_of_contractions = [{'contracted_expression': None, 'correct_expression': None}, {'contracted_expression': None, 'correct_expression': None}, {'contracted_expression': None, 'correct_expression': None}, {'contracted_expression': None, 'correct_expression': None}]):
    
    import contractions
    # contractions library: https://github.com/kootenpv/contractions
    
    # list_of_contractions = 
    # [{'contracted_expression': None, 'correct_expression': None}]
    # This is a list of dictionaries, where each dictionary contains two key-value pairs:
    # the first one contains the form as the contraction is usually observed; and the second one 
    # contains the correct (full) string that will replace it.
    # Since contractions can cause issues when processing text, we can expand them with these functions.
    
    # The object list_of_contractions must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'contracted_expression' for the contraction; and 'correct_expression', 
    # for the strings with the correspondent correction.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you want to add more elements
    # to the contractions library.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'contracted_expression': original_str, 'correct_expression': new_str}, 
    # where original_str and new_str represent the contracted and expanded strings
    # (If one of the keys contains None, the new dictionary will be ignored).
    
    # Example:
    # list_of_contractions = [{'contracted_expression': 'mychange', 'correct_expression': 'my change'}]
    
    
    for dictionary in list_of_contractions:
        
        contraction = dictionary['contracted_expression']
        correction = dictionary['correct_expression']
        
        if ((contraction is not None) & (correction is not None)):
    
            contractions.add(contraction, correction)
            print(f"Successfully included the contracted expression {contraction} to the contractions library.")

    print("Now, the function for contraction correction will be able to process it within the strings.\n")

# **Function for correcting contracted strings**

In [24]:
def correct_contracted_strings (df, column_to_analyze, create_new_column = True, new_column_suffix = "_contractionsFixed"):
     
    import numpy as np
    import pandas as pd
    import contractions
    
    # contractions library: https://github.com/kootenpv/contractions
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
   
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_contractionsFixed"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_contractionsFixed", the new column will be named as
    # "column1_contractionsFixed".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    # Contractions operate at one string at once:
    correct_contractions_list = [contractions.fix(new_series[i], slang = True) for i in range (0, len(DATASET))]
    
    # Make this list the new_series itself:
    new_series = pd.Series(correct_contractions_list)
    
    if (create_new_column):
            
        if (new_column_suffix is None):
            new_column_suffix = "_contractionsFixed"

        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:

        DATASET[column_to_analyze] = new_series

    # Now, we are in the main code.
    print(f"Finished correcting the contracted strings from column {column_to_analyze}.")
    print("Check the 10 first elements (10 lists) from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for substituting (replacing) substrings on string variables**

In [25]:
def replace_substring (df, column_to_analyze, substring_to_be_replaced = None, new_substring_for_replacement = '', create_new_column = True, new_column_suffix = "_substringReplaced"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # substring_to_be_replaced = None; new_substring_for_replacement = ''. 
    # Strings (in quotes): when the sequence of characters substring_to_be_replaced was
    # found in the strings from column_to_analyze, it will be substituted by the substring
    # new_substring_for_replacement. If None is provided to one of these substring arguments,
    # it will be substituted by the empty string: ''
    # e.g. suppose column_to_analyze contains the following strings, with a spelling error:
    # "my collumn 1", 'his collumn 2', 'her column 3'. We may correct this error by setting:
    # substring_to_be_replaced = 'collumn' and new_substring_for_replacement = 'column'. The
    # function will search for the wrong group of characters and, if it finds it, will substitute
    # by the correct sequence: "my column 1", 'his column 2', 'her column 3'.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_substringReplaced"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_column_suffix. Then, if the original
    # column was "column1" and the suffix is "_substringReplaced", the new column will be named as
    # "column1_substringReplaced".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    print("ATTENTION: Operations of string strip (removal) or replacement are all case-sensitive. There must be correct correspondence between cases and spaces for the strings being removed or replaced.\n")
        
    # If one of the input substrings is None, make it the empty string:
    if (substring_to_be_replaced is None):
        substring_to_be_replaced = ''
    
    if (new_substring_for_replacement is None):
        new_substring_for_replacement = ''
    
    # Guarantee that both were read as strings (they may have been improperly read as 
    # integers or floats):
    substring_to_be_replaced = str(substring_to_be_replaced)
    new_substring_for_replacement = str(new_substring_for_replacement)
    
    # For manipulating strings, call the str attribute and, then, the method to be applied:
    new_series = new_series.str.replace(substring_to_be_replaced, new_substring_for_replacement)
        
    if (create_new_column):
        
        if (new_column_suffix is None):
            new_column_suffix = "_substringReplaced"
                
        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:
        
        DATASET[column_to_analyze] = new_series
    
    # Now, we are in the main code.
    print(f"Finished replacing the substring {substring_to_be_replaced} by {new_substring_for_replacement}.")
    print("Check the 10 first elements from the series:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))
            
    except: # regular mode
        print(new_series.head(10))
    
    return DATASET

# **Function for inverting the order of the string characters**

In [26]:
def invert_strings (df, column_to_analyze, create_new_column = True, new_column_suffix = "_stringInverted"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_stringInverted"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_stringInverted", the new column will be named as
    # "column1_stringInverted".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    # Pandas slice: start from -1 (last character) and go to the last element with -1 step
    # walk through the string 'backwards':
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    new_series = new_series.str.slice(start = -1, step = -1)
    
    if (create_new_column):
            
        if (new_column_suffix is None):
            new_column_suffix = "_stringInverted"

        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:

        DATASET[column_to_analyze] = new_series

    # Now, we are in the main code.
    print(f"Finished inversion of the strings.")
    print("Check the 10 first elements from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for slicing the strings**

In [27]:
def slice_strings (df, column_to_analyze, first_character_index = None, last_character_index = None, step = 1, create_new_column = True, new_column_suffix = "_slicedString"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_slicedString"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_slicedString", the new column will be named as
    # "column1_slicedString".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    # first_character_index = None - integer representing the index of the first character to be
    # included in the new strings. If None, slicing will start from first character.
    # Indexing of strings always start from 0. The last index can be represented as -1, the index of
    # the character before as -2, etc (inverse indexing starts from -1).
    # example: consider the string "idsw", which contains 4 characters. We can represent the indices as:
    # 'i': index 0; 'd': 1, 's': 2, 'w': 3. Alternatively: 'w': -1, 's': -2, 'd': -3, 'i': -4.
    
    # last_character_index = None - integer representing the index of the last character to be
    # included in the new strings. If None, slicing will go until the last character.
    # Attention: this is effectively the last character to be added, and not the next index after last
    # character.
    
    # in the 'idsw' example, if we want a string as 'ds', we want the first_character_index = 1 and
    # last_character_index = 2.
    
    # step = 1 - integer representing the slicing step. If step = 1, all characters will be added.
    # If step = 2, then the slicing will pick one element of index i and the element with index (i+2)
    # (1 index will be 'jumped'), and so on.
    # If step is negative, then the order of the new strings will be inverted.
    # Example: step = -1, and the start and finish indices are None: the output will be the inverted
    # string, 'wsdi'.
    # first_character_index = 1, last_character_index = 2, step = 1: output = 'ds';
    # first_character_index = None, last_character_index = None, step = 2: output = 'is';
    # first_character_index = None, last_character_index = None, step = 3: output = 'iw';
    # first_character_index = -1, last_character_index = -2, step = -1: output = 'ws';
    # first_character_index = -1, last_character_index = None, step = -2: output = 'wd';
    # first_character_index = -1, last_character_index = None, step = 1: output = 'w'
    # In this last example, the function tries to access the next element after the character of index
    # -1. Since -1 is the last character, there are no other characters to be added.
    # first_character_index = -2, last_character_index = -1, step = 1: output = 'sw'.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    # Pandas slice:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    
    if (step is None):
        # set as 1
        step = 1
    
    if (last_character_index is not None):
        if (last_character_index == -1):
            # In this case, we cannot sum 1, because it would result in index 0 (1st character).
            # So, we will proceed without last index definition, to stop only at the end.
            last_character_index = None
    
    # Now, make the checking again:
            
    if ((first_character_index is None) & (last_character_index is None)):
        
        new_series = new_series.str.slice(step = step)
        
    elif (first_character_index is None):
        # Only this is None:
        new_series = new_series.str.slice(stop = (last_character_index + 1), step = step)
    
    elif (last_character_index is None):
        
        new_series = new_series.str.slice(start = first_character_index, step = step)
    
    else:
        
        new_series = new_series.str.slice(start = first_character_index, stop = (last_character_index + 1), step = step)
    
    # Slicing from index i to index j includes index i, but does not include 
    # index j (ends in j-1). So, we add 1 to the last index to include it.
    # automatically included.

    if (create_new_column):
            
        if (new_column_suffix is None):
            new_column_suffix = "_slicedString"

        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:

        DATASET[column_to_analyze] = new_series

    # Now, we are in the main code.
    print(f"Finished slicing the strings from character {first_character_index} to character {last_character_index}.")
    print("Check the 10 first elements from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for getting the leftest characters from the strings (retrieve last characters)**

In [28]:
def left_characters (df, column_to_analyze, number_of_characters_to_retrieve = 1, new_variable_type = None, create_new_column = True, new_column_suffix = "_leftChars"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_leftChars"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_leftChars", the new column will be named as
    # "column1_leftChars".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    # number_of_characters_to_retrieve = 1 - integer representing the total of characters that will
    # be retrieved. Here, we will retrieve the leftest characters. If number_of_characters_to_retrieve = 1,
    # only the leftest (last) character will be retrieved.
    # Consider the string 'idsw'.
    # number_of_characters_to_retrieve = 1 - output: 'w';
    # number_of_characters_to_retrieve = 2 - output: 'sw'.
    
    # new_variable_type = None. String (in quotes) that represents a given data type for the column
    # after transformation. Set:
    # - new_variable_type = 'int' to convert the extracted column to integer;
    # - new_variable_type = 'float' to convert the column to float (decimal number);
    # - new_variable_type = 'datetime' to convert it to date or timestamp;
    # - new_variable_type = 'category' to convert it to Pandas categorical variable.
    
    # So, if the last part of the strings is a number, you can use this argument to directly extract
    # this part as numeric variable.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    # Pandas slice:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    if (number_of_characters_to_retrieve is None):
        # set as 1
        number_of_characters_to_retrieve = 1
    
    # last_character_index = -1 would be the index of the last character.
    # If we want the last N = 2 characters, we should go from index -2 to -1, -2 = -1 - (N-1);
    # If we want the last N = 3 characters, we should go from index -3 to -1, -2 = -1 - (N-1);
    # If we want only the last (N = 1) character, we should go from -1 to -1, -1 = -1 - (N-1).
    
    # N = number_of_characters_to_retrieve
    first_character_index = -1 - (number_of_characters_to_retrieve - 1)
    
    # Perform the slicing without setting the limit, to slice until the end of the string:
    new_series = new_series.str.slice(start = first_character_index, step = 1)
    
    # Check if a the series type should be modified:
    if (new_variable_type is not None):
        
        if (new_variable_type == 'int'):

            new_type = np.int64

        elif (new_variable_type == 'float'):
            
            new_type = np.float64
        
        elif (new_variable_type == 'datetime'):
            
            new_type = np.datetime64
        
        elif (new_variable_type == 'category'):
            
            new_type = new_variable_type
        
        # Try converting the type:
        try:
            new_series = new_series.astype(new_type)
            print(f"Successfully converted the series to the type {new_variable_type}.\n")
        
        except:
            pass
    
    
    if (create_new_column):
            
        if (new_column_suffix is None):
            new_column_suffix = "_leftChars"

        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:

        DATASET[column_to_analyze] = new_series

    # Now, we are in the main code.
    print(f"Finished extracting the {number_of_characters_to_retrieve} leftest characters.")
    print("Check the 10 first elements from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for getting the rightest characters from the strings (retrieve first characters)**

In [101]:
def right_characters (df, column_to_analyze, number_of_characters_to_retrieve = 1, new_variable_type = None, create_new_column = True, new_column_suffix = "_rightChars"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_rightChars"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_rightChars", the new column will be named as
    # "column1_rightChars".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    # number_of_characters_to_retrieve = 1 - integer representing the total of characters that will
    # be retrieved. Here, we will retrieve the rightest characters. If number_of_characters_to_retrieve = 1,
    # only the rightest (first) character will be retrieved.
    # Consider the string 'idsw'.
    # number_of_characters_to_retrieve = 1 - output: 'i';
    # number_of_characters_to_retrieve = 2 - output: 'id'.
    
    # new_variable_type = None. String (in quotes) that represents a given data type for the column
    # after transformation. Set:
    # - new_variable_type = 'int' to convert the extracted column to integer;
    # - new_variable_type = 'float' to convert the column to float (decimal number);
    # - new_variable_type = 'datetime' to convert it to date or timestamp;
    # - new_variable_type = 'category' to convert it to Pandas categorical variable.
    
    # So, if the first part of the strings is a number, you can use this argument to directly extract
    # this part as numeric variable.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    # Pandas slice:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    if (number_of_characters_to_retrieve is None):
        # set as 1
        number_of_characters_to_retrieve = 1
    
    # first_character_index = 0 would be the index of the first character.
    # If we want the last N = 2 characters, we should go from index 0 to 1, 1 = (N-1);
    # If we want the last N = 3 characters, we should go from index 0 to 2, 2 = (N-1);
    # If we want only the last (N = 1) character, we should go from 0 to 0, 0 = (N-1).
    
    # N = number_of_characters_to_retrieve
    last_character_index = number_of_characters_to_retrieve - 1
    
    # Perform the slicing without setting the limit, to slice from the 1st character:
    new_series = new_series.str.slice(stop = (last_character_index + 1), step = 1)
    
    # Check if a the series type should be modified:
    if (new_variable_type is not None):
        
        if (new_variable_type == 'int'):

            new_type = np.int64

        elif (new_variable_type == 'float'):
            
            new_type = np.float64
        
        elif (new_variable_type == 'datetime'):
            
            new_type = np.datetime64
        
        elif (new_variable_type == 'category'):
            
            new_type = new_variable_type
        
        # Try converting the type:
        try:
            new_series = new_series.astype(new_type)
            print(f"Successfully converted the series to the type {new_variable_type}.\n")
        
        except:
            pass
    
    
    if (create_new_column):
            
        if (new_column_suffix is None):
            new_column_suffix = "_rightChars"

        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:

        DATASET[column_to_analyze] = new_series

    # Now, we are in the main code.
    print(f"Finished extracting the {number_of_characters_to_retrieve} rightest characters.")
    print("Check the 10 first elements from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for joining strings from a same column into a single string**

In [30]:
def join_strings_from_column (df, column_to_analyze, separator = " "):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # separator = " " - string containing the separator. Suppose the column contains the
    # strings: 'a', 'b', 'c', 'd'. If the separator is the empty string '', the output will be:
    # 'abcd' (no separation). If separator = " " (simple whitespace), the output will be 'a b c d'
    
    
    if (separator is None):
        # make it a whitespace:
        separator = " "
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    concat_string = separator.join(new_series)
    # sep.join(list_of_strings) method: join all the strings, separating them by sep.

    # Now, we are in the main code.
    print(f"Finished joining strings from column {column_to_analyze}.")
    print("Check the 10 first characters from the new string:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(concat_string[:10])

    except: # regular mode
        print(concat_string[:10])

    return concat_string

# **Function for joining several string columns into a single string column**

In [106]:
def join_string_columns (df, list_of_columns_to_join, separator = " ", new_column_suffix = "_stringConcat"):
     
    import numpy as np
    import pandas as pd
    
    # list_of_columns_to_join: list of strings (inside quotes), 
    # containing the name of the columns with strings to be joined.
    # Attention: the strings will be joined row by row, i.e. only strings in the same rows will
    # be concatenated. To join strings from the same column, use function join_strings_from_column
    # e.g. list_of_columns_to_join = ["column1", "column2"] will join strings from "column1" with
    # the correspondent strings from "column2".
    # Notice that you can concatenate any kind of columns: numeric, dates, texts ,..., but the output
    # will be a string column.
    
    # separator = " " - string containing the separator. Suppose the columns contain the
    # strings: 'a', 'b', 'c', 'd' on a given row. If the separator is the empty string '', 
    # the output will be: 'abcd' (no separation). If separator = " " (simple whitespace), 
    # the output will be 'a b c d'
    
    # new_column_suffix = "_stringConcat"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_stringConcat", the new column will be named as
    # "column1_stringConcat".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    if (separator is None):
        # make it a whitespace:
        separator = " "
        
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    
    # Start a string pandas series from DATASET, but without connections with it. It will contain
    # only empty strings.
    second_copy_df = DATASET.copy(deep = True)
    second_copy_df['concat_string'] = ''
    # Also, create a separator series from it, and make it constant and equals to the separator:
    second_copy_df['separator'] = separator
    
    new_series = second_copy_df['concat_string']
    sep_series = second_copy_df['separator']
    
    col = list_of_columns_to_join[0]
    new_series = new_series + (DATASET[col]).astype(str)
    
    # Now, loop through the columns in the list:
    for i in range(1, len(list_of_columns_to_join)):
        # We already picked the 1st column (index 0). Now, we pick the second one and go
        # until len(list_of_columns_to_join) - 1, index of the last column of the list.
        
        # concatenate the column with new_series, adding the separator to the left.
        # As we add the separator before, there will be no extra separator after the last string.
        # Convert the columns to strings for concatenation.
        new_series = new_series + sep_series + (DATASET[col]).astype(str)
        # The sep.join(list_of_strings) method can only be applied to array-like objects. It cannot
        # be used for this operation.
            
    if (new_column_suffix is None):
        new_column_suffix = "_stringConcat"

    # Add the suffix to the name of the first column
    new_column_name = list_of_columns_to_join[0] + new_column_suffix
    DATASET[new_column_name] = new_series
    
    # Now, we are in the main code.
    print(f"Finished concatenating strings from columns {list_of_columns_to_join}.")
    print("Check the 10 first elements from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for splitting strings into a list of strings**

In [32]:
def split_strings (df, column_to_analyze, separator = " ", create_new_column = True, new_column_suffix = "_stringSplitted"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
   
    # separator = " " - string containing the separator. Suppose the column contains the
    # string: 'a b c d' on a given row. If the separator is whitespace ' ', 
    # the output will be a list: ['a', 'b', 'c', 'd']: the function splits the string into a list
    # of strings (one list per row) every time it finds the separator.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_stringSplitted"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_stringSplitted", the new column will be named as
    # "column1_stringSplitted".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    if (separator is None):
        # make it a whitespace:
        separator = " "
        
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    # Split the strings from new_series, getting a list of strings per column:
    new_series = new_series.str.split(separator)
    
    if (create_new_column):
            
        if (new_column_suffix is None):
            new_column_suffix = "_stringSplitted"

        new_column_name = column_to_analyze + new_column_suffix
        DATASET[new_column_name] = new_series
            
    else:

        DATASET[column_to_analyze] = new_series

    # Now, we are in the main code.
    print(f"Finished splitting strings from column {column_to_analyze}.")
    print("Check the 10 first elements (10 lists) from the series:\n")

    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series.head(10))

    except: # regular mode
        print(new_series.head(10))

    return DATASET

# **Function for substituting (replacing or switching) whole strings by different text values (on string variables)**

In [33]:
def switch_strings (df, column_to_analyze, list_of_dictionaries_with_original_strings_and_replacements = [{'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}], create_new_column = True, new_column_suffix = "_stringReplaced"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # list_of_dictionaries_with_original_strings_and_replacements = 
    # [{'original_string': None, 'new_string': None}]
    # This is a list of dictionaries, where each dictionary contains two key-value pairs:
    # the first one contains the original string; and the second one contains the new string
    # that will substitute the original one. The function will loop through all dictionaries in
    # this list, access the values of the keys 'original_string', and search these values on the strings
    # in column_to_analyze. When the value is found, it will be replaced (switched) by the correspondent
    # value in key 'new_string'.
    
    # The object list_of_dictionaries_with_original_strings_and_replacements must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'original_string' for the original strings to search on the column 
    # column_to_analyze; and 'new_string', for the strings that will replace the original ones.
    # Notice that this function will not search substrings: it will substitute a value only when
    # there is perfect correspondence between the string in 'column_to_analyze' and 'original_string'.
    # So, the cases (upper or lower) must be the same.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to replace more
    # values.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'original_string': original_str, 'new_string': new_str}, 
    # where original_str and new_str represent the strings for searching and replacement 
    # (If one of the keys contains None, the new dictionary will be ignored).
    
    # Example:
    # Suppose the column_to_analyze contains the values 'sunday', 'monday', 'tuesday', 'wednesday',
    # 'thursday', 'friday', 'saturday', but you want to obtain data labelled as 'weekend' or 'weekday'.
    # Set: list_of_dictionaries_with_original_strings_and_replacements = 
    # [{'original_string': 'sunday', 'new_string': 'weekend'},
    # {'original_string': 'saturday', 'new_string': 'weekend'},
    # {'original_string': 'monday', 'new_string': 'weekday'},
    # {'original_string': 'tuesday', 'new_string': 'weekday'},
    # {'original_string': 'wednesday', 'new_string': 'weekday'},
    # {'original_string': 'thursday', 'new_string': 'weekday'},
    # {'original_string': 'friday', 'new_string': 'weekday'}]
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_stringReplaced"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_stringReplaced", the new column will be named as
    # "column1_stringReplaced".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()
    
    print("ATTENTION: Operations of string strip (removal) or replacement are all case-sensitive. There must be correct correspondence between cases and spaces for the strings being removed or replaced.\n")
     
    # Create the mapping dictionary for the str.replace method:
    mapping_dict = {}
    # The key of the mapping dict must be an string, whereas the value must be the new string
    # that will replace it.
        
    # Loop through each element on the list list_of_dictionaries_with_original_strings_and_replacements:
    
    for i in range (0, len(list_of_dictionaries_with_original_strings_and_replacements)):
        # from i = 0 to i = len(list_of_dictionaries_with_original_strings_and_replacements) - 1, index of the
        # last element from the list
            
        # pick the i-th dictionary from the list:
        dictionary = list_of_dictionaries_with_original_strings_and_replacements[i]
            
        # access 'original_string' and 'new_string' keys from the dictionary:
        original_string = dictionary['original_string']
        new_string = dictionary['new_string']
        
        # check if they are not None:
        if ((original_string is not None) & (new_string is not None)):
            
            #Guarantee that both are read as strings:
            original_string = str(original_string)
            new_string = str(new_string)
            
            # add them to the mapping dictionary, using the original_string as key and
            # new_string as the correspondent value:
            mapping_dict[original_string] = new_string
    
    # Now, the input list was converted into a dictionary with the correct format for the method.
    # Check if there is at least one key in the dictionary:
    if (len(mapping_dict) > 0):
        # len of a dictionary returns the amount of key:value pairs stored. If nothing is stored,
        # len = 0. dictionary.keys() method (no arguments in parentheses) returns an array containing
        # the keys; whereas dictionary.values() method returns the arrays of the values.
        
        new_series = new_series.replace(mapping_dict)
        # For replacing the whole strings using a mapping dictionary, do not call the str
        # attribute
    
        if (create_new_column):
            
            if (new_column_suffix is None):
                new_column_suffix = "_substringReplaced"

            new_column_name = column_to_analyze + new_column_suffix
            DATASET[new_column_name] = new_series
            
        else:

            DATASET[column_to_analyze] = new_series

        # Now, we are in the main code.
        print(f"Finished replacing the substrings accordingly to the mapping: {mapping_dict}.")
        print("Check the 10 first elements from the series:\n")

        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(new_series.head(10))

        except: # regular mode
            print(new_series.head(10))

        return DATASET
    
    else:
        print("Input at least one dictionary containing a pair of original string, in the key \'original_string\', and the correspondent new string as key \'new_string\'.")
        print("The dictionaries must be elements from the list list_of_dictionaries_with_original_strings_and_replacements.\n")
        
        return "error"

# **Function for string replacement with Machine Learning: find similar strings and replace them by standard strings**

In [34]:
def string_replacement_ml (df, column_to_analyze, mode = 'find_and_replace', threshold_for_percent_of_similarity = 80.0, list_of_dictionaries_with_standard_strings_for_replacement = [{'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}], create_new_column = True, new_column_suffix = "_stringReplaced"):
    
    import numpy as np
    import pandas as pd
    from fuzzywuzzy import process
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # mode = 'find_and_replace' will find similar strings; and switch them by one of the
    # standard strings if the similarity between them is higher than or equals to the threshold.
    # Alternatively: mode = 'find' will only find the similar strings by calculating the similarity.
    
    # threshold_for_percent_of_similarity = 80.0 - 0.0% means no similarity and 100% means equal strings.
    # The threshold_for_percent_of_similarity is the minimum similarity calculated from the
    # Levenshtein (minimum edit) distance algorithm. This distance represents the minimum number of
    # insertion, substitution or deletion of characters operations that are needed for making two
    # strings equal.
    
    # list_of_dictionaries_with_standard_strings_for_replacement =
    # [{'standard_string': None}]
    # This is a list of dictionaries, where each dictionary contains a single key-value pair:
    # the key must be always 'standard_string', and the value will be one of the standard strings 
    # for replacement: if a given string on the column_to_analyze presents a similarity with one 
    # of the standard string equals or higher than the threshold_for_percent_of_similarity, it will be
    # substituted by this standard string.
    # For instance, suppose you have a word written in too many ways, making it difficult to use
    # the function switch_strings: "EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ...
    # You can use this function to search strings similar to "Europe" and replace them.
    
    # The function will loop through all dictionaries in
    # this list, access the values of the keys 'standard_string', and search these values on the strings
    # in column_to_analyze. When the value is found, it will be replaced (switched) if the similarity
    # is sufficiently high.
    
    # The object list_of_dictionaries_with_standard_strings_for_replacement must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'standard_string'.
    # Notice that this function performs fuzzy matching, so it MAY SEARCH substrings and strings
    # written with different cases (upper or lower) when this portions or modifications make the
    # strings sufficiently similar to each other.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to replace more
    # values.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same key: {'standard_string': other_std_str}, 
    # where other_std_str represents the string for searching and replacement 
    # (If the key contains None, the new dictionary will be ignored).
    
    # Example:
    # Suppose the column_to_analyze contains the values 'California', 'Cali', 'Calefornia', 
    # 'Calefornie', 'Californie', 'Calfornia', 'Calefernia', 'New York', 'New York City', 
    # but you want to obtain data labelled as the state 'California' or 'New York'.
    # Set: list_of_dictionaries_with_standard_strings_for_replacement = 
    # [{'standard_string': 'California'},
    # {'standard_string': 'New York'}]
    
    # ATTENTION: It is advisable for previously searching the similarity to find the best similarity
    # threshold; set it as high as possible, avoiding incorrect substitutions in a gray area; and then
    # perform the replacement. It will avoid the repetition of original incorrect strings in the
    # output dataset, as well as wrong replacement (replacement by one of the standard strings which
    # is not the correct one).
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_stringReplaced"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_stringReplaced", the new column will be named as
    # "column1_stringReplaced".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    print("Performing fuzzy replacement based on the Levenshtein (minimum edit) distance algorithm.")
    print("This distance represents the minimum number of insertion, substitution or deletion of characters operations that are needed for making two strings equal.\n")
    
    print("This means that substrings or different cases (upper or higher) may be searched and replaced, as long as the similarity threshold is reached.\n")
    
    print("ATTENTION!\n")
    print("It is advisable for previously searching the similarity to find the best similarity threshold.\n")
    print("Set the threshold as high as possible, and only then perform the replacement.\n")
    print("It will avoid the repetition of original incorrect strings in the output dataset, as well as wrong replacement (replacement by one of the standard strings which is not the correct one.\n")
    
    # Set a local copy of dataframe to manipulate
    DATASET = df.copy(deep = True)
    # Guarantee that the column to analyze was read as string:
    DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
    new_series = DATASET[column_to_analyze].copy()

    # Get the unique values present in column_to_analyze:
    unique_types = new_series.unique()
    
    # Create the summary_list:
    summary_list = []
        
    # Loop through each element on the list list_of_dictionaries_with_original_strings_and_replacements:
    
    for i in range (0, len(list_of_dictionaries_with_standard_strings_for_replacement)):
        # from i = 0 to i = len(list_of_dictionaries_with_standard_strings_for_replacement) - 1, index of the
        # last element from the list
            
        # pick the i-th dictionary from the list:
        dictionary = list_of_dictionaries_with_standard_strings_for_replacement[i]
            
        # access 'standard_string' key from the dictionary:
        standard_string = dictionary['standard_string']
        
        # check if it is not None:
        if (standard_string is not None):
            
            # Guarantee that it was read as a string:
            standard_string = str(standard_string)
            
            # Calculate the similarity between each one of the unique_types and standard_string:
            similarity_list = process.extract(standard_string, unique_types, limit = len(unique_types))
            
            # Add the similarity list to the dictionary:
            dictionary['similarity_list'] = similarity_list
            # This is a list of tuples with the format (tested_string, percent_of_similarity_with_standard_string)
            # e.g. ('asiane', 92) for checking similarity with string 'asian'
            
            if (mode == 'find_and_replace'):
                
                # If an invalid value was set for threshold_for_percent_of_similarity, correct it to 80% standard:
                
                if(threshold_for_percent_of_similarity is None):
                    threshold_for_percent_of_similarity = 80.0
                
                if((threshold_for_percent_of_similarity == np.nan) | (threshold_for_percent_of_similarity < 0)):
                    threshold_for_percent_of_similarity = 80.0
                
                list_of_replacements = []
                # Let's replace the matches in the series by the standard_string:
                # Iterate through the list of matches
                for match in similarity_list:
                    # Check whether the similarity score is greater than or equal to threshold_for_percent_of_similarity.
                    # The similarity score is the second element (index 1) from the tuples:
                    if (match[1] >= threshold_for_percent_of_similarity):
                        # If it is, select all rows where the column_to_analyze is spelled as
                        # match[0] (1st Tuple element), and set it to standard_string:
                        boolean_filter = (new_series == match[0])
                        new_series.loc[boolean_filter] = standard_string
                        print(f"Found {match[1]}% of similarity between {match[0]} and {standard_string}.")
                        print(f"Then, {match[0]} was replaced by {standard_string}.\n")
                        
                        # Add match to the list of replacements:
                        list_of_replacements.append(match)
                
                # Add the list_of_replacements to the dictionary, if its length is higher than zero:
                if (len(list_of_replacements) > 0):
                    dictionary['list_of_replacements_by_std_str'] = list_of_replacements
            
            # Add the dictionary to the summary_list:
            summary_list.append(dictionary)
      
    # Now, let's replace the original column or create a new one if mode was set as replace:
    if (mode == 'find_and_replace'):
    
        if (create_new_column):
            
            if (new_column_suffix is None):
                new_column_suffix = "_substringReplaced"

            new_column_name = column_to_analyze + new_column_suffix
            DATASET[new_column_name] = new_series
            
        else:

            DATASET[column_to_analyze] = new_series

        # Now, we are in the main code.
        print(f"Finished replacing the strings by the provided standards. Returning the new dataset and a summary list.\n")
        print("In summary_list, you can check the calculated similarities in keys \'similarity_list\' from the dictionaries.\n")
        print("The similarity list is a list of tuples, where the first element is the string compared against the value on key \'standard_string\'; and the second element is the similarity score, the percent of similarity between the tested and the standard string.\n")
        print("Check the 10 first elements from the new series, with strings replaced:\n")
        
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(new_series.head(10))

        except: # regular mode
            print(new_series.head(10))
    
    else:
        
        print("Finished mapping similarities. Returning the original dataset and a summary list.\n")
        print("Check the similarities below, in keys \'similarity_list\' from the dictionaries.\n")
        print("The similarity list is a list of tuples, where the first element is the string compared against the value on key \'standard_string\'; and the second element is the similarity score, the percent of similarity between the tested and the standard string.\n")
        
        try:
            display(summary_list)
        except:
            print(summary_list)
    
    return DATASET, summary_list

# **Function for searching for Regular Expression (RegEx) within a string column**

In [35]:
class regex_help:

    def __init__ (self, start_helper = True, helper_screen = 0):
        
        # from DataCamp course Regular Expressions in Python
        # https://www.datacamp.com/courses/regular-expressions-in-python#!

        self.start_helper = start_helper
        self.helper_screen = helper_screen
        
        self.helper_menu_1 = """

Regular Expressions (RegEx) Helper
                
Input the number in the text box and press enter to visualize help and examples for a topic:

    1. regex basic theory and most common metacharacters
    2. regex quantifiers
    3. regex anchoring and finding
    4. regex greedy and non-greedy search
    5. regex grouping and capturing
    6. regex alternating and non-capturing groups
    7. regex backreferences
    8. regex lookaround
    9. print all topics at once
    10. Finish regex helper
    
    """
        
        # regex basic theory and most common metacharacters
        self.help_text_1 = """
REGular EXpression or regex:
String containing a combination of normal characters and special metacharacters that
describes patterns to find text or positions within a text.

Example:

r'st\d\s\w{3,10}'
- In Python, the r at the beginning indicates a raw string. It is always advisable to use it.
- We said that a regex contains normal characters, or literal characters we already know. 
    - The normal characters match themselves. 
    - In the case shown above, 'st' exactly matches an 's' followed by a 't'.

- Most important metacharacters:
    - \d: digit (number);
    - \D: non-digit;
    - \s: whitespace;
    - \s+: one or more consecutive whitespaces.
    - \S: non-whitespace;
    - \w: (word) character;
    - \W: non-word character.
    - {N, M}: indicates that the character on the left appears from N to M consecutive times.
        - \w{3,10}: a word character that appears 3, 4, 5,..., or 10 consecutive times.
    - {N}: indicates that the character on the left appears exactly N consecutive times.
        - \d{4}: a digit appears 4 consecutive times.
    - {N,}: indicates that the character appears at least N times.
        - \d{4,}: a digit appears 4 or more times.
        - phone_number = "John: 1-966-847-3131 Michelle: 54-908-42-42424"
        - re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}", phone_number) - returns: ['1-966-847-3131', '54-908-42-42424']

ATTENTION: Using metacharacters in regular expressions will allow you to match types of characters such as digits. 
- You can encounter many forms of whitespace such as tabs, space or new line. 
- To make sure you match all of them always specify whitespaces as \s.

re module: Python standard library module to search regex within individual strings.

- .findall method: search all occurrences of the regex within the string, returning a list of strings.
- Syntax: re.findall(r"regex", string)
    - Example: re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")
        - Returns: ['#movies', '#movies']

- .split method: splits the string at each occurrence of the regex, returning a list of strings.
- Syntax: re.split(r"regex", string)
    - Example: re.split(r"!", "Nice Place to eat! I'll come back! Excellent meat!")
        - Returns: ['Nice Place to eat', " I'll come back", ' Excellent meat', '']

- .sub method: replace one or many matches of the regex with a given string (returns a replaced string).
- Syntax: re.sub((r"regex", new_substring, original_string))
    - Example: re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighborhood")
    - Returns: 'I have a nice car and a nice house in a nice neighborhood'

- .search and .match methods: they have the same syntax and are used to find a match. 
    - Both methods return an object with the match found. 
    - The difference is that .match is anchored at the beginning of the string.
- Syntax: re.search(r"regex", string) and re.match(r"regex", string)
    - Example 1: re.search(r"\d{4}", "4506 people attend the show")
    - Returns: <_sre.SRE_Match object; span=(0, 4), match='4506'>
    - re.match(r"\d{4}", "4506 people attend the show")
    - Returns: <_sre.SRE_Match object; span=(0, 4), match='4506'>
        - In this example, we use both methods to find a digit appearing four times. 
        - Both methods return an object with the match found.
    
    - Example 2: re.search(r"\d+", "Yesterday, I saw 3 shows")
    - Returns: <_sre.SRE_Match object; span=(17, 18), match='3'>
    - re.match(r"\d+","Yesterday, I saw 3 shows")
    - Returns: None
        - In this example,, we used them to find a match for a digit. 
        - In this case, .search finds a match, but .match does not. 
        - This is because the first characters do not match the regex.

- .group method: detailed in Section 7 (Backreferences).
    - Retrieves the groups captured.
- Syntax: searched_string = re.search(r"regex", string)
    re.group(N) - returns N-th group captured (group 0 is the regex itself).
    
    Example: text = "Python 3.0 was released on 12-03-2008."
    information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)
    information.group(3) - returns: '2008'
- .group can only be used with .search and .match methods.

Examples of regex:

1. re.findall(r"User\d", "The winners are: User9, UserN, User8")
    ['User9', 'User8']
2. re.findall(r"User\D", "The winners are: User9, UserN, User8")
    ['UserN']
3. re.findall(r"User\w", "The winners are: User9, UserN, User8")
    ['User9', 'UserN', 'User8']
4. re.findall(r"\W\d", "This skirt is on sale, only $5 today!")
    ['$5']
5. re.findall(r"Data\sScience", "I enjoy learning Data Science")
    ['Data Science']
6. re.sub(r"ice\Scream", "ice cream", "I really like ice-cream")
    'I really like ice cream'

7. regex that matches the user mentions that starts with @ and follows the pattern @robot3!.

regex = r"@robot\d\W"

8. regex that matches the number of user mentions given as, for example: User_mentions:9.

regex = r"User_mentions:\d"

9. regex that matches the number of likes given as, for example, likes: 5.

regex = r"likes:\s\d"

10. regex that matches the number of retweets given as, for example, number of retweets: 4.

regex = r"number\sof\sretweets:\s\d"

11. regex that matches the user mentions that starts with @ and follows the pattern @robot3!.

regex_sentence = r"\W\dbreak\W"

12. regex that matches the pattern #newH

regex_words = r"\Wnew\w"

"""

        # regex quantifiers
        self.help_text_2 = """
Quantifiers: 
A metacharacter that tells the regex engine how many times to match a character immediately to its left.

    1. +: Once or more times.
        - text = "Date of start: 4-3. Date of registration: 10-04."
        - re.findall(r"\d+-\d+", text) - returns: ['4-3', '10-04']
        - Again, \s+ represents one or more consecutive whitespaces.
    2. *: Zero times or more.
        - my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
        - re.findall(r"@\w+\W*\w+", my_string) - returns: ['@ameli!a', '@joh&&n', '@mary90']
    3. ?: Zero times or once: ?
        - text = "The color of this image is amazing. However, the colour blue could be brighter."
        - re.findall(r"colou?r", text) - returns: ['color', 'colour']
    
The quantifier refers to the character immediately on the left:
- r"apple+" : + applies to 'e' and not to 'apple'.

Examples of regex:

1. Most of the times, links start with 'http' and do not contain any whitespace, e.g. https://www.datacamp.com. 
- regex to find all the matches of http links appearing:
    - regex = r"http\S+"
    - \S is very useful to use when you know a pattern does not contain spaces and you have reached the end when you do find one.

2. User mentions in Twitter start with @ and can have letters and numbers only, e.g. @johnsmith3.
- regex to find all the matches of user mentions appearing:
    - regex = r"@\w*\d*"

3. regex that finds all dates in a format similar to 27 minutes ago or 4 hours ago.
- regex = r"\d{1,2}\s\w+\sago"

4. regex that finds all dates in a format similar to 23rd june 2018.
- regex = r"\d{1,2}\w{2}\s\w+\s\d{4}"

5. regex that finds all dates in a format similar to 1st september 2019 17:25.
- regex = r"\d{1,2}\w{2}\s\w+\s\d{4}\s\d{1,2}:\d{2}"

6. Hashtags start with a # symbol and contain letters and numbers but never whitespace.
- regex that matches the described hashtag pattern.
    - regex = r"#\w+"
    
"""

        # regex anchoring and finding
        self.help_text_3 = """
- Anchoring and Finding Metacharacters

    1. . (dot): Match any character (except newline).
        - my_links = "Just check out this link: www.amazingpics.com. It has amazing photos!"
        - re.findall(r"www.+com", my_links) - returns: ['www.amazingpics.com']
            - The dot . metacharacter is very useful when we want to match all repetitions of any character. 
            - However, we need to be very careful how we use it.
    2. ^: Anchoring on start of the string.
        - my_string = "the 80s music was much better that the 90s"
        - If we do re.findall(r"the\s\d+s", my_string) - returns: ['the 80s', 'the 90s']
        - Using ^: re.findall(r"^the\s\d+s", my_string) - returns: ['the 80s']
    3. $: Anchoring at the end of the string.
        - my_string = "the 80s music hits were much better that the 90s"
        - re.findall(r"the\s\d+s$", my_string) - returns: ['the 90s']
    4. \: Escape special characters.
        - my_string = "I love the music of Mr.Go. However, the sound was too loud."
            - re.split(r".\s", my_string) - returns: ['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']
            - re.split(r"\.\s", my_string) - returns: ['I love the music of Mr.Go', 'However, the sound was too loud.']
    5. |: OR Operator
        - my_string = "Elephants are the world's largest land animal! I would love to see an elephant one day"
        - re.findall(r"Elephant|elephant", my_string) - returns: ['Elephant', 'elephant']
    6. []: set of characters representing the OR Operator.
        Example 1 - my_string = "Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3"
        - re.findall(r"[a-zA-Z]+\d", my_string) - returns: ['MaryJohn2', 'Clary3']
        Example 2 - my_string = "My&name&is#John Smith. I%live$in#London."
        - re.sub(r"[#$%&]", " ", my_string) - returns: 'My name is John Smith. I live in London.'
        
        Note 1: within brackets, the characters to be found should not be separated, as in [#$%&].
            - Whitespaces or other separators would be interpreted as characters to be found.
        Note 2: [a-z] represents all word characters from 'a' to 'z', lowercase.
                - [A-Z] represents all word characters from 'A' to 'Z', uppercase.
                - Since lower and uppercase are different, we must declare [a-zA-Z] or [A-Za-z] to capture all word characters.
                - [0-9] represents all digits from 0 to 9.
                - Something like [a-zA-Z0-9] or [a-z0-9A-Z] will search all word characters and all numbers.
    
    7. [^ ]: OR operator combined to ^ transforms the expression to negative.
        - my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
        - re.findall(r"www[^0-9]+com", my_links) - returns: ['www.hola.com']

Examples of regex:

1. You want to find names of files that appear at the start of the string; 
    - always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u); 
    - and always finish with the txt ending.
        - Write a regex that matches the pattern of the text file names, e.g. aemyfile.txt.
        # . = match any character
        regex = r"^[aeiouAEIOU]{2,3}.+txt"

2. When a user signs up on the company website, they must provide a valid email address.
    - The company puts some rules in place to verify that the given email address is valid:
    - The first part can contain: Upper A-Z or lowercase letters a-z; 
    - Numbers; Characters: !, #, %, &, *, $, . Must have @. Domain: Can contain any word characters;
    - But only .com ending is allowed. 
        - Write a regular expression to match valid email addresses.
        - Match the regex to the elements contained in emails, and print out the message indicating if it is a valid email or not 
    
    # Write a regex to match a valid email address
    regex = r"^[A-Za-z0-9!#%&*$.]+@\w+\.com"

    for example in emails:
        # Match the regex to the string
        if re.match(regex, example):
            # Complete the format method to print out the result
            print("The email {email_example} is a valid email".format(email_example=example))
        else:
            print("The email {email_example} is invalid".format(email_example=example))
    
    # Notice that we used the .match() method. 
    # The reason is that we want to match the pattern from the beginning of the string.

3. Rules in order to verify valid passwords: it can contain lowercase a-z and uppercase letters A-Z;
    - It can contain numbers; it can contain the symbols: *, #, $, %, !, &, .
    - It must be at least 8 characters long but not more than 20.
        - Write a regular expression to check if the passwords are valid according to the description.
        - Search the elements in the passwords list to find out if they are valid passwords.
        - Print out the message indicating if it is a valid password or not, complete .format() statement.
    
    # Write a regex to check if the password is valid
    regex = r"[a-z0-9A-Z*#$%!&.]{8,20}"

    for example in passwords:
        # Scan the strings to find a match
        if re.match(regex, example):
            # Complete the format method to print out the result
            print("The password {pass_example} is a valid password".format(pass_example=example))
        else:
            print("The password {pass_example} is invalid".format(pass_example=example))

"""

        # regex greedy and non-greedy search
        self.help_text_4 = """
There are two types of matching methods: greedy and non-greedy (also called lazy) operators. 

Greedy operators
- The standard quantifiers are greedy by default, meaning that they will attempt to match as many characters as possible.
    - Standard quantifiers: * , + , ? , {num, num}
    - Example: re.match(r"\d+", "12345bcada") - returns: <_sre.SRE_Match object; span=(0, 5), match='12345'>
    - We can explain this in the following way: our quantifier will start by matching the first digit found, '1'. 
    - Because it is greedy, it will keep going to find 'more' digits and stop only when no other digit can be matched, returning '12345'.
- If the greedy quantifier has matched so many characters that can not match the rest of pattern, it will backtrack, giving up characters matched earlier one at a time and try again. 
- Backtracking is like driving a car without a map. If you drive through a path and hit a dead end street, you need to backtrack along your road to an earlier point to take another street. 
    - Example: re.match(r".*hello", "xhelloxxxxxx") - returns: <_sre.SRE_Match object; span=(0, 6), match='xhello'>
    - We use the greedy quantifier .* to find anything, zero or more times, followed by the letters "h" "e" "l" "l" "o". 
    - We can see here that it returns the pattern 'xhello'. 
    - So our greedy quantifier will start by matching as much as possible, the entire string. 
    - Then, it tries to match the h, but there are no characters left. So it backtracks, giving up one matched character. 
    - Trying again, it still doesn't match the h, so it backtracks one more step repeatedly until it finally matches the h in the regex, and the rest of the characters.

Non-greedy (lazy) operators
- Because they have lazy behavior, non-greedy quantifiers will attempt to match as few characters as needed, returning the shortest match. 
- To obtain non-greedy quantifiers, we can append a question mark at the end of the greedy quantifiers to convert them into lazy. 
    - Example: re.match(r"\d+?", "12345bcada") - returns: <_sre.SRE_Match object; span=(0, 1), match='1'>
    - Now, our non-greedy quantifier will return the pattern '1'. 
    - In this case, our quantifier will start by matching the first digit found, '1'. 
    - Because it is non-greedy, it will stop there, as we stated that we want 'one or more', and 1 is as few as needed.
- Non-greedy quantifiers also backtrack. 
- In this case, if they have matched so few characters that the rest of the pattern cannot match, they backtrack, expand the matched character one at a time, and try again. 
- In the example above: this time we use the lazy quantifier .*?. Interestingly, we obtain the same match 'xhello'. 
- But, how this match was obtained is different from the first time: the lazy quantifier first matches as little as possible, nothing, leaving the entire string unmatched. 
- Then it tries to match the 'h', but it doesn't work. 
- So, it backtracks, matching one more character, the 'x'. Then, it tries again, this time matching the 'h', and afterwards, the rest of the regex.

- Even though greedy quantifiers lead to longer matches, they are sometimes the best option. 
- Because lazy quantifiers match as few as possible, they return a shorter match than we expected.
    - Example: if you want to extract a word starting with 'a' and ending with 'e' in the string 'I like apple pie', you may think that applying the greedy regex r"a.+e" will return 'apple'. 
    - However, your match will be 'apple pie'. A way to overcome this is to make it lazy by using '?'' which will return 'apple'.
- On the other hand, using greedy quantifiers always leads to longer matches that sometimes are not desired. 
    - Making quantifiers lazy by adding '?' to match a shorter pattern is a very important consideration to keep in mind when handling data for text mining.

Examples of regex:

1. You want to extract the number contained in the sentence 'I was born on April 24th'. 
    - A lazy quantifier will make the regex return 2 and 4, because they will match as few characters as needed. 
    - However, a greedy quantifier will return the entire 24 due to its need to match as much as possible.

    1.1. Use a lazy quantifier to match all numbers that appear in the variable sentiment_analysis:
    numbers_found_lazy = re.findall(r"[0-9]+?", sentiment_analysis)
    - Output: ['5', '3', '6', '1', '2']
    
    1.2. Now, use a greedy quantifier to match all numbers that appear in the variable sentiment_analysis.
    numbers_found_greedy = re.findall(r"[0-9]+", sentiment_analysis)
    - Output: ['536', '12']

2.1. Use a greedy quantifier to match text that appears within parentheses in the variable sentiment_analysis.
    
    sentences_found_greedy = re.findall(r"\(.+\)", sentiment_analysis)
    - Output: ["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site ('I'm crying)"]

2.2. Now, use a lazy quantifier to match text that appears within parentheses in the variable sentiment_analysis.

    sentences_found_lazy = re.findall(r"\(.+?\)", sentiment_analysis)
    - Output: ["(They were so cute)", "('I'm crying)"]
    
"""

        # regex grouping and capturing
        self.help_text_5 = """
Capturing groups in regular expressions
- Let's say that we have the following text:
    
    text = "Clary has 2 friends who she spends a lot time with. Susan has 3 brothers while John has 4 sisters."
    
- We want to extract information about a person, how many and which type of relationships they have. 
- So, we want to extract Clary 2 friends, Susan 3 brothers and John 4 sisters.
- If we do: re.findall(r'[A-Za-z]+\s\w+\s\d+\s\w+', text), the output will be: ['Clary has 2 friends', 'Susan has 3 brothers', 'John has 4 sisters']
    - The output is quite close, but we do not want the word 'has'.

- We start simple, by trying to extract only the names. We can place parentheses to group those characters, capture them, and retrieve only that group:
    - re.findall(r'([A-Za-z]+)\s\w+\s\d+\s\w+', text) - returns: ['Clary', 'Susan', 'John']
- Actually, we can place parentheses around the three groups that we want to capture. 
    - re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)
    
    - Each group will receive a number: 
        - The entire expression will always be group 0. 
        - The first group: 1; the second: 2; and the third: 3.
    
    - The result returned is: [('Clary', '2', 'friends'), ('Susan', '3', 'brothers'), ('John', '4', 'sisters')]
        - We got a list of tuples: 
            - The first element of each tuple is the match captured corresponding to group 1. 
            - The second, to group 2. The last, to group 3.
    
    - We can use capturing groups to match a specific subpattern in a pattern. 
    - We can use this information for retrieving the groups by numbers; or to organize data.
        - Example: pets = re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', "Clary has 2 dogs but John has 3 cats")
                    pets[0][0] == 'Clary'
                    - In the code, we placed the parentheses to capture the name of the owner, the number and which type of pets each one has. 
                    - We can access the information retrieved by using indexing and slicing as seen in the code. 
   
- Capturing groups have one important feature. 
    - Remember that quantifiers apply to the character immediately to the left. 
    - So, we can place parentheses to group characters and then apply the quantifier to the entire group. 
    
    Example: re.search(r"(\d[A-Za-z])+", "My user name is 3e4r5fg")
        - returns: <_sre.SRE_Match object; span=(16, 22), match='3e4r5f'>
        - In the code, we have placed parentheses to match the group containing a number and any letter. 
        - We applied the plus quantifier to specify that we want this group repeated once or more times. 
    
- ATTENTION: It's not the same to capture a repeated group AND to repeat a capturing group. 
    
    my_string = "My lucky numbers are 8755 and 33"
    - re.findall(r"(\d)+", my_string) - returns: ['5', '3']
    - re.findall(r"(\d+)", my_string) - returns: ['8755', '33']
    
    - In the first code, we use findall to match a capturing group containing one number. 
        - We want this capturing group to be repeated once or more times. 
        - We get 5 and 3 as an output, because these numbers are repeated consecutively once or more times. 
    - In the second code, we specify that we should capture a group containing one or more repetitions of a number. 

- Placing a subpattern inside parenthesis will capture that content and stores it temporarily in memory. This can be later reused.

Examples of regex:

1. You want to extract the first part of the email. E.g. if you have the email marysmith90@gmail.com, you are only interested in marysmith90.
- You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.
- regex to match the email capturing only the name part. The name part appears before the @.
    - regex_email = r"([a-z0-9A-Z]+)@\S+"

2. Text follows a pattern: "Here you have your boarding pass LA4214 AER-CDB 06NOV."
- You need to extract the information about the flight: 
    - The two letters indicate the airline (e.g LA); the 4 numbers are the flight number (e.g. 4214);
    - The three letters correspond to the departure (e.g AER); the destination (CDB); the date (06NOV) of the flight.
    - All letters are always uppercase.

- Regular expression to match and capture all the flight information required.
- Find all the matches corresponding to each piece of information about the flight. Assign it to flight_matches.
- Complete the format method with the elements contained in flight_matches: 
    - In the first line print the airline and the flight number. 
    - In the second line, the departure and destination. In the third line, the date.

# Import re
import re

# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, flight)
    
#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

    - findall() returns a list of tuples. 
    - The nth element of each tuple is the element corresponding to group n. 
    - This provides us with an easy way to access and organize our data.

"""

        # regex alternating and non-capturing groups
        self.help_text_6 = """
Alternating and non-capturing groups

- Vertical bar or pipe operator
    - Suppose we have the following string, and we want to find all matches for pet names. 
    - We can use the pipe operator to specify that we want to match cat or dog or bird:
        - my_string = "I want to have a pet. But I don't know if I want a cat, a dog or a bird."
        - re.findall(r"cat|dog|bird", my_string) - returns: ['cat', 'dog', 'bird']
    
     - Now, we changed the string a little bit, and once more we want to find all the pet names, but only those that come after a number and a whitespace. 
     - So, if we specify this again with the pipe operator, we get the wrong output: 
        - my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
        - re.findall(r"\d+\scat|dog|bird", my_string) - returns: ['2 cat', 'dog', 'bird']
     
     - That is because the pipe operator works by comparing everything that is to its left (digit whitespace cat) with everything to the right, dog.
     - In order to solve this, we can use alternation. 
         - In simpler terms, we can use parentheses again to group the optional characters:
         
         - my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
         - re.findall(r"\d+\s(cat|dog|bird)", my_string) - returns: ['cat', 'dog']
         
         In the code, now the parentheses are added to group cat or dog or bird.
    
    - In the previous example, we may also want to match the number. 
    - In that case, we need to place parentheses to capture the digit group:
    
        - my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
        - re.findall(r"(\d)+\s(cat|dog|bird)", my_string) - returns: [('2', 'cat'), ('1', 'dog')]
    
        - In the code, we now use two pair of parentheses and we use findall in the string, so we get a list with two tuples.
    
- Non-capturing groups
    - Sometimes, we need to group characters using parentheses, but we are not going to reference back to this group. 
    - For these cases, there are a special type of groups called non-capturing groups. 
    - For using them, we just need to add question mark colon inside the parenthesis but before the regex.
    
    regex = r"(?:regex)"
    
    - Example: we have the following string, and we want to find all matches of numbers. 
    
        my_string = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
    
    - We see that the pattern consists of two numbers and dash repeated three times. After that, three numbers, dash, four numbers. 
    - We want to extract only the last part, without the first repeated elements. 
    - We need to group the first two elements to indicate repetitions, but we do not want to capture them. 
    - So, we use non-capturing groups to group \d repeated two times and dash. Then we indicate this group should be repeated three times. Then, we group \d repeated three times, dash, \d repeated three times:
    
        re.findall(r"(?:\d{2}-){3}(\d{3}-\d{3})", my_string) - returns: ['042-980', '434-425']
    
- Alternation
    - We can combine non-capturing groups and alternation together. 
    - Remember that alternation implies using parentheses and the pipe operand to group optional characters. 
    - Let's suppose that we have the following string. We want to match all the numbers of the day. 
    
        my_date = "Today is 23rd May 2019. Tomorrow is 24th May 19."
    
    - We know that they are followed by 'th' or 'rd', but we only want to capture the number, and not the letters that follow it. 
    - We write our regex to capture inside parentheses \d repeated once or more times. Then, we can use a non-capturing group. 
    - Inside, we use the pipe operator to choose between 'th' or 'rd':
    
        re.findall(r"(\d+)(?:th|rd)", my_date) - returns: ['23', '24']

- Non-capturing groups are very often used together with alternation. 
- Sometimes, you have optional patterns and you need to group them. 
- However, you are not interested in keeping them. It's a nice feature of regex.

Examples of regex:

1. Sentiment analysis project: firstly, you want to identify positive tweets about movies and concerts.
- You plan to find all the sentences that contain the words 'love', 'like', or 'enjoy', and capture that word. 
- You will limit the tweets by focusing on those that contain the words 'movie' or 'concert' by keeping the word in another group. 
- You will also save the movie or concert name.
    - For example, if you have the sentence: 'I love the movie Avengers', you match and capture 'love'. 
    - You need to match and capture 'movie'. Afterwards, you match and capture anything until the dot.
    - The list sentiment_analysis contains the text of tweets.
- Regular expression to capture the words 'love', 'like', or 'enjoy'; 
    - match and capture the words 'movie' or 'concert'; 
    - match and capture anything appearing until the '.'.

    regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

    - The pipe operator works by comparing everything that is to its left with everything to the right. 
    - Grouping optional patterns is the way to get the correct result.

2. After finding positive tweets, you want to do it for negative tweets. 
- Your plan now is to find sentences that contain the words 'hate', 'dislike' or 'disapprove'. 
- You will again save the movie or concert name. 
- You will get the tweet containing the words 'movie' or 'concert', but this time, you do not plan to save the word.
    - For example, if you have the sentence: 'I dislike the movie Avengers a lot.', you match and capture 'dislike'. 
    - You will match, but not capture, the word 'movie'. Afterwards, you match and capture anything until the dot.
- Regular expression to capture the words 'hate', 'dislike' or 'disapprove'; 
    - Match, but do not capture, the words 'movie' or 'concert'; 
    - Match and capture anything appearing until the '.'.
    
    regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."
        
        """

        # regex backreferences
        self.help_text_7 = """
Backreferences
- How we can backreference capturing groups.

Numbered groups
- Imagine we come across this text, and we want to extract the date: 
    
    text = "Python 3.0 was released on 12-03-2008. It was a major revision of the language. Many of its major features were backported to Python 2.6.x and 2.7.x version series."
    
- We want to extract only the numbers. So, we can place parentheses in a regex to capture these groups:
    
    regex = r"(\d{1,2})-(\d{1,2})-(\d{4})"

- We have also seen that each of these groups receive a number. 
- The whole expression is group 0; the first group, 1; and so on.

- Let's use .search to match the pattern to the text. 
- To retrieve the groups captured, we can use the method .group specifying the number of a group we want. 

Again: .group method retrieves the groups captured.
    - Syntax: searched_string = re.search(r"regex", string)
    re.group(N) - returns N-th group captured (group 0 is the regex itself).

Example: text = "Python 3.0 was released on 12-03-2008."

    information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)
    information.group(3) - returns: '2008'
    information.group(0) - returns: '12-03-2008' (regex itself, the entire expression).

- .group can only be used with .search and .match methods.

Named groups
- We can also give names to our capturing groups. 
- Inside the parentheses, we write '?P', and the name inside angle brackets:

    regex = r"(?P<name>regex)"

- Let's say we have the following string, and we want to match the name of the city and zipcode in different groups. 
- We can use capturing groups and assign them the name 'city' and 'zipcode'. 
- We retrieve the information by using .group, and we indicate the name of the group. 
    
    text = "Austin, 78701"
    cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)
    cities.group("city") - returns: 'Austin'
    cities.group("zipcode") - returns: '78701'

Backreferences
- There is another way to backreference groups. 
- In fact, the matched group can be reused inside the same regex or outside for substitution. 
- We can do this using backslash and the number of the group:

    regex = r'(\d{1,2})-(\d{2})-(\d{4})'
    
    - we can backreference the groups as:
        (\d{1,2}): (\1);
        (\d{2}): (\2)
        (\d{4}): (\3)

- Example: we have the following string, and we want to find all matches of repeated words. 
- In the code, we specify that we want to capture a sequence of word characters, then a whitespace.
- Finally, we write \1. This will indicate that we want to match the first group captured again. 
- In other words, it says: 'match that sequence of characters that was previously captured once more.' 
    
    sentence = "I wish you a happy happy birthday!"
    re.findall(r"(\w+)\s\1", sentence) - returns: ['happy'] 

- We get the word 'happy' as an output: this was the repeated word in our string.

- Now, we want to replace the repeated word with one occurrence of the same word. 
- In the code, we use the same regex as before, but this time, we use the .sub method. 
- In the replacement part, we can also reference back to the captured group: 
    - We write r"\1" to say: 'replace the entire expression match with the first captured group.' 
    
    re.sub(r"(\w+)\s\1", r"\1", sentence) - returns: 'I wish you a happy birthday!'
    - In the output string, we have only one occurrence of the word 'happy'.
    
- We can also use named groups for backreferencing. 
- To do this, we use ?P= the group name. 

    regex = r"(?P=name)"

Example:
    sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
    re.findall(r"(?P<code>\d{5}).*?(?P=code)", sentence) - returns: ['23434']

- In the code, we want to find all matches of the same number. 
- We use a capturing group and name it 'code'. 
- Later, we reference back to this group, and we obtain the number as an output.

- To reference the group back for replacement, we need to use \g and the group name inside angle brackets. 

    regex = r"(\g<name>)"

Example:
    sentence = "This app is not working! It's repeating the last word word."
    re.sub(r"(?P<word>\w+)\s(?P=word)", r"\g<word>", sentence) - returns: 'This app is not working! It's repeating the last word.'
    
- In the code, we want to replace repeated words by one occurrence of the same word. 
- Inside the regex, we use the previous syntax. 
- In the replacement field, we need to use this new syntax as seen in the code.
- Backreferences are very helpful when you need to reuse part of the regex match inside the regex.
- You should remember that the group zero stands for the entire expression matched. 
    - It is always helpful to keep that in mind. Sometimes you will need to use it.

Examples of regex:

1. Parsing PDF files: your company gave you some PDF files of signed contracts. The goal of the project is to create a database with the information you parse from them. 
- Three of these columns should correspond to the day, month, and year when the contract was signed.
- The dates appear as 'Signed on 05/24/2016' ('05' indicating the month, '24' the day). 
- You decide to use capturing groups to extract this information. Also, you would like to retrieve that information so you can store it separately in different variables.
- The variable contract contains the text of one contract.

- Write a regex that captures the month, day, and year in which the contract was signed. 
- Scan contract for matches.
- Assign each captured group to the corresponding keys in the dictionary.
- Complete the positional method to print out the captured groups. 
- Use the values corresponding to each key in the dictionary.

    # Write regex and scan contract to capture the dates described
    regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
    dates = re.search(regex_dates, contract)

    # Assign to each key the corresponding match
    signature = {
        "day": dates.group(2),
        "month": dates.group(1),
        "year": dates.group(3)
    }
    # Complete the format method to print-out
    print("Our first contract is dated back to {data[year]}. Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

- Remember that each capturing group is assigned a number according to its position in the regex. 
- Only if you use .search() and .match(), you can use .group() to retrieve the groups.

2. The company is going to develop a new product which will help developers automatically check the code they are writing. 
- You need to write a short script for checking that every HTML tag that is open has its proper closure.
- You have an example of a string containing HTML tags: "<title>The Data Science Company</title>"
- You learn that an opening HTML tag is always at the beginning of the string, and appears inside "<>". 
- A closing tag also appears inside "<>", but it is preceded by "/".
- The list html_tags, contains strings with HTML tags.

- Regex to match closed HTML tags: find if there is a match in each string of the list html_tags. Assign the result to match_tag;
    - If a match is found, print the first group captured and saved in match_tag;
- If no match is found, regex to match only the text inside the HTML tag. Assign it to notmatch_tag.
    - Print the first group captured by the regex and save it in notmatch_tag.
    - To capture the text inside <>, place parenthesis around the expression: r"<(text)>. To confirm that the same text appears in the closing tag, reference back to the m group captured by using '\m'.
    - To print the 'm' group captured, use .group(m).

    for string in html_tags:
        # Complete the regex and find if it matches a closed HTML tags
        match_tag =  re.match(r"<(\w+)>.*?</\1>", string)

        if match_tag:
            # If it matches print the first group capture
            print("Your tag {} is closed".format(match_tag.group(1))) 
        else:
            # If it doesn't match capture only the tag 
            notmatch_tag = re.match(r"<(\w+)>",string)
            # Print the first group capture
            print("Close your {} tag!".format(notmatch_tag.group(1)))

3. Your task is to replace elongated words that appear in the tweets. 
- We define an elongated word as a word that contains a repeating character twice or more times. 
    - e.g. "Awesoooome".
- Replacing those words is very important since a classifier will treat them as a different term from the source words, lowering their frequency.
- To find them, you will use capturing groups and reference them back using numbers. E.g \4.
- If you want to find a match for 'Awesoooome', you firstly need to capture 'Awes'. 
    - Then, match 'o' and reference the same character back, and then, 'me'.
- The list sentiment_analysis contains the text tweets.
- Regular expression to match an elongated word as described.
- Search the elements in sentiment_analysis list to find out if they contain elongated words. Assign the result to match_elongated.
- Assign the captured group number zero to the variable elongated_word.
    - Print the result contained in the variable elongated_word.

    # Complete the regex to match an elongated word
    regex_elongated = r"\w*(\w)\1*me\w*"

    for tweet in sentiment_analysis:
        # Find if there is a match in each tweet 
        match_elongated = re.search(regex_elongated, tweet)

        if match_elongated:
            # Assign the captured group zero 
            elongated_word = match_elongated.group(0)

            # Complete the format method to print the word
            print("Elongated word found: {word}".format(word=elongated_word))
        else:
            print("No elongated word found") 

        """
        
        # regex lookaround
        self.help_text_8 = """
Lookaround
- There are specific types of non-capturing groups that help us look around an expression.
- Look-around will look for what is behind or ahead of a pattern. 
- Imagine that we have the following string:
    
    text = "the white cat sat on the chair"

- We want to see what is surrounding a specific word. 
- For example, we position ourselves in the word 'cat'. 
- So look-around will let us answer the following problem: 
    - At my current position, look ahead and search if 'sat' is there. 
    - Or, look behind and search if 'white' is there.
    
- In other words, looking around allows us to confirm that a sub-pattern is ahead or behind the main pattern.
- "At my current position in the matching process, look ahead or behind and examine whether some pattern matches or not match before continuing."
- In the previous example, we are looking for the word 'cat'. 
- The look ahead expression can be either positive or negative. For positive we use ?=. For negative, ?!.
    - positive: (?=sat)
    - negative: (?!run)

- Look-ahead
- This non-capturing group checks whether the first part of the expression is followed or not by the lookahead expression. 
- As a consequence, it will return the first part of the expression. 
    - Let's imagine that we have a string containing file names and the status of that file. 
    - We want to extract only those files that are followed by the word 'transferred'. 
    - So we start building the regex by indicating any word character followed by .txt.
    - We now indicate we want the first part to be followed by the word transferred. 
    - We do so by writing ?= and then whitespace transferred all inside the parenthesis:
    
    my_text ="tweets.txt transferred, mypass.txt transferred, keywords.txt error"
    re.findall(r"\w+\.txt(?=\stransferred)", my_text) - returns: ['tweets.txt', 'mypass.txt']

- Negative look-ahead
    - Now, let's use negative lookahead in the same example.
    - In this case, we will say that we want those matches that are NOT followed by the expression 'transferred'. 
    - We use, instead, ?! inside parenthesis:

    my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
    re.findall(r"\w+\.txt(?!\stransferred)", my_text) - returns: ['keywords.txt']

- Look-behind
- The non-capturing group look-behind gets all matches that are preceded or not by a specific pattern.
- As a consequence, it will return the matches after the look expression.
- Look behind expression can also be either positive or negative. 
    - For positive, we use ?<=. For negative, ?<!.
    - So, we add an intermediate '<' (angle bracket) sign. In the previous example, we can look before the word 'cat': 
        - positive: (?<=white)
        - negative: (?<!brown)
    
- Positive look-behind
    - Let's look at the following string, in which we want to find all matches of the names that are preceded by the word 'member'. 
    - We construct our regex with positive look-behind. 
    - At the end of the regex, we indicate that we want a sequence of word characters whitespace another sequence of word characters:
    
    my_text = "Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
    re.findall(r"(?<=Member:\s)\w+\s\w+", my_text) - returns: ['Angus Young', 'Chris Slade']
    
    - Pay attention to the code: the look-behind expression goes before that expression. 
    - We indicate ?<= followed by member, colon, and whitespace. All inside parentheses. 
    - In that way we get the two names that were preceded by the word member, as shown in the output.

- Negative look-behind
- Now, we have other string, in which will use negative look-behind. 
- We will find all matches of the word 'cat' or 'dog' that are not preceded by the word 'brown'. 
- In this example, we use ?<!, followed by brown, whitespace. All inside the parenthesis. 
- Then, we indicate our alternation group: 'cat' or 'dog'. 

    my_text = "My white cat sat at the table. However, my brown dog was lying on the couch."
    re.findall(r"(?<!brown\s)(cat|dog)", my_text) - returns: ['cat']

    - Consequently, we get 'cat' as an output, the 'cat' or 'dog' word that is not after the word 'brown'.

In summary:
- Positive lookahead (?=) makes sure that first part of the expression is followed by the lookahead expression. 
- Positive lookbehind (?<=) returns all matches that are preceded by the specified pattern.
- It is important to know that positive lookahead will return the text matched by the first part of the expression after asserting that it is followed by the lookahead expression,
    - while positive lookbehind will return all matches that follow a specific pattern.
- Negative lookarounds work in a similar way to positive lookarounds. 
    - They are very helpful when we are looking to exclude certain patterns from our analysis.

Examples of regex:

1. You are interested in the words surrounding 'python'. You want to count how many times a specific words appears right before and after it.
- The variable sentiment_analysis contains the text of one tweet.
- Get all the words that are followed by the word 'python' in sentiment_analysis. 
- Print out the word found.
    - In re.findall(). Use \w+ to match the words followed by the word 'python';
    - In re.findall() first argument, include \spython within parentheses to indicate that everything after the word 'python' should be matched.

    # Positive lookahead
    look_ahead = re.findall(r"\w+(?=\spython)", sentiment_analysis)

    # Print out
    print(look_ahead)
 
1.2. Get all the words that are preceded by the word 'python' or 'Python' in sentiment_analysis. Print out the words found.
- In re.findall() first argument, include [Pp]ython\s within parentheses to indicate that everything before the word 'python' (or 'Python') should be matched.

    # Positive lookbehind
    look_behind = re.findall(r"(?<=[pP]ython\s)\w+", sentiment_analysis)

    # Print out
    print(look_behind)

2. You need to write a script for a cell-phone searcher. 
- It should scan a list of phone numbers and return those that meet certain characteristics.
- The phone numbers in the list have the structure:
    - Optional area code: 3 numbers
    - Prefix: 4 numbers
    - Line number: 6 numbers
    - Optional extension: 2 numbers
    - E.g. 654-8764-439434-01.
- You decide to use .findall() and the non-capturing group's negative lookahead (?!) and negative lookbehind (?<!).
- The list cellphones, contains three phone numbers:
    cellphones = ['4564-646464-01', '345-5785-544245', '6476-579052-01']

- Get all cell phones numbers that are not preceded by the optional area code.
    - In re.findall() first argument, you use a negative lookbehind ?<! within parentheses () indicating the optional area code.

    for phone in cellphones:
        # Get all phone numbers not preceded by area code
        number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
        print(number)
 
2.1. Get all the cell phones numbers that are not followed by the optional extension.
    - In re.findall() first argument, you use a negative lookahead ?! within parentheses () indicating the optional extension.

    for phone in cellphones:
        # Get all phone numbers not followed by optional extension
        number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
        print(number)
    
        """
        
    def show_screen (self):
            
        helper_screen = self.helper_screen
        helper_menu_1 = self.helper_menu_1
            
        if (helper_screen == 0):
                
            # Start screen
            print(self.helper_menu_1)
            print("\n")
            # For the input, strip all whitespaces and, and so convert it to integer:
            helper_screen = int(str(input("Next screen:")).strip())
                
            # the object.__dict__ method returns all attributes from an object as a dictionary.
            # Analogously, the vars function applied to an object vars(object) returns the same
            # dictionary. We can access an attribute from the object by passing the key of this
            # dictionary:
            # vars(object)['key']
                
            while (helper_screen != 10):
                    
                if (helper_screen not in range(0, 11)):
                    # range (0, 11): integers from 0 to 10
                        
                    helper_screen = int(str(input("Input a valid number, from 0 to 10:")).strip())
                    
                else:
                        
                    if (helper_screen == 9):
                        # print all at once:
                        for screen_number in range (1, 9):
                            # integers from 1 to 8
                            key = "help_text_" + str(screen_number)
                            # apply the vars function to get the dictionary of attributes, and call the
                            # attribute by passing its name as key from the dictionary:
                            screen_text = vars(self)[key]
                            # Notice that we cannot directly call the attribute as a string. We would have to
                            # create an if else for each of the 8 attributes.
                            print(screen_text)
                            
                        # Now, make helper_screen = 10 for finishing this step:
                        helper_screen = 10
                        
                    else:
                        key = "help_text_" + str(helper_screen)
                        screen_text = vars(self)[key]
                        print(screen_text)
                        helper_screen = int(str(input("Next screen:")).strip())
            
        print("Finishing regex assistant.\n")
            
        return self

In [124]:
def regex_search (df, column_to_analyze, regex_to_search = r"", show_regex_helper = False, create_new_column = True, new_column_suffix = "_regex"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # regex_to_search = r"" - string containing the regular expression (regex) that will be searched
    # within each string from the column. Declare it with the r before quotes, indicating that the
    # 'raw' string should be read. That is because the regex contain special characters, such as \,
    # which should not be read as scape characters.
    # example of regex: r'st\d\s\w{3,10}'
    # Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
    # regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
    # regex alternating and non-capturing groups; regex backreferences; and regex lookaround.
    
    ## ATTENTION: This function returns ONLY the capturing groups from the regex, i.e., portions of the
    # regex explicitly marked with parentheses (check the regex helper for more details, including how
    # to convert parentheses into non-capturing groups). If no groups are marked as capturing, the
    # function will raise an error.

    # show_regex_helper: set show_regex_helper = True to show a helper guide to the construction of
    # the regular expression. After finishing the helper, the original dataset itself will be returned
    # and the function will not proceed. Use it in case of not knowing or not certain on how to input
    # the regex.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_regex"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_regex", the new column will be named as
    # "column1_regex".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    if (show_regex_helper): # run if True
        
        # Create an instance (object) from class regex_help:
        helper = regex_help()
        # Run helper object:
        helper = helper.show_screen()
        print("Interrupting the function and returning the dataframe itself.")
        print("Use the regex helper instructions to obtain the regex.")
        print("Do not forget to declare it as r'regex', with the r before quotes.")
        print("It indicates a raw expression. It is important for not reading the regex metacharacters as regular string scape characters.")
        print("Also, notice that this function returns only the capturing groups (marked with parentheses).")
        print("If no groups are marked as capturing groups (with parentheses) within the regex, the function will raise an error.\n")
        
        return df
    
    else:
        
        # Set a local copy of dataframe to manipulate
        DATASET = df.copy(deep = True)
        DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
        new_series = DATASET[column_to_analyze].copy()
        
        # Search for the regex within new_series:
        new_series = new_series.str.extract(regex_to_search, expand = True)
        
        # https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
        # setting expand = True returns a dataframe with one column per capture group, if the
        # regex contains more than 1 capture group.
        
        # The shape attribute is a tuple (N,) for a Pandas Series, and (N, M) for a dataframe,
        # where N is the number of rows, and M is the number of columns.
        # Let's try to access the number of columns. It will only be possible if the object is a
        # dataframe (index 1 from shape tuple):
        try:
            
            total_new_cols = new_series.shape[1]
            
            if (new_column_suffix is None):
                new_column_suffix = "_regex"
            
            new_column_suffix = str(column_to_analyze) + new_column_suffix + "_group_"
            
            # In the regex, the group 0 is the expression itself, whereas the first group is group 1.
            # range (0, total_new_cols) goes from 0 to total_new_cols-1;
            # range (1, total_new_cols + 1) goes from group 1 to group total_new_cols
            # (both cases result in total_new_cols elements):
            
            # Create a list of columns:
            new_columns_list = [(new_column_suffix + str(i)) for i in range (1, (total_new_cols + 1))]
            
            # Make this list the new columns' names:
            new_series.columns = new_columns_list
            
            # Concatenate this dataframe to the original one (add columns to the right of DATASET):
            DATASET = pd.concat([DATASET, new_series], axis = 1, join = "inner")
        
        
        except IndexError:
            
            # There is no second dimension, because it is a series.
            # The regex finds a single group
            
            if (create_new_column):

                if (new_column_suffix is None):
                    new_column_suffix = "_regex"

                new_column_name = column_to_analyze + new_column_suffix
                DATASET[new_column_name] = new_series

            else:

                DATASET[column_to_analyze] = new_series

        # Now, we are in the main code.
        print(f"Finished searching the regex {regex_to_search} within {column_to_analyze}.")
        print("Check the 10 first elements from the output:\n")

        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(new_series.head(10))

        except: # regular mode
            print(new_series.head(10))

        return DATASET

# **Function for replacing a Regular Expression (RegEx) in a string column**

In [142]:
def regex_replacement (df, column_to_analyze, regex_to_search = r"", string_for_replacement = "", show_regex_helper = False, create_new_column = True, new_column_suffix = "_regex"):
     
    import numpy as np
    import pandas as pd
    
    # column_to_analyze: string (inside quotes), 
    # containing the name of the column that will be analyzed. 
    # e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.
    
    # regex_to_search = r"" - string containing the regular expression (regex) that will be searched
    # within each string from the column. Declare it with the r before quotes, indicating that the
    # 'raw' string should be read. That is because the regex contain special characters, such as \,
    # which should not be read as scape characters.
    # example of regex: r'st\d\s\w{3,10}'
    # Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
    # regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
    # regex alternating and non-capturing groups; regex backreferences; and regex lookaround.
    
    # string_for_replacement = "" - regular string that will replace the regex_to_search: 
    # whenever regex_to_search is found in the string, it is replaced (substituted) by 
    # string_or_regex_for_replacement. 
    # Example string_for_replacement = " " (whitespace).
    # If string_for_replacement = None, the empty string will be used for replacement.
    
    ## ATTENTION: This function process a single regex by call.
    
    # show_regex_helper: set show_regex_helper = True to show a helper guide to the construction of
    # the regular expression. After finishing the helper, the original dataset itself will be returned
    # and the function will not proceed. Use it in case of not knowing or not certain on how to input
    # the regex.
    
    # create_new_column = True
    # Alternatively, set create_new_columns = True to store the transformed data into a new
    # column. Or set create_new_column = False to overwrite the existing column.
    
    # new_column_suffix = "_regex"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_regex", the new column will be named as
    # "column1_regex".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    if (show_regex_helper): # run if True
        
        # Create an instance (object) from class regex_help:
        helper = regex_help()
        # Run helper object:
        helper = helper.show_screen()
        print("Interrupting the function and returning the dataframe itself.")
        print("Use the regex helper instructions to obtain the regex.")
        print("Do not forget to declare it as r'regex', with the r before quotes.")
        print("It indicates a raw expression. It is important for not reading the regex metacharacters as regular string scape characters.\n")
        
        return df
    
    else:
        
        if (string_for_replacement is None):
            # make it the empty string
            string_for_replacement = ""
        
        # Set a local copy of dataframe to manipulate
        DATASET = df.copy(deep = True)
        DATASET[column_to_analyze] = (DATASET[column_to_analyze]).astype(str)
        new_series = DATASET[column_to_analyze].copy()
        
        new_series = new_series.str.replace(regex_to_search, string_for_replacement, regex = True)
        # set regex = True to replace a regular expression
        # https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
            
        if (create_new_column):

            if (new_column_suffix is None):
                new_column_suffix = "_regex"

            new_column_name = column_to_analyze + new_column_suffix
            DATASET[new_column_name] = new_series

        else:

            DATASET[column_to_analyze] = new_series

        # Now, we are in the main code.
        print(f"Finished searching the regex {regex_to_search} within {column_to_analyze}.")
        print("Check the 10 first elements from the output:\n")

        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(new_series.head(10))

        except: # regular mode
            print(new_series.head(10))

        return DATASET

# **Function for applying Fast Fourier Transform**
- Determine which frequencies are important by extracting features with <a href="https://en.wikipedia.org/wiki/Fast_Fourier_transform" class="external">Fast Fourier Transform</a>.

In [None]:
def fast_fourier_transform (df, column_to_analyze, average_frequency_of_data_collection = 'hour', x_axis_rotation = 0, y_axis_rotation = 0, grid = True, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
    
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import tensorflow as tf
    
    
    # average_frequency_of_data_collection = 'hour' or 'h' for hours; 'day' or 'd' for days;
    # 'minute' or 'min' for minutes; 'seconds' or 's' for seconds; 'ms' for milliseconds; 'ns' for
    # nanoseconds; 'year' or 'y' for years; 'month' or 'm' for months.
    
    
    average_frequency_of_data_collection = str(average_frequency_of_data_collection).lower()
    
    if ((average_frequency_of_data_collection == 'year')|(average_frequency_of_data_collection == 'y')):
        count_per_year = 1
        xtick_list = [1]
        labels_list = ['1/year']
    
    elif ((average_frequency_of_data_collection == 'month')|(average_frequency_of_data_collection == 'm')):
        count_per_year = 12
        xtick_list = [1, count_per_year]
        labels_list = ['1/year', '1/month']
    
    elif ((average_frequency_of_data_collection == 'day')|(average_frequency_of_data_collection == 'd')):
        count_per_year = 365.2524
        xtick_list = [1, count_per_year]
        labels_list = ['1/year', '1/day']
    
    elif ((average_frequency_of_data_collection == 'hour')|(average_frequency_of_data_collection == 'h')):
        count_per_year = 24 * 365.2524
        xtick_list = [1, 365.2524, count_per_year]
        labels_list = ['1/year', '1/day', '1/h']
    
    elif ((average_frequency_of_data_collection == 'minute')|(average_frequency_of_data_collection == 'min')):
        count_per_year = 60 * 24 * 365.2524
        xtick_list = [1, 365.2524, (24 * 365.2524), count_per_year]
        labels_list = ['1/year', '1/day', '1/h', '1/min']
    
    elif ((average_frequency_of_data_collection == 'second')|(average_frequency_of_data_collection == 's')):
        count_per_year = 60 * 60 * 24 * 365.2524
        xtick_list = [1, 365.2524, (24 * 365.2524), (60 * 24 * 365.2524), count_per_year]
        labels_list = ['1/year', '1/day', '1/h', '1/min', '1/s']
    
    elif (average_frequency_of_data_collection == 'ms'):
        count_per_year = 60 * 60 * 24 * 365.2524 * (10**3)
        xtick_list = [1, 365.2524, (24 * 365.2524), (60 * 24 * 365.2524), (60 * 60 * 24 * 365.2524), count_per_year]
        labels_list = ['1/year', '1/day', '1/h', '1/min', '1/s', '1/ms']
    
    elif (average_frequency_of_data_collection == 'ns'):
        count_per_year = 60 * 60 * 24 * 365.2524 * (10**9)
        xtick_list = [1, 365.2524, (24 * 365.2524), (60 * 24 * 365.2524), (60 * 60 * 24 * 365.2524), (60 * 60 * 24 * 365.2524 * (10**3)), count_per_year]
        labels_list = ['1/year', '1/day', '1/h', '1/min', '1/s', '1/ms', '1/ns']
    
    else:
        print("No valid frequency input. Considering frequency in h.\n")
        count_per_year = 24 * 365.2524
        xtick_list = [1, 365.2524, count_per_year]
        labels_list = ['1/year', '1/day', '1/h']
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)
    
    #Subtract the average value of column to analyze from each entry, to eliminate a possible offset
    avg_value = DATASET[column_to_analyze].mean()
    DATASET[column_to_analyze] = DATASET[column_to_analyze] - avg_value
    
    # Perform the Fourier transform
    fft = tf.signal.rfft(DATASET[column_to_analyze])
    f_per_dataset = np.arange(0, len(fft))

    n_samples = len(DATASET[column_to_analyze])
    years_per_dataset = n_samples/(count_per_year)

    f_per_year = f_per_dataset/years_per_dataset
    
    # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
    # so that the bars do not completely block other views.
    OPACITY = 0.95
    
    if (plot_title is None):
        # Set graphic title
        plot_title = "obtained_frequencies"

    if (horizontal_axis_title is None):
        # Set horizontal axis title
        horizontal_axis_title = "frequency_log_scale"

    if (vertical_axis_title is None):
        # Set vertical axis title
        vertical_axis_title = "abs(fft)"
    
    # fft is a complex tensor. Let's pick the absolute value of each complex:
    abs_fft = np.abs(fft)
    
    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    fig = plt.figure(figsize = (12, 8))
    ax = fig.add_subplot()
    
    ax.step(f_per_year, abs_fft, color = 'crimson', linestyle = '-', alpha = OPACITY)
    
    # Set limits of the axes:
    # Y from 0 to a value 1% higher than the maximum
    # X from 0.1, close to zero, to the maximum. Zero cannot be present in log scale
    
    plt.xlim([0.1, max(plt.xlim())])
    
    plt.xscale('log')
    plt.xticks(xtick_list, labels = labels_list)
        
    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 0 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)

    ax.set_title(plot_title)
    ax.set_xlabel(horizontal_axis_title)
    ax.set_ylabel(vertical_axis_title)

    ax.grid(grid) # show grid or not
    
    if (export_png == True):
        # Image will be exported
        import os

        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = ""

        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "fast_fourier_transform"

        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 330 dpi
            png_resolution_dpi = 330

        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)

        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    #plt.figure(figsize = (12, 8))
    #fig.tight_layout()

    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
    
    print("Attention: the frequency is in counts per year: 1 count per year corresponds to 1 year; 12 counts: months per year; 365.2524 counts: days per year, etc.\n")
    
    # Also, return a tuple combining the absolute value of fft with the corresponding count per year
    return fft, tuple(zip(abs_fft, f_per_year))

# **Function for generating columns with frequency information**
- This gives the model access to the most important frequency features.

In [8]:
def get_frequency_features (df, timestamp_tag_column, important_frequencies = [{'value': 1, 'unit': 'day'}, {'value':1, 'unit': 'year'}], x_axis_rotation = 0, y_axis_rotation = 0, grid = True, horizontal_axis_title = None, vertical_axis_title = None, plot_title = None, export_png = False, directory_to_save = None, file_name = None, png_resolution_dpi = 330):
    
    import numpy as np
    import pandas as pd
    
    # important_frequencies = [{'value': 1, 'unit': 'day'}, {'value':1, 'unit': 'year'}]
    # List of dictionaries with the important frequencies to add to the model. You can remove dictionaries,
    # or add extra dictionaries. The dictionaries must have always the same keys, 'value' and 'unit'.
    # If the importante frequency is once a day, the value will be 1, and the unit will be 'day' or 'd'.
    # The possible units are: 'ns', 'ms', 'second' or 's', 'minute' or 'min', 'day' or 'd', 'month' or 'm',
    # 'year' or 'y'.
    
    # the Date Time column is very useful, but not in this string form. 
    # Start by converting it to seconds:
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)
    
    timestamp_s = DATASET[timestamp_tag_column].map(pd.Timestamp.timestamp)
    # the time in seconds is not a useful model input. 
    # It may have daily and yearly periodicity, for instance. 
    # To deal with periodicity, you can get usable signals by using sine and cosine transforms 
    # to clear "Time of day" and "Time of year" signals:
    
    columns_to_plot = []
    
    for freq_dict in important_frequencies:
        
        value = freq_dict['value']
        unit = freq_dict['unit']
        
        if ((value is not None) & (unit is not None)):
            
            unit = str(unit).lower()
            
            column_name1 = unit + "_sin"
            column_name2 = unit + "_cos"
            
            column_tuple = (column_name1, column_name2)
            columns_to_plot.append(column_tuple)
            
            if (unit == 'ns'):
                # convert to seconds:
                factor = 10 ** (-9)
            
            elif (unit == 'ms'):
                # convert to seconds:
                factor = 10 ** (-3)
            
            elif ((unit == 's')|(unit == 'second')):
                # convert to seconds:
                factor = 1
            
            elif ((unit == 'min')|(unit == 'minute')):
                # convert to seconds:
                factor = 60
            
            elif ((unit == 'hour')|(unit == 'h')):
                # convert to seconds:
                factor = 60 * 60
            
            elif ((unit == 'month')|(unit == 'm')):
                # convert to seconds, considering a (365.2425)-day year, divided by 12:
                factor = 60 * 60 * 24 * (365.2425)/12
                print(f"Attention: considering an average month of {(365.2425)/12} days.\n")
            
            elif ((unit == 'year')|(unit == 'y')):
                # convert to seconds, considering a (365.2425)-day year:
                factor = 60 * 60 * 24 * (365.2425)
            
            else:
                # unit == 'day', or 'd', the default case
                # convert to seconds:
                factor = 60 * 60 * 24
            
            DATASET[column_name1] = np.sin(timestamp_s * (2 * np.pi / factor))
            DATASET[column_name2] = np.cos(timestamp_s * (2 * np.pi / factor))
            
    # There are 8 possible frequencies to plot, i.e, 16 possible sin and cos plots.
    # List of tuples, containing the pairs of colors to be used:
    colors = [('crimson', 'darkblue'), 
                ('fuchsia', 'black'),
                ('red', 'blue'),
                ('darkgreen', 'magenta'),
                ('aqua', 'violet'),
                ('navy', 'purple'),
                ('green', 'firebrick'),
                ('blue', 'plum')]
    
    # Slice the colors list so that it has the same amount of elements as columns_to_plot:
    colors = colors[:(len(columns_to_plot))]
    # Now, we can zip both to create an iterable containing a tuple of plots and a correspondent
    # tuple of colors.
    
    # Let's put a small degree of transparency (1 - OPACITY) = 0.05 = 5%
    # so that the bars do not completely block other views.
    OPACITY = 0.95
    
    if (plot_title is None):
        # Set graphic title
        plot_title = f"frequency_signals"

    if (horizontal_axis_title is None):
        # Set horizontal axis title
        horizontal_axis_title = "time"

    if (vertical_axis_title is None):
        # Set vertical axis title
        vertical_axis_title = "signal"
    
    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    fig = plt.figure(figsize = (12, 8))
    ax = fig.add_subplot()
    
    for columns_tuple, colors_tuple in zip(columns_to_plot, colors):
        
        ax.plot(np.array(DATASET[columns_tuple[0]])[:25], linestyle = "-", marker = '', color = colors_tuple[0], alpha = OPACITY, label = columns_tuple[0])
        ax.plot(np.array(DATASET[columns_tuple[1]])[:25], linestyle = "-", marker = '', color = colors_tuple[1], alpha = OPACITY, label = columns_tuple[1])
        
    #ROTATE X AXIS IN XX DEGREES
    plt.xticks(rotation = x_axis_rotation)
    # XX = 0 DEGREES x_axis (Default)
    #ROTATE Y AXIS IN XX DEGREES:
    plt.yticks(rotation = y_axis_rotation)
    # XX = 0 DEGREES y_axis (Default)

    ax.set_title(plot_title)
    ax.set_xlabel(horizontal_axis_title)
    ax.set_ylabel(vertical_axis_title)

    ax.grid(grid) # show grid or not
    ax.legend(loc = 'upper right')
    # position options: 'upper right'; 'upper left'; 'lower left'; 'lower right';
    # 'right', 'center left'; 'center right'; 'lower center'; 'upper center', 'center'
    # https://www.statology.org/matplotlib-legend-position/

    if (export_png == True):
        # Image will be exported
        import os

        #check if the user defined a directory path. If not, set as the default root path:
        if (directory_to_save is None):
            #set as the default
            directory_to_save = ""

        #check if the user defined a file name. If not, set as the default name for this
        # function.
        if (file_name is None):
            #set as the default
            file_name = "frequency_signals"

        #check if the user defined an image resolution. If not, set as the default 110 dpi
        # resolution.
        if (png_resolution_dpi is None):
            #set as 330 dpi
            png_resolution_dpi = 330

        #Get the new_file_path
        new_file_path = os.path.join(directory_to_save, file_name)

        #Export the file to this new path:
        # The extension will be automatically added by the savefig method:
        plt.savefig(new_file_path, dpi = png_resolution_dpi, quality = 100, format = 'png', transparent = False) 
        #quality could be set from 1 to 100, where 100 is the best quality
        #format (str, supported formats) = 'png', 'pdf', 'ps', 'eps' or 'svg'
        #transparent = True or False
        # For other parameters of .savefig method, check https://indianaiproduction.com/matplotlib-savefig/
        print (f"Figure exported as \'{new_file_path}.png\'. Any previous file in this root path was overwritten.")

    #Set image size (x-pixels, y-pixels) for printing in the notebook's cell:
    #plt.figure(figsize = (12, 8))
    #fig.tight_layout()

    ## Show an image read from an image file:
    ## import matplotlib.image as pltimg
    ## img=pltimg.imread('mydecisiontree.png')
    ## imgplot = plt.imshow(img)
    ## See linkedIn Learning course: "Supervised machine learning and the technology boom",
    ##  Ex_Files_Supervised_Learning, Exercise Files, lesson '03. Decision Trees', '03_05', 
    ##  '03_05_END.ipynb'
    plt.show()
    
    return DATASET

# **Function for log-transforming the variables**
- One curve derived from the normal is the log-normal.
- If the values Y follow a log-normal distribution, their log follow a normal.
- A log normal curve resembles a normal, but with skewness (distortion); and kurtosis (long-tail).

Applying the log is a methodology for **normalizing the variables**: the sample space gets shrinkled after the transformation, making the data more adequate for being processed by Machine Learning algorithms.
- Preferentially apply the transformation to the whole dataset, so that all variables will be of same order of magnitude.
- Obviously, it is not necessary for variables ranging from -100 to 100 in numerical value, where most outputs from the log transformation are.

#### **WARNING**: This function will eliminate rows where the selected variables present values lower or equal to zero (condition for the logarithm to be applied).

In [38]:
def log_transform (df, subset = None, create_new_columns = True, new_columns_suffix = "_log"):
    
    import numpy as np
    import pandas as pd
    
    #### WARNING: This function will eliminate rows where the selected variables present 
    #### values lower or equal to zero (condition for the logarithm to be applied).
    
    # subset = None
    # Set subset = None to transform the whole dataset. Alternatively, pass a list with 
    # columns names for the transformation to be applied. For instance:
    # subset = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
    # as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
    # Declaring the full list of columns is equivalent to setting subset = None.
    
    # create_new_columns = True
    # Alternatively, set create_new_columns = True to store the transformed data into new
    # columns. Or set create_new_columns = False to overwrite the existing columns
    
    # new_columns_suffix = "_log"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_log", the new column will be named as
    # "column1_log".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)
    
    # Check if a subset was defined. If so, make columns_list = subset 
    if not (subset is None):
        
        columns_list = subset
    
    else:
        #There is no declared subset. Then, make columns_list equals to the list of
        # numeric columns of the dataframe.
        columns_list = list(DATASET.columns)
        
    # Let's check if there are categorical columns in columns_list. Only numerical
    # columns should remain
    # Start a support list:
    support_list = []
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]
    
    # Loop through each column in columns_list:
    for column in columns_list:
        
        # Check the Pandas series (column) data type:
        column_type = DATASET[column].dtype
            
        # If it is not categorical (object), append it to the support list:
        if (column_type in numeric_dtypes):
                
            support_list.append(column)
    
    # Finally, make the columns_list support_list itself:
    columns_list = support_list
    
    #Loop through each column to apply the transform:
    for column in columns_list:
        #access each element in the list column_list. The element is named 'column'.
        
        #boolean filter to check if the entry is higher than zero, condition for the log
        # to be applied
        boolean_filter = (DATASET[column] > 0)
        #This filter is equals True only for the rows where the column is higher than zero.
        
        #Apply the boolean filter to the dataframe, removing the entries where the column
        # cannot be log transformed.
        # The boolean_filter selects only the rows for which the filter values are True.
        DATASET = DATASET[boolean_filter]
        
        #Check if a new column will be created, or if the original column should be
        # substituted.
        if (create_new_columns == True):
            # Create a new column.
            
            # The new column name will be set as column + new_columns_suffix
            new_column_name = column + new_columns_suffix
        
        else:
            # Overwrite the existing column. Simply set new_column_name as the value 'column'
            new_column_name = column
        
        # Calculate the column value as the log transform of the original series (column)
        DATASET[new_column_name] = np.log(DATASET[column])
    
    # Reset the index:
    DATASET.reset_index(drop = True)
    
    print("The columns were successfully log-transformed. Check the 10 first rows of the new dataset:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
    
    return DATASET

    # One curve derived from the normal is the log-normal.
    # If the values Y follow a log-normal distribution, their log follow a normal.
    # A log normal curve resembles a normal, but with skewness (distortion); 
    # and kurtosis (long-tail).

    # Applying the log is a methodology for normalizing the variables: 
    # the sample space gets shrinkled after the transformation, making the data more 
    # adequate for being processed by Machine Learning algorithms. Preferentially apply 
    # the transformation to the whole dataset, so that all variables will be of same order 
    # of magnitude.
    # Obviously, it is not necessary for variables ranging from -100 to 100 in numerical 
    # value, where most outputs from the log transformation are.

# **Function for reversing the log-transform - applying the exponential transformation**

In [8]:
def reverse_log_transform (df, subset = None, create_new_columns = True, new_columns_suffix = "_originalScale"):
    
    import numpy as np
    import pandas as pd
    
    #### WARNING: This function will eliminate rows where the selected variables present 
    #### values lower or equal to zero (condition for the logarithm to be applied).
    
    # subset = None
    # Set subset = None to transform the whole dataset. Alternatively, pass a list with 
    # columns names for the transformation to be applied. For instance:
    # subset = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
    # as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
    # Declaring the full list of columns is equivalent to setting subset = None.
    
    # create_new_columns = True
    # Alternatively, set create_new_columns = True to store the transformed data into new
    # columns. Or set create_new_columns = False to overwrite the existing columns
    
    # new_columns_suffix = "_originalScale"
    # This value has effect only if create_new_column = True.
    # The new column name will be set as column + new_columns_suffix. Then, if the original
    # column was "column1" and the suffix is "_originalScale", the new column will be named 
    # as "column1_originalScale".
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name.
    
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)
    
    # Check if a subset was defined. If so, make columns_list = subset 
    if not (subset is None):
        
        columns_list = subset
    
    else:
        #There is no declared subset. Then, make columns_list equals to the list of
        # numeric columns of the dataframe.
        columns_list = list(DATASET.columns)
        
    # Let's check if there are categorical columns in columns_list. Only numerical
    # columns should remain
    # Start a support list:
    support_list = []
    # List the possible numeric data types for a Pandas dataframe column:
    numeric_dtypes = [np.int16, np.int32, np.int64, np.float16, np.float32, np.float64]

    # Loop through each column in columns_list:
    for column in columns_list:
        
        # Check the Pandas series (column) data type:
        column_type = DATASET[column].dtype
            
        # If it is not categorical (object), append it to the support list:
        if (column_type in numeric_dtypes):
                
            support_list.append(column)
    
    # Finally, make the columns_list support_list itself:
    columns_list = support_list
    
    #Loop through each column to apply the transform:
    for column in columns_list:
        #access each element in the list column_list. The element is named 'column'.
        
        # The exponential transformation can be applied to zero and negative values,
        # so we remove the boolean filter.
        
        #Check if a new column will be created, or if the original column should be
        # substituted.
        if (create_new_columns == True):
            # Create a new column.
            
            # The new column name will be set as column + new_columns_suffix
            new_column_name = column + new_columns_suffix
        
        else:
            # Overwrite the existing column. Simply set new_column_name as the value 'column'
            new_column_name = column
        
        # Calculate the column value as the log transform of the original series (column)
        DATASET[new_column_name] = np.exp(DATASET[column])
    
    print("The log_transform was successfully reversed through the exponential transformation. Check the 10 first rows of the new dataset:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
    
    return DATASET

# **Function for obtaining and applying Box-Cox transform**
- Transform data into a series that are represented by the normal distribution.

In [10]:
def box_cox_transform (df, column_to_transform, mode = 'calculate_and_apply', lambda_boxcox = None, suffix = '_BoxCoxTransf', specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': None}):
    
    import numpy as np
    import pandas as pd
    from statsmodels.stats import diagnostic
    from scipy import stats
    
    # This function will process a single column column_to_transform 
    # of the dataframe df per call.
    
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
    ## Box-Cox transform is given by:
    ## y = (x**lmbda - 1) / lmbda,  for lmbda != 0
    ## log(x),                  for lmbda = 0
    
    # column_to_transform must be a string with the name of the column.
    # e.g. column_to_transform = 'column1' to transform a column named as 'column1'
    
    # mode = 'calculate_and_apply'
    # Aternatively, mode = 'calculate_and_apply' to calculate lambda and apply Box-Cox
    # transform; mode = 'apply_only' to apply the transform for a known lambda.
    # To 'apply_only', lambda_box must be provided.
    
    # lambda_boxcox must be a float value. e.g. lamda_boxcox = 1.7
    # If you calculated lambda from the function box_cox_transform and saved the
    # transformation data summary dictionary as data_sum_dict, simply set:
    # lambda_boxcox = data_sum_dict['lambda_boxcox']
    # This will access the value on the key 'lambda_boxcox' of the dictionary, which
    # contains the lambda. 
    
    # Analogously, spec_lim_dict['Inf_spec_lim_transf'] access
    # the inferior specification limit transformed; and spec_lim_dict['Sup_spec_lim_transf'] 
    # access the superior specification limit transformed.
    
    # If lambda_boxcox is None, 
    # the mode will be automatically set as 'calculate_and_apply'.
    
    # suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_BoxCoxTransf', the transformed column will be
    # identified as 'Y_BoxCoxTransf'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
    
    # specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': None}
    # If there are specification limits, input them in this dictionary. Do not modify the keys,
    # simply substitute None by the lower and/or the upper specification.
    # e.g. Suppose you have a tank that cannot have more than 10 L. So:
    # specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': 10}, there is only
    # an upper specification equals to 10 (do not add units);
    # Suppose a temperature cannot be lower than 10 ºC, but there is no upper specification. So,
    # specification_limits = {'lower_spec_lim': 10, 'upper_spec_lim': None}. Finally, suppose
    # a liquid which pH must be between 6.8 and 7.2:
    # specification_limits = {'lower_spec_lim': 6.8, 'upper_spec_lim': 7.2}
    
    if not (suffix is None):
        #only if a suffix was declared
        #concatenate the column name to the suffix
        new_col = column_to_transform + suffix
    
    else:
        #concatenate the column name to the standard '_BoxCoxTransf' suffix
        new_col = column_to_transform + '_BoxCoxTransf'
    
    boolean_check = (mode != 'calculate_and_apply') & (mode != 'apply_only')
    # & is the 'and' operator. != is the 'is different from' operator.
    #Check if neither 'calculate_and_apply' nor 'apply_only' were selected
    
    if ((lambda_boxcox is None) & (mode == 'apply_only')):
        print("Invalid value set for \'lambda_boxcox'\. Setting mode to \'calculate_and_apply\'.\n")
        mode = 'calculate_and_apply'
    
    elif (boolean_check == True):
        print("Invalid value set for \'mode'\. Setting mode to \'calculate_and_apply\'.\n")
        mode = 'calculate_and_apply'
    
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)
    
    print("Box-Cox transformation must be applied only to values higher than zero.\n")
    print("That is because it is a logarithmic transformation.\n")
    print(f"So, filtering out all values from {column_to_transform} lower than or equal to zero.\n")
    DATASET = DATASET[DATASET[column_to_transform] > 0]
    DATASET = DATASET.reset_index(drop = True)
    
    y = DATASET[column_to_transform]
    
    
    if (mode == 'calculate_and_apply'):
        # Calculate lambda_boxcox
        lambda_boxcox = stats.boxcox_normmax(y, method = 'pearsonr')
        #calcula o lambda da transformacao box-cox utilizando o metodo da maxima verossimilhanca
        #por meio da maximizacao do coeficiente de correlacao de pearson da funcao
        #y = boxcox(x), onde boxcox representa a transformacao
    
    # For other cases, we will apply the lambda_boxcox set as the function parameter.

    #Calculo da variavel transformada
    y_transform = stats.boxcox(y, lmbda = lambda_boxcox, alpha = None)
    #Calculo da transformada
    
    DATASET[new_col] = y_transform
    #dataframe contendo os dados transformados
    
    print("Data successfully transformed. Check the 10 first transformed rows:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
        
    print("\n") #line break
    
    # Start a dictionary to store the summary results of the transform and the normality
    # tests:
    data_sum_dict = {'lambda_boxcox': lambda_boxcox}
    
    # Test normality of the transformed variable:
    # Scipy.stats’ normality test
    # It is based on D’Agostino and Pearson’s test that combines 
    # skew and kurtosis to produce an omnibus test of normality.
    _, scipystats_test_pval = stats.normaltest(y_transform) 
    # add this test result to the dictionary:
    data_sum_dict['dagostino_pearson_p_val'] = scipystats_test_pval
            
    # Scipy.stats’ Shapiro-Wilk test
    shapiro_test = stats.shapiro(y_transform)
    data_sum_dict['shapiro_wilk_p_val'] = shapiro_test[1]
    
    # Lilliefors’ normality test
    lilliefors_test = diagnostic.kstest_normal(y_transform, dist = 'norm', pvalmethod = 'table')
    data_sum_dict['lilliefors_p_val'] = lilliefors_test[1]
    
    # Anderson-Darling normality test
    ad_test = diagnostic.normal_ad(y_transform, axis = 0)
    data_sum_dict['anderson_darling_p_val'] = ad_test[1]
     
    print("Box-Cox Transformation Summary:\n")
    try:
        display(data_sum_dict)     
    except:
        print(data_sum_dict)
    
    print("\n") #line break
    
    if not ((specification_limits['lower_spec_lim'] is None) & (specification_limits['upper_spec_lim'] is None)):
        # Convert it to a list of specs:
        list_of_specs = []
        
        if not (specification_limits['lower_spec_lim'] is None):
            
            if (specification_limits['lower_spec_lim'] <= 0):
                print("Box-Cox transform can only be applied to values higher than zero. So, ignoring the lower specification.\n")
            
            else:
                list_of_specs.append(specification_limits['lower_spec_lim'])
        
        if not (specification_limits['upper_spec_lim'] is None):
            
            if (specification_limits['upper_spec_lim'] <= 0):
                print("Box-Cox transform can only be applied to values higher than zero. So, ignoring the upper specification.\n")
            
            else:
                list_of_specs.append(specification_limits['upper_spec_lim'])
        
        # Notice that the list may have 1 or 2 elements.
        
        # Convert the list of specifications into a NumPy array:
        spec_lim_array = np.array(list_of_specs)
        
        # If the array has a single element, we cannot apply stats method. So, let's transform
        # manually:
        ## y = (x**lmbda - 1) / lmbda,  for lmbda != 0
        ## log(x),                  for lmbda = 0
        if (lambda_boxcox == 0):
            
            spec_lim_array = np.log(spec_lim_array)
        
        else:
            spec_lim_array = ((spec_lim_array**lambda_boxcox) - 1)/(lambda_boxcox)
        
        # Start a dictionary to store the transformed limits:
        spec_lim_dict = {}
        
        if not (specification_limits['lower_spec_lim'] is None):
            # First element of the array is the lower specification. Add it to the
            # dictionary:
            spec_lim_dict['lower_spec_lim_transf'] = spec_lim_array[0]
            
            if not (specification_limits['upper_spec_lim'] is None):
                # Second element of the array is the upper specification. Add it to the
                # dictionary:
                spec_lim_dict['upper_spec_lim_transf'] = spec_lim_array[1]
        
        else:
            # The array contains only one element, which is the upper specification. Add
            # it to the dictionary:
            spec_lim_dict['upper_spec_lim_transf'] = spec_lim_array[0]
        
        print("New specification limits successfully obtained:\n")
        try:
            display(spec_lim_dict)     
        except:
            print(spec_lim_dict)
        
        # Add spec_lim_dict as a new element from data_sum_dict:
        data_sum_dict['spec_lim_dict'] = spec_lim_dict
    
    
    return DATASET, data_sum_dict

# **Function for reversing Box-Cox transform**

In [11]:
def reverse_box_cox (df, column_to_transform, lambda_boxcox, suffix = '_ReversedBoxCox'):
    
    import numpy as np
    import pandas as pd
    
    # This function will process a single column column_to_transform 
    # of the dataframe df per call.
    
    # Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
    ## Box-Cox transform is given by:
    ## y = (x**lmbda - 1) / lmbda,  for lmbda != 0
    ## log(x),                  for lmbda = 0
    
    # column_to_transform must be a string with the name of the column.
    # e.g. column_to_transform = 'column1' to transform a column named as 'column1'
    
    # lambda_boxcox must be a float value. e.g. lamda_boxcox = 1.7
    # If you calculated lambda from the function box_cox_transform and saved the
    # transformation data summary dictionary as data_sum_dict, simply set:
    # lambda_boxcox = data_sum_dict['lambda_boxcox']
    # This will access the value on the key 'lambda_boxcox' of the dictionary, which
    # contains the lambda. 
    
    # Analogously, spec_lim_dict['Inf_spec_lim_transf'] access
    # the inferior specification limit transformed; and spec_lim_dict['Sup_spec_lim_transf'] 
    # access the superior specification limit transformed.
    
    #suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
    # identified as '_ReversedBoxCox'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
    
    
    # Start a local copy of the dataframe:
    DATASET = df.copy(deep = True)

    y = DATASET[column_to_transform]
    
    if (lambda_boxcox == 0):
        #ytransf = np.log(y), according to Box-Cox definition. Then
        #y_retransform = np.exp(y)
        #In the case of this function, ytransf is passed as the argument y.
        y_transform = np.exp(y)
    
    else:
        #apply Box-Cox function:
        #y_transf = (y**lmbda - 1) / lmbda. Then,
        #y_retransf ** (lmbda) = (y_transf * lmbda) + 1
        #y_retransf = ((y_transf * lmbda) + 1) ** (1/lmbda), where ** is the potentiation
        #In the case of this function, ytransf is passed as the argument y.
        y_transform = ((y * lambda_boxcox) + 1) ** (1/lambda_boxcox)
    
    if not (suffix is None):
        #only if a suffix was declared
        #concatenate the column name to the suffix
        new_col = column_to_transform + suffix
    
    else:
        #concatenate the column name to the standard '_ReversedBoxCox' suffix
        new_col = column_to_transform + '_ReversedBoxCox'
    
    DATASET[new_col] = y_transform
    #dataframe contendo os dados transformados
    
    print("Data successfully retransformed. Check the 10 first retransformed rows:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(DATASET.head(10))
            
    except: # regular mode
        print(DATASET.head(10))
    
    print("\n") #line break
 
    return DATASET

# **Function for One-Hot Encoding categorical features**
- Transform categorical values without notion of order into numerical (binary) features.
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.
- The new columns will be named as the original columns + "_" + possible categories + "OneHotEnc".
- Each column is a binary variable of the type "is classified in this category or not".

Therefore, for a category "A", a column named "A" is created.
- If the row is an element from category "A", the value for the column "A" is 1.
- If not, the value for column "A" is 0.

In [14]:
def OneHotEncode_df (df, subset_of_features_to_be_encoded):

    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    
    # df: the whole dataframe to be processed.
    
    # subset_of_features_to_be_encoded: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    #Start an encoding list empty (it will be a JSON object):
    encoding_list = []
    
    # Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display  
    except:
        pass
    
    #loop through each column of the subset:
    for column in subset_of_features_to_be_encoded:
        
        # Start two empty dictionaries:
        encoding_dict = {}
        nested_dict = {}
        
        # Add the column to encoding_dict as the key 'column':
        encoding_dict['column'] = column
        
        # Loop through each element (named 'column') of the list of columns to analyze,
        # subset_of_features_to_be_encoded
        
        # We could process the whole subset at once, but it could make us lose information
        # about the generated columns
        
        # set a subset of the dataframe X containing 'column' as the only column:
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X  = df[[column]]
        
        #Start the OneHotEncoder object:
        OneHot_enc_obj = OneHotEncoder()
        
        #Fit the object to that column:
        OneHot_enc_obj = OneHot_enc_obj.fit(X)
        # Get the transformed columns as a SciPy sparse matrix: 
        transformed_columns = OneHot_enc_obj.transform(X)
        # Convert the sparse matrix to a NumPy dense array:
        transformed_columns = transformed_columns.toarray()
        
        # Now, let's retrieve the encoding information and save it:
        # Show encoded categories and store this array. 
        # It will give the proper columns' names:
        encoded_columns = OneHot_enc_obj.categories_

        # encoded_columns is an array containing a single element.
        # This element is an array like:
        # array(['cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8'], dtype=object)
        # Then, this array is the element of index 0 from the list encoded_columns.
        # It is represented as encoded_columns[0]

        # Therefore, we actually want the array which is named as encoded_columns[0]
        # Each element of this array is the name of one of the encoded columns. In the
        # example above, the element 'cat2' would be accessed as encoded_columns[0][1],
        # since it is the element of index [1] (second element) from the array 
        # encoded_columns[0].
        
        new_columns = encoded_columns[0]
        # To identify the column that originated these new columns, we can join the
        # string column to each element from new_columns:
        
        # Update the nested dictionary: store the new_columns as the key 'categories':
        nested_dict['categories'] = new_columns
        # Store the encoder object as the key 'OneHot_enc_obj'
        # Add the encoder object to the dictionary:
        nested_dict['OneHot_enc_obj'] = OneHot_enc_obj
        
        # Store the nested dictionary in the encoding_dict as the key 'OneHot_encoder':
        encoding_dict['OneHot_encoder'] = nested_dict
        # Append the encoding_dict as an element from list encoding_list:
        encoding_list.append(encoding_dict)
        
        # Now we saved all encoding information, let's transform the data:
        
        # Start a support_list to store the concatenated strings:
        support_list = []
        
        for encoded_col in new_columns:
            # Use the str attribute to guarantee that the array stores only strings.
            # Add an underscore "_" to separate the strings and an identifier of the transform:
            new_column = column + "_" + str(encoded_col) + "_OneHotEnc"
            # Append it to the support_list:
            support_list.append(new_column)
            
        # Convert the support list to NumPy array, and make new_columns the support list itself:
        new_columns = np.array(support_list)
        
        # Crete a Pandas dataframe from the array transformed_columns:
        encoded_X_df = pd.DataFrame(transformed_columns)
        
        # Modify the name of the columns to make it equal to new_columns:
        encoded_X_df.columns = new_columns
        
        #Inner join the new dataset with the encoded dataset.
        # Use the index as the key, since indices are necessarily correspondent.
        # To use join on index, we apply pandas .concat method.
        # To join on a specific key, we could use pandas .merge method with the arguments
        # left_on = 'left_key', right_on = 'right_key'; or, if the keys have same name,
        # on = 'key':
        # Check Pandas merge and concat documentation:
        # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
        
        new_df = pd.concat([new_df, encoded_X_df], axis = 1, join = "inner")
        # When axis = 0, the .concat operation occurs in the row level, so the rows
        # of the second dataframe are added to the bottom of the first one.
        # It is the SQL union, and creates a dataframe with more rows, and
        # total of columns equals to the total of columns of the first dataframe
        # plus the columns of the second one that were not in the first dataframe.
        # When axis = 1, the operation occurs in the column level: the two
        # dataframes are laterally merged using the index as the key, 
        # preserving all columns from both dataframes. Therefore, the number of
        # rows will be the total of rows of the dataframe with more entries,
        # and the total of columns will be the sum of the total of columns of
        # the first dataframe with the total of columns of the second dataframe.
        
        print(f"Successfully encoded column \'{column}\' and merged the encoded columns to the dataframe.\n")
        print("Check first 5 rows of the encoded table that was merged:\n")
        
        try:
            display(encoded_X_df.head())
        except: # regular mode
            print(encoded_X_df.head())
        
        # The default of the head method, when no parameter is printed, is to show 5 rows; if an
        # integer number Y is passed as argument .head(Y), Pandas shows the first Y-rows.
        print("\n")
        
    print("Finished One-Hot Encoding. Returning the new transformed dataframe; and an encoding list.\n")
    print("Each element from this list is a dictionary with the original column name as key \'column\', and a nested dictionary as the key \'OneHot_encoder\'.\n")
    print("In turns, the nested dictionary shows the different categories as key \'categories\' and the encoder object as the key \'OneHot_enc_obj\'.\n")
    print("Use the encoder object to inverse the One-Hot Encoding in the correspondent function.\n")
    print(f"For each category in the columns \'{subset_of_features_to_be_encoded}\', a new column has value 1, if it is the actual category of that row; or is 0 if not.\n")
    print("Check the first 10 rows of the new dataframe:\n")
    
    try:
        display(new_df.head(10))
    except:
        print(new_df.head(10))

    #return the transformed dataframe and the encoding dictionary:
    return new_df, encoding_list

# **Function for Reversing One-Hot Encoding of categorical features**

In [15]:
def reverse_OneHotEncode (df, encoding_list):

    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    
    # df: the whole dataframe to be processed.
    
    # encoding_list: list in the same format of the one generated by OneHotEncode_df function:
    # it must be a list of dictionaries where each dictionary contains two keys:
    # key 'column': string with the original column name (in quotes); 
    # key 'OneHot_encoder': this key must store a nested dictionary.
    # Even though the nested dictionaries generates by the encoding function present
    # two keys:  'categories', storing an array with the different categories;
    # and 'OneHot_enc_obj', storing the encoder object, only the key 'OneHot_enc_obj' is required.
    ## On the other hand, a third key is needed in the nested dictionary:
    ## key 'encoded_columns': this key must store a list or array with the names of the columns
    # obtained from Encoding.
    
    
    # Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display  
    except:
        pass
    
    for encoder_dict in encoding_list:
        
        try:
            # Check if the required arguments are present:
            if ((encoder_dict['column'] is not None) & (encoder_dict['OneHot_encoder']['OneHot_enc_obj'] is not None) & (encoder_dict['OneHot_encoder']['encoded_columns'] is not None)):

                # Access the column name:
                col_name = encoder_dict['column']

                # Access the nested dictionary:
                nested_dict = encoder_dict['OneHot_encoder']
                # Access the encoder object on the dictionary
                OneHot_enc_obj = nested_dict['OneHot_enc_obj']
                # Access the list of encoded columns:
                list_of_encoded_cols = list(nested_dict['encoded_columns'])

                # Get a subset of the encoded columns
                X = new_df.copy(deep = True)
                X = X[list_of_encoded_cols]

                # Reverse the encoding:
                reversed_array = OneHot_enc_obj.inverse_transform(X)

                # Add the reversed array as the column col_name on the dataframe:
                new_df[col_name] = reversed_array
                
                print(f"Reversed the encoding for {col_name}. Check the 5 first rows of the re-transformed series:\n")
                
                try:
                    display(new_df[[col_name]].head())
                except:
                    print(new_df[[col_name]].head())
                
                print("\n")
            
        except:
            print("Detected dictionary with incorrect keys or format. Unable to reverse encoding. Please, correct it.\n")
    
    print("Finished reversing One-Hot Encoding. Returning the new transformed dataframe.\n")
    print("Check the first 10 rows of the new dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_df.head(10))
            
    except: # regular mode
        print(new_df.head(10))

    #return the transformed dataframe:
    return new_df

# **Function for Ordinal Encoding categorical features**
- Transform categorical values with notion of order into numerical (integer) features.
- For each column, the Ordinal Encoder creates a new column in the dataset. This new column is represented by a an integer value, where each integer represents a possible categorie.
- The new columns will be named as the original column + "_OrdinalEnc".

#### WARNING: Machine Learning algorithms assume that close values represent similarity and order. If there is no order and no distance associated to each ordinal, use the One-Hot Encoding for converting categorical to numerical values.

In [16]:
def OrdinalEncode_df (df, subset_of_features_to_be_encoded):

    # Ordinal encoding: let's associate integer sequential numbers to the categorical column
    # to apply the advanced encoding techniques. Even though the one-hot encoding could perform
    # the same task and would, in fact, better, since there may be no ordering relation, the
    # ordinal encoding is simpler and more suitable for this particular task:    
    
    import pandas as pd
    from sklearn.preprocessing import OrdinalEncoder
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder 
    
    # df: the whole dataframe to be processed.
    
    # subset_of_features_to_be_encoded: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    #Start an encoding list empty (it will be a JSON object):
    encoding_list = []
    
    # Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display  
    except:
        pass
   
    #loop through each column of the subset:
    for column in subset_of_features_to_be_encoded:
        
        # Start two empty dictionaries:
        encoding_dict = {}
        nested_dict = {}
        
        # Add the column to encoding_dict as the key 'column':
        encoding_dict['column'] = column
        
        # Loop through each element (named 'column') of the list of columns to analyze,
        # subset_of_features_to_be_encoded
        
        # We could process the whole subset at once, but it could make us lose information
        # about the generated columns
        
        # set a subset of the dataframe X containing 'column' as the only column:
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X  = new_df[[column]]
        
        #Start the OrdinalEncoder object:
        ordinal_enc_obj = OrdinalEncoder()
        
        # Fit the ordinal encoder to the dataframe X:
        ordinal_enc_obj = ordinal_enc_obj.fit(X)
        # Get the transformed dataframe X: 
        transformed_X = ordinal_enc_obj.transform(X)
        # transformed_X is an array of arrays like: [[0.], [0.], ..., [8.]]
        # We want all the values in the first position of the internal arrays:
        transformed_X = transformed_X[:,0]
        # Get the encoded series as a NumPy array:
        encoded_series = np.array(transformed_X)
        
        # Get a name for the new encoded column:
        new_column = column + "_OrdinalEnc"
        # Add this column to the dataframe:
        new_df[new_column] = encoded_series
        
        # Now, let's retrieve the encoding information and save it:
        # Show encoded categories and store this array. 
        # It will give the proper columns' names:
        encoded_categories = ordinal_enc_obj.categories_

        # encoded_categories is an array of strings containing the information of
        # encoded categories and their values.
        
        # Update the nested dictionary: store the categories as the key 'categories':
        nested_dict['categories'] = encoded_categories
        # Store the encoder object as the key 'ordinal_enc_obj'
        # Add the encoder object to the dictionary:
        nested_dict['ordinal_enc_obj'] = ordinal_enc_obj
        
        # Store the nested dictionary in the encoding_dict as the key 'ordinal_encoder':
        encoding_dict['ordinal_encoder'] = nested_dict
        # Append the encoding_dict as an element from list encoding_list:
        encoding_list.append(encoding_dict)
        
        print(f"Successfully encoded column \'{column}\' and added the encoded column to the dataframe.\n")
        print("Check first 5 rows of the encoded series that was merged:\n")
        
        try:
            display(new_df[[new_column]].head())
        except:
            print(new_df[[new_column]].head())
        
        # The default of the head method, when no parameter is printed, is to show 5 rows; if an
        # integer number Y is passed as argument .head(Y), Pandas shows the first Y-rows.
        print("\n")
        
    print("Finished Ordinal Encoding. Returning the new transformed dataframe; and an encoding list.\n")
    print("Each element from this list is a dictionary with the original column name as key \'column\', and a nested dictionary as the key \'ordinal_encoder\'.\n")
    print("In turns, the nested dictionary shows the different categories as key \'categories\' and the encoder object as the key \'ordinal_enc_obj\'.\n")
    print("Use the encoder object to inverse the Ordinal Encoding in the correspondent function.\n")
    print("Check the first 10 rows of the new dataframe:\n")
    
    try:
        display(new_df.head(10))
    except:
        print(new_df.head(10))
    
    #return the transformed dataframe and the encoding dictionary:
    return new_df, encoding_list

# **Function for Reversing Ordinal Encoding of categorical features**

In [17]:
def reverse_OrdinalEncode (df, encoding_list):

    import pandas as pd
    from sklearn.preprocessing import OrdinalEncoder
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder
    
    # df: the whole dataframe to be processed.
    
    # encoding_list: list in the same format of the one generated by OrdinalEncode_df function:
    # it must be a list of dictionaries where each dictionary contains two keys:
    # key 'column': string with the original column name (in quotes); 
    # key 'ordinal_encoder': this key must store a nested dictionary.
    # Even though the nested dictionaries generates by the encoding function present
    # two keys:  'categories', storing an array with the different categories;
    # and 'ordinal_enc_obj', storing the encoder object, only the key 'ordinal_enc_obj' is required.
    ## On the other hand, a third key is needed in the nested dictionary:
    ## key 'encoded_column': this key must store a string with the name of the column
    # obtained from Encoding.
    
    
    # Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display  
    except:
        pass
   
    for encoder_dict in encoding_list:
        
        try:
            # Check if the required arguments are present:
            if ((encoder_dict['column'] is not None) & (encoder_dict['ordinal_encoder']['ordinal_enc_obj'] is not None) & (encoder_dict['ordinal_encoder']['encoded_column'] is not None)):

                # Access the column name:
                col_name = encoder_dict['column']

                # Access the nested dictionary:
                nested_dict = encoder_dict['ordinal_encoder']
                # Access the encoder object on the dictionary
                ordinal_enc_obj = nested_dict['ordinal_enc_obj']
                # Access the encoded column and save it as a list:
                list_of_encoded_cols = [nested_dict['encoded_column']]
                # In OneHotEncoding we have an array of strings. Applying the list
                # attribute would convert the array to list. Here, in turns, we have a simple
                # string, which is also an iterable object. Applying the list attribute to a string
                # creates a list of characters of that string.
                # So, here we create a list with the string as its single element.

                # Get a subset of the encoded column
                X = new_df.copy(deep = True)
                X = X[list_of_encoded_cols]

                # Reverse the encoding:
                reversed_array = ordinal_enc_obj.inverse_transform(X)

                # Add the reversed array as the column col_name on the dataframe:
                new_df[col_name] = reversed_array
                    
                print(f"Reversed the encoding for {col_name}. Check the 5 first rows of the re-transformed series:\n")
                
                try:
                    display(new_df[[col_name]].head())
                except:
                    print(new_df[[col_name]].head())

                print("\n")
                   
        except:
            print("Detected dictionary with incorrect keys or format. Unable to reverse encoding. Please, correct it.\n")
    
    
    print("Finished reversing Ordinal Encoding. Returning the new transformed dataframe.\n")
    print("Check the first 10 rows of the new dataframe:\n")
    
    try:
        display(new_df.head(10))
    except:
        print(new_df.head(10))

    #return the transformed dataframe:
    return new_df

# **Function for scaling the features**
- Machine Learning algorithms are extremely sensitive to scale. This function provides 3 methods (modes) of scaling:
    - `mode = 'standard'`: applies the standard scaling, which creates a new variable with mean = 0; and standard deviation = 1. Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean of the training samples, and s is the standard deviation of the training samples or one if with_std=False.
    - `mode = 'min_max'`: applies min-max normalization, with a resultant feature ranging from 0 to 1. Each value Y is transformed as Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and maximum values of Y, respectively.
    - `mode = 'factor'`: divide the whole series by a numeric value provided as argument. For a factor F, the new Y values will be Ytransf = Y/F.

In [18]:
def feature_scaling (df, subset_of_features_to_scale, mode = 'min_max', scale_with_new_params = True, list_of_scaling_params = None, suffix = '_scaled'):
    
    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    # Scikit-learn Preprocessing data guide:
    # https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
    # Standard scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    # Min-Max scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.set_params
    
    ## Machine Learning algorithms are extremely sensitive to scale. 
    
    ## This function provides 4 methods (modes) of scaling:
    ## mode = 'standard': applies the standard scaling, 
    ##  which creates a new variable with mean = 0; and standard deviation = 1.
    ##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
    ##  of the training samples, and s is the standard deviation of the training samples.
    
    ## mode = 'min_max': applies min-max normalization, with a resultant feature 
    ## ranging from 0 to 1. each value Y is transformed as 
    ## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
    ## maximum values of Y, respectively.
    
    ## mode = 'factor': divides the whole series by a numeric value provided as argument. 
    ## For a factor F, the new Y values will be Ytransf = Y/F.
    
    ## mode = 'normalize_by_maximum' is similar to mode = 'factor', but the factor will be selected
    # as the maximum value. This mode is available only for scale_with_new_params = True. If
    # scale_with_new_params = False, you should provide the value of the maximum as a division 'factor'.
    
    # df: the whole dataframe to be processed.
    
    # subset_of_features_to_be_scaled: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_scaled = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_scaled = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    # scale_with_new_params = True
    # Alternatively, set scale_with_new_params = True if you want to calculate a new
    # scaler for the data; or set scale_with_new_params = False if you want to apply 
    # parameters previously obtained to the data (i.e., if you want to apply the scaler
    # previously trained to another set of data; or wants to simply apply again the same
    # scaler).
    
    # list_of_scaling_params:
    # This variable has effect only when SCALE_WITH_NEW_PARAMS = False
    ## WARNING: The mode 'factor' demmands the input of the list of factors that will be 
    # used for normalizing each column. Therefore, it can be used only 
    # when scale_with_new_params = False.
    
    # list_of_scaling_params is a list of dictionaries with the same format of the list returned
    # from this function. Each dictionary must correspond to one of the features that will be scaled,
    # but the list do not have to be in the same order of the columns - it will check one of the
    # dictionary keys.
    # The first key of the dictionary must be 'column'. This key must store a string with the exact
    # name of the column that will be scaled.
    # the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
    # one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
    # numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
    # must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
    # two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
    # For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
    # standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
    # factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
    # Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
    # division.
    # The key 'scaler_details' will not create an object: the transform will be directly performed 
    # through vectorial operations.
    
    # suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_scaled', the transformed column will be
    # identified as '_scaled'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
      
    if (suffix is None):
        #set as the default
        suffix = '_scaled'
    
    #Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    #Start an scaling list empty (it will be a JSON object):
    scaling_list = []
    
    for column in subset_of_features_to_scale:
        
        # Create a dataframe X by subsetting only the analyzed column
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X = new_df[[column]]
        
        if (scale_with_new_params == False):
            
            # Use a previously obtained scaler.
            # Loop through each element of the list:
            
            for scaling_dict in list_of_scaling_params:
                
                # check if the dictionary is from that column:
                if (scaling_dict['column'] == column):
                    
                    # We found the correct dictionary. Let's retrieve the information:
                    # retrieve the nested dictionary:
                    nested_dict = scaling_dict['scaler']
                    
                    # try accessing the scaler object:
                    try:
                        scaler = nested_dict['scaler_obj']
                        #calculate the scaled feature, and store it as new array:
                        scaled_feature = scaler.transform(X)
                        
                        # Add the parameters to the nested dictionary:
                        nested_dict['scaling_params'] = scaler.get_params(deep = True)
                        
                        if (mode == 'standard'):
                            
                            nested_dict['scaler_details'] = {
                                'mu': X[column].mean(),
                                'sigma': X[column].std()
                            }
                        
                        elif (mode == 'min_max'):
                            
                            nested_dict['scaler_details'] = {
                                'min': X[column].min(),
                                'max': X[column].max()
                            }
                    
                    except:
                        
                        try:
                            # As last alternative, let's try accessing the scaler details dict
                            scaler_details = nested_dict['scaler_details']
                                
                            if (mode == 'standard'):
                                
                                nested_dict['scaling_params'] = 'standard_scaler_manually_defined'
                                mu = scaler_details['mu']
                                sigma = scaler_details['sigma']
                                    
                                if (sigma != 0):
                                    scaled_feature = (X - mu)/sigma
                                else:
                                    scaled_feature = (X - mu)
                                
                            elif (mode == 'min_max'):
                                    
                                nested_dict['scaling_params'] = 'min_max_scaler_manually_defined'
                                minimum = scaler_details['min']
                                maximum = scaler_details['max']
                                    
                                if ((maximum - minimum) != 0):
                                    scaled_feature = (X - minimum)/(maximum - minimum)
                                else:
                                    scaled_feature = X/maximum
                                
                            elif (mode == 'factor'):
                                
                                nested_dict['scaling_params'] = 'normalization_by_factor'
                                factor = scaler_details['factor']
                                scaled_feature = X/(factor)
                                
                            else:
                                print("Select a valid mode: standard, min_max, or factor.\n")
                                return "error", "error"
                            
                        except:
                                
                            print(f"No valid scaling dictionary was input for column {column}.\n")
                            return "error", "error"
            
        elif (mode == 'normalize_by_maximum'):
            
            #Start an scaling dictionary empty:
            scaling_dict = {}

            # add the column to the scaling dictionary:
            scaling_dict['column'] = column

            # Start a nested dictionary:
            nested_dict = {}
            
            factor = X[column].max()
            scaled_feature = X/(factor)
            nested_dict['scaling_params'] = 'normalization_by_factor'
            nested_dict['scaler_details'] = {'factor': factor, 'description': 'division_by_maximum_detected_value'}
    
        else:
            # Create a new scaler:
            
            #Start an scaling dictionary empty:
            scaling_dict = {}

            # add the column to the scaling dictionary:
            scaling_dict['column'] = column
            
            # Start a nested dictionary:
            nested_dict = {}
                
            #start the scaler object:
            if (mode == 'standard'):
                
                scaler = StandardScaler()
                scaler_details = {'mu': X[column].mean(), 'sigma': X[column].std()}

            elif (mode == 'min_max'):
                
                scaler = MinMaxScaler()
                scaler_details = {'min': X[column].min(), 'max': X[column].max()}
                
            # fit the scaler to the column
            scaler = scaler.fit(X)
                    
            # calculate the scaled feature, and store it as new array:
            scaled_feature = scaler.transform(X)
            # scaler.inverse_transform(X) would reverse the scaling.
                
            # Get the scaling parameters for that column:
            scaling_params = scaler.get_params(deep = True)
                    
            # scaling_params is a dictionary containing the scaling parameters.
            # Add the scaling parameters to the nested dictionary:
            nested_dict['scaling_params'] = scaling_params
                
            # add the scaler object to the nested dictionary:
            nested_dict['scaler_obj'] = scaler
            
            # Add the scaler_details dictionary:
            nested_dict['scaler_details'] = scaler_details
            
            # Now, all steps are the same for all cases, so we can go back to the main
            # for loop:
    
        # Create the new_column name:
        new_column = column + suffix
        # Create the new_column by dividing the previous column by the scaling factor:
                    
        # Set the new column as scaled_feature
        new_df[new_column] = scaled_feature
                
        # Add the nested dictionary to the scaling_dict:
        scaling_dict['scaler'] = nested_dict
                
        # Finally, append the scaling_dict to the list scaling_list:
        scaling_list.append(scaling_dict)
                    
        print(f"Successfully scaled column {column}.\n")
                
    print("Successfully scaled the dataframe. Returning the transformed dataframe and the scaling dictionary.\n")
    print("Check 10 first rows of the new dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_df.head(10))
            
    except: # regular mode
        print(new_df.head(10))
 
    return new_df, scaling_list

# **Function for reversing the scaling of the features**
- `mode = 'standard'`.
- `mode = 'min_max'`.
- `mode = 'factor'`.

In [19]:
def reverse_feature_scaling (df, subset_of_features_to_scale, list_of_scaling_params, mode = 'min_max', suffix = '_reverseScaling'):
    
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    # Scikit-learn Preprocessing data guide:
    # https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler
    # Standard scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    # Min-Max scaler documentation:
    # https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.set_params
    
    ## mode = 'standard': reverses the standard scaling, 
    ##  which creates a new variable with mean = 0; and standard deviation = 1.
    ##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
    ##  of the training samples, and s is the standard deviation of the training samples.
    
    ## mode = 'min_max': reverses min-max normalization, with a resultant feature 
    ## ranging from 0 to 1. each value Y is transformed as 
    ## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
    ## maximum values of Y, respectively.
    ## mode = 'factor': reverses the division of the whole series by a numeric value 
    # provided as argument. 
    ## For a factor F, the new Y transformed values are Ytransf = Y/F.
    # Notice that if the original mode was 'normalize_by_maximum', then the maximum value used
    # must be declared as any other factor.
    
    # df: the whole dataframe to be processed.
    
    # subset_of_features_to_be_scaled: list of strings (inside quotes), 
    # containing the names of the columns with the categorical variables that will be 
    # encoded. If a single column will be encoded, declare this parameter as list with
    # only one element e.g.subset_of_features_to_be_scaled = ["column1"] 
    # will analyze the column named as 'column1'; 
    # subset_of_features_to_be_scaled = ["col1", 'col2', 'col3'] will analyze 3 columns
    # with categorical variables: 'col1', 'col2', and 'col3'.
    
    # list_of_scaling_params is a list of dictionaries with the same format of the list returned
    # from this function. Each dictionary must correspond to one of the features that will be scaled,
    # but the list do not have to be in the same order of the columns - it will check one of the
    # dictionary keys.
    # The first key of the dictionary must be 'column'. This key must store a string with the exact
    # name of the column that will be scaled.
    # the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
    # one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
    # numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
    # must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
    # two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
    # For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
    # standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
    # factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
    # Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
    # division.
    # The key 'scaler_details' will not create an object: the transform will be directly performed 
    # through vectorial operations.
    
    # suffix: string (inside quotes).
    # How the transformed column will be identified in the returned data_transformed_df.
    # If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
    # identified as '_reverseScaling'.
    # Alternatively, input inside quotes a string with the desired suffix. Recommendation:
    # start the suffix with "_" to separate it from the original name
      
    if (suffix is None):
        #set as the default
        suffix = '_reverseScaling'
    
    #Start a copy of the original dataframe. This copy will be updated to create the new
    # transformed dataframe. Then, we avoid manipulating the original object.
    new_df = df.copy(deep = True)
    
    #Start an scaling list empty (it will be a JSON object):
    scaling_list = []
    
    # Use a previously obtained scaler:
    
    for column in subset_of_features_to_scale:
        
        # Create a dataframe X by subsetting only the analyzed column
        # it will be equivalent to using .reshape(-1,1) to set a 1D-series
        # or array in the shape for scikit-learn:
        # For doing so, pass a list of columns for column filtering, containing
        # the object column as its single element:
        X = new_df[[column]]

        # Loop through each element of the list:
            
        for scaling_dict in list_of_scaling_params:
                
            # check if the dictionary is from that column:
            if (scaling_dict['column'] == column):
                    
                # We found the correct dictionary. Let's retrieve the information:
                # retrieve the nested dictionary:
                nested_dict = scaling_dict['scaler']
                    
                # try accessing the scaler object:
                try:
                    scaler = nested_dict['scaler_obj']
                    #calculate the reversed scaled feature, and store it as new array:
                    rev_scaled_feature = scaler.inverse_transform(X)
                        
                    # Add the parameters to the nested dictionary:
                    nested_dict['scaling_params'] = scaler.get_params(deep = True)
                        
                    if (mode == 'standard'):
                            
                        nested_dict['scaler_details'] = {
                                'mu': rev_scaled_feature.mean(),
                                'sigma': rev_scaled_feature.std()
                            }
                        
                    elif (mode == 'min_max'):
                            
                        nested_dict['scaler_details'] = {
                                'min': rev_scaled_feature.min(),
                                'max': rev_scaled_feature.max()
                            }
                    
                except:
                        
                    try:
                        # As last alternative, let's try accessing the scaler details dict
                        scaler_details = nested_dict['scaler_details']
                                
                        if (mode == 'standard'):
                                
                            nested_dict['scaling_params'] = 'standard_scaler_manually_defined'
                            mu = scaler_details['mu']
                            sigma = scaler_details['sigma']
                                    
                            if (sigma != 0):
                                # scaled_feature = (X - mu)/sigma
                                rev_scaled_feature = (X * sigma) + mu
                            else:
                                # scaled_feature = (X - mu)
                                rev_scaled_feature = (X + mu)
                                
                        elif (mode == 'min_max'):
                                    
                            nested_dict['scaling_params'] = 'min_max_scaler_manually_defined'
                            minimum = scaler_details['min']
                            maximum = scaler_details['max']
                                    
                            if ((maximum - minimum) != 0):
                                # scaled_feature = (X - minimum)/(maximum - minimum)
                                rev_scaled_feature = (X * (maximum - minimum)) + minimum
                            else:
                                # scaled_feature = X/maximum
                                rev_scaled_feature = (X * maximum)
                                
                        elif (mode == 'factor'):
                                
                            nested_dict['scaling_params'] = 'normalization_by_factor'
                            factor = scaler_details['factor']
                            # scaled_feature = X/(factor)
                            rev_scaled_feature = (X * factor)
                                
                        else:
                            print("Select a valid mode: standard, min_max, or factor.\n")
                            return "error", "error"
                            
                    except:
                                
                        print(f"No valid scaling dictionary was input for column {column}.\n")
                        return "error", "error"
         
                # Create the new_column name:
                new_column = column + suffix
                # Create the new_column by dividing the previous column by the scaling factor:

                # Set the new column as rev_scaled_feature
                new_df[new_column] = rev_scaled_feature

                # Add the nested dictionary to the scaling_dict:
                scaling_dict['scaler'] = nested_dict

                # Finally, append the scaling_dict to the list scaling_list:
                scaling_list.append(scaling_dict)

                print(f"Successfully re-scaled column {column}.\n")
                
    print("Successfully re-scaled the dataframe.\n")
    print("Check 10 first rows of the new dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_df.head(10))
            
    except: # regular mode
        print(new_df.head(10))
                
    return new_df, scaling_list

# **Function for exporting the dataframe as CSV File (to notebook's workspace)**

In [None]:
def export_pd_dataframe_as_csv (dataframe_obj_to_be_exported, new_file_name_without_extension, file_directory_path = None):
    
    import os
    import pandas as pd
    
    ## WARNING: all files exported from this function are .csv (comma separated values)
    
    # dataframe_obj_to_be_exported: dataframe object that is going to be exported from the
    # function. Since it is an object (not a string), it should not be declared in quotes.
    # example: dataframe_obj_to_be_exported = dataset will export the dataset object.
    # ATTENTION: The dataframe object must be a Pandas dataframe.
    
    # FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
    # (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
    # or FILE_DIRECTORY_PATH = "/folder"
    # If you want to export the file to AWS S3, this parameter will have no effect.
    # In this case, you can set FILE_DIRECTORY_PATH = None

    # new_file_name_without_extension - (string, in quotes): input the name of the 
    # file without the extension. e.g. new_file_name_without_extension = "my_file" 
    # will export a file 'my_file.csv' to notebook's workspace.
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, new_file_name_without_extension)
    # Concatenate the extension ".csv":
    file_path = file_path + ".csv"

    dataframe_obj_to_be_exported.to_csv(file_path, index = False)

    print(f"Dataframe {new_file_name_without_extension} exported as CSV file to notebook\'s workspace as \'{file_path}\'.")
    print("Warning: if there was a file in this file path, it was replaced by the exported dataframe.")

# **Function for importing or exporting models, lists, or dictionaries**

In [None]:
def import_export_model_list_dict (action = 'import', objects_manipulated = 'model_only', model_file_name = None, dictionary_or_list_file_name = None, directory_path = '', model_type = 'keras', dict_or_list_to_export = None, model_to_export = None, use_colab_memory = False):
    
    import os
    import pickle
    import dill
    import tarfile
    import tensorflow as tf
    from zipfile import ZipFile
    # https://docs.python.org/3/library/tarfile.html#tar-examples
    # https://docs.python.org/3/library/zipfile.html#zipfile-objects
    # pickle and dill save the file in binary (bits) serialized mode. So, we must use
    # open 'rb' or 'wb' when calling the context manager. The 'b' stands for 'binary',
    # informing the context manager (with statement) that a bit-file will be processed
    from statsmodels.tsa.arima.model import ARIMA, ARIMAResults
    from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler
    from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
    from sklearn.neural_network import MLPRegressor, MLPClassifier
    from xgboost import XGBRegressor, XGBClassifier
    
    # action = 'import' for importing a model and/or a dictionary;
    # action = 'export' for exporting a model and/or a dictionary.
    
    # objects_manipulated = 'model_only' if only a model will be manipulated.
    # objects_manipulated = 'dict_or_list_only' if only a dictionary or list will be manipulated.
    # objects_manipulated = 'model_and_dict' if both a model and a dictionary will be
    # manipulated.
    
    # model_file_name: string with the name of the file containing the model (for 'import');
    # or of the name that the exported file will have (for 'export')
    # e.g. model_file_name = 'model'
    # WARNING: Do not add the file extension.
    # Keep it in quotes. Keep model_file_name = None if no model will be manipulated.
    
    # dictionary_or_list_file_name: string with the name of the file containing the dictionary 
    # (for 'import');
    # or of the name that the exported file will have (for 'export')
    # e.g. dictionary_or_list_file_name = 'history_dict'
    # WARNING: Do not add the file extension.
    # Keep it in quotes. Keep dictionary_or_list_file_name = None if no 
    # dictionary or list will be manipulated.
    
    # DIRECTORY_PATH: path of the directory where the model will be saved,
    # or from which the model will be retrieved. If no value is provided,
    # the DIRECTORY_PATH will be the root: "/"
    # Notice that the model and the dictionary must be stored in the same path.
    # If a model and a dictionary will be exported, they will be stored in the same
    # DIRECTORY_PATH.
    
    # model_type: This parameter has effect only when a model will be manipulated.
    # model_type = 'keras' for deep learning keras/ tensorflow models with extension .h5
    # model_type = 'tensorflow_general' for generic deep learning tensorflow models containing 
    # custom layers, losses and architectures. Such models are compressed as tar.gz, tar, or zip.
    # model_type = 'sklearn' for models from scikit-learn (non-deep learning)
    # model_type = 'xgb_regressor' for XGBoost regression models (non-deep learning)
    # model_type = 'xgb_classifier' for XGBoost classification models (non-deep learning)
    # model_type = 'arima' for ARIMA model (Statsmodels)
    
    # dict_or_list_to_export and model_to_export: 
    # These two parameters have effect only when ACTION == 'export'. In this case, they
    # must be declared. If ACTION == 'export', keep:
    # dict_or_list_to_export = None, 
    # model_to_export = None
    # If one of these objects will be exported, substitute None by the name of the object
    # e.g. if your model is stored in the global memory as 'keras_model' declare:
    # model_to_export = keras_model. Notice that it must be declared without quotes, since
    # it is not a string, but an object.
    # For exporting a dictionary named as 'dict':
    # dict_or_list_to_export = dict
    
    # use_colab_memory: this parameter has only effect when using Google Colab (or it will
    # raise an error). Set as use_colab_memory = True if you want to use the instant memory
    # from Google Colaboratory: you will update or download the file and it will be available
    # only during the time when the kernel is running. It will be excluded when the kernel
    # dies, for instance, when you close the notebook.
    
    # If action == 'export' and use_colab_memory == True, then the file will be downloaded
    # to your computer (running the cell will start the download).
    
    # Check the directory path
    if (directory_path is None):
        # set as the root (empty string):
        directory_path = ""
        
        
    bool_check1 = (objects_manipulated != 'model_only')
    # bool_check1 == True if a dictionary will be manipulated
    
    bool_check2 = (objects_manipulated != 'dict_or_list_only')
    # bool_check1 == True if a dictionary will be manipulated
    
    if (bool_check1 == True):
        #manipulate a dictionary
        
        if (dictionary_or_list_file_name is None):
            print("Please, enter a name for the dictionary or list.")
            return "error1"
        
        else:
            # Create the file path for the dictionary:
            dict_path = os.path.join(directory_path, dictionary_or_list_file_name)
            # Extract the file extension
            dict_extension = 'pkl'
            #concatenate:
            dict_path = dict_path + "." + dict_extension
            
    
    if (bool_check2 == True):
        #manipulate a model
        
        if (model_file_name is None):
            print("Please, enter a name for the model.")
            return "error1"
        
        else:
            # Create the file path for the dictionary:
            model_path = os.path.join(directory_path, model_file_name)
            # Extract the file extension
            
            #check model_type:
            if (model_type == 'keras'):
                model_extension = 'h5'
            
            elif (model_type == 'sklearn'):
                model_extension = 'dill'
                #it could be 'pkl', though
            
            elif (model_type == 'xgb_regressor'):
                model_extension = 'json'
                #it could be 'ubj', though
            
            elif (model_type == 'xgb_classifier'):
                model_extension = 'json'
                #it could be 'ubj', though
            
            elif (model_type == 'arima'):
                model_extension = 'pkl'
            
            else:
                print("Enter a valid model_type: keras, sklearn_xgb, or arima.")
                return "error2"
            
            #concatenate:
            model_path = model_path +  "." + model_extension
            
    # Now we have the full paths for the dictionary and for the model.
    
    if (action == 'import'):
        
        if (use_colab_memory == True):
             
            from google.colab import files
            # google.colab library must be imported only in case 
            # it is going to be used, for avoiding 
            # AWS compatibility issues.
            
            print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
            print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
            # this functionality requires the previous declaration:
            ## from google.colab import files
            colab_files_dict = files.upload()
            # The files are stored into a dictionary called colab_files_dict where the keys
            # are the names of the files and the values are the files themselves.
            ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
            ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
            ## representing the contents of the file. The length of this value is the size of the
            ## uploaded file, in bytes.
            ## To access the file is like accessing a value from a dictionary: 
            ## d = {'key1': 'val1'}, d['key1'] == 'val1'
            ## we simply declare the key inside brackets and quotes, the same way we would do for
            ## accessing the column of a dataframe.
            ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
            ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
            ## file in bytes.
            ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
            ## parentheses): colab_files_dict.keys()
            
            for key in colab_files_dict.keys():
                #loop through each element of the list of keys of the dictionary
                # (list colab_files_dict.keys()). Each element is named 'key'
                print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
                # The key is the name of the file, and the length of the value
                ## correspondent to the key is the file's size in bytes.
                ## Notice that the content of the uploaded object must be passed 
                ## as argument for a proper function to be interpreted. 
                ## For instance, the content of a xlsx file should be passed as
                ## argument for Pandas .read_excel function; the pkl file must be passed as
                ## argument for pickle.
                ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
                ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
                ## df from the uploaded table. Notice that is the value, not the key, that is the
                ## argument.
        
        if (bool_check1 == True):
            #manipulate a dictionary
            if (use_colab_memory == True):
                key = dictionary_file_name + "." + dict_extension
                #Use the key to access the file content, and pass the file content
                # to pickle:
                with open(colab_files_dict[key], 'rb') as opened_file:
            
                    imported_dict = pickle.load(opened_file)
                    # The structure imported_dict = pkl.load(open(colab_files_dict[key], 'rb')) relies 
                    # on the GC to close the file. That's not a good idea: If someone doesn't use 
                    # CPython the garbage collector might not be using refcounting (which collects 
                    # unreferenced objects immediately) but e.g. collect garbage only after some time.
                    # Since file handles are closed when the associated object is garbage collected or 
                    # closed explicitly (.close() or .__exit__() from a context manager) the file 
                    # will remain open until the GC kicks in.
                    # Using 'with' ensures the file is closed as soon as the block is left - even if 
                    # an exception happens inside that block, so it should always be preferred for any 
                    # real application.
                    # source: https://stackoverflow.com/questions/39447362/equivalent-ways-to-json-load-a-file-in-python

                print(f"Dictionary or list {key} successfully imported to Colab environment.")
            
            else:
                #standard method
                with open(dict_path, 'rb') as opened_file:
            
                    imported_dict = pickle.load(opened_file)
                
                # 'rb' stands for read binary (read mode). For writing mode, 'wb', 'write binary'
                print(f"Dictionary or list successfully imported from {dict_path}.")
                
        if (bool_check2 == True):
            #manipulate a model
            # select the proper model
        
            if (model_type == 'keras'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = tf.keras.models.load_model(colab_files_dict[key])
                    print(f"Keras/TensorFlow model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    # We previously declared:
                    # from keras.models import load_model
                    model = tf.keras.models.load_model(model_path)
                    print(f"Keras/TensorFlow model successfully imported from {model_path}.")
            
            elif (model_type == 'tensorflow_general'):
                
                print("Warning, save the model in a directory called 'saved_model' (before compressing.)\n")
                # Create a temporary folder in case it does not exist:
                # https://www.geeksforgeeks.org/python-os-makedirs-method/
                # Set exist_ok = True
                os.makedirs("tmp/", exist_ok = True)
                
                if (use_colab_memory == True):
                    
                    key = model_file_name
                    
                    try:
                        model_extension = ".tar"
                        key = key + model_extension
                        model_path = colab_files_dict[key]
                        # Open the context manager
                        with tarfile.open (model_path, 'r:') as compressed_model:
                            #extract all to the tmp directory:
                            compressed_model.extractall("tmp/")
                        
                        # if you were not using the context manager, it would be necessary to apply
                        # close method: tar = tarfile.open(fname, "r:gz"); tar.extractall(); tar.close()
                    
                    except:
                        
                        try:
                            # try tar.gz extension
                            model_extension = ".tar.gz"
                            key = key + model_extension
                            model_path = colab_files_dict[key]
                            
                            # Open the context manager
                            with tarfile.open (model_path, 'r:gz') as compressed_model:
                                #extract all to the tmp directory:
                                compressed_model.extractall("tmp/")
                        
                        except:
                            # try .zip extension
                            try:
                                model_extension = ".zip"
                                key = key + model_extension
                                model_path = colab_files_dict[key]
                                
                                # Open the context manager
                                with ZipFile (model_path, 'r') as compressed_model:
                                    #extract all to the tmp directory:
                                    compressed_model.extractall("tmp/")
                            
                            except:
                                print("Failed to load the model. Compress it as zip, tar or tar.gz file.\n")
                    
                    
                    # Compress the directory using tar
                    # https://www.gnu.org/software/tar/manual/tar.html
                    #    ! tar --extract --file=model_path --verbose --verbose tmp/
                    
                    try:
                        model = tf.keras.models.load_model("tmp/saved_model")
                        print(f"TensorFlow model: {model_path} successfully imported to Colab environment.")
                    
                    except:
                        print("Failed to load the model. Save it in a directory named 'saved_model' before compressing.\n")
                    
                else:
                    #standard method
                    
                    # Try simply accessing the directory:
                    try:
                        model = tf.keras.models.load_model("tmp/saved_model")
                    
                    except:
                        
                        try:
                            model = tf.keras.models.load_model(model_file_name)
                        
                        except:
                            
                            # It is compressed
                            try:
                                model_extension = ".tar"
                                model_path = model_file_name

                                # Open the context manager
                                with tarfile.open (model_path, 'r:') as compressed_model:
                                    #extract all to the tmp directory:
                                    compressed_model.extractall("tmp/")

                                # if you were not using the context manager, it would be necessary to apply
                                # close method: tar = tarfile.open(fname, "r:gz"); tar.extractall(); tar.close()

                            except:

                                try:
                                    # try tar.gz extension
                                    model_extension = ".tar.gz"
                                    model_path = model_file_name

                                    # Open the context manager
                                    with tarfile.open (model_path, 'r:gz') as compressed_model:
                                        #extract all to the tmp directory:
                                        compressed_model.extractall("tmp/")

                                except:
                                    # try .zip extension
                                    try:
                                        model_extension = ".zip"
                                        model_path = model_file_name

                                        # Open the context manager
                                        with ZipFile (model_path, 'r') as compressed_model:
                                            #extract all to the tmp directory:
                                            compressed_model.extractall("tmp/")

                                    except:
                                        print("Failed to load the model. Compress it as zip, tar or tar.gz file.\n")

                    
                    try:
                        model = tf.keras.models.load_model("tmp/saved_model")
                        print(f"TensorFlow model: {model_path} successfully imported to Colab environment.")
                    
                    except:
                        print("Failed to load the model. Save it in a directory named 'saved_model' before compressing.\n")
                    
                    
            elif (model_type == 'sklearn'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    
                    with open(colab_files_dict[key], 'rb') as opened_file:
            
                        model = dill.load(opened_file)
                    
                    print(f"Scikit-learn model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    with open(model_path, 'rb') as opened_file:
            
                        model = dill.load(opened_file)
                
                    print(f"Scikit-learn model successfully imported from {model_path}.")
                    # For loading a pickle model:
                    ## model = pkl.load(open(model_path, 'rb'))
                    # 'rb' stands for read binary (read mode). For writing mode, 'wb', 'write binary'

            elif (model_type == 'xgb_regressor'):
                
                # Create an instance (object) from the class XGBRegressor:
                
                model = XGBRegressor()
                # Now we can apply the load_model method from this class:
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = model.load_model(colab_files_dict[key])
                    print(f"XGBoost regression model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    model = model.load_model(model_path)
                    print(f"XGBoost regression model successfully imported from {model_path}.")
                    # model.load_model("model.json") or model.load_model("model.ubj")
                    # .load_model is a method from xgboost object
            
            elif (model_type == 'xgb_classifier'):

                # Create an instance (object) from the class XGBClassifier:

                model = XGBClassifier()
                # Now we can apply the load_model method from this class:
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = model.load_model(colab_files_dict[key])
                    print(f"XGBoost classification model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    model = model.load_model(model_path)
                    print(f"XGBoost classification model successfully imported from {model_path}.")
                    # model.load_model("model.json") or model.load_model("model.ubj")
                    # .load_model is a method from xgboost object

            elif (model_type == 'arima'):
                
                if (use_colab_memory == True):
                    key = model_file_name + "." + model_extension
                    model = ARIMAResults.load(colab_files_dict[key])
                    print(f"ARIMA model: {key} successfully imported to Colab environment.")
            
                else:
                    #standard method
                    # We previously declared:
                    # from statsmodels.tsa.arima.model import ARIMAResults
                    model = ARIMAResults.load(model_path)
                    print(f"ARIMA model successfully imported from {model_path}.")
            
            if (objects_manipulated == 'model_only'):
                # only the model should be returned
                return model
            
            elif (objects_manipulated == 'dict_only'):
                # only the dictionary should be returned:
                return imported_dict
            
            else:
                # Both objects are returned:
                return model, imported_dict

    
    elif (action == 'export'):
        
        #Let's export the models or dictionary:
        if (use_colab_memory == True):
            
            from google.colab import files
            # google.colab library must be imported only in case 
            # it is going to be used, for avoiding 
            # AWS compatibility issues.
            
            print("The files will be downloaded to your computer.")
        
        if (bool_check1 == True):
            #manipulate a dictionary
            if (use_colab_memory == True):
                ## Download the dictionary
                key = dictionary_or_list_file_name + "." + dict_extension
                
                with open(key, 'wb') as opened_file:
            
                    pickle.dump(dict_or_list_to_export, opened_file)
                
                # this functionality requires the previous declaration:
                ## from google.colab import files
                files.download(key)
                
                print(f"Dictionary or list {key} successfully downloaded from Colab environment.")
            
            else:
                #standard method 
                with open(dict_path, 'wb') as opened_file:
            
                    pickle.dump(dict_or_list_to_export, opened_file)
                
                #to save the file, the mode must be set as 'wb' (write binary)
                print(f"Dictionary or list successfully exported as {dict_path}.")
                
        if (bool_check2 == True):
            #manipulate a model
            # select the proper model
        
            if (model_type == 'keras'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save(key)
                    files.download(key)
                    print(f"Keras/TensorFlow model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save(model_path)
                    print(f"Keras/TensorFlow model successfully exported as {model_path}.")
            
            elif (model_type == 'tensorflow_general'):
                
                # Save your model in the SavedModel format
                # Save as a directory named 'saved_model'
                model_to_export.save('saved_model')
                model_path = 'saved_model'
            
                try:
                    model_path = model_path + ".tar.gz"
                    
                    # Open the context manager
                    with tarfile.open (model_path, 'w:gz') as compressed_model:
                        #Add the folder:
                        compressed_model.add('saved_model/')    
                        # if you were not using the context manager, it would be necessary to apply
                        # close method: tar = tarfile.open(fname, "r:gz"); tar.extractall(); tar.close()
                
                except:
                    # try compressing as tar:
                    try:
                        model_path = model_path + ".tar"
                        # Open the context manager
                        with tarfile.open (model_path, 'w:') as compressed_model:
                            #Add the folder:
                            compressed_model.add('saved_model/') 
                    
                    except:
                        # compress as zip:
                        model_path = model_path + ".zip"
                        with ZipFile (model_path, 'w') as compressed_model:
                            compressed_model.write('saved_model/')
                
                if (use_colab_memory == True):
                    
                    key = model_path
                    files.download(key)
                    print(f"TensorFlow model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    print(f"TensorFlow model successfully exported as {model_path}.")

            elif (model_type == 'sklearn'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    
                    with open(key, 'wb') as opened_file:

                        dill.dump(model_to_export, opened_file)
                    
                    #to save the file, the mode must be set as 'wb' (write binary)
                    files.download(key)
                    print(f"Scikit-learn model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    with open(model_path, 'wb') as opened_file:

                        dill.dump(model_to_export, opened_file)
                    
                    print(f"Scikit-learn model successfully exported as {model_path}.")
                    # For exporting a pickle model:
                    ## pkl.dump(model_to_export, open(model_path, 'wb'))
            
            elif ((model_type == 'xgb_regressor')|(model_type == 'xgb_classifier')):
                # In both cases, the XGBoost object is already loaded in global
                # context memory. So there is already the object for using the
                # save_model method, available for both classes (XGBRegressor and
                # XGBClassifier).
                # We can simply check if it is one type OR the other, since the
                # method is the same:
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save_model(key)
                    files.download(key)
                    print(f"XGBoost model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save_model(model_path)
                    print(f"XGBoost model successfully exported as {model_path}.")
                    # For exporting a pickle model:
                    ## pkl.dump(model_to_export, open(model_path, 'wb'))
            
            elif (model_type == 'arima'):
                
                if (use_colab_memory == True):
                    ## Download the model
                    key = model_file_name + "." + model_extension
                    model_to_export.save(key)
                    files.download(key)
                    print(f"ARIMA model: {key} successfully downloaded from Colab environment.")
            
                else:
                    #standard method
                    model_to_export.save(model_path)
                    print(f"ARIMA model successfully exported as {model_path}.")
        
        print("Export of files completed.")
    
    else:
        print("Enter a valid action, import or export.")

# **Function for downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

In [None]:
def upload_to_or_download_file_from_colab (action = 'download', file_to_download_from_colab = None):
    
    # action = 'download' to download the file to the local machine
    # action = 'upload' to upload a file from local machine to
    # Google Colab's instant memory
    
    # file_to_download_from_colab = None. This parameter is obbligatory when
    # action = 'download'. 
    # Declare as file_to_download_from_colab the file that you want to download, with
    # the correspondent extension.
    # It should not be declared in quotes.
    # e.g. to download a dictionary named dict, object_to_download_from_colab = 'dict.pkl'
    # To download a dataframe named df, declare object_to_download_from_colab = 'df.csv'
    # To export a model named keras_model, declare object_to_download_from_colab = 'keras_model.h5'
 
    from google.colab import files
    # google.colab library must be imported only in case 
    # it is going to be used, for avoiding 
    # AWS compatibility issues.
        
    if (action == 'upload'):
            
        print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
        print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
        # this functionality requires the previous declaration:
        ## from google.colab import files
            
        colab_files_dict = files.upload()
            
        # The files are stored into a dictionary called colab_files_dict where the keys
        # are the names of the files and the values are the files themselves.
        ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
        ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
        ## representing the contents of the file. The length of this value is the size of the
        ## uploaded file, in bytes.
        ## To access the file is like accessing a value from a dictionary: 
        ## d = {'key1': 'val1'}, d['key1'] == 'val1'
        ## we simply declare the key inside brackets and quotes, the same way we would do for
        ## accessing the column of a dataframe.
        ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
        ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
        ## file in bytes.
        ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
        ## parentheses): colab_files_dict.keys()
            
        for key in colab_files_dict.keys():
            #loop through each element of the list of keys of the dictionary
            # (list colab_files_dict.keys()). Each element is named 'key'
            print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
            # The key is the name of the file, and the length of the value
            ## correspondent to the key is the file's size in bytes.
            ## Notice that the content of the uploaded object must be passed 
            ## as argument for a proper function to be interpreted. 
            ## For instance, the content of a xlsx file should be passed as
            ## argument for Pandas .read_excel function; the pkl file must be passed as
            ## argument for pickle.
            ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
            ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
            ## df from the uploaded table. Notice that is the value, not the key, that is the
            ## argument.
                
            print("The uploaded files are stored into a dictionary object named as colab_files_dict.")
            print("Each key from this dictionary is the name of an uploaded file. The value correspondent to that key is the file itself.")
            print("The structure of a general Python dictionary is dict = {\'key1\': value1}. To access value1, declare file = dict[\'key1\'], as if you were accessing a column from a dataframe.")
            print("Then, if you uploaded a file named \'table.xlsx\', you can access this file as:")
            print("uploaded_file = colab_files_dict[\'table.xlsx\']")
            print("Notice, though, that the object uploaded_file is the whole file content, not a Python object already converted. To convert to a Python object, pass this element as argument for a proper function or method.")
            print("In this example, to convert the object uploaded_file to a dataframe, Pandas pd.read_excel function could be used. In the following line, a df dataframe object is obtained from the uploaded file:")
            print("df = pd.read_excel(uploaded_file)")
            print("Also, the uploaded file itself will be available in the Colaboratory Notebook\'s workspace.")
            
            return colab_files_dict
        
    elif (action == 'download'):
            
        if (file_to_download_from_colab is None):
                
            #No object was declared
            print("Please, inform a file to download from the notebook\'s workspace. It should be declared in quotes and with the extension: e.g. \'table.csv\'.")
            
        else:
                
            print("The file will be downloaded to your computer.")

            files.download(file_to_download_from_colab)

            print(f"File {file_to_download_from_colab} successfully downloaded from Colab environment.")

    else:
            
            print("Please, select a valid action, \'download\' or \'upload\'.")

# **Function for exporting a list of files from notebook's workspace to AWS Simple Storage Service (S3)**

In [None]:
def export_files_to_s3 (list_of_file_names_with_extensions, directory_of_notebook_workspace_storing_files_to_export = None, s3_bucket_name = None, s3_obj_prefix = None):
    
    import os
    import boto3
    # boto3 is AWS S3 Python SDK
    # sagemaker and boto3 libraries must be imported only in case 
    # they are going to be used, for avoiding 
    # Google Colab compatibility issues.
    from getpass import getpass
    
    # list_of_file_names_with_extensions: list containing all the files to export to S3.
    # Declare it as a list even if only a single file will be exported.
    # It must be a list of strings containing the file names followed by the extensions.
    # Example, to a export a single file my_file.ext, where my_file is the name and ext is the
    # extension:
    # list_of_file_names_with_extensions = ['my_file.ext']
    # To export 3 files, file1.ext1, file2.ext2, and file3.ext3:
    # list_of_file_names_with_extensions = ['file1.ext1', 'file2.ext2', 'file3.ext3']
    # Other examples:
    # list_of_file_names_with_extensions = ['Screen_Shot.png', 'dataset.csv']
    # list_of_file_names_with_extensions = ["dictionary.pkl", "model.h5"]
    # list_of_file_names_with_extensions = ['doc.pdf', 'model.dill']
    
    # directory_of_notebook_workspace_storing_files_to_export: directory from notebook's workspace
    # from which the files will be exported to S3. Keep it None, or
    # directory_of_notebook_workspace_storing_files_to_export = "/"; or
    # directory_of_notebook_workspace_storing_files_to_export = '' (empty string) to export from
    # the root (main) directory.
    # Alternatively, set as a string containing only the directories and folders, not the file names.
    # Examples: directory_of_notebook_workspace_storing_files_to_export = 'folder1';
    # directory_of_notebook_workspace_storing_files_to_export = 'folder1/folder2/'
    
    # For this function, all exported files must be located in the same directory.
    
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_prefix = None. Keep it None or as an empty string (s3_obj_key_prefix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_prefix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    # So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
    # a given folder (directory) of the bucket.
    # DO NOT PUT A SLASH before (to the right of) the prefix;
    # DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

    # Alternatively, provide the full path of a given file if you want to import only it:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
    # where my_file is the file's name, and ext is its extension.


    # Attention: after running this function for connecting with AWS Simple Storage System (S3), 
    # your 'AWS Access key ID' and your 'Secret access key' will be requested.
    # The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
    # other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
    # and the prefix. All of these are sensitive information from the organization.
    # Therefore, after importing the information, always remember of cleaning the output of this cell
    # and of removing such information from the strings.
    # Remember that these data may contain privilege for accessing the information, so it should not
    # be used for non-authorized people.

    # Also, remember of deleting the exported from the workspace after finishing the analysis.
    # The costs for storing the files in S3 is quite inferior than those for storing directly in the
    # workspace. Also, files stored in S3 may be accessed for other users than those with access to
    # the notebook's workspace.
    
    
    # Check if directory_of_notebook_workspace_storing_files_to_export is None. 
    # If it is, make it the root directory:
    if ((directory_of_notebook_workspace_storing_files_to_export is None)|(str(directory_of_notebook_workspace_storing_files_to_export) == "/")):
            
            # For the S3 buckets, the path should not start with slash. Assign the empty
            # string instead:
            directory_of_notebook_workspace_storing_files_to_export = ""
            print("The files will be exported from the notebook\'s root directory to S3.")
    
    elif (str(directory_of_notebook_workspace_storing_files_to_export) == ""):
        
            # Guarantee that the path is the empty string.
            # Avoid accessing the else condition, what would raise an error
            # since the empty string has no character of index 0
            directory_of_notebook_workspace_storing_files_to_export = str(directory_of_notebook_workspace_storing_files_to_export)
            print("The files will be exported from the notebook\'s root directory to S3.")
          
    else:
        # Use the str attribute to guarantee that the path was read as a string:
        directory_of_notebook_workspace_storing_files_to_export = str(directory_of_notebook_workspace_storing_files_to_export)
            
        if(directory_of_notebook_workspace_storing_files_to_export[0] == "/"):
            # the first character is the slash. Let's remove it

            # In AWS, neither the prefix nor the path to which the file will be imported
            # (file from S3 to workspace) or from which the file will be exported to S3
            # (the path in the notebook's workspace) may start with slash, or the operation
            # will not be concluded. Then, we have to remove this character if it is present.

            # The slash is character 0. Then, we want all characters from character 1 (the
            # second) to character len(str(path_to_store_imported_s3_bucket)) - 1, the index
            # of the last character. So, we can slice the string from position 1 to position
            # the slicing syntax is: string[1:] - all string characters from character 1
            # string[:10] - all string characters from character 10-1 = 9 (including 9); or
            # string[1:10] - characters from 1 to 9
            # So, slice the whole string, starting from character 1:
            directory_of_notebook_workspace_storing_files_to_export = directory_of_notebook_workspace_storing_files_to_export[1:]
            # attention: even though strings may be seem as list of characters, that can be
            # sliced, we cannot neither simply assign a character to a given position nor delete
            # a character from a position.

    # Ask the user to provide the credentials:
    ACCESS_KEY = input("Enter your AWS Access Key ID here (in the right). It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
    print("\n") # line break
    SECRET_KEY = getpass("Enter your password (Secret key) here (in the right). It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        
    # The use of 'getpass' instead of 'input' hide the password behind dots.
    # So, the password is not visible by other users and cannot be copied.
        
    print("\n")
    print("WARNING: The bucket\'s name, the prefix, the AWS access key ID, and the AWS Secret access key are all sensitive information, which may grant access to protected information from the organization.\n")
    print("After finish exporting data to S3, remember of removing these information from the notebook, specially if it is going to be shared. Also, remember of removing the files from the workspace.\n")
    print("The cost for storing files in Simple Storage Service is quite inferior than the one for storing directly in SageMaker workspace. Also, files stored in S3 may be accessed for other users than those with access the notebook\'s workspace.\n")

    # Check if the user actually provided the mandatory inputs, instead
    # of putting None or empty string:
    if ((ACCESS_KEY is None) | (ACCESS_KEY == '')):
        print("AWS Access Key ID is missing. It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
        return "error"
    elif ((SECRET_KEY is None) | (SECRET_KEY == '')):
        print("AWS Secret Access Key is missing. It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        return "error"
    elif ((s3_bucket_name is None) | (s3_bucket_name == '')):
        print ("Please, enter a valid S3 Bucket\'s name. Do not add sub-directories or folders (prefixes), only the name of the bucket itself.")
        return "error"
    
    else:
        # Use the str attribute to guarantee that all AWS parameters were properly read as strings, and not as
        # other variables (like integers or floats):
        ACCESS_KEY = str(ACCESS_KEY)
        SECRET_KEY = str(SECRET_KEY)
        s3_bucket_name = str(s3_bucket_name)

    if(s3_bucket_name[0] == "/"):
        # the first character is the slash. Let's remove it

        # In AWS, neither the prefix nor the path to which the file will be imported
        # (file from S3 to workspace) or from which the file will be exported to S3
        # (the path in the notebook's workspace) may start with slash, or the operation
        # will not be concluded. Then, we have to remove this character if it is present.

        # So, slice the whole string, starting from character 1 (as did for 
        # path_to_store_imported_s3_bucket):
        s3_bucket_name = s3_bucket_name[1:]

    # Remove any possible trailing (white and tab spaces) spaces
    # That may be present in the string. Use the Python string
    # rstrip method, which is the equivalent to the Trim function:
    # When no arguments are provided, the whitespaces and tabulations
    # are the removed characters
    # https://www.w3schools.com/python/ref_string_rstrip.asp?msclkid=ee2d05c3c56811ecb1d2189d9f803f65
    s3_bucket_name = s3_bucket_name.rstrip()
    ACCESS_KEY = ACCESS_KEY.rstrip()
    SECRET_KEY = SECRET_KEY.rstrip()
    # Since the user manually inputs the parameters ACCESS and SECRET_KEY,
    # it is easy to input whitespaces without noticing that.

    # Now process the non-obbligatory parameter.
    # Check if a prefix was passed as input parameter. If so, we must select only the names that start with
    # The prefix.
    # Example: in the bucket 'my_bucket' we have a directory 'dir1'.
    # In the main (root) directory, we have a file 'file1.json' like: '/file1.json'
    # If we pass the prefix 'dir1', we want only the files that start as '/dir1/'
    # such as: 'dir1/file2.json', excluding the file in the main (root) directory and excluding the files in other
    # directories. Also, we want to eliminate the file names with no extensions, like 'dir1/' or 'dir1/dir2',
    # since these object names represent folders or directories, not files.	

    if (s3_obj_prefix is None):
        print ("No prefix, specific object, or subdirectory provided.") 
        print (f"Then, exporting to \'{s3_bucket_name}\' root (main) directory.\n")
        # s3_path: path that the file should have in S3:
        s3_path = "" # empty string for the root directory
    elif ((s3_obj_prefix == "/") | (s3_obj_prefix == '')):
        # The root directory in the bucket must not be specified starting with the slash
        # If the root "/" or the empty string '' is provided, make
        # it equivalent to None (no directory)
        print ("No prefix, specific object, or subdirectory provided.") 
        print (f"Then, exporting to \'{s3_bucket_name}\' root (main) directory.\n")
        # s3_path: path that the file should have in S3:
        s3_path = "" # empty string for the root directory
    
    else:
        # Since there is a prefix, use the str attribute to guarantee that the path was read as a string:
        s3_obj_prefix = str(s3_obj_prefix)
            
        if(s3_obj_prefix[0] == "/"):
            # the first character is the slash. Let's remove it

            # In AWS, neither the prefix nor the path to which the file will be imported
            # (file from S3 to workspace) or from which the file will be exported to S3
            # (the path in the notebook's workspace) may start with slash, or the operation
            # will not be concluded. Then, we have to remove this character if it is present.

            # So, slice the whole string, starting from character 1 (as did for 
            # path_to_store_imported_s3_bucket):
            s3_obj_prefix = s3_obj_prefix[1:]

        # Remove any possible trailing (white and tab spaces) spaces
        # That may be present in the string. Use the Python string
        # rstrip method, which is the equivalent to the Trim function:
        s3_obj_prefix = s3_obj_prefix.rstrip()
            
        # s3_path: path that the file should have in S3:
        # Make the path the prefix itself, since there is a prefix:
        s3_path = s3_obj_prefix
            
        print("AWS Access Credentials, and bucket\'s prefix, object or subdirectory provided.\n")	

            
        print ("Starting connection with the S3 bucket.\n")
        
        try:
            # Start S3 client as the object 's3_client'
            s3_client = boto3.resource('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = SECRET_KEY)
        
            print(f"Credentials accepted by AWS. S3 client successfully started.\n")
            # An object 'data_table.xlsx' in the main (root) directory of the s3_bucket is stored in Python environment as:
            # s3.ObjectSummary(bucket_name='bucket_name', key='data_table.xlsx')
            # The name of each object is stored as the attribute 'key' of the object.
        
        except:
            
            print("Failed to connect to AWS Simple Storage Service (S3). Review if your credentials are correct.")
            print("The variable \'access_key\' must be set as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("The variable \'secret_key\' must be set as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
        
        
        try:
            # Connect to the bucket specified as 'bucket_name'.
            # The bucket is started as the object 's3_bucket':
            s3_bucket = s3_client.Bucket(s3_bucket_name)
            print(f"Connection with bucket \'{s3_bucket_name}\' stablished.\n")
            
        except:
            
            print("Failed to connect with the bucket, which usually happens when declaring a wrong bucket\'s name.") 
            print("Check the spelling of your bucket_name string and remember that it must be all in lower-case.\n")
                
        # Now, let's obtain the lists of all file paths in the notebook's workspace and
        # of the paths that the files should have in S3, after being exported.
        
        try:
            
            # start the lists:
            workspace_full_paths = []
            s3_full_paths = []
            
            # Get the total of files in list_of_file_names_with_extensions:
            total_of_files = len(list_of_file_names_with_extensions)
            
            # And Loop through all elements, named 'my_file' from the list
            for my_file in list_of_file_names_with_extensions:
                
                # Get the full path in the notebook's workspace:
                workspace_file_full_path = os.path.join(directory_of_notebook_workspace_storing_files_to_export, my_file)
                # Get the full path that the file will have in S3:
                s3_file_full_path = os.path.join(s3_path, my_file)
                
                # Append these paths to the correspondent lists:
                workspace_full_paths.append(workspace_file_full_path)
                s3_full_paths.append(s3_file_full_path)
                
            # Now, both lists have the same number of elements. For an element (file) i,
            # workspace_full_paths has the full file path in notebook's workspace, and
            # s3_full_paths has the path that the new file should have in S3 bucket.
        
        except:
            
            print("The function returned an error when trying to access the list of files. Declare it as a list of strings, even if there is a single element in the list.")
            print("Example: list_of_file_names_with_extensions = [\'my_file.ext\']\n")
            return "error"
        
        
        # Now, loop through all elements i from the lists.
        # The first elements of the lists have index 0; the last elements have index
        # total_of_files - 1, since there are 'total_of_files' elements:
        
        # Then, export the correspondent element to S3:
        
        try:
            
            for i in range(total_of_files):
                # goes from i = 0 to i = total_of_files - 1

                # get the element from list workspace_file_full_path 
                # (original path of file i, from which it will be exported):
                PATH_IN_WORKSPACE = workspace_full_paths[i]

                # get the correspondent element of list s3_full_paths
                # (path that the file i should have in S3, after being exported):
                S3_FILE_PATH = s3_full_paths[i]

                # Start the new object in the bucket previously started as 's3_bucket'.
                # Start it with the specified prefix, in S3_FILE_PATH:
                new_s3_object = s3_bucket.Object(S3_FILE_PATH)
                
                # Finally, upload the file in PATH_IN_WORKSPACE.
                # Make new_s3_object the exported file:
            
                # Upload the selected object from the workspace path PATH_IN_WORKSPACE
                # to the S3 path specified as S3_FILE_PATH.
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" exports a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                new_s3_object.upload_file(Filename = PATH_IN_WORKSPACE)

                print(f"The file \'{list_of_file_names_with_extensions[i]}\' was successfully exported from notebook\'s workspace to AWS Simple Storage Service (S3).\n")

                
            print("Finished exporting the files from the the notebook\'s workspace to S3 bucket. It may take a couple of minutes untill they be shown in S3 environment.\n") 
            print("Do not forget to delete these copies after finishing the analysis. They will remain stored in the bucket.\n")


        except:

            # Run this code for any other exception that may happen (no exception error
            # specified, so any exception runs the following code).
            # Check: https://pythonbasics.org/try-except/?msclkid=4f6b4540c5d011ecb1fe8a4566f632a6
            # for seeing how to handle successive exceptions

            print("Attention! The function raised an exception error, which is probably due to the AWS Simple Storage Service (S3) permissions.")
            print("Before running again this function, check this quick guide for configuring the permission roles in AWS.\n")
            print("It is necessary to create an user with full access permissions to interact with S3 from SageMaker. To configure the User, go to the upper ribbon of AWS, click on Services, and select IAM – Identity and Access Management.")
            print("1. In IAM\'s lateral panel, search for \'Users\' in the group of Access Management.")
            print("2. Click on the \'Add users\' button.")
            print("3. Set an user name in the text box \'User name\'.")
            print("Attention: users and S3 buckets cannot be written in upper case. Also, selecting a name already used by an Amazon user or bucket will raise an error message.\n")
            print("4. In the field \'Select type of Access to AWS\'-\'Select type of AWS credentials\' select the option \'Access key - Programmatic access\'. After that, click on the button \'Next: Permissions\'.")
            print("5. In the field \'Set Permissions\', keep the \'Add user to a group\' button marked.")
            print("6. In the field \'Add user to a group\', click on \'Create group\' (alternatively, you can be added to a group already configured or copy the permissions of another user.")
            print("7. In the text box \'Group\'s name\', set a name for the new group of permissions.")
            print("8. In the search bar below (\'Filter politics\'), search for a politics that fill your needs, and check the option button on the left of this politic. The politics \'AmazonS3FullAccess\' grants full access to the S3 content.")
            print("9. Finally, click on \'Create a group\'.")
            print("10. After the group is created, it will appear with a check box marked, over the previous groups. Keep it marked and click on the button \'Next: Tags\'.")
            print("11. Create and note down the Access key ID and Secret access key. You can also download a comma separated values (CSV) file containing the credentials for future use.")
            print("ATTENTION: These parameters are required for accessing the bucket\'s content from any application, including AWS SageMaker.")
            print("12. Click on \'Next: Review\' and review the user credentials information and permissions.")
            print("13. Click on \'Create user\' and click on the download button to download the CSV file containing the user credentials information.")
            print("The headers of the CSV file (the stored fields) is: \'User name, Password, Access key ID, Secret access key, Console login link\'.")
            print("You need both the values indicated as \'Access key ID\' and as \'Secret access key\' to fetch the S3 bucket.")
            print("\n") # line break
            print("After acquiring the necessary user privileges, use the boto3 library to export the file from the notebook’s workspace to the bucket (i.e., to upload a file to the bucket).")
            print("For exporting the file as a new bucket\'s file use the following code:\n")
            print("1. Set a variable \'access_key\' as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("2. Set a variable \'secret_key\' as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
            print("3. Set a variable \'bucket_name\' as a string containing only the name of the bucket. Do not add subdirectories, folders (prefixes), or file names.")
            print("Example: if your bucket is named \'my_bucket\' and its main directory contains folders like \'folder1\', \'folder2\', etc, do not declare bucket_name = \'my_bucket/folder1\', even if you only want files from folder1.")
            print("ALWAYS declare only the bucket\'s name: bucket_name = \'my_bucket\'.")
            print("4. Set a variable \'file_path_in_workspace\' containing the path of the file in notebook’s workspace. The file will be exported from “file_path_in_workspace” to the S3 bucket.")
            print("If the file is stored in the notebook\'s root (main) directory: file_path = \"my_file.ext\".")
            print("If the path of the file in the notebook workspace is: \'dir1/…/dirN/my_file.ext\', where dirN is the N-th subdirectory, and dir1 is a folder or directory of the main (root) bucket\'s directory: file_path = \"dir1/…/dirN/my_file.ext\".")
            print("5. Set a variable named \'file_path_in_s3\' containing the path from the bucket’s subdirectories to the file you want to fetch. Include the file name and its extension.")
            print("6. Finally, declare the following code, which refers to the defined variables:\n")

            # Let's use triple quotes to declare a formated string
            example_code = """
                import boto3
                # Start S3 client as the object 's3_client'
                s3_client = boto3.resource('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key)
                # Connect to the bucket specified as 'bucket_name'.
                # The bucket is started as the object 's3_bucket':
                s3_bucket = s3_client.Bucket(bucket_name)
                # Start the new object in the bucket previously started as 's3_bucket'.
                # Start it with the specified prefix, in file_path_in_s3:
                new_s3_object = s3_bucket.Object(file_path_in_s3)
                # Finally, upload the file in file_path_in_workspace.
                # Make new_s3_object the exported file:
                # Upload the selected object from the workspace path file_path_in_workspace
                # to the S3 path specified as file_path_in_s3.
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" exports a xlsx file named 'my_table' to 
                # the notebook's main (root) directory.
                new_s3_object.upload_file(Filename = file_path_in_workspace)
                """

            print(example_code)

            print("An object \'my_file.ext\' in the main (root) directory of the s3_bucket is stored in Python environment as:")
            print("""s3.ObjectSummary(bucket_name='bucket_name', key='my_file.ext'""") 
            # triple quotes to keep the internal quotes without using too much backslashes "\" (the ignore next character)
            print("Then, the name of each object is stored as the attribute \'key\' of the object. To view all objects, we can loop through their \'key\' attributes:\n")
            example_code = """
                # Loop through all objects of the bucket:
                for stored_obj in s3_bucket.objects.all():		
                    # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
                    # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
                    # Print the object’s names:
                    print(stored_obj.key)
                    """

            print(example_code)

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = ''
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None; or if it is an empty string; or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = 'copied_s3_bucket'

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for fetching AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

### **Importing the dataset**

In [None]:
## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
## JSON, txt, or CSV (comma separated values) files.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"

FILE_NAME_WITH_EXTENSION = "dataset.csv"
# FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
# FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
# Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.

LOAD_TXT_FILE_WITH_JSON_FORMAT = False
# LOAD_TXT_FILE_WITH_JSON_FORMAT = False. Set LOAD_TXT_FILE_WITH_JSON_FORMAT = True 
# if you want to read a file with txt extension containing a text formatted as JSON 
# (but not saved as JSON).
# WARNING: if LOAD_TXT_FILE_WITH_JSON_FORMAT = True, all the JSON file parameters of the 
# function (below) must be set. If not, an error message will be raised.

HOW_MISSING_VALUES_ARE_REGISTERED = None
# HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
# empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
# This parameter manipulates the argument na_values (default: None) from Pandas functions.
# By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
#‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
# ‘n/a’, ‘nan’, ‘null’.

# If a different denomination is used, indicate it as a string. e.g.
# HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
# HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

# If dict passed, specific per-column NA values. For example, if zero is the missing value
# only in column 'numeric_col', you can specify the following dictionary:
# how_missing_values_are_registered = {'numeric-col': 0}

    
HAS_HEADER = True
# HAS_HEADER = True if the the imported table has headers (row with columns names).
# Alternatively, HAS_HEADER = False if the dataframe does not have header.

DECIMAL_SEPARATOR = '.'
# DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
# the decimal separator. Alternatively, specify here the separator.
# e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
# It manipulates the argument 'decimal' from Pandas functions.

TXT_CSV_COL_SEP = "comma"
# txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
# or 'csv'. It informs how the different columns are separated.
# Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
# for columns separated by comma;
# txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
# for columns separated by simple spaces.
# You can also set a specific separator as string. For example:
# txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
# is used as separator for the columns - '\t' represents the tab character).

## Parameters for loading Excel files:

LOAD_ALL_SHEETS_AT_ONCE = False
# LOAD_ALL_SHEETS_AT_ONCE = False - This parameter has effect only when for Excel files.
# If LOAD_ALL_SHEETS_AT_ONCE = True, the function will return a list of dictionaries, each
# dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
# value will be the name (or number) of the table (sheet). The second key will be 'df',
# and its value will be the pandas dataframe object obtained from that sheet.
# This argument has preference over SHEET_TO_LOAD. If it is True, all sheets will be loaded.
    
SHEET_TO_LOAD = None
# SHEET_TO_LOAD - This parameter has effect only when for Excel files.
# keep SHEET_TO_LOAD = None not to specify a sheet of the file, so that the first sheet
# will be loaded.
# SHEET_TO_LOAD may be an integer or an string (inside quotes). SHEET_TO_LOAD = 0
# loads the first sheet (sheet with index 0); SHEET_TO_LOAD = 1 loads the second sheet
# of the file (index 1); SHEET_TO_LOAD = "Sheet1" loads a sheet named as "Sheet1".
# Declare a number to load the sheet with that index, starting from 0; or declare a
# name to load the sheet with that name.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = load_pandas_dataframe (file_directory_path = FILE_DIRECTORY_PATH, file_name_with_extension = FILE_NAME_WITH_EXTENSION, load_txt_file_with_json_format = LOAD_TXT_FILE_WITH_JSON_FORMAT, how_missing_values_are_registered = HOW_MISSING_VALUES_ARE_REGISTERED, has_header = HAS_HEADER, decimal_separator = DECIMAL_SEPARATOR, txt_csv_col_sep = TXT_CSV_COL_SEP, load_all_sheets_at_once = LOAD_ALL_SHEETS_AT_ONCE, sheet_to_load = SHEET_TO_LOAD, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

# OBS: If an Excel file is loaded and LOAD_ALL_SHEETS_AT_ONCE = True, then the object
# dataset will be a list of dictionaries, with 'sheet' as key containing the sheet name; and 'df'
# as key correspondent to the Pandas dataframe. So, to access the 3rd dataframe (index 2, since
# indexing starts from zero): df = dataframe[2]['df'], where dataframe is the list returned.

### **Converting JSON object to dataframe**

In [None]:
# JSON object in terms of Python structure: list of dictionaries, where each value of a
# dictionary may be a dictionary or a list of dictionaries (nested structures).
# example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
# structure could be declared and stored into a string variable. For instance, if you have a txt
# file containing JSON, you could read the txt and save its content as a string.
# json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
# 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
# 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]

JSON_OBJ_TO_CONVERT = json_object #Alternatively: object containing the JSON to be converted

# JSON_OBJ_TO_CONVERT: object containing JSON, or string with JSON content to parse.
# Objects may be: string with JSON formatted text;
# list with nested dictionaries (JSON formatted);
# dictionaries, possibly with nested dictionaries (JSON formatted).

JSON_OBJ_TYPE = 'list'
# JSON_OBJ_TYPE = 'list', in case the object was saved as a list of dictionaries (JSON format)
# JSON_OBJ_TYPE = 'string', in case it was saved as a string (text) containing JSON.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: [{'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]}]
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = json_obj_to_pandas_dataframe (json_obj_to_convert = JSON_OBJ_TO_CONVERT, json_obj_type = JSON_OBJ_TYPE, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

### **Removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
    
METHOD = 'trim'
# METHOD = 'trim' will eliminate trailing and leading white spaces from the strings in
# COLUMN_TO_ANALYZE.
# METHOD = 'substring' will eliminate a defined trailing and leading substring from
# COLUMN_TO_ANALYZE.

SUBSTRING_TO_ELIMINATE = None
# SUBSTRING_TO_ELIMINATE = None. Set as a string (in quotes) if METHOD = 'substring'.
# e.g. suppose COLUMN_TO_ANALYZE contains time information: each string ends in " min":
# "1 min", "2 min", "3 min", etc. If SUBSTRING_TO_ELIMINATE = " min", this portion will be
# eliminated, resulting in: "1", "2", "3", etc. If NEW_VARIABLE_TYPE = None, these values will
# continue to be strings. By setting NEW_VARIABLE_TYPE = 'int' or 'float', the series will be
# converted to a numeric type.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_trim'
# NEW_COLUMN_SUFFIX = "_trim"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_trim", the new column will be named as
# "column1_trim".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.
    

# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = trim_spaces_or_characters (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, new_variable_type = NEW_VARIABLE_TYPE, method = METHOD, substring_to_eliminate = SUBSTRING_TO_ELIMINATE, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Capitalizing or lowering case of string variables (string homogenizing)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

METHOD = 'lowercase'
# METHOD = 'capitalize' will capitalize all letters from the input string 
# (turn them to upper case).
# METHOD = 'lowercase' will make the opposite: turn all letters to lower case.
# e.g. suppose COLUMN_TO_ANALYZE contains strings such as 'String One', 'STRING 2',  and
# 'string3'. If METHOD = 'capitalize', the output will contain the strings: 
# 'STRING ONE', 'STRING 2', 'STRING3'. If METHOD = 'lowercase', the outputs will be:
# 'string one', 'string 2', 'string3'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_homogenized'
# NEW_COLUMN_SUFFIX = "_homogenized"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_homogenized", the new column will be named as
# "column1_homogenized".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.
    
    
# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = capitalize_or_lower_string_case (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, method = METHOD, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Adding contractions to the contractions library**

In [None]:
LIST_OF_CONTRACTIONS = [
    
    {'contracted_expression': None, 'correct_expression': None}, 
    {'contracted_expression': None, 'correct_expression': None}, 
    {'contracted_expression': None, 'correct_expression': None}, 
    {'contracted_expression': None, 'correct_expression': None}

]
# LIST_OF_CONTRACTIONS = [{'contracted_expression': None, 'correct_expression': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the form as the contraction is usually observed; and the second one 
# contains the correct (full) string that will replace it.
# Since contractions can cause issues when processing text, we can expand them with these functions.
        
# The object list_of_contractions must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'contracted_expression' for the contraction; and 'correct_expression', 
# for the strings with the correspondent correction.
        
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you want to add more elements
# to the contractions library.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'contracted_expression': original_str, 'correct_expression': new_str}, 
# where original_str and new_str represent the contracted and expanded strings
# (If one of the keys contains None, the new dictionary will be ignored).
        
# Example:
# LIST_OF_CONTRACTIONS = [{'contracted_expression': 'mychange', 'correct_expression': 'my change'}]
        

add_contractions_to_library (list_of_contractions = LIST_OF_CONTRACTIONS)

### **Correcting contracted strings**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_contractionsFixed'
# NEW_COLUMN_SUFFIX = "_contractionsFixed"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_contractionsFixed", the new column will be named as
# "column1_contractionsFixed".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = correct_contracted_strings (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Substituting (replacing) substrings on string variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

SUBSTRING_TO_BE_REPLACED = None
NEW_SUBSTRING_FOR_REPLACEMENT = ''
# SUBSTRING_TO_BE_REPLACED = None; new_substring_for_replacement = ''. 
# Strings (in quotes): when the sequence of characters SUBSTRING_TO_BE_REPLACED was
# found in the strings from column_to_analyze, it will be substituted by the substring
# NEW_SUBSTRING_FOR_REPLACEMENT. If None is provided to one of these substring arguments,
# it will be substituted by the empty string: ''
# e.g. suppose COLUMN_TO_ANALYZE contains the following strings, with a spelling error:
# "my collumn 1", 'his collumn 2', 'her column 3'. We may correct this error by setting:
# SUBSTRING_TO_BE_REPLACED = 'collumn' and NEW_SUBSTRING_FOR_REPLACEMENT = 'column'. The
# function will search for the wrong group of characters and, if it finds it, will substitute
# by the correct sequence: "my column 1", 'his column 2', 'her column 3'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_substringReplaced'
# NEW_COLUMN_SUFFIX = "_substringReplaced"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_substringReplaced", the new column will be named as
# "column1_substringReplaced".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = replace_substring (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, substring_to_be_replaced = SUBSTRING_TO_BE_REPLACED, new_substring_for_replacement = NEW_SUBSTRING_FOR_REPLACEMENT, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Inverting the order of the string characters**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_stringInverted'
# NEW_COLUMN_SUFFIX = "_stringInverted"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_stringInverted", the new column will be named as
# "column1_stringInverted".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = invert_strings (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Slicing the strings**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

FIRST_CHARACTER_INDEX = None
# FIRST_CHARACTER_INDEX = None - integer representing the index of the first character to be
# included in the new strings. If None, slicing will start from first character.
# Indexing of strings always start from 0. The last index can be represented as -1, the index of
# the character before as -2, etc (inverse indexing starts from -1).
# example: consider the string "idsw", which contains 4 characters. We can represent the indices as:
# 'i': index 0; 'd': 1, 's': 2, 'w': 3. Alternatively: 'w': -1, 's': -2, 'd': -3, 'i': -4.

LAST_CHARACTER_INDEX = None
# LAST_CHARACTER_INDEX = None - integer representing the index of the last character to be
# included in the new strings. If None, slicing will go until the last character.
# Attention: this is effectively the last character to be added, and not the next index after last
# character.
        
# in the 'idsw' example, if we want a string as 'ds', we want the FIRST_CHARACTER_INDEX = 1 and
# LAST_CHARACTER_INDEX = 2.

STEP = 1
# STEP = 1 - integer representing the slicing step. If step = 1, all characters will be added.
# If STEP = 2, then the slicing will pick one element of index i and the element with index (i+2)
# (1 index will be 'jumped'), and so on.
# If STEP is negative, then the order of the new strings will be inverted.
# Example: STEP = -1, and the start and finish indices are None: the output will be the inverted
# string, 'wsdi'.
# FIRST_CHARACTER_INDEX = 1, LAST_CHARACTER_INDEX = 2, STEP = 1: output = 'ds';
# FIRST_CHARACTER_INDEX = None, LAST_CHARACTER_INDEX = None, STEP = 2: output = 'is';
# FIRST_CHARACTER_INDEX = None, LAST_CHARACTER_INDEX = None, STEP = 3: output = 'iw';
# FIRST_CHARACTER_INDEX = -1, LAST_CHARACTER_INDEX = -2, STEP = -1: output = 'ws';
# FIRST_CHARACTER_INDEX = -1, LAST_CHARACTER_INDEX = None, STEP = -2: output = 'wd';
# FIRST_CHARACTER_INDEX = -1, LAST_CHARACTER_INDEX = None, STEP = 1: output = 'w'
# In this last example, the function tries to access the next element after the character of index
# -1. Since -1 is the last character, there are no other characters to be added.
# FIRST_CHARACTER_INDEX = -2, LAST_CHARACTER_INDEX = -1, STEP = 1: output = 'sw'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_slicedString'
# NEW_COLUMN_SUFFIX = "_slicedString"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_slicedString", the new column will be named as
# "column1_slicedString".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = slice_strings (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, first_character_index = FIRST_CHARACTER_INDEX, last_character_index = LAST_CHARACTER_INDEX, step = STEP, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Getting the leftest characters from the strings (retrieve last characters)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - integer representing the total of characters that will
# be retrieved. Here, we will retrieve the leftest characters. If NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1,
# only the leftest (last) character will be retrieved.
# Consider the string 'idsw'.
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - output: 'w';
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 2 - output: 'sw'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
# So, if the last part of the strings is a number, you can use this argument to directly extract
# this part as numeric variable.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_leftChars'
# NEW_COLUMN_SUFFIX = "_leftChars"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_leftChars", the new column will be named as
# "column1_leftChars".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.
    

# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = left_characters (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, number_of_characters_to_retrieve = NUMBER_OF_CHARACTERS_TO_RETRIEVE, new_variable_type = NEW_VARIABLE_TYPE, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Getting the rightest characters from the strings (retrieve first characters)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - integer representing the total of characters that will
# be retrieved. Here, we will retrieve the rightest characters. If NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1,
# only the rightest (first) character will be retrieved.
# Consider the string 'idsw'.
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - output: 'i';
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 2 - output: 'id'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
# So, if the first part of the strings is a number, you can use this argument to directly extract
# this part as numeric variable.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_rightChars'
# NEW_COLUMN_SUFFIX = "_rightChars"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_rightChars", the new column will be named as
# "column1_rightChars".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.
    

# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = right_characters (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, number_of_characters_to_retrieve = NUMBER_OF_CHARACTERS_TO_RETRIEVE, new_variable_type = NEW_VARIABLE_TYPE, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Joining strings from a same column into a single string**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

SEPARATOR = " "
# SEPARATOR = " " - string containing the separator. Suppose the column contains the
# strings: 'a', 'b', 'c', 'd'. If the SEPARATOR is the empty string '', the output will be:
# 'abcd' (no separation). If SEPARATOR = " " (simple whitespace), the output will be 'a b c d'


# The returned string is stored as concat_string:
# Simply modify this variable name on the left of equality:
concat_string = join_strings_from_column (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, separator = SEPARATOR)

### **Joining several string columns into a single string column**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

LIST_OF_COLUMNS_TO_JOIN = ['column1', 'column2']
# LIST_OF_COLUMNS_TO_JOIN: list of strings (inside quotes), 
# containing the name of the columns with strings to be joined.
# Attention: the strings will be joined row by row, i.e. only strings in the same rows will
# be concatenated. To join strings from the same column, use function join_strings_from_column
# e.g. LIST_OF_COLUMNS_TO_JOIN = ["column1", "column2"] will join strings from "column1" with
# the correspondent strings from "column2".
# Notice that you can concatenate any kind of columns: numeric, dates, texts ,..., but the output
# will be a string column.

SEPARATOR = " "
# SEPARATOR = " " - string containing the separator. Suppose the column contains the
# strings: 'a', 'b', 'c', 'd'. If the SEPARATOR is the empty string '', the output will be:
# 'abcd' (no separation). If SEPARATOR = " " (simple whitespace), the output will be 'a b c d'

NEW_COLUMN_SUFFIX = '_stringConcat'
# NEW_COLUMN_SUFFIX = "_stringConcat"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_stringConcat", the new column will be named as
# "column1_stringConcat".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = join_string_columns (df = DATASET, list_of_columns_to_join = LIST_OF_COLUMNS_TO_JOIN, separator = SEPARATOR, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Splitting strings into a list of strings**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

SEPARATOR = " "
# SEPARATOR = " " - string containing the separator. Suppose the column contains the
# string: 'a b c d' on a given row. If the SEPARATOR is whitespace ' ', 
# the output will be a list: ['a', 'b', 'c', 'd']: the function splits the string into a list
# of strings (one list per row) every time it finds the separator.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_stringSplitted'
# NEW_COLUMN_SUFFIX = "_stringSplitted"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_stringSplitted", the new column will be named as
# "column1_stringSplitted".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = split_strings (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, separator = SEPARATOR, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Substituting (replacing or switching) whole strings by different text values (on string variables)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = [
    
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}
    
]
# LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = 
# [{'original_string': None, 'new_string': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the original string; and the second one contains the new string
# that will substitute the original one. The function will loop through all dictionaries in
# this list, access the values of the keys 'original_string', and search these values on the strings
# in COLUMN_TO_ANALYZE. When the value is found, it will be replaced (switched) by the correspondent
# value in key 'new_string'.
    
# The object LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'original_string' for the original strings to search on the column 
# column_to_analyze; and 'new_string', for the strings that will replace the original ones.
# Notice that this function will not search for substrings: it will substitute a value only when
# there is perfect correspondence between the string in 'column_to_analyze' and 'original_string'.
# So, the cases (upper or lower) must be the same.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'original_string': original_str, 'new_string': new_str}, 
# where original_str and new_str represent the strings for searching and replacement 
# (If one of the keys contains None, the new dictionary will be ignored).
    
# Example:
# Suppose the COLUMN_TO_ANALYZE contains the values 'sunday', 'monday', 'tuesday', 'wednesday',
# 'thursday', 'friday', 'saturday', but you want to obtain data labelled as 'weekend' or 'weekday'.
# Set: LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = 
# [{'original_string': 'sunday', 'new_string': 'weekend'},
# {'original_string': 'saturday', 'new_string': 'weekend'},
# {'original_string': 'monday', 'new_string': 'weekday'},
# {'original_string': 'tuesday', 'new_string': 'weekday'},
# {'original_string': 'wednesday', 'new_string': 'weekday'},
# {'original_string': 'thursday', 'new_string': 'weekday'},
# {'original_string': 'friday', 'new_string': 'weekday'}]

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_stringReplaced'
# NEW_COLUMN_SUFFIX = "_stringReplaced"
# This value has effect only if CREATE_NEW_COLUMN = True.
# column was "column1" and the suffix is "_stringReplaced", the new column will be named as
# "column1_stringReplaced".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = switch_strings (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, list_of_dictionaries_with_original_strings_and_replacements = LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Replacing strings with Machine Learning: finding similar strings and replacing them by standard strings**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

MODE = 'find_and_replace'
# MODE = 'find_and_replace' will find similar strings; and switch them by one of the
# standard strings if the similarity between them is higher than or equals to the threshold.
# Alternatively: MODE = 'find' will only find the similar strings by calculating the similarity.

THRESHOLD_FOR_PERCENT_OF_SIMILARITY = 80.0
# THRESHOLD_FOR_PERCENT_OF_SIMILARITY = 80.0 - 0.0% means no similarity and 100% means equal strings.
# The THRESHOLD_FOR_PERCENT_OF_SIMILARITY is the minimum similarity calculated from the
# Levenshtein (minimum edit) distance algorithm. This distance represents the minimum number of
# insertion, substitution or deletion of characters operations that are needed for making two
# strings equal.

LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT = [
    
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None}, 
    {'standard_string': None}
    
]
# This is a list of dictionaries, where each dictionary contains a single key-value pair:
# the key must be always 'standard_string', and the value will be one of the standard strings 
# for replacement: if a given string on the COLUMN_TO_ANALYZE presents a similarity with one 
# of the standard string equals or higher than the THRESHOLD_FOR_PERCENT_OF_SIMILARITY, it will be
# substituted by this standard string.
# For instance, suppose you have a word written in too many ways, making it difficult to use
# the function switch_strings: "EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ...
# You can use this function to search strings similar to "Europe" and replace them.
    
# The function will loop through all dictionaries in this list, access the values of the keys 
# 'standard_string', and search these values on the strings in COLUMN_TO_ANALYZE. When the value 
# is found, it will be replaced (switched) if the similarity is sufficiently high.
    
# The object LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'standard_string'.
# Notice that this function performs fuzzy matching, so it MAY SEARCH substrings and strings
# written with different cases (upper or lower) when this portions or modifications make the
# strings sufficiently similar to each other.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same key: {'standard_string': other_std_str}, 
# where other_std_str represents the string for searching and replacement 
# (If the key contains None, the new dictionary will be ignored).
    
# Example:
# Suppose the COLUMN_TO_ANALYZE contains the values 'California', 'Cali', 'Calefornia', 
# 'Calefornie', 'Californie', 'Calfornia', 'Calefernia', 'New York', 'New York City', 
# but you want to obtain data labelled as the state 'California' or 'New York'.
# Set: list_of_dictionaries_with_standard_strings_for_replacement = 
# [{'standard_string': 'California'},
# {'standard_string': 'New York'}]
    
# ATTENTION: It is advisable for previously searching the similarity to find the best similarity
# threshold; set it as high as possible, avoiding incorrect substitutions in a gray area; and then
# perform the replacement. It will avoid the repetition of original incorrect strings in the
# output dataset, as well as wrong replacement (replacement by one of the standard strings which
# is not the correct one).

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_stringReplaced'
# NEW_COLUMN_SUFFIX = "_stringReplaced"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_stringReplaced", the new column will be named as
# "column1_stringReplaced".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset.
# The summary list is saved as summary_list.
# Simply modify these objects on the left of equality:
transf_dataset, summary_list = string_replacement_ml (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, mode = MODE, threshold_for_percent_of_similarity = THRESHOLD_FOR_PERCENT_OF_SIMILARITY, list_of_dictionaries_with_standard_strings_for_replacement = LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Searching for Regular Expression (RegEx) within a string column**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

REGEX_TO_SEARCH = r""
# REGEX_TO_SEARCH = r"" - string containing the regular expression (regex) that will be searched
# within each string from the column. Declare it with the r before quotes, indicating that the
# 'raw' string should be read. That is because the regex contain special characters, such as \,
# which should not be read as scape characters.
# example of regex: r'st\d\s\w{3,10}'
# Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
# regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
# regex alternating and non-capturing groups; regex backreferences; and regex lookaround.

## ATTENTION: This function returns ONLY the capturing groups from the regex, i.e., portions of the
# regex explicitly marked with parentheses (check the regex helper for more details, including how
# to convert parentheses into non-capturing groups). If no groups are marked as capturing, the
# function will raise an error.

SHOW_REGEX_HELPER = False
# SHOW_REGEX_HELPER: set SHOW_REGEX_HELPER = True to show a helper guide to the construction of
# the regular expression. After finishing the helper, the original dataset itself will be returned
# and the function will not proceed. Use it in case of not knowing or not certain on how to input
# the regex.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_regex'
# NEW_COLUMN_SUFFIX = "_regex"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_regex", the new column will be named as
# "column1_regex".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = regex_search (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, regex_to_search = REGEX_TO_SEARCH, show_regex_helper = SHOW_REGEX_HELPER, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Replacing a Regular Expression (RegEx) from a string column**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

REGEX_TO_SEARCH = r""
# REGEX_TO_SEARCH = r"" - string containing the regular expression (regex) that will be searched
# within each string from the column. Declare it with the r before quotes, indicating that the
# 'raw' string should be read. That is because the regex contain special characters, such as \,
# which should not be read as scape characters.
# example of regex: r'st\d\s\w{3,10}'
# Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
# regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
# regex alternating and non-capturing groups; regex backreferences; and regex lookaround.

STRING_FOR_REPLACEMENT = ""
# STRING_FOR_REPLACEMENT = "" - regular string that will replace the REGEX_TO_SEARCH: 
# whenever REGEX_TO_SEARCH is found in the string, it is replaced (substituted) by 
# STRING_FOR_REPLACEMENT. 
# Example STRING_FOR_REPLACEMENT = " " (whitespace).
# If STRING_FOR_REPLACEMENT = None, the empty string will be used for replacement.
        
## ATTENTION: This function process a single regex by call.

SHOW_REGEX_HELPER = False
# SHOW_REGEX_HELPER: set SHOW_REGEX_HELPER = True to show a helper guide to the construction of
# the regular expression. After finishing the helper, the original dataset itself will be returned
# and the function will not proceed. Use it in case of not knowing or not certain on how to input
# the regex.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_regex'
# NEW_COLUMN_SUFFIX = "_regex"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_regex", the new column will be named as
# "column1_regex".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = regex_replacement (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, regex_to_search = REGEX_TO_SEARCH, string_for_replacement = STRING_FOR_REPLACEMENT, show_regex_helper = SHOW_REGEX_HELPER, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Applying Fast Fourier Transform**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

AVERAGE_FREQUENCY_OF_DATA_COLLECTION = 'hour'
# AVERAGE_FREQUENCY_OF_DATA_COLLECTION = 'hour' or 'h' for hours; 'day' or 'd' for days;
# 'minute' or 'min' for minutes; 'seconds' or 's' for seconds; 'ms' for milliseconds; 'ns' for
# nanoseconds; 'year' or 'y' for years; 'month' or 'm' for months.


X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'capability_plot.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# The results of the Fast Fourier Transform will be stored in the object named fft.
# Simply modify this object on the left of equality:
fft = fast_fourier_transform (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, average_frequency_of_data_collection = AVERAGE_FREQUENCY_OF_DATA_COLLECTION, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Generating columns with frequency information**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN = "timestamp"
# TIMESTAMP_TAG_COLUMN = None. string containing the name of the column with the timestamp. 
# If TIMESTAMP_TAG_COLUMN is None, the index will be used for testing different imputations.
# be the time series reference. declare as a string under quotes. This is the column from 
# which we will extract the timestamps or values with temporal information. e.g.
# TIMESTAMP_TAG_COLUMN = 'timestamp' will consider the column 'timestamp' a time column.

IMPORTANT_FREQUENCIES = [{'value': 1, 'unit': 'day'}, 
                         {'value':1, 'unit': 'year'}]

# IMPORTANT_FREQUENCIES = [{'value': 1, 'unit': 'day'}, {'value':1, 'unit': 'year'}]
# List of dictionaries with the important frequencies to add to the model. You can remove dictionaries,
# or add extra dictionaries. The dictionaries must have always the same keys, 'value' and 'unit'.
# If the importante frequency is once a day, the value will be 1, and the unit will be 'day' or 'd'.
# The possible units are: 'ns', 'ms', 'second' or 's', 'minute' or 'min', 'day' or 'd', 'month' or 'm',
# 'year' or 'y'.


X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'capability_plot.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# The dataset with new columns containing the frequency information will be stored as dataset.
# Simply modify this object on the left of equality:
dataset = get_frequency_features (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, important_frequencies = IMPORTANT_FREQUENCIES, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **log-transforming the variables**

In [None]:
#### WARNING: This function will eliminate rows where the selected variables present 
#### values lower or equal to zero (condition for the logarithm to be applied).

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_log"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_log", the new column will be named as
# "column1_log".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

# New dataframe saved as log_transf_df.
# Simply modify this object on the left of equality:
log_transf_df = log_transform (df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

# One curve derived from the normal is the log-normal.
# If the values Y follow a log-normal distribution, their log follow a normal.
# A log normal curve resembles a normal, but with skewness (distortion); 
# and kurtosis (long-tail).

# Applying the log is a methodology for normalizing the variables: 
# the sample space gets shrinkled after the transformation, making the data more 
# adequate for being processed by Machine Learning algorithms. Preferentially apply 
# the transformation to the whole dataset, so that all variables will be of same order 
# of magnitude.
# Obviously, it is not necessary for variables ranging from -100 to 100 in numerical 
# value, where most outputs from the log transformation are.

### **Reversing the log-transform - Exponentially transforming variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_originalScale"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_originalScale", the new column will be named as
# "column1_originalScale".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

#New dataframe saved as rescaled_df.
# Simply modify this object on the left of equality:
rescaled_df = reverse_log_transform(df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

### **Obtaining and applying Box-Cox transform**
- Transform a series of data into a series described by a normal distribution.

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

MODE = 'calculate_and_apply'
# Aternatively, mode = 'calculate_and_apply' to calculate lambda and apply Box-Cox
# transform; mode = 'apply_only' to apply the transform for a known lambda.
# To 'apply_only', lambda_box must be provided.

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_BoxCoxTransf'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_BoxCoxTransf', the transformed column will be
# identified as 'Y_BoxCoxTransf'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

SPECIFICATION_LIMITS = {'lower_spec_lim': None, 'upper_spec_lim': None}
# specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': None}
# If there are specification limits, input them in this dictionary. Do not modify the keys,
# simply substitute None by the lower and/or the upper specification.
# e.g. Suppose you have a tank that cannot have more than 10 L. So:
# specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': 10}, there is only
# an upper specification equals to 10 (do not add units);
# Suppose a temperature cannot be lower than 10 ºC, but there is no upper specification. So,
# specification_limits = {'lower_spec_lim': 10, 'upper_spec_lim': None}. Finally, suppose
# a liquid which pH must be between 6.8 and 7.2:
# specification_limits = {'lower_spec_lim': 6.8, 'upper_spec_lim': 7.2}

#New dataframe saved as data_transformed_df; dictionary saved as data_sum_dict.
# Simply modify this object on the left of equality:
data_transformed_df, data_sum_dict = box_cox_transform (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, mode = MODE, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX, specification_limits = SPECIFICATION_LIMITS)

### **Reversing Box-Cox transform**

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_ReversedBoxCox'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
# identified as 'Y_ReversedBoxCox'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as retransformed_df.
# Simply modify this object on the left of equality:
retransformed_df = reverse_box_cox (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX)

### **One-Hot Encoding the categorical variables**
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.For a category "A", a column named "A" is created.
    - If the row is an element from category "A", the value for the column "A" is 1.
    - If not, the value for column "A" is 0.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_BE_ENCODED = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# SUBSET_OF_FEATURES_TO_BE_ENCODED: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.SUBSET_OF_FEATURES_TO_BE_ENCODED = ["column1"] 
# will analyze the column named as 'column1'; 
# SUBSET_OF_FEATURES_TO_BE_ENCODED = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

# New dataframe saved as one_hot_encoded_df; list of encoding information,
# including different categories and encoder objects as OneHot_encoding_list.
# Simply modify this object on the left of equality:
one_hot_encoded_df, OneHot_encoding_list = OneHotEncode_df (df = DATASET, subset_of_features_to_be_encoded = SUBSET_OF_FEATURES_TO_BE_ENCODED)

### **Reversing the One-Hot Encoding of the categorical variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

ENCODING_LIST = [
    
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}},
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}},
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}},
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}}
    
]
# ENCODING_LIST: list in the same format of the one generated by OneHotEncode_df function:
# it must be a list of dictionaries where each dictionary contains two keys:
# key 'column': string with the original column name (in quotes); 
# key 'OneHot_encoder': this key must store a nested dictionary.
# Even though the nested dictionaries generates by the encoding function present
# two keys:  'categories', storing an array with the different categories;
# and 'OneHot_enc_obj', storing the encoder object, only the key 'OneHot_enc_obj' is required.
## On the other hand, a third key is needed in the nested dictionary:
## key 'encoded_columns': this key must store a list or array with the names of the columns
# obtained from Encoding.

# New dataframe saved as reversed_one_hot_encoded_df.
# Simply modify this object on the left of equality:
reversed_one_hot_encoded_df = reverse_OneHotEncode (df = DATASET, encoding_list = ENCODING_LIST)

### **Ordinal Encoding the categorical variables**
- Transform categorical values with notion of order into numerical (integer) features.
- For each column, the Ordinal Encoder creates a new column in the dataset. This new column is represented by a an integer value, where each integer represents a possible categorie.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_BE_ENCODED = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# SUBSET_OF_FEATURES_TO_BE_ENCODED: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.SUBSET_OF_FEATURES_TO_BE_ENCODED = ["column1"] 
# will analyze the column named as 'column1'; 
# SUBSET_OF_FEATURES_TO_BE_ENCODED = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

# New dataframe saved as ordinal_encoded_df; list of encoding information,
# including different categories and encoder objects as ordinal_encoding_list.
# Simply modify this object on the left of equality:
ordinal_encoded_df, ordinal_encoding_list = OrdinalEncode_df (df = DATASET, subset_of_features_to_be_encoded = SUBSET_OF_FEATURES_TO_BE_ENCODED)

### **Reversing the Ordinal Encoding of the categorical variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

ENCODING_LIST = [
    
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}},
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}},
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}},
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}}
    
]
# ENCODING_LIST: list in the same format of the one generated by OrdinalEncode_df function:
# it must be a list of dictionaries where each dictionary contains two keys:
# key 'column': string with the original column name (in quotes); 
# key 'ordinal_encoder': this key must store a nested dictionary.
# Even though the nested dictionaries generates by the encoding function present
# two keys:  'categories', storing an array with the different categories;
# and 'ordinal_enc_obj', storing the encoder object, only the key 'ordinal_enc_obj' is required.
## On the other hand, a third key is needed in the nested dictionary:
## key 'encoded_column': this key must store a string with the name of the column
# obtained from Encoding.

# New dataframe saved as reversed_ordinal_encoded_df.
# Simply modify this object on the left of equality:
reversed_ordinal_encoded_df = reverse_OrdinalEncode (df = DATASET, encoding_list = ENCODING_LIST)

### **Scaling the features - Standard scaler, Min-Max scaler, division by factor**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'min_max'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor', MODE = 'normalize_by_maximum'
## This function provides 4 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

## MODE = 'normalize_by_maximum' is similar to MODE = 'factor', but the factor will be selected
# as the maximum value. This mode is available only for SCALE_WITH_NEW_PARAMS = True. If
# SCALE_WITH_NEW_PARAMS = False, you should provide the value of the maximum as a division 'factor'.

SCALE_WITH_NEW_PARAMS = True
# Alternatively, set SCALE_WITH_NEW_PARAMS = True if you want to calculate a new
# scaler for the data; or SCALE_WITH_NEW_PARAMS = False if you want to apply 
# parameters previously obtained to the data (i.e., if you want to apply the scaler
# previously trained to another set of data; or wants to simply apply again the same
# scaler).
    
## WARNING: The MODE 'factor' demmands the input of the list of factors that will be 
# used for normalizing each column. Therefore, it can be used only 
# when SCALE_WITH_NEW_PARAMS = False.

LIST_OF_SCALING_PARAMS = None
# LIST_OF_SCALING_PARAMS is a list of dictionaries with the same format of the list returned
# from this function. Each dictionary must correspond to one of the features that will be scaled,
# but the list do not have to be in the same order of the columns - it will check one of the
# dictionary keys.
# The first key of the dictionary must be 'column'. This key must store a string with the exact
# name of the column that will be scaled.
# the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
# one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
# numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
# must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
# two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
# For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
# standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
# factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
# Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
# division.
# The key 'scaler_details' will not create an object: the transform will be directly performed 
# through vectorial operations.

SUFFIX = '_scaled'
# suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_scaled', the transformed column will be
# identified as 'Y_scaled'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

# New dataframe saved as scaled_df; list of scaling parameters saved as scaling_list
# Simply modify this object on the left of equality:
scaled_df, scaling_list = feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, mode = MODE, scale_with_new_params = SCALE_WITH_NEW_PARAMS, list_of_scaling_params = LIST_OF_SCALING_PARAMS, suffix = SUFFIX)

### **Reversing scaling of the features - Standard scaler, Min-Max scaler, division by factor**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'min_max'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor'
## This function provides 3 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

LIST_OF_SCALING_PARAMS = [
                            {'column': None,
                            'scaler': {'scaler_obj': None, 
                                      'scaler_details': None}},
                            {'column': None,
                            'scaler': {'scaler_obj': None, 
                                      'scaler_details': None}}
                            
                         ]
# LIST_OF_SCALING_PARAMS is a list of dictionaries with the same format of the list returned
# from this function. Each dictionary must correspond to one of the features that will be scaled,
# but the list do not have to be in the same order of the columns - it will check one of the
# dictionary keys.
# The first key of the dictionary must be 'column'. This key must store a string with the exact
# name of the column that will be scaled.
# the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
# one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
# numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
# must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
# two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
# For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
# standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
# factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
# Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
# division.

SUFFIX = '_reverseScaling'
# suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
# identified as 'Y_reverseScaling'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

# New dataframe saved as rescaled_df; list of scaling parameters saved as scaling_list
# Simply modify this object on the left of equality:
rescaled_df, scaling_list = reverse_feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, list_of_scaling_params = LIST_OF_SCALING_PARAMS, mode = MODE, suffix = SUFFIX)

### **Importing or exporting models and dictionaries (or lists)**

#### Case 1: import only a model

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_general' for generic deep learning tensorflow models containing 
# custom layers, losses and architectures. Such models are compressed as tar.gz, tar, or zip.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model.
# Simply modify this object on the left of equality:
model = import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 2: import only a dictionary or a list

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'dict_or_list_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_general' for generic deep learning tensorflow models containing 
# custom layers, losses and architectures. Such models are compressed as tar.gz, tar, or zip.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Dictionary or list saved as imported_dict_or_list.
# Simply modify this object on the left of equality:
imported_dict_or_list = import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 3: import a model and a dictionary (or a list)

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_and_dict'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_general' for generic deep learning tensorflow models containing 
# custom layers, losses and architectures. Such models are compressed as tar.gz, tar, or zip.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model. Dictionary or list saved as imported_dict_or_list.
# Simply modify these objects on the left of equality:
model, imported_dict_or_list = import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 4: export a model and/or a dictionary (or a list)

In [None]:
ACTION = 'export'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_general' for generic deep learning tensorflow models containing 
# custom layers, losses and architectures. Such models are compressed as tar.gz, tar, or zip.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

### **Filtering (selecting); ordering; or renaming columns from the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'select_or_order_columns'
# MODE = 'select_or_order_columns' for filtering only the list of columns passed as COLUMNS_LIST,
# and setting a new column order. In this mode, you can pass the columns in any order: 
# the order of elements on the list will be the new order of columns.

# MODE = 'rename_columns' for renaming the columns with the names passed as COLUMNS_LIST. In this
# mode, the list must have same length and same order of the columns of the dataframe. That is because
# the columns will sequentially receive the names in the list. So, a mismatching of positions
# will result into columns with incorrect names.

COLUMNS_LIST = ['column1', 'column2', 'column3']
# COLUMNS_LIST = list of strings containing the names (headers) of the columns to select
# (filter); or to be set as the new columns' names, according to the selected mode.
# For instance: COLUMNS_LIST = ['col1', 'col2', 'col3'] will 
# select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
# Declare the names inside quotes.
# Simply substitute the list by the list of columns that you want to select; or the
# list of the new names you want to give to the dataset columns.

# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = select_order_or_rename_columns (df = DATASET, columns_list = COLUMNS_LIST, mode = MODE)

### **Renaming specific columns from the dataframe; or cleaning columns' labels**
- The function `select_order_or_rename_columns` requires the user to pass a list containing the names from all columns.
- Also, this list must contain the columns in the correct order (the order they appear in the dataframe).
- This function may manipulate one or several columns by call, and is not dependent on their order.
- This function can also be used for cleaning the columns' labels: capitalize (upper case) or lower cases of all columns' names; replace substrings on columns' names; or eliminating trailing and leading white spaces or characters from columns' labels.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'set_new_names'
# MODE = 'set_new_names' will change the columns according to the specifications in
# LIST_OF_COLUMNS_LABELS.

# MODE = 'capitalize_columns' will capitalize all columns names (i.e., they will be put in
# upper case). e.g. a column named 'column' will be renamed as 'COLUMN'

# MODE = 'lowercase_columns' will lower the case of all columns names. e.g. a column named
# 'COLUMN' will be renamed as 'column'.

# MODE = 'replace_substring' will search on the columns names (strings) for the 
# SUBSTRING_TO_BE_REPLACED (which may be a character or a string); and will replace it by 
# NEW_SUBSTRING_FOR_REPLACEMENT (which again may be either a character or a string). 
# Numbers (integers or floats) will be automatically converted into strings.
# As an example, consider the default situation where we search for a whitespace ' ' and replace it
# by underscore '_': SUBSTRING_TO_BE_REPLACED = ' ', NEW_SUBSTRING_FOR_REPLACEMENT = '_'  
# In this case, a column named 'new column' will be renamed as 'new_column'.

# MODE = 'trim' will remove all trailing or leading whitespaces from column names.
# e.g. a column named as ' col1 ' will be renamed as 'col1'; 'col2 ' will be renamed as
# 'col2'; and ' col3' will be renamed as 'col3'.

# MODE = 'eliminate_trailing_characters' will eliminate a defined trailing and leading 
# substring from the columns' names. 
# The substring must be indicated as TRAILING_SUBSTRING, and its default, when no value
# is provided, is equivalent to mode = 'trim' (eliminate white spaces). 
# e.g., if TRAILING_SUBSTRING = '_test' and you have a column named 'col_test', it will be 
# renamed as 'col'.

SUBSTRING_TO_BE_REPLACED = ' '
NEW_SUBSTRING_FOR_REPLACEMENT = '_'

TRAILING_SUBSTRING = None

LIST_OF_COLUMNS_LABELS = [
    
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None}
    
]
# LIST_OF_COLUMNS_LABELS = [{'column_name': None, 'new_column_name': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the original column name; and the second one contains the new name
# that will substitute the original one. The function will loop through all dictionaries in
# this list, access the values of the keys 'column_name', and it will be replaced (switched) 
# by the correspondent value in key 'new_column_name'.
    
# The object LIST_OF_COLUMNS_LABELS must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'column_name' for the original label; 
# and 'new_column_name', for the correspondent new label.
# Notice that this function will not search substrings: it will substitute a value only when
# there is perfect correspondence between the string in 'column_name' and one of the columns
# labels. So, the cases (upper or lower) must be the same.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'column_name': original_col, 'new_column_name': new_col}, 
# where original_col and new_col represent the strings for searching and replacement 
# (If one of the keys contains None, the new dictionary will be ignored).
# Example: LIST_OF_COLUMNS_LABELS = [{'column_name': 'col1', 'new_column_name': 'col'}] will
# rename 'col1' as 'col'.


# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = rename_or_clean_columns_labels (df = DATASET, mode = MODE, substring_to_be_replaced = SUBSTRING_TO_BE_REPLACED, new_substring_for_replacement = NEW_SUBSTRING_FOR_REPLACEMENT, trailing_substring = TRAILING_SUBSTRING, list_of_columns_labels = LIST_OF_COLUMNS_LABELS)

### **Characterizing the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

#New dataframes saved as df_shape, df_columns_list, df_dtypes, df_general_statistics, df_missing_values.
# Simply modify this object on the left of equality:
df_shape, df_columns_array, df_dtypes, df_general_statistics, df_missing_values = df_general_characterization (df = DATASET)

### **Obtaining correlation plots**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SHOW_MASKED_PLOT = True
#SHOW_MASKED_PLOT = True - keep as True if you want to see a cleaned version of the plot
# where a mask is applied. Alternatively, SHOW_MASKED_PLOT = True, or 
# SHOW_MASKED_PLOT = False

RESPONSES_TO_RETURN_CORR = None
#RESPONSES_TO_RETURN_CORR - keep as None to return the full correlation tensor.
# If you want to display the correlations for a particular group of features, input them
# as a list, even if this list contains a single element. Examples:
# responses_to_return_corr = ['response1'] for a single response
# responses_to_return_corr = ['response1', 'response2', 'response3'] for multiple
# responses. Notice that 'response1',... should be substituted by the name ('string')
# of a column of the dataset that represents a response variable.
# WARNING: The returned coefficients will be ordered according to the order of the list
# of responses. i.e., they will be firstly ordered based on 'response1'
# Alternatively: a list containing strings (inside quotes) with the names of the response
# columns that you want to see the correlations. Declare as a list even if it contains a
# single element.

SET_RETURNED_LIMIT = None
# SET_RETURNED_LIMIT = None - This variable will only present effects in case you have
# provided a response feature to be returned. In this case, keep set_returned_limit = None
# to return all of the correlation coefficients; or, alternatively, 
# provide an integer number to limit the total of coefficients returned. 
# e.g. if set_returned_limit = 10, only the ten highest coefficients will be returned. 

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'correlation_plot.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


#New dataframe saved as correlation_matrix. Simply modify this object on the left of equality:
correlation_matrix = correlation_plot (df = DATASET, show_masked_plot = SHOW_MASKED_PLOT, responses_to_return_corr = RESPONSES_TO_RETURN_CORR, set_returned_limit = SET_RETURNED_LIMIT, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Obtaining scatter plots and simple linear regressions**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

SHOW_LINEAR_REG = True
#Alternatively: set SHOW_LINEAR_REG = True to plot the linear regressions graphics and show 
# the linear regressions calculated for each pair Y x X (i.e., each correlation 
# Y = aX + b, as well as the R² coefficient calculated). 
# Set SHOW_LINEAR_REG = False to omit both the linear regressions plots on the graphic, and
# the correlations and R² coefficients obtained.

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = False #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'scatter_plot_lin_reg.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# JSON-formatted list containing all series converted to NumPy arrays, 
#  with timestamps parsed as datetimes, and all the information regarding the linear regressions, 
# including the predicted values for plotting, returned as list_of_dictionaries_with_series_and_predictions. 
# Simply modify this object on the left of equality:
list_of_dictionaries_with_series_and_predictions = scatter_plot_lin_reg (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, show_linear_reg = SHOW_LINEAR_REG, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing time series**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = True #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
ADD_SCATTER_DOTS = False
# If ADD_SCATTER_DOTS = False, no dots representing the data points are shown.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'time_series_vis.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


time_series_vis (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, add_scatter_dots = ADD_SCATTER_DOTS, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing histograms**

In [None]:
# REMEMBER: A histogram is the representation of a statistical distribution 
# of a given variable.

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'analyzed_variable'
#Alternatively: other column in quotes, substituting 'analyzed_variable'
# e.g., if the analyzed variable is in a column named 'column1':
# COLUMN_TO_ANALYZE = 'column1'

TOTAL_OF_BINS = 10
# This parameter must be an integer number: it represents the total of bins of the 
# histogram, i.e., the number of divisions of the sample space (in how much intervals
# the sample space will be divided.
# Manually adjust this parameter to obtain more or less resolution of the statistical
# distribution: less bins tend to result into higher counting of values per bin, since
# a larger interval of values is grouped. After modifying the total of bins, do not forget
# to adjust the bar width in SET_GRAPHIC_BAR_WIDTH.
# Examples: TOTAL_OF_BINS = 50, to divide the sample space into 50 equally-separated 
# intervals; TOTAL_OF_BINS = 10 to divide it into 10 intervals; TOTAL_OF_BINS = 100 to
# divide it into 100 intervals.
NORMAL_CURVE_OVERLAY = True
#Alternatively: set NORMAL_CURVE_OVERLAY = True to show a normal curve overlaying the
# histogram; or set NORMAL_CURVE_OVERLAY = False to omit the normal curve (show only
# the histogram).

X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'histogram.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.

#New dataframes saved as general_stats and frequency_table.
# Simply modify these objects on the left of equality:
general_stats, frequency_table = histogram (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, total_of_bins = TOTAL_OF_BINS, normal_curve_overlay = NORMAL_CURVE_OVERLAY, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Testing data normality and visualizing the probability plot**
- Check the probability that data is actually described by a normal distribution.

In [None]:
# WARNING: The statistical tests require at least 20 samples

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze' 
# COLUMN_TO_ANALYZE: column (variable) of the dataset that will be tested. Declare as a string,
# in quotes.
# e.g. COLUMN_TO_ANALYZE = 'col1' will analyze a column named 'col1'.

COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS = None
# column_with_labels_to_test_subgroups: if there is a column with labels or
# subgroup indication, and the normality should be tested separately for each label, indicate
# it here as a string (in quotes). e.g. column_with_labels_to_test_subgroups = 'col2' 
# will retrieve the labels from 'col2'.
# Keep column_with_labels_to_test_subgroups = None if a single series (the whole column)
# will be tested.
    
ALPHA = 0.10
# Confidence level = 1 - ALPHA. For ALPHA = 0.10, we get a 0.90 = 90% confidence
# Set ALPHA = 0.05 to get 0.95 = 95% confidence in the analysis.
# Notice that, when less trust is needed, we can increase ALPHA to get less restrictive
# results.

SHOW_PROBABILITY_PLOT = True
#Alternatively: set SHOW_PROBABILITY_PLOT = True to obtain the probability plot for the
# variable Y (normal distribution tested). 
# Set SHOW_PROBABILITY_PLOT = False to omit the probability plot.
X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'probability_plot_normal.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.

# List of dictionaries containing the series, p-values, skewness and kurtosis returned as
# list_of_dicts
# Simply modify this object on the left of equality:
list_of_dicts = test_data_normality (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, column_with_labels_to_test_subgroups = COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS, alpha = ALPHA, show_probability_plot = SHOW_PROBABILITY_PLOT, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

## **Exporting the dataframe as CSV file (to notebook's workspace)**

In [None]:
## WARNING: all files exported from this function are .csv (comma separated values)

DATAFRAME_OBJ_TO_BE_EXPORTED = dataset
# Alternatively: object containing the dataset to be exported.
# DATAFRAME_OBJ_TO_BE_EXPORTED: dataframe object that is going to be exported from the
# function. Since it is an object (not a string), it should not be declared in quotes.
# example: DATAFRAME_OBJ_TO_BE_EXPORTED = dataset will export the dataset object.
# ATTENTION: The dataframe object must be a Pandas dataframe.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"
# If you want to export the file to AWS S3, this parameter will have no effect.
# In this case, you can set FILE_DIRECTORY_PATH = None

NEW_FILE_NAME_WITHOUT_EXTENSION = "dataset"
# NEW_FILE_NAME_WITHOUT_EXTENSION - (string, in quotes): input the name of the 
# file without the extension. e.g. set NEW_FILE_NAME_WITHOUT_EXTENSION = "my_file" 
# to export the CSV file 'my_file.csv' to notebook's workspace.

export_pd_dataframe_as_csv (dataframe_obj_to_be_exported = DATAFRAME_OBJ_TO_BE_EXPORTED, new_file_name_without_extension = NEW_FILE_NAME_WITHOUT_EXTENSION, file_directory_path = FILE_DIRECTORY_PATH)

## **Downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

#### Case 1: upload a file to Colab's workspace

In [None]:
ACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model named keras_model, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'keras_model.h5'

# Dictionary storing the uploaded files returned as colab_files_dict.
# Simply modify this object on the left of the equality:
colab_files_dict = upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

#### Case 2: download a file from Colab's workspace

In [None]:
ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model nameACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

## **Exporting a list of files from notebook's workspace to AWS Simple Storage Service (S3)**

In [None]:
LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['s3_file1.txt', 's3_file2.txt']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS: list containing all the files to export to S3.
# Declare it as a list even if only a single file will be exported.
# It must be a list of strings containing the file names followed by the extensions.
# Example, to a export a single file my_file.ext, where my_file is the name and ext is the
# extension:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['my_file.ext']
# To export 3 files, file1.ext1, file2.ext2, and file3.ext3:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['file1.ext1', 'file2.ext2', 'file3.ext3']
# Other examples:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['Screen_Shot.png', 'dataset.csv']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ["dictionary.pkl", "model.h5"]
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['doc.pdf', 'model.dill']

DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = ''
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT: directory from notebook's workspace
# from which the files will be exported to S3. Keep it None, or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = "/"; or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = '' (empty string) to export from
# the root (main) directory.
# Alternatively, set as a string containing only the directories and folders, not the file names.
# Examples: DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1';
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1/folder2/'
    
# For this function, all exported files must be located in the same directory.

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for connecting with AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
export_files_to_s3 (list_of_file_names_with_extensions = LIST_OF_FILE_NAMES_WITH_EXTENSIONS, directory_of_notebook_workspace_storing_files_to_export = DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

****

# **One-Hot Encoding - Background**

If there are **categorical features**, they should be converted into numerical variables for being processed by the machine learning algorithms.

\- We can assign integer values for each one of the categories. This works well for situations where there is a scale or order for the assignment of the variables (e.g., if there is a satisfaction grade).

\- On the other hand, the results may be compromised if there is no order. That is because the ML algorithms assume that, if two categories have close numbers, then the categories are similar, what is not necessarily true. There are cases where the categories have no relation with each other.

\- In these cases, the best strategy is the One-Hot Encoding. For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.

\- Naturally, the number of columns grow with the number of possible labels. The One-Hot Encoder from Sklearn creates a Scipy Sparse matrix that stores the position of the zeros in the dataset. Then, the computational cost is reduced due to the fact that we are not storing a huge amount of null values.

\- Since each column is a binary variable of the type "is classified in this category or not", we expect that the created columns contain more zeros than 1s. That is because if an element belongs to one category (= 1), it does not belong to the others, so its value is zero for all other columns.