# **Text Preprocessing**

## _Data Extraction Workflow Notebook 1_

## Content:
1. Removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type;
2. Capitalizing or lowering case of string variables (string homogenizing);
3. Adding contractions to the contractions library;
4. Correcting contracted strings;
5. Substituting (replacing) substrings on string variables;
6. Inverting the order of the string characters;
7. Slicing the strings;
8. Getting the leftest characters from the strings (retrieve last characters);
9. Getting the rightest characters from the strings (retrieve first characters);
10. Joining list of strings into a single string;
11. Splitting strings into a list of strings;
12. Substituting (replacing or switching) whole strings by different text values (on string variables);
13. Replacing strings with Machine Learning: finding similar strings and replacing them by standard strings;
14. Searching for Regular Expression (RegEx) within a list of strings;
15. Replacing a Regular Expression (RegEx) within a list of strings;

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

In [None]:
# To install a library (e.g. tensorflow), unmark and run:
# ! pip install tensorflow
# to update a library (e.g. tensorflow), unmark and run:
# ! pip install tensorflow --upgrade
# to update pip, unmark and run:
# ! pip install pip --upgrade
# to show if a library is installed and visualize its information, unmark and run
# (e.g. tensorflow):
# ! pip show tensorflow
# To run a Python file (e.g idsw_etl.py) saved in the notebook's workspace directory,
# unmark and run:
# import idsw_etl
# or:
# import idsw_etl as etl

## **Load Python Libraries in Global Context**

In [None]:
import pandas as pd
import numpy as np

# **Function for mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
def mount_storage_system (source = 'aws', path_to_store_imported_s3_bucket = '', s3_bucket_name = None, s3_obj_prefix = None):
    
    # source = 'google' for mounting the google drive;
    # source = 'aws' for mounting an AWS S3 bucket.
    
    # THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN source == 'aws'
    
    # path_to_store_imported_s3_bucket: path of the Python environment to which the
    # S3 bucket contents will be imported. If it is None, or if 
    # path_to_store_imported_s3_bucket = '/', bucket will be imported to the root path. 
    # Alternatively, input the path as a string (in quotes). e.g. 
    # path_to_store_imported_s3_bucket = 'copied_s3_bucket'
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_prefix = None. Keep it None or as an empty string (s3_obj_key_prefix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_prefix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    # So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
    # a given folder (directory) of the bucket.
    # DO NOT PUT A SLASH before (to the right of) the prefix;
    # DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

    # Alternatively, provide the full path of a given file if you want to import only it:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
    # where my_file is the file's name, and ext is its extension.


    # Attention: after running this function for fetching AWS Simple Storage System (S3), 
    # your 'AWS Access key ID' and your 'Secret access key' will be requested.
    # The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
    # other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
    # and the prefix. All of these are sensitive information from the organization.
    # Therefore, after importing the information, always remember of cleaning the output of this cell
    # and of removing such information from the strings.
    # Remember that these data may contain privilege for accessing the information, so it should not
    # be used for non-authorized people.

    # Also, remember of deleting the imported files from the workspace after finishing the analysis.
    # The costs for storing the files in S3 is quite inferior than those for storing directly in the
    # workspace. Also, files stored in S3 may be accessed for other users than those with access to
    # the notebook's workspace.
    
    
    if (source == 'google'):
        
        from google.colab import drive
        # Google Colab library must be imported only in case it is
        # going to be used, for avoiding AWS compatibility issues.
        
        print("Associate the Python environment to your Google Drive account, and authorize the access in the opened window.")
        
        drive.mount('/content/drive')
        
        print("Now your Python environment is connected to your Google Drive: the root directory of your environment is now the root of your Google Drive.")
        print("In Google Colab, navigate to the folder icon (\'Files\') of the left navigation menu to find a specific folder or file in your Google Drive.")
        print("Click on the folder or file name and select the elipsis (...) icon on the right of the name to reveal the option \'Copy path\', which will give you the path to use as input for loading objects and files on your Python environment.")
        print("Caution: save your files into different directories of the Google Drive. If files are all saved in a same folder or directory, like the root path, they may not be accessible from your Python environment.")
        print("If you still cannot see the file after moving it to a different folder, reload the environment.")
    
    elif (source == 'aws'):
        
        import os
        import boto3
        # boto3 is AWS S3 Python SDK
        # sagemaker and boto3 libraries must be imported only in case 
        # they are going to be used, for avoiding 
        # Google Colab compatibility issues.
        from getpass import getpass

        # Check if path_to_store_imported_s3_bucket is None. If it is, make it the root directory:
        if ((path_to_store_imported_s3_bucket is None)|(str(path_to_store_imported_s3_bucket) == "/")):
            
            # For the S3 buckets, the path should not start with slash. Assign the empty
            # string instead:
            path_to_store_imported_s3_bucket = ""
            print("Bucket\'s content will be copied to the notebook\'s root directory.")
        
        elif (str(path_to_store_imported_s3_bucket) == ""):
            # Guarantee that the path is the empty string.
            # Avoid accessing the else condition, what would raise an error
            # since the empty string has no character of index 0
            path_to_store_imported_s3_bucket = str(path_to_store_imported_s3_bucket)
            print("Bucket\'s content will be copied to the notebook\'s root directory.")
        
        else:
            # Use the str attribute to guarantee that the path was read as a string:
            path_to_store_imported_s3_bucket = str(path_to_store_imported_s3_bucket)
            
            if(path_to_store_imported_s3_bucket[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # The slash is character 0. Then, we want all characters from character 1 (the
                # second) to character len(str(path_to_store_imported_s3_bucket)) - 1, the index
                # of the last character. So, we can slice the string from position 1 to position
                # the slicing syntax is: string[1:] - all string characters from character 1
                # string[:10] - all string characters from character 10-1 = 9 (including 9); or
                # string[1:10] - characters from 1 to 9
                # So, slice the whole string, starting from character 1:
                path_to_store_imported_s3_bucket = path_to_store_imported_s3_bucket[1:]
                # attention: even though strings may be seem as list of characters, that can be
                # sliced, we cannot neither simply assign a character to a given position nor delete
                # a character from a position.

        # Ask the user to provide the credentials:
        ACCESS_KEY = input("Enter your AWS Access Key ID here (in the right). It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
        print("\n") # line break
        SECRET_KEY = getpass("Enter your password (Secret key) here (in the right). It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        
        # The use of 'getpass' instead of 'input' hide the password behind dots.
        # So, the password is not visible by other users and cannot be copied.
        
        print("\n")
        print("WARNING: The bucket\'s name, the prefix, the AWS access key ID, and the AWS Secret access key are all sensitive information, which may grant access to protected information from the organization.\n")
        print("After copying data from S3 to your workspace, remember of removing these information from the notebook, specially if it is going to be shared. Also, remember of removing the files from the workspace.\n")
        print("The cost for storing files in Simple Storage Service is quite inferior than the one for storing directly in SageMaker workspace. Also, files stored in S3 may be accessed for other users than those with access the notebook\'s workspace.\n")

        # Check if the user actually provided the mandatory inputs, instead
        # of putting None or empty string:
        if ((ACCESS_KEY is None) | (ACCESS_KEY == '')):
            print("AWS Access Key ID is missing. It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
            return "error"
        elif ((SECRET_KEY is None) | (SECRET_KEY == '')):
            print("AWS Secret Access Key is missing. It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
            return "error"
        elif ((s3_bucket_name is None) | (s3_bucket_name == '')):
            print ("Please, enter a valid S3 Bucket\'s name. Do not add sub-directories or folders (prefixes), only the name of the bucket itself.")
            return "error"
        
        else:
            # Use the str attribute to guarantee that all AWS parameters were properly read as strings, and not as
            # other variables (like integers or floats):
            ACCESS_KEY = str(ACCESS_KEY)
            SECRET_KEY = str(SECRET_KEY)
            s3_bucket_name = str(s3_bucket_name)
        
        if(s3_bucket_name[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # So, slice the whole string, starting from character 1 (as did for 
                # path_to_store_imported_s3_bucket):
                s3_bucket_name = s3_bucket_name[1:]

        # Remove any possible trailing (white and tab spaces) spaces
        # That may be present in the string. Use the Python string
        # rstrip method, which is the equivalent to the Trim function:
        # When no arguments are provided, the whitespaces and tabulations
        # are the removed characters
        # https://www.w3schools.com/python/ref_string_rstrip.asp?msclkid=ee2d05c3c56811ecb1d2189d9f803f65
        s3_bucket_name = s3_bucket_name.rstrip()
        ACCESS_KEY = ACCESS_KEY.rstrip()
        SECRET_KEY = SECRET_KEY.rstrip()
        # Since the user manually inputs the parameters ACCESS and SECRET_KEY,
        # it is easy to input whitespaces without noticing that.

        # Now process the non-obbligatory parameter.
        # Check if a prefix was passed as input parameter. If so, we must select only the names that start with
        # The prefix.
        # Example: in the bucket 'my_bucket' we have a directory 'dir1'.
        # In the main (root) directory, we have a file 'file1.json' like: '/file1.json'
        # If we pass the prefix 'dir1', we want only the files that start as '/dir1/'
        # such as: 'dir1/file2.json', excluding the file in the main (root) directory and excluding the files in other
        # directories. Also, we want to eliminate the file names with no extensions, like 'dir1/' or 'dir1/dir2',
        # since these object names represent folders or directories, not files.	

        if (s3_obj_prefix is None):
            print ("No prefix, specific object, or subdirectory provided.") 
            print (f"Then, retrieving all content from the bucket \'{s3_bucket_name}\'.\n")
        elif ((s3_obj_prefix == "/") | (s3_obj_prefix == '')):
            # The root directory in the bucket must not be specified starting with the slash
            # If the root "/" or the empty string '' is provided, make
            # it equivalent to None (no directory)
            s3_obj_prefix = None
            print ("No prefix, specific object, or subdirectory provided.") 
            print (f"Then, retrieving all content from the bucket \'{s3_bucket_name}\'.\n")
    
        else:
            # Since there is a prefix, use the str attribute to guarantee that the path was read as a string:
            s3_obj_prefix = str(s3_obj_prefix)
            
            if(s3_obj_prefix[0] == "/"):
                # the first character is the slash. Let's remove it

                # In AWS, neither the prefix nor the path to which the file will be imported
                # (file from S3 to workspace) or from which the file will be exported to S3
                # (the path in the notebook's workspace) may start with slash, or the operation
                # will not be concluded. Then, we have to remove this character if it is present.

                # So, slice the whole string, starting from character 1 (as did for 
                # path_to_store_imported_s3_bucket):
                s3_obj_prefix = s3_obj_prefix[1:]

            # Remove any possible trailing (white and tab spaces) spaces
            # That may be present in the string. Use the Python string
            # rstrip method, which is the equivalent to the Trim function:
            s3_obj_prefix = s3_obj_prefix.rstrip()
            
            # Store the total characters in the prefix string after removing the initial slash
            # and trailing spaces:
            prefix_len = len(s3_obj_prefix)
            
            print("AWS Access Credentials, and bucket\'s prefix, object or subdirectory provided.\n")	

            
        print ("Starting connection with the S3 bucket.\n")
        
        try:
            # Start S3 client as the object 's3_client'
            s3_client = boto3.resource('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = SECRET_KEY)
        
            print(f"Credentials accepted by AWS. S3 client successfully started.\n")
            # An object 'data_table.xlsx' in the main (root) directory of the s3_bucket is stored in Python environment as:
            # s3.ObjectSummary(bucket_name='bucket_name', key='data_table.xlsx')
            # The name of each object is stored as the attribute 'key' of the object.
        
        except:
            
            print("Failed to connect to AWS Simple Storage Service (S3). Review if your credentials are correct.")
            print("The variable \'access_key\' must be set as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("The variable \'secret_key\' must be set as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
        
        try:
            # Connect to the bucket specified as 'bucket_name'.
            # The bucket is started as the object 's3_bucket':
            s3_bucket = s3_client.Bucket(s3_bucket_name)
            print(f"Connection with bucket \'{s3_bucket_name}\' stablished.\n")
            
        except:
            
            print("Failed to connect with the bucket, which usually happens when declaring a wrong bucket\'s name.") 
            print("Check the spelling of your bucket_name string and remember that it must be all in lower-case.\n")
                

        # Then, let's obtain a list of all objects in the bucket (list bucket_objects):
        
        bucket_objects_list = []

        # Loop through all objects of the bucket:
        for stored_obj in s3_bucket.objects.all():
            
            # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
            # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
            # Let's store only the key attribute and use the str function
            # to guarantee that all values were stored as strings.
            bucket_objects_list.append(str(stored_obj.key))
        
        # Now start a support list to store only the elements from
        # bucket_objects_list that are not folders or directories
        # (objects with extensions).
        # If a prefix was provided, only files with that prefix should
        # be added:
        support_list = []
        
        for stored_obj in bucket_objects_list:
            
            # Loop through all elements 'stored_obj' from the list
            # bucket_objects_list

            # Check the file extension.
            file_extension = os.path.splitext(stored_obj)[1][1:]
            
            # The os.path.splitext method splits the string into its FIRST dot (".") to
            # separate the file extension from the full path. Example:
            # "C:/dir1/dir2/data_table.csv" is split into:
            # "C:/dir1/dir2/data_table" (root part) and '.csv' (extension part)
            # https://www.geeksforgeeks.org/python-os-path-splitext-method/?msclkid=2d56198fc5d311ec820530cfa4c6d574

            # os.path.splitext(stored_obj) is a tuple of strings: the first is the complete file
            # root with no extension; the second is the extension starting with a point: '.txt'
            # When we set os.path.splitext(stored_obj)[1], we are selecting the second element of
            # the tuple. By selecting os.path.splitext(stored_obj)[1][1:], we are taking this string
            # from the second character (index 1), eliminating the dot: 'txt'


            # Check if the file extension is not an empty string '' (i.e., that it is different from != the empty
            # string:
            if (file_extension != ''):
                    
                    # The extension is different from the empty string, so it is not neither a folder nor a directory
                    # The object is actually a file and may be copied if it satisfies the prefix condition. If there
                    # is no prefix to check, we may simply copy the object to the list.

                    # If there is a prefix, the first characters of the stored_obj must be the prefix:
                    if not (s3_obj_prefix is None):
                        
                        # Check the characters from the position 0 (1st character) to the position
                        # prefix_len - 1. Since a prefix was declared, we want only the objects that this first portion
                        # corresponds to the prefix. string[i:j] slices the string from index i to index j-1
                        # Then, the 1st portion of the string to check is: string[0:(prefix_len)]

                        # Slice the string stored_obj from position 0 (1st character) to position prefix_len - 1,
                        # The position that the prefix should end.
                        obj_name_first_part = (stored_obj)[0:(prefix_len)]
                        
                        # If this first part is the prefix, then append the object to 
                        # support list:
                        if (obj_name_first_part == (s3_obj_prefix)):

                                support_list.append(stored_obj)

                    else:
                        # There is no prefix, so we can simply append the object to the list:
                        support_list.append(stored_obj)

            
        # Make the objects list the support list itself:
        bucket_objects_list = support_list
            
        # Now, bucket_objects_list contains the names of all objects from the bucket that must be copied.

        print("Finished mapping objects to fetch. Now, all these objects from S3 bucket will be copied to the notebook\'s workspace, in the specified directory.\n")
        print(f"A total of {len(bucket_objects_list)} files were found in the specified bucket\'s prefix (\'{s3_obj_prefix}\').")
        print(f"The first file found is \'{bucket_objects_list[0]}\'; whereas the last file found is \'{bucket_objects_list[len(bucket_objects_list) - 1]}\'.")
            
        # Now, let's try copying the files:
            
        try:
            
            # Loop through all objects in the list bucket_objects and copy them to the workspace:
            for copied_object in bucket_objects_list:

                # Select the object in the bucket previously started as 's3_bucket':
                selected_object = s3_bucket.Object(copied_object)
            
                # Now, copy this object to the workspace:
                # Set the new file_path. Notice that by now, copied_object may be a string like:
                # 'dir1/.../dirN/file_name.ext', where dirN is the n-th directory and ext is the file extension.
                # We want only the file_name to joing with the path to store the imported bucket. So, we can use the
                # str.split method specifying the separator sep = '/' to break the string into a list of substrings.
                # The last element from this list will be 'file_name.ext'
                # https://www.w3schools.com/python/ref_string_split.asp?msclkid=135399b6c63111ecada75d7d91add056

                # 1. Break the copied_object full path into the list object_path_list, using the .split method:
                object_path_list = copied_object.split(sep = "/")

                # 2. Get the last element from this list. Since it has length len(object_path_list) and indexing starts from
                # zero, the index of the last element is (len(object_path_list) - 1):
                fetched_object = object_path_list[(len(object_path_list) - 1)]

                # 3. Finally, join the string fetched_object with the new path (path on the notebook's workspace) to finish
                # The new object's file_path:

                file_path = os.path.join(path_to_store_imported_s3_bucket, fetched_object)

                # Download the selected object to the workspace in the specified file_path
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" copies a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                selected_object.download_file(Filename = file_path)

                print(f"The file \'{fetched_object}\' was successfully copied to notebook\'s workspace.\n")

                
            print("Finished copying the files from the bucket to the notebook\'s workspace. It may take a couple of minutes untill they be shown in SageMaker environment.\n") 
            print("Do not forget to delete these copies after finishing the analysis. They will remain stored in the bucket.\n")


        except:

            # Run this code for any other exception that may happen (no exception error
            # specified, so any exception runs the following code).
            # Check: https://pythonbasics.org/try-except/?msclkid=4f6b4540c5d011ecb1fe8a4566f632a6
            # for seeing how to handle successive exceptions

            print("Attention! The function raised an exception error, which is probably due to the AWS Simple Storage Service (S3) permissions.")
            print("Before running again this function, check this quick guide for configuring the permission roles in AWS.\n")
            print("It is necessary to create an user with full access permissions to interact with S3 from SageMaker. To configure the User, go to the upper ribbon of AWS, click on Services, and select IAM – Identity and Access Management.")
            print("1. In IAM\'s lateral panel, search for \'Users\' in the group of Access Management.")
            print("2. Click on the \'Add users\' button.")
            print("3. Set an user name in the text box \'User name\'.")
            print("Attention: users and S3 buckets cannot be written in upper case. Also, selecting a name already used by an Amazon user or bucket will raise an error message.\n")
            print("4. In the field \'Select type of Access to AWS\'-\'Select type of AWS credentials\' select the option \'Access key - Programmatic access\'. After that, click on the button \'Next: Permissions\'.")
            print("5. In the field \'Set Permissions\', keep the \'Add user to a group\' button marked.")
            print("6. In the field \'Add user to a group\', click on \'Create group\' (alternatively, you can be added to a group already configured or copy the permissions of another user.")
            print("7. In the text box \'Group\'s name\', set a name for the new group of permissions.")
            print("8. In the search bar below (\'Filter politics\'), search for a politics that fill your needs, and check the option button on the left of this politic. The politics \'AmazonS3FullAccess\' grants full access to the S3 content.")
            print("9. Finally, click on \'Create a group\'.")
            print("10. After the group is created, it will appear with a check box marked, over the previous groups. Keep it marked and click on the button \'Next: Tags\'.")
            print("11. Create and note down the Access key ID and Secret access key. You can also download a comma separated values (CSV) file containing the credentials for future use.")
            print("ATTENTION: These parameters are required for accessing the bucket\'s content from any application, including AWS SageMaker.")
            print("12. Click on \'Next: Review\' and review the user credentials information and permissions.")
            print("13. Click on \'Create user\' and click on the download button to download the CSV file containing the user credentials information.")
            print("The headers of the CSV file (the stored fields) is: \'User name, Password, Access key ID, Secret access key, Console login link\'.")
            print("You need both the values indicated as \'Access key ID\' and as \'Secret access key\' to fetch the S3 bucket.")
            print("\n") # line break
            print("After acquiring the necessary user privileges, use the boto3 library to fetch the bucket from the Python code. boto3 is AWS S3 Python SDK.")
            print("For fetching a specific bucket\'s file use the following code:\n")
            print("1. Set a variable \'access_key\' as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("2. Set a variable \'secret_key\' as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
            print("3. Set a variable \'bucket_name\' as a string containing only the name of the bucket. Do not add subdirectories, folders (prefixes), or file names.")
            print("Example: if your bucket is named \'my_bucket\' and its main directory contains folders like \'folder1\', \'folder2\', etc, do not declare bucket_name = \'my_bucket/folder1\', even if you only want files from folder1.")
            print("ALWAYS declare only the bucket\'s name: bucket_name = \'my_bucket\'.")
            print("4. Set a variable \'file_path\' containing the path from the bucket\'s subdirectories to the file you want to fetch. Include the file name and its extension.")
            print("If the file is stored in the bucket\'s root (main) directory: file_path = \"my_file.ext\".")
            print("If the path of the file in the bucket is: \'dir1/…/dirN/my_file.ext\', where dirN is the N-th subdirectory, and dir1 is a folder or directory of the main (root) bucket\'s directory: file_path = \"dir1/…/dirN/my_file.ext\".")
            print("Also, we say that \'dir1/…/dirN/\' is the file\'s prefix. Notice that the name of the bucket is never declared here as the path for fetching its content from the Python code.")
            print("5. Set a variable named \'new_path\' to store the path of the file copied to the notebook’s workspace. This path must contain the file name and its extension.")
            print("Example: if you want to copy \'my_file.ext\' to the root directory of the notebook’s workspace, set: new_path = \"/my_file.ext\".")
            print("6. Finally, declare the following code, which refers to the defined variables:\n")

            # Let's use triple quotes to declare a formated string
            example_code = """
                import boto3
                # Start S3 client as the object 's3_client'
                s3_client = boto3.resource('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key)
                # Connect to the bucket specified as 'bucket_name'.
                # The bucket is started as the object 's3_bucket':
                s3_bucket = s3_client.Bucket(bucket_name)
                # Select the object in the bucket previously started as 's3_bucket':
                selected_object = s3_bucket.Object(file_path)
                # Download the selected object to the workspace in the specified file_path
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" copies a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                selected_object.download_file(Filename = new_path)
                """

            print(example_code)

            print("An object \'my_file.ext\' in the main (root) directory of the s3_bucket is stored in Python environment as:")
            print("""s3.ObjectSummary(bucket_name='bucket_name', key='my_file.ext'""") 
            # triple quotes to keep the internal quotes without using too much backslashes "\" (the ignore next character)
            print("Then, the name of each object is stored as the attribute \'key\' of the object. To view all objects, we can loop through their \'key\' attributes:\n")
            example_code = """
                # Loop through all objects of the bucket:
                for stored_obj in s3_bucket.objects.all():		
                    # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
                    # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
                    # Print the object’s names:
                    print(stored_obj.key)
                    """

            print(example_code)

                
    else:
        
        print("Select a valid source: \'google\' for mounting Google Drive; or \'aws\' for accessing AWS S3 Bucket.")

# **Function for loading a text file**

In [None]:
# Open the context manager:
with open (file_path, 'r') as text_file:
    
    all_lines = text_file.readlines()
    next_line = text_file.readline()

In [None]:
def load_pandas_dataframe (file_directory_path, file_name_with_extension, load_txt_file_with_json_format = False, how_missing_values_are_registered = None, has_header = True, decimal_separator = '.', txt_csv_col_sep = "comma", load_all_sheets_at_once = False, sheet_to_load = None, json_record_path = None, json_field_separator = "_", json_metadata_prefix_list = None):
    
    # Pandas documentation:
    # pd.read_csv: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    # pd.read_excel: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
    # pd.json_normalize: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
    # Python JSON documentation:
    # https://docs.python.org/3/library/json.html
    
    import os
    import json
    import numpy as np
    import pandas as pd
    from pandas import json_normalize
    
    ## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
    ## JSON, txt, or CSV (comma separated values) files.
    
    # file_directory_path - (string, in quotes): input the path of the directory (e.g. folder path) 
    # where the file is stored. e.g. file_directory_path = "/" or file_directory_path = "/folder"
    
    # FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
    # extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
    # FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
    # Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.
    
    # load_txt_file_with_json_format = False. Set load_txt_file_with_json_format = True 
    # if you want to read a file with txt extension containing a text formatted as JSON 
    # (but not saved as JSON).
    # WARNING: if load_txt_file_with_json_format = True, all the JSON file parameters of the 
    # function (below) must be set. If not, an error message will be raised.
    
    # HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
    # empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
    # This parameter manipulates the argument na_values (default: None) from Pandas functions.
    # By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
    #‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
    # ‘n/a’, ‘nan’, ‘null’.

    # If a different denomination is used, indicate it as a string. e.g.
    # HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
    # HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

    # If dict passed, specific per-column NA values. For example, if zero is the missing value
    # only in column 'numeric_col', you can specify the following dictionary:
    # how_missing_values_are_registered = {'numeric-col': 0}
    
    
    # has_header = True if the the imported table has headers (row with columns names).
    # Alternatively, has_header = False if the dataframe does not have header.
    
    # DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
    # the decimal separator. Alternatively, specify here the separator.
    # e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
    # It manipulates the argument 'decimal' from Pandas functions.
    
    # txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
    # or 'csv'. It informs how the different columns are separated.
    # Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
    # for columns separated by comma;
    # txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
    # for columns separated by simple spaces.
    # You can also set a specific separator as string. For example:
    # txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
    # is used as separator for the columns - '\t' represents the tab character).
    
    
    ## Parameters for loading Excel files:
    
    # load_all_sheets_at_once = False - This parameter has effect only when for Excel files.
    # If load_all_sheets_at_once = True, the function will return a list of dictionaries, each
    # dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
    # value will be the name (or number) of the table (sheet). The second key will be 'df',
    # and its value will be the pandas dataframe object obtained from that sheet.
    # This argument has preference over sheet_to_load. If it is True, all sheets will be loaded.
    
    # sheet_to_load - This parameter has effect only when for Excel files.
    # keep sheet_to_load = None not to specify a sheet of the file, so that the first sheet
    # will be loaded.
    # sheet_to_load may be an integer or an string (inside quotes). sheet_to_load = 0
    # loads the first sheet (sheet with index 0); sheet_to_load = 1 loads the second sheet
    # of the file (index 1); sheet_to_load = "Sheet1" loads a sheet named as "Sheet1".
    # Declare a number to load the sheet with that index, starting from 0; or declare a
    # name to load the sheet with that name.
    
    
    ## Parameters for loading JSON files:
    
    # json_record_path (string): manipulate parameter 'record_path' from json_normalize method.
    # Path in each object to list of records. If not passed, data will be assumed to 
    # be an array of records. If a given field from the JSON stores a nested JSON (or a nested
    # dictionary) declare it here to decompose the content of the nested data. e.g. if the field
    # 'books' stores a nested JSON, declare, json_record_path = 'books'
    
    # json_field_separator = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
    # Nested records will generate names separated by sep. 
    # e.g., for json_field_separator = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
    # Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
    # the name of the columns of the dataframe will be formed by concatenating 'main_field', the
    # separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...
    
    # json_metadata_prefix_list: list of strings (in quotes). Manipulates the parameter 
    # 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
    # table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
    # will be repeated in the rows of the dataframe to give the metadata (context) of the rows.
    
    # e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
    # 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
    # Here, there are nested JSONs in the field 'books'. The fields that are not nested
    # are 'name' and 'last'.
    # Then, json_record_path = 'books'
    # json_metadata_prefix_list = ['name', 'last']
    
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, file_name_with_extension)
    # Extract the file extension
    file_extension = os.path.splitext(file_path)[1][1:]
    # os.path.splitext(file_path) is a tuple of strings: the first is the complete file
    # root with no extension; the second is the extension starting with a point: '.txt'
    # When we set os.path.splitext(file_path)[1], we are selecting the second element of
    # the tuple. By selecting os.path.splitext(file_path)[1][1:], we are taking this string
    # from the second character (index 1), eliminating the dot: 'txt'
    
    # Check if the decimal separator is None. If it is, set it as '.' (period):
    if (decimal_separator is None):
        decimal_separator = '.'
    
    if ((file_extension == 'txt') | (file_extension == 'csv')): 
        # The operator & is equivalent to 'And' (intersection).
        # The operator | is equivalent to 'Or' (union).
        # pandas.read_csv method must be used.
        if (load_txt_file_with_json_format == True):
            
            print("Reading a txt file containing JSON parsed data. A reading error will be raised if you did not set the JSON parameters.\n")
            
            with open(file_path, 'r') as opened_file:
                # 'r' stands for read mode; 'w' stands for write mode
                # read the whole file as a string named 'file_full_text'
                file_full_text = opened_file.read()
                # if we used the readlines() method, we would be reading the
                # file by line, not the whole text at once.
                # https://stackoverflow.com/questions/8369219/how-to-read-a-text-file-into-a-string-variable-and-strip-newlines?msclkid=a772c37bbfe811ec9a314e3629df4e1e
                # https://www.tutorialkart.com/python/python-read-file-as-string/#:~:text=example.py%20%E2%80%93%20Python%20Program.%20%23open%20text%20file%20in,and%20prints%20it%20to%20the%20standard%20output.%20Output.?msclkid=a7723a1abfe811ecb68bba01a2b85bd8
                
            #Now, file_full_text is a string containing the full content of the txt file.
            json_file = json.loads(file_full_text)
            # json.load() : This method is used to parse JSON from URL or file.
            # json.loads(): This method is used to parse string with JSON content.
            # e.g. .json.loads() must be used to read a string with JSON and convert it to a flat file
            # like a dataframe.
            # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.
            dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
        
        else:
            # Not a JSON txt
        
            if (has_header == True):

                if ((txt_csv_col_sep == "comma") | (txt_csv_col_sep == ",")):

                    dataset = pd.read_csv(file_path, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    # verbose = True for showing number of NA values placed in non-numeric columns.
                    #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                    # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                    # parsing speed by 5-10x.

                elif ((txt_csv_col_sep == "whitespace") | (txt_csv_col_sep == " ")):

                    dataset = pd.read_csv(file_path, delim_whitespace = True, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    
                else:
                    
                    try:
                        
                        # Try using the character specified as the argument txt_csv_col_sep:
                        dataset = pd.read_csv(file_path, sep = txt_csv_col_sep, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    except:
                        # An error was raised, the separator is not valid
                        print(f"Enter a valid column separator for the {file_extension} file, like: \'comma\' or \'whitespace\'.")


            else:
                # has_header == False

                if ((txt_csv_col_sep == "comma") | (txt_csv_col_sep == ",")):

                    dataset = pd.read_csv(file_path, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)

                    
                elif ((txt_csv_col_sep == "whitespace") | (txt_csv_col_sep == " ")):

                    dataset = pd.read_csv(file_path, delim_whitespace = True, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    
                else:
                    
                    try:
                        
                        # Try using the character specified as the argument txt_csv_col_sep:
                        dataset = pd.read_csv(file_path, sep = txt_csv_col_sep, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True, infer_datetime_format = True, decimal = decimal_separator)
                    
                    except:
                        # An error was raised, the separator is not valid
                        print(f"Enter a valid column separator for the {file_extension} file, like: \'comma\' or \'whitespace\'.")

    elif (file_extension == 'json'):
        
        with open(file_path, 'r') as opened_file:
            
            json_file = json.load(opened_file)
            # The structure json_file = json.load(open(file_path)) relies on the GC to close the file. That's not a 
            # good idea: If someone doesn't use CPython the garbage collector might not be using refcounting (which 
            # collects unreferenced objects immediately) but e.g. collect garbage only after some time.
            # Since file handles are closed when the associated object is garbage collected or closed 
            # explicitly (.close() or .__exit__() from a context manager) the file will remain open until 
            # the GC kicks in.
            # Using 'with' ensures the file is closed as soon as the block is left - even if an exception 
            # happens inside that block, so it should always be preferred for any real application.
            # source: https://stackoverflow.com/questions/39447362/equivalent-ways-to-json-load-a-file-in-python
            
        # json.load() : This method is used to parse JSON from URL or file.
        # json.loads(): This method is used to parse string with JSON content.
        # Then, json.load for a .json file
        # and json.loads for text file containing json
        # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.   
        dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
    
    else:
        # If it is not neither a csv nor a txt file, let's assume it is one of different
        # possible Excel files.
        print("Excel file inferred. If an error message is shown, check if a valid file extension was used: \'xlsx\', \'xls\', etc.\n")
        # For Excel type files, Pandas automatically detects the decimal separator and requires only the parameter parse_dates.
        # Firstly, the argument infer_datetime_format was present on read_excel function, but was removed.
        # From version 1.4 (beta, in 10 May 2022), it will be possible to pass the parameter 'decimal' to
        # read_excel function for detecting decimal cases in strings. For numeric variables, it is not needed, though
        
        if (load_all_sheets_at_once == True):
            
            # Corresponds to setting sheet_name = None
            
            if (has_header == True):
                
                xlsx_doc = pd.read_excel(file_path, sheet_name = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                # verbose = True for showing number of NA values placed in non-numeric columns.
                #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                # parsing speed by 5-10x.
                
            else:
                #No header
                xlsx_doc = pd.read_excel(file_path, sheet_name = None, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
            
            # xlsx_doc is a dictionary containing the sheet names as keys, and dataframes as items.
            # Let's convert it to the desired format.
            # Dictionary dict, dict.keys() is the array of keys; dict.values() is an array of the values;
            # and dict.items() is an array of tuples with format ('key', value)
            
            # Create a list of returned datasets:
            list_of_datasets = []
            
            # Let's iterate through the array of tuples. The first element returned is the key, and the
            # second is the value
            for sheet_name, dataframe in (xlsx_doc.items()):
                # sheet_name = key; dataframe = value
                # Define the dictionary with the standard format:
                df_dict = {'sheet': sheet_name,
                            'df': dataframe}
                
                # Add the dictionary to the list:
                list_of_datasets.append(df_dict)
            
            print("\n")
            print(f"A total of {len(list_of_datasets)} dataframes were retrieved from the Excel file.\n")
            print(f"The dataframes correspond to the following Excel sheets: {list(xlsx_doc.keys())}\n")
            print("Returning a list of dictionaries. Each dictionary contains the key \'sheet\', with the original sheet name; and the key \'df\', with the Pandas dataframe object obtained.\n")
            print(f"Check the 10 first rows of the dataframe obtained from the first sheet, named {list_of_datasets[0]['sheet']}:\n")
            
            try:
                # only works in Jupyter Notebook:
                from IPython.display import display
                display((list_of_datasets[0]['df']).head(10))
            
            except: # regular mode
                print((list_of_datasets[0]['df']).head(10))
            
            return list_of_datasets
            
        elif (sheet_to_load is not None):        
        #Case where the user specifies which sheet of the Excel file should be loaded.
            
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                # verbose = True for showing number of NA values placed in non-numeric columns.
                #  parse_dates = True: try parsing the index; infer_datetime_format = True : If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in 
                # the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the 
                # parsing speed by 5-10x.
                
            else:
                #No header
                dataset = pd.read_excel(file_path, sheet_name = sheet_to_load, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
        
        else:
            #No sheet specified
            if (has_header == True):
                
                dataset = pd.read_excel(file_path, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
            else:
                #No header
                dataset = pd.read_excel(file_path, header = None, na_values = how_missing_values_are_registered, verbose = True, parse_dates = True)
                
    print(f"Dataset extracted from {file_path}. Check the 10 first rows of this dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(dataset.head(10))
            
    except: # regular mode
        print(dataset.head(10))
    
    return dataset

# **Function for loading a csv file**

# **Function for converting JSON object to dataframe**
- Objects may be:
    - String with JSON formatted text;
    - List with nested dictionaries (JSON formatted);
    - Each dictionary may contain nested dictionaries, or nested lists of dictionaries (nested JSON).

In [None]:
def json_obj_to_pandas_dataframe (json_obj_to_convert, json_obj_type = 'list', json_record_path = None, json_field_separator = "_", json_metadata_prefix_list = None):
    
    import json
    import pandas as pd
    from pandas import json_normalize
    
    # JSON object in terms of Python structure: list of dictionaries, where each value of a
    # dictionary may be a dictionary or a list of dictionaries (nested structures).
    # example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
    # structure could be declared and stored into a string variable. For instance, if you have a txt
    # file containing JSON, you could read the txt and save its content as a string.
    # json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
    # 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
    # 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]    

    # json_obj_type = 'list', in case the object was saved as a list of dictionaries (JSON format)
    # json_obj_type = 'string', in case it was saved as a string (text) containing JSON.

    # json_obj_to_convert: object containing JSON, or string with JSON content to parse.
    # Objects may be: string with JSON formatted text;
    # list with nested dictionaries (JSON formatted);
    # dictionaries, possibly with nested dictionaries (JSON formatted).
    
    # https://docs.python.org/3/library/json.html
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html#pandas.json_normalize
    
    # json_record_path (string): manipulate parameter 'record_path' from json_normalize method.
    # Path in each object to list of records. If not passed, data will be assumed to 
    # be an array of records. If a given field from the JSON stores a nested JSON (or a nested
    # dictionary) declare it here to decompose the content of the nested data. e.g. if the field
    # 'books' stores a nested JSON, declare, json_record_path = 'books'
    
    # json_field_separator = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
    # Nested records will generate names separated by sep. 
    # e.g., for json_field_separator = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
    # Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
    # the name of the columns of the dataframe will be formed by concatenating 'main_field', the
    # separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...
    
    # json_metadata_prefix_list: list of strings (in quotes). Manipulates the parameter 
    # 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
    # table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
    # will be repeated in the rows of the dataframe to give the metadata (context) of the rows.
    
    # e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
    # 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
    # Here, there are nested JSONs in the field 'books'. The fields that are not nested
    # are 'name' and 'last'.
    # Then, json_record_path = 'books'
    # json_metadata_prefix_list = ['name', 'last']

    
    if (json_obj_type == 'string'):
        # Use the json.loads method to convert the string to json
        json_file = json.loads(json_obj_to_convert)
        # json.load() : This method is used to parse JSON from URL or file.
        # json.loads(): This method is used to parse string with JSON content.
        # e.g. .json.loads() must be used to read a string with JSON and convert it to a flat file
        # like a dataframe.
        # check: https://www.pythonpip.com/python-tutorials/how-to-load-json-file-using-python/#:~:text=The%20json.load%20%28%29%20is%20used%20to%20read%20the,and%20alter%20data%20in%20our%20application%20or%20system.
    
    elif (json_obj_type == 'list'):
        
        # make the json_file the object itself:
        json_file = json_obj_to_convert
    
    else:
        print ("Enter a valid JSON object type: \'list\', in case the JSON object is a list of dictionaries in JSON format; or \'string\', if the JSON is stored as a text (string variable).")
        return "error"
    
    dataset = json_normalize(json_file, record_path = json_record_path, sep = json_field_separator, meta = json_metadata_prefix_list)
    
    print(f"JSON object converted to a flat dataframe object. Check the 10 first rows of this dataframe:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(dataset.head(10))
            
    except: # regular mode
        print(dataset.head(10))
    
    return dataset

# **Function for removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type**

In [27]:
def trim_spaces_or_characters (string_or_list_of_strings, new_variable_type = None, method = 'trim', substring_to_eliminate = None):
    
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # new_variable_type = None. String (in quotes) that represents a given data type for the variables
    # after transformation. Set:
    # - new_variable_type = 'int' to convert the column to integer type after the transform;
    # - new_variable_type = 'float' to convert the column to float (decimal number);
    # - new_variable_type = 'datetime' to convert it to date or timestamp;
    
    # method = 'trim' will eliminate trailing and leading white spaces from the strings in
    # column_to_analyze.
    # method = 'substring' will eliminate a defined trailing and leading substring from
    # column_to_analyze.
    
    # substring_to_eliminate = None. Set as a string (in quotes) if method = 'substring'.
    # e.g. suppose column_to_analyze contains time information: each string ends in " min":
    # "1 min", "2 min", "3 min", etc. If substring_to_eliminate = " min", this portion will be
    # eliminated, resulting in: "1", "2", "3", etc. If new_variable_type = None, these values will
    # continue to be strings. By setting new_variable_type = 'int' or 'float', the series will be
    # converted to a numeric type.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    if (method == 'substring'):
        
        if (substring_to_eliminate is None):
            
            method = 'trim'
            print("No valid substring input. Modifying method to \'trim\'.\n")
    
    if (method == 'substring'):
        
        print("ATTENTION: Operations of string strip (removal) or replacement are all case-sensitive. There must be correct correspondence between cases and spaces for the strings being removed or replaced.\n")
           
        new_series = [string.strip(substring_to_eliminate) for string in list_of_strings]
    
    else:
        
        new_series = [string.strip() for string in list_of_strings]
    
    # Check if a the series type should be modified:
    if (new_variable_type is not None):
        # try converting the type:
        try:
            if (new_variable_type == 'int'):

                new_series = np.int64(new_series)

            elif (new_variable_type == 'float'):

                new_series = np.float64(new_series)

            elif (new_variable_type == 'datetime'):

                new_series = np.datetime64(new_series)
        
            print(f"Successfully converted the strings to the type {new_variable_type}.\n")
        
        except:
            pass

    # Now, we are in the main code.
    print("Finished removing leading and trailing spaces or characters (substrings).")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for capitalizing or lowering case of string variables (string homogenizing)**

In [28]:
def capitalize_or_lower_string_case (string_or_list_of_strings, method = 'lowercase'):
     
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # method = 'capitalize' will capitalize all letters from the input string 
    # (turn them to upper case).
    # method = 'lowercase' will make the opposite: turn all letters to lower case.
    # e.g. suppose string_or_list_of_strings contains strings such as 'String One', 'STRING 2',  and
    # 'string3'. If method = 'capitalize', the output will contain the strings: 
    # 'STRING ONE', 'STRING 2', 'STRING3'. If method = 'lowercase', the outputs will be:
    # 'string one', 'string 2', 'string3'.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    if (method == 'capitalize'):
        
        print("Capitalizing the string (moving all characters to upper case).\n")
        new_series = [string.upper() for string in list_of_strings]
    
    else:
        
        print("Lowering the string case (moving all characters to lower case).\n")
        new_series = [string.lower() for string in list_of_strings]
    
    # Now, we are in the main code.
    print(f"Finished homogenizing the string cases, giving value consistency.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for adding contractions to the contractions library**

In [53]:
def add_contractions_to_library (list_of_contractions = [{'contracted_expression': None, 'correct_expression': None}, {'contracted_expression': None, 'correct_expression': None}, {'contracted_expression': None, 'correct_expression': None}, {'contracted_expression': None, 'correct_expression': None}]):
    
    import contractions
    # contractions library: https://github.com/kootenpv/contractions
    
    # list_of_contractions = 
    # [{'contracted_expression': None, 'correct_expression': None}]
    # This is a list of dictionaries, where each dictionary contains two key-value pairs:
    # the first one contains the form as the contraction is usually observed; and the second one 
    # contains the correct (full) string that will replace it.
    # Since contractions can cause issues when processing text, we can expand them with these functions.
    
    # The object list_of_contractions must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'contracted_expression' for the contraction; and 'correct_expression', 
    # for the strings with the correspondent correction.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you want to add more elements
    # to the contractions library.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'contracted_expression': original_str, 'correct_expression': new_str}, 
    # where original_str and new_str represent the contracted and expanded strings
    # (If one of the keys contains None, the new dictionary will be ignored).
    
    # Example:
    # list_of_contractions = [{'contracted_expression': 'mychange', 'correct_expression': 'my change'}]
    
    
    for dictionary in list_of_contractions:
        
        contraction = dictionary['contracted_expression']
        correction = dictionary['correct_expression']
        
        if ((contraction is not None) & (correction is not None)):
    
            contractions.add(contraction, correction)
            print(f"Successfully included the contracted expression {contraction} to the contractions library.")

    print("Now, the function for contraction correction will be able to process it within the strings.\n")

# **Function for correcting contracted strings**

In [74]:
def correct_contracted_strings (string_or_list_of_strings):
     
    import numpy as np
    import contractions
    
    # contractions library: https://github.com/kootenpv/contractions
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    # Contractions operate at one string at once:
    correct_contractions_list = [contractions.fix(string, slang = True) for string in list_of_strings]

    # Now, we are in the main code.
    print(f"Finished correcting the contracted strings.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(correct_contractions_list[:10])
            
    except: # regular mode
        print(correct_contractions_list[:10])
    
    return correct_contractions_list

# **Function for substituting (replacing) substrings on string variables**

In [55]:
def replace_substring (string_or_list_of_strings, substring_to_be_replaced = None, new_substring_for_replacement = ''):
     
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # substring_to_be_replaced = None; new_substring_for_replacement = ''. 
    # Strings (in quotes): when the sequence of characters substring_to_be_replaced was
    # found in the strings from column_to_analyze, it will be substituted by the substring
    # new_substring_for_replacement. If None is provided to one of these substring arguments,
    # it will be substituted by the empty string: ''
    # e.g. suppose column_to_analyze contains the following strings, with a spelling error:
    # "my collumn 1", 'his collumn 2', 'her column 3'. We may correct this error by setting:
    # substring_to_be_replaced = 'collumn' and new_substring_for_replacement = 'column'. The
    # function will search for the wrong group of characters and, if it finds it, will substitute
    # by the correct sequence: "my column 1", 'his column 2', 'her column 3'.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    print("ATTENTION: Operations of string strip (removal) or replacement are all case-sensitive. There must be correct correspondence between cases and spaces for the strings being removed or replaced.\n")
        
    # If one of the input substrings is None, make it the empty string:
    if (substring_to_be_replaced is None):
        substring_to_be_replaced = ''
    
    if (new_substring_for_replacement is None):
        new_substring_for_replacement = ''
    
    # Guarantee that both were read as strings (they may have been improperly read as 
    # integers or floats):
    substring_to_be_replaced = str(substring_to_be_replaced)
    new_substring_for_replacement = str(new_substring_for_replacement)
    
    new_series = [string.replace(substring_to_be_replaced, new_substring_for_replacement) for string in list_of_strings]
    
    # Now, we are in the main code.
    print(f"Finished replacing the substring {substring_to_be_replaced} by {new_substring_for_replacement}.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for inverting the order of the string characters**

In [56]:
def invert_strings (string_or_list_of_strings):
     
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    # Pandas slice: start from -1 (last character) and go to the last element with -1 step
    # walk through the string 'backwards':
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    new_series = [string[::-1] for string in list_of_strings]
    

    # Now, we are in the main code.
    print(f"Finished inversion of the strings.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for slicing the strings**

In [57]:
def slice_strings (string_or_list_of_strings, first_character_index = None, last_character_index = None, step = 1):
     
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # first_character_index = None - integer representing the index of the first character to be
    # included in the new strings. If None, slicing will start from first character.
    # Indexing of strings always start from 0. The last index can be represented as -1, the index of
    # the character before as -2, etc (inverse indexing starts from -1).
    # example: consider the string "idsw", which contains 4 characters. We can represent the indices as:
    # 'i': index 0; 'd': 1, 's': 2, 'w': 3. Alternatively: 'w': -1, 's': -2, 'd': -3, 'i': -4.
    
    # last_character_index = None - integer representing the index of the last character to be
    # included in the new strings. If None, slicing will go until the last character.
    # Attention: this is effectively the last character to be added, and not the next index after last
    # character.
    
    # in the 'idsw' example, if we want a string as 'ds', we want the first_character_index = 1 and
    # last_character_index = 2.
    
    # step = 1 - integer representing the slicing step. If step = 1, all characters will be added.
    # If step = 2, then the slicing will pick one element of index i and the element with index (i+2)
    # (1 index will be 'jumped'), and so on.
    # If step is negative, then the order of the new strings will be inverted.
    # Example: step = -1, and the start and finish indices are None: the output will be the inverted
    # string, 'wsdi'.
    # first_character_index = 1, last_character_index = 2, step = 1: output = 'ds';
    # first_character_index = None, last_character_index = None, step = 2: output = 'is';
    # first_character_index = None, last_character_index = None, step = 3: output = 'iw';
    # first_character_index = -1, last_character_index = -2, step = -1: output = 'ws';
    # first_character_index = -1, last_character_index = None, step = -2: output = 'wd';
    # first_character_index = -1, last_character_index = None, step = 1: output = 'w'
    # In this last example, the function tries to access the next element after the character of index
    # -1. Since -1 is the last character, there are no other characters to be added.
    # first_character_index = -2, last_character_index = -1, step = 1: output = 'sw'.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    if (step is None):
        # set as 1
        step = 1
    
    if (last_character_index is not None):
        if (last_character_index == -1):
            # In this case, we cannot sum 1, because it would result in index 0 (1st character).
            # So, we will proceed without last index definition, to stop only at the end.
            last_character_index = None
    
    # Now, make the checking again:
            
    if ((first_character_index is None) & (last_character_index is None)):
        
        new_series = [string[::step] for string in list_of_strings]
        
    elif (first_character_index is None):
        # Only this is None:
        new_series = [string[:(last_character_index + 1):step] for string in list_of_strings]
        
    elif (last_character_index is None):
        new_series = [string[first_character_index::step] for string in list_of_strings]
    
    else:
        new_series = [string[first_character_index:(last_character_index + 1):step] for string in list_of_strings]
        
    # Slicing from index i to index j includes index i, but does not include 
    # index j (ends in j-1). So, we add 1 to the last index to include it.
    # automatically included.

    # Now, we are in the main code.
    print(f"Finished slicing the strings from character {first_character_index} to character {last_character_index}.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for getting the leftest characters from the strings (retrieve last characters)**

In [58]:
def left_characters (string_or_list_of_strings, number_of_characters_to_retrieve = 1, new_variable_type = None):
    
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # number_of_characters_to_retrieve = 1 - integer representing the total of characters that will
    # be retrieved. Here, we will retrieve the leftest characters. If number_of_characters_to_retrieve = 1,
    # only the leftest (last) character will be retrieved.
    # Consider the string 'idsw'.
    # number_of_characters_to_retrieve = 1 - output: 'w';
    # number_of_characters_to_retrieve = 2 - output: 'sw'.
    
    # new_variable_type = None. String (in quotes) that represents a given data type for the column
    # after transformation. Set:
    # - new_variable_type = 'int' to convert the extracted column to integer;
    # - new_variable_type = 'float' to convert the column to float (decimal number);
    # - new_variable_type = 'datetime' to convert it to date or timestamp;
   
    # So, if the last part of the strings is a number, you can use this argument to directly extract
    # this part as numeric variable.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    # Pandas slice:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    if (number_of_characters_to_retrieve is None):
        # set as 1
        number_of_characters_to_retrieve = 1
    
    # last_character_index = -1 would be the index of the last character.
    # If we want the last N = 2 characters, we should go from index -2 to -1, -2 = -1 - (N-1);
    # If we want the last N = 3 characters, we should go from index -3 to -1, -2 = -1 - (N-1);
    # If we want only the last (N = 1) character, we should go from -1 to -1, -1 = -1 - (N-1).
    
    # N = number_of_characters_to_retrieve
    first_character_index = -1 - (number_of_characters_to_retrieve - 1)
    
    # Perform the slicing without setting the limit, to slice until the end of the string:
    new_series = [string[first_character_index:] for string in list_of_strings]
    # If no step is specified, step = 1
    
    # Check if a the series type should be modified:
    if (new_variable_type is not None):
        # try converting the type:
        try:
            if (new_variable_type == 'int'):

                new_series = np.int64(new_series)

            elif (new_variable_type == 'float'):

                new_series = np.float64(new_series)

            elif (new_variable_type == 'datetime'):

                new_series = np.datetime64(new_series)
        
            print(f"Successfully converted the strings to the type {new_variable_type}.\n")
        
        except:
            pass
    
    
    # Now, we are in the main code.
    print(f"Finished extracting the {number_of_characters_to_retrieve} leftest characters.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for getting the rightest characters from the strings (retrieve first characters)**

In [59]:
def right_characters (string_or_list_of_strings, number_of_characters_to_retrieve = 1, new_variable_type = None):
    
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # number_of_characters_to_retrieve = 1 - integer representing the total of characters that will
    # be retrieved. Here, we will retrieve the rightest characters. If number_of_characters_to_retrieve = 1,
    # only the rightest (first) character will be retrieved.
    # Consider the string 'idsw'.
    # number_of_characters_to_retrieve = 1 - output: 'i';
    # number_of_characters_to_retrieve = 2 - output: 'id'.
    
    # new_variable_type = None. String (in quotes) that represents a given data type for the column
    # after transformation. Set:
    # - new_variable_type = 'int' to convert the extracted column to integer;
    # - new_variable_type = 'float' to convert the column to float (decimal number);
    # - new_variable_type = 'datetime' to convert it to date or timestamp;
     
    # So, if the first part of the strings is a number, you can use this argument to directly extract
    # this part as numeric variable.
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    # Pandas slice:
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
    
    if (number_of_characters_to_retrieve is None):
        # set as 1
        number_of_characters_to_retrieve = 1
    
    # first_character_index = 0 would be the index of the first character.
    # If we want the last N = 2 characters, we should go from index 0 to 1, 1 = (N-1);
    # If we want the last N = 3 characters, we should go from index 0 to 2, 2 = (N-1);
    # If we want only the last (N = 1) character, we should go from 0 to 0, 0 = (N-1).
    
    # N = number_of_characters_to_retrieve
    last_character_index = number_of_characters_to_retrieve - 1
    
    # Perform the slicing without setting the limit, to slice from the 1st character:
    new_series = [string[:(last_character_index + 1)] for string in list_of_strings]
    # If no step is specified, step = 1
    
    # Check if a the series type should be modified:
    if (new_variable_type is not None):
        # try converting the type:
        try:
            if (new_variable_type == 'int'):

                new_series = np.int64(new_series)

            elif (new_variable_type == 'float'):

                new_series = np.float64(new_series)

            elif (new_variable_type == 'datetime'):

                new_series = np.datetime64(new_series)
        
            print(f"Successfully converted the strings to the type {new_variable_type}.\n")
        
        except:
            pass
    

    # Now, we are in the main code.
    print(f"Finished extracting the {number_of_characters_to_retrieve} rightest characters.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for joining list of strings into a single string**

In [45]:
def join_list_of_strings (string_or_list_of_strings, separator = " "):
    
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # separator = " " - string containing the separator. Suppose the column contains the
    # strings: 'a', 'b', 'c', 'd'. If the separator is the empty string '', the output will be:
    # 'abcd' (no separation). If separator = " " (simple whitespace), the output will be 'a b c d'
    
    
    if (separator is None):
        # make it a whitespace:
        separator = " "
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    concat_string = separator.join(list_of_strings)
    # sep.join(list_of_strings) method: join all the strings, separating them by sep.

    # Now, we are in the main code.
    print(f"Finished joining strings.")
    print("Check the 10 first characters of the new string:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(concat_string[:10])
            
    except: # regular mode
        print(concat_string[:10])
    
    return concat_string

# **Function for splitting strings into a list of strings**

In [46]:
def split_strings (string_or_list_of_strings, separator = " "):
    
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
   
    # separator = " " - string containing the separator. Suppose the column contains the
    # string: 'a b c d' on a given row. If the separator is whitespace ' ', 
    # the output will be a list: ['a', 'b', 'c', 'd']: the function splits the string into a list
    # of strings (one list per row) every time it finds the separator.
    
    
    if (separator is None):
        # make it a whitespace:
        separator = " "
        
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    # Split the strings from new_series, getting a list of strings:
    new_series = [string.split(sep = separator) for string in list_of_strings]

    # Now, we are in the main code.
    print(f"Finished splitting strings.")
    print("Check the 10 first strings:\n")
    
    try:
        # only works in Jupyter Notebook:
        from IPython.display import display
        display(new_series[:10])
            
    except: # regular mode
        print(new_series[:10])
    
    return new_series

# **Function for substituting (replacing or switching) whole strings by different text values (on string variables)**

In [47]:
def switch_strings (string_or_list_of_strings, list_of_dictionaries_with_original_strings_and_replacements = [{'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}, {'original_string': None, 'new_string': None}]):
    
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # list_of_dictionaries_with_original_strings_and_replacements = 
    # [{'original_string': None, 'new_string': None}]
    # This is a list of dictionaries, where each dictionary contains two key-value pairs:
    # the first one contains the original string; and the second one contains the new string
    # that will substitute the original one. The function will loop through all dictionaries in
    # this list, access the values of the keys 'original_string', and search these values on the strings
    # in column_to_analyze. When the value is found, it will be replaced (switched) by the correspondent
    # value in key 'new_string'.
    
    # The object list_of_dictionaries_with_original_strings_and_replacements must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'original_string' for the original strings to search on the column 
    # column_to_analyze; and 'new_string', for the strings that will replace the original ones.
    # Notice that this function will not search substrings: it will substitute a value only when
    # there is perfect correspondence between the string in 'column_to_analyze' and 'original_string'.
    # So, the cases (upper or lower) must be the same.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to replace more
    # values.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same keys: {'original_string': original_str, 'new_string': new_str}, 
    # where original_str and new_str represent the strings for searching and replacement 
    # (If one of the keys contains None, the new dictionary will be ignored).
    
    # Example:
    # Suppose the column_to_analyze contains the values 'sunday', 'monday', 'tuesday', 'wednesday',
    # 'thursday', 'friday', 'saturday', but you want to obtain data labelled as 'weekend' or 'weekday'.
    # Set: list_of_dictionaries_with_original_strings_and_replacements = 
    # [{'original_string': 'sunday', 'new_string': 'weekend'},
    # {'original_string': 'saturday', 'new_string': 'weekend'},
    # {'original_string': 'monday', 'new_string': 'weekday'},
    # {'original_string': 'tuesday', 'new_string': 'weekday'},
    # {'original_string': 'wednesday', 'new_string': 'weekday'},
    # {'original_string': 'thursday', 'new_string': 'weekday'},
    # {'original_string': 'friday', 'new_string': 'weekday'}]
    
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
    
    print("ATTENTION: Operations of string strip (removal) or replacement are all case-sensitive. There must be correct correspondence between cases and spaces for the strings being removed or replaced.\n")
     
    # Create the mapping dictionary for the str.replace method:
    mapping_dict = {}
    # The key of the mapping dict must be an string, whereas the value must be the new string
    # that will replace it.
        
    # Loop through each element on the list list_of_dictionaries_with_original_strings_and_replacements:
    
    for i in range (0, len(list_of_dictionaries_with_original_strings_and_replacements)):
        # from i = 0 to i = len(list_of_dictionaries_with_original_strings_and_replacements) - 1, index of the
        # last element from the list
            
        # pick the i-th dictionary from the list:
        dictionary = list_of_dictionaries_with_original_strings_and_replacements[i]
            
        # access 'original_string' and 'new_string' keys from the dictionary:
        original_string = dictionary['original_string']
        new_string = dictionary['new_string']
        
        # check if they are not None:
        if ((original_string is not None) & (new_string is not None)):
            
            #Guarantee that both are read as strings:
            original_string = str(original_string)
            new_string = str(new_string)
            
            # add them to the mapping dictionary, using the original_string as key and
            # new_string as the correspondent value:
            mapping_dict[original_string] = new_string
    
    # Now, the input list was converted into a dictionary with the correct format for the method.
    # Check if there is at least one key in the dictionary:
    if (len(mapping_dict) > 0):
        # len of a dictionary returns the amount of key:value pairs stored. If nothing is stored,
        # len = 0. dictionary.keys() method (no arguments in parentheses) returns an array containing
        # the keys; whereas dictionary.values() method returns the arrays of the values.
        
        for original_string, new_string in mapping_dict.items():
            
            # For strings, we must perform one substitution by call of the replace method.
            # It is different from pd.str.replace, where a simple call performs this work.
            # So, let's re-create the lists for each key value pair
            # https://www.w3schools.com/python/ref_string_replace.asp
            list_of_strings = [string.replace(original_string, new_string) for string in list_of_strings]
        
        # Now, we are in the main code.
        print(f"Finished replacing the substrings accordingly to the mapping: {mapping_dict}.")
        print("Check the 10 first strings:\n")
    
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(list_of_strings[:10])

        except: # regular mode
            print(list_of_strings[:10])

        return list_of_strings
    
    else:
        print("Input at least one dictionary containing a pair of original string, in the key \'original_string\', and the correspondent new string as key \'new_string\'.")
        print("The dictionaries must be elements from the list list_of_dictionaries_with_original_strings_and_replacements.\n")
        
        return "error"

# **Function for string replacement with Machine Learning: find similar strings and replace them by standard strings**

In [48]:
def string_replacement_ml (string_or_list_of_strings, mode = 'find_and_replace', threshold_for_percent_of_similarity = 80.0, list_of_dictionaries_with_standard_strings_for_replacement = [{'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}, {'standard_string': None}]):
    
    import numpy as np
    from fuzzywuzzy import process
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # mode = 'find_and_replace' will find similar strings; and switch them by one of the
    # standard strings if the similarity between them is higher than or equals to the threshold.
    # Alternatively: mode = 'find' will only find the similar strings by calculating the similarity.
    
    # threshold_for_percent_of_similarity = 80.0 - 0.0% means no similarity and 100% means equal strings.
    # The threshold_for_percent_of_similarity is the minimum similarity calculated from the
    # Levenshtein (minimum edit) distance algorithm. This distance represents the minimum number of
    # insertion, substitution or deletion of characters operations that are needed for making two
    # strings equal.
    
    # list_of_dictionaries_with_standard_strings_for_replacement =
    # [{'standard_string': None}]
    # This is a list of dictionaries, where each dictionary contains a single key-value pair:
    # the key must be always 'standard_string', and the value will be one of the standard strings 
    # for replacement: if a given string on the column_to_analyze presents a similarity with one 
    # of the standard string equals or higher than the threshold_for_percent_of_similarity, it will be
    # substituted by this standard string.
    # For instance, suppose you have a word written in too many ways, making it difficult to use
    # the function switch_strings: "EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ...
    # You can use this function to search strings similar to "Europe" and replace them.
    
    # The function will loop through all dictionaries in
    # this list, access the values of the keys 'standard_string', and search these values on the strings
    # in column_to_analyze. When the value is found, it will be replaced (switched) if the similarity
    # is sufficiently high.
    
    # The object list_of_dictionaries_with_standard_strings_for_replacement must be declared as a list, 
    # in brackets, even if there is a single dictionary.
    # Use always the same keys: 'standard_string'.
    # Notice that this function performs fuzzy matching, so it MAY SEARCH substrings and strings
    # written with different cases (upper or lower) when this portions or modifications make the
    # strings sufficiently similar to each other.
    
    # If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
    # and you can also add more elements (dictionaries) to the lists, if you need to replace more
    # values.
    # Simply put a comma after the last element from the list and declare a new dictionary, keeping the
    # same key: {'standard_string': other_std_str}, 
    # where other_std_str represents the string for searching and replacement 
    # (If the key contains None, the new dictionary will be ignored).
    
    # Example:
    # Suppose the column_to_analyze contains the values 'California', 'Cali', 'Calefornia', 
    # 'Calefornie', 'Californie', 'Calfornia', 'Calefernia', 'New York', 'New York City', 
    # but you want to obtain data labelled as the state 'California' or 'New York'.
    # Set: list_of_dictionaries_with_standard_strings_for_replacement = 
    # [{'standard_string': 'California'},
    # {'standard_string': 'New York'}]
    
    # ATTENTION: It is advisable for previously searching the similarity to find the best similarity
    # threshold; set it as high as possible, avoiding incorrect substitutions in a gray area; and then
    # perform the replacement. It will avoid the repetition of original incorrect strings in the
    # output dataset, as well as wrong replacement (replacement by one of the standard strings which
    # is not the correct one).
    
    
    print("Performing fuzzy replacement based on the Levenshtein (minimum edit) distance algorithm.")
    print("This distance represents the minimum number of insertion, substitution or deletion of characters operations that are needed for making two strings equal.\n")
    
    print("This means that substrings or different cases (upper or higher) may be searched and replaced, as long as the similarity threshold is reached.\n")
    
    print("ATTENTION!\n")
    print("It is advisable for previously searching the similarity to find the best similarity threshold.\n")
    print("Set the threshold as high as possible, and only then perform the replacement.\n")
    print("It will avoid the repetition of original incorrect strings in the output dataset, as well as wrong replacement (replacement by one of the standard strings which is not the correct one.\n")
    
    # Check if a string was passed. If it was, convert it to list of single element:
    if (type(string_or_list_of_strings) == str):
        list_of_strings = [string_or_list_of_strings]
    
    else: # simply convert the iterable to the new standard name:
        list_of_strings = list(string_or_list_of_strings)
    
    # Now, we have a local copy as a list.
    # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.

    # Get the unique values present in column_to_analyze:
    # Convert the list to a set: sets accepts only unique elements. These objects are based on the
    # sets mathematical theory
    unique_types = set(list_of_strings)
    
    # Create the summary_list:
    summary_list = []
        
    # Loop through each element on the list list_of_dictionaries_with_original_strings_and_replacements:
    
    for i in range (0, len(list_of_dictionaries_with_standard_strings_for_replacement)):
        # from i = 0 to i = len(list_of_dictionaries_with_standard_strings_for_replacement) - 1, index of the
        # last element from the list
            
        # pick the i-th dictionary from the list:
        dictionary = list_of_dictionaries_with_standard_strings_for_replacement[i]
            
        # access 'standard_string' key from the dictionary:
        standard_string = dictionary['standard_string']
        
        # check if it is not None:
        if (standard_string is not None):
            
            # Guarantee that it was read as a string:
            standard_string = str(standard_string)
            
            # Calculate the similarity between each one of the unique_types and standard_string:
            similarity_list = process.extract(standard_string, unique_types, limit = len(unique_types))
            
            # Add the similarity list to the dictionary:
            dictionary['similarity_list'] = similarity_list
            # This is a list of tuples with the format (tested_string, percent_of_similarity_with_standard_string)
            # e.g. ('asiane', 92) for checking similarity with string 'asian'
            
            if (mode == 'find_and_replace'):
                
                # If an invalid value was set for threshold_for_percent_of_similarity, correct it to 80% standard:
                
                if(threshold_for_percent_of_similarity is None):
                    threshold_for_percent_of_similarity = 80.0
                
                if((threshold_for_percent_of_similarity == np.nan) | (threshold_for_percent_of_similarity < 0)):
                    threshold_for_percent_of_similarity = 80.0
                
                list_of_replacements = []
                # Let's replace the matches in the series by the standard_string:
                # Iterate through the list of matches
                for match in similarity_list:
                    # Check whether the similarity score is greater than or equal to threshold_for_percent_of_similarity.
                    # The similarity score is the second element (index 1) from the tuples:
                    if (match[1] >= threshold_for_percent_of_similarity):
                        # If it is, select all rows where the column_to_analyze is spelled as
                        # match[0] (1st Tuple element), and set it to standard_string:
                        list_of_strings = [string.replace(match[0], standard_string) for string in list_of_strings]
        
                        print(f"Found {match[1]}% of similarity between {match[0]} and {standard_string}.")
                        print(f"Then, {match[0]} was replaced by {standard_string}.\n")
                        
                        # Add match to the list of replacements:
                        list_of_replacements.append(match)
                
                # Add the list_of_replacements to the dictionary, if its length is higher than zero:
                if (len(list_of_replacements) > 0):
                    dictionary['list_of_replacements_by_std_str'] = list_of_replacements
            
            # Add the dictionary to the summary_list:
            summary_list.append(dictionary)
      
    # Now, let's replace the original column or create a new one if mode was set as replace:
    if (mode == 'find_and_replace'):
    
        # Now, we are in the main code.
        print(f"Finished replacing the strings by the provided standards. Returning the new list and a summary list.\n")
        print("In summary_list, you can check the calculated similarities in keys \'similarity_list\' from the dictionaries.\n")
        print("The similarity list is a list of tuples, where the first element is the string compared against the value on key \'standard_string\'; and the second element is the similarity score, the percent of similarity between the tested and the standard string.\n")
        print("Check the 10 first strings:\n")
    
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(list_of_strings[:10])

        except: # regular mode
            print(list_of_strings[:10])
    
    else:
        
        print("Finished mapping similarities. Returning the original list and a summary list.\n")
        print("Check the similarities below, in keys \'similarity_list\' from the dictionaries.\n")
        print("The similarity list is a list of tuples, where the first element is the string compared against the value on key \'standard_string\'; and the second element is the similarity score, the percent of similarity between the tested and the standard string.\n")
        
        try:
            display(summary_list)
        except:
            print(summary_list)
    
    return list_of_strings, summary_list

# **Function for searching for Regular Expression (RegEx) within a string column**

In [49]:
class regex_help:

    def __init__ (self, start_helper = True, helper_screen = 0):
        
        # from DataCamp course Regular Expressions in Python
        # https://www.datacamp.com/courses/regular-expressions-in-python#!

        self.start_helper = start_helper
        self.helper_screen = helper_screen
        
        self.helper_menu_1 = """

Regular Expressions (RegEx) Helper
                
Input the number in the text box and press enter to visualize help and examples for a topic:

    1. regex basic theory and most common metacharacters
    2. regex quantifiers
    3. regex anchoring and finding
    4. regex greedy and non-greedy search
    5. regex grouping and capturing
    6. regex alternating and non-capturing groups
    7. regex backreferences
    8. regex lookaround
    9. print all topics at once
    10. Finish regex helper
    
    """
        
        # regex basic theory and most common metacharacters
        self.help_text_1 = """
REGular EXpression or regex:
String containing a combination of normal characters and special metacharacters that
describes patterns to find text or positions within a text.

Example:

r'st\d\s\w{3,10}'
- In Python, the r at the beginning indicates a raw string. It is always advisable to use it.
- We said that a regex contains normal characters, or literal characters we already know. 
    - The normal characters match themselves. 
    - In the case shown above, 'st' exactly matches an 's' followed by a 't'.

- Most important metacharacters:
    - \d: digit (number);
    - \D: non-digit;
    - \s: whitespace;
    - \s+: one or more consecutive whitespaces.
    - \S: non-whitespace;
    - \w: (word) character;
    - \W: non-word character.
    - {N, M}: indicates that the character on the left appears from N to M consecutive times.
        - \w{3,10}: a word character that appears 3, 4, 5,..., or 10 consecutive times.
    - {N}: indicates that the character on the left appears exactly N consecutive times.
        - \d{4}: a digit appears 4 consecutive times.
    - {N,}: indicates that the character appears at least N times.
        - \d{4,}: a digit appears 4 or more times.
        - phone_number = "John: 1-966-847-3131 Michelle: 54-908-42-42424"
        - re.findall(r"\d{1,2}-\d{3}-\d{2,3}-\d{4,}", phone_number) - returns: ['1-966-847-3131', '54-908-42-42424']

ATTENTION: Using metacharacters in regular expressions will allow you to match types of characters such as digits. 
- You can encounter many forms of whitespace such as tabs, space or new line. 
- To make sure you match all of them always specify whitespaces as \s.

re module: Python standard library module to search regex within individual strings.

- .findall method: search all occurrences of the regex within the string, returning a list of strings.
- Syntax: re.findall(r"regex", string)
    - Example: re.findall(r"#movies", "Love #movies! I had fun yesterday going to the #movies")
        - Returns: ['#movies', '#movies']

- .split method: splits the string at each occurrence of the regex, returning a list of strings.
- Syntax: re.split(r"regex", string)
    - Example: re.split(r"!", "Nice Place to eat! I'll come back! Excellent meat!")
        - Returns: ['Nice Place to eat', " I'll come back", ' Excellent meat', '']

- .sub method: replace one or many matches of the regex with a given string (returns a replaced string).
- Syntax: re.sub((r"regex", new_substring, original_string))
    - Example: re.sub(r"yellow", "nice", "I have a yellow car and a yellow house in a yellow neighborhood")
    - Returns: 'I have a nice car and a nice house in a nice neighborhood'

- .search and .match methods: they have the same syntax and are used to find a match. 
    - Both methods return an object with the match found. 
    - The difference is that .match is anchored at the beginning of the string.
- Syntax: re.search(r"regex", string) and re.match(r"regex", string)
    - Example 1: re.search(r"\d{4}", "4506 people attend the show")
    - Returns: <_sre.SRE_Match object; span=(0, 4), match='4506'>
    - re.match(r"\d{4}", "4506 people attend the show")
    - Returns: <_sre.SRE_Match object; span=(0, 4), match='4506'>
        - In this example, we use both methods to find a digit appearing four times. 
        - Both methods return an object with the match found.
    
    - Example 2: re.search(r"\d+", "Yesterday, I saw 3 shows")
    - Returns: <_sre.SRE_Match object; span=(17, 18), match='3'>
    - re.match(r"\d+","Yesterday, I saw 3 shows")
    - Returns: None
        - In this example,, we used them to find a match for a digit. 
        - In this case, .search finds a match, but .match does not. 
        - This is because the first characters do not match the regex.

- .group method: detailed in Section 7 (Backreferences).
    - Retrieves the groups captured.
- Syntax: searched_string = re.search(r"regex", string)
    re.group(N) - returns N-th group captured (group 0 is the regex itself).
    
    Example: text = "Python 3.0 was released on 12-03-2008."
    information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)
    information.group(3) - returns: '2008'
- .group can only be used with .search and .match methods.

Examples of regex:

1. re.findall(r"User\d", "The winners are: User9, UserN, User8")
    ['User9', 'User8']
2. re.findall(r"User\D", "The winners are: User9, UserN, User8")
    ['UserN']
3. re.findall(r"User\w", "The winners are: User9, UserN, User8")
    ['User9', 'UserN', 'User8']
4. re.findall(r"\W\d", "This skirt is on sale, only $5 today!")
    ['$5']
5. re.findall(r"Data\sScience", "I enjoy learning Data Science")
    ['Data Science']
6. re.sub(r"ice\Scream", "ice cream", "I really like ice-cream")
    'I really like ice cream'

7. regex that matches the user mentions that starts with @ and follows the pattern @robot3!.

regex = r"@robot\d\W"

8. regex that matches the number of user mentions given as, for example: User_mentions:9.

regex = r"User_mentions:\d"

9. regex that matches the number of likes given as, for example, likes: 5.

regex = r"likes:\s\d"

10. regex that matches the number of retweets given as, for example, number of retweets: 4.

regex = r"number\sof\sretweets:\s\d"

11. regex that matches the user mentions that starts with @ and follows the pattern @robot3!.

regex_sentence = r"\W\dbreak\W"

12. regex that matches the pattern #newH

regex_words = r"\Wnew\w"

"""

        # regex quantifiers
        self.help_text_2 = """
Quantifiers: 
A metacharacter that tells the regex engine how many times to match a character immediately to its left.

    1. +: Once or more times.
        - text = "Date of start: 4-3. Date of registration: 10-04."
        - re.findall(r"\d+-\d+", text) - returns: ['4-3', '10-04']
        - Again, \s+ represents one or more consecutive whitespaces.
    2. *: Zero times or more.
        - my_string = "The concert was amazing! @ameli!a @joh&&n @mary90"
        - re.findall(r"@\w+\W*\w+", my_string) - returns: ['@ameli!a', '@joh&&n', '@mary90']
    3. ?: Zero times or once: ?
        - text = "The color of this image is amazing. However, the colour blue could be brighter."
        - re.findall(r"colou?r", text) - returns: ['color', 'colour']
    
The quantifier refers to the character immediately on the left:
- r"apple+" : + applies to 'e' and not to 'apple'.

Examples of regex:

1. Most of the times, links start with 'http' and do not contain any whitespace, e.g. https://www.datacamp.com. 
- regex to find all the matches of http links appearing:
    - regex = r"http\S+"
    - \S is very useful to use when you know a pattern does not contain spaces and you have reached the end when you do find one.

2. User mentions in Twitter start with @ and can have letters and numbers only, e.g. @johnsmith3.
- regex to find all the matches of user mentions appearing:
    - regex = r"@\w*\d*"

3. regex that finds all dates in a format similar to 27 minutes ago or 4 hours ago.
- regex = r"\d{1,2}\s\w+\sago"

4. regex that finds all dates in a format similar to 23rd june 2018.
- regex = r"\d{1,2}\w{2}\s\w+\s\d{4}"

5. regex that finds all dates in a format similar to 1st september 2019 17:25.
- regex = r"\d{1,2}\w{2}\s\w+\s\d{4}\s\d{1,2}:\d{2}"

6. Hashtags start with a # symbol and contain letters and numbers but never whitespace.
- regex that matches the described hashtag pattern.
    - regex = r"#\w+"
    
"""

        # regex anchoring and finding
        self.help_text_3 = """
- Anchoring and Finding Metacharacters

    1. . (dot): Match any character (except newline).
        - my_links = "Just check out this link: www.amazingpics.com. It has amazing photos!"
        - re.findall(r"www.+com", my_links) - returns: ['www.amazingpics.com']
            - The dot . metacharacter is very useful when we want to match all repetitions of any character. 
            - However, we need to be very careful how we use it.
    2. ^: Anchoring on start of the string.
        - my_string = "the 80s music was much better that the 90s"
        - If we do re.findall(r"the\s\d+s", my_string) - returns: ['the 80s', 'the 90s']
        - Using ^: re.findall(r"^the\s\d+s", my_string) - returns: ['the 80s']
    3. $: Anchoring at the end of the string.
        - my_string = "the 80s music hits were much better that the 90s"
        - re.findall(r"the\s\d+s$", my_string) - returns: ['the 90s']
    4. \: Escape special characters.
        - my_string = "I love the music of Mr.Go. However, the sound was too loud."
            - re.split(r".\s", my_string) - returns: ['', 'lov', 'th', 'musi', 'o', 'Mr.Go', 'However', 'th', 'soun', 'wa', 'to', 'loud.']
            - re.split(r"\.\s", my_string) - returns: ['I love the music of Mr.Go', 'However, the sound was too loud.']
    5. |: OR Operator
        - my_string = "Elephants are the world's largest land animal! I would love to see an elephant one day"
        - re.findall(r"Elephant|elephant", my_string) - returns: ['Elephant', 'elephant']
    6. []: set of characters representing the OR Operator.
        Example 1 - my_string = "Yesterday I spent my afternoon with my friends: MaryJohn2 Clary3"
        - re.findall(r"[a-zA-Z]+\d", my_string) - returns: ['MaryJohn2', 'Clary3']
        Example 2 - my_string = "My&name&is#John Smith. I%live$in#London."
        - re.sub(r"[#$%&]", " ", my_string) - returns: 'My name is John Smith. I live in London.'
        
        Note 1: within brackets, the characters to be found should not be separated, as in [#$%&].
            - Whitespaces or other separators would be interpreted as characters to be found.
        Note 2: [a-z] represents all word characters from 'a' to 'z', lowercase.
                - [A-Z] represents all word characters from 'A' to 'Z', uppercase.
                - Since lower and uppercase are different, we must declare [a-zA-Z] or [A-Za-z] to capture all word characters.
                - [0-9] represents all digits from 0 to 9.
                - Something like [a-zA-Z0-9] or [a-z0-9A-Z] will search all word characters and all numbers.
    
    7. [^ ]: OR operator combined to ^ transforms the expression to negative.
        - my_links = "Bad website: www.99.com. Favorite site: www.hola.com"
        - re.findall(r"www[^0-9]+com", my_links) - returns: ['www.hola.com']

Examples of regex:

1. You want to find names of files that appear at the start of the string; 
    - always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u); 
    - and always finish with the txt ending.
        - Write a regex that matches the pattern of the text file names, e.g. aemyfile.txt.
        # . = match any character
        regex = r"^[aeiouAEIOU]{2,3}.+txt"

2. When a user signs up on the company website, they must provide a valid email address.
    - The company puts some rules in place to verify that the given email address is valid:
    - The first part can contain: Upper A-Z or lowercase letters a-z; 
    - Numbers; Characters: !, #, %, &, *, $, . Must have @. Domain: Can contain any word characters;
    - But only .com ending is allowed. 
        - Write a regular expression to match valid email addresses.
        - Match the regex to the elements contained in emails, and print out the message indicating if it is a valid email or not 
    
    # Write a regex to match a valid email address
    regex = r"^[A-Za-z0-9!#%&*$.]+@\w+\.com"

    for example in emails:
        # Match the regex to the string
        if re.match(regex, example):
            # Complete the format method to print out the result
            print("The email {email_example} is a valid email".format(email_example=example))
        else:
            print("The email {email_example} is invalid".format(email_example=example))
    
    # Notice that we used the .match() method. 
    # The reason is that we want to match the pattern from the beginning of the string.

3. Rules in order to verify valid passwords: it can contain lowercase a-z and uppercase letters A-Z;
    - It can contain numbers; it can contain the symbols: *, #, $, %, !, &, .
    - It must be at least 8 characters long but not more than 20.
        - Write a regular expression to check if the passwords are valid according to the description.
        - Search the elements in the passwords list to find out if they are valid passwords.
        - Print out the message indicating if it is a valid password or not, complete .format() statement.
    
    # Write a regex to check if the password is valid
    regex = r"[a-z0-9A-Z*#$%!&.]{8,20}"

    for example in passwords:
        # Scan the strings to find a match
        if re.match(regex, example):
            # Complete the format method to print out the result
            print("The password {pass_example} is a valid password".format(pass_example=example))
        else:
            print("The password {pass_example} is invalid".format(pass_example=example))

"""

        # regex greedy and non-greedy search
        self.help_text_4 = """
There are two types of matching methods: greedy and non-greedy (also called lazy) operators. 

Greedy operators
- The standard quantifiers are greedy by default, meaning that they will attempt to match as many characters as possible.
    - Standard quantifiers: * , + , ? , {num, num}
    - Example: re.match(r"\d+", "12345bcada") - returns: <_sre.SRE_Match object; span=(0, 5), match='12345'>
    - We can explain this in the following way: our quantifier will start by matching the first digit found, '1'. 
    - Because it is greedy, it will keep going to find 'more' digits and stop only when no other digit can be matched, returning '12345'.
- If the greedy quantifier has matched so many characters that can not match the rest of pattern, it will backtrack, giving up characters matched earlier one at a time and try again. 
- Backtracking is like driving a car without a map. If you drive through a path and hit a dead end street, you need to backtrack along your road to an earlier point to take another street. 
    - Example: re.match(r".*hello", "xhelloxxxxxx") - returns: <_sre.SRE_Match object; span=(0, 6), match='xhello'>
    - We use the greedy quantifier .* to find anything, zero or more times, followed by the letters "h" "e" "l" "l" "o". 
    - We can see here that it returns the pattern 'xhello'. 
    - So our greedy quantifier will start by matching as much as possible, the entire string. 
    - Then, it tries to match the h, but there are no characters left. So it backtracks, giving up one matched character. 
    - Trying again, it still doesn't match the h, so it backtracks one more step repeatedly until it finally matches the h in the regex, and the rest of the characters.

Non-greedy (lazy) operators
- Because they have lazy behavior, non-greedy quantifiers will attempt to match as few characters as needed, returning the shortest match. 
- To obtain non-greedy quantifiers, we can append a question mark at the end of the greedy quantifiers to convert them into lazy. 
    - Example: re.match(r"\d+?", "12345bcada") - returns: <_sre.SRE_Match object; span=(0, 1), match='1'>
    - Now, our non-greedy quantifier will return the pattern '1'. 
    - In this case, our quantifier will start by matching the first digit found, '1'. 
    - Because it is non-greedy, it will stop there, as we stated that we want 'one or more', and 1 is as few as needed.
- Non-greedy quantifiers also backtrack. 
- In this case, if they have matched so few characters that the rest of the pattern cannot match, they backtrack, expand the matched character one at a time, and try again. 
- In the example above: this time we use the lazy quantifier .*?. Interestingly, we obtain the same match 'xhello'. 
- But, how this match was obtained is different from the first time: the lazy quantifier first matches as little as possible, nothing, leaving the entire string unmatched. 
- Then it tries to match the 'h', but it doesn't work. 
- So, it backtracks, matching one more character, the 'x'. Then, it tries again, this time matching the 'h', and afterwards, the rest of the regex.

- Even though greedy quantifiers lead to longer matches, they are sometimes the best option. 
- Because lazy quantifiers match as few as possible, they return a shorter match than we expected.
    - Example: if you want to extract a word starting with 'a' and ending with 'e' in the string 'I like apple pie', you may think that applying the greedy regex r"a.+e" will return 'apple'. 
    - However, your match will be 'apple pie'. A way to overcome this is to make it lazy by using '?'' which will return 'apple'.
- On the other hand, using greedy quantifiers always leads to longer matches that sometimes are not desired. 
    - Making quantifiers lazy by adding '?' to match a shorter pattern is a very important consideration to keep in mind when handling data for text mining.

Examples of regex:

1. You want to extract the number contained in the sentence 'I was born on April 24th'. 
    - A lazy quantifier will make the regex return 2 and 4, because they will match as few characters as needed. 
    - However, a greedy quantifier will return the entire 24 due to its need to match as much as possible.

    1.1. Use a lazy quantifier to match all numbers that appear in the variable sentiment_analysis:
    numbers_found_lazy = re.findall(r"[0-9]+?", sentiment_analysis)
    - Output: ['5', '3', '6', '1', '2']
    
    1.2. Now, use a greedy quantifier to match all numbers that appear in the variable sentiment_analysis.
    numbers_found_greedy = re.findall(r"[0-9]+", sentiment_analysis)
    - Output: ['536', '12']

2.1. Use a greedy quantifier to match text that appears within parentheses in the variable sentiment_analysis.
    
    sentences_found_greedy = re.findall(r"\(.+\)", sentiment_analysis)
    - Output: ["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site ('I'm crying)"]

2.2. Now, use a lazy quantifier to match text that appears within parentheses in the variable sentiment_analysis.

    sentences_found_lazy = re.findall(r"\(.+?\)", sentiment_analysis)
    - Output: ["(They were so cute)", "('I'm crying)"]
    
"""

        # regex grouping and capturing
        self.help_text_5 = """
Capturing groups in regular expressions
- Let's say that we have the following text:
    
    text = "Clary has 2 friends who she spends a lot time with. Susan has 3 brothers while John has 4 sisters."
    
- We want to extract information about a person, how many and which type of relationships they have. 
- So, we want to extract Clary 2 friends, Susan 3 brothers and John 4 sisters.
- If we do: re.findall(r'[A-Za-z]+\s\w+\s\d+\s\w+', text), the output will be: ['Clary has 2 friends', 'Susan has 3 brothers', 'John has 4 sisters']
    - The output is quite close, but we do not want the word 'has'.

- We start simple, by trying to extract only the names. We can place parentheses to group those characters, capture them, and retrieve only that group:
    - re.findall(r'([A-Za-z]+)\s\w+\s\d+\s\w+', text) - returns: ['Clary', 'Susan', 'John']
- Actually, we can place parentheses around the three groups that we want to capture. 
    - re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', text)
    
    - Each group will receive a number: 
        - The entire expression will always be group 0. 
        - The first group: 1; the second: 2; and the third: 3.
    
    - The result returned is: [('Clary', '2', 'friends'), ('Susan', '3', 'brothers'), ('John', '4', 'sisters')]
        - We got a list of tuples: 
            - The first element of each tuple is the match captured corresponding to group 1. 
            - The second, to group 2. The last, to group 3.
    
    - We can use capturing groups to match a specific subpattern in a pattern. 
    - We can use this information for retrieving the groups by numbers; or to organize data.
        - Example: pets = re.findall(r'([A-Za-z]+)\s\w+\s(\d+)\s(\w+)', "Clary has 2 dogs but John has 3 cats")
                    pets[0][0] == 'Clary'
                    - In the code, we placed the parentheses to capture the name of the owner, the number and which type of pets each one has. 
                    - We can access the information retrieved by using indexing and slicing as seen in the code. 
   
- Capturing groups have one important feature. 
    - Remember that quantifiers apply to the character immediately to the left. 
    - So, we can place parentheses to group characters and then apply the quantifier to the entire group. 
    
    Example: re.search(r"(\d[A-Za-z])+", "My user name is 3e4r5fg")
        - returns: <_sre.SRE_Match object; span=(16, 22), match='3e4r5f'>
        - In the code, we have placed parentheses to match the group containing a number and any letter. 
        - We applied the plus quantifier to specify that we want this group repeated once or more times. 
    
- ATTENTION: It's not the same to capture a repeated group AND to repeat a capturing group. 
    
    my_string = "My lucky numbers are 8755 and 33"
    - re.findall(r"(\d)+", my_string) - returns: ['5', '3']
    - re.findall(r"(\d+)", my_string) - returns: ['8755', '33']
    
    - In the first code, we use findall to match a capturing group containing one number. 
        - We want this capturing group to be repeated once or more times. 
        - We get 5 and 3 as an output, because these numbers are repeated consecutively once or more times. 
    - In the second code, we specify that we should capture a group containing one or more repetitions of a number. 

- Placing a subpattern inside parenthesis will capture that content and stores it temporarily in memory. This can be later reused.

Examples of regex:

1. You want to extract the first part of the email. E.g. if you have the email marysmith90@gmail.com, you are only interested in marysmith90.
- You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.
- regex to match the email capturing only the name part. The name part appears before the @.
    - regex_email = r"([a-z0-9A-Z]+)@\S+"

2. Text follows a pattern: "Here you have your boarding pass LA4214 AER-CDB 06NOV."
- You need to extract the information about the flight: 
    - The two letters indicate the airline (e.g LA); the 4 numbers are the flight number (e.g. 4214);
    - The three letters correspond to the departure (e.g AER); the destination (CDB); the date (06NOV) of the flight.
    - All letters are always uppercase.

- Regular expression to match and capture all the flight information required.
- Find all the matches corresponding to each piece of information about the flight. Assign it to flight_matches.
- Complete the format method with the elements contained in flight_matches: 
    - In the first line print the airline and the flight number. 
    - In the second line, the departure and destination. In the third line, the date.

# Import re
import re

# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, flight)
    
#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

    - findall() returns a list of tuples. 
    - The nth element of each tuple is the element corresponding to group n. 
    - This provides us with an easy way to access and organize our data.

"""

        # regex alternating and non-capturing groups
        self.help_text_6 = """
Alternating and non-capturing groups

- Vertical bar or pipe operator
    - Suppose we have the following string, and we want to find all matches for pet names. 
    - We can use the pipe operator to specify that we want to match cat or dog or bird:
        - my_string = "I want to have a pet. But I don't know if I want a cat, a dog or a bird."
        - re.findall(r"cat|dog|bird", my_string) - returns: ['cat', 'dog', 'bird']
    
     - Now, we changed the string a little bit, and once more we want to find all the pet names, but only those that come after a number and a whitespace. 
     - So, if we specify this again with the pipe operator, we get the wrong output: 
        - my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
        - re.findall(r"\d+\scat|dog|bird", my_string) - returns: ['2 cat', 'dog', 'bird']
     
     - That is because the pipe operator works by comparing everything that is to its left (digit whitespace cat) with everything to the right, dog.
     - In order to solve this, we can use alternation. 
         - In simpler terms, we can use parentheses again to group the optional characters:
         
         - my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
         - re.findall(r"\d+\s(cat|dog|bird)", my_string) - returns: ['cat', 'dog']
         
         In the code, now the parentheses are added to group cat or dog or bird.
    
    - In the previous example, we may also want to match the number. 
    - In that case, we need to place parentheses to capture the digit group:
    
        - my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
        - re.findall(r"(\d)+\s(cat|dog|bird)", my_string) - returns: [('2', 'cat'), ('1', 'dog')]
    
        - In the code, we now use two pair of parentheses and we use findall in the string, so we get a list with two tuples.
    
- Non-capturing groups
    - Sometimes, we need to group characters using parentheses, but we are not going to reference back to this group. 
    - For these cases, there are a special type of groups called non-capturing groups. 
    - For using them, we just need to add question mark colon inside the parenthesis but before the regex.
    
    regex = r"(?:regex)"
    
    - Example: we have the following string, and we want to find all matches of numbers. 
    
        my_string = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
    
    - We see that the pattern consists of two numbers and dash repeated three times. After that, three numbers, dash, four numbers. 
    - We want to extract only the last part, without the first repeated elements. 
    - We need to group the first two elements to indicate repetitions, but we do not want to capture them. 
    - So, we use non-capturing groups to group \d repeated two times and dash. Then we indicate this group should be repeated three times. Then, we group \d repeated three times, dash, \d repeated three times:
    
        re.findall(r"(?:\d{2}-){3}(\d{3}-\d{3})", my_string) - returns: ['042-980', '434-425']
    
- Alternation
    - We can combine non-capturing groups and alternation together. 
    - Remember that alternation implies using parentheses and the pipe operand to group optional characters. 
    - Let's suppose that we have the following string. We want to match all the numbers of the day. 
    
        my_date = "Today is 23rd May 2019. Tomorrow is 24th May 19."
    
    - We know that they are followed by 'th' or 'rd', but we only want to capture the number, and not the letters that follow it. 
    - We write our regex to capture inside parentheses \d repeated once or more times. Then, we can use a non-capturing group. 
    - Inside, we use the pipe operator to choose between 'th' or 'rd':
    
        re.findall(r"(\d+)(?:th|rd)", my_date) - returns: ['23', '24']

- Non-capturing groups are very often used together with alternation. 
- Sometimes, you have optional patterns and you need to group them. 
- However, you are not interested in keeping them. It's a nice feature of regex.

Examples of regex:

1. Sentiment analysis project: firstly, you want to identify positive tweets about movies and concerts.
- You plan to find all the sentences that contain the words 'love', 'like', or 'enjoy', and capture that word. 
- You will limit the tweets by focusing on those that contain the words 'movie' or 'concert' by keeping the word in another group. 
- You will also save the movie or concert name.
    - For example, if you have the sentence: 'I love the movie Avengers', you match and capture 'love'. 
    - You need to match and capture 'movie'. Afterwards, you match and capture anything until the dot.
    - The list sentiment_analysis contains the text of tweets.
- Regular expression to capture the words 'love', 'like', or 'enjoy'; 
    - match and capture the words 'movie' or 'concert'; 
    - match and capture anything appearing until the '.'.

    regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

    - The pipe operator works by comparing everything that is to its left with everything to the right. 
    - Grouping optional patterns is the way to get the correct result.

2. After finding positive tweets, you want to do it for negative tweets. 
- Your plan now is to find sentences that contain the words 'hate', 'dislike' or 'disapprove'. 
- You will again save the movie or concert name. 
- You will get the tweet containing the words 'movie' or 'concert', but this time, you do not plan to save the word.
    - For example, if you have the sentence: 'I dislike the movie Avengers a lot.', you match and capture 'dislike'. 
    - You will match, but not capture, the word 'movie'. Afterwards, you match and capture anything until the dot.
- Regular expression to capture the words 'hate', 'dislike' or 'disapprove'; 
    - Match, but do not capture, the words 'movie' or 'concert'; 
    - Match and capture anything appearing until the '.'.
    
    regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."
        
        """

        # regex backreferences
        self.help_text_7 = """
Backreferences
- How we can backreference capturing groups.

Numbered groups
- Imagine we come across this text, and we want to extract the date: 
    
    text = "Python 3.0 was released on 12-03-2008. It was a major revision of the language. Many of its major features were backported to Python 2.6.x and 2.7.x version series."
    
- We want to extract only the numbers. So, we can place parentheses in a regex to capture these groups:
    
    regex = r"(\d{1,2})-(\d{1,2})-(\d{4})"

- We have also seen that each of these groups receive a number. 
- The whole expression is group 0; the first group, 1; and so on.

- Let's use .search to match the pattern to the text. 
- To retrieve the groups captured, we can use the method .group specifying the number of a group we want. 

Again: .group method retrieves the groups captured.
    - Syntax: searched_string = re.search(r"regex", string)
    re.group(N) - returns N-th group captured (group 0 is the regex itself).

Example: text = "Python 3.0 was released on 12-03-2008."

    information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)
    information.group(3) - returns: '2008'
    information.group(0) - returns: '12-03-2008' (regex itself, the entire expression).

- .group can only be used with .search and .match methods.

Named groups
- We can also give names to our capturing groups. 
- Inside the parentheses, we write '?P', and the name inside angle brackets:

    regex = r"(?P<name>regex)"

- Let's say we have the following string, and we want to match the name of the city and zipcode in different groups. 
- We can use capturing groups and assign them the name 'city' and 'zipcode'. 
- We retrieve the information by using .group, and we indicate the name of the group. 
    
    text = "Austin, 78701"
    cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)
    cities.group("city") - returns: 'Austin'
    cities.group("zipcode") - returns: '78701'

Backreferences
- There is another way to backreference groups. 
- In fact, the matched group can be reused inside the same regex or outside for substitution. 
- We can do this using backslash and the number of the group:

    regex = r'(\d{1,2})-(\d{2})-(\d{4})'
    
    - we can backreference the groups as:
        (\d{1,2}): (\1);
        (\d{2}): (\2)
        (\d{4}): (\3)

- Example: we have the following string, and we want to find all matches of repeated words. 
- In the code, we specify that we want to capture a sequence of word characters, then a whitespace.
- Finally, we write \1. This will indicate that we want to match the first group captured again. 
- In other words, it says: 'match that sequence of characters that was previously captured once more.' 
    
    sentence = "I wish you a happy happy birthday!"
    re.findall(r"(\w+)\s\1", sentence) - returns: ['happy'] 

- We get the word 'happy' as an output: this was the repeated word in our string.

- Now, we want to replace the repeated word with one occurrence of the same word. 
- In the code, we use the same regex as before, but this time, we use the .sub method. 
- In the replacement part, we can also reference back to the captured group: 
    - We write r"\1" to say: 'replace the entire expression match with the first captured group.' 
    
    re.sub(r"(\w+)\s\1", r"\1", sentence) - returns: 'I wish you a happy birthday!'
    - In the output string, we have only one occurrence of the word 'happy'.
    
- We can also use named groups for backreferencing. 
- To do this, we use ?P= the group name. 

    regex = r"(?P=name)"

Example:
    sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
    re.findall(r"(?P<code>\d{5}).*?(?P=code)", sentence) - returns: ['23434']

- In the code, we want to find all matches of the same number. 
- We use a capturing group and name it 'code'. 
- Later, we reference back to this group, and we obtain the number as an output.

- To reference the group back for replacement, we need to use \g and the group name inside angle brackets. 

    regex = r"(\g<name>)"

Example:
    sentence = "This app is not working! It's repeating the last word word."
    re.sub(r"(?P<word>\w+)\s(?P=word)", r"\g<word>", sentence) - returns: 'This app is not working! It's repeating the last word.'
    
- In the code, we want to replace repeated words by one occurrence of the same word. 
- Inside the regex, we use the previous syntax. 
- In the replacement field, we need to use this new syntax as seen in the code.
- Backreferences are very helpful when you need to reuse part of the regex match inside the regex.
- You should remember that the group zero stands for the entire expression matched. 
    - It is always helpful to keep that in mind. Sometimes you will need to use it.

Examples of regex:

1. Parsing PDF files: your company gave you some PDF files of signed contracts. The goal of the project is to create a database with the information you parse from them. 
- Three of these columns should correspond to the day, month, and year when the contract was signed.
- The dates appear as 'Signed on 05/24/2016' ('05' indicating the month, '24' the day). 
- You decide to use capturing groups to extract this information. Also, you would like to retrieve that information so you can store it separately in different variables.
- The variable contract contains the text of one contract.

- Write a regex that captures the month, day, and year in which the contract was signed. 
- Scan contract for matches.
- Assign each captured group to the corresponding keys in the dictionary.
- Complete the positional method to print out the captured groups. 
- Use the values corresponding to each key in the dictionary.

    # Write regex and scan contract to capture the dates described
    regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
    dates = re.search(regex_dates, contract)

    # Assign to each key the corresponding match
    signature = {
        "day": dates.group(2),
        "month": dates.group(1),
        "year": dates.group(3)
    }
    # Complete the format method to print-out
    print("Our first contract is dated back to {data[year]}. Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

- Remember that each capturing group is assigned a number according to its position in the regex. 
- Only if you use .search() and .match(), you can use .group() to retrieve the groups.

2. The company is going to develop a new product which will help developers automatically check the code they are writing. 
- You need to write a short script for checking that every HTML tag that is open has its proper closure.
- You have an example of a string containing HTML tags: "<title>The Data Science Company</title>"
- You learn that an opening HTML tag is always at the beginning of the string, and appears inside "<>". 
- A closing tag also appears inside "<>", but it is preceded by "/".
- The list html_tags, contains strings with HTML tags.

- Regex to match closed HTML tags: find if there is a match in each string of the list html_tags. Assign the result to match_tag;
    - If a match is found, print the first group captured and saved in match_tag;
- If no match is found, regex to match only the text inside the HTML tag. Assign it to notmatch_tag.
    - Print the first group captured by the regex and save it in notmatch_tag.
    - To capture the text inside <>, place parenthesis around the expression: r"<(text)>. To confirm that the same text appears in the closing tag, reference back to the m group captured by using '\m'.
    - To print the 'm' group captured, use .group(m).

    for string in html_tags:
        # Complete the regex and find if it matches a closed HTML tags
        match_tag =  re.match(r"<(\w+)>.*?</\1>", string)

        if match_tag:
            # If it matches print the first group capture
            print("Your tag {} is closed".format(match_tag.group(1))) 
        else:
            # If it doesn't match capture only the tag 
            notmatch_tag = re.match(r"<(\w+)>",string)
            # Print the first group capture
            print("Close your {} tag!".format(notmatch_tag.group(1)))

3. Your task is to replace elongated words that appear in the tweets. 
- We define an elongated word as a word that contains a repeating character twice or more times. 
    - e.g. "Awesoooome".
- Replacing those words is very important since a classifier will treat them as a different term from the source words, lowering their frequency.
- To find them, you will use capturing groups and reference them back using numbers. E.g \4.
- If you want to find a match for 'Awesoooome', you firstly need to capture 'Awes'. 
    - Then, match 'o' and reference the same character back, and then, 'me'.
- The list sentiment_analysis contains the text tweets.
- Regular expression to match an elongated word as described.
- Search the elements in sentiment_analysis list to find out if they contain elongated words. Assign the result to match_elongated.
- Assign the captured group number zero to the variable elongated_word.
    - Print the result contained in the variable elongated_word.

    # Complete the regex to match an elongated word
    regex_elongated = r"\w*(\w)\1*me\w*"

    for tweet in sentiment_analysis:
        # Find if there is a match in each tweet 
        match_elongated = re.search(regex_elongated, tweet)

        if match_elongated:
            # Assign the captured group zero 
            elongated_word = match_elongated.group(0)

            # Complete the format method to print the word
            print("Elongated word found: {word}".format(word=elongated_word))
        else:
            print("No elongated word found") 

        """
        
        # regex lookaround
        self.help_text_8 = """
Lookaround
- There are specific types of non-capturing groups that help us look around an expression.
- Look-around will look for what is behind or ahead of a pattern. 
- Imagine that we have the following string:
    
    text = "the white cat sat on the chair"

- We want to see what is surrounding a specific word. 
- For example, we position ourselves in the word 'cat'. 
- So look-around will let us answer the following problem: 
    - At my current position, look ahead and search if 'sat' is there. 
    - Or, look behind and search if 'white' is there.
    
- In other words, looking around allows us to confirm that a sub-pattern is ahead or behind the main pattern.
- "At my current position in the matching process, look ahead or behind and examine whether some pattern matches or not match before continuing."
- In the previous example, we are looking for the word 'cat'. 
- The look ahead expression can be either positive or negative. For positive we use ?=. For negative, ?!.
    - positive: (?=sat)
    - negative: (?!run)

- Look-ahead
- This non-capturing group checks whether the first part of the expression is followed or not by the lookahead expression. 
- As a consequence, it will return the first part of the expression. 
    - Let's imagine that we have a string containing file names and the status of that file. 
    - We want to extract only those files that are followed by the word 'transferred'. 
    - So we start building the regex by indicating any word character followed by .txt.
    - We now indicate we want the first part to be followed by the word transferred. 
    - We do so by writing ?= and then whitespace transferred all inside the parenthesis:
    
    my_text ="tweets.txt transferred, mypass.txt transferred, keywords.txt error"
    re.findall(r"\w+\.txt(?=\stransferred)", my_text) - returns: ['tweets.txt', 'mypass.txt']

- Negative look-ahead
    - Now, let's use negative lookahead in the same example.
    - In this case, we will say that we want those matches that are NOT followed by the expression 'transferred'. 
    - We use, instead, ?! inside parenthesis:

    my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
    re.findall(r"\w+\.txt(?!\stransferred)", my_text) - returns: ['keywords.txt']

- Look-behind
- The non-capturing group look-behind gets all matches that are preceded or not by a specific pattern.
- As a consequence, it will return the matches after the look expression.
- Look behind expression can also be either positive or negative. 
    - For positive, we use ?<=. For negative, ?<!.
    - So, we add an intermediate '<' (angle bracket) sign. In the previous example, we can look before the word 'cat': 
        - positive: (?<=white)
        - negative: (?<!brown)
    
- Positive look-behind
    - Let's look at the following string, in which we want to find all matches of the names that are preceded by the word 'member'. 
    - We construct our regex with positive look-behind. 
    - At the end of the regex, we indicate that we want a sequence of word characters whitespace another sequence of word characters:
    
    my_text = "Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
    re.findall(r"(?<=Member:\s)\w+\s\w+", my_text) - returns: ['Angus Young', 'Chris Slade']
    
    - Pay attention to the code: the look-behind expression goes before that expression. 
    - We indicate ?<= followed by member, colon, and whitespace. All inside parentheses. 
    - In that way we get the two names that were preceded by the word member, as shown in the output.

- Negative look-behind
- Now, we have other string, in which will use negative look-behind. 
- We will find all matches of the word 'cat' or 'dog' that are not preceded by the word 'brown'. 
- In this example, we use ?<!, followed by brown, whitespace. All inside the parenthesis. 
- Then, we indicate our alternation group: 'cat' or 'dog'. 

    my_text = "My white cat sat at the table. However, my brown dog was lying on the couch."
    re.findall(r"(?<!brown\s)(cat|dog)", my_text) - returns: ['cat']

    - Consequently, we get 'cat' as an output, the 'cat' or 'dog' word that is not after the word 'brown'.

In summary:
- Positive lookahead (?=) makes sure that first part of the expression is followed by the lookahead expression. 
- Positive lookbehind (?<=) returns all matches that are preceded by the specified pattern.
- It is important to know that positive lookahead will return the text matched by the first part of the expression after asserting that it is followed by the lookahead expression,
    - while positive lookbehind will return all matches that follow a specific pattern.
- Negative lookarounds work in a similar way to positive lookarounds. 
    - They are very helpful when we are looking to exclude certain patterns from our analysis.

Examples of regex:

1. You are interested in the words surrounding 'python'. You want to count how many times a specific words appears right before and after it.
- The variable sentiment_analysis contains the text of one tweet.
- Get all the words that are followed by the word 'python' in sentiment_analysis. 
- Print out the word found.
    - In re.findall(). Use \w+ to match the words followed by the word 'python';
    - In re.findall() first argument, include \spython within parentheses to indicate that everything after the word 'python' should be matched.

    # Positive lookahead
    look_ahead = re.findall(r"\w+(?=\spython)", sentiment_analysis)

    # Print out
    print(look_ahead)
 
1.2. Get all the words that are preceded by the word 'python' or 'Python' in sentiment_analysis. Print out the words found.
- In re.findall() first argument, include [Pp]ython\s within parentheses to indicate that everything before the word 'python' (or 'Python') should be matched.

    # Positive lookbehind
    look_behind = re.findall(r"(?<=[pP]ython\s)\w+", sentiment_analysis)

    # Print out
    print(look_behind)

2. You need to write a script for a cell-phone searcher. 
- It should scan a list of phone numbers and return those that meet certain characteristics.
- The phone numbers in the list have the structure:
    - Optional area code: 3 numbers
    - Prefix: 4 numbers
    - Line number: 6 numbers
    - Optional extension: 2 numbers
    - E.g. 654-8764-439434-01.
- You decide to use .findall() and the non-capturing group's negative lookahead (?!) and negative lookbehind (?<!).
- The list cellphones, contains three phone numbers:
    cellphones = ['4564-646464-01', '345-5785-544245', '6476-579052-01']

- Get all cell phones numbers that are not preceded by the optional area code.
    - In re.findall() first argument, you use a negative lookbehind ?<! within parentheses () indicating the optional area code.

    for phone in cellphones:
        # Get all phone numbers not preceded by area code
        number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
        print(number)
 
2.1. Get all the cell phones numbers that are not followed by the optional extension.
    - In re.findall() first argument, you use a negative lookahead ?! within parentheses () indicating the optional extension.

    for phone in cellphones:
        # Get all phone numbers not followed by optional extension
        number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
        print(number)
    
        """
        
    def show_screen (self):
            
        helper_screen = self.helper_screen
        helper_menu_1 = self.helper_menu_1
            
        if (helper_screen == 0):
                
            # Start screen
            print(self.helper_menu_1)
            print("\n")
            # For the input, strip all whitespaces and, and so convert it to integer:
            helper_screen = int(str(input("Next screen:")).strip())
                
            # the object.__dict__ method returns all attributes from an object as a dictionary.
            # Analogously, the vars function applied to an object vars(object) returns the same
            # dictionary. We can access an attribute from the object by passing the key of this
            # dictionary:
            # vars(object)['key']
                
            while (helper_screen != 10):
                    
                if (helper_screen not in range(0, 11)):
                    # range (0, 11): integers from 0 to 10
                        
                    helper_screen = int(str(input("Input a valid number, from 0 to 10:")).strip())
                    
                else:
                        
                    if (helper_screen == 9):
                        # print all at once:
                        for screen_number in range (1, 9):
                            # integers from 1 to 8
                            key = "help_text_" + str(screen_number)
                            # apply the vars function to get the dictionary of attributes, and call the
                            # attribute by passing its name as key from the dictionary:
                            screen_text = vars(self)[key]
                            # Notice that we cannot directly call the attribute as a string. We would have to
                            # create an if else for each of the 8 attributes.
                            print(screen_text)
                            
                        # Now, make helper_screen = 10 for finishing this step:
                        helper_screen = 10
                        
                    else:
                        key = "help_text_" + str(helper_screen)
                        screen_text = vars(self)[key]
                        print(screen_text)
                        helper_screen = int(str(input("Next screen:")).strip())
            
        print("Finishing regex assistant.\n")
            
        return self

In [50]:
def regex_search (string_or_list_of_strings, regex_to_search = r"", show_regex_helper = False):
    
    import re
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # regex_to_search = r"" - string containing the regular expression (regex) that will be searched
    # within each string from the column. Declare it with the r before quotes, indicating that the
    # 'raw' string should be read. That is because the regex contain special characters, such as \,
    # which should not be read as scape characters.
    # example of regex: r'st\d\s\w{3,10}'
    # Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
    # regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
    # regex alternating and non-capturing groups; regex backreferences; and regex lookaround.
    
    ## ATTENTION: This function returns ONLY the capturing groups from the regex, i.e., portions of the
    # regex explicitly marked with parentheses (check the regex helper for more details, including how
    # to convert parentheses into non-capturing groups). If no groups are marked as capturing, the
    # function will raise an error.

    # show_regex_helper: set show_regex_helper = True to show a helper guide to the construction of
    # the regular expression. After finishing the helper, the original dataset itself will be returned
    # and the function will not proceed. Use it in case of not knowing or not certain on how to input
    # the regex.
    
    
    if (show_regex_helper): # run if True
        
        # Create an instance (object) from class regex_help:
        helper = regex_help()
        # Run helper object:
        helper = helper.show_screen()
        print("Interrupting the function and returning the dataframe itself.")
        print("Use the regex helper instructions to obtain the regex.")
        print("Do not forget to declare it as r'regex', with the r before quotes.")
        print("It indicates a raw expression. It is important for not reading the regex metacharacters as regular string scape characters.")
        print("Also, notice that this function returns only the capturing groups (marked with parentheses).")
        print("If no groups are marked as capturing groups (with parentheses) within the regex, the function will raise an error.\n")
        
        return df
    
    else:
        
        # Check if a string was passed. If it was, convert it to list of single element:
        if (type(string_or_list_of_strings) == str):
            list_of_strings = [string_or_list_of_strings]

        else: # simply convert the iterable to the new standard name:
            list_of_strings = list(string_or_list_of_strings)

        # Now, we have a local copy as a list.
        # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
        
        # Search for the regex within the list:
        new_series = [re.findall(regex_to_search, string) for string in list_of_strings]
        

        # Now, we are in the main code.
        print(f"Finished searching the regex {regex_to_search} within the list of strings.")
        print("Check the 10 first strings:\n")
    
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(new_series[:10])

        except: # regular mode
            print(new_series[:10])

        return new_series

# **Function for replacing a Regular Expression (RegEx) in a string column**

In [51]:
def regex_replacement (string_or_list_of_strings, regex_to_search = r"", string_for_replacement = "", show_regex_helper = False):
    
    import re
    import numpy as np
    
    # string_or_list_of_strings: string or list of strings (inside quotes), 
    # that will be analyzed. 
    # e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
    # string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.
    
    # regex_to_search = r"" - string containing the regular expression (regex) that will be searched
    # within each string from the column. Declare it with the r before quotes, indicating that the
    # 'raw' string should be read. That is because the regex contain special characters, such as \,
    # which should not be read as scape characters.
    # example of regex: r'st\d\s\w{3,10}'
    # Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
    # regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
    # regex alternating and non-capturing groups; regex backreferences; and regex lookaround.
    
    # string_for_replacement = "" - regular string that will replace the regex_to_search: 
    # whenever regex_to_search is found in the string, it is replaced (substituted) by 
    # string_or_regex_for_replacement. 
    # Example string_for_replacement = " " (whitespace).
    # If string_for_replacement = None, the empty string will be used for replacement.
    
    ## ATTENTION: This function process a single regex by call.
    
    # show_regex_helper: set show_regex_helper = True to show a helper guide to the construction of
    # the regular expression. After finishing the helper, the original dataset itself will be returned
    # and the function will not proceed. Use it in case of not knowing or not certain on how to input
    # the regex.
    
    
    if (show_regex_helper): # run if True
        
        # Create an instance (object) from class regex_help:
        helper = regex_help()
        # Run helper object:
        helper = helper.show_screen()
        print("Interrupting the function and returning the dataframe itself.")
        print("Use the regex helper instructions to obtain the regex.")
        print("Do not forget to declare it as r'regex', with the r before quotes.")
        print("It indicates a raw expression. It is important for not reading the regex metacharacters as regular string scape characters.\n")
        
        return df
    
    else:
        
        if (string_for_replacement is None):
            # make it the empty string
            string_for_replacement = ""
        
        # Check if a string was passed. If it was, convert it to list of single element:
        if (type(string_or_list_of_strings) == str):
            list_of_strings = [string_or_list_of_strings]

        else: # simply convert the iterable to the new standard name:
            list_of_strings = list(string_or_list_of_strings)

        # Now, we have a local copy as a list.
        # As we are dealing with strings and not Pandas dataframes, we do not call the str attribute.
        # Search for the regex within the list and replace (substitute it):
        new_series = [re.sub(regex_to_search, string_for_replacement, string) for string in list_of_strings]
        

        # Now, we are in the main code.
        print(f"Finished searching the regex {regex_to_search} within the input strings.")
        print("Check the 10 first strings:\n")
    
        try:
            # only works in Jupyter Notebook:
            from IPython.display import display
            display(new_series[:10])

        except: # regular mode
            print(new_series[:10])

        return new_series

# **Function for converting the lists to Pandas dataframe**

# **Function for exporting the dataframe as CSV File (to notebook's workspace)**

In [None]:
def export_pd_dataframe_as_csv (dataframe_obj_to_be_exported, new_file_name_without_extension, file_directory_path = None):
    
    import os
    import pandas as pd
    
    ## WARNING: all files exported from this function are .csv (comma separated values)
    
    # dataframe_obj_to_be_exported: dataframe object that is going to be exported from the
    # function. Since it is an object (not a string), it should not be declared in quotes.
    # example: dataframe_obj_to_be_exported = dataset will export the dataset object.
    # ATTENTION: The dataframe object must be a Pandas dataframe.
    
    # FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
    # (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "/" 
    # or FILE_DIRECTORY_PATH = "/folder"
    # If you want to export the file to AWS S3, this parameter will have no effect.
    # In this case, you can set FILE_DIRECTORY_PATH = None

    # new_file_name_without_extension - (string, in quotes): input the name of the 
    # file without the extension. e.g. new_file_name_without_extension = "my_file" 
    # will export a file 'my_file.csv' to notebook's workspace.
    
    # Create the complete file path:
    file_path = os.path.join(file_directory_path, new_file_name_without_extension)
    # Concatenate the extension ".csv":
    file_path = file_path + ".csv"

    dataframe_obj_to_be_exported.to_csv(file_path, index = False)

    print(f"Dataframe {new_file_name_without_extension} exported as CSV file to notebook\'s workspace as \'{file_path}\'.")
    print("Warning: if there was a file in this file path, it was replaced by the exported dataframe.")

# **Function for downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

In [None]:
def upload_to_or_download_file_from_colab (action = 'download', file_to_download_from_colab = None):
    
    # action = 'download' to download the file to the local machine
    # action = 'upload' to upload a file from local machine to
    # Google Colab's instant memory
    
    # file_to_download_from_colab = None. This parameter is obbligatory when
    # action = 'download'. 
    # Declare as file_to_download_from_colab the file that you want to download, with
    # the correspondent extension.
    # It should not be declared in quotes.
    # e.g. to download a dictionary named dict, object_to_download_from_colab = 'dict.pkl'
    # To download a dataframe named df, declare object_to_download_from_colab = 'df.csv'
    # To export a model named keras_model, declare object_to_download_from_colab = 'keras_model.h5'
 
    from google.colab import files
    # google.colab library must be imported only in case 
    # it is going to be used, for avoiding 
    # AWS compatibility issues.
        
    if (action == 'upload'):
            
        print("Click on the button for file selection and select the files from your machine that will be uploaded in the Colab environment.")
        print("Warning: the files will be removed from Colab memory after the Kernel dies or after the notebook is closed.")
        # this functionality requires the previous declaration:
        ## from google.colab import files
            
        colab_files_dict = files.upload()
            
        # The files are stored into a dictionary called colab_files_dict where the keys
        # are the names of the files and the values are the files themselves.
        ## e.g. if you upload a single file named "dictionary.pkl", the dictionary will be
        ## colab_files_dict = {'dictionary.pkl': file}, where file is actually a big string
        ## representing the contents of the file. The length of this value is the size of the
        ## uploaded file, in bytes.
        ## To access the file is like accessing a value from a dictionary: 
        ## d = {'key1': 'val1'}, d['key1'] == 'val1'
        ## we simply declare the key inside brackets and quotes, the same way we would do for
        ## accessing the column of a dataframe.
        ## In this example, colab_files_dict['dictionary.pkl'] access the content of the 
        ## .pkl file, and len(colab_files_dict['dictionary.pkl']) is the size of the .pkl
        ## file in bytes.
        ## To check the dictionary keys, apply the method .keys() to the dictionary (with empty
        ## parentheses): colab_files_dict.keys()
            
        for key in colab_files_dict.keys():
            #loop through each element of the list of keys of the dictionary
            # (list colab_files_dict.keys()). Each element is named 'key'
            print(f"User uploaded file {key} with length {len(colab_files_dict[key])} bytes.")
            # The key is the name of the file, and the length of the value
            ## correspondent to the key is the file's size in bytes.
            ## Notice that the content of the uploaded object must be passed 
            ## as argument for a proper function to be interpreted. 
            ## For instance, the content of a xlsx file should be passed as
            ## argument for Pandas .read_excel function; the pkl file must be passed as
            ## argument for pickle.
            ## e.g., if you uploaded 'table.xlsx' and stored it into colab_files_dict you should
            ## declare df = pd.read_excel(colab_files_dict['table.xlsx']) to obtain a dataframe
            ## df from the uploaded table. Notice that is the value, not the key, that is the
            ## argument.
                
            print("The uploaded files are stored into a dictionary object named as colab_files_dict.")
            print("Each key from this dictionary is the name of an uploaded file. The value correspondent to that key is the file itself.")
            print("The structure of a general Python dictionary is dict = {\'key1\': value1}. To access value1, declare file = dict[\'key1\'], as if you were accessing a column from a dataframe.")
            print("Then, if you uploaded a file named \'table.xlsx\', you can access this file as:")
            print("uploaded_file = colab_files_dict[\'table.xlsx\']")
            print("Notice, though, that the object uploaded_file is the whole file content, not a Python object already converted. To convert to a Python object, pass this element as argument for a proper function or method.")
            print("In this example, to convert the object uploaded_file to a dataframe, Pandas pd.read_excel function could be used. In the following line, a df dataframe object is obtained from the uploaded file:")
            print("df = pd.read_excel(uploaded_file)")
            print("Also, the uploaded file itself will be available in the Colaboratory Notebook\'s workspace.")
            
            return colab_files_dict
        
    elif (action == 'download'):
            
        if (file_to_download_from_colab is None):
                
            #No object was declared
            print("Please, inform a file to download from the notebook\'s workspace. It should be declared in quotes and with the extension: e.g. \'table.csv\'.")
            
        else:
                
            print("The file will be downloaded to your computer.")

            files.download(file_to_download_from_colab)

            print(f"File {file_to_download_from_colab} successfully downloaded from Colab environment.")

    else:
            
            print("Please, select a valid action, \'download\' or \'upload\'.")

# **Function for exporting a list of files from notebook's workspace to AWS Simple Storage Service (S3)**

In [None]:
def export_files_to_s3 (list_of_file_names_with_extensions, directory_of_notebook_workspace_storing_files_to_export = None, s3_bucket_name = None, s3_obj_prefix = None):
    
    import os
    import boto3
    # boto3 is AWS S3 Python SDK
    # sagemaker and boto3 libraries must be imported only in case 
    # they are going to be used, for avoiding 
    # Google Colab compatibility issues.
    from getpass import getpass
    
    # list_of_file_names_with_extensions: list containing all the files to export to S3.
    # Declare it as a list even if only a single file will be exported.
    # It must be a list of strings containing the file names followed by the extensions.
    # Example, to a export a single file my_file.ext, where my_file is the name and ext is the
    # extension:
    # list_of_file_names_with_extensions = ['my_file.ext']
    # To export 3 files, file1.ext1, file2.ext2, and file3.ext3:
    # list_of_file_names_with_extensions = ['file1.ext1', 'file2.ext2', 'file3.ext3']
    # Other examples:
    # list_of_file_names_with_extensions = ['Screen_Shot.png', 'dataset.csv']
    # list_of_file_names_with_extensions = ["dictionary.pkl", "model.h5"]
    # list_of_file_names_with_extensions = ['doc.pdf', 'model.dill']
    
    # directory_of_notebook_workspace_storing_files_to_export: directory from notebook's workspace
    # from which the files will be exported to S3. Keep it None, or
    # directory_of_notebook_workspace_storing_files_to_export = "/"; or
    # directory_of_notebook_workspace_storing_files_to_export = '' (empty string) to export from
    # the root (main) directory.
    # Alternatively, set as a string containing only the directories and folders, not the file names.
    # Examples: directory_of_notebook_workspace_storing_files_to_export = 'folder1';
    # directory_of_notebook_workspace_storing_files_to_export = 'folder1/folder2/'
    
    # For this function, all exported files must be located in the same directory.
    
    
    # s3_bucket_name = None.
    ## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
    # with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
    # "aws-bucket-1"
    
    # s3_obj_prefix = None. Keep it None or as an empty string (s3_obj_key_prefix = '')
    # to import the whole bucket content, instead of a single object from it.
    # Alternatively, set it as a string containing the subfolder from the bucket to import:
    # Suppose that your bucket (admin-created) has four objects with the following object 
    # keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
    # s3-dg.pdf. The s3-dg.pdf key does not have a prefix, so its object appears directly 
    # at the root level of the bucket. If you open the Development/ folder, you see 
    # the Projects.xlsx object in it.
    # Check Amazon documentation:
    # https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
    
    # In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
    # where 'bucket' is the bucket's name, key_prefix = 'my_path/.../', without the
    # 'file.csv' (file name with extension) last part.
    
    # So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
    # a given folder (directory) of the bucket.
    # DO NOT PUT A SLASH before (to the right of) the prefix;
    # DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

    # Alternatively, provide the full path of a given file if you want to import only it:
    # S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
    # where my_file is the file's name, and ext is its extension.


    # Attention: after running this function for connecting with AWS Simple Storage System (S3), 
    # your 'AWS Access key ID' and your 'Secret access key' will be requested.
    # The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
    # other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
    # and the prefix. All of these are sensitive information from the organization.
    # Therefore, after importing the information, always remember of cleaning the output of this cell
    # and of removing such information from the strings.
    # Remember that these data may contain privilege for accessing the information, so it should not
    # be used for non-authorized people.

    # Also, remember of deleting the exported from the workspace after finishing the analysis.
    # The costs for storing the files in S3 is quite inferior than those for storing directly in the
    # workspace. Also, files stored in S3 may be accessed for other users than those with access to
    # the notebook's workspace.
    
    
    # Check if directory_of_notebook_workspace_storing_files_to_export is None. 
    # If it is, make it the root directory:
    if ((directory_of_notebook_workspace_storing_files_to_export is None)|(str(directory_of_notebook_workspace_storing_files_to_export) == "/")):
            
            # For the S3 buckets, the path should not start with slash. Assign the empty
            # string instead:
            directory_of_notebook_workspace_storing_files_to_export = ""
            print("The files will be exported from the notebook\'s root directory to S3.")
    
    elif (str(directory_of_notebook_workspace_storing_files_to_export) == ""):
        
            # Guarantee that the path is the empty string.
            # Avoid accessing the else condition, what would raise an error
            # since the empty string has no character of index 0
            directory_of_notebook_workspace_storing_files_to_export = str(directory_of_notebook_workspace_storing_files_to_export)
            print("The files will be exported from the notebook\'s root directory to S3.")
          
    else:
        # Use the str attribute to guarantee that the path was read as a string:
        directory_of_notebook_workspace_storing_files_to_export = str(directory_of_notebook_workspace_storing_files_to_export)
            
        if(directory_of_notebook_workspace_storing_files_to_export[0] == "/"):
            # the first character is the slash. Let's remove it

            # In AWS, neither the prefix nor the path to which the file will be imported
            # (file from S3 to workspace) or from which the file will be exported to S3
            # (the path in the notebook's workspace) may start with slash, or the operation
            # will not be concluded. Then, we have to remove this character if it is present.

            # The slash is character 0. Then, we want all characters from character 1 (the
            # second) to character len(str(path_to_store_imported_s3_bucket)) - 1, the index
            # of the last character. So, we can slice the string from position 1 to position
            # the slicing syntax is: string[1:] - all string characters from character 1
            # string[:10] - all string characters from character 10-1 = 9 (including 9); or
            # string[1:10] - characters from 1 to 9
            # So, slice the whole string, starting from character 1:
            directory_of_notebook_workspace_storing_files_to_export = directory_of_notebook_workspace_storing_files_to_export[1:]
            # attention: even though strings may be seem as list of characters, that can be
            # sliced, we cannot neither simply assign a character to a given position nor delete
            # a character from a position.

    # Ask the user to provide the credentials:
    ACCESS_KEY = input("Enter your AWS Access Key ID here (in the right). It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
    print("\n") # line break
    SECRET_KEY = getpass("Enter your password (Secret key) here (in the right). It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        
    # The use of 'getpass' instead of 'input' hide the password behind dots.
    # So, the password is not visible by other users and cannot be copied.
        
    print("\n")
    print("WARNING: The bucket\'s name, the prefix, the AWS access key ID, and the AWS Secret access key are all sensitive information, which may grant access to protected information from the organization.\n")
    print("After finish exporting data to S3, remember of removing these information from the notebook, specially if it is going to be shared. Also, remember of removing the files from the workspace.\n")
    print("The cost for storing files in Simple Storage Service is quite inferior than the one for storing directly in SageMaker workspace. Also, files stored in S3 may be accessed for other users than those with access the notebook\'s workspace.\n")

    # Check if the user actually provided the mandatory inputs, instead
    # of putting None or empty string:
    if ((ACCESS_KEY is None) | (ACCESS_KEY == '')):
        print("AWS Access Key ID is missing. It is the value stored in the field \'Access key ID\' from your AWS user credentials CSV file.")
        return "error"
    elif ((SECRET_KEY is None) | (SECRET_KEY == '')):
        print("AWS Secret Access Key is missing. It is the value stored in the field \'Secret access key\' from your AWS user credentials CSV file.")
        return "error"
    elif ((s3_bucket_name is None) | (s3_bucket_name == '')):
        print ("Please, enter a valid S3 Bucket\'s name. Do not add sub-directories or folders (prefixes), only the name of the bucket itself.")
        return "error"
    
    else:
        # Use the str attribute to guarantee that all AWS parameters were properly read as strings, and not as
        # other variables (like integers or floats):
        ACCESS_KEY = str(ACCESS_KEY)
        SECRET_KEY = str(SECRET_KEY)
        s3_bucket_name = str(s3_bucket_name)

    if(s3_bucket_name[0] == "/"):
        # the first character is the slash. Let's remove it

        # In AWS, neither the prefix nor the path to which the file will be imported
        # (file from S3 to workspace) or from which the file will be exported to S3
        # (the path in the notebook's workspace) may start with slash, or the operation
        # will not be concluded. Then, we have to remove this character if it is present.

        # So, slice the whole string, starting from character 1 (as did for 
        # path_to_store_imported_s3_bucket):
        s3_bucket_name = s3_bucket_name[1:]

    # Remove any possible trailing (white and tab spaces) spaces
    # That may be present in the string. Use the Python string
    # rstrip method, which is the equivalent to the Trim function:
    # When no arguments are provided, the whitespaces and tabulations
    # are the removed characters
    # https://www.w3schools.com/python/ref_string_rstrip.asp?msclkid=ee2d05c3c56811ecb1d2189d9f803f65
    s3_bucket_name = s3_bucket_name.rstrip()
    ACCESS_KEY = ACCESS_KEY.rstrip()
    SECRET_KEY = SECRET_KEY.rstrip()
    # Since the user manually inputs the parameters ACCESS and SECRET_KEY,
    # it is easy to input whitespaces without noticing that.

    # Now process the non-obbligatory parameter.
    # Check if a prefix was passed as input parameter. If so, we must select only the names that start with
    # The prefix.
    # Example: in the bucket 'my_bucket' we have a directory 'dir1'.
    # In the main (root) directory, we have a file 'file1.json' like: '/file1.json'
    # If we pass the prefix 'dir1', we want only the files that start as '/dir1/'
    # such as: 'dir1/file2.json', excluding the file in the main (root) directory and excluding the files in other
    # directories. Also, we want to eliminate the file names with no extensions, like 'dir1/' or 'dir1/dir2',
    # since these object names represent folders or directories, not files.	

    if (s3_obj_prefix is None):
        print ("No prefix, specific object, or subdirectory provided.") 
        print (f"Then, exporting to \'{s3_bucket_name}\' root (main) directory.\n")
        # s3_path: path that the file should have in S3:
        s3_path = "" # empty string for the root directory
    elif ((s3_obj_prefix == "/") | (s3_obj_prefix == '')):
        # The root directory in the bucket must not be specified starting with the slash
        # If the root "/" or the empty string '' is provided, make
        # it equivalent to None (no directory)
        print ("No prefix, specific object, or subdirectory provided.") 
        print (f"Then, exporting to \'{s3_bucket_name}\' root (main) directory.\n")
        # s3_path: path that the file should have in S3:
        s3_path = "" # empty string for the root directory
    
    else:
        # Since there is a prefix, use the str attribute to guarantee that the path was read as a string:
        s3_obj_prefix = str(s3_obj_prefix)
            
        if(s3_obj_prefix[0] == "/"):
            # the first character is the slash. Let's remove it

            # In AWS, neither the prefix nor the path to which the file will be imported
            # (file from S3 to workspace) or from which the file will be exported to S3
            # (the path in the notebook's workspace) may start with slash, or the operation
            # will not be concluded. Then, we have to remove this character if it is present.

            # So, slice the whole string, starting from character 1 (as did for 
            # path_to_store_imported_s3_bucket):
            s3_obj_prefix = s3_obj_prefix[1:]

        # Remove any possible trailing (white and tab spaces) spaces
        # That may be present in the string. Use the Python string
        # rstrip method, which is the equivalent to the Trim function:
        s3_obj_prefix = s3_obj_prefix.rstrip()
            
        # s3_path: path that the file should have in S3:
        # Make the path the prefix itself, since there is a prefix:
        s3_path = s3_obj_prefix
            
        print("AWS Access Credentials, and bucket\'s prefix, object or subdirectory provided.\n")	

            
        print ("Starting connection with the S3 bucket.\n")
        
        try:
            # Start S3 client as the object 's3_client'
            s3_client = boto3.resource('s3', aws_access_key_id = ACCESS_KEY, aws_secret_access_key = SECRET_KEY)
        
            print(f"Credentials accepted by AWS. S3 client successfully started.\n")
            # An object 'data_table.xlsx' in the main (root) directory of the s3_bucket is stored in Python environment as:
            # s3.ObjectSummary(bucket_name='bucket_name', key='data_table.xlsx')
            # The name of each object is stored as the attribute 'key' of the object.
        
        except:
            
            print("Failed to connect to AWS Simple Storage Service (S3). Review if your credentials are correct.")
            print("The variable \'access_key\' must be set as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("The variable \'secret_key\' must be set as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
        
        
        try:
            # Connect to the bucket specified as 'bucket_name'.
            # The bucket is started as the object 's3_bucket':
            s3_bucket = s3_client.Bucket(s3_bucket_name)
            print(f"Connection with bucket \'{s3_bucket_name}\' stablished.\n")
            
        except:
            
            print("Failed to connect with the bucket, which usually happens when declaring a wrong bucket\'s name.") 
            print("Check the spelling of your bucket_name string and remember that it must be all in lower-case.\n")
                
        # Now, let's obtain the lists of all file paths in the notebook's workspace and
        # of the paths that the files should have in S3, after being exported.
        
        try:
            
            # start the lists:
            workspace_full_paths = []
            s3_full_paths = []
            
            # Get the total of files in list_of_file_names_with_extensions:
            total_of_files = len(list_of_file_names_with_extensions)
            
            # And Loop through all elements, named 'my_file' from the list
            for my_file in list_of_file_names_with_extensions:
                
                # Get the full path in the notebook's workspace:
                workspace_file_full_path = os.path.join(directory_of_notebook_workspace_storing_files_to_export, my_file)
                # Get the full path that the file will have in S3:
                s3_file_full_path = os.path.join(s3_path, my_file)
                
                # Append these paths to the correspondent lists:
                workspace_full_paths.append(workspace_file_full_path)
                s3_full_paths.append(s3_file_full_path)
                
            # Now, both lists have the same number of elements. For an element (file) i,
            # workspace_full_paths has the full file path in notebook's workspace, and
            # s3_full_paths has the path that the new file should have in S3 bucket.
        
        except:
            
            print("The function returned an error when trying to access the list of files. Declare it as a list of strings, even if there is a single element in the list.")
            print("Example: list_of_file_names_with_extensions = [\'my_file.ext\']\n")
            return "error"
        
        
        # Now, loop through all elements i from the lists.
        # The first elements of the lists have index 0; the last elements have index
        # total_of_files - 1, since there are 'total_of_files' elements:
        
        # Then, export the correspondent element to S3:
        
        try:
            
            for i in range(total_of_files):
                # goes from i = 0 to i = total_of_files - 1

                # get the element from list workspace_file_full_path 
                # (original path of file i, from which it will be exported):
                PATH_IN_WORKSPACE = workspace_full_paths[i]

                # get the correspondent element of list s3_full_paths
                # (path that the file i should have in S3, after being exported):
                S3_FILE_PATH = s3_full_paths[i]

                # Start the new object in the bucket previously started as 's3_bucket'.
                # Start it with the specified prefix, in S3_FILE_PATH:
                new_s3_object = s3_bucket.Object(S3_FILE_PATH)
                
                # Finally, upload the file in PATH_IN_WORKSPACE.
                # Make new_s3_object the exported file:
            
                # Upload the selected object from the workspace path PATH_IN_WORKSPACE
                # to the S3 path specified as S3_FILE_PATH.
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" exports a xlsx file named 'my_table' to the notebook's main (root)
                # directory
                new_s3_object.upload_file(Filename = PATH_IN_WORKSPACE)

                print(f"The file \'{list_of_file_names_with_extensions[i]}\' was successfully exported from notebook\'s workspace to AWS Simple Storage Service (S3).\n")

                
            print("Finished exporting the files from the the notebook\'s workspace to S3 bucket. It may take a couple of minutes untill they be shown in S3 environment.\n") 
            print("Do not forget to delete these copies after finishing the analysis. They will remain stored in the bucket.\n")


        except:

            # Run this code for any other exception that may happen (no exception error
            # specified, so any exception runs the following code).
            # Check: https://pythonbasics.org/try-except/?msclkid=4f6b4540c5d011ecb1fe8a4566f632a6
            # for seeing how to handle successive exceptions

            print("Attention! The function raised an exception error, which is probably due to the AWS Simple Storage Service (S3) permissions.")
            print("Before running again this function, check this quick guide for configuring the permission roles in AWS.\n")
            print("It is necessary to create an user with full access permissions to interact with S3 from SageMaker. To configure the User, go to the upper ribbon of AWS, click on Services, and select IAM – Identity and Access Management.")
            print("1. In IAM\'s lateral panel, search for \'Users\' in the group of Access Management.")
            print("2. Click on the \'Add users\' button.")
            print("3. Set an user name in the text box \'User name\'.")
            print("Attention: users and S3 buckets cannot be written in upper case. Also, selecting a name already used by an Amazon user or bucket will raise an error message.\n")
            print("4. In the field \'Select type of Access to AWS\'-\'Select type of AWS credentials\' select the option \'Access key - Programmatic access\'. After that, click on the button \'Next: Permissions\'.")
            print("5. In the field \'Set Permissions\', keep the \'Add user to a group\' button marked.")
            print("6. In the field \'Add user to a group\', click on \'Create group\' (alternatively, you can be added to a group already configured or copy the permissions of another user.")
            print("7. In the text box \'Group\'s name\', set a name for the new group of permissions.")
            print("8. In the search bar below (\'Filter politics\'), search for a politics that fill your needs, and check the option button on the left of this politic. The politics \'AmazonS3FullAccess\' grants full access to the S3 content.")
            print("9. Finally, click on \'Create a group\'.")
            print("10. After the group is created, it will appear with a check box marked, over the previous groups. Keep it marked and click on the button \'Next: Tags\'.")
            print("11. Create and note down the Access key ID and Secret access key. You can also download a comma separated values (CSV) file containing the credentials for future use.")
            print("ATTENTION: These parameters are required for accessing the bucket\'s content from any application, including AWS SageMaker.")
            print("12. Click on \'Next: Review\' and review the user credentials information and permissions.")
            print("13. Click on \'Create user\' and click on the download button to download the CSV file containing the user credentials information.")
            print("The headers of the CSV file (the stored fields) is: \'User name, Password, Access key ID, Secret access key, Console login link\'.")
            print("You need both the values indicated as \'Access key ID\' and as \'Secret access key\' to fetch the S3 bucket.")
            print("\n") # line break
            print("After acquiring the necessary user privileges, use the boto3 library to export the file from the notebook’s workspace to the bucket (i.e., to upload a file to the bucket).")
            print("For exporting the file as a new bucket\'s file use the following code:\n")
            print("1. Set a variable \'access_key\' as the value (string) stored as \'Access key ID\' in your user security credentials CSV file.")
            print("2. Set a variable \'secret_key\' as the value (string) stored as \'Secret access key\' in your user security credentials CSV file.")
            print("3. Set a variable \'bucket_name\' as a string containing only the name of the bucket. Do not add subdirectories, folders (prefixes), or file names.")
            print("Example: if your bucket is named \'my_bucket\' and its main directory contains folders like \'folder1\', \'folder2\', etc, do not declare bucket_name = \'my_bucket/folder1\', even if you only want files from folder1.")
            print("ALWAYS declare only the bucket\'s name: bucket_name = \'my_bucket\'.")
            print("4. Set a variable \'file_path_in_workspace\' containing the path of the file in notebook’s workspace. The file will be exported from “file_path_in_workspace” to the S3 bucket.")
            print("If the file is stored in the notebook\'s root (main) directory: file_path = \"my_file.ext\".")
            print("If the path of the file in the notebook workspace is: \'dir1/…/dirN/my_file.ext\', where dirN is the N-th subdirectory, and dir1 is a folder or directory of the main (root) bucket\'s directory: file_path = \"dir1/…/dirN/my_file.ext\".")
            print("5. Set a variable named \'file_path_in_s3\' containing the path from the bucket’s subdirectories to the file you want to fetch. Include the file name and its extension.")
            print("6. Finally, declare the following code, which refers to the defined variables:\n")

            # Let's use triple quotes to declare a formated string
            example_code = """
                import boto3
                # Start S3 client as the object 's3_client'
                s3_client = boto3.resource('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key)
                # Connect to the bucket specified as 'bucket_name'.
                # The bucket is started as the object 's3_bucket':
                s3_bucket = s3_client.Bucket(bucket_name)
                # Start the new object in the bucket previously started as 's3_bucket'.
                # Start it with the specified prefix, in file_path_in_s3:
                new_s3_object = s3_bucket.Object(file_path_in_s3)
                # Finally, upload the file in file_path_in_workspace.
                # Make new_s3_object the exported file:
                # Upload the selected object from the workspace path file_path_in_workspace
                # to the S3 path specified as file_path_in_s3.
                # The parameter Filename must be input with the path of the copied file, including its name and
                # extension. Example Filename = "/my_table.xlsx" exports a xlsx file named 'my_table' to 
                # the notebook's main (root) directory.
                new_s3_object.upload_file(Filename = file_path_in_workspace)
                """

            print(example_code)

            print("An object \'my_file.ext\' in the main (root) directory of the s3_bucket is stored in Python environment as:")
            print("""s3.ObjectSummary(bucket_name='bucket_name', key='my_file.ext'""") 
            # triple quotes to keep the internal quotes without using too much backslashes "\" (the ignore next character)
            print("Then, the name of each object is stored as the attribute \'key\' of the object. To view all objects, we can loop through their \'key\' attributes:\n")
            example_code = """
                # Loop through all objects of the bucket:
                for stored_obj in s3_bucket.objects.all():		
                    # Loop through all elements 'stored_obj' from s3_bucket.objects.all()
                    # Which stores the ObjectSummary for all objects in the bucket s3_bucket:
                    # Print the object’s names:
                    print(stored_obj.key)
                    """

            print(example_code)

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = ''
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None; or if it is an empty string; or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = 'copied_s3_bucket'

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for fetching AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

### **Converting JSON object to dataframe**

In [None]:
# JSON object in terms of Python structure: list of dictionaries, where each value of a
# dictionary may be a dictionary or a list of dictionaries (nested structures).
# example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
# structure could be declared and stored into a string variable. For instance, if you have a txt
# file containing JSON, you could read the txt and save its content as a string.
# json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
# 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
# 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]

JSON_OBJ_TO_CONVERT = json_object #Alternatively: object containing the JSON to be converted

# JSON_OBJ_TO_CONVERT: object containing JSON, or string with JSON content to parse.
# Objects may be: string with JSON formatted text;
# list with nested dictionaries (JSON formatted);
# dictionaries, possibly with nested dictionaries (JSON formatted).

JSON_OBJ_TYPE = 'list'
# JSON_OBJ_TYPE = 'list', in case the object was saved as a list of dictionaries (JSON format)
# JSON_OBJ_TYPE = 'string', in case it was saved as a string (text) containing JSON.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: [{'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]}]
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = json_obj_to_pandas_dataframe (json_obj_to_convert = JSON_OBJ_TO_CONVERT, json_obj_type = JSON_OBJ_TYPE, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

### **Removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
    
METHOD = 'trim'
# METHOD = 'trim' will eliminate trailing and leading white spaces from the strings in
# COLUMN_TO_ANALYZE.
# METHOD = 'substring' will eliminate a defined trailing and leading substring from
# COLUMN_TO_ANALYZE.

SUBSTRING_TO_ELIMINATE = None
# SUBSTRING_TO_ELIMINATE = None. Set as a string (in quotes) if METHOD = 'substring'.
# e.g. suppose COLUMN_TO_ANALYZE contains time information: each string ends in " min":
# "1 min", "2 min", "3 min", etc. If SUBSTRING_TO_ELIMINATE = " min", this portion will be
# eliminated, resulting in: "1", "2", "3", etc. If NEW_VARIABLE_TYPE = None, these values will
# continue to be strings. By setting NEW_VARIABLE_TYPE = 'int' or 'float', the series will be
# converted to a numeric type.
    

# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = trim_spaces_or_characters (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, new_variable_type = NEW_VARIABLE_TYPE, method = METHOD, substring_to_eliminate = SUBSTRING_TO_ELIMINATE)

### **Capitalizing or lowering case of string variables (string homogenizing)**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.


METHOD = 'lowercase'
# METHOD = 'capitalize' will capitalize all letters from the input string 
# (turn them to upper case).
# METHOD = 'lowercase' will make the opposite: turn all letters to lower case.
# e.g. suppose COLUMN_TO_ANALYZE contains strings such as 'String One', 'STRING 2',  and
# 'string3'. If METHOD = 'capitalize', the output will contain the strings: 
# 'STRING ONE', 'STRING 2', 'STRING3'. If METHOD = 'lowercase', the outputs will be:
# 'string one', 'string 2', 'string3'.
    
    
# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = capitalize_or_lower_string_case (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, method = METHOD)

### **Adding contractions to the contractions library**

In [71]:
LIST_OF_CONTRACTIONS = [
    
    {'contracted_expression': None, 'correct_expression': None}, 
    {'contracted_expression': None, 'correct_expression': None}, 
    {'contracted_expression': None, 'correct_expression': None}, 
    {'contracted_expression': None, 'correct_expression': None}

]
# LIST_OF_CONTRACTIONS = [{'contracted_expression': None, 'correct_expression': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the form as the contraction is usually observed; and the second one 
# contains the correct (full) string that will replace it.
# Since contractions can cause issues when processing text, we can expand them with these functions.
        
# The object list_of_contractions must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'contracted_expression' for the contraction; and 'correct_expression', 
# for the strings with the correspondent correction.
        
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you want to add more elements
# to the contractions library.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'contracted_expression': original_str, 'correct_expression': new_str}, 
# where original_str and new_str represent the contracted and expanded strings
# (If one of the keys contains None, the new dictionary will be ignored).
        
# Example:
# LIST_OF_CONTRACTIONS = [{'contracted_expression': 'mychange', 'correct_expression': 'my change'}]
        

add_contractions_to_library (list_of_contractions = LIST_OF_CONTRACTIONS)

Successfully included the contracted expression hj to the contractions library.
Now, the function for contraction correction will be able to process it within the strings.



### **Correcting contracted strings**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = correct_contracted_strings (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS)

### **Substituting (replacing) substrings on string variables**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

SUBSTRING_TO_BE_REPLACED = None
NEW_SUBSTRING_FOR_REPLACEMENT = ''
# SUBSTRING_TO_BE_REPLACED = None; new_substring_for_replacement = ''. 
# Strings (in quotes): when the sequence of characters SUBSTRING_TO_BE_REPLACED was
# found in the strings from column_to_analyze, it will be substituted by the substring
# NEW_SUBSTRING_FOR_REPLACEMENT. If None is provided to one of these substring arguments,
# it will be substituted by the empty string: ''
# e.g. suppose COLUMN_TO_ANALYZE contains the following strings, with a spelling error:
# "my collumn 1", 'his collumn 2', 'her column 3'. We may correct this error by setting:
# SUBSTRING_TO_BE_REPLACED = 'collumn' and NEW_SUBSTRING_FOR_REPLACEMENT = 'column'. The
# function will search for the wrong group of characters and, if it finds it, will substitute
# by the correct sequence: "my column 1", 'his column 2', 'her column 3'.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = replace_substring (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, substring_to_be_replaced = SUBSTRING_TO_BE_REPLACED, new_substring_for_replacement = NEW_SUBSTRING_FOR_REPLACEMENT)

### **Inverting the order of the string characters**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = invert_strings (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS)

### **Slicing the strings**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

FIRST_CHARACTER_INDEX = None
# FIRST_CHARACTER_INDEX = None - integer representing the index of the first character to be
# included in the new strings. If None, slicing will start from first character.
# Indexing of strings always start from 0. The last index can be represented as -1, the index of
# the character before as -2, etc (inverse indexing starts from -1).
# example: consider the string "idsw", which contains 4 characters. We can represent the indices as:
# 'i': index 0; 'd': 1, 's': 2, 'w': 3. Alternatively: 'w': -1, 's': -2, 'd': -3, 'i': -4.

LAST_CHARACTER_INDEX = None
# LAST_CHARACTER_INDEX = None - integer representing the index of the last character to be
# included in the new strings. If None, slicing will go until the last character.
# Attention: this is effectively the last character to be added, and not the next index after last
# character.
        
# in the 'idsw' example, if we want a string as 'ds', we want the FIRST_CHARACTER_INDEX = 1 and
# LAST_CHARACTER_INDEX = 2.

STEP = 1
# STEP = 1 - integer representing the slicing step. If step = 1, all characters will be added.
# If STEP = 2, then the slicing will pick one element of index i and the element with index (i+2)
# (1 index will be 'jumped'), and so on.
# If STEP is negative, then the order of the new strings will be inverted.
# Example: STEP = -1, and the start and finish indices are None: the output will be the inverted
# string, 'wsdi'.
# FIRST_CHARACTER_INDEX = 1, LAST_CHARACTER_INDEX = 2, STEP = 1: output = 'ds';
# FIRST_CHARACTER_INDEX = None, LAST_CHARACTER_INDEX = None, STEP = 2: output = 'is';
# FIRST_CHARACTER_INDEX = None, LAST_CHARACTER_INDEX = None, STEP = 3: output = 'iw';
# FIRST_CHARACTER_INDEX = -1, LAST_CHARACTER_INDEX = -2, STEP = -1: output = 'ws';
# FIRST_CHARACTER_INDEX = -1, LAST_CHARACTER_INDEX = None, STEP = -2: output = 'wd';
# FIRST_CHARACTER_INDEX = -1, LAST_CHARACTER_INDEX = None, STEP = 1: output = 'w'
# In this last example, the function tries to access the next element after the character of index
# -1. Since -1 is the last character, there are no other characters to be added.
# FIRST_CHARACTER_INDEX = -2, LAST_CHARACTER_INDEX = -1, STEP = 1: output = 'sw'.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = slice_strings (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, first_character_index = FIRST_CHARACTER_INDEX, last_character_index = LAST_CHARACTER_INDEX, step = STEP)

### **Getting the leftest characters from the strings (retrieve last characters)**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - integer representing the total of characters that will
# be retrieved. Here, we will retrieve the leftest characters. If NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1,
# only the leftest (last) character will be retrieved.
# Consider the string 'idsw'.
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - output: 'w';
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 2 - output: 'sw'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
# So, if the last part of the strings is a number, you can use this argument to directly extract
# this part as numeric variable.
    

# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = left_characters (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, number_of_characters_to_retrieve = NUMBER_OF_CHARACTERS_TO_RETRIEVE, new_variable_type = NEW_VARIABLE_TYPE)

### **Getting the rightest characters from the strings (retrieve first characters)**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - integer representing the total of characters that will
# be retrieved. Here, we will retrieve the rightest characters. If NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1,
# only the rightest (first) character will be retrieved.
# Consider the string 'idsw'.
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 1 - output: 'i';
# NUMBER_OF_CHARACTERS_TO_RETRIEVE = 2 - output: 'id'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
# So, if the first part of the strings is a number, you can use this argument to directly extract
# this part as numeric variable.
    

# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = right_characters (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, number_of_characters_to_retrieve = NUMBER_OF_CHARACTERS_TO_RETRIEVE, new_variable_type = NEW_VARIABLE_TYPE)

### **Joining list of strings into a single string**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

SEPARATOR = " "
# SEPARATOR = " " - string containing the separator. Suppose the column contains the
# strings: 'a', 'b', 'c', 'd'. If the SEPARATOR is the empty string '', the output will be:
# 'abcd' (no separation). If SEPARATOR = " " (simple whitespace), the output will be 'a b c d'


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = join_list_of_strings (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, separator = SEPARATOR)

### **Splitting strings into a list of strings**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

SEPARATOR = " "
# SEPARATOR = " " - string containing the separator. Suppose the column contains the
# string: 'a b c d' on a given row. If the SEPARATOR is whitespace ' ', 
# the output will be a list: ['a', 'b', 'c', 'd']: the function splits the string into a list
# of strings (one list per row) every time it finds the SEPARATOR.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = split_strings (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, separator = SEPARATOR)

### **Substituting (replacing or switching) whole strings by different text values (on string variables)**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = [
    
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}
    
]
# LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = 
# [{'original_string': None, 'new_string': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the original string; and the second one contains the new string
# that will substitute the original one. The function will loop through all dictionaries in
# this list, access the values of the keys 'original_string', and search these values on the strings
# in COLUMN_TO_ANALYZE. When the value is found, it will be replaced (switched) by the correspondent
# value in key 'new_string'.
    
# The object LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'original_string' for the original strings to search on the column 
# column_to_analyze; and 'new_string', for the strings that will replace the original ones.
# Notice that this function will not search for substrings: it will substitute a value only when
# there is perfect correspondence between the string in 'column_to_analyze' and 'original_string'.
# So, the cases (upper or lower) must be the same.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'original_string': original_str, 'new_string': new_str}, 
# where original_str and new_str represent the strings for searching and replacement 
# (If one of the keys contains None, the new dictionary will be ignored).
    
# Example:
# Suppose the COLUMN_TO_ANALYZE contains the values 'sunday', 'monday', 'tuesday', 'wednesday',
# 'thursday', 'friday', 'saturday', but you want to obtain data labelled as 'weekend' or 'weekday'.
# Set: LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = 
# [{'original_string': 'sunday', 'new_string': 'weekend'},
# {'original_string': 'saturday', 'new_string': 'weekend'},
# {'original_string': 'monday', 'new_string': 'weekday'},
# {'original_string': 'tuesday', 'new_string': 'weekday'},
# {'original_string': 'wednesday', 'new_string': 'weekday'},
# {'original_string': 'thursday', 'new_string': 'weekday'},
# {'original_string': 'friday', 'new_string': 'weekday'}]


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = switch_strings (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, list_of_dictionaries_with_original_strings_and_replacements = LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS)

### **Replacing strings with Machine Learning: finding similar strings and replacing them by standard strings**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

MODE = 'find_and_replace'
# MODE = 'find_and_replace' will find similar strings; and switch them by one of the
# standard strings if the similarity between them is higher than or equals to the threshold.
# Alternatively: MODE = 'find' will only find the similar strings by calculating the similarity.

THRESHOLD_FOR_PERCENT_OF_SIMILARITY = 80.0
# THRESHOLD_FOR_PERCENT_OF_SIMILARITY = 80.0 - 0.0% means no similarity and 100% means equal strings.
# The THRESHOLD_FOR_PERCENT_OF_SIMILARITY is the minimum similarity calculated from the
# Levenshtein (minimum edit) distance algorithm. This distance represents the minimum number of
# insertion, substitution or deletion of characters operations that are needed for making two
# strings equal.

LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT = [
    
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None}, 
    {'standard_string': None}
    
]
# This is a list of dictionaries, where each dictionary contains a single key-value pair:
# the key must be always 'standard_string', and the value will be one of the standard strings 
# for replacement: if a given string on the COLUMN_TO_ANALYZE presents a similarity with one 
# of the standard string equals or higher than the THRESHOLD_FOR_PERCENT_OF_SIMILARITY, it will be
# substituted by this standard string.
# For instance, suppose you have a word written in too many ways, making it difficult to use
# the function switch_strings: "EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ...
# You can use this function to search strings similar to "Europe" and replace them.
    
# The function will loop through all dictionaries in this list, access the values of the keys 
# 'standard_string', and search these values on the strings in COLUMN_TO_ANALYZE. When the value 
# is found, it will be replaced (switched) if the similarity is sufficiently high.
    
# The object LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'standard_string'.
# Notice that this function performs fuzzy matching, so it MAY SEARCH substrings and strings
# written with different cases (upper or lower) when this portions or modifications make the
# strings sufficiently similar to each other.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same key: {'standard_string': other_std_str}, 
# where other_std_str represents the string for searching and replacement 
# (If the key contains None, the new dictionary will be ignored).
    
# Example:
# Suppose the COLUMN_TO_ANALYZE contains the values 'California', 'Cali', 'Calefornia', 
# 'Calefornie', 'Californie', 'Calfornia', 'Calefernia', 'New York', 'New York City', 
# but you want to obtain data labelled as the state 'California' or 'New York'.
# Set: list_of_dictionaries_with_standard_strings_for_replacement = 
# [{'standard_string': 'California'},
# {'standard_string': 'New York'}]
    
# ATTENTION: It is advisable for previously searching the similarity to find the best similarity
# threshold; set it as high as possible, avoiding incorrect substitutions in a gray area; and then
# perform the replacement. It will avoid the repetition of original incorrect strings in the
# output dataset, as well as wrong replacement (replacement by one of the standard strings which
# is not the correct one).


# The list of strings will be stored in the object named list_of_strings.
# The summary list is saved as summary_list.
# Simply modify these objects on the left of equality:
list_of_strings, summary_list = string_replacement_ml (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, mode = MODE, threshold_for_percent_of_similarity = THRESHOLD_FOR_PERCENT_OF_SIMILARITY, list_of_dictionaries_with_standard_strings_for_replacement = LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT)

### **Searching for Regular Expression (RegEx) within a list of strings**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

REGEX_TO_SEARCH = r""
# REGEX_TO_SEARCH = r"" - string containing the regular expression (regex) that will be searched
# within each string from the column. Declare it with the r before quotes, indicating that the
# 'raw' string should be read. That is because the regex contain special characters, such as \,
# which should not be read as scape characters.
# example of regex: r'st\d\s\w{3,10}'
# Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
# regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
# regex alternating and non-capturing groups; regex backreferences; and regex lookaround.

## ATTENTION: This function returns ONLY the capturing groups from the regex, i.e., portions of the
# regex explicitly marked with parentheses (check the regex helper for more details, including how
# to convert parentheses into non-capturing groups). If no groups are marked as capturing, the
# function will raise an error.

SHOW_REGEX_HELPER = False
# SHOW_REGEX_HELPER: set SHOW_REGEX_HELPER = True to show a helper guide to the construction of
# the regular expression. After finishing the helper, the original dataset itself will be returned
# and the function will not proceed. Use it in case of not knowing or not certain on how to input
# the regex.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = regex_search (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, regex_to_search = REGEX_TO_SEARCH, show_regex_helper = SHOW_REGEX_HELPER)

### **Replacing a Regular Expression (RegEx) within a list of strings**

In [None]:
STRING_OR_LIST_OF_STRINGS = []
# string_or_list_of_strings: string or list of strings (inside quotes), 
# that will be analyzed. 
# e.g. string_or_list_of_strings = "column1" will analyze 'column1', whereas 
# string_or_list_of_strings = ['col1', 'col2'] will process both 'col1' and 'col2'.

REGEX_TO_SEARCH = r""
# REGEX_TO_SEARCH = r"" - string containing the regular expression (regex) that will be searched
# within each string from the column. Declare it with the r before quotes, indicating that the
# 'raw' string should be read. That is because the regex contain special characters, such as \,
# which should not be read as scape characters.
# example of regex: r'st\d\s\w{3,10}'
# Use the regex helper to check: basic theory and most common metacharacters; regex quantifiers;
# regex anchoring and finding; regex greedy and non-greedy search; regex grouping and capturing;
# regex alternating and non-capturing groups; regex backreferences; and regex lookaround.

STRING_FOR_REPLACEMENT = ""
# STRING_FOR_REPLACEMENT = "" - regular string that will replace the REGEX_TO_SEARCH: 
# whenever REGEX_TO_SEARCH is found in the string, it is replaced (substituted) by 
# STRING_FOR_REPLACEMENT. 
# Example STRING_FOR_REPLACEMENT = " " (whitespace).
# If STRING_FOR_REPLACEMENT = None, the empty string will be used for replacement.
        
## ATTENTION: This function process a single regex by call.

SHOW_REGEX_HELPER = False
# SHOW_REGEX_HELPER: set SHOW_REGEX_HELPER = True to show a helper guide to the construction of
# the regular expression. After finishing the helper, the original dataset itself will be returned
# and the function will not proceed. Use it in case of not knowing or not certain on how to input
# the regex.


# The list of strings will be stored in the object named list_of_strings:
# Simply modify this object on the left of equality:
list_of_strings = regex_replacement (string_or_list_of_strings = STRING_OR_LIST_OF_STRINGS, regex_to_search = REGEX_TO_SEARCH, string_for_replacement = STRING_FOR_REPLACEMENT, show_regex_helper = SHOW_REGEX_HELPER)

## **Exporting the dataframe as CSV file (to notebook's workspace)**

In [None]:
## WARNING: all files exported from this function are .csv (comma separated values)

DATAFRAME_OBJ_TO_BE_EXPORTED = dataset
# Alternatively: object containing the dataset to be exported.
# DATAFRAME_OBJ_TO_BE_EXPORTED: dataframe object that is going to be exported from the
# function. Since it is an object (not a string), it should not be declared in quotes.
# example: DATAFRAME_OBJ_TO_BE_EXPORTED = dataset will export the dataset object.
# ATTENTION: The dataframe object must be a Pandas dataframe.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"
# If you want to export the file to AWS S3, this parameter will have no effect.
# In this case, you can set FILE_DIRECTORY_PATH = None

NEW_FILE_NAME_WITHOUT_EXTENSION = "dataset"
# NEW_FILE_NAME_WITHOUT_EXTENSION - (string, in quotes): input the name of the 
# file without the extension. e.g. set NEW_FILE_NAME_WITHOUT_EXTENSION = "my_file" 
# to export the CSV file 'my_file.csv' to notebook's workspace.

export_pd_dataframe_as_csv (dataframe_obj_to_be_exported = DATAFRAME_OBJ_TO_BE_EXPORTED, new_file_name_without_extension = NEW_FILE_NAME_WITHOUT_EXTENSION, file_directory_path = FILE_DIRECTORY_PATH)

## **Downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

#### Case 1: upload a file to Colab's workspace

In [None]:
ACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model named keras_model, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'keras_model.h5'

# Dictionary storing the uploaded files returned as colab_files_dict.
# Simply modify this object on the left of the equality:
colab_files_dict = upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

#### Case 2: download a file from Colab's workspace

In [None]:
ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model nameACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

## **Exporting a list of files from notebook's workspace to AWS Simple Storage Service (S3)**

In [None]:
LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['s3_file1.txt', 's3_file2.txt']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS: list containing all the files to export to S3.
# Declare it as a list even if only a single file will be exported.
# It must be a list of strings containing the file names followed by the extensions.
# Example, to a export a single file my_file.ext, where my_file is the name and ext is the
# extension:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['my_file.ext']
# To export 3 files, file1.ext1, file2.ext2, and file3.ext3:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['file1.ext1', 'file2.ext2', 'file3.ext3']
# Other examples:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['Screen_Shot.png', 'dataset.csv']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ["dictionary.pkl", "model.h5"]
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['doc.pdf', 'model.dill']

DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = ''
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT: directory from notebook's workspace
# from which the files will be exported to S3. Keep it None, or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = "/"; or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = '' (empty string) to export from
# the root (main) directory.
# Alternatively, set as a string containing only the directories and folders, not the file names.
# Examples: DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1';
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1/folder2/'
    
# For this function, all exported files must be located in the same directory.

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for connecting with AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
export_files_to_s3 (list_of_file_names_with_extensions = LIST_OF_FILE_NAMES_WITH_EXTENSIONS, directory_of_notebook_workspace_storing_files_to_export = DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

****