# **Dataset Transformation**

## _ETL Workflow Notebook 3_

## Content:
1. Removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type;
2. Capitalizing or lowering case of string variables (string homogenizing);
3. Substituting (replacing) substrings on string variables;
4. Substituting (replacing or switching) whole strings by different text values (on string variables);
5. Replacing strings with Machine Learning: finding similar strings and replacing them by standard strings;
6. Transforming the dataset and reverse transforms: log-transform; 
7. Exponential transform; 
8. Box-Cox transform; 
9. One-Hot Encoding;
10. Ordinal Encoding;
11. Feature scaling; 
12. Importing or exporting models and dictionaries.

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

## **Load Python Libraries in Global Context**

In [None]:
import numpy as np
import pandas as pd
import idsw
from idsw import etl

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = ''
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None; or if it is an empty string; or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = 'copied_s3_bucket'

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for fetching AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
idsw.mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

### **Importing the dataset**

In [None]:
## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
## JSON, txt, or CSV (comma separated values) files.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"

FILE_NAME_WITH_EXTENSION = "dataset.csv"
# FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
# FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
# Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.

LOAD_TXT_FILE_WITH_JSON_FORMAT = False
# LOAD_TXT_FILE_WITH_JSON_FORMAT = False. Set LOAD_TXT_FILE_WITH_JSON_FORMAT = True 
# if you want to read a file with txt extension containing a text formatted as JSON 
# (but not saved as JSON).
# WARNING: if LOAD_TXT_FILE_WITH_JSON_FORMAT = True, all the JSON file parameters of the 
# function (below) must be set. If not, an error message will be raised.

HOW_MISSING_VALUES_ARE_REGISTERED = None
# HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
# empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
# This parameter manipulates the argument na_values (default: None) from Pandas functions.
# By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
#‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
# ‘n/a’, ‘nan’, ‘null’.

# If a different denomination is used, indicate it as a string. e.g.
# HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
# HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

# If dict passed, specific per-column NA values. For example, if zero is the missing value
# only in column 'numeric_col', you can specify the following dictionary:
# how_missing_values_are_registered = {'numeric-col': 0}

    
HAS_HEADER = True
# HAS_HEADER = True if the the imported table has headers (row with columns names).
# Alternatively, HAS_HEADER = False if the dataframe does not have header.

DECIMAL_SEPARATOR = '.'
# DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
# the decimal separator. Alternatively, specify here the separator.
# e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
# It manipulates the argument 'decimal' from Pandas functions.

TXT_CSV_COL_SEP = "comma"
# txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
# or 'csv'. It informs how the different columns are separated.
# Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
# for columns separated by comma;
# txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
# for columns separated by simple spaces.
# You can also set a specific separator as string. For example:
# txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
# is used as separator for the columns - '\t' represents the tab character).

## Parameters for loading Excel files:

LOAD_ALL_SHEETS_AT_ONCE = False
# LOAD_ALL_SHEETS_AT_ONCE = False - This parameter has effect only when for Excel files.
# If LOAD_ALL_SHEETS_AT_ONCE = True, the function will return a list of dictionaries, each
# dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
# value will be the name (or number) of the table (sheet). The second key will be 'df',
# and its value will be the pandas dataframe object obtained from that sheet.
# This argument has preference over SHEET_TO_LOAD. If it is True, all sheets will be loaded.
    
SHEET_TO_LOAD = None
# SHEET_TO_LOAD - This parameter has effect only when for Excel files.
# keep SHEET_TO_LOAD = None not to specify a sheet of the file, so that the first sheet
# will be loaded.
# SHEET_TO_LOAD may be an integer or an string (inside quotes). SHEET_TO_LOAD = 0
# loads the first sheet (sheet with index 0); SHEET_TO_LOAD = 1 loads the second sheet
# of the file (index 1); SHEET_TO_LOAD = "Sheet1" loads a sheet named as "Sheet1".
# Declare a number to load the sheet with that index, starting from 0; or declare a
# name to load the sheet with that name.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = idsw.load_pandas_dataframe (file_directory_path = FILE_DIRECTORY_PATH, file_name_with_extension = FILE_NAME_WITH_EXTENSION, load_txt_file_with_json_format = LOAD_TXT_FILE_WITH_JSON_FORMAT, how_missing_values_are_registered = HOW_MISSING_VALUES_ARE_REGISTERED, has_header = HAS_HEADER, decimal_separator = DECIMAL_SEPARATOR, txt_csv_col_sep = TXT_CSV_COL_SEP, load_all_sheets_at_once = LOAD_ALL_SHEETS_AT_ONCE, sheet_to_load = SHEET_TO_LOAD, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

# OBS: If an Excel file is loaded and LOAD_ALL_SHEETS_AT_ONCE = True, then the object
# dataset will be a list of dictionaries, with 'sheet' as key containing the sheet name; and 'df'
# as key correspondent to the Pandas dataframe. So, to access the 3rd dataframe (index 2, since
# indexing starts from zero): df = dataframe[2]['df'], where dataframe is the list returned.

### **Converting JSON object to dataframe**

In [None]:
# JSON object in terms of Python structure: list of dictionaries, where each value of a
# dictionary may be a dictionary or a list of dictionaries (nested structures).
# example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
# structure could be declared and stored into a string variable. For instance, if you have a txt
# file containing JSON, you could read the txt and save its content as a string.
# json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
# 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
# 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]

JSON_OBJ_TO_CONVERT = json_object #Alternatively: object containing the JSON to be converted

# JSON_OBJ_TO_CONVERT: object containing JSON, or string with JSON content to parse.
# Objects may be: string with JSON formatted text;
# list with nested dictionaries (JSON formatted);
# dictionaries, possibly with nested dictionaries (JSON formatted).

JSON_OBJ_TYPE = 'list'
# JSON_OBJ_TYPE = 'list', in case the object was saved as a list of dictionaries (JSON format)
# JSON_OBJ_TYPE = 'string', in case it was saved as a string (text) containing JSON.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: [{'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]}]
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = idsw.json_obj_to_pandas_dataframe (json_obj_to_convert = JSON_OBJ_TO_CONVERT, json_obj_type = JSON_OBJ_TYPE, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

### **Removing trailing or leading white spaces or characters (trim) from string variables, and modifying the variable type**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

NEW_VARIABLE_TYPE = None
# NEW_VARIABLE_TYPE = None. String (in quotes) that represents a given data type for the column
# after transformation. Set:
# - NEW_VARIABLE_TYPE = 'int' to convert the column to integer type after the transform;
# - NEW_VARIABLE_TYPE = 'float' to convert the column to float (decimal number);
# - NEW_VARIABLE_TYPE = 'datetime' to convert it to date or timestamp;
# - NEW_VARIABLE_TYPE = 'category' to convert it to Pandas categorical variable.
    
METHOD = 'trim'
# METHOD = 'trim' will eliminate trailing and leading white spaces from the strings in
# COLUMN_TO_ANALYZE.
# METHOD = 'substring' will eliminate a defined trailing and leading substring from
# COLUMN_TO_ANALYZE.

SUBSTRING_TO_ELIMINATE = None
# SUBSTRING_TO_ELIMINATE = None. Set as a string (in quotes) if METHOD = 'substring'.
# e.g. suppose COLUMN_TO_ANALYZE contains time information: each string ends in " min":
# "1 min", "2 min", "3 min", etc. If SUBSTRING_TO_ELIMINATE = " min", this portion will be
# eliminated, resulting in: "1", "2", "3", etc. If NEW_VARIABLE_TYPE = None, these values will
# continue to be strings. By setting NEW_VARIABLE_TYPE = 'int' or 'float', the series will be
# converted to a numeric type.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_trim'
# NEW_COLUMN_SUFFIX = "_trim"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_trim", the new column will be named as
# "column1_trim".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.
    

# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = etl.trim_spaces_or_characters (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, new_variable_type = NEW_VARIABLE_TYPE, method = METHOD, substring_to_eliminate = SUBSTRING_TO_ELIMINATE, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Capitalizing or lowering case of string variables (string homogenizing)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

METHOD = 'lowercase'
# METHOD = 'capitalize' will capitalize all letters from the input string 
# (turn them to upper case).
# METHOD = 'lowercase' will make the opposite: turn all letters to lower case.
# e.g. suppose COLUMN_TO_ANALYZE contains strings such as 'String One', 'STRING 2',  and
# 'string3'. If METHOD = 'capitalize', the output will contain the strings: 
# 'STRING ONE', 'STRING 2', 'STRING3'. If METHOD = 'lowercase', the outputs will be:
# 'string one', 'string 2', 'string3'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_homogenized'
# NEW_COLUMN_SUFFIX = "_homogenized"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_homogenized", the new column will be named as
# "column1_homogenized".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.
    
    
# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = etl.capitalize_or_lower_string_case (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, method = METHOD, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Substituting (replacing) substrings on string variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

SUBSTRING_TO_BE_REPLACED = None
NEW_SUBSTRING_FOR_REPLACEMENT = ''
# SUBSTRING_TO_BE_REPLACED = None; new_substring_for_replacement = ''. 
# Strings (in quotes): when the sequence of characters SUBSTRING_TO_BE_REPLACED was
# found in the strings from column_to_analyze, it will be substituted by the substring
# NEW_SUBSTRING_FOR_REPLACEMENT. If None is provided to one of these substring arguments,
# it will be substituted by the empty string: ''
# e.g. suppose COLUMN_TO_ANALYZE contains the following strings, with a spelling error:
# "my collumn 1", 'his collumn 2', 'her column 3'. We may correct this error by setting:
# SUBSTRING_TO_BE_REPLACED = 'collumn' and NEW_SUBSTRING_FOR_REPLACEMENT = 'column'. The
# function will search for the wrong group of characters and, if it finds it, will substitute
# by the correct sequence: "my column 1", 'his column 2', 'her column 3'.

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_substringReplaced'
# NEW_COLUMN_SUFFIX = "_substringReplaced"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_substringReplaced", the new column will be named as
# "column1_substringReplaced".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = etl.replace_substring (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, substring_to_be_replaced = SUBSTRING_TO_BE_REPLACED, new_substring_for_replacement = NEW_SUBSTRING_FOR_REPLACEMENT, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Substituting (replacing or switching) whole strings by different text values (on string variables)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = [
    
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}, 
    {'original_string': None, 'new_string': None}
    
]
# LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = 
# [{'original_string': None, 'new_string': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the original string; and the second one contains the new string
# that will substitute the original one. The function will loop through all dictionaries in
# this list, access the values of the keys 'original_string', and search these values on the strings
# in COLUMN_TO_ANALYZE. When the value is found, it will be replaced (switched) by the correspondent
# value in key 'new_string'.
    
# The object LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'original_string' for the original strings to search on the column 
# column_to_analyze; and 'new_string', for the strings that will replace the original ones.
# Notice that this function will not search for substrings: it will substitute a value only when
# there is perfect correspondence between the string in 'column_to_analyze' and 'original_string'.
# So, the cases (upper or lower) must be the same.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'original_string': original_str, 'new_string': new_str}, 
# where original_str and new_str represent the strings for searching and replacement 
# (If one of the keys contains None, the new dictionary will be ignored).
    
# Example:
# Suppose the COLUMN_TO_ANALYZE contains the values 'sunday', 'monday', 'tuesday', 'wednesday',
# 'thursday', 'friday', 'saturday', but you want to obtain data labelled as 'weekend' or 'weekday'.
# Set: LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS = 
# [{'original_string': 'sunday', 'new_string': 'weekend'},
# {'original_string': 'saturday', 'new_string': 'weekend'},
# {'original_string': 'monday', 'new_string': 'weekday'},
# {'original_string': 'tuesday', 'new_string': 'weekday'},
# {'original_string': 'wednesday', 'new_string': 'weekday'},
# {'original_string': 'thursday', 'new_string': 'weekday'},
# {'original_string': 'friday', 'new_string': 'weekday'}]

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_stringReplaced'
# NEW_COLUMN_SUFFIX = "_stringReplaced"
# This value has effect only if CREATE_NEW_COLUMN = True.
# column was "column1" and the suffix is "_stringReplaced", the new column will be named as
# "column1_stringReplaced".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset:
# Simply modify this object on the left of equality:
transf_dataset = etl.switch_strings (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, list_of_dictionaries_with_original_strings_and_replacements = LIST_OF_DICTIONARIES_WITH_ORIGINAL_STRINGS_AND_REPLACEMENTS, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **Replacing strings with Machine Learning: finding similar strings and replacing them by standard strings**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE: string (inside quotes), 
# containing the name of the column that will be analyzed. 
# e.g. COLUMN_TO_ANALYZE = "column1" will analyze the column named as 'column1'.

MODE = 'find_and_replace'
# MODE = 'find_and_replace' will find similar strings; and switch them by one of the
# standard strings if the similarity between them is higher than or equals to the threshold.
# Alternatively: MODE = 'find' will only find the similar strings by calculating the similarity.

THRESHOLD_FOR_PERCENT_OF_SIMILARITY = 80.0
# THRESHOLD_FOR_PERCENT_OF_SIMILARITY = 80.0 - 0.0% means no similarity and 100% means equal strings.
# The THRESHOLD_FOR_PERCENT_OF_SIMILARITY is the minimum similarity calculated from the
# Levenshtein (minimum edit) distance algorithm. This distance represents the minimum number of
# insertion, substitution or deletion of characters operations that are needed for making two
# strings equal.

LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT = [
    
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None},
    {'standard_string': None}, 
    {'standard_string': None}, 
    {'standard_string': None}
    
]
# This is a list of dictionaries, where each dictionary contains a single key-value pair:
# the key must be always 'standard_string', and the value will be one of the standard strings 
# for replacement: if a given string on the COLUMN_TO_ANALYZE presents a similarity with one 
# of the standard string equals or higher than the THRESHOLD_FOR_PERCENT_OF_SIMILARITY, it will be
# substituted by this standard string.
# For instance, suppose you have a word written in too many ways, making it difficult to use
# the function switch_strings: "EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ...
# You can use this function to search strings similar to "Europe" and replace them.
    
# The function will loop through all dictionaries in this list, access the values of the keys 
# 'standard_string', and search these values on the strings in COLUMN_TO_ANALYZE. When the value 
# is found, it will be replaced (switched) if the similarity is sufficiently high.
    
# The object LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'standard_string'.
# Notice that this function performs fuzzy matching, so it MAY SEARCH substrings and strings
# written with different cases (upper or lower) when this portions or modifications make the
# strings sufficiently similar to each other.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same key: {'standard_string': other_std_str}, 
# where other_std_str represents the string for searching and replacement 
# (If the key contains None, the new dictionary will be ignored).
    
# Example:
# Suppose the COLUMN_TO_ANALYZE contains the values 'California', 'Cali', 'Calefornia', 
# 'Calefornie', 'Californie', 'Calfornia', 'Calefernia', 'New York', 'New York City', 
# but you want to obtain data labelled as the state 'California' or 'New York'.
# Set: list_of_dictionaries_with_standard_strings_for_replacement = 
# [{'standard_string': 'California'},
# {'standard_string': 'New York'}]
    
# ATTENTION: It is advisable for previously searching the similarity to find the best similarity
# threshold; set it as high as possible, avoiding incorrect substitutions in a gray area; and then
# perform the replacement. It will avoid the repetition of original incorrect strings in the
# output dataset, as well as wrong replacement (replacement by one of the standard strings which
# is not the correct one).

CREATE_NEW_COLUMN = True
# CREATE_NEW_COLUMN = True
# Alternatively, set CREATE_NEW_COLUMN = True to store the transformed data into a new
# column. Or set CREATE_NEW_COLUMN = False to overwrite the existing column.
NEW_COLUMN_SUFFIX = '_stringReplaced'
# NEW_COLUMN_SUFFIX = "_stringReplaced"
# This value has effect only if CREATE_NEW_COLUMN = True.
# The new column name will be set as column + NEW_COLUMN_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_stringReplaced", the new column will be named as
# "column1_stringReplaced".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.


# The dataframe will be stored in the object named transf_dataset.
# The summary list is saved as summary_list.
# Simply modify these objects on the left of equality:
transf_dataset, summary_list = etl.string_replacement_ml (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, mode = MODE, threshold_for_percent_of_similarity = THRESHOLD_FOR_PERCENT_OF_SIMILARITY, list_of_dictionaries_with_standard_strings_for_replacement = LIST_OF_DICTIONARIES_WITH_STANDARD_STRINGS_FOR_REPLACEMENT, create_new_column = CREATE_NEW_COLUMN, new_column_suffix = NEW_COLUMN_SUFFIX)

### **log-transforming the variables**

In [None]:
#### WARNING: This function will eliminate rows where the selected variables present 
#### values lower or equal to zero (condition for the logarithm to be applied).

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_log"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_log", the new column will be named as
# "column1_log".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

# New dataframe saved as log_transf_df.
# Simply modify this object on the left of equality:
log_transf_df = etl.log_transform (df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

# One curve derived from the normal is the log-normal.
# If the values Y follow a log-normal distribution, their log follow a normal.
# A log normal curve resembles a normal, but with skewness (distortion); 
# and kurtosis (long-tail).

# Applying the log is a methodology for normalizing the variables: 
# the sample space gets shrinkled after the transformation, making the data more 
# adequate for being processed by Machine Learning algorithms. Preferentially apply 
# the transformation to the whole dataset, so that all variables will be of same order 
# of magnitude.
# Obviously, it is not necessary for variables ranging from -100 to 100 in numerical 
# value, where most outputs from the log transformation are.

### **Reversing the log-transform - Exponentially transforming variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SUBSET = None
# Set SUBSET = None to transform the whole dataset. Alternatively, pass a list with 
# columns names for the transformation to be applied. For instance:
# SUBSET = ['col1', 'col2', 'col3'] will apply the transformation to the columns named
# as 'col1', 'col2', and 'col3'. Declare the names inside quotes.
# Declaring the full list of columns is equivalent to setting SUBSET = None.

CREATE_NEW_COLUMNS = True
# Alternatively, set CREATE_NEW_COLUMNS = True to store the transformed data into new
# columns. Or set CREATE_NEW_COLUMNS = False to overwrite the existing columns
    
NEW_COLUMNS_SUFFIX = "_originalScale"
# This value has effect only if CREATE_NEW_COLUMNS = True.
# The new column name will be set as column + NEW_COLUMNS_SUFFIX. Then, if the original
# column was "column1" and the suffix is "_originalScale", the new column will be named as
# "column1_originalScale".
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name.

#New dataframe saved as rescaled_df.
# Simply modify this object on the left of equality:
rescaled_df = etl.reverse_log_transform(df = DATASET, subset = SUBSET, create_new_columns = CREATE_NEW_COLUMNS, new_columns_suffix = NEW_COLUMNS_SUFFIX)

### **Obtaining and applying Box-Cox transform**
- Transform a series of data into a series described by a normal distribution.

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

MODE = 'calculate_and_apply'
# Aternatively, mode = 'calculate_and_apply' to calculate lambda and apply Box-Cox
# transform; mode = 'apply_only' to apply the transform for a known lambda.
# To 'apply_only', lambda_box must be provided.

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_BoxCoxTransf'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_BoxCoxTransf', the transformed column will be
# identified as 'Y_BoxCoxTransf'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

SPECIFICATION_LIMITS = {'lower_spec_lim': None, 'upper_spec_lim': None}
# specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': None}
# If there are specification limits, input them in this dictionary. Do not modify the keys,
# simply substitute None by the lower and/or the upper specification.
# e.g. Suppose you have a tank that cannot have more than 10 L. So:
# specification_limits = {'lower_spec_lim': None, 'upper_spec_lim': 10}, there is only
# an upper specification equals to 10 (do not add units);
# Suppose a temperature cannot be lower than 10 ºC, but there is no upper specification. So,
# specification_limits = {'lower_spec_lim': 10, 'upper_spec_lim': None}. Finally, suppose
# a liquid which pH must be between 6.8 and 7.2:
# specification_limits = {'lower_spec_lim': 6.8, 'upper_spec_lim': 7.2}

#New dataframe saved as data_transformed_df; dictionary saved as data_sum_dict.
# Simply modify this object on the left of equality:
data_transformed_df, data_sum_dict = etl.box_cox_transform (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, mode = MODE, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX, specification_limits = SPECIFICATION_LIMITS)

### **Reversing Box-Cox transform**

In [None]:
# This function will process a single column column_to_transform of the dataframe df 
# per call.

DATASET = dataset #Alternatively: object containing the dataset to be processed

COLUMN_TO_TRANSFORM = 'column_to_transform'
# COLUMN_TO_TRANSFORM must be a string with the name of the column.
# e.g. COLUMN_TO_TRANSFORM = 'column1' to transform a column named as 'column1'

LAMBDA_BOXCOX = None
# LAMBDA_BOXCOX must be a float value. e.g. lamda_boxcox = 1.7
# If you calculated lambda from the function box_cox_transform and saved the
# transformation data summary dictionary as data_sum_dict, simply set:
## LAMBDA_BOXCOX = data_sum_dict['lambda_boxcox']
# This will access the value on the key 'lambda_boxcox' of the dictionary, which
# contains the lambda. 
# If lambda_boxcox is None, the mode will be automatically set as 'calculate_and_apply'.

SUFFIX = '_ReversedBoxCox'
#suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_ReversedBoxCox', the transformed column will be
# identified as 'Y_ReversedBoxCox'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

#New dataframe saved as retransformed_df.
# Simply modify this object on the left of equality:
retransformed_df = etl.reverse_box_cox (df = DATASET, column_to_transform = COLUMN_TO_TRANSFORM, lambda_boxcox = LAMBDA_BOXCOX, suffix = SUFFIX)

### **One-Hot Encoding the categorical variables**
- For each category, the One-Hot Encoder creates a new column in the dataset. This new column is represented by a binary variable which is equals to zero if the row is not classified in that category; and is equals to 1 when the row represents an element in that category.For a category "A", a column named "A" is created.
    - If the row is an element from category "A", the value for the column "A" is 1.
    - If not, the value for column "A" is 0.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_BE_ENCODED = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# SUBSET_OF_FEATURES_TO_BE_ENCODED: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.SUBSET_OF_FEATURES_TO_BE_ENCODED = ["column1"] 
# will analyze the column named as 'column1'; 
# SUBSET_OF_FEATURES_TO_BE_ENCODED = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

# New dataframe saved as one_hot_encoded_df; list of encoding information,
# including different categories and encoder objects as OneHot_encoding_list.
# Simply modify this object on the left of equality:
one_hot_encoded_df, OneHot_encoding_list = etl.OneHotEncode_df (df = DATASET, subset_of_features_to_be_encoded = SUBSET_OF_FEATURES_TO_BE_ENCODED)

### **Reversing the One-Hot Encoding of the categorical variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

ENCODING_LIST = [
    
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}},
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}},
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}},
    {'column': None,
    'OneHot_encoder': {'OneHot_enc_obj': None, 'encoded_columns': None}}
    
]
# ENCODING_LIST: list in the same format of the one generated by OneHotEncode_df function:
# it must be a list of dictionaries where each dictionary contains two keys:
# key 'column': string with the original column name (in quotes); 
# key 'OneHot_encoder': this key must store a nested dictionary.
# Even though the nested dictionaries generates by the encoding function present
# two keys:  'categories', storing an array with the different categories;
# and 'OneHot_enc_obj', storing the encoder object, only the key 'OneHot_enc_obj' is required.
## On the other hand, a third key is needed in the nested dictionary:
## key 'encoded_columns': this key must store a list or array with the names of the columns
# obtained from Encoding.

# New dataframe saved as reversed_one_hot_encoded_df.
# Simply modify this object on the left of equality:
reversed_one_hot_encoded_df = etl.reverse_OneHotEncode (df = DATASET, encoding_list = ENCODING_LIST)

### **Ordinal Encoding the categorical variables**
- Transform categorical values with notion of order into numerical (integer) features.
- For each column, the Ordinal Encoder creates a new column in the dataset. This new column is represented by a an integer value, where each integer represents a possible categorie.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_BE_ENCODED = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# SUBSET_OF_FEATURES_TO_BE_ENCODED: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.SUBSET_OF_FEATURES_TO_BE_ENCODED = ["column1"] 
# will analyze the column named as 'column1'; 
# SUBSET_OF_FEATURES_TO_BE_ENCODED = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

# New dataframe saved as ordinal_encoded_df; list of encoding information,
# including different categories and encoder objects as ordinal_encoding_list.
# Simply modify this object on the left of equality:
ordinal_encoded_df, ordinal_encoding_list = etl.OrdinalEncode_df (df = DATASET, subset_of_features_to_be_encoded = SUBSET_OF_FEATURES_TO_BE_ENCODED)

### **Reversing the Ordinal Encoding of the categorical variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

ENCODING_LIST = [
    
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}},
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}},
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}},
    {'column': None,
    'ordinal_encoder': {'ordinal_enc_obj': None, 'encoded_column': None}}
    
]
# ENCODING_LIST: list in the same format of the one generated by OrdinalEncode_df function:
# it must be a list of dictionaries where each dictionary contains two keys:
# key 'column': string with the original column name (in quotes); 
# key 'ordinal_encoder': this key must store a nested dictionary.
# Even though the nested dictionaries generates by the encoding function present
# two keys:  'categories', storing an array with the different categories;
# and 'ordinal_enc_obj', storing the encoder object, only the key 'ordinal_enc_obj' is required.
## On the other hand, a third key is needed in the nested dictionary:
## key 'encoded_column': this key must store a string with the name of the column
# obtained from Encoding.

# New dataframe saved as reversed_ordinal_encoded_df.
# Simply modify this object on the left of equality:
reversed_ordinal_encoded_df = etl.reverse_OrdinalEncode (df = DATASET, encoding_list = ENCODING_LIST)

### **Scaling the features - Standard scaler, Min-Max scaler, division by factor**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
# subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'min_max'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor', MODE = 'normalize_by_maximum'
## This function provides 4 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

## MODE = 'normalize_by_maximum' is similar to MODE = 'factor', but the factor will be selected
# as the maximum value. This mode is available only for SCALE_WITH_NEW_PARAMS = True. If
# SCALE_WITH_NEW_PARAMS = False, you should provide the value of the maximum as a division 'factor'.

SCALE_WITH_NEW_PARAMS = True
# Alternatively, set SCALE_WITH_NEW_PARAMS = True if you want to calculate a new
# scaler for the data; or SCALE_WITH_NEW_PARAMS = False if you want to apply 
# parameters previously obtained to the data (i.e., if you want to apply the scaler
# previously trained to another set of data; or wants to simply apply again the same
# scaler).
    
## WARNING: The MODE 'factor' demmands the input of the list of factors that will be 
# used for normalizing each column. Therefore, it can be used only 
# when SCALE_WITH_NEW_PARAMS = False.

LIST_OF_SCALING_PARAMS = None
# LIST_OF_SCALING_PARAMS is a list of dictionaries with the same format of the list returned
# from this function. Each dictionary must correspond to one of the features that will be scaled,
# but the list do not have to be in the same order of the columns - it will check one of the
# dictionary keys.
# The first key of the dictionary must be 'column'. This key must store a string with the exact
# name of the column that will be scaled.
# the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
# one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
# numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
# must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
# two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
# For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
# standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
# factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
# Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
# division.
# The key 'scaler_details' will not create an object: the transform will be directly performed 
# through vectorial operations.

SUFFIX = '_scaled'
# suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_scaled', the transformed column will be
# identified as 'Y_scaled'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

# New dataframe saved as scaled_df; list of scaling parameters saved as scaling_list
# Simply modify this object on the left of equality:
scaled_df, scaling_list = etl.feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, mode = MODE, scale_with_new_params = SCALE_WITH_NEW_PARAMS, list_of_scaling_params = LIST_OF_SCALING_PARAMS, suffix = SUFFIX)

### **Reversing scaling of the features - Standard scaler, Min-Max scaler, division by factor**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be processed

SUBSET_OF_FEATURES_TO_SCALE = ['COLUMN1', 'COLUMN2', 'COLUMN3']
#subset_of_features_to_be_encoded: list of strings (inside quotes), 
# containing the names of the columns with the categorical variables that will be 
# encoded. If a single column will be encoded, declare this parameter as list with
# only one element e.g.subset_of_features_to_be_encoded = ["column1"] 
# will analyze the column named as 'column1'; 
# subset_of_features_to_be_encoded = ["col1", 'col2', 'col3'] will analyze 3 columns
# with categorical variables: 'col1', 'col2', and 'col3'.

MODE = 'min_max'
## Alternatively: MODE = 'standard', MODE = 'min_max', MODE = 'factor'
## This function provides 3 methods (modes) of scaling:
## MODE = 'standard': applies the standard scaling, 
##  which creates a new variable with mean = 0; and standard deviation = 1.
##  Each value Y is transformed as Ytransf = (Y - u)/s, where u is the mean 
##  of the training samples, and s is the standard deviation of the training samples.
    
## MODE = 'min_max': applies min-max normalization, with a resultant feature 
## ranging from 0 to 1. each value Y is transformed as 
## Ytransf = (Y - Ymin)/(Ymax - Ymin), where Ymin and Ymax are the minimum and 
## maximum values of Y, respectively.
    
## MODE = 'factor': divides the whole series by a numeric value provided as argument. 
## For a factor F, the new Y values will be Ytransf = Y/F.

LIST_OF_SCALING_PARAMS = [
                            {'column': None,
                            'scaler': {'scaler_obj': None, 
                                      'scaler_details': None}},
                            {'column': None,
                            'scaler': {'scaler_obj': None, 
                                      'scaler_details': None}}
                            
                         ]
# LIST_OF_SCALING_PARAMS is a list of dictionaries with the same format of the list returned
# from this function. Each dictionary must correspond to one of the features that will be scaled,
# but the list do not have to be in the same order of the columns - it will check one of the
# dictionary keys.
# The first key of the dictionary must be 'column'. This key must store a string with the exact
# name of the column that will be scaled.
# the second key must be 'scaler'. This key must store a dictionary. The dictionary must store
# one of two keys: 'scaler_obj' - sklearn scaler object to be used; or 'scaler_details' - the
# numeric parameters for re-calculating the scaler without the object. The key 'scaler_details', 
# must contain a nested dictionary. For the mode 'min_max', this dictionary should contain 
# two keys: 'min', with the minimum value of the variable, and 'max', with the maximum value. 
# For mode 'standard', the keys should be 'mu', with the mean value, and 'sigma', with its 
# standard deviation. For the mode 'factor', the key should be 'factor', and should contain the 
# factor for division (the scaling value. e.g 'factor': 2.0 will divide the column by 2.0.).
# Again, if you want to normalize by the maximum, declare the maximum value as any other factor for
# division.

SUFFIX = '_reverseScaling'
# suffix: string (inside quotes).
# How the transformed column will be identified in the returned data_transformed_df.
# If y_label = 'Y' and suffix = '_reverseScaling', the transformed column will be
# identified as 'Y_reverseScaling'.
# Alternatively, input inside quotes a string with the desired suffix. Recommendation:
# start the suffix with "_" to separate it from the original name

# New dataframe saved as rescaled_df; list of scaling parameters saved as scaling_list
# Simply modify this object on the left of equality:
rescaled_df, scaling_list = etl.reverse_feature_scaling (df = DATASET, subset_of_features_to_scale = SUBSET_OF_FEATURES_TO_SCALE, list_of_scaling_params = LIST_OF_SCALING_PARAMS, mode = MODE, suffix = SUFFIX)

### **Importing or exporting models and dictionaries (or lists)**

#### Case 1: import only a model

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model.
# Simply modify this object on the left of equality:
model = idsw.import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 2: import only a dictionary or a list

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'dict_or_list_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Dictionary or list saved as imported_dict_or_list.
# Simply modify this object on the left of equality:
imported_dict_or_list = idsw.import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 3: import a model and a dictionary (or a list)

In [None]:
ACTION = 'import'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_and_dict'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

# Model object saved as model. Dictionary or list saved as imported_dict_or_list.
# Simply modify these objects on the left of equality:
model, imported_dict_or_list = idsw.import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

#### Case 4: export a model and/or a dictionary (or a list)

In [None]:
ACTION = 'export'
# ACTION = 'import' for importing a model and/or a dictionary;
# ACTION = 'export' for exporting a model and/or a dictionary.

OBJECTS_MANIPULATED = 'model_only'
# OBJECTS_MANIPULATED = 'model_only' if only a model will be manipulated.
# OBJECTS_MANIPULATED = 'dict_or_list_only' if only a dictionary will be manipulated.
# OBJECTS_MANIPULATED = 'model_and_dict' if both a model and a dictionary will 
#  be manipulated.

MODEL_FILE_NAME = None
# MODEL_FILE_NAME: string with the name of the file containing the model (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. MODEL_FILE_NAME = 'model'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep MODEL_FILE_NAME = None if no model will be manipulated.

DICTIONARY_OR_LIST_FILE_NAME = None
# DICTIONARY_OR_LIST_FILE_NAME: string with the name of the file containing the dictionary 
# (for 'import');
# or of the name that the exported file will have (for 'export')
# e.g. DICTIONARY_OR_LIST_FILE_NAME = 'history_dict'
# WARNING: Do not add the file extension.
# Keep it in quotes. Keep DICTIONARY_OR_LIST_FILE_NAME = None if no dictionary 
# or list will be manipulated.

DIRECTORY_PATH = ''
# DIRECTORY_PATH: path of the directory where the model will be saved,
# or from which the model will be retrieved. If no value is provided,
# the DIRECTORY_PATH will be the root: ""
# Notice that the model and the dictionary must be stored in the same path.
# If a model and a dictionary will be exported, they will be stored in the same
# DIRECTORY_PATH.
    
MODEL_TYPE = 'arima'
# This parameter has effect only when a model will be manipulated.
# MODEL_TYPE: 'keras' for deep learning Keras/ TensorFlow models with extension .h5
# MODEL_TYPE: 'tensorflow_lambda' for deep learning tensorflow models containing 
# lambda layers. Such models are compressed as tar.gz.
# MODEL_TYPE = 'sklearn' for models from Scikit-learn (non-deep learning)
# MODEL_TYPE = 'xgb_regressor' for XGBoost regression models (non-deep learning)
# MODEL_TYPE = 'xgb_classifier' for XGBoost classification models (non-deep learning)
# MODEL_TYPE = 'arima' for ARIMA model (Statsmodels)

DICT_OR_LIST_TO_EXPORT = None
MODEL_TO_EXPORT = None 
# These two parameters have effect only when ACTION == 'export'. In this case, they
# must be declared. If ACTION == 'export', keep:
# DICT_OR_LIST_TO_EXPORT = None, 
# MODEL_TO_EXPORT = None
# If one of these objects will be exported, substitute None by the name of the object
# e.g. if your model is stored in the global memory as 'keras_model' declare:
# MODEL_TO_EXPORT = keras_model. Notice that it must be declared without quotes, since
# it is not a string, but an object.
# For exporting a dictionary named as 'dict':
# DICT_OR_LIST_TO_EXPORT = dict

USE_COLAB_MEMORY = False
# USE_COLAB_MEMORY: this parameter has only effect when using Google Colab (or it will
# raise an error). Set as USE_COLAB_MEMORY = True if you want to use the instant memory
# from Google Colaboratory: you will update or download the file and it will be available
# only during the time when the kernel is running. It will be excluded when the kernel
# dies, for instance, when you close the notebook.
    
# If ACTION == 'export' and USE_COLAB_MEMORY == True, then the file will be downloaded
# to your computer (running the cell will start the download).

idsw.import_export_model_list_dict (action = ACTION, objects_manipulated = OBJECTS_MANIPULATED, model_file_name = MODEL_FILE_NAME, dictionary_or_list_file_name = DICTIONARY_OR_LIST_FILE_NAME, directory_path = DIRECTORY_PATH, model_type = MODEL_TYPE, dict_or_list_to_export = DICT_OR_LIST_TO_EXPORT, model_to_export = MODEL_TO_EXPORT, use_colab_memory = USE_COLAB_MEMORY) 

### **Filtering (selecting); ordering; or renaming columns from the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'select_or_order_columns'
# MODE = 'select_or_order_columns' for filtering only the list of columns passed as COLUMNS_LIST,
# and setting a new column order. In this mode, you can pass the columns in any order: 
# the order of elements on the list will be the new order of columns.

# MODE = 'rename_columns' for renaming the columns with the names passed as COLUMNS_LIST. In this
# mode, the list must have same length and same order of the columns of the dataframe. That is because
# the columns will sequentially receive the names in the list. So, a mismatching of positions
# will result into columns with incorrect names.

COLUMNS_LIST = ['column1', 'column2', 'column3']
# COLUMNS_LIST = list of strings containing the names (headers) of the columns to select
# (filter); or to be set as the new columns' names, according to the selected mode.
# For instance: COLUMNS_LIST = ['col1', 'col2', 'col3'] will 
# select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
# Declare the names inside quotes.
# Simply substitute the list by the list of columns that you want to select; or the
# list of the new names you want to give to the dataset columns.

# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = etl.select_order_or_rename_columns (df = DATASET, columns_list = COLUMNS_LIST, mode = MODE)

### **Renaming specific columns from the dataframe; or cleaning columns' labels**
- The function `select_order_or_rename_columns` requires the user to pass a list containing the names from all columns.
- Also, this list must contain the columns in the correct order (the order they appear in the dataframe).
- This function may manipulate one or several columns by call, and is not dependent on their order.
- This function can also be used for cleaning the columns' labels: capitalize (upper case) or lower cases of all columns' names; replace substrings on columns' names; or eliminating trailing and leading white spaces or characters from columns' labels.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'set_new_names'
# MODE = 'set_new_names' will change the columns according to the specifications in
# LIST_OF_COLUMNS_LABELS.

# MODE = 'capitalize_columns' will capitalize all columns names (i.e., they will be put in
# upper case). e.g. a column named 'column' will be renamed as 'COLUMN'

# MODE = 'lowercase_columns' will lower the case of all columns names. e.g. a column named
# 'COLUMN' will be renamed as 'column'.

# MODE = 'replace_substring' will search on the columns names (strings) for the 
# SUBSTRING_TO_BE_REPLACED (which may be a character or a string); and will replace it by 
# NEW_SUBSTRING_FOR_REPLACEMENT (which again may be either a character or a string). 
# Numbers (integers or floats) will be automatically converted into strings.
# As an example, consider the default situation where we search for a whitespace ' ' and replace it
# by underscore '_': SUBSTRING_TO_BE_REPLACED = ' ', NEW_SUBSTRING_FOR_REPLACEMENT = '_'  
# In this case, a column named 'new column' will be renamed as 'new_column'.

# MODE = 'trim' will remove all trailing or leading whitespaces from column names.
# e.g. a column named as ' col1 ' will be renamed as 'col1'; 'col2 ' will be renamed as
# 'col2'; and ' col3' will be renamed as 'col3'.

# MODE = 'eliminate_trailing_characters' will eliminate a defined trailing and leading 
# substring from the columns' names. 
# The substring must be indicated as TRAILING_SUBSTRING, and its default, when no value
# is provided, is equivalent to mode = 'trim' (eliminate white spaces). 
# e.g., if TRAILING_SUBSTRING = '_test' and you have a column named 'col_test', it will be 
# renamed as 'col'.

SUBSTRING_TO_BE_REPLACED = ' '
NEW_SUBSTRING_FOR_REPLACEMENT = '_'

TRAILING_SUBSTRING = None

LIST_OF_COLUMNS_LABELS = [
    
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None}
    
]
# LIST_OF_COLUMNS_LABELS = [{'column_name': None, 'new_column_name': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the original column name; and the second one contains the new name
# that will substitute the original one. The function will loop through all dictionaries in
# this list, access the values of the keys 'column_name', and it will be replaced (switched) 
# by the correspondent value in key 'new_column_name'.
    
# The object LIST_OF_COLUMNS_LABELS must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'column_name' for the original label; 
# and 'new_column_name', for the correspondent new label.
# Notice that this function will not search substrings: it will substitute a value only when
# there is perfect correspondence between the string in 'column_name' and one of the columns
# labels. So, the cases (upper or lower) must be the same.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'column_name': original_col, 'new_column_name': new_col}, 
# where original_col and new_col represent the strings for searching and replacement 
# (If one of the keys contains None, the new dictionary will be ignored).
# Example: LIST_OF_COLUMNS_LABELS = [{'column_name': 'col1', 'new_column_name': 'col'}] will
# rename 'col1' as 'col'.


# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = etl.rename_or_clean_columns_labels (df = DATASET, mode = MODE, substring_to_be_replaced = SUBSTRING_TO_BE_REPLACED, new_substring_for_replacement = NEW_SUBSTRING_FOR_REPLACEMENT, trailing_substring = TRAILING_SUBSTRING, list_of_columns_labels = LIST_OF_COLUMNS_LABELS)

### **Characterizing the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

#New dataframes saved as df_shape, df_columns_list, df_dtypes, df_general_statistics, df_missing_values.
# Simply modify this object on the left of equality:
df_shape, df_columns_array, df_dtypes, df_general_statistics, df_missing_values = etl.df_general_characterization (df = DATASET)

### **Obtaining correlation plots**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SHOW_MASKED_PLOT = True
#SHOW_MASKED_PLOT = True - keep as True if you want to see a cleaned version of the plot
# where a mask is applied. Alternatively, SHOW_MASKED_PLOT = True, or 
# SHOW_MASKED_PLOT = False

RESPONSES_TO_RETURN_CORR = None
#RESPONSES_TO_RETURN_CORR - keep as None to return the full correlation tensor.
# If you want to display the correlations for a particular group of features, input them
# as a list, even if this list contains a single element. Examples:
# responses_to_return_corr = ['response1'] for a single response
# responses_to_return_corr = ['response1', 'response2', 'response3'] for multiple
# responses. Notice that 'response1',... should be substituted by the name ('string')
# of a column of the dataset that represents a response variable.
# WARNING: The returned coefficients will be ordered according to the order of the list
# of responses. i.e., they will be firstly ordered based on 'response1'
# Alternatively: a list containing strings (inside quotes) with the names of the response
# columns that you want to see the correlations. Declare as a list even if it contains a
# single element.

SET_RETURNED_LIMIT = None
# SET_RETURNED_LIMIT = None - This variable will only present effects in case you have
# provided a response feature to be returned. In this case, keep set_returned_limit = None
# to return all of the correlation coefficients; or, alternatively, 
# provide an integer number to limit the total of coefficients returned. 
# e.g. if set_returned_limit = 10, only the ten highest coefficients will be returned. 

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'correlation_plot.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


#New dataframe saved as correlation_matrix. Simply modify this object on the left of equality:
correlation_matrix = etl.correlation_plot (df = DATASET, show_masked_plot = SHOW_MASKED_PLOT, responses_to_return_corr = RESPONSES_TO_RETURN_CORR, set_returned_limit = SET_RETURNED_LIMIT, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Obtaining scatter plots and simple linear regressions**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

SHOW_LINEAR_REG = True
#Alternatively: set SHOW_LINEAR_REG = True to plot the linear regressions graphics and show 
# the linear regressions calculated for each pair Y x X (i.e., each correlation 
# Y = aX + b, as well as the R² coefficient calculated). 
# Set SHOW_LINEAR_REG = False to omit both the linear regressions plots on the graphic, and
# the correlations and R² coefficients obtained.

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = False #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'scatter_plot_lin_reg.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# JSON-formatted list containing all series converted to NumPy arrays, 
#  with timestamps parsed as datetimes, and all the information regarding the linear regressions, 
# including the predicted values for plotting, returned as list_of_dictionaries_with_series_and_predictions. 
# Simply modify this object on the left of equality:
list_of_dictionaries_with_series_and_predictions = etl.scatter_plot_lin_reg (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, show_linear_reg = SHOW_LINEAR_REG, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing time series**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = True #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
ADD_SCATTER_DOTS = False
# If ADD_SCATTER_DOTS = False, no dots representing the data points are shown.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'time_series_vis.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


etl.time_series_vis (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, add_scatter_dots = ADD_SCATTER_DOTS, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing histograms**

In [None]:
# REMEMBER: A histogram is the representation of a statistical distribution 
# of a given variable.

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'analyzed_variable'
#Alternatively: other column in quotes, substituting 'analyzed_variable'
# e.g., if the analyzed variable is in a column named 'column1':
# COLUMN_TO_ANALYZE = 'column1'

TOTAL_OF_BINS = 10
# This parameter must be an integer number: it represents the total of bins of the 
# histogram, i.e., the number of divisions of the sample space (in how much intervals
# the sample space will be divided.
# Manually adjust this parameter to obtain more or less resolution of the statistical
# distribution: less bins tend to result into higher counting of values per bin, since
# a larger interval of values is grouped. After modifying the total of bins, do not forget
# to adjust the bar width in SET_GRAPHIC_BAR_WIDTH.
# Examples: TOTAL_OF_BINS = 50, to divide the sample space into 50 equally-separated 
# intervals; TOTAL_OF_BINS = 10 to divide it into 10 intervals; TOTAL_OF_BINS = 100 to
# divide it into 100 intervals.
NORMAL_CURVE_OVERLAY = True
#Alternatively: set NORMAL_CURVE_OVERLAY = True to show a normal curve overlaying the
# histogram; or set NORMAL_CURVE_OVERLAY = False to omit the normal curve (show only
# the histogram).

X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'histogram.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.

#New dataframes saved as general_stats and frequency_table.
# Simply modify these objects on the left of equality:
general_stats, frequency_table = etl.histogram (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, total_of_bins = TOTAL_OF_BINS, normal_curve_overlay = NORMAL_CURVE_OVERLAY, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Testing data normality and visualizing the probability plot**
- Check the probability that data is actually described by a normal distribution.

In [None]:
# WARNING: The statistical tests require at least 20 samples

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze' 
# COLUMN_TO_ANALYZE: column (variable) of the dataset that will be tested. Declare as a string,
# in quotes.
# e.g. COLUMN_TO_ANALYZE = 'col1' will analyze a column named 'col1'.

COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS = None
# column_with_labels_to_test_subgroups: if there is a column with labels or
# subgroup indication, and the normality should be tested separately for each label, indicate
# it here as a string (in quotes). e.g. column_with_labels_to_test_subgroups = 'col2' 
# will retrieve the labels from 'col2'.
# Keep column_with_labels_to_test_subgroups = None if a single series (the whole column)
# will be tested.
    
ALPHA = 0.10
# Confidence level = 1 - ALPHA. For ALPHA = 0.10, we get a 0.90 = 90% confidence
# Set ALPHA = 0.05 to get 0.95 = 95% confidence in the analysis.
# Notice that, when less trust is needed, we can increase ALPHA to get less restrictive
# results.

SHOW_PROBABILITY_PLOT = True
#Alternatively: set SHOW_PROBABILITY_PLOT = True to obtain the probability plot for the
# variable Y (normal distribution tested). 
# Set SHOW_PROBABILITY_PLOT = False to omit the probability plot.
X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'probability_plot_normal.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.

# List of dictionaries containing the series, p-values, skewness and kurtosis returned as
# list_of_dicts
# Simply modify this object on the left of equality:
list_of_dicts = etl.test_data_normality (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, column_with_labels_to_test_subgroups = COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS, alpha = ALPHA, show_probability_plot = SHOW_PROBABILITY_PLOT, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

## **Exporting the dataframe as CSV file (to notebook's workspace)**

In [None]:
## WARNING: all files exported from this function are .csv (comma separated values)

DATAFRAME_OBJ_TO_BE_EXPORTED = dataset
# Alternatively: object containing the dataset to be exported.
# DATAFRAME_OBJ_TO_BE_EXPORTED: dataframe object that is going to be exported from the
# function. Since it is an object (not a string), it should not be declared in quotes.
# example: DATAFRAME_OBJ_TO_BE_EXPORTED = dataset will export the dataset object.
# ATTENTION: The dataframe object must be a Pandas dataframe.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"
# If you want to export the file to AWS S3, this parameter will have no effect.
# In this case, you can set FILE_DIRECTORY_PATH = None

NEW_FILE_NAME_WITHOUT_EXTENSION = "dataset"
# NEW_FILE_NAME_WITHOUT_EXTENSION - (string, in quotes): input the name of the 
# file without the extension. e.g. set NEW_FILE_NAME_WITHOUT_EXTENSION = "my_file" 
# to export the CSV file 'my_file.csv' to notebook's workspace.

idsw.export_pd_dataframe_as_csv (dataframe_obj_to_be_exported = DATAFRAME_OBJ_TO_BE_EXPORTED, new_file_name_without_extension = NEW_FILE_NAME_WITHOUT_EXTENSION, file_directory_path = FILE_DIRECTORY_PATH)

## **Downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

#### Case 1: upload a file to Colab's workspace

In [None]:
ACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model named keras_model, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'keras_model.h5'

# Dictionary storing the uploaded files returned as colab_files_dict.
# Simply modify this object on the left of the equality:
colab_files_dict = idsw.upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

#### Case 2: download a file from Colab's workspace

In [None]:
ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model nameACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

idsw.upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

## **Exporting a list of files from notebook's workspace to AWS Simple Storage Service (S3)**

In [None]:
LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['s3_file1.txt', 's3_file2.txt']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS: list containing all the files to export to S3.
# Declare it as a list even if only a single file will be exported.
# It must be a list of strings containing the file names followed by the extensions.
# Example, to a export a single file my_file.ext, where my_file is the name and ext is the
# extension:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['my_file.ext']
# To export 3 files, file1.ext1, file2.ext2, and file3.ext3:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['file1.ext1', 'file2.ext2', 'file3.ext3']
# Other examples:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['Screen_Shot.png', 'dataset.csv']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ["dictionary.pkl", "model.h5"]
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['doc.pdf', 'model.dill']

DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = ''
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT: directory from notebook's workspace
# from which the files will be exported to S3. Keep it None, or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = "/"; or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = '' (empty string) to export from
# the root (main) directory.
# Alternatively, set as a string containing only the directories and folders, not the file names.
# Examples: DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1';
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1/folder2/'
    
# For this function, all exported files must be located in the same directory.

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for connecting with AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
idsw.export_files_to_s3 (list_of_file_names_with_extensions = LIST_OF_FILE_NAMES_WITH_EXTENSIONS, directory_of_notebook_workspace_storing_files_to_export = DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

****