# **Dataset Characterization**

## _ETL Workflow Notebook 2_

## Content:
1. Dataframe general characterization; 
2. Characterizing categorical variables;
3. Removing all columns and rows that contain only missing values;
4. Visualizing and characterizing distribution of missing values;
5. Visualizing missingness across a variable, and comparing it to another variable (both numeric);
6. Dealing with missing values; 
7. Obtaining the correlation plot;
8. Plotting bar charts;
9. Calculating cumulative statistics; 
10. Obtaining scatter plots and simple linear regressions;
11. Performing the polynomial fitting; 
12. Visualizing time series; 
13. Visualizing histograms; 
14. Testing normality and visualizing the probability plot;
15. Testing and visualizing probability plots for different statistical distributions;
16. Filtering (selecting); ordering; or renaming columns from the dataframe;
17. Renaming specific columns from the dataframe; or cleaning columns' labels;
18. Dropping specific columns or rows from the dataframe; 
19. Removing duplicate rows from the dataframe.

Marco Cesar Prado Soares, Data Scientist Specialist - Bayer Crop Science LATAM
- marcosoares.feq@gmail.com
- marco.soares@bayer.com

## **Load Python Libraries in Global Context**

In [1]:
import numpy as np
import pandas as pd
import idsw
from idsw import etl
from idsw.etl import etl_workflow as ewf

## **Call the functions**

### **Mounting Google Drive or S3 (AWS Simple Storage Service) bucket**

In [None]:
SOURCE = 'aws'
# SOURCE = 'google' for mounting the google drive;
# SOURCE = 'aws' for accessing an AWS S3 bucket

## THE FOLLOWING PARAMETERS HAVE EFFECT ONLY WHEN SOURCE == 'aws':

PATH_TO_STORE_IMPORTED_S3_BUCKET = ''
# PATH_TO_STORE_IMPORTED_S3_BUCKET: path of the Python environment to which the
# S3 bucket contents will be imported. If it is None; or if it is an empty string; or if 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = '/', bucket will be imported to the root path. 
# Alternatively, input the path as a string (in quotes). e.g. 
# PATH_TO_STORE_IMPORTED_S3_BUCKET = 'copied_s3_bucket'

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for fetching AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
idsw.mount_storage_system (source = SOURCE, path_to_store_imported_s3_bucket = PATH_TO_STORE_IMPORTED_S3_BUCKET, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

### **Importing the dataset**

In [None]:
## WARNING: Use this function to load dataframes stored on Excel (xls, xlsx, xlsm, xlsb, odf, ods and odt), 
## JSON, txt, or CSV (comma separated values) files.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"

FILE_NAME_WITH_EXTENSION = "dataset.csv"
# FILE_NAME_WITH_EXTENSION - (string, in quotes): input the name of the file with the 
# extension. e.g. FILE_NAME_WITH_EXTENSION = "file.xlsx", or, 
# FILE_NAME_WITH_EXTENSION = "file.csv", "file.txt", or "file.json"
# Again, the extensions may be: xls, xlsx, xlsm, xlsb, odf, ods, odt, json, txt or csv.

LOAD_TXT_FILE_WITH_JSON_FORMAT = False
# LOAD_TXT_FILE_WITH_JSON_FORMAT = False. Set LOAD_TXT_FILE_WITH_JSON_FORMAT = True 
# if you want to read a file with txt extension containing a text formatted as JSON 
# (but not saved as JSON).
# WARNING: if LOAD_TXT_FILE_WITH_JSON_FORMAT = True, all the JSON file parameters of the 
# function (below) must be set. If not, an error message will be raised.

HOW_MISSING_VALUES_ARE_REGISTERED = None
# HOW_MISSING_VALUES_ARE_REGISTERED = None: keep it None if missing values are registered as None,
# empty or np.nan. Pandas automatically converts None to NumPy np.nan objects (floats).
# This parameter manipulates the argument na_values (default: None) from Pandas functions.
# By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, 
#‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, 
# ‘n/a’, ‘nan’, ‘null’.

# If a different denomination is used, indicate it as a string. e.g.
# HOW_MISSING_VALUES_ARE_REGISTERED = '.' will convert all strings '.' to missing values;
# HOW_MISSING_VALUES_ARE_REGISTERED = 0 will convert zeros to missing values.

# If dict passed, specific per-column NA values. For example, if zero is the missing value
# only in column 'numeric_col', you can specify the following dictionary:
# how_missing_values_are_registered = {'numeric-col': 0}

    
HAS_HEADER = True
# HAS_HEADER = True if the the imported table has headers (row with columns names).
# Alternatively, HAS_HEADER = False if the dataframe does not have header.

DECIMAL_SEPARATOR = '.'
# DECIMAL_SEPARATOR = '.' - String. Keep it '.' or None to use the period ('.') as
# the decimal separator. Alternatively, specify here the separator.
# e.g. DECIMAL_SEPARATOR = ',' will set the comma as the separator.
# It manipulates the argument 'decimal' from Pandas functions.

TXT_CSV_COL_SEP = "comma"
# txt_csv_col_sep = "comma" - This parameter has effect only when the file is a 'txt'
# or 'csv'. It informs how the different columns are separated.
# Alternatively, txt_csv_col_sep = "comma", or txt_csv_col_sep = "," 
# for columns separated by comma;
# txt_csv_col_sep = "whitespace", or txt_csv_col_sep = " " 
# for columns separated by simple spaces.
# You can also set a specific separator as string. For example:
# txt_csv_col_sep = '\s+'; or txt_csv_col_sep = '\t' (in this last example, the tabulation
# is used as separator for the columns - '\t' represents the tab character).

## Parameters for loading Excel files:

LOAD_ALL_SHEETS_AT_ONCE = False
# LOAD_ALL_SHEETS_AT_ONCE = False - This parameter has effect only when for Excel files.
# If LOAD_ALL_SHEETS_AT_ONCE = True, the function will return a list of dictionaries, each
# dictionary containing 2 key-value pairs: the first key will be 'sheet', and its
# value will be the name (or number) of the table (sheet). The second key will be 'df',
# and its value will be the pandas dataframe object obtained from that sheet.
# This argument has preference over SHEET_TO_LOAD. If it is True, all sheets will be loaded.
    
SHEET_TO_LOAD = None
# SHEET_TO_LOAD - This parameter has effect only when for Excel files.
# keep SHEET_TO_LOAD = None not to specify a sheet of the file, so that the first sheet
# will be loaded.
# SHEET_TO_LOAD may be an integer or an string (inside quotes). SHEET_TO_LOAD = 0
# loads the first sheet (sheet with index 0); SHEET_TO_LOAD = 1 loads the second sheet
# of the file (index 1); SHEET_TO_LOAD = "Sheet1" loads a sheet named as "Sheet1".
# Declare a number to load the sheet with that index, starting from 0; or declare a
# name to load the sheet with that name.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: {'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]},
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = idsw.load_pandas_dataframe (file_directory_path = FILE_DIRECTORY_PATH, file_name_with_extension = FILE_NAME_WITH_EXTENSION, load_txt_file_with_json_format = LOAD_TXT_FILE_WITH_JSON_FORMAT, how_missing_values_are_registered = HOW_MISSING_VALUES_ARE_REGISTERED, has_header = HAS_HEADER, decimal_separator = DECIMAL_SEPARATOR, txt_csv_col_sep = TXT_CSV_COL_SEP, load_all_sheets_at_once = LOAD_ALL_SHEETS_AT_ONCE, sheet_to_load = SHEET_TO_LOAD, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

# OBS: If an Excel file is loaded and LOAD_ALL_SHEETS_AT_ONCE = True, then the object
# dataset will be a list of dictionaries, with 'sheet' as key containing the sheet name; and 'df'
# as key correspondent to the Pandas dataframe. So, to access the 3rd dataframe (index 2, since
# indexing starts from zero): df = dataframe[2]['df'], where dataframe is the list returned.

### **Converting JSON object to dataframe**

In [None]:
# JSON object in terms of Python structure: list of dictionaries, where each value of a
# dictionary may be a dictionary or a list of dictionaries (nested structures).
# example of highly nested structure saved as a list 'json_formatted_list'. Note that the same
# structure could be declared and stored into a string variable. For instance, if you have a txt
# file containing JSON, you could read the txt and save its content as a string.
# json_formatted_list = [{'field1': val1, 'field2': {'dict_val': dict_val}, 'field3': [{
# 'nest1': nest_val1}, {'nest2': nestval2}]}, {'field1': val1, 'field2': {'dict_val': dict_val}, 
# 'field3': [{'nest1': nest_val1}, {'nest2': nestval2}]}]

JSON_OBJ_TO_CONVERT = json_object #Alternatively: object containing the JSON to be converted

# JSON_OBJ_TO_CONVERT: object containing JSON, or string with JSON content to parse.
# Objects may be: string with JSON formatted text;
# list with nested dictionaries (JSON formatted);
# dictionaries, possibly with nested dictionaries (JSON formatted).

JSON_OBJ_TYPE = 'list'
# JSON_OBJ_TYPE = 'list', in case the object was saved as a list of dictionaries (JSON format)
# JSON_OBJ_TYPE = 'string', in case it was saved as a string (text) containing JSON.

## Parameters for loading JSON files:

JSON_RECORD_PATH = None
# JSON_RECORD_PATH (string): manipulate parameter 'record_path' from json_normalize method.
# Path in each object to list of records. If not passed, data will be assumed to 
# be an array of records. If a given field from the JSON stores a nested JSON (or a nested
# dictionary) declare it here to decompose the content of the nested data. e.g. if the field
# 'books' stores a nested JSON, declare, JSON_RECORD_PATH = 'books'

JSON_FIELD_SEPARATOR = "_"
# JSON_FIELD_SEPARATOR = "_" (string). Manipulates the parameter 'sep' from json_normalize method.
# Nested records will generate names separated by sep. 
# e.g., for JSON_FIELD_SEPARATOR = ".", {‘foo’: {‘bar’: 0}} -> foo.bar.
# Then, if a given field 'main_field' stores a nested JSON with fields 'field1', 'field2', ...
# the name of the columns of the dataframe will be formed by concatenating 'main_field', the
# separator, and the names of the nested fields: 'main_field_field1', 'main_field_field2',...

JSON_METADATA_PREFIX_LIST = None
# JSON_METADATA_PREFIX_LIST: list of strings (in quotes). Manipulates the parameter 
# 'meta' from json_normalize method. Fields to use as metadata for each record in resulting 
# table. Declare here the non-nested fields, i.e., the fields in the principal JSON. They
# will be repeated in the rows of the dataframe to give the metadata (context) of the rows.

# e.g. Suppose a JSON with the following structure: [{'name': 'Mary', 'last': 'Shelley',
# 'books': [{'title': 'Frankestein', 'year': 1818}, {'title': 'Mathilda ', 'year': 1819},{'title': 'The Last Man', 'year': 1826}]}]
# Here, there are nested JSONs in the field 'books'. The fields that are not nested
# are 'name' and 'last'.
# Then, JSON_RECORD_PATH = 'books'
# JSON_METADATA_PREFIX_LIST = ['name', 'last']


# The dataframe will be stored in the object named 'dataset':
# Simply modify this object on the left of equality:
dataset = idsw.json_obj_to_pandas_dataframe (json_obj_to_convert = JSON_OBJ_TO_CONVERT, json_obj_type = JSON_OBJ_TYPE, json_record_path = JSON_RECORD_PATH, json_field_separator = JSON_FIELD_SEPARATOR, json_metadata_prefix_list = JSON_METADATA_PREFIX_LIST)

### **Characterizing the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

#New dataframes saved as df_shape, df_columns_list, df_dtypes, df_general_statistics, df_missing_values.
# Simply modify this object on the left of equality:
df_shape, df_columns_array, df_dtypes, df_general_statistics, df_missing_values = ewf.df_general_characterization (df = DATASET)

### **Characterizing the categorical variables**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

TIMESTAMP_TAG_COLUMN = None
# TIMESTAMP_TAG_COLUMN: name (header) of the column containing the timestamps. 
# Keep TIMESTAMP_TAG_COLUMN = None if the dataframe do not contain timestamps.

# Dataframe with summary from the categorical variables returned as cat_vars_summary. 
# Simply modify this object on the left of equality:
cat_vars_summary = ewf.characterize_categorical_variables (df = DATASET, timestamp_tag_column = TIMESTAMP_TAG_COLUMN)

### **Removing all columns and rows that contain only missing values**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

LIST_OF_COLUMNS_TO_IGNORE = None
# list_of_columns_to_ignore: if you do not want to check a specific column, pass its name
# (header) as an element from this list. It should be declared as a list even if it contains
# a single value.
# e.g. list_of_columns_to_ignore = ['column1'] will not analyze missing values in column named
# 'column1'; list_of_columns_to_ignore = ['col1', 'col2'] will ignore columns 'col1' and 'col2'

# Cleaned dataframe returned as cleaned_df. 
# Simply modify this object on the left of equality:
cleaned_df = ewf.remove_completely_blank_rows_and_columns (df = DATASET, list_of_columns_to_ignore = LIST_OF_COLUMNS_TO_IGNORE)

### **Visualizing and characterizing distribution of missing values**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SLICE_TIME_WINDOW_FROM = None
SLICE_TIME_WINDOW_TO = None    
# SLICE_TIME_WINDOW_FROM and SLICE_TIME_WINDOW_TO (timestamps). When analyzing time series,
# use these parameters to observe only values in a given time range.
    
# SLICE_TIME_WINDOW_FROM: the inferior limit of the analyzed window. If you declare this value
# and keep SLICE_TIME_WINDOW_TO = None, then you will analyze all values that comes after
# SLICE_TIME_WINDOW_FROM.
# SLICE_TIME_WINDOW_TO: the superior limit of the analyzed window. If you declare this value
# and keep SLICE_TIME_WINDOW_FROM = None, then you will analyze all values until
# SLICE_TIME_WINDOW_TO.
# If SLICE_TIME_WINDOW_FROM = SLICE_TIME_WINDOW_TO = None, only the standard analysis with
# the whole dataset will be performed. If both values are specified, then the specific time
# window from 'SLICE_TIME_WINDOW_FROM' to 'SLICE_TIME_WINDOW_TO' will be analyzed.
# e.g. SLICE_TIME_WINDOW_FROM = 'May-1976', and SLICE_TIME_WINDOW_TO = 'Jul-1976'
# Notice that the timestamps must be declares in quotes, just as strings.

AGGREGATE_TIME_IN_TERMS_OF = None    
# AGGREGATE_TIME_IN_TERMS_OF = None. Keep it None if you do not want to aggregate the time
# series. Alternatively, set AGGREGATE_TIME_IN_TERMS_OF = 'Y' or AGGREGATE_TIME_IN_TERMS_OF = 
# 'year' to aggregate the timestamps in years; set AGGREGATE_TIME_IN_TERMS_OF = 'M' or
# 'month' to aggregate in terms of months; or set AGGREGATE_TIME_IN_TERMS_OF = 'D' or 'day'
# to aggregate in terms of days.

# Dataframes containing total of missing values and percent of missing values for each variable
# returned as total_of_missing_values and percent_of_missing_values.
# Simply modify these objects on the left of equality:
df_missing_values = ewf.visualize_and_characterize_missing_values (df = DATASET, slice_time_window_from = SLICE_TIME_WINDOW_FROM, slice_time_window_to = SLICE_TIME_WINDOW_TO, aggregate_time_in_terms_of = AGGREGATE_TIME_IN_TERMS_OF)

### **Visualizing missingness across a variable, and comparing it to another variable (both numeric)**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column1'
COLUMN_TO_COMPARE_WITH = 'column2'
# COLUMN_TO_ANALYZE, COLUMN_TO_COMPARE_WITH: strings (in quotes).
# COLUMN_TO_ANALYZE is the column from the dataframe df that will be analyzed in terms of
# missingness; whereas COLUMN_TO_COMPARE_WITH is the column to which column_to_analyze will
# be compared.
# e.g. COLUMN_TO_ANALYZE = 'column1' will analyze 'column1' from df.
# COLUMN_TO_COMPARE_WITH = 'column2' will compare 'column1' against 'column2'

SHOW_INTERPRETED_EXAMPLE = False
# SHOW_INTERPRETED_EXAMPLE: set as True if you want to see an example of a graphic analyzed and
# interpreted.

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'comparison_of_missing_values.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


ewf.visualizing_and_comparing_missingness_across_numeric_vars (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, column_to_compare_with = COLUMN_TO_COMPARE_WITH, show_interpreted_example = SHOW_INTERPRETED_EXAMPLE, grid = GRID, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Dealing with missing values**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be manipulated

SUBSET_COLUMNS_LIST = None
# SUBSET_COLUMNS_LIST = list of columns to look for missing values. Only missing values
# in these columns will be considered for deciding which columns to remove.
# Declare it as a list of strings inside quotes containing the columns' names to look at,
# even if this list contains a single element. e.g. subset_columns_list = ['column1']
# will check only 'column1'; whereas subset_columns_list = ['col1', 'col2', 'col3'] will
# chek the columns named as 'col1', 'col2', and 'col3'.
# ATTENTION: Subsets are considered only for dropping missing values, not for filling.
    
DROP_MISSING_VAL = True
# DROP_MISSING_VAL = True to eliminate the rows containing missing values.
# Alternatively: DROP_MISSING_VAL = False to use the filling method.

FILL_MISSING_VAL = False
# FILL_MISSING_VAL = False. Set this to True to activate the mode for filling the missing
# values.

ELIMINATE_ONLY_COMPLETELY_EMPTY_ROWS = False
# ELIMINATE_ONLY_COMPLETELY_EMPTY_ROWS = False - This parameter shows effect only when
# DROP_MISSING_VAL = True. If you set ELIMINATE_ONLY_COMPLETELY_EMPTY_ROWS = True, then
# only the rows where all the columns are missing will be eliminated.
# If you define a subset, then only the rows where all the subset columns are missing
# will be eliminated.

MINIMUM_NUMBER_OF_NON_MISSING_VALUES_FOR_A_ROW_TO_BE_KEPT = None
# This parameter shows effect only when DROP_MISSING_VAL = True. 
# If you set MINIMUM_NUMBER_OF_NON_MISSING_VALUES_FOR_A_ROW_TO_BE_KEPT equals to an integer value,
# then only the rows where at least this integer number of non-missing values will be kept
# after dropping the NAs.
# e.g. if MINIMUM_NUMBER_OF_NON_MISSING_VALUES_FOR_A_ROW_TO_BE_KEPT = 2, only rows containing at
# least two columns without missing values will be kept.
# If you define a subset, then the criterium is applied only to the subset.

VALUE_TO_FILL = None
# VALUE_TO_FILL = None - This parameter shows effect only when
# FILL_MISSING_VAL = True. Set this parameter as a float value to fill all missing
# values with this value. e.g. VALUE_TO_FILL = 0 will fill all missing values with
# the number 0. You can also pass a function call like 
# VALUE_TO_FILL = np.sum(dataset['col1']). In this case, the missing values will be
# filled with the sum of the series dataset['col1']
# Alternatively, you can also input a string to fill the missing values. e.g.
# VALUE_TO_FILL = 'text' will fill all the missing values with the string "text".

# You can also input a dictionary containing the column(s) to be filled as key(s);
# and the values to fill as the correspondent values. For instance:
# VALUE_TO_FILL = {'col1': 10} will fill only 'col1' with value 10.
# VALUE_TO_FILL = {'col1': 0, 'col2': 'text'} will fill 'col1' with zeros; and will
# fill 'col2' with the value 'text'

FILL_METHOD = "fill_with_zeros"
# FILL_METHOD = "fill_with_zeros". - This parameter shows effect only 
# when FILL_MISSING_VAL = True.
# Alternatively: FILL_METHOD = "fill_with_zeros" - fill all the missing values with 0
    
# FILL_METHOD = "fill_with_value_to_fill" - fill the missing values with the value
# defined as the parameter value_to_fill
    
# FILL_METHOD = "fill_with_avg_or_mode" - fill the missing values with the average value for 
# each column, if the column is numeric; or fill with the mode, if the column is categorical.
# The mode is the most commonly observed value.
    
# FILL_METHOD = "ffill" - Forward (pad) fill: propagate last valid observation forward 
# to next valid.
# FILL_METHOD = 'bfill' - backfill: use next valid observation to fill gap.
# FILL_METHOD = 'nearest' - 'ffill' or 'bfill', depending if the point is closest to the
# next or to the previous non-missing value.

# FILL_METHOD = "fill_by_interpolating" - fill by interpolating the previous and the 
# following value. For categorical columns, it fills the
# missing with the previous value, just as like FILL_METHOD = 'ffill'

INTERPOLATION_ORDER = 'linear'
# INTERPOLATION_ORDER: order of the polynomial used for interpolating if fill_method =
# "fill_by_interpolating". If INTERPOLATION_ORDER = None, INTERPOLATION_ORDER = 'linear',
# or INTERPOLATION_ORDER = 1, a linear (1st-order polynomial) will be used.
# If INTERPOLATION_ORDER is an integer > 1, then it will represent the polynomial order.
# e.g. INTERPOLATION_ORDER = 2, for a 2nd-order polynomial; INTERPOLATION_ORDER = 3 for a
# 3rd-order, and so on.
    
# WARNING: if the fillna method is selected (FILL_MISSING_VAL == True), but no filling
# methodology is selected, the missing values of the dataset will be filled with 0.
# The same applies when a non-valid fill methodology is selected.
# Pandas fillna method does not allow us to fill only a selected subset.
# WARNING: if FILL_METHOD == "fill_with_value_to_fill" but value_to_fill is None, the 
# missing values will be filled with the value 0.


# New dataframe saved as cleaned_df
# Simply modify this object on the left of equality:
cleaned_df = ewf.handle_missing_values (df = DATASET, subset_columns_list = SUBSET_COLUMNS_LIST, drop_missing_val = DROP_MISSING_VAL, fill_missing_val = FILL_MISSING_VAL, eliminate_only_completely_empty_rows = ELIMINATE_ONLY_COMPLETELY_EMPTY_ROWS, min_number_of_non_missing_val_for_a_row_to_be_kept = MINIMUM_NUMBER_OF_NON_MISSING_VALUES_FOR_A_ROW_TO_BE_KEPT, value_to_fill = VALUE_TO_FILL, fill_method = FILL_METHOD, interpolation_order = INTERPOLATION_ORDER)

### **Advanced imputation on time series data: finding the best imputation strategy for missing values on a given column**

In [None]:
# This function handles only one column by call, whereas handle_missing_values can process the whole
# dataframe at once.
# The strategies used for handling missing values is different here. You can use the function to
# process data that does not come from time series, but only plot the graphs for time series data.  
# This function is more indicated for dealing with missing values on time series data than handle_missing_values.
# This function will search for the best imputer for a given column.
# It can process both numerical and categorical columns.

DATASET = dataset #Alternatively: object containing the dataset to be manipulated

COLUMN_TO_FILL = None
# string (in quotes) indicating the column with missing values to fill.
# e.g. if COLUMN_TO_FILL = 'col1', imputations will be performed on column 'col1'.
    
TIMESTAMP_TAG_COLUMN = "timestamp_tag_column"
# TIMESTAMP_TAG_COLUMN = None. string containing the name of the column with the timestamp. 
# If TIMESTAMP_TAG_COLUMN is None, the index will be used for testing different imputations.
# be the time series reference. declare as a string under quotes. This is the column from 
# which we will extract the timestamps or values with temporal information. e.g.
# TIMESTAMP_TAG_COLUMN = 'timestamp' will consider the column 'timestamp' a time column.

TEST_VALUE_TO_FILL = None
# TEST_VALUE_TO_FILL: the function will test the imputation of a constant. Specify this constant here
# or the tested constant will be zero. e.g. TEST_VALUE_TO_FILL = None will test the imputation of 0.
# TEST_VALUE_TO_FILL = 10 will test the imputation of value zero.

SHOW_IMPUTATION_COMPARISON_PLOTS = True
# SHOW_IMPUTATION_COMPARISON_PLOTS = True. Keep it True to plot the scatter plot comparison
# between imputed and original values, as well as the Kernel density estimate (KDE) plot.
# Alternatively, set SHOW_IMPUTATION_COMPARISON_PLOTS = False to omit the plots.

# The following imputation techniques will be tested, and the best one will be automatically
# selected: mean_imputer, median_imputer, mode_imputer, constant_imputer, linear_interpolation,
# quadratic_interpolation, cubic_interpolation, nearest_interpolation, bfill_imputation,
# ffill_imputation, knn_imputer, mice_imputer (MICE = Multiple Imputations by Chained Equations).
    
# MICE: Performs multiple regressions over random samples of the data; 
# Takes the average of multiple regression values; and imputes the missing feature value for the 
# data point.
# KNN (K-Nearest Neighbor): Selects K nearest or similar data points using all the 
# non-missing features. It takes the average of the selected data points to fill in the missing 
# feature.
# These are Machine Learning techniques to impute missing values.
# KNN finds most similar points for imputing.
# MICE performs multiple regression for imputing. MICE is a very robust model for imputation.


# New dataframe saved as cleaned_df
# Simply modify this object on the left of equality:
cleaned_df = ewf.adv_imputation_missing_values (df = DATASET, column_to_fill = COLUMN_TO_FILL, timestamp_tag_column = TIMESTAMP_TAG_COLUMN, test_value_to_fill = TEST_VALUE_TO_FILL, show_imputation_comparison_plots = SHOW_IMPUTATION_COMPARISON_PLOTS)

### **Applying a list of row filters**

In [None]:
# Warning: this function filter the rows and results into a smaller dataset, 
# since it removes the non-selected entries.
# If you want to pass a filter to simply label the selected rows, use the function 
# LABEL_DATAFRAME_SUBSETS, which do not eliminate entries from the dataframe.
    
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

## define the filters and only them define the filters list
# EXAMPLES OF BOOLEAN FILTERS TO COMPOSE THE LIST
boolean_filter1 = ((None) & (None)) # (condition1 and (&) condition2)
boolean_filter2 = ((None) | (None)) # condition1 or (|) condition2
# boolean filters result into boolean values True or False.

## Examples of filters:
## filter1 = (condition 1) & (condition 2)
## filter1 = (df['column1'] > = 0) & (df['column2']) < 0)
## filter2 = (condition)
## filter2 = (df['column3'] <= 2.5)
## filter3 = (df['column4'] > 10.7)
## filter3 = (condition 1) | (condition 2)
## filter3 = (df['column5'] != 'string1') | (df['column5'] == 'string2')
    
## comparative operators: > (higher); >= (higher or equal); < (lower); 
## <= (lower or equal); == (equal); != (different)
    
## concatenation operators: & (and): the filter is True only if the 
## two conditions concatenated through & are True
## | (or): the filter is True if at least one of the two conditions concatenated
## through | are True.
## ~ (not): inverts the boolean, i.e., True becomes False, and False becomes True. 
    
## separate conditions with parentheses. Use parentheses to define a order
## of definition of the conditions:
## filter = ((condition1) & (condition2)) | (condition3)
## Here, firstly ((condition1) & (condition2)) = subfilter is evaluated. 
## Then, the resultant (subfilter) | (condition3) is evaluated.

## Pandas .isin method: you can also use this method to filter rows belonging to
## a given subset (the row that is in the subset is selected). The syntax is:
## is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
## or: filter = (dataframe_column_series).isin([value1, value2, ...])
# The negative of this condition may be acessed with ~ operator:
##  filter = ~(dataframe_column_series).isin([value1, value2, ...])
## Also, you may use isna() method as filter for missing values:
## filter = (dataframe_column_series).isna()
## or, for not missing: ~(dataframe_column_series).isna()

LIST_OF_ROW_FILTERS = [boolean_filter1, boolean_filter2]
# LIST_OF_ROW_FILTERS: list of boolean filters to be applied to the dataframe
# e.g. LIST_OF_ROW_FILTERS = [filter1]
# applies a single filter saved as filter 1. Notice: even if there is a single
# boolean filter, it must be declared inside brackets, as a single-element list.
# That is because the function will loop through the list of filters.
# LIST_OF_ROW_FILTERS = [filter1, filter2, filter3, filter4]
# will apply, in sequence, 4 filters: filter1, filter2, filter3, and filter4.
# Notice that the filters must be declared in the order you want to apply them.

# Filtered dataframe saved as filtered_df
# Simply modify this object on the left of equality:
filtered_df = ewf.APPLY_ROW_FILTERS_LIST (df = DATASET, list_of_row_filters = LIST_OF_ROW_FILTERS)

### **Obtaining the correlation plot**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

SHOW_MASKED_PLOT = True
#SHOW_MASKED_PLOT = True - keep as True if you want to see a cleaned version of the plot
# where a mask is applied. Alternatively, SHOW_MASKED_PLOT = True, or 
# SHOW_MASKED_PLOT = False

RESPONSES_TO_RETURN_CORR = None
#RESPONSES_TO_RETURN_CORR - keep as None to return the full correlation tensor.
# If you want to display the correlations for a particular group of features, input them
# as a list, even if this list contains a single element. Examples:
# responses_to_return_corr = ['response1'] for a single response
# responses_to_return_corr = ['response1', 'response2', 'response3'] for multiple
# responses. Notice that 'response1',... should be substituted by the name ('string')
# of a column of the dataset that represents a response variable.
# WARNING: The returned coefficients will be ordered according to the order of the list
# of responses. i.e., they will be firstly ordered based on 'response1'
# Alternatively: a list containing strings (inside quotes) with the names of the response
# columns that you want to see the correlations. Declare as a list even if it contains a
# single element.

SET_RETURNED_LIMIT = None
# SET_RETURNED_LIMIT = None - This variable will only present effects in case you have
# provided a response feature to be returned. In this case, keep set_returned_limit = None
# to return all of the correlation coefficients; or, alternatively, 
# provide an integer number to limit the total of coefficients returned. 
# e.g. if set_returned_limit = 10, only the ten highest coefficients will be returned. 

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'correlation_plot.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


#New dataframe saved as correlation_matrix. Simply modify this object on the left of equality:
correlation_matrix = ewf.correlation_plot (df = DATASET, show_masked_plot = SHOW_MASKED_PLOT, responses_to_return_corr = RESPONSES_TO_RETURN_CORR, set_returned_limit = SET_RETURNED_LIMIT, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Plotting a bar chart**
- To obtain a **Pareto chart**, keep `aggregate_function = 'sum'`, `plot_cumulative_percent = True`, and `orientation = 'vertical'`.
- For obtaining the **data distribution of categorical variables**, select any numeric column as the response, and set `aggregate_function = 'count'`. You can also set `plot_cumulative_percent = True` to compare the frequencies of each possible value.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

CATEGORICAL_VAR_NAME = 'categorical_column_name'
# CATEGORICAL_VAR_NAME: string (inside quotes) containing the name 
# of the column to be analyzed. e.g. 
# CATEGORICAL_VAR_NAME = "column1"

RESPONSE_VAR_NAME = "response_column_name"
# RESPONSE_VAR_NAME: string (inside quotes) containing the name 
# of the column that stores the response correspondent to the
# categories. e.g. RESPONSE_VAR_NAME = "response_feature"

AGGREGATE_FUNCTION = 'sum'
# AGGREGATE_FUNCTION = 'sum': String defining the aggregation 
# method that will be applied. Possible values:
# 'median', 'mean', 'mode', 'sum', 'min', 'max', 'variance', 'count',
# 'standard_deviation','10_percent_quantile', '20_percent_quantile',
# '25_percent_quantile', '30_percent_quantile', '40_percent_quantile',
# '50_percent_quantile', '60_percent_quantile', '70_percent_quantile',
# '75_percent_quantile', '80_percent_quantile', '90_percent_quantile',
# and '95_percent_quantile'.
# To use another aggregate function, the method must be added to the
# dictionary of methods agg_methods_dict, defined in the function.
# If None or an invalid function is input, 'sum' will be used.

ADD_SUFFIX_TO_AGGREGATED_COL = True
# ADD_SUFFIX_TO_AGGREGATED_COL = True will add a suffix to the
# aggregated column. e.g. 'responseVar_mean'. If ADD_SUFFIX_TO_AGGREGATED_COL
# = False, the aggregated column will have the original column name.
SUFFIX = None
# suffix = None. Keep it None if no suffix should be added, or if
# the name of the aggregate function should be used as suffix, after
# "_". Alternatively, set it as a string. As recommendation, put the
# "_" sign in the beginning of this string to separate the suffix from
# the original column name. e.g. if the response variable is 'Y' and
# suffix = '_agg', the new aggregated column will be named as 'Y_agg'
CALCULATE_AND_PLOT_CUMULATIVE_PERCENT = True
# CALCULATE_AND_PLOT_CUMULATIVE_PERCENT = True to calculate and plot
# the line of cumulative percent, or 
# CALCULATE_AND_PLOT_CUMULATIVE_PERCENT = False to omit it.
# This feature is only shown when AGGREGATE_FUNCTION = 'sum', 'median',
# 'mean', or 'mode'. So, it will be automatically set as False if 
# another aggregate is selected.
ORIENTATION = 'vertical'
# ORIENTATION = 'vertical' is the standard, and plots vertical bars
# (perpendicular to the X axis). In this case, the categories are shown
# in the X axis, and the correspondent responses are in Y axis.
# Alternatively, ORIENTATION = 'horizontal' results in horizontal bars.
# In this case, categories are in Y axis, and responses in X axis.
# If None or invalid values are provided, orientation is set as 'vertical'.
LIMIT_OF_PLOTTED_CATEGORIES = None
# LIMIT_OF_PLOTTED_CATEGORIES: integer value that represents
# the maximum of categories that will be plot. Keep it None to plot
# all categories. Alternatively, set an integer value. e.g.: if
# LIMIT_OF_PLOTTED_CATEGORIES = 4, but there are more categories,
# the dataset will be sorted in descending order and: 1) The remaining
# categories will be sum in a new category named 'others' if the
# aggregate function is 'sum'; 2) Or the other categories will be simply
# omitted from the plot, for other aggregate functions. Notice that
# it limits only the variables in the plot: all of them will be
# returned in the dataframe.
# Use this parameter to obtain a cleaner plot. Notice that the remaining
# columns will be aggregated as 'others' even if there is a single column
# beyond the limit.

X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.

DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""

FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'bar_chart.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.

PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# New dataframe saved as aggregated_sorted_df. 
# Simply modify this object on the left of equality:
aggregated_sorted_df = ewf.bar_chart (df = DATASET, categorical_var_name = CATEGORICAL_VAR_NAME, response_var_name = RESPONSE_VAR_NAME, aggregate_function = AGGREGATE_FUNCTION, add_suffix_to_aggregated_col = ADD_SUFFIX_TO_AGGREGATED_COL, suffix = SUFFIX, calculate_and_plot_cumulative_percent = CALCULATE_AND_PLOT_CUMULATIVE_PERCENT, orientation = ORIENTATION, limit_of_plotted_categories = LIMIT_OF_PLOTTED_CATEGORIES, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Calculating cumulative statistics**
- Cumulative sum (cumsum); cumulative product (cumprod); cumulative maximum (cummax); cumulative minimum (cummin)

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze'
# COLUMN_TO_ANALYZE = string (inside quotes) containing the name of the column that will be analyzed. 
# e.g. column_to_analyze = "column1" will analyze the column named as 'column1'.

CUMULATIVE_STATISTIC = 'sum'
# CUMULATIVE_STATISTIC: the statistic that will be calculated. The cumulative
# statistics allowed are: 'sum' (for cumulative sum, cumsum); 'product' 
# (for cumulative product, cumprod); 'max' (for cumulative maximum, cummax);
# and 'min' (for cumulative minimum, cummin).

NEW_CUM_STATS_COL_NAME = None
# NEW_CUM_STATS_COL_NAME = None or string (inside quotes), 
# containing the name of the new column created for storing the cumulative statistic
# calculated. 
# e.g. NEW_CUM_STATS_COL_NAME = "cum_stats" will create a column named as 'cum_stats'.
# If its None, the new column will be named as column_to_analyze + "_" + [selected
# cumulative function] ('cumsum', 'cumprod', 'cummax', 'cummin')

# New dataframe saved as new_df
# Simply modify this object on the left of equality:
new_df = ewf.calculate_cumulative_stats (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, cumulative_statistic = CUMULATIVE_STATISTIC, new_cum_stats_col_name = NEW_CUM_STATS_COL_NAME)

### **Obtaining scatter plots and simple linear regressions**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).

SHOW_LINEAR_REG = True
#Alternatively: set SHOW_LINEAR_REG = True to plot the linear regressions graphics and show 
# the linear regressions calculated for each pair Y x X (i.e., each correlation 
# Y = aX + b, as well as the R² coefficient calculated). 
# Set SHOW_LINEAR_REG = False to omit both the linear regressions plots on the graphic, and
# the correlations and R² coefficients obtained.

GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = False #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'scatter_plot_lin_reg.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# JSON-formatted list containing all series converted to NumPy arrays, 
#  with timestamps parsed as datetimes, and all the information regarding the linear regressions, 
# including the predicted values for plotting, returned as list_of_dictionaries_with_series_and_predictions. 
# Simply modify this object on the left of equality:
list_of_dictionaries_with_series_and_predictions = ewf.scatter_plot_lin_reg (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, show_linear_reg = SHOW_LINEAR_REG, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Performing the polynomial fitting**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.

POLYNOMIAL_DEGREE = 6
# Integer value representing the degree of the fitted polynomial.
CALCULATE_ROOTS = False
# CALCULATE_ROOTS = False.  Alternatively, set as True to calculate the roots of the
#  fitted polynomial and return them as a NumPy array.
CALCULATE_DERIVATIVE = False
# CALCULATE_DERIVATIVE = False. Alternatively, set as True to calculate the derivative of the
#  fitted polynomial and add it as a column of the dataframe.
CALCULATE_INTEGRAL = False
# CALCULATE_INTEGRAL = False. Alternatively, set as True to calculate the integral of the
#  fitted polynomial and add it as a column of the dataframe.

X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
SHOW_POLYNOMIAL_REG = True
#Alternatively: set SHOW_POLYNOMIAL_REG = True to plot the polynomial regressions graphics
# calculated for each pair Y x X. 
# Set SHOW_LINEAR_REG = False to omit these plots.
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = False #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'polynomial_fitting.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


# JSON-formatted list containing all series converted to NumPy arrays, 
#  with timestamps parsed as datetimes, and all the information regarding the linear regressions, 
# including the predicted values for plotting, returned as list_of_dictionaries_with_series_and_predictions. 
# Simply modify this object on the left of equality:
list_of_dictionaries_with_series_and_predictions = ewf.polynomial_fit (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, polynomial_degree = POLYNOMIAL_DEGREE, calculate_roots = CALCULATE_ROOTS, calculate_derivative = CALCULATE_DERIVATIVE, calculate_integral = CALCULATE_INTEGRAL, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, show_polynomial_reg = SHOW_POLYNOMIAL_REG, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing time series**

In [None]:
DATA_IN_SAME_COLUMN = False

# Parameters to input when DATA_IN_SAME_COLUMN = True:
DATASET = None #Alternatively: object containing the dataset to be analyzed (e.g. DATASET = dataset)
COLUMN_WITH_PREDICT_VAR_X = 'X' # Alternatively: correct name for X-column
COLUMN_WITH_RESPONSE_VAR_Y = 'Y' # Alternatively: correct name for Y-column
COLUMN_WITH_LABELS = 'label_column' # Alternatively: correct name for column with the labels or groups

# DATA_IN_SAME_COLUMN = False: set as True if all the values to plot are in a same column.
# If DATA_IN_SAME_COLUMN = True, you must specify the dataframe containing the data as DATASET;
# the column containing the predict variable (X) as COLUMN_WITH_PREDICT_VAR_X; the column 
# containing the responses to plot (Y) as COLUMN_WITH_RESPONSE_VAR_Y; and the column 
# containing the labels (subgroup) indication as COLUMN_WITH_LABELS. 
# DATASET is an object, so do not declare it in quotes. The other three arguments (columns' names) 
# are strings, so declare in quotes. 

# Example: suppose you have a dataframe saved as dataset, and two groups A and B to compare. 
# All the results for both groups are in a column named 'results', wich will be plot against
# the time, saved as 'time' (X = 'time'; Y = 'results'). If the result is for
# an entry from group A, then a column named 'group' has the value 'A'. If it is for group B,
# column 'group' shows the value 'B'. In this example:
# DATA_IN_SAME_COLUMN = True,
# DATASET = dataset,
# COLUMN_WITH_PREDICT_VAR_X = 'time',
# COLUMN_WITH_RESPONSE_VAR_Y = 'results', 
# COLUMN_WITH_LABELS = 'group'
# If you want to declare a list of dictionaries, keep DATA_IN_SAME_COLUMN = False and keep
# DATASET = None (the other arguments may be set as None, but it is not mandatory: 
# COLUMN_WITH_PREDICT_VAR_X = None, COLUMN_WITH_RESPONSE_VAR_Y = None, COLUMN_WITH_LABELS = None).


# Parameter to input when DATA_IN_SAME_COLUMN = False:
LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = [
    
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}, 
    {'x': None, 'y': None, 'lab': None}
    
]
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE: if data is already converted to series, lists
# or arrays, provide them as a list of dictionaries. It must be declared as a list, in brackets,
# even if there is a single dictionary.
# Use always the same keys: 'x' for the X-series (predict variables); 'y' for the Y-series
# (response variables); and 'lab' for the labels. If you do not want to declare a series, simply
# keep as None, but do not remove or rename a key (ALWAYS USE THE KEYS SHOWN AS MODEL).
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to plot more series.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'x': x_series, 'y': y_series, 'lab': label}, where x_series, y_series and label
# represents the series and label of the added dictionary (you can pass 'lab': None, but if 
# 'x' or 'y' are None, the new dictionary will be ignored).

# Examples:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y'], 'lab': 'label'}]
# will plot a single variable. In turns:
# LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE = 
# [{'x': DATASET['X'], 'y': DATASET['Y1'], 'lab': 'label'}, {'x': DATASET['X'], 'y': DATASET['Y2'], 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}, {'x': None, 'y': None, 'lab': None}]
# will plot two series, Y1 x X and Y2 x X.
# Notice that all dictionaries where 'x' or 'y' are None are automatically ignored.
# If None is provided to 'lab', an automatic label will be generated.


X_AXIS_ROTATION = 70
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
ADD_SPLINE_LINES = True #Alternatively: True or False
# If ADD_SPLINE_LINES = False, no lines connecting the successive values are shown.
# Since we are obtaining a scatter plot, there is no meaning in omitting the dots,
# as we can do for the time series visualization function.
ADD_SCATTER_DOTS = False
# If ADD_SCATTER_DOTS = False, no dots representing the data points are shown.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'time_series_vis.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.


ewf.time_series_vis (data_in_same_column = DATA_IN_SAME_COLUMN, df = DATASET, column_with_predict_var_x = COLUMN_WITH_PREDICT_VAR_X, column_with_response_var_y = COLUMN_WITH_RESPONSE_VAR_Y, column_with_labels = COLUMN_WITH_LABELS, list_of_dictionaries_with_series_to_analyze = LIST_OF_DICTIONARIES_WITH_SERIES_TO_ANALYZE, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, add_splines_lines = ADD_SPLINE_LINES, add_scatter_dots = ADD_SCATTER_DOTS, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Visualizing histograms**

In [None]:
# REMEMBER: A histogram is the representation of a statistical distribution 
# of a given variable.

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'analyzed_variable'
#Alternatively: other column in quotes, substituting 'analyzed_variable'
# e.g., if the analyzed variable is in a column named 'column1':
# COLUMN_TO_ANALYZE = 'column1'

TOTAL_OF_BINS = 10
# This parameter must be an integer number: it represents the total of bins of the 
# histogram, i.e., the number of divisions of the sample space (in how much intervals
# the sample space will be divided.
# Manually adjust this parameter to obtain more or less resolution of the statistical
# distribution: less bins tend to result into higher counting of values per bin, since
# a larger interval of values is grouped. After modifying the total of bins, do not forget
# to adjust the bar width in SET_GRAPHIC_BAR_WIDTH.
# Examples: TOTAL_OF_BINS = 50, to divide the sample space into 50 equally-separated 
# intervals; TOTAL_OF_BINS = 10 to divide it into 10 intervals; TOTAL_OF_BINS = 100 to
# divide it into 100 intervals.
NORMAL_CURVE_OVERLAY = True
#Alternatively: set NORMAL_CURVE_OVERLAY = True to show a normal curve overlaying the
# histogram; or set NORMAL_CURVE_OVERLAY = False to omit the normal curve (show only
# the histogram).

X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.
HORIZONTAL_AXIS_TITLE = None #Alternatively: string inside quotes for horizontal title
VERTICAL_AXIS_TITLE = None #Alternatively: string inside quotes for vertical title
PLOT_TITLE = None #Alternatively: string inside quotes for graphic title
# e.g. HORIZONTAL_AXIS_TITLE = "X", VERTICAL_AXIS_TITLE = "Y", PLOT_TITLE = "YxX"

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'histogram.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.

#New dataframes saved as general_stats and frequency_table.
# Simply modify these objects on the left of equality:
general_stats, frequency_table = ewf.histogram (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, total_of_bins = TOTAL_OF_BINS, normal_curve_overlay = NORMAL_CURVE_OVERLAY, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, horizontal_axis_title = HORIZONTAL_AXIS_TITLE, vertical_axis_title = VERTICAL_AXIS_TITLE, plot_title = PLOT_TITLE, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Testing data normality and visualizing the probability plot**
- Check the probability that data is actually described by a normal distribution.

In [None]:
# WARNING: The statistical tests require at least 20 samples

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze' 
# COLUMN_TO_ANALYZE: column (variable) of the dataset that will be tested. Declare as a string,
# in quotes.
# e.g. COLUMN_TO_ANALYZE = 'col1' will analyze a column named 'col1'.

COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS = None
# column_with_labels_to_test_subgroups: if there is a column with labels or
# subgroup indication, and the normality should be tested separately for each label, indicate
# it here as a string (in quotes). e.g. column_with_labels_to_test_subgroups = 'col2' 
# will retrieve the labels from 'col2'.
# Keep column_with_labels_to_test_subgroups = None if a single series (the whole column)
# will be tested.
    
ALPHA = 0.10
# Confidence level = 1 - ALPHA. For ALPHA = 0.10, we get a 0.90 = 90% confidence
# Set ALPHA = 0.05 to get 0.95 = 95% confidence in the analysis.
# Notice that, when less trust is needed, we can increase ALPHA to get less restrictive
# results.

SHOW_PROBABILITY_PLOT = True
#Alternatively: set SHOW_PROBABILITY_PLOT = True to obtain the probability plot for the
# variable Y (normal distribution tested). 
# Set SHOW_PROBABILITY_PLOT = False to omit the probability plot.
X_AXIS_ROTATION = 0
#Rotation of X axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
Y_AXIS_ROTATION = 0
#Rotation of Y axis labels. Alternatively, insert any numeric value from 0 to 90 (degrees).
GRID = True #Alternatively: True or False
# If GRID = False, no grid lines are shown in the graphic.

EXPORT_PNG = False
# Set EXPORT_PNG = False if you do not want to export the obtained image;
# Set EXPORT_PNG = True to export the obtained image.
DIRECTORY_TO_SAVE = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file will be stored. e.g. DIRECTORY_TO_SAVE = "" 
# or DIRECTORY_TO_SAVE = "folder"
# If EXPORT_PNG = True and DIRECTORY_TO_SAVE = None, the file will be saved in the root
# path, DIRECTORY_TO_SAVE = ""
FILE_NAME = None
# This parameter has effect only if EXPORT_PNG = True.
# (string, in quotes): input the name you want for the file without the 
# extension, which will be 'png'. e.g. FILE_NAME = "my_image" will save a file 'my_image.png' 
# If EXPORT_PNG = True and FILE_NAME = None, the file will be saved as:
# 'probability_plot_normal.png'
# WARNING: if there is already a file in the path DIRECTORY_TO_SAVE saved as FILE_NAME,
# the file will be overwritten.
PNG_RESOLUTION_DPI = 330
# This parameter has effect only if EXPORT_PNG = True.
# Alternatively, input an integer that will correspond to the resolution of the exported
# image in dpi. If PNG_RESOLUTION_DPI = None, it will be set as 330.

# List of dictionaries containing the series, p-values, skewness and kurtosis returned as
# list_of_dicts
# Simply modify this object on the left of equality:
list_of_dicts = ewf.test_data_normality (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, column_with_labels_to_test_subgroups = COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS, alpha = ALPHA, show_probability_plot = SHOW_PROBABILITY_PLOT, x_axis_rotation = X_AXIS_ROTATION, y_axis_rotation = Y_AXIS_ROTATION, grid = GRID, export_png = EXPORT_PNG, directory_to_save = DIRECTORY_TO_SAVE, file_name = FILE_NAME, png_resolution_dpi = PNG_RESOLUTION_DPI)

### **Testing and visualizing probability plots for different statistical distributions**

In [None]:
# WARNING: The statistical tests require at least 20 samples
# Attention: if you want to test a normal distribution, use the function 
# test_data_normality.Function test_data_normality tests normality through 4 methods 
# and compare them: D’Agostino and Pearson’s; Shapiro-Wilk; Lilliefors; and Anderson-Darling tests.
# The calculus of the p-value from the Anderson-Darling statistic is available only 
# for some distributions. The function specific for the normality calculates these 
# probabilities of following the normal.
# Here, the function is destined to test a variety of distributions, and so only the 
# Anderson-Darling test is performed.

DATASET = dataset #Alternatively: object containing the dataset to be analyzed

COLUMN_TO_ANALYZE = 'column_to_analyze' 
# COLUMN_TO_ANALYZE: column (variable) of the dataset that will be tested. Declare as a string,
# in quotes.
# e.g. COLUMN_TO_ANALYZE = 'col1' will analyze a column named 'col1'.

COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS = None
# column_with_labels_to_test_subgroups: if there is a column with labels or
# subgroup indication, and the normality should be tested separately for each label, indicate
# it here as a string (in quotes). e.g. column_with_labels_to_test_subgroups = 'col2' 
# will retrieve the labels from 'col2'.
# Keep column_with_labels_to_test_subgroups = None if a single series (the whole column)
# will be tested.

STATISTICAL_DISTRIBUTION_TO_TEST = 'lognormal'
#STATISTICAL_DISTRIBUTION: string (inside quotes) containing the tested statistical 
# distribution. 
## Notice: if data Y follow a 'lognormal', log(Y) follow a normal
## Poisson is a special case from 'gamma' distribution.
## There are 91 accepted statistical distributions:
# 'alpha', 'anglit', 'arcsine', 'beta', 'beta_prime', 'bradford', 'burr', 'burr12', 'cauchy',
# 'chi', 'chi-squared', 'cosine', 'double_gamma', 'double_weibull', 
# 'erlang', 'exponential', 'exponentiated_weibull', 'exponential_power',
# 'fatigue_life_birnbaum-saunders', 'fisk_log_logistic', 'folded_cauchy', 'folded_normal',
# 'F', 'gamma', 'generalized_logistic', 'generalized_pareto', 'generalized_exponential', 
# 'generalized_extreme_value', 'generalized_gamma', 'generalized_half-logistic', 
# 'generalized_inverse_gaussian', 'generalized_normal', 
# 'gilbrat', 'gompertz_truncated_gumbel', 'gumbel', 'gumbel_left-skewed', 'half-cauchy', 
# 'half-normal', 'half-logistic', 'hyperbolic_secant', 'gauss_hypergeometric', 
# 'inverted_gamma', 'inverse_normal', 'inverted_weibull', 'johnson_SB', 'johnson_SU', 
# 'KSone','KStwobign', 'laplace', 'left-skewed_levy', 
# 'levy', 'logistic', 'log_laplace', 'log_gamma', 'lognormal', 'log-uniform', 'maxwell', 
# 'mielke_Beta-Kappa', 'nakagami', 'noncentral_chi-squared', 'noncentral_F', 
# 'noncentral_t', 'normal', 'normal_inverse_gaussian', 'pareto', 'lomax', 'power_lognormal',
# 'power_normal', 'power-function', 'R', 'rayleigh', 'rice', 'reciprocal_inverse_gaussian', 
# 'semicircular', 'student-t', 'triangular', 
# 'truncated_exponential', 'truncated_normal', 'tukey-lambda', 'uniform', 'von_mises', 
# 'wald', 'weibull_maximum_extreme_value', 'weibull_minimum_extreme_value', 'wrapped_cauchy'


# List of dictionaries containing the series, p-values, skewness and kurtosis returned as
# list_of_dicts
# Simply modify this object on the left of equality:
list_of_dicts = ewf.test_stat_distribution (df = DATASET, column_to_analyze = COLUMN_TO_ANALYZE, column_with_labels_to_test_subgroups = COLUMN_WITH_LABELS_TO_TEST_SUBGROUPS, statistical_distribution_to_test = STATISTICAL_DISTRIBUTION_TO_TEST)

### **Filtering (selecting); ordering; or renaming columns from the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'select_or_order_columns'
# MODE = 'select_or_order_columns' for filtering only the list of columns passed as COLUMNS_LIST,
# and setting a new column order. In this mode, you can pass the columns in any order: 
# the order of elements on the list will be the new order of columns.

# MODE = 'rename_columns' for renaming the columns with the names passed as COLUMNS_LIST. In this
# mode, the list must have same length and same order of the columns of the dataframe. That is because
# the columns will sequentially receive the names in the list. So, a mismatching of positions
# will result into columns with incorrect names.

COLUMNS_LIST = ['column1', 'column2', 'column3']
# COLUMNS_LIST = list of strings containing the names (headers) of the columns to select
# (filter); or to be set as the new columns' names, according to the selected mode.
# For instance: COLUMNS_LIST = ['col1', 'col2', 'col3'] will 
# select columns 'col1', 'col2', and 'col3' (or rename the columns with these names). 
# Declare the names inside quotes.
# Simply substitute the list by the list of columns that you want to select; or the
# list of the new names you want to give to the dataset columns.

# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = ewf.select_order_or_rename_columns (df = DATASET, columns_list = COLUMNS_LIST, mode = MODE)

### **Renaming specific columns from the dataframe; or cleaning columns' labels**
- The function `select_order_or_rename_columns` requires the user to pass a list containing the names from all columns.
- Also, this list must contain the columns in the correct order (the order they appear in the dataframe).
- This function may manipulate one or several columns by call, and is not dependent on their order.
- This function can also be used for cleaning the columns' labels: capitalize (upper case) or lower cases of all columns' names; replace substrings on columns' names; or eliminating trailing and leading white spaces or characters from columns' labels.

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

MODE = 'set_new_names'
# MODE = 'set_new_names' will change the columns according to the specifications in
# LIST_OF_COLUMNS_LABELS.

# MODE = 'capitalize_columns' will capitalize all columns names (i.e., they will be put in
# upper case). e.g. a column named 'column' will be renamed as 'COLUMN'

# MODE = 'lowercase_columns' will lower the case of all columns names. e.g. a column named
# 'COLUMN' will be renamed as 'column'.

# MODE = 'replace_substring' will search on the columns names (strings) for the 
# SUBSTRING_TO_BE_REPLACED (which may be a character or a string); and will replace it by 
# NEW_SUBSTRING_FOR_REPLACEMENT (which again may be either a character or a string). 
# Numbers (integers or floats) will be automatically converted into strings.
# As an example, consider the default situation where we search for a whitespace ' ' and replace it
# by underscore '_': SUBSTRING_TO_BE_REPLACED = ' ', NEW_SUBSTRING_FOR_REPLACEMENT = '_'  
# In this case, a column named 'new column' will be renamed as 'new_column'.

# MODE = 'trim' will remove all trailing or leading whitespaces from column names.
# e.g. a column named as ' col1 ' will be renamed as 'col1'; 'col2 ' will be renamed as
# 'col2'; and ' col3' will be renamed as 'col3'.

# MODE = 'eliminate_trailing_characters' will eliminate a defined trailing and leading 
# substring from the columns' names. 
# The substring must be indicated as TRAILING_SUBSTRING, and its default, when no value
# is provided, is equivalent to mode = 'trim' (eliminate white spaces). 
# e.g., if TRAILING_SUBSTRING = '_test' and you have a column named 'col_test', it will be 
# renamed as 'col'.

SUBSTRING_TO_BE_REPLACED = ' '
NEW_SUBSTRING_FOR_REPLACEMENT = '_'

TRAILING_SUBSTRING = None

LIST_OF_COLUMNS_LABELS = [
    
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None},
    {'column_name': None, 'new_column_name': None}, 
    {'column_name': None, 'new_column_name': None}
    
]
# LIST_OF_COLUMNS_LABELS = [{'column_name': None, 'new_column_name': None}]
# This is a list of dictionaries, where each dictionary contains two key-value pairs:
# the first one contains the original column name; and the second one contains the new name
# that will substitute the original one. The function will loop through all dictionaries in
# this list, access the values of the keys 'column_name', and it will be replaced (switched) 
# by the correspondent value in key 'new_column_name'.
    
# The object LIST_OF_COLUMNS_LABELS must be declared as a list, 
# in brackets, even if there is a single dictionary.
# Use always the same keys: 'column_name' for the original label; 
# and 'new_column_name', for the correspondent new label.
# Notice that this function will not search substrings: it will substitute a value only when
# there is perfect correspondence between the string in 'column_name' and one of the columns
# labels. So, the cases (upper or lower) must be the same.
    
# If you want, you can remove elements (dictionaries) from the list to declare fewer elements;
# and you can also add more elements (dictionaries) to the lists, if you need to replace more
# values.
# Simply put a comma after the last element from the list and declare a new dictionary, keeping the
# same keys: {'column_name': original_col, 'new_column_name': new_col}, 
# where original_col and new_col represent the strings for searching and replacement 
# (If one of the keys contains None, the new dictionary will be ignored).
# Example: LIST_OF_COLUMNS_LABELS = [{'column_name': 'col1', 'new_column_name': 'col'}] will
# rename 'col1' as 'col'.


# New dataframe saved as new_df. Simply modify this object on the left of equality:
new_df = ewf.rename_or_clean_columns_labels (df = DATASET, mode = MODE, substring_to_be_replaced = SUBSTRING_TO_BE_REPLACED, new_substring_for_replacement = NEW_SUBSTRING_FOR_REPLACEMENT, trailing_substring = TRAILING_SUBSTRING, list_of_columns_labels = LIST_OF_COLUMNS_LABELS)

### **Merging (joining) the data on a timestamp column**

In [None]:
DF_LEFT = dataset1 #Alternatively: object containing the dataset to be joined on the left
DF_RIGHT = dataset2 #Alternatively: object containing the dataset to be joined on the right

LEFT_KEY = "DATE" 
#Alternatively: (string) name of the column of the left dataframe to be used as key for 
# joining. Keep inside quotes.
RIGHT_KEY = "DATE"
#Alternatively: (string) name of the column of the right dataframe to be used as key for 
# joining. Keep inside quotes.

HOW_TO_JOIN = "inner"
#Alternatively: "inner", "outer", "left", "right". This option has no effect 
# if MERGE_METHOD = "asof". Keep inside quotes.

MERGE_METHOD = "asof"
# Alternatively: MERGE_METHOD = 'ordered' to use pandas .merge_ordered method, or
# MERGE_METHOD = "asof" for using the .merge_asof method.
# WARNING: .merge_asof uses fuzzy matching, so the HOW_TO_JOIN parameter is not applicable.
# Keep inside quotes.

## USE MERGE_METHOD = 'asof' to merge data collected asynchronously, i.e., data collected in
# different moments, resulting in timestamps that do not perfectly match.
# merge_asof method sorts the timestamps in ascending order and does not look for a perfect 
# combination of keys. Instead, it takes the timestamp from the right dataframe as key, and 
# searches for the closest dataframe on the left dataframe. So, it inputs the row from the right on 
# the correct position it should have in the left dataframe (in other words, it appends the rows from
# one into the order, respecting the columns and the time order).
# If a missing value would be generated, the 'ffill' parameter can be used to automatically 
# repeat the previous value (from the left dataframe) on the row that came from the right table,
# filling the missing values.

MERGED_SUFFIXES = ('_left', '_right')
# SUFFIXES = ('_left', '_right') - tuple of the suffixes to be added to columns.
# Example: suppose both datasets have the column 'Value'. The column from the left dataset
# will be renamed as "Value_left", and the column from the right dataset will be renamed as
# "Value_right".
# Alternatively: modify the strings inside quotes to modify the standard values. 
# Do not eliminate the parenthesis that indicate the tuple object.
# Any unmutable list is a tuple. A tuple can be also declared as an unmutable list of two
# objects inside parenthesis instead of the brackets used for lists: []

ASOF_DIRECTION = "nearest"
# Parameter of .merge_asof method. 'nearest' merge the closest timestamps in both directions.
# Alternatively: 'backward' or 'forward'.
# This option has no effect if MERGE_METHOD = "ordered". Keep inside quotes.

ORDERED_FILLING = 'ffill'
# Parameter or .merge_ordered method.
# Alternatively: ORDERED_FILLING = 'ffill' (inside quotes) to fill missings 
# with the previous value.
# This option has no effect if MERGE_METHOD = "asof", so you can keep it None


#New dataframe saved as merged_df. Simply modify this object on the left of equality:
merged_df = ewf.MERGE_ON_TIMESTAMP (df_left = DF_LEFT, df_right = DF_RIGHT, left_key = LEFT_KEY, right_key = RIGHT_KEY, how_to_join = HOW_TO_JOIN, merge_method = MERGE_METHOD, merged_suffixes = MERGED_SUFFIXES, asof_direction = ASOF_DIRECTION, ordered_filling = ORDERED_FILLING)

### **Merging (joining) dataframes on given keys; and sorting the merged table**
- Merge (join) types:
    - 'inner': resultant dataframe contains only the rows on the left dataframe with correspondent values on the right dataframe. Can be used for filtering a set of labelled rows. Results in no missing values;
    - 'left': resultant dataframe contains all the rows from the left table (even those without correspondence on the right); and the rows from the right table that have correspondence on the left one. Since rows from the left table may not have correspondence, it may result in missing values.
    - 'right': resultant dataframe contains all the rows from the right table (even those without correspondence on the right); and the rows from the left table that have correspondence on the right one. Since rows from the right table may not have correspondence, it may result in missing values.
    - 'outer': in SQL, the Pandas 'outer' merge usually corresponds to the FULL OUTER JOIN: the resultant dataframe contains all rows from both tables, not taking in account if there is correspondence. So, it may result in missing values.

In [None]:
DF_LEFT = dataset1 #Alternatively: object containing the dataset to be joined on the left
DF_RIGHT = dataset2 #Alternatively: object containing the dataset to be joined on the right

LEFT_KEY = "left_key_column" 
#Alternatively: (string) name of the column of the left dataframe to be used as key for 
# joining. Keep inside quotes.
RIGHT_KEY = "right_key_column"
#Alternatively: (string) name of the column of the right dataframe to be used as key for 
# joining. Keep inside quotes.

HOW_TO_JOIN = "inner"
#Alternatively: "inner", "outer", "left", "right".

MERGED_SUFFIXES = ('_left', '_right')
# SUFFIXES = ('_left', '_right') - tuple of the suffixes to be added to columns.
# Example: suppose both datasets have the column 'Value'. The column from the left dataset
# will be renamed as "Value_left", and the column from the right dataset will be renamed as
# "Value_right".
# Alternatively: modify the strings inside quotes to modify the standard values. 
# Do not eliminate the parenthesis that indicate the tuple object.
# Any unmutable list is a tuple. A tuple can be also declared as an unmutable list of two
# objects inside parenthesis instead of the brackets used for lists: []

SORT_MERGED_DF = False
# SORT_MERGED_DF = False not to sort the merged dataframe. If you want to sort it,
# set as True. If SORT_MERGED_DF = True and COLUMN_TO_SORT = None, the dataframe will
# be sorted by its first column.

COLUMN_TO_SORT = None
# COLUMN_TO_SORT = None. Keep it None if the dataframe should not be sorted.
# Alternatively, pass a string with a column name to sort, such as:
# COLUMN_TO_SORT = 'col1'; or a list of columns to use for sorting: COLUMN_TO_SORT = 
# ['col1', 'col2']

ASCENDING_SORTING = True
# ascending_sorting = True. If you want to sort the column(s) passed on column_to_sort in
# ascending order, set as True. Set as False if you want to sort in descending order. If
# you want to sort each column passed as list column_to_sort in a specific order, pass a 
# list of booleans like ASCENDING_SORTING = [False, True] - the first column of the list
# will be sorted in descending order, whereas the 2nd will be in ascending. Notice that
# the correspondence is element-wise: the boolean in list ASCENDING_SORTING will correspond 
# to the sorting order of the column with the same position in list COLUMN_TO_SORT.
# If None, the dataframe will be sorted in ascending order.
    

#New dataframe saved as merged_df. Simply modify this object on the left of equality:
merged_df = ewf.MERGE_AND_SORT_DATAFRAMES (df_left = DF_LEFT, df_right = DF_RIGHT, left_key = LEFT_KEY, right_key = RIGHT_KEY, how_to_join = HOW_TO_JOIN, merged_suffixes = MERGED_SUFFIXES, sort_merged_df = SORT_MERGED_DF, column_to_sort = COLUMN_TO_SORT, ascending_sorting = ASCENDING_SORTING)

### **Dropping specific columns or rows from the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

WHAT_TO_DROP = 'columns'
# WHAT_TO_DROP = 'columns' for removing the columns specified by their names (headers)
# in COLS_LIST (a list of strings).
# WHAT_TO_DROP = 'rows' for removing the rows specified by their indices in
# ROW_INDEX_LIST (a list of integers). Remember that the indexing starts from zero, i.e.,
# the first row is row number zero.

COLS_LIST = None
# COLS_LIST = list of strings containing the names (headers) of the columns to be removed
# For instance: COLS_LIST = ['col1', 'col2', 'col3'] will 
# remove columns 'col1', 'col2', and 'col3' from the dataframe.
# If a single column will be dropped, you can declare it as a string (outside a list)
# e.g. COLS_LIST = 'col1'; or COLS_LIST = ['col1']

ROW_INDEX_LIST = None
# ROW_INDEX_LIST = a list of integers containing the indices of the rows that will be dropped.
# e.g. ROW_INDEX_LIST = [0, 1, 2] will drop the rows with indices 0 (1st row), 1 (2nd row), and
# 2 (third row). Again, if a single row will be dropped, you can declare it as an integer (outside
# a list).
# e.g. ROW_INDEX_LIST = 20 or ROW_INDEX_LIST = [20] to drop the row with index 20 (21st row).
    
RESET_INDEX_AFTER_DROP = True
# RESET_INDEX_AFTER_DROP = True. keep it True to restarting the indexing numeration after dropping.
# Alternatively, set RESET_INDEX_AFTER_DROP = False to keep the original numeration (the removed indices
# will be missing).

# New dataframe saved as cleaned_df. Simply modify this object on the left of equality:
cleaned_df = ewf.drop_columns_or_rows (df = DATASET, what_to_drop = WHAT_TO_DROP, cols_list = COLS_LIST, row_index_list = ROW_INDEX_LIST, reset_index_after_drop = RESET_INDEX_AFTER_DROP)

### **Removing duplicate rows from the dataframe**

In [None]:
DATASET = dataset #Alternatively: object containing the dataset to be analyzed

LIST_OF_COLUMNS_TO_ANALYZE = None
# if LIST_OF_COLUMNS_TO_ANALYZE = None, the whole dataset will be analyzed, i.e., rows
# will be removed only if they have same values for all columns from the dataset.
# Alternatively, pass a list of columns names (strings), if you want to remove rows with
# same values for that combination of columns. Pass it as a list, even if there is a single column
# being declared.
# e.g. LIST_OF_COLUMNS_TO_ANALYZE = ['column1'] will check only 'column1'. Entries with same value
# on 'column1' will be considered duplicates and will be removed.
# LIST_OF_COLUMNS_TO_ANALYZE = ['col1', 'col2',  'col3'] will analyze the combination of 3 columns:
# 'col1', 'col2', and 'col3'. Only rows with same value for these 3 columns will be considered
# duplicates and will be removed.

WHICH_ROW_TO_KEEP = 'first'
# WHICH_ROW_TO_KEEP = 'first' will keep the first detected row and remove all other duplicates. If
# None or an invalid string is input, this method will be selected.
# WHICH_ROW_TO_KEEP = 'last' will keep only the last detected duplicate row, and remove all the others.
    
RESET_INDEX_AFTER_DROP = True
# RESET_INDEX_AFTER_DROP = True. keep it True to restarting the indexing numeration after dropping.
# Alternatively, set RESET_INDEX_AFTER_DROP = False to keep the original numeration (the removed indices
# will be missing).

# New dataframe saved as cleaned_df. Simply modify this object on the left of equality:
cleaned_df = ewf.remove_duplicate_rows (df = DATASET, list_of_columns_to_analyze = LIST_OF_COLUMNS_TO_ANALYZE, which_row_to_keep = WHICH_ROW_TO_KEEP, reset_index_after_drop = RESET_INDEX_AFTER_DROP)

## **Exporting the dataframe as CSV file (to notebook's workspace)**

In [None]:
## WARNING: all files exported from this function are .csv (comma separated values)

DATAFRAME_OBJ_TO_BE_EXPORTED = dataset
# Alternatively: object containing the dataset to be exported.
# DATAFRAME_OBJ_TO_BE_EXPORTED: dataframe object that is going to be exported from the
# function. Since it is an object (not a string), it should not be declared in quotes.
# example: DATAFRAME_OBJ_TO_BE_EXPORTED = dataset will export the dataset object.
# ATTENTION: The dataframe object must be a Pandas dataframe.

FILE_DIRECTORY_PATH = ""
# FILE_DIRECTORY_PATH - (string, in quotes): input the path of the directory 
# (e.g. folder path) where the file is stored. e.g. FILE_DIRECTORY_PATH = "" 
# or FILE_DIRECTORY_PATH = "folder"
# If you want to export the file to AWS S3, this parameter will have no effect.
# In this case, you can set FILE_DIRECTORY_PATH = None

NEW_FILE_NAME_WITHOUT_EXTENSION = "dataset"
# NEW_FILE_NAME_WITHOUT_EXTENSION - (string, in quotes): input the name of the 
# file without the extension. e.g. set NEW_FILE_NAME_WITHOUT_EXTENSION = "my_file" 
# to export the CSV file 'my_file.csv' to notebook's workspace.

idsw.export_pd_dataframe_as_csv (dataframe_obj_to_be_exported = DATAFRAME_OBJ_TO_BE_EXPORTED, new_file_name_without_extension = NEW_FILE_NAME_WITHOUT_EXTENSION, file_directory_path = FILE_DIRECTORY_PATH)

## **Downloading a file from Google Colab to the local machine; or uploading a file from the machine to Colab's instant memory**

#### Case 1: upload a file to Colab's workspace

In [None]:
ACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model named keras_model, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'keras_model.h5'

# Dictionary storing the uploaded files returned as colab_files_dict.
# Simply modify this object on the left of the equality:
colab_files_dict = idsw.upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

#### Case 2: download a file from Colab's workspace

In [None]:
ACTION = 'download'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

FILE_TO_DOWNLOAD_FROM_COLAB = None
# FILE_TO_DOWNLOAD_FROM_COLAB = None. This parameter is obbligatory when
# action = 'download'. 
# Declare as FILE_TO_DOWNLOAD_FROM_COLAB the file that you want to download, with
# the correspondent extension.
# It should not be declared in quotes.
# e.g. to download a dictionary named dict, FILE_TO_DOWNLOAD_FROM_COLAB = 'dict.pkl'
# To download a dataframe named df, declare FILE_TO_DOWNLOAD_FROM_COLAB = 'df.csv'
# To export a model nameACTION = 'upload'
# ACTION = 'download' to download the file to the local machine
# ACTION = 'upload' to upload a file from local machine to Google Colab's 
# instant memory

idsw.upload_to_or_download_file_from_colab (action = ACTION, file_to_download_from_colab = FILE_TO_DOWNLOAD_FROM_COLAB)

## **Exporting a list of files from notebook's workspace to AWS Simple Storage Service (S3)**

In [None]:
LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['s3_file1.txt', 's3_file2.txt']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS: list containing all the files to export to S3.
# Declare it as a list even if only a single file will be exported.
# It must be a list of strings containing the file names followed by the extensions.
# Example, to a export a single file my_file.ext, where my_file is the name and ext is the
# extension:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['my_file.ext']
# To export 3 files, file1.ext1, file2.ext2, and file3.ext3:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['file1.ext1', 'file2.ext2', 'file3.ext3']
# Other examples:
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['Screen_Shot.png', 'dataset.csv']
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ["dictionary.pkl", "model.h5"]
# LIST_OF_FILE_NAMES_WITH_EXTENSIONS = ['doc.pdf', 'model.dill']

DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = ''
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT: directory from notebook's workspace
# from which the files will be exported to S3. Keep it None, or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = "/"; or
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = '' (empty string) to export from
# the root (main) directory.
# Alternatively, set as a string containing only the directories and folders, not the file names.
# Examples: DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1';
# DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT = 'folder1/folder2/'
    
# For this function, all exported files must be located in the same directory.

S3_BUCKET_NAME = 'my_bucket'
## This parameter is obbligatory to access an AWS S3 bucket. Substitute it for a string
# with the bucket's name. e.g. s3_bucket_name = "aws-bucket-1" access a bucket named as
# "aws-bucket-1"

S3_OBJECT_FOLDER_PREFIX = ""
# S3_OBJECT_FOLDER_PREFIX = None. Keep it None; or as an empty string 
# (S3_OBJECT_FOLDER_PREFIX = ''); or as the root "/" to import the 
# whole bucket content, instead of a single object from it.
# Alternatively, set it as a string containing the subfolder from the bucket to import:
# Suppose that your bucket (admin-created) has four objects with the following object 
# keys: Development/Projects1.xls; Finance/statement1.pdf; Private/taxdocument.pdf; and
# s3-dg.pdf. 
# The s3-dg.pdf key does not have a prefix, so its object appears directly 
# at the root level of the bucket. If you open the Development/ folder, you see 
# the Projects.xlsx object in it.
# In summary, if the path of the file is: 'bucket/my_path/.../file.csv'
# where 'bucket' is the bucket's name, prefix = 'my_path/.../', without the
# 'file.csv' (file name with extension) last part.

# So, declare the prefix as S3_OBJECT_FOLDER_PREFIX to import only files from
# a given folder (directory) of the bucket.
# DO NOT PUT A SLASH before (to the right of) the prefix;
# DO NOT ADD THE BUCKET'S NAME TO THE right of the prefix:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/"

# Alternatively, provide the full path of a given file if you want to import only it:
# S3_OBJECT_FOLDER_PREFIX = "bucket_directory1/.../bucket_directoryN/my_file.ext"
# where my_file is the file's name, and ext is its extension.


# Attention: after running this function for connecting with AWS Simple Storage System (S3), 
# your 'AWS Access key ID' and your 'Secret access key' will be requested.
# The 'Secret access key' will be hidden through dots, so it cannot be visualized or copied by
# other users. On the other hand, the same is not true for 'Access key ID', the bucket's name 
# and the prefix. All of these are sensitive information from the organization.
# Therefore, after importing the information, always remember of cleaning the output of this cell
# and of removing such information from the strings.
# Remember that these data may contain privilege for accessing protected information, 
# so it should not be used for non-authorized people.

# Also, remember of deleting the imported files from the workspace after finishing the analysis.
# The costs for storing the files in S3 is quite inferior than those for storing directly in the
# workspace. Also, files stored in S3 may be accessed for other users than those with access to
# the notebook's workspace.
idsw.export_files_to_s3 (list_of_file_names_with_extensions = LIST_OF_FILE_NAMES_WITH_EXTENSIONS, directory_of_notebook_workspace_storing_files_to_export = DIRECTORY_OF_NOTEBOOK_WORKSPACE_STORING_FILES_TO_EXPORT, s3_bucket_name = S3_BUCKET_NAME, s3_obj_prefix = S3_OBJECT_FOLDER_PREFIX)

****