# Save dataset in GCS
**Codes to save the a dataset that the user upload in a GCS. Also it is necesary save the rest of selections of the user (for example save the time to forecast, type of analysis to do, type of models to trainl etc)**

---
**When the user upload a dataset it should be saved in a bucket of GCS** OBS: it is necesary create a bucket and a folder

---
**Upload the data, select the list of features, select the target**


---
**----> V2 USING GCSFS PACAKGE TO UPLOAD FILES IN A PYTHONIC WAY**

In [None]:
import pandas as pd
import json
import subprocess
import gcsfs

# RUN CODES

### 0. Auxiliar functions

In [None]:
# def upload_files_gcs(path_local_file, path_gcs_folder):
#     """
#     Upload files local in cloud. Work in jupyter notebook command !

#     Args:
#         path_local_file: path to file locally
#         path_gcs_folder: path to GCS folder where to save the file
#     """
#     # upload file
#     ! gsutil cp {path_local_file} {path_gcs_folder}

#### THIS FUNCTION IS NOT USED IN THIS EXAMPLE - GCSFS package is used instead

### 0. Config

#### 0.1. Read env variables - configuration GCP

In [None]:
# ---------------------------- read env variables used in the app ----------------------------
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
PROJECT_GCP = os.environ.get("PROJECT_GCP", "")
REGION_GCP = os.environ.get("REGION_GCP", "")
BUCKET_GCP = os.environ.get("BUCKET_GCP", "")

In [None]:
# connect to GCS
fs = gcsfs.GCSFileSystem(project = PROJECT_GCP)

#### 0.2 Root path
Set path where I stay in the previous folder

In [None]:
import os
# fix root path to save outputs
actual_path = os.path.abspath(os.getcwd())
print('actual path: ', actual_path)
list_root_path = actual_path.split('\\')[:-1]
root_path = '\\'.join(list_root_path)
os.chdir(root_path)
print('new path: ', root_path)

#### 0.3 Read paths of data uploaded BY THE USER
In theory the user will upload a dataset and this dataset will be saved locally in this path. After that, this code starts to run

In [None]:
path_local_data = 'data_local/data.xlsx'

### 1 PARAMS 

#### 1.1 Read data

In [None]:
# read data
data = pd.read_excel(path_local_data)

In [None]:
data

#### 1.2 Define parameters to upload data
Define a list of parameters when the data is uploaded. For example:

- The name of the dataset. The name will be used as id across al the functionalities
- Parameters of the data: number of steps to do forecast in the future (according the interpolarity of the data) 

In [None]:
#### DEFINE PARAMETERS THAT ALWAYS NEED TO BE DEFINED ####

# define name of the dataset - the user need to define it
NAME_DATASET = 'develop-app-final-v2'

# define number of steps to forecast
STEPS_FORECAST = 5

# define feature list
LIST_FEATURES = ['CMPC.SN', 'CHILE.SN', 'COPEC.SN', 'MSFT', 'AAPL', 'GOOG', 'TSLA', 'O', 'BHP']

# define target list
LIST_TARGET = ['VOO']

In [None]:
#### DEFINE PARAMETERS ACCORDING THE ANALYSIS THAT THE USER WANT

## EDA
STATISTICS = False
HISTOGRAMS = True
BOXPLOTS_MONTLY = True 
CORRELATIONS_ALL = True
CORRELATIONS_TARGET = True
SEGMENTATION_ANALYSIS = True
SEG_PARAM_TO_SEGMENT = 'TSLA'
CATEGORICAL_ANALYSIS = True

#### 1.3 Create segmentations parameters according the user input
In the segmentation data the user decide which feature make segmentation analysis and internally its decide the intervals (because this choice is easier to write in an app. In the future implement the user select the feature and the intervals)

In [None]:
# define threshold to segment the data into 3 segments by percentile
threshold_1 = data[SEG_PARAM_TO_SEGMENT].min() - 10
threshold_2 = data[SEG_PARAM_TO_SEGMENT].quantile(0.25)
threshold_3 = data[SEG_PARAM_TO_SEGMENT].quantile(0.75)
threshold_4 = data[SEG_PARAM_TO_SEGMENT].max() + 10
SEG_DATA_INTERVALS = [threshold_1, threshold_2, threshold_3, threshold_4]

# define labels
SEG_DATA_LABELS = [SEG_PARAM_TO_SEGMENT + ' low', SEG_PARAM_TO_SEGMENT + ' medium', SEG_PARAM_TO_SEGMENT + ' high']

### 2. Upload data and params (save locally and then upload to cloud)

#### 2.1 Build path to save artifacts of this dataset
Build path to folder in GCS where are saved the files of this dataset

In [None]:
# define path cloud where the data will be saved in GCS. transversal across all data upload
path_gcs_folder_data = "gs://" + BUCKET_GCP + '/' + NAME_DATASET + '/' + 'data' + '/'
path_gcs_folder_data

#### 2.2. Upload parameters
Create a dictionary to save as a json of the configuration of the data to future analysis

Save locally and the upload to cloud

In [None]:
# create dictionary parametes
dict_parameters_data = {
    "steps_forecast": STEPS_FORECAST,
    "list_features": LIST_FEATURES,
    "list_target": LIST_TARGET,
    "eda":{
        "statistics":STATISTICS,
        "histograms":HISTOGRAMS,
        "boxplots_montly": BOXPLOTS_MONTLY,
        "correlations_all": CORRELATIONS_ALL,
        "correlations_target": CORRELATIONS_TARGET,
        "segmentation_analysis":SEGMENTATION_ANALYSIS,
        "seg_param_to_segment": SEG_PARAM_TO_SEGMENT,
        "seg_data_intervals": SEG_DATA_INTERVALS,
        "seg_data_labels": SEG_DATA_LABELS,
        "categorical_analysis": CATEGORICAL_ANALYSIS
        }
    }

dict_parameters_data

In [None]:
# save locally
path_local_parameters = 'data_local/parameters.json'
with open(path_local_parameters, 'w') as file:
    json.dump(dict_parameters_data, file)

In [None]:
# upload cloud
path_cloud_parameters = path_gcs_folder_data + 'parameters.json'
with fs.open(path_cloud_parameters, 'w') as file:
    json.dump(dict_parameters_data, file)

#### 2.3. Upload data file in Storage - GCP
The data uploaded by the user needs to be saved locally, then, it is possible save it in cloud

In [None]:
# save locally
path_local_data = 'data_local/data.xlsx'
data.to_excel(path_local_data)

In [None]:
# save cloud
path_cloud_data = path_gcs_folder_data + 'data.xlsx'
data.to_excel(path_cloud_data)

In [None]:
data