<a href="https://colab.research.google.com/github/kartoch/colab-eda/blob/master/01%20-%20Load%20XML%20and%20save%20as%20CSV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions

This notebook is an example to load all essence datasets, extract the hierarchical data from the XML and save the tabular corresponding data in CSV.

The code to handle the load and save from Google Cloud Storage (GCS) is included.

In [0]:
from google.oauth2 import service_account
from google.cloud.storage import client
import io
import pandas as pd
from io import BytesIO
import json
import os.path
import logging
from zipfile import ZipFile

# Constants

- `START_YEAR` : first year from the essence dataset to load (2007 by  default)
- `END_YEAR` : fast year from the essence dataset to load (choose this year minus 1, as the actual year is probably incomplete)
- `CACHE_DIRECTORY` : cache directory to save and load file from GCS
- `LOG_LEVEL` : log level used by the logging instance `logger`

In [0]:
START_YEAR = 2007
END_YEAR = 2007
CACHE_DIRECTORY = "/tmp/"
LOG_LEVEL = "DEBUG"

In [0]:
logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(LOG_LEVEL)

GCS configuration

- `SERVICE_ACCOUNT` : copy/paste your service account here
- `BUCKET_DATASETS` : bucket containing the original dataset
- `BUCKET_PERSONAL` : bucket where you can read/write to save/load files between notebooks


In [0]:
SERVICE_ACCOUNT = json.loads(r"""{
  "type": "service_account",
  "project_id": "...",
  "private_key_id": "...",
  "private_key": "...",
  "client_email": "...",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "..."
}""")
BUCKET_DATASETS = "essence-dataset-eda"
BUCKET_PERSONAL = "eda-essence-NAME_STUDENT"

# Init and functions for GCS


In [0]:
credentials = service_account.Credentials.from_service_account_info(
    SERVICE_ACCOUNT,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

client_gcs = client.Client(
    credentials=credentials,
    project=credentials.project_id,
)

In [0]:
def download_file(local_filename, remote_filename, bucket):
    if os.path.isfile(local_filename):
      logger.info("Already donwloaded: %s", local_filename)
    blob = bucket.blob(remote_filename)
    blob.download_to_filename(local_filename)

def generator_zip_file(client):
    bucket = client_gcs.bucket(BUCKET_DATASETS)
    for year in range(START_YEAR,END_YEAR+1):
        blob_pathname = "PrixCarburants_annuel_" + str(year) + ".zip"
        local_filename = CACHE_DIRECTORY + blob_pathname
        download_file(local_filename,blob_pathname,bucket)
        zip_ref = ZipFile(local_filename)
        [xml_filename] = zip_ref.namelist()
        yield (zip_ref.open(xml_filename),year)
        zip_ref.close()

This is an example to read each file from the essence dataset in a loop. It uses a generator method, which is called for each loop iteration and returns
a type (year, file descriptor) of each file in the dataset starting from
START_YEAR and ending with END_YEAR.

**You probably want to add your code here**

In [0]:
for f,year in generator_zip_file(client):
    print(f,year)

This is a code example to send a CSV into GCS.
- It creates a dataframe and save it in `/tmp/test.csv`
- The method `zip_and_save_file` zips the file to `/tmp/test.csv.zip` and send the file to the bucket with name `test.csv.zip`

In [0]:
df_test = pd.DataFrame(
    {"col1": [1,2,3],
     "col2": [4,5,6]}
).to_csv(path_or_buf="/tmp/test.csv")

In [0]:
def zip_and_save_file(local_filename, remote_filename, bucket):
    zip_local_filename = local_filename + '.zip'
    zip_remote_filename = remote_filename + '.zip'
    ZipFile(local_filename + '.zip', mode='w').write(local_filename)
    blob = bucket.blob(zip_remote_filename)
    blob.upload_from_filename(zip_local_filename)

In [0]:
zip_and_save_file("/tmp/test.csv","test.csv", client_gcs.bucket(BUCKET_PERSONAL))