# What's Avro?
* The data from our solar panel is stored in Azure Blob Storage.
* It's stored using the Avro serialization system (https://avro.apache.org/).
* In a nutshell, Avro is a compact, fast, binary data format.
* Avro relies on schemas. Schemas describe the data, its attributes and types.
* The schema is always part of the data itself, no need to store it separately.
* There's a Python library to read and write Avro files.

# Download data from Azure Blob Storage

In [1]:
# Connect to our storage account, download some blobs, and collect them into a list.
# The list may contain only one element, it really depends on how many blobs you download.
# For the purpose of this excercise, it might make sense to download only a single blob.
#
# Hint: Log in to the Azure portal and look for the account name and key.
# Hint: Check out the blob structure in the "Storage Explorer (preview)".
# Hint: Look at the documentation of BlockBlobService and look for a suitable method.
# 
# Note: If you print the blob's content, you will see a binary Avro string.
#
# See https://azure-storage.readthedocs.io/ref/azure.storage.blob.baseblobservice.html.

azure_storage_account_name = "redischoolstorage"
azure_storage_account_key = "<See Azure portal>"
azure_blob_container = 'iotdataavro'

from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(azure_storage_account_name, azure_storage_account_key)

blobs = []

##### YOUR CODE GOES HERE #####
blob = blob_service.get_blob_to_bytes(azure_blob_container, 'redischoolhub/03/2019/10/15/12/00')
blobs.append(blob)
blobs

[<azure.storage.blob.models.Blob at 0x10c93f630>]

# Deserialize Avro messages

In [2]:
# Deserialize the Avro messages, extract the payload, and collect them into a list.
# Check out the attributes of the messages. Which one contains the payload?
#
# Hint: Use DataFileReader from the Avro API and pass it the content of a blob.
# Hint: If you downloaded the blob as an array of bytes, wrap the content in io.BytesIO.
# Hint: DataFileReader is an iterator that returns dicts corresponding to the serialized items.
#
# Note: Each blob always contains two Avro messages. That's why 'payload' is a list.
# Note: Don't forget to close the Avro reader.
#
# See https://avro.apache.org/docs/current/gettingstartedpython.html

from avro.datafile import DataFileReader
from avro.io import DatumReader

payload = []

##### YOUR CODE GOES HERE #####
import io
for b in blobs:
    reader = DataFileReader(io.BytesIO(b.content), DatumReader())
    for elem in reader:
        payload.append(elem['Body'])
    reader.close()
        
payload

[b"{'msgID': 'msg0', 'msgVer': '1.0', 'gwID': 'MGate 5105_8352', 'Regulator': {'Voltage': 4354, 'Current': 17, 'Power': 741, 'Battery_voltage': 1454, 'Battery charging current': 51, 'Battery charging power': 756, 'Load voltage': 1454, 'Load current': 2, 'Load power': 29, 'Battery Temperature': 2130, 'Temperature inside case': 3322, 'Power component temperature': 0, 'Batterys remaining capacity': 0, 'Remote battery temperature': 0, 'Batterys rated power': 1200, 'Battery status': 0, 'Charing equipment status': 11, 'Maximum input volt(PV) today': 0, 'Minimum input volt(PV) today': 0, 'Maximal battery volt today': 0, 'Minimum battery volt today': 0, 'Consumed energy today': 0, 'Consumed energy this month': 0, 'Consumed energy this year': 320, 'Total consumed energy': 345, 'Generated energy today': 3, 'Generated energy this month': 0, 'Generated energy this year': 0, 'Total generated energy': 1025, 'Carbon dioxide reduction': 1, 'Battery current': 50}, 'Temp': {'Ambient Temp': 2129, 'Batter

# Parse JSON payload

In [3]:
# Parse the JSONs that you extracted from the Avro messages, and collect them into a list.
#
# Hint: Use Python's built-in JSON libary.
# Hint: The payload that you get from Avro are byte literals. You have to convert them.

import json

jsons = []

##### YOUR CODE GOES HERE #####
for p in payload:
    json_str = str(p, "utf8")
    json_str = json_str.replace("'", '"')
    jsons.append(json.loads(json_str))
    
jsons

[{'msgID': 'msg0',
  'msgVer': '1.0',
  'gwID': 'MGate 5105_8352',
  'Regulator': {'Voltage': 4354,
   'Current': 17,
   'Power': 741,
   'Battery_voltage': 1454,
   'Battery charging current': 51,
   'Battery charging power': 756,
   'Load voltage': 1454,
   'Load current': 2,
   'Load power': 29,
   'Battery Temperature': 2130,
   'Temperature inside case': 3322,
   'Power component temperature': 0,
   'Batterys remaining capacity': 0,
   'Remote battery temperature': 0,
   'Batterys rated power': 1200,
   'Battery status': 0,
   'Charing equipment status': 11,
   'Maximum input volt(PV) today': 0,
   'Minimum input volt(PV) today': 0,
   'Maximal battery volt today': 0,
   'Minimum battery volt today': 0,
   'Consumed energy today': 0,
   'Consumed energy this month': 0,
   'Consumed energy this year': 320,
   'Total consumed energy': 345,
   'Generated energy today': 3,
   'Generated energy this month': 0,
   'Generated energy this year': 0,
   'Total generated energy': 1025,
   'C

# Convert JSON to Pandas DataFrame

In [4]:
# Convert the parsed JSON objects into a single Pandas DataFrame.
#
# Hint: Use Pandas's json_normalize method to flatten the nested JSON structure.

import pandas as pd

df = pd.DataFrame()

##### YOUR CODE GOES HERE #####
from pandas.io.json import json_normalize
for j in jsons:
    df = df.append(json_normalize(j))

df

Unnamed: 0,msgID,msgVer,gwID,dateTime,Regulator.Voltage,Regulator.Current,Regulator.Power,Regulator.Battery_voltage,Regulator.Battery charging current,Regulator.Battery charging power,...,WetterOnline.x0411.data.fx_kmh,WetterOnline.x0411.data.pop,WetterOnline.x0411.data.pp_hpa,WetterOnline.x0411.data.rad_wm2,WetterOnline.x0411.data.rh,WetterOnline.x0411.data.rr_mm,WetterOnline.x0411.data.tt_C,WetterOnline.x0411.data.tta_C,WetterOnline.x0411.data.wm,WetterOnline.x0411.meta.local_date
0,msg0,1.0,MGate 5105_8352,2019-10-15T14:00:54+00:00,4354,17,741,1454,51,756,...,18,10,1006.0,620,55,0.0,17.7,17.7,so____,2019-10-15 13:59:09
0,msg0,1.0,MGate 5105_8352,2019-10-15T14:01:54+00:00,4345,17,756,1454,52,741,...,18,10,1006.0,620,55,0.0,17.7,17.7,so____,2019-10-15 13:59:09


# Possible next steps
* Understand what's in the data. --> See Norbert's part.
* Clean the data, e.g. remove rows with null values.
* Add a time-based index. 
* One-hot encode categorical variables.
* Create a linear correlation matrix.
* Select features for first ML model.