# Introduction to Initial Data Analysis

Initial Data Analysis (IDA) consists of steps performed on the data of a study typically between the end of the data collection point.  IDA starts of those statistical analyses that address "research questions". It is important to detect the issues, then dealing with data issues as early as possible.  

The output should contain the simplify of data collection with data summary report. 

- Created Date : **10/4/2022**
- Updated Date : **14/4/2022**
- Author       : KK Yong

**References:**
- [Using Dictionary in Python](https://realpython.com/python-dicts/)
- [Handling Nested Dictionary](https://www.programiz.com/python-programming/nested-dictionary)
- [OrderedDict in Python](https://www.geeksforgeeks.org/ordereddict-in-python/)
- [Using Pandas Series](https://pythonbasics.org/pandas-series/)
- [MultiIndex / advanced indexing](https://pandas.pydata.org/docs/user_guide/advanced.html)

# Initialization for Python and NILMTK

Let's kick-off to process and analysis the data with Python.

In [None]:
import dateutil
import warnings

import matplotlib.pyplot as plt
import pandas as pd

from datetime import datetime

import nilmtk as ntk

## Define constant and global variable

In [None]:
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = [15, 10]

RAW_FILENAME = "../Dataset/ukdale.h5"

START_TS ='2013-04-01 00:00:00'
END_TS='2013-04-01 12:00:00'

## Overview of UK-DALE Dataset

In [None]:
# Load dataset
ukdale = ntk.DataSet(RAW_FILENAME)

In [None]:
# Print metadata
ntk.utils.print_dict(ukdale.metadata)

### Access the Dictionary of Key-Value Pair Vs List

In [None]:
print("\nList of Keys at the top level, total {} items.".format(len(list(ukdale.metadata))))
print(list(ukdale.metadata))

print("\nList of Keys inside the 'meter_devices' sublevel")
print(list(ukdale.metadata['meter_devices']))

print("\nList of Keys inside the 'timeframe' sublevel")
print(list(ukdale.metadata['timeframe']))

print("\nShow values of 'description', which at top level")
print(ukdale.metadata['description'])

print("\nList of Keys inside the 'timeframe' sublevel")
print("Data type of timeframe is {}".format(type(ukdale.metadata['timeframe'])))
print(list(ukdale.metadata['timeframe'].values()))

###  Detect branches of DICT - "ukdale.metadata"

In [None]:
cnt = 0
tmplst_dict = list()
d = ukdale.metadata
for e in d:
    if type(d[e]) is dict:
        cnt += 1
        tmplst_dict.append(d[e])
        print("{} - Data Type is {}.".format(e,type(e)))
#    else:
#        print("{} - Data Type is {}.".format(e,type(e)))

print("\nLength of dict in list = {}.".format(len(tmplst_dict)))
for item in tmplst_dict:
    #if type(subitem) is dict:
    #print("{} - Data Type is {}.".format(item, type(item)))
    print("...{} - {}".format(len(item) ,type(item)))
    for subitem in item:
        print("{} - Data Type is {}.".format(subitem, type(item[subitem])))

In [None]:
def print_hierarchy(d,s):
    cnt=0
    cnt_dict = 0
    cnt_other = 0
    cnt_lvl = 0
    global counter
    s += "#"
    for e in d:
        emptysp = ""
        if type(d[e]) is dict:
            for i in range (len(s)):
                emptysp += "  "
            print("cnt_lvl={}.\tcounter={}.\t{}\t{}-{}".format(cnt_lvl,counter, s, emptysp, e))
            dict_tree[e] = s
            cnt_lvl += 1
            counter += 1
            cnt += print_hierarchy(d[e],s)
        else:
            cnt_other += 1
    cnt_lvl -= 1
    return cnt

dict_tree = dict()
counter = 0

ret_cnt = print_hierarchy(ukdale.metadata,"")

In [None]:
# Working with nested dictionary

dict_md_cct_mea = ukdale.metadata["meter_devices"]["CurrentCostTx"]["measurements"]
print("\nList of Keys in 'meter_devices' -> CurrentCostTx -> measurements' sublevel")
print("Data type of dict_md_cct_mea is {}.\n".format(type(ukdale.metadata["meter_devices"]["CurrentCostTx"]["measurements"])))
print(list(ukdale.metadata["meter_devices"]["CurrentCostTx"]["measurements"]))

### Understanding different between **( ) { } [ ]** in python

**( )** is a tuple: An immutable collection of values, usually (but not necessarily) of different types. 

**[ ]** is a list: A mutable collection of values, usually (but not necessarily) of the same type. 

**{ }** is a dict: Use a dictionary for key value pairs.

In [None]:
lst_mea = ukdale.metadata["meter_devices"]["CurrentCostTx"]["measurements"]
print("Count of the obj_mea list = {}".format(len(lst_mea)))

dict_mea = lst_mea[0]
print("\nList of keys for dict_mea\t= {}".format(list(dict_mea)))
print("List of values for dict_mea\t= {}".format(dict_mea.values()))

val_ul = lst_mea[0].get("upper_limit")
print("\nGet value for lower_limit = {}. Data Type is {}".format(val_ul, type(val_ul)))

### Drill down ukdale.metatdata with "dict" data type in "DateTime"

In [None]:
print("Data Type for the variable of ukdale.metadata is {}".format(type(ukdale.metadata)))

# Using dict - key & value  

start_date_ukdale = ukdale.metadata.get("timeframe").get("start")
end_date_ukdale = ukdale.metadata.get("timeframe").get("end")

print("\nRaw string format")
print("Data Type of start_date is {}, value string is {}".format(type(start_date_ukdale), start_date_ukdale))
print("Data Type of end_date is {}, value string is {}".format(type(end_date_ukdale), end_date_ukdale))

# The raw datetime string is in iso format.  As a result, it need to import the 'dateutil' library.
# Then, convert the datetime iso format to data type of Datetime for python.
# ts - short form of timestamp

start_ts_ukdale = dateutil.parser.parse(start_date_ukdale)
end_ts_ukdale = dateutil.parser.parse(end_date_ukdale)

print("\nConverted the iso format")
print("Data Type of start_date is {}, value string is {}".format(type(start_ts_ukdale), start_ts_ukdale))
print("Data Type of end_date is {}, value string is {}".format(type(end_ts_ukdale), end_ts_ukdale))

print("\nDuration/Delta")
delta_in_days = end_ts_ukdale - start_ts_ukdale
delta_in_years = dateutil.relativedelta.relativedelta(end_ts_ukdale, start_ts_ukdale).years
print("Total of the recorded timespan is {} days or {} years".format(delta_in_days.days, delta_in_years))

### Short Findings

- Found out that the start date is 2012-11-09 until 2017-04-26 (~1628 days or ~4 years)

## Search certain timeframe

In [None]:
print(ukdale.buildings)
print("\nData Type of ukdale.buildings = {}. Count = {}.".format(type(ukdale.buildings), len(ukdale.buildings)))

In [None]:
# Doubt to clear!!
#
# The start and end date for these data should 2012 to 2017.  However, the info extratcted
# from it that showing 1 hour data on 2013-04-01.
#
# Therefore, need further investigation!

print(ukdale.buildings[1].elec.get_timeframe())

In [None]:
# Loop through the Data Type of OrderedDict 

for item in ukdale.buildings:
    rec = ukdale.buildings[item].elec
    print("buidling no = {}.".format(item))
    print(rec.get_timeframe())
    print("Sample Period = {}.".format(rec.sample_period()))
    print("Appliances = {}.".format(len(rec.appliances)))
    print("\n")

## Extract House/Building 4 for data based on timeframe 

In [None]:
# Select Data based on specific time range
ukdale.set_window(start=START_TS,end=END_TS)
house_data = ukdale.buildings[4].elec

# Simple plot house 4 time series of the power consumption
house_data.plot()

### MetaGroup

Dataset consists of various grouping of electricty meters.  There is grouping type of appliances with its sampling rate, site meter for whole house, or appliance-level submeter, or a circuit-level submeter.  The NILMTK's design has consist of the key calss is **MeterGroup**. It stores a list of meters and allows to select subset of meters, aggregate power from all meters and many other functions.

To access MetaGroup objects, **nilmtk.global_meter_group**.  This holds every meter currently loaded.  There is one **MeterGroups** per building, it can access via **Building.elec** attribute.  There is also **nested MetaGroups** for aggregating togehter split-phase mains and dual-supply (240 volt) applliances in North American and Canadian datasets.  You can call the API **".nested_metergroups()"**.

In [None]:
print(house_data)

print("\nData Type for the variable of house_data is {}.".format(type(house_data)))

In [None]:
house_data.mains()

***WORKOUT!***

* There isn't any nested MetaGroups in this MeterGroup object.  You can try out and find out any nested group in other similiar object or change different house/building. On the other hand, you can also review the documentation of UK-DALE dataset.*

### Load all columns data (default) to a dataframe

This is to show ways of loading data from a NILMTK DataStore into the memory, as a dataframe.  The load function returns a generator of DataFrames loaded from the DataStore based on the conditions specified. If no conditions are specified, then all data from all the columns is loaded. This is a quick guide to [Python generators.](http://stackoverflow.com/a/1756156/732596)

In [None]:
#
# Load 'site_meter' (aggregated data) to variable "main_df" dataframe
#
# Sample Period is in second. It can be resampling to a specified period, eg 12, 24, 30 or 60.
# Default sample rate is 6 second per log
#
main_df = next(house_data[1].load(sample_period=30))

main_df.head()

In [None]:
# Print all available meters and appliance in house 4

house_data.all_meters()

### Load only washing machine to dataframe

In [None]:
df_wm = next(house_data['washing machine'].load())
df_wm.plot()

***WORKOUT!***

*You can practice to zoom to the interesting time zone, and further exploring the statistical analysis.*

In [None]:
df_wm.describe()

***WORKOUT!***

*Various sampling rate setting may provide different slice-and-dice of the description.  It also can further the statistical analysis with std and mean.*

### Get column data and  load specfic column of data

This is to show how to extract specify column of data.  It picks washine machine as an example.  Thus, api '.available_columns' is to show all the columns name.

Also, it can load specific data to another data type, which is pandas series.  This utilize the API of '.power_series()'.  It provides a generator of one dimenstional pandas.Series objects, each containing power data.

In [None]:
# Get column header

obj_wm = house_data['washing machine']
obj_wm.available_columns()

In [None]:
series_wm = next(obj_wm.power_series())
series_wm.head()

***Notes:***

*Due to the current dataset only providing (power, active) single column value, therefore, it cannot specify other. It can do in below argument to the function.  If there is 'reactive', 'active' or 'voltage'. Below are the code reference.*

```
obj_tv = elec['television']

df = next(obj_tv.load(physical_quantity='voltage'))
df = next(obj_tv.load(physical_quantity='power', ac_type='reactive'))
df = next(obj_tv.load(ac_type='active'))
```

##  APIs of Stats for MeterGroups

### Using Pandas Series to perform simple analysis

**House  power consumption extract from "main" meter with simple plot**.

In [None]:
house_data.mains().power_series_all_data().head()

In [None]:
house_data.mains().power_series_all_data().plot()

In [None]:
start_dtts = dateutil.parser.parse(START_TS)
end_dtts = dateutil.parser.parse(END_TS)
delta_in_days = end_dtts - start_dtts

print("Start Date    : {}".format(START_TS))
print("End Date      : {}\n".format(END_TS))
print("Total of Hours (hr:min:sec)  = {}".format(delta_in_days))
print("Total of power consumption   = {:.2f}".format(house_data.mains().power_series_all_data().sum()))
print("Median of power consumption  = {:.2f}".format(house_data.mains().power_series_all_data().median()))
print("Average of power consumption = {:.2f}".format(house_data.mains().power_series_all_data().mean()))

***WORKOUT!***

*You can practice to create histogram chart to group by hours, and further exploring the statistical analysis.*

### Using NILMTK APIs to perform statistical analysis

This is to see the proportion of energy submetered in house/building 4

In [None]:
house_data.proportion_of_energy_submetered()

In [None]:
house_data.available_ac_types('power')

In [None]:
house_data.mains().available_ac_types('power')

In [None]:
house_data.submeters().available_ac_types('power')

In [None]:
# Total Energy returns in 'kWh'

house_data.mains().total_energy() 

In [None]:
# Energy use per submeter

house_data.submeters().energy_per_meter()

# Your Summary and Finding

Let's exercise here, and further this initial data exploration with its analysis.  Then, you provide the summary and findings here. 

- Created Date: ??
- Updated Date: ??

**Findings:**
- ?
- ?
- ?