# Initial Exploratory Data Analysis

In this notebook an analysis of raw data is performed in order to have a first sight about the data sets, the features and their types, existence of missing values and outliers, inconsistencies, etc.

We want to have, for every data set, a to do task list in later steps of the data preprocessing stage, like:

* Missing values imputation
* Handling outliers
* Data transformation

## Data description

Data have been obtained from kaggle competition [Ashrae - Great Energy Predictor III](https://www.kaggle.com/c/ashrae-energy-prediction), where the goal is to predicto the energy consumption in several buildings for the next two years. These are the files and the feature description in each one:

**train.csv**
* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as `{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`. Not every building has all meter types.
* `timestamp`  - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

**building_meta.csv**
* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

**weather_[train/test].csv**

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

**test.csv**

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period


In [1]:
# Load libraries

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import datetime as dt
import gc
from src.functions import utils as utl
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import pandas_profiling

## Train Data import

In order to avoid memory problems because of the size of the data sets, we're using the funciton `import_data` from our local library `utils`, that considerably reduces the size of data frames.

In [5]:
train = utl.import_data('../../data/raw/train.csv') 
building_meta = utl.import_data('../../data/raw/building_metadata.csv')
weather_train = utl.import_data('../../data/raw/weather_train.csv')

Memory usage of dataframe is 616.95 MB
Memory usage after optimization is: 173.90 MB
Decreased by 71.8%
Memory usage of dataframe is 0.07 MB
Memory usage after optimization is: 0.02 MB
Decreased by 73.8%
Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 2.65 MB
Decreased by 72.4%


## Exploratory Data Analysis Reports 

Using the library `pandas_profiling` we get a report with an exploratory data analysis for all the three raw data sets. 

### `Weather_train` data set

In [7]:
weather_train_profile = weather_train.profile_report(
    title='Weather Data Profile', 
    style={'full_width':True}
)

In [10]:
weather_train_profile.to_file(output_file="../../reports/EDA/weather_train_profile.html")
weather_train_profile

