# Initial Exploratory Data Analysis

In this notebook an analysis of raw data is performed in order to have a first sight about the data sets, the features and their types, existence of missing values and outliers, inconsistencies, etc.

We want to have, for every data set, a to do task list in later steps of the data preprocessing stage, like:

* Missing values imputation
* Handling outliers
* Data transformation

## Data description

Data have been obtained from kaggle competition [Ashrae - Great Energy Predictor III](https://www.kaggle.com/c/ashrae-energy-prediction), where the goal is to predicto the energy consumption in several buildings for the next two years. These are the files and the feature description in each one:

**train.csv**
* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as `{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`. Not every building has all meter types.
* `timestamp`  - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

**building_meta.csv**
* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

**weather_[train/test].csv**

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

**test.csv**

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period


In [1]:
# Load libraries

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import datetime as dt
import gc
from src.functions import utils as utl
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import pandas_profiling

## Train Data import

In order to avoid memory problems because of the size of the data sets, we're using the funciton `import_data` from our local library `utils`, that considerably reduces the size of data frames.

In [46]:
train = utl.import_data('../../data/raw/train.csv') 

Memory usage of dataframe is 616.95 MB
Memory usage after optimization is: 173.90 MB
Decreased by 71.8%


In [2]:
building_meta = utl.import_data('../../data/raw/building_metadata.csv')
building_meta.name = 'Building_metadata'

Memory usage of dataframe is 0.07 MB
Memory usage after optimization is: 0.02 MB
Decreased by 73.8%


In [3]:
weather_train = utl.import_data('../../data/raw/weather_train.csv')
weather_train.name = 'Weather'

Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 2.65 MB
Decreased by 72.4%


## Exploratory Data Analysis Reports 

Using the library `pandas_profiling` we get a report with an exploratory data analysis for all the three raw data sets. 

### `weather_train` data set

In [None]:
weather_train_profile = weather_train.profile_report(
    title='Weather Data Profile', 
    style={'full_width':True}
)

In [None]:
weather_train_profile.to_file(output_file="../../reports/EDA/raw_weather_train_profile.html")
weather_train_profile

Variable `timestamp` is categorical. Let's convert it to datetime.

In [4]:
weather_train['timestamp'] = pd.to_datetime(weather_train['timestamp'])

In [None]:
weather_train.head()

#### Reports by `site_id`

In [5]:
sites = weather_train['site_id'].unique()

In [43]:
def get_report_by_site(df, sites):
    """Generate a report for every site in list 'sites'
    """
    reports = []
    for site in sites:
        try:
            report = df.loc[df['site_id'] == site, :].profile_report(
                title='Report Site {}'.format(site), 
                style={'full_width':True}
            )
            reports.append(report)  
        except Exception:
            print('WARN: Report for site {} has not been generated'.format(site))
            continue
            
    return reports

In [None]:
weather_reports = get_report_by_site(weather_train, sites)

In [45]:
def export_reports(df, reports, loc):
    """ Export each report in 'reports' to html in the location indicated by 'loc'
    """
    for i in range(0, len(reports)):
        try:
            reports[i].to_file(
                output_file = loc + '{}_site{}.html'.format(df.name, i)      
            )
        except Exception:
            print('WARN: Exportation failed for file {}'.format(output_file))
            continue

In [None]:
export_reports(weather_train, weather_reports)

### `building_metadata` data set

In [8]:
building_meta.head()

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,
3,0,3,Education,23685,2002.0,
4,0,4,Education,116607,1975.0,


In [9]:
building_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1449 entries, 0 to 1448
Data columns (total 6 columns):
site_id        1449 non-null int8
building_id    1449 non-null int16
primary_use    1449 non-null category
square_feet    1449 non-null int32
year_built     675 non-null float16
floor_count    355 non-null float16
dtypes: category(1), float16(2), int16(1), int32(1), int8(1)
memory usage: 17.9 KB


Feature `year_built`  is `float` type. We're casting it to `str`, removing the '.0' decimal too.

In [10]:
building_meta['year_built'] = building_meta['year_built'].astype(str, errors='ignore')

f = lambda x: x.replace('.0','')
building_meta['year_built'] = building_meta['year_built'].apply(f)

In [12]:
building_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1449 entries, 0 to 1448
Data columns (total 6 columns):
site_id        1449 non-null int8
building_id    1449 non-null int16
primary_use    1449 non-null category
square_feet    1449 non-null int32
year_built     1449 non-null object
floor_count    355 non-null float16
dtypes: category(1), float16(1), int16(1), int32(1), int8(1), object(1)
memory usage: 26.3+ KB


Let's generate a report by site:

In [17]:
building_reports = get_report_by_site(building_meta, [7])

WARN: Report for site 7 has not been generated


In [20]:
export_reports(building_meta, building_reports, "../../reports/EDA/building_metada/")

As we get an error genearting report for site 7, let's have a look with other functions:

In [29]:
building_meta.loc[building_meta['site_id'] == 7, :]

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
789,7,789,Education,64583,1923.0,1.0
790,7,790,Education,86111,1911.0,8.0
791,7,791,Education,150695,,5.0
792,7,792,Education,333681,1938.0,3.0
793,7,793,Education,150695,1964.0,6.0
794,7,794,Education,731945,1969.0,11.0
795,7,795,Education,387500,1960.0,6.0
796,7,796,Education,226042,1965.0,2.0
797,7,797,Education,764237,1979.0,13.0
798,7,798,Education,409028,1970.0,21.0


In [26]:
building_meta.loc[building_meta['site_id'] == 7, ['square_feet', 'floor_count']].describe()

Unnamed: 0,square_feet,floor_count
count,15.0,15.0
mean,323634.533333,8.734375
std,226863.000213,7.730469
min,64583.0,1.0
25%,150695.0,3.0
50%,290625.0,6.0
75%,446702.0,12.0
max,764237.0,26.0


In [24]:
building_meta.loc[building_meta['site_id'] in [, :].isna().sum()

site_id        0
building_id    0
primary_use    0
square_feet    0
year_built     0
floor_count    0
dtype: int64

Report for the whole data set, ignoring site 7:

In [37]:
building_meta_profile = building_meta.loc[building_meta['site_id'] != 7, :].profile_report(
    title='Building Metadata Profile', 
    style={'full_width':True}
)

In [40]:
building_meta_profile.to_file(output_file="../../reports/EDA/raw_building_meta_profile.html")

### `train` data set

In [None]:
train = train.profile_report(
    title='Train set', 
    style={'full_width':True}
)

In [None]:
ssss