## Lesson 01 Assignment

Import required libraries:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Functions

# Display kernel density plot for the given column with options
def show_kde_plot(col_name, axis, bandwidth):
    sns.kdeplot(energy_loads.loc[:, col_name], ax = axis, bw = bandwidth)
    axis.set_title('KDE plot of ' + col_name) # Give the plot a main title
    axis.set_xlabel(col_name) # Set text for the x axis
    axis.set_ylabel('Density')# Set text for y axis

# Display distribution for the given categorical variable as a bar chart
def show_count_plot(data_set, col_name, axis):
    sns.countplot(x = col_name, data = data_set, ax = axis, color="#4080c0")
    axis.set_title('Number of Buildings by ' + col_name)
    axis.set_xlabel(col_name)
    axis.set_ylabel('Number of Buildings')

Download the energy efficiency data set:

In [None]:
url = 'https://raw.githubusercontent.com/StephenElston/DataScience410/master/Lecture1/EnergyEfficiencyData.csv'

# Download the data into a dataframe object
energy_loads = pd.read_csv(url)

Inspect initial rows of data set:

In [None]:
energy_loads.head()

Inspect final rows of data set:

In [None]:
energy_loads.tail()

### Glossary
 - *Cooling Load*: amount of heat energy required to remove from a space (cooling) to maintain the temperature in an acceptable range (kWh/m²)
 - *Glazing Area*: glass component of building's façade or internal surfaces relative to floor area, one of 0%, 10%, 25%, or 40% of floor area
 - *Glazing Area Distribution*: how glazing area is distributed within the whole building, 1:Uniform, 2:North, 3:East, 4:South, 5:West
 - *Heating Load*: amount of heat energy required to add to a space to maintain the temperature in an acceptable range (kWh/m²)
 - *Orientation*: 2:North, 3:East, 4:South, 5:West
 - *Overall Height*: height of the building at its highest point (m)
 - *Relative Compactness*: volume to surface ratio of structure compared to that of the most compact shape with the same volume, 1 = most compact
 - *Roof Area*: total area of roof (m²)
 - *Surface Area*: total area of all surfaces (m²)
 - *Wall Area*: total area of all walls (m²)

### Frequency and Distribution of Categorical Variables

Begin with frequency tables of explicitly categorical variables: *Orientation* and *Glazing Area Distribution*.

Frequency of orientation values (2:North, 3:East, 4:South, 5:West):

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Orientation']].groupby(['Orientation']).agg('count')

Frequency of glazing area distribution values (1:Uniform, 2:North, 3:East, 4:South, 5:West):

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 
              'Glazing Area Distribution']].groupby(['Glazing Area Distribution']).agg('count')

We can render these frequencies graphically:

In [None]:
# Frequency distribution of categorical variables
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(8, 2))
plt.subplots_adjust(wspace=0.3, hspace=0.5)
sns.set_style("whitegrid")
show_count_plot(energy_loads, 'Orientation', ax1)
show_count_plot(energy_loads, 'Glazing Area Distribution', ax2)

### Distribution of Continuous Variables

Now we explore the distributions of the ostensibly continuous variables via kernel density plots:

In [None]:
# Explore distributions of the columns
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), 
      (ax7, ax8)) = plt.subplots(nrows=4, ncols=2, figsize=(10, 10))
plt.subplots_adjust(wspace=0.3, hspace=0.5)
sns.set_style("whitegrid")
show_kde_plot('Relative Compactness', ax1, 0.002)
show_kde_plot('Surface Area', ax2, 1)
show_kde_plot('Wall Area', ax3, 2)
show_kde_plot('Roof Area', ax4, 2)
show_kde_plot('Overall Height', ax5, 0.2)
show_kde_plot('Glazing Area', ax6, 0.01)
show_kde_plot('Heating Load', ax7, 0.2)
show_kde_plot('Cooling Load', ax8, 0.2)

It is apparent that 6 of our continuous variables are not distributed continuously, but represent a limited number of values: *Relative Compactness*, *Surface Area*, *Wall Area*, *Roof Area*, *Overall Height*, and *Glazing Area*. Frequency tables of these variables are given here.

There are 12 possible values of *Relative Compactness*, represented by 64 units each for a total of 768 units:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Relative Compactness']].groupby(['Relative Compactness']).agg('count')

Similarly, there are 12 possible values of *Surface Area*, again represented by 64 units each for a total of 768 units:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Surface Area']].groupby(['Surface Area']).agg('count')

There are 7 possible values of *Wall Area*:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Wall Area']].groupby(['Wall Area']).agg('count')

There are just 4 possible values of *Roof Area*:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Roof Area']].groupby(['Roof Area']).agg('count')

*Overall Height* consists of just 2 possible values:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Overall Height']].groupby(['Overall Height']).agg('count')

There are just 4 possible values of *Glazing Area*:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Glazing Area']].groupby(['Glazing Area']).agg('count')

Clearly, there is a limited number of building configurations among the 768 units in the data set. To identify these configurations, we apply the *groupby* method:

In [None]:
energy_loads['counts'] = 1
energy_loads[['counts', 'Relative Compactness', 'Surface Area', 
              'Wall Area', 'Roof Area', 'Overall Height'
             ]].groupby(['Relative Compactness', 'Surface Area', 'Wall Area', 
                         'Roof Area', 'Overall Height']).agg('count')

It turns out there are 12 building configurations, uniquely identifiable by either *Relative Compactness* or *Surface Area*, which are perfectly correlated (proxies). There are 64 units of each configuration.<br><br>*Wall Area*: Three configurations have a wall area of 294 square meters, three have a wall area of 318.5 square meters, and two have a wall area of 343 square meters. The other four configurations have unique wall area measurements, two at the low extreme and two at the high extreme.<br><br>*Roof Area*: Six configurations have a roof area of 220.5 square meters, three have a roof area of 147 square meters, two have a roof area of 122.5 square meters, and one has a roof area of 110.25 square meters.<br><br>*Overall Height*: Six configurations have an overall height of 3.5 meters, while six have an overall height of 7 meters.<br><br>So *Wall Area*, *Roof Area*, and *Overall Height* are each somewhat dependent on either *Relative Compactness* or *Surface Area*.

### *Surface Area* as a Possible Predictor of Heating / Cooling Load

Since either *Relative Compactness* or *Surface Area* defines the unit configuration (among the 12 configurations identified above), we would like to see how well these variables correlate with heating or cooling load and assess the contribution of the remaining categorical variables (*Orientation*, *Glazing Area*, and *Glazing Area Distribution*). We shall arbitrarily choose *Surface Area* as the basis of our analysis because its units (square meters) are straightforward to conceptualize.

We begin by examining scatter plots of *Surface Area* against *Heating Load* and *Cooling Load*.

Here is *Surface Area* vs. *Heating Load*:

In [None]:
## Define a figure and axes and make a scatter plot
fig = plt.figure(figsize=(6, 6)) # define plot area
ax = fig.gca() # define axis
ax.scatter(x = energy_loads['Surface Area'], y = energy_loads['Heating Load'])
ax.set_title('Scatter plot of Surface Area vs Heating Load') # Give the plot a main title
ax.set_xlabel('Surface Area') # Set text for the x axis
ax.set_ylabel('Heating Load') # Set text for y axisv

Here is *Surface Area* vs. *Cooling Load*:

In [None]:
## Define a figure and axes and make a scatter plot
fig = plt.figure(figsize=(6, 6)) # define plot area
ax = fig.gca() # define axis
ax.scatter(x = energy_loads['Surface Area'], y = energy_loads['Cooling Load'])
ax.set_title('Scatter plot of Surface Area vs Cooling Load') # Give the plot a main title
ax.set_xlabel('Surface Area') # Set text for the x axis
ax.set_ylabel('Cooling Load') # Set text for y axis

It is evident that there is not much difference between the respective relationships between *Surface Area* and either *Heating Load* or *Cooling Load*. Therefore, we shall only consider *Heating Load* in further analysis.

The most startling feature of these two scatter plots is the sharp break in energy efficiency between surface areas of 661.5 and 686 square meters. Looking at the configurations above, it is apparent that this break is attributable to the height difference between units having surface areas of 661.5 square meters or less and units having surface areas of 686 or greater. The former are taller at 7 meters, while the latter are shorter at 3.5 meters. **So the taller configurations have a smaller surface area.** Yet despite their smaller surface area, these taller buildings have significantly *higher* heating and cooling loads.

Graphing *Overall Height* against *Heating Load* reveals this relationship:

In [None]:
## Define a figure and axes and make a scatter plot
fig = plt.figure(figsize=(4, 4)) # define plot area
ax = fig.gca() # define axis
ax.scatter(x = energy_loads['Overall Height'], y = energy_loads['Heating Load'])
ax.set_title('Scatter plot of Overall Height vs Heating Load') # Give the plot a main title
ax.set_xlabel('Overall Height') # Set text for the x axis
ax.set_ylabel('Heating Load') # Set text for y axis

In fact, knowing the height of a unit in our data set allows to predict its minimum or maximum heating load: If the configuration is 7 meters in overall height, its heating load will be above 15 kWh/m². If the configuration is 3.5 meters in overall height, its heating load will be below 20 kWh/m².

At the same time, it is clear that surface area by itself is a poor predictor of heating or cooling load: As the surface area increases there is no corresponding increase or decrease in heating or cooling load. Rather the heating and cooling loads fluctuate as surface area increases. The remaining variables do not appear fully to explain these fluctuations. It is possible that the relative stability of the heating loads of the lower buildings is explained by the fact they all have the same roof surface area (220.5 square meters). Yet there is no corresponding pattern among the taller buildings. Between configurations with surface areas of 563.5 and 588 square meters respectively, the roof area increases by 24.5 square meters, but the heating load goes down. At the same time, between configurations with surface areas of 612.5 and 637 square meters, the roof area remains constant, but the heating load increases dramatically. These fluctuations are not explained by wall area either. The wall area increases by the same amount (24.5 sq m) between configurations of 588 and 612.5 square meters and 612.5 and 637 square meters respectively, but in the former case the heating load goes down while in the latter it goes up significantly.

### Effects of Categorical Variables

It remains to assess whether any of the categorical variables contributes to the erratic relationship between *Surface Area* and *Heating Load* observed in the scatter plots. In the facet grid below we examine the categorical variables *Orientation*, *Glazing Area*, and *Glazing Area Distribution* in relation to *Surface Area* and *Heating Load*.

In [None]:
g = sns.FacetGrid(energy_loads, col='Glazing Area', 
                  row = 'Glazing Area Distribution', hue = 'Orientation',  
                  palette="Set2", margin_titles=True)
g.map(sns.regplot, 'Surface Area', 'Heating Load', fit_reg = False)

The facet grid suggests that of the remaining variables, only *Glazing Area* has an effect on *Heating Load*. Configurations where *Glazing Load* = 0.0 have a significantly lower heating load relative to configurations with higher glazing loads. As *Glazing Load* increases from 0.1 to 0.25 to 0.4, the heating loads steadily increase for each configuration.

This is evident when we color-code the scatter plot of *Surface Area* vs *Heating Load* with values of *Glazing Area*:

In [None]:
ax = sns.relplot(x = 'Surface Area', y = 'Heating Load', 
                 hue = 'Glazing Area', palette = 'Accent',
                 data=energy_loads,
                 height = 8, aspect=1/1)
ax.set(title ='Surface Area vs. Heating Load (Glazing Area by color)', # Main title
       xlabel = 'Surface Area', # Set text for the x axis
       ylabel = 'Heating Load') # Set text for y axis 

The color-coded scatter plot reveals that units with *Glazing Area* = 0.0 have consistently lower heating loads. As glazing area increases, so does the heating load. Variation in *Glazing Area* accounts for the vertical alignment of points at each value of *Surface Area*.

On the other hand, *Orientation* and *Glazing Area Distribution* appear to have little measurable effect on heating load.

### Conclusions

Three interesting relationships within the variables:
1. *Relative Compactness* and *Surface Area* are proxies. If we know the value of one, we know the value of the other.
2. Knowing the height of a building in our data set allows us to predict its minimum or maximum heating load: If the building is 7 meters in overall height, its heating load will be above 15 kWh/m². If the building is 3.5 meters in overall height, its heating load will be below 20 kWh/m².
3. *Relative Compactness*, *Surface Area*, *Wall Area*, *Roof Area*, and *Overall Height* are not continuous or independent but have a limited number of values that occur together in 12 configurations apparently determined by the design of the building.