# Exploratory Data Analysis: Energy Consumption For Chicago In 2010

## Life cycle of machine learning project:
1. Understanding the problem
1. Data collection
1. Data checks to perform 
1. Exploratory data analysis
1. Data pre-preocessing
1. Model Training
1. Choose best model

### 1: Understanding the problem
The point of this project is to show how energy consumption is affected by variables such as time of the year, type of building, size of building, number of people, etc.

### 2: Data Collection
Dataset <a href='https://data.cityofchicago.org/Environment-Sustainable-Development/Energy-Usage-2010/8yq3-m6wp/about_data'>source</a>


### 2.1: Import data and required packages

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.impute import SimpleImputer

### Read and print 5 first rows of dataset

In [2]:
df = pd.read_csv('data/chicago_energy_consumption.csv')
df.head()

Unnamed: 0,COMMUNITY AREA NAME,CENSUS BLOCK,BUILDING TYPE,BUILDING_SUBTYPE,KWH JANUARY 2010,KWH FEBRUARY 2010,KWH MARCH 2010,KWH APRIL 2010,KWH MAY 2010,KWH JUNE 2010,...,TOTAL POPULATION,TOTAL UNITS,AVERAGE STORIES,AVERAGE BUILDING AGE,AVERAGE HOUSESIZE,OCCUPIED UNITS,OCCUPIED UNITS PERCENTAGE,RENTER-OCCUPIED HOUSING UNITS,RENTER-OCCUPIED HOUSING PERCENTAGE,OCCUPIED HOUSING UNITS
0,Archer Heights,170315700000000.0,Residential,Multi < 7,,,,,,,...,89.0,24.0,2.0,71.33,3.87,23.0,0.9582,9.0,0.391,23.0
1,Ashburn,170317000000000.0,Residential,Multi 7+,7334.0,7741.0,4214.0,4284.0,2518.0,4273.0,...,112.0,67.0,2.0,41.0,1.81,62.0,0.9254,50.0,0.8059,62.0
2,Auburn Gresham,170317100000000.0,Commercial,Multi < 7,,,,,,,...,102.0,48.0,3.0,86.0,3.0,34.0,0.7082,23.0,0.6759,34.0
3,Austin,170312500000000.0,Commercial,Multi < 7,,,,,,,...,121.0,56.0,2.0,84.0,2.95,41.0,0.7321,32.0,0.78,41.0
4,Austin,170312500000000.0,Commercial,Multi < 7,,,,,,,...,62.0,23.0,2.0,85.0,3.26,19.0,0.8261,11.0,0.579,19.0


### Shape of dataset

In [3]:
print('Columns: ',df.shape[1])
print('Rows: ',df.shape[0])

Columns:  73
Rows:  67051


### 2.2: Dataset information

### Dataset description
For an in-depth look at the dataset click <a href='data_description.md'> here</a>

### List of all columns

In [4]:
df.columns.values

array(['COMMUNITY AREA NAME', 'CENSUS BLOCK', 'BUILDING TYPE',
       'BUILDING_SUBTYPE', 'KWH JANUARY 2010', 'KWH FEBRUARY 2010',
       'KWH MARCH 2010', 'KWH APRIL 2010', 'KWH MAY 2010',
       'KWH JUNE 2010', 'KWH JULY 2010', 'KWH AUGUST 2010',
       'KWH SEPTEMBER 2010', 'KWH OCTOBER 2010', 'KWH NOVEMBER 2010',
       'KWH DECEMBER 2010', 'TOTAL KWH', 'ELECTRICITY ACCOUNTS',
       'ZERO KWH ACCOUNTS', 'THERM JANUARY 2010', 'THERM FEBRUARY 2010',
       'THERM MARCH 2010', 'TERM APRIL 2010', 'THERM MAY 2010',
       'THERM JUNE 2010', 'THERM JULY 2010', 'THERM AUGUST 2010',
       'THERM SEPTEMBER 2010', 'THERM OCTOBER 2010',
       'THERM NOVEMBER 2010', 'THERM DECEMBER 2010', 'TOTAL THERMS',
       'GAS ACCOUNTS', 'KWH TOTAL SQFT', 'THERMS TOTAL SQFT',
       'KWH MEAN 2010', 'KWH STANDARD DEVIATION 2010', 'KWH MINIMUM 2010',
       'KWH 1ST QUARTILE 2010', 'KWH 2ND QUARTILE 2010',
       'KWH 3RD QUARTILE 2010', 'KWH MAXIMUM 2010', 'KWH SQFT MEAN 2010',
       'KWH SQFT STAND

### 3: Data Checks to Perform
- Missing values
- Duplicates
- Data types
- Number of unique values of each column
- Dataset statistics 
- Different categories present in categorical columns

#### Missing Values
 Finding missing values


In [5]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df.isnull().sum())
    print('Total missing values: ', df.isnull().sum().sum())

COMMUNITY AREA NAME                        0
CENSUS BLOCK                              77
BUILDING TYPE                             77
BUILDING_SUBTYPE                          77
KWH JANUARY 2010                         871
KWH FEBRUARY 2010                        871
KWH MARCH 2010                           871
KWH APRIL 2010                           871
KWH MAY 2010                             871
KWH JUNE 2010                            871
KWH JULY 2010                            871
KWH AUGUST 2010                          871
KWH SEPTEMBER 2010                       871
KWH OCTOBER 2010                         871
KWH NOVEMBER 2010                        871
KWH DECEMBER 2010                        871
TOTAL KWH                                871
ELECTRICITY ACCOUNTS                     871
ZERO KWH ACCOUNTS                          0
THERM JANUARY 2010                      2230
THERM FEBRUARY 2010                     4232
THERM MARCH 2010                        1482
TERM APRIL

#### Missing Values
Deleting 871 rows where electric data is not available.

In [6]:
df = df.dropna(subset=['TOTAL KWH'])
print('Total missing values: ', df.isnull().sum().sum())

Total missing values:  100479


#### Missing Values
Using imputation to andle missing values. Numerical values will be handled with the average in each column. Categorical features will be handled with the mode of each column. 

In [7]:
# Creating 2 lists for categrical and numerical columns
cat_col = [i for i in df.columns.values if df[i].dtypes=='O']
num_col = [i for i in df.columns.values if df[i].dtypes=='float64' or df[i].dtypes=='int64']
cat_col, num_col

(['COMMUNITY AREA NAME',
  'BUILDING TYPE',
  'BUILDING_SUBTYPE',
  'ELECTRICITY ACCOUNTS',
  'GAS ACCOUNTS'],
 ['CENSUS BLOCK',
  'KWH JANUARY 2010',
  'KWH FEBRUARY 2010',
  'KWH MARCH 2010',
  'KWH APRIL 2010',
  'KWH MAY 2010',
  'KWH JUNE 2010',
  'KWH JULY 2010',
  'KWH AUGUST 2010',
  'KWH SEPTEMBER 2010',
  'KWH OCTOBER 2010',
  'KWH NOVEMBER 2010',
  'KWH DECEMBER 2010',
  'TOTAL KWH',
  'ZERO KWH ACCOUNTS',
  'THERM JANUARY 2010',
  'THERM FEBRUARY 2010',
  'THERM MARCH 2010',
  'TERM APRIL 2010',
  'THERM MAY 2010',
  'THERM JUNE 2010',
  'THERM JULY 2010',
  'THERM AUGUST 2010',
  'THERM SEPTEMBER 2010',
  'THERM OCTOBER 2010',
  'THERM NOVEMBER 2010',
  'THERM DECEMBER 2010',
  'TOTAL THERMS',
  'KWH TOTAL SQFT',
  'THERMS TOTAL SQFT',
  'KWH MEAN 2010',
  'KWH STANDARD DEVIATION 2010',
  'KWH MINIMUM 2010',
  'KWH 1ST QUARTILE 2010',
  'KWH 2ND QUARTILE 2010',
  'KWH 3RD QUARTILE 2010',
  'KWH MAXIMUM 2010',
  'KWH SQFT MEAN 2010',
  'KWH SQFT STANDARD DEVIATION 2010',
  

'CENCUS BLOCK' should be a categorical data type because it is describing an area, similar to how zipcodes work. Will simply convert this to object data type.

'ELECTRICITY ACCOUNTS' and 'GAS ACCOUNTS' data types are returning as categorical because of the value "Less than 4". For simplcity, will replace "Less than 4" with 3 and then convert to float64. Will also add new boolean column to indicate which rows ave less than 4 accounts


In [8]:
df['GAS ACCOUNTS'].replace('Less than 4', 3, inplace=True)
df['ELECTRICITY ACCOUNTS'].replace('Less than 4', 3, inplace=True)

# Correcting to appropriate data types
df = df.astype({'CENSUS BLOCK':object,'ELECTRICITY ACCOUNTS': 'float64', 'GAS ACCOUNTS': 'float64' })
 
df['GAS ACCOUNTS < 4']= np.where(df['GAS ACCOUNTS'] < 4, True, False)
df['ELECTRIC ACCOUNTS < 4']= np.where(df['ELECTRICITY ACCOUNTS'] < 4, True, False)

In [9]:
# Correcting lists
cat_col = [i for i in df.columns.values if df[i].dtypes=='O']
num_col = [i for i in df.columns.values if df[i].dtypes=='float64' or df[i].dtypes=='int64']
bool_col = [i for i in df.columns.values if df[i].dtypes=='bool']
cat_col, num_col, bool_col

(['COMMUNITY AREA NAME', 'CENSUS BLOCK', 'BUILDING TYPE', 'BUILDING_SUBTYPE'],
 ['KWH JANUARY 2010',
  'KWH FEBRUARY 2010',
  'KWH MARCH 2010',
  'KWH APRIL 2010',
  'KWH MAY 2010',
  'KWH JUNE 2010',
  'KWH JULY 2010',
  'KWH AUGUST 2010',
  'KWH SEPTEMBER 2010',
  'KWH OCTOBER 2010',
  'KWH NOVEMBER 2010',
  'KWH DECEMBER 2010',
  'TOTAL KWH',
  'ELECTRICITY ACCOUNTS',
  'ZERO KWH ACCOUNTS',
  'THERM JANUARY 2010',
  'THERM FEBRUARY 2010',
  'THERM MARCH 2010',
  'TERM APRIL 2010',
  'THERM MAY 2010',
  'THERM JUNE 2010',
  'THERM JULY 2010',
  'THERM AUGUST 2010',
  'THERM SEPTEMBER 2010',
  'THERM OCTOBER 2010',
  'THERM NOVEMBER 2010',
  'THERM DECEMBER 2010',
  'TOTAL THERMS',
  'GAS ACCOUNTS',
  'KWH TOTAL SQFT',
  'THERMS TOTAL SQFT',
  'KWH MEAN 2010',
  'KWH STANDARD DEVIATION 2010',
  'KWH MINIMUM 2010',
  'KWH 1ST QUARTILE 2010',
  'KWH 2ND QUARTILE 2010',
  'KWH 3RD QUARTILE 2010',
  'KWH MAXIMUM 2010',
  'KWH SQFT MEAN 2010',
  'KWH SQFT STANDARD DEVIATION 2010',
  'KWH S

In [10]:
for col in range(len(num_col)):
    imputer = SimpleImputer(strategy='mean')
    df[num_col[col]] = imputer.fit_transform(df[num_col[col]].values.reshape(-1,1))
print('Total missing values: ', df.isnull().sum().sum())

Total missing values:  231


In [13]:
for col in range(len(cat_col)):
    imputer = SimpleImputer(strategy='mode')
    df[cat_col[col]] = imputer.fit_transform(df[cat_col[col]].values.reshape(-1,1))
print('Total missing values: ', df.isnull().sum().sum())

InvalidParameterError: The 'strategy' parameter of SimpleImputer must be a str among {'constant', 'mean', 'most_frequent', 'median'}. Got 'mode' instead.

 ### Checking Duplicates

In [11]:
df.duplicated().sum()

0

There are no duplicates

In [12]:
print(df[df['CENSUS BLOCK'].isnull()]['ELECTRICITY ACCOUNTS'])


258       636.0
2536      114.0
2933      404.0
3215      281.0
4087      727.0
          ...  
60432     360.0
61127     469.0
62227    2074.0
63773    1117.0
66492     904.0
Name: ELECTRICITY ACCOUNTS, Length: 77, dtype: float64
