# House Hold Electricity Usage Predictor 

Household electricity usage is dependent on various factors such as household occupancy, location, and structural features. Per the U.S. Energy Information Administation (EIA), 52% of household energy consumption is used for space heating and air conditioning. Heating and air conditioning usage vary significantly based on location, home size and structure, and equpment and fuels used. 25% of household energy usage consists of water heating, lighting, and refrigeration (i.e. year round energy use). The remaining 23% consists of devices such as televisions, cooking appliances, washer and dryers, and consumer electronics (i.e. computers, smart phones, video game consoles, streaming devices, etc.). The purpose of this project is to develope a predictive model which estimates household electricity usage based on factors such as household occupancy, location, and structural features. 

## 1. Data

The Residential Energy Consumption Survey (RECS), first conducted in 1978, uses mail and/or web forms to collect details regarding engery characteristics on housing units, usage patterns, and household demographics. Used in conjuction with data from energy suppliers (estimates of energy costs and usage for heating, cooling, appliances, etc.) the U.S. EIA develops predictions to better meet future energy damands as well as to improve the energy efficieny and building designs. 

The RECS data used in this analysis was published in 2009, consisting of the data collected in 2005. Refer to the links below to access the original datasets published by the EIA:
* [Data](https://www.eia.gov/consumption/residential/data/2005/index.php?view=characteristics)
* [Microdata](https://www.eia.gov/consumption/residential/data/2005/index.php?view=microdata)

The 2005 RECS data consists of over 700 columns and 5,000 rows of data, and is organized in several files. For the purposes of this project, key data were extracted from the survey to include information regarding location, home size, and energy usages. This data may be found under the following subsections on the [2005 RECS Survey Data](https://www.eia.gov/consumption/residential/data/2015/) website. 
* [Household Characteristics](https://www.eia.gov/consumption/residential/data/2005/mdatfiles/RECS05file8.csv)
* [Energy Consumption](https://www.eia.gov/consumption/residential/data/2005/mdatfiles/RECS05file11.csv)

It is important to note, the data extracted from the two supporting spreadsheets contain data on 4,382 households. The discrepancy between the full dataset and the supporting data files is unknown to the author. 

## 2. Data Cleaning 

Data cleaning was completed in the following notebooks: 

* [Data Cleaning Report](https://github.com/mjknights/CapstoneTwoGitHub/blob/main/notebooks/02_data_wrangling_10.19.24.ipynb)
* [Statistical Analaysis](https://github.com/mjknights/CapstoneTwoGitHub/blob/main/notebooks/03_Statisitical_Analysis-Copy1.ipynb)
* [Modeling](https://github.com/mjknights/CapstoneTwoGitHub/blob/main/notebooks/05_Modeling.ipynb)

#### Summary of Existing Data

The original dataframe consisted of two merged tables (Household Characteristics and Energy Consumption). These tables contained 4,382 rows, and 32 and 122 columns, respectively. The data related to house hold energy consumption, appliance usage, cooling/heating, and square footage was extracted from the datasets. 

#### Cleaning Data 

The dataframe was examined for duplicates and missing information. Duplicate columns were confirmed to have repeating data, and the replicated information was removed from the dataframe. Missing values were replaced by the median value of the given column, allowing valuable data within the row to remain in the analysis. Note, the median value was chosen over the mean because the median is less sensitive to extreme values, or outlier data. 

The square footage of the holdholds were checked by comparing the total square footage to the square footage of home components (i.e attic, basement, air-conditioned area, etc.). 2,108 rows have confliting data regarding the air-conditioned square footage. This is approximately 48% of the households in the dataset. The data was collected through homeowner surveys. It can be assumed the total air-conditioned square footage was not broken down in square footage of the attic, basement, garage, and/or the rest of the housing unit, or there is an additional area where squarefootage was not collected. Therefore, it is assumed that several of these features are missing from the dataset. We do not know how usefulness of the missing data. Therefore, these rows will not be removed from the dataset. It should be noted, the total air-conditioned footage for these homes is less than or equal to the total square footage of the home.

Several columns contained extranious information not pertaining to the household data:  survey year, questionnaire codes, and final weight. The survey year (2005) was identical for all households. The questionnarie code represented whether the survey was signed/unsigned, return, etc. The final weight is a statistical adjustment developed by the EIA used to apply the findings nation wide. The factor is calucated using the base sampling weight (inverse probability the holdhold is selected for sampling), nonresponse adjustment (accounted for bias due to survey nonresponses), and ratio adjustemnts (ensuring the weight is representative of the population survey). This factor was removed to focus the model on raw data collected through the RECS. 

#### Removing Estimates

Several columns within the dataframe are estimates. This includes estimates of (1) appliance energy and (2) household energy usage. The appliance energy usages were developed using the survey responses to questions regarding the number, size, model/features, energy efficiency rating, and usage of the given appliance. The house hold energy usage (i.e. the dependent variable within this analysis) was developed based on energy characteristics on the housing unit, usage patterns, and household demographics. The data collected through the RECS survey is combined with data from energy suppliers estimate energy costs and usage for heating, cooling, appliances and other end uses. These data include estimates; therefore, these columns may skew the results of the predictive model. The data have been removed from the database.  

#### Categorical Data

Categorical data, including the census region, census division, four largest states, and home type were included within the dataframe. The four largest states category denotes the whether the household is located within one of the four largest states based on residential energy consumption (during the 2005 these states included New York, California, Texas and Florida). The categorical data was converted into binary columns each representing a category (i.e. dummy variables). THe binary columns allow the data to be included within regression models. Therefore, the results of the analysis will include the statistical relationship between the categorical variables and electricity use.  

#### Review Distributions 

The distributions of features within the data frame was examined for obvious outliers and data patterns. The distributions were also examined to ensure the data values looked sensible and there were no obvious errors. The visual anlaysis identified several columns with wide distributions (Mail Question Codes, and Natural Gas/Fuel/Oil/Kerosene/LPG purchased). Additionally, the square footage and appliance usage graphs showed a small number of extreme values.

The freezer, refrigerator and water heater usages have outlier values significantly higher than 99.9% of the data. The rows with outliers values greater that the last 0.1% were removed from the dataset. These values were removed due to potential for error throughout the row, and/or because the energy usage within these homes is not representative. This will prevent the model being overly influenced by some extreme values. 

The data distribution of household square footage were compared. These columns summarized the total square footage of each home, as well as provided a break down of the square footage of various components of the home (i.e. attic, basement, garage, air-conditioned area, heated area, etc.). The repeatative data may skew the model, potentially leading overfitting and biasis predictions due to reduced generalizability. Therefore, only the total square footage was used in the analysis. 

#### Final Dataframe 

The final dataframe which will be used in the analysis consists of 4,370 rows, and 37 columns.

## 3. Analysis 

The statistical analysis and data processing was completed in the following notebooks: 

* [Statistical Analysis](https://github.com/mjknights/CapstoneTwoGitHub/blob/main/notebooks/03_Statisitical_Analysis-Copy1.ipynb)
* [Pre-Processing & Training Data](http://localhost:8890/notebooks/Downloads/USF/CapstoneTwoGitHub-main/CapstoneTwoGitHub-main/notebooks/04_Pre-processing&TrainingData.ipynb?)

#### Scale Standarization 

Machine learning algorithms typically assume input features follow a normal distrubtion. Therefore, data standardization was used to center and scale the all values within the dataframe. The method independently centered and scaled the features using the mean and standard deviation of each column, bringing all data within the dataframe to the same scale. The features in the resulting dataframe are centered around zero (i.e. mean is equal to zero) and have a variance of one (i.e. the standard deviation is equal to one). Additionally, scale standarization was used to ensure the feature inputs contribute equally to the model, reducinng biased predictions.  

#### Feature Correlation Heatmap 

The scaled data is reduced to two dimensional data using the PCA transformation to aid in visualization and data comparision. The resultingdata base was used to create a heatmap, shoing a high level view of the relationships between each feature. The correlation matrix, represented using color gradients, displayed the strength and direction of the each of these correlations. A value of positive one represents a prefect positive coorelation, whereas a value of negative one represents a prefect negative coorelation. A value of zero indicates no linear correlation between the features. 

The following conclusions were drawn from the analysis. The heatmap shows a stronger (0.37 to 0.21) positive correlation between electricity usage and households located in the south altantic census region, number of cool days, square footage, and single family homes. Whereas, the heatmap shows a stronger (-0.24 to -0.19) negative correlation between elevtricity usage and apartment builtings with more than 5 units, households located in the northeast census region and California, and number of heat days. 

#### Split Data

The data was partitioned into training and testing sets. This allows the model to learn from the training dataset, while the testing dataset is used as an independent assessment of the model (i.e. validation). This means the overall model selection process is fitting to one specific data set. Then the model is cross-validated as a final check on expected future performance.

Following common practice, the dataframe was split 75% to 25%, training to testing. This will allow the model to effectively learn underlying patterns and relationships, then evaluate the data not observed during training. 

### 4. Modeling 

#### Linear Regression Model - Multiple Linear Regression

A multiple linear regression model is used to predict a continous dependent variable (electricity usage) based on multiple independent variables, assuming a linear relationship (i.e. the dependent variable changes proportionally with the independent variables) while minimizing the average error. 

The R-Squared coefficient, or coefficient of determination, is the proportion of the variation in the dependent variable that is predicted from the independent variable (i.e. a measure of how successfully the model predicts variations in the data away from the mean). An R-Squared value of 1 means a perfect model, which explains 100 percent of the variation. Whereas, an R-Squared value of zero indicates the model cannot explain the variation of the data. 

The linear regression model yielded an R-Squared value of approximately 0.36 (or 37%). 

The predicted values (y-axis) from the model were plotted against the actual results (x-axis). There is a visible positive correlation, but it is clear it is not maximally accurate. 

![LinearRegression](\\images\\LinearRegression.jpeg)

#### Regrssion Model - Ordinary Least Squares 

Ordinary least squares is a type of linear regression model. The model is used to predict a continous dependent variable (electricity usage) based on multiple independent variables, assuming a linear relationship (i.e. the dependent variable changes proportionally with the independent variables) while minimizing the sum of square differences between the dependent variable and independent variables. 

The R-Squared coefficient was 0.36 (or 36%) which is the same as the previous model. This is because the model use the same algorithm and the same data. 

The probability value (or p-value) describes the likelihood of obtaining the observed data under the null hypothesis of a statistical test (i.e. the likelihood no statistical significance exists). A p-value of 0.05 or lower is generally considered statistically significant (i.e. reject the null hypothesis). Therefore, the p-value was calculated for each features, and features with p-values greater than 0.05 were removed from the analysis. 

The model was run again, and the R-squared value decreased to approximately 0.33 (or 33%). R-squared may have decrease due to (i) overfitting the previous data or (ii) model strucuture. The features removed from the dataframe may have been correlated (or collinear) or a removed feature may have been absorbing variance of the dependent variable. 

The linear regression models are not a good representation of the data. 

![OLS](\\images\\OLS.png)

#### Decision Tree - Entropy Model 

Decision trees are supervised learning algorithms used for classification and regression. The goal of the model is to predict the value of a target variable (electricity usage) by learning simple decision rules inferred from the data features (independent variables), creating a piecewise approximation. Entropy, or the measure of uncertainty, is used within a decision tree to determine the features to split at each node. Note, the goal is to minimize the entropy after each split. The resulting model reduces entropy, or uncertainty, by dividing a dataset based on feature values and increasing information gain. 

The accuracy score represents the ratio of true positives and true negatives within all of the predictions (i.e. (True Positives + True Negatives) / (Total Predictions)). An accuracy score of one represents 100% correct classification. Whereas, as score of zero represents 100% incorrect classification. The accuracy score of the model is approximately 0.71 (71%). 

The balanced accuracy score is a measure of skewed models. The balanced accuracy score is approximately 0.14 (14%). This low score may indicate the model is not effectively capturing patters within the data, potentally due to overfitting the training dataset. 

The precision score of the model is the ratio of true positives to the sum of true positives and false positives, or the ability of the model to correctly classify the data. A precision score of one represents 100% correct classification. Whereas, as score of zero represents 100% incorrect classification. The precision scores for true positives and true negatives are approximately 0.51 (51%). 

The recall score of the model is the ratio of true posities to the sume of true positives and false negatives. A recall score of one represents 100% correct classification. Whereas, as score of zero represents 100% incorrect classification. The recall score for true positives and true negatives is approximately 0.71 (71%). 

#### Random Forest Model 

A random forest model is an ensemble of decision trees. The Random Forest algorithm introduces additional randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which trades a higher bias for a lower variance, generally yielding an overall better model.

The results of the model are summarized below: 
- Accuacy: 70.2%
- Precision Score for Yes: 63.6%
- Precision Score for No:  63.6%
- Recall Score for Yes:  70.2%
- Recall Sore for No:  70.2%

## 5. Results  

Based on the accuacy, precision score, and recall score, the random forest model is the best representation of the data. 

## 6. Future Improvements 

- The 2020 RECS study, published in January 2024, was released as the 15th iteration of the survey. This dataset was published after the start of this project. If the future, I would use the most recient survey data. This will better reflect the electricity usage of each household by better accounting for updated technology such as advancements in construction and insulation, energy-efficient appliances, integration of smart technology, etc.
- The 2005 RECS survey includes overy 700 columns of data. However, only 37 columns were include in the final model. In a future rendition, I would include more of features. This may improve the accuracy of the model as well as provide more insite into correlations between electricity usage and the given variables. 
- The dataset used in this model includes annual wood consumption in thousands BTU. This alternative energy source should have been excluded from the analysis. This would have been consistant with the removal of other alternal energy sources such as natural gas, fuel oil, kerosene, and LPG. 
- The dataset include information on whether the holdhouses are within the four largest states cateogry from analysis. This does not add to the analysis and should be removed in future models.