# Model Building Part 1

As carbon emissions remain to be one of the biggest instigators of climate change, methods to predict the amount of greenhouse gas emissions become a useful assest in the planning of civil structures. By researching what aspects of a building's infrastructure correlates most to its greenhouse gas emissions, action to implement more energy efficient systems or discouragement of carbon polluting features will become more feasible. 

Our team's main objective is to determine what features of a building influence their pollution levels, or specifically, their greenhouse gas emission intensity (total GHG emissions/total area). Additionally, we aim to predict a building's GHG emission intensity given certain attributes. To accomplish this, our team will implement three models: decision tree, naive bayes classification, and quantile regression. With the results we recieved, we then chose to test a neural network in hopes of better results. Our decision to test a model from both a classification and a regression lens stems from our desire to discover which type of prediction would fair better results. The raw dataset has a numerical data type for each building's GHG emission intensity, so to convert this response variable into an alternative categorical variable, we employ DBSCAN clustering to form four clusters/classes to assign each data point. 

### Decision Tree

[Juptyer Notebook of Decision Tree](https://github.com/libbyqstephan/DATA422_Fall2024_Team5/blob/c8221c9ea4e2275d76d2ea0575b73efd2b6cfa11/Model%20Building%20Part%201/decisionTreeUpdated.ipynb)

Decision trees are a classic and simple classification model that is often overlooked in favor of more intricate and sophisticated models. However, decision trees may very well be sufficient for our purposes given that our decision tree from the exploratory phase yielded an accuracy of 74%. However, this model only used a select handful of variables. In this new tree, we would like to compare our results with a model that uses all of our data. Unfortunately, this increase in dimension may have added too much noise for the decision tree to handle. The accuracy of the new tree is 53%. While this is still better than randomly selecting (25% accuracy), this is a very poor performance. Previously, the root node was SiteEUI(kBtu/sf), but now it is BuildingType_Multifamily LR (1-4). This new root node did not exist beforehand as one of the new variables added (via encoding). This new root node suggests that whether a building is a low rise apartment is the biggest indicator about how much GHG emissions this building will produce. With the drastic decrease in accuracy, this model tells us that less is more when it comes to classification of GHG emission intensity. The addition of the new variables added severe noise to the set resulting in an inaccurate model.

### Naive Bayes Classification

[Jupyter Notebook of Naive Bayes](https://github.com/libbyqstephan/DATA422_Fall2024_Team5/blob/c8221c9ea4e2275d76d2ea0575b73efd2b6cfa11/Model%20Building%20Part%201/NaiveBayes.ipynb)

Naive Bayes takes a probability based modeling approach in contrast to decision trees. Therefore, we believe it would be worth investigating whether a probability-based model would yield more accurate results. There are two types of naive bayes models we tested: multinomial and complement. In particular, we chose complement naive bayes for its handling of imbalanced classes; the distribution of our classes falls heavily into the the first category. The multinomial and complement model gave an accuracy of 49%. The underperformance of these model may be due to naive bayes desire for features to be independent which is likely not the case based on the covariance graph.

#### Multinomial Naive Bayes Accuracy: 0.4933
Multinomial Classification Report:<br>

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.59      | 0.75   | 0.66     | 399     |
| 1     | 0.27      | 0.40   | 0.32     | 81      |
| 2     | 0.29      | 0.02   | 0.03     | 121     |
| 3     | 0.30      | 0.22   | 0.26     | 143     |
| **Accuracy**   |           |        |          | **0.49** (Total: 744) |
| **Macro Avg**  | 0.36      | 0.35   | 0.32     | 744     |
| **Weighted Avg** | 0.45   | 0.49   | 0.44     | 744     |

Confusion Matrix

|         | Predicted: 0 | Predicted: 1 | Predicted: 2 | Predicted: 3 |
|---------|---------------|--------------|--------------|--------------|
| Actual: 0 | 301           | 42           | 4            | 52           |
| Actual: 1 | 46            | 32           | 1            | 2            |
| Actual: 2 | 71            | 29           | 2            | 19           |
| Actual: 3 | 94            | 17           | 0            | 32           |

#### Complement Naive Bayes Accuracy: 0.4933
Complement Classification Report:<br>

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.58      | 0.79   | 0.67     | 399     |
| 1     | 0.00      | 0.00   | 0.00     | 81      |
| 2     | 0.24      | 0.23   | 0.24     | 121     |
| 3     | 0.29      | 0.16   | 0.21     | 143     |
| **Accuracy**   |           |        |          | **0.49** (Total: 744) |
| **Macro Avg**  | 0.28      | 0.30   | 0.28     | 744     |
| **Weighted Avg** | 0.40   | 0.49   | 0.44     | 744     |

Confusion Matrix

|         | Predicted: 0 | Predicted: 1 | Predicted: 2 | Predicted: 3 |
|---------|---------------|--------------|--------------|--------------|
| Actual: 0 | 316           | 0            | 43           | 40           |
| Actual: 1 | 49            | 0            | 31           | 1            |
| Actual: 2 | 77            | 0            | 28           | 16           |
| Actual: 3 | 106           | 0            | 14           | 23           |

Examining the Classification reports, the precision value tells us how accurate the model is at predicting the different classes. Both of the models seem to perform fairly well at predicting class 0 (low GHG emission intensity) but struggle to accurately assign the other classes. This is supported by the confusion matrix which shows that both models are predicting large numbers 301, and 316 out of the 399 to belong to class 0. It also appears to be predicting many cases of class 1 to be class 2 instead. Finally the recall values for both models suggest that the models are missing many instances of classes 1, 2, and 3.

Overall, multinomial naive bayes classification outperforms the second model, but only slightly based on the information in the recall matrix and classification report. Unfortunately this indicates that this model is likely not useful in terms of predicting emission intensity classes.

### Quantile

[Juptyer Notebook of Quantile Regression](https://github.com/libbyqstephan/DATA422_Fall2024_Team5/blob/c8221c9ea4e2275d76d2ea0575b73efd2b6cfa11/Model%20Building%20Part%201/QuantileRegression.ipynb)

For our regression model, we chose quantile regression for its robustness to outliers which our data set certainly contains. Rather than focusing solely on the mean of the distribution of GHG intensity, quantile regression will consider certain portions of the distribution (via quantiles). This means that our model will consider what factors influence data points that are in the 25th percentile, 50th percentile, and so on. The variables used for this model are 'SiteEUI(kBtu/sf)', 'SourceEUI(kBtu/sf)', and 'YearBuilt.'

In order to assess the performance of the quantile regression, we employ mean pinball loss. Pinball is a loss function that applies an asymetric penalty, which is helpful for determining the accuracy of the quantile predictions. Mean pinball loss works by calculating the average pinball loss for each data point. Underestimation is penalized by the quantile value, and overestimation is penalized by 1 - the quantile value. So, For 0.05, An underestimation would be penalized by only 0.05, while an overestimation would be penalized by 1 - 0.05 = 0.95. This gives the model a tendency to overpredict the emission intensity for high quantiles like 0.95, and underpredict the emission intensity for smaller quantiles like 0.05, which partially explains the jump in pinball loss from the quartiles in the interquartile range to the quartiles at either end. Also contributing to this is the fact that every outlier is found in either the 0.05 or 0.95 quantile, so the loss is bound to be higher. The loss values for the interquartile quartiles are fairly good, being a little over 0.1, but this is still not a great level of accuracy, as the mean value for GHGEmissionsIntensity is 1.1289.

Training Pinball Losses:<br>
Quantile 0.05: 0.51768<br>
Quantile 0.25: 0.11619<br>
Quantile 0.5: 0.11082<br>
Quantile 0.75: 0.11424<br>
Quantile 0.95: 0.36396<br>

Test Pinball Losses:<br>
Quantile 0.05: 0.47994<br>
Quantile 0.25: 0.10342<br>
Quantile 0.5: 0.09820<br>
Quantile 0.75: 0.10208<br>
Quantile 0.95: 0.35855<br>

Overall, quantile regression performs fairly well, with the 0.5 quantile being the best performing, in terms of mean pinball loss. This is very good for the model, as this means it is best at capturing the central tendency of the data. In this sense, the model is a good option for our goal of being able to predict the emission intensity of a building. On the otherhand, the two tail quantiles, 0.05 and 0.95, perform much worse than the others, and aren't very strong estimations. While as mentioned earlier, this can be partially explained by the existence of outliers and the nature of the penalization that pinball loss uses. It is also an indicator that the regression is quite limited when it comes to capturing the very low and high values. As one of the tangential goals of our project was to learn what can be done to minimize a buildings emission intensity, this is alarming. If we can't capture these values well, it means the model doesn't provide a good explanation for the uniquely great and poor buildings, which are the exact buildings we would want to focus on, as implementing commonalities amongst the buildings with low emissions and avoiding the commonalities of the buildings with high emissions is how we can detect the factors that make a building good at handling emissions. Due to this, it would be beneficial to look into other models that can better predict these important tail values, while maintaining a good level of accuracy at predicting the central values.

### Neural Network

[Jupyter Notebook of Neural Network](https://github.com/libbyqstephan/DATA422_Fall2024_Team5/blob/c8221c9ea4e2275d76d2ea0575b73efd2b6cfa11/Model%20Building%20Part%201/NeuralNetwork.ipynb)

Due to the lack luster results of the naive bayes models, we decided to test out a nueral network out of curiosity to compare results. Neural networks are extrememly popular and for good reason. They can handle complex and high dimensional data with non linear relationships all while automatically selecting features or relationships that may be present in the data. The only downside to these models is that the inner workings are frequently known to be a black box. That is, users are unsure what and why certain decisions are made in the prediction process. In that case, this model would be be particularly useful if we would like to know what features influence GHG emission intensity, however, if accurate, a neural network may be an excellent choice for simply predicting how well a building will perform.

The two important metrics to evaluate this model are the Loss and the Root Mean Squared Error. The Loss tells us how well the model is predicting values (predicted vs actual). In this case, we chose to use Mean Squared Error which is the average squared difference between predicted and actual. The Root Mean Squared Error evaluates the performance of the regression model. It is the square root of the squared difference between the predicted and actual values.

They are essentially the same thing, RMSE being MSE after square rooting however they serve different functions to the model and its training process. Loss is minimized by the optimizer of the function during training while RMSE is used afterwards to evaluate the model. RMSE shows the average of the magnitude of the errors in the same unit as the target while MSE helps identify large errors.

In this case, this neural network performs quite well. The best performing model had an RMSE of just under 3 which means that its predictions for emission intensity are off by about 3 units on average. Considering the standard deviation of the GHG emissions intensity is 2.11, this means the estimates that the model provides are not as accurate as hoped.

However, this model could be further refined using hyper parameter tuning in order to get the training and validation loss to converge better. Currently the high validation loss suggests that the model is under fitting. Getting the validation loss lower would mean the predictions on the validation training set would only get more accurate.

### Conclusion

With the only model showing some decent accuracy being quantile regression, the models we have created tell us several things about our next steps. The first is that our most accuracte model may be a simple decision tree with a handful of variables rather than all of them. This is evident in the high decrease in accuracy between the decision tree made in the exploratory phase versus the one made here. The second obstacle to consider is the high number of extreme outliers in our data set. One sound reason for the poor results of our model is the sheer volume of outiers and the degree to which these points stray from the rest. There are then a couple of solutions to this barrier. The first is we could remove these chunk of data points to at least reduce the amount of extrema. The strictness of what is deemed an outlier may vary dependning on how much of our data we would need to truncate. The other option would be to narrow our scope of buildings to simply residential buildings (e.g. apartments). This is because most of the outliers in this set are non-residential properties; the two highest extrema are laboratories for the University of Washington. If we select this approach, we would sacrifice some generalization for the model. That is, we would only be able to predict GHG emission intensity for residential buildings. However, this may be a limitation that is worthwhile if we obtain a model that performs significantly better.