**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Kim Lim
- Ronnie Volman
- Milton Iwama
- Saul Sanchez
- Owen Connor

# Research Question

Can we predict the average monthly household energy consumption in San Diego County based on time of year, weather patterns and geographical location?

## Background and Prior Work

Energy consumption in households is influenced by a variety of factors including but not limited to seasonal changes, weather patterns and regional behaviors. San Diego, in particular, is home to a surprising variety of microclimates, each offering a distinct weather experience despite their proximity to one another<sup><a href="#cite_ref1">[1]</a></sup>. Unlike regions with clearly defined seasons, San Diego experiences subtle but varied weather shifts, such as dry heat in one area and humidity in another during summer, or cool winters without snow. These mild yet diverse conditions may lead to unique, localized trends in energy use.
Despite the significance of these patterns, most research on energy consumption has focused on larger metropolitan areas or regions with more extreme weather conditions. For instance, Alaska, which experiences extreme cold weather, has two metropolitan areas. According to the US Energy Information Administration, Alaska had the highest per capita total primary energy consumption at the state level in 2022 with about 987 MMBtu (Million British Thermal Units) per person and the highest per capita transportation energy use<sup><a href="./#cite_ref2">[2]</a></sup>. In contrast, temperate coastal cities like San Diego remain understudied, presenting an opportunity to examine how subtle weather fluctuations affect energy demand.

Past research, such as the study published in Frontiers in Energy Research (2022), has shown that household energy consumption can vary significantly based on appliance usage and individual behavior, emphasizing the need for accurate energy forecasting at the local level<sup><a href="./#cite_ref3">[3]</a></sup>. To predict energy demand more accurately, researchers have used various models such as time series regression, which looks at the past energy demand to create patterns, and random forest models, which are able to handle complex data and inter correlations. Other methods also include the use of both ARIMA (a statistical model) and the LSTM neural network (a type of machine learning)<sup><a href="./#cite_ref4">[4]</a></sup>. 

Understanding these localized patterns in energy consumption, especially in regions like San Diego, can empower San Diego residents to make more cost-effective and sustainable energy decisions. Energy use typically peaks at certain times, when utilities must rely on the most expensive and resource-intensive electricity sources to meet demand<sup><a href="./#cite_ref5">[5]</a></sup>. By predicting when energy demand will peak due to subtle weather changes, residents can shift their usage to off-peak hours, saving money and easing pressure on the electrical grid. On a larger scale, this helps lower reliance on high-emission energy sources, contributing to a more resilient, low-carbon infrastructure. As climate conditions become less predictable and energy needs continue to rise, exploring how microclimate-driven behaviors affect consumption becomes a relevant and essential study.

<p id="cite_ref1"><sup>[1]</sup> Amber Coakley. “Experience All of San Diego County’s Unique Microclimates in One Day.” <i>FOX 5 San Diego & KUSI News</i>, 2 Mar. 2025. <a href="https://fox5sandiego.com/weather/experience-all-of-san-diego-countys-unique-microclimates-in-one-day/">https://fox5sandiego.com/weather/experience-all-of-san-diego-countys-unique-microclimates-in-one-day/</a></p>    
<p id="cite_ref2"><sup>[2]</sup> “Frequently Asked Questions (FAQs) - U.S. Energy Information Administration (EIA).”  Www.eia.gov, <a href="www.eia.gov/tools/faqs/faq.php?id=85&t=1">www.eia.gov/tools/faqs/faq.php?id=85&t=1</a></p>
<p id="cite_ref3"><sup>[3]</sup> K. Qian, C. Zhou, M. Allan, and Y. Yuan. “Modeling of load profiles in residential energy consumption: A case study”, <a href="https://doi.org/10.3389/fenrg.2022.1113733">https://doi.org/10.3389/fenrg.2022.1113733.</a></p>
<p id="cite_ref4"><sup>[4]</sup>M. Zhelezniak, L. Ivanchenko, and R. Fedorenko. “Hybrid forecasting of national electricity demand combining ARIMA and LSTM approaches”, <a href="https://arxiv.org/abs/2304.05174">https://arxiv.org/abs/2304.05174</a></p>
<p id="cite_ref5"><sup>[5]</sup> “Why Does It Matter What Time of Day You Use Power?” Efficiencyvermont.com, 2020, <a href="www.efficiencyvermont.com/blog/our-insights/why-does-it-matter-what-time-of-day-you-use-power">www.efficiencyvermont.com/blog/our-insights/why-does-it-matter-what-time-of-day-you-use-power</a></p>

# Hypothesis


Average household energy consumption in San Diego county can be accurately predicted using time of year, weather variable and general location, with higher consumption expected during hotter months due to increased cooling needs. We think more energy consumption would happen in hotter months, because aside from having to cool ourselves down, we also need to consume more energy to cool the electronics that could be potentially overheated. 

## Data overview

This study will analyze variables such as zip codes, monthly average electricity consumption, average temperature, humidity, precipitation, date, month, and year. Data will be aggregated at the zip code level across San Diego County. The analysis will focus on the period from 2021 to the present, excluding data from 2020 due to the pandemic’s impact on energy consumption behaviors. Since people were largely confined to their homes in 2020, electricity usage patterns were atypical and could skew the results. Similarly, we will exclude pre-pandemic data, as energy consumption habits may have been significantly different prior to 2020. Beginning the analysis in 2021 allows us to establish a more stable and representative baseline that reflects post-pandemic adjustments. With that said, the following are the datasets we will be using for our analysis.

- SDGE dataset
  - Link to the dataset: https://energydata.sdge.com/
  - Number of observations: 1719
  - Number of variables: 8
- Weather dataset
  - Link to the dataset: https://open-meteo.com/en/docs/historical-forecast-api
  - Number of observations: 5865
  - Number of variables: 15

The SDG&E dataset provides total and average monthly gas and electricity consumption aggregated by ZIP code. The original data was available as quarterly files, so we merged all datasets from 2021 to the present. In total, 34 files were downloaded. To keep the repository organized, we uploaded a cleaned and consolidated version of the dataset. The original data included four customer classes, but we retained only the residential class, as the others were not the focus of our interest.

The weather dataset, on the other hand, contains daily records of weather variables such as average temperature, precipitation, daylight hours, etc. also broken down by zip code. We will aggregate the weather data to a monthly level to align it with the SDGE dataset. The two datasets will then be merged using zip code, month, and year as keys. Any records with missing weather or energy consumption data will be excluded from the analysis to maintain data integrity.

### Weather Dataset

In [None]:
%pip install pandas

In [12]:
import pandas as pd

weather_df = pd.read_csv('./datasets/comp_weather_data.csv')
weather_df.head()

Unnamed: 0,location_id,year,month,temperature_2m_mean (°C),apparent_temperature_mean (°C),daylight_duration (s),sunshine_duration (s),precipitation_sum (mm),rain_sum (mm),snowfall_sum (cm),precipitation_hours (h),wind_speed_10m_max (km/h),wind_gusts_10m_max (km/h),wind_direction_10m_dominant (°),zipcode
0,0,2021,1,11.419355,8.406452,36998.017742,30892.716452,4.516129,3.690323,0.578065,3.451613,16.306452,38.741935,135.129032,91901.0
1,0,2021,2,11.9,9.507143,39717.795714,32507.791786,0.489286,0.489286,0.0,1.535714,14.685714,34.310714,180.464286,91901.0
2,0,2021,3,10.596774,8.16129,43200.236452,35699.556129,5.919355,5.877419,0.029355,4.870968,17.564516,41.729032,228.935484,91901.0
3,0,2021,4,15.103333,13.406667,46836.152333,39640.753333,0.686667,0.686667,0.0,2.4,15.546667,38.2,270.2,91901.0
4,0,2021,5,16.345161,15.548387,49880.583548,43571.595484,0.212903,0.212903,0.0,1.483871,15.429032,38.822581,276.0,91901.0


### SDGE Dataset

In [13]:
sdge_df = pd.read_csv('./datasets/merged_sdge_cleaned.csv')
sdge_df.head()

Unnamed: 0,ZipCode,Month,Year,CustomerClass,Combined,TotalCustomers,TotalkWh,AveragekWh,energy_type,TotalTherms,AverageTherms
0,91901,1,2021,R,Y,8377,5606431.0,669.0,electricity,0.0,0.0
1,91902,1,2021,R,N,6213,3277281.0,527.0,electricity,0.0,0.0
2,91905,1,2021,R,Y,301,221271.0,735.0,electricity,0.0,0.0
3,91906,1,2021,R,Y,2024,1634976.0,808.0,electricity,0.0,0.0
4,91910,1,2021,R,Y,25984,10347161.0,398.0,electricity,0.0,0.0


### Weather and SDGE dataset combined

In [21]:
sdge_df.columns = sdge_df.columns.str.lower()

weather_df['zipcode'] = weather_df['zipcode'].astype(int)
sdge_df['zipcode'] = sdge_df['zipcode'].astype(int)

sdge_weather_df = pd.merge(weather_df, sdge_df, on=['zipcode', 'month', 'year'], how='inner')
sdge_weather_df = sdge_weather_df.drop(columns=['location_id','customerclass', 'combined'])
sdge_weather_df.head()

Unnamed: 0,year,month,temperature_2m_mean (°C),apparent_temperature_mean (°C),daylight_duration (s),sunshine_duration (s),precipitation_sum (mm),rain_sum (mm),snowfall_sum (cm),precipitation_hours (h),wind_speed_10m_max (km/h),wind_gusts_10m_max (km/h),wind_direction_10m_dominant (°),zipcode,totalcustomers,totalkwh,averagekwh,energy_type,totaltherms,averagetherms
0,2021,1,11.419355,8.406452,36998.017742,30892.716452,4.516129,3.690323,0.578065,3.451613,16.306452,38.741935,135.129032,91901,8377,5606431.0,669.0,electricity,0.0,0.0
1,2021,1,11.419355,8.406452,36998.017742,30892.716452,4.516129,3.690323,0.578065,3.451613,16.306452,38.741935,135.129032,91901,2255,0.0,0.0,gas,129525.0,57.0
2,2021,2,11.9,9.507143,39717.795714,32507.791786,0.489286,0.489286,0.0,1.535714,14.685714,34.310714,180.464286,91901,8308,4081593.0,491.0,electricity,0.0,0.0
3,2021,2,11.9,9.507143,39717.795714,32507.791786,0.489286,0.489286,0.0,1.535714,14.685714,34.310714,180.464286,91901,2230,0.0,0.0,gas,98057.0,44.0
4,2021,3,10.596774,8.16129,43200.236452,35699.556129,5.919355,5.877419,0.029355,4.870968,17.564516,41.729032,228.935484,91901,8238,4250283.0,516.0,electricity,0.0,0.0


The columns location_id, customerclass, and combined were dropped from the dataset because they were either redundant or not relevant to the analysis:
- location_id: An internal identifier that duplicated information already represented by ZIP code, offering no additional analytical value.
- customerclass: Since we filtered the dataset to include only the Residential class, this column became redundant, containing a single constant value.
- combined: This column contained categorical values such as 'Y' or 'N', but its purpose was not clearly documented. As it was not essential for our current analysis and lacked interpretable meaning, we chose to drop it to avoid introducing ambiguity

# Ethics & Privacy

There may be several potential biases and privacy concerns associated with the proposed energy consumption data. The data might contain personally identifiable information (PII), which would require careful handling to protect individual privacy. Additionally, there could be potential biases related to how the data was collected and who is represented within it. For instance, certain neighborhoods, particularly higher-income areas with smart meters, solar panels, or better internet access may be overrepresented, while lower-income communities without advanced infrastructure could be underrepresented. Demographic biases may also be embedded in the energy usage patterns due to historic housing segregation, affecting the equity of any resulting analysis. To address these concerns, steps will be taken to detect biases throughout the project lifecycle, including before, during, and after analysis, and especially when communicating findings. Specific strategies include focusing on transparency by clearly stating data sources, preprocessing methods, and known limitations, as well as working with aggregated or anonymized data whenever possible to minimize privacy risks and ensure more equitable outcomes.

Failure to rigorously protect individual privacy in this study could lead to increased concerns about the misuse of residents' energy data and potentially eroding trust in both energy providers and research institutions. Even anonymized data, if mishandled or combined with other sources, could inadvertently lead to the re-identification of households or discriminatory practices. Furthermore, biases inherent in the data collection, potentially over-representing certain geographical areas or demographics within San Diego County, could result in prediction models that reinforce existing socioeconomic disparities. This could lead to inequitable resource allocation, the development of ineffective or even discriminatory energy policies, and the potential for social stigma against certain communities. Ultimately, we recognize that the value of our study on predicting average household energy consumption in San Diego County hinges not only on its accuracy but also on its ethical grounding. Therefore, we will proactively integrate considerations of these societal implications throughout our research process, from data acquisition and analysis to the dissemination of findings.


# Team Expectations 

* Communication is through discord. The expected response time is 24 hrs.
* Meeting during discussions. Discord meetings when needed
* Assigned tasks need to be done by our internal due date (before the actual due date) If others depend on your task to be done, you have an obligation to complete your task ahead of time so everyone may have the appropriate time to complete their own tasks. Failure will be documented which can result in a lesser grade. If this issue persists, you’ll be reported to the TA and they can decide what to do.
* Tasks will be listed in github issues.
* Decision making will be a majority vote. If an individual fails to respond within 12 hours their vote is null. For urgent issues a 6 hour window will be appointed. If a quick decision with a hard deadline (such as a submission to canvas/github) has to be made the individual must respond 3 hours before the deadline so there is time to complete the task.
* People struggling with completing their tasks must contact the group right away in order for the work to not fall back. It is better to ask for help sooner rather than later.
* Tasks will be assigned on the person who is better at it but everyone will do a bit of everything. This is a GROUP project. We should be working together to help each other out and finish tasks accordingly.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 04/28 | 3 PM | When to meet schedule, Think about the research question  | Determine expectations for group, decide on what topic to do, discuss hypothesis, do background research | 
| 04/30 | 4 PM | Edit and finalize project proposal | Search for dataset, tidy and clean dataset | 
| 05/02 | 4 PM | Dataset should be tidy and clean |   Data exploration, discuss what features to use, assign tasks to members |
| 05/07 | 4 PM | NA | Progress check in |
| 05/09 | 4 PM | Data exploration should be complete | Data preprocessing, discuss techniques to properly complete this task |
| 05/14 | 4 PM | NA | Progress check in |
| 05/16 | 4 PM | Data preprocessing should be complete | Explore different machine learning models, discuss which one is best |
| 05/21 | 4 PM | Model tuned and finalized | Review the whole project, discuss if changes or improvements need to be made |
| 05/23 | 4 PM | Edits should be done | Review again |
| 05/28 | 4 PM | Finalize project and turn in | NA |