This project was completed to provide recommendations for analyzing temperature data at specific hive locations throughout the country. While recommendations were the primary objective, I also really just wanted to dive in and make sure I could build the models I've suggested. So without further ado -
- Data cleaning
- Modeling
- Analysis
Data cleaning involved importing a csv file in addition to scraping some info from NOAA and Weather Underground for additional climate data based on the zip codes provided. (Note: The BeautifulSoup python package was used to parse text scrapped from these sites). It was assumed that hive data was recorded either a single location within a zip code or with multiple locations spread across a single zip code. The graph below reflects all of the different hives recording data throughout the year long test period.
The data was fairly messy because most date / time /zip combos had between 2 and 500 temperature values associated with them. This was addressed by simply taking the mean of all values matching this identification criteria because the main goal was to establish baseline hive values. Once Nan columns were eliminated and the appropriate additional weather data was merged, preliminary modeling could begin.
Autoregressive integrated moving average (ARIMA) models and Long Short Term Memory (LSTM) networks were used to perform time-series analysis on the data.
The ARIMA model had a test RMSE of 0.1138. The Multivariate LSTM network, on the other hand, had a test RMSE of 10.752.
Parameters have yet to be played around with though so these numbers will likely improve in the future.
The univariate ARIMA model is currently performing best of the three models based on RMSE values. Future directions would likely include:
- Regression analysis to fill in missing time series data
- Categorical variables on hive health
- Pulling minute by minute rather than day by day outside information
- Fine-tuning model parameters