# Will it be an early Spring?

On February 2<sup>nd</sup> every year Punxsutawney Phil makes a prediction about if there will be an early Spring or if Winter will continue for 6 more weeks (till about mid-March). He is however not very accurate (well, according to [The Inner Circle](https://www.groundhog.org/inner-circle) he is 100% correct but the human handler may not interpret his response correctly). The overall goal is to be able to predict if it will be an early Spring.

For this project you must go through most steps in the checklist. You must write responses for all items however sometimes the item will simply be "does not apply". Some of the parts are a bit more nebulous and you simply show that you have done things in general (and the order doesn't really matter). Keep your progress and thoughts organized in this document and use formatting as appropriate (using markdown to add headers and sub-headers for each major part). Do not do the final part (launching the product) and your presentation will be done as information written in this document in a dedicated section, no slides or anything like that. It should however include the best summary plots/graphics/data points.

You are intentionally given very little information thus far. You must communicate with your client (me) for additional information as necessary. But also make sure that your communications are efficient, thought out, and not redundant as your client might get frustrated and "fire" you (this only applies to getting information from your client, this does not necessary apply to asking for help with the actual project itself - you should continuously ask questions for getting help).

You must submit all data files and a pickled final model along with this notebook.

The group with the best results on the 10% of the data that I kept for myself will earn +5 extra credit (if multiple groups are close points may be given to multiple groups).

Questions For Client 
====================
- Can you provide more context on how these predictions were made? For example, are these predictions based solely on Punxsutawney Phil, or do they also consider meteorological data?

- How accurate are these predictions historically when compared to actual meteorological outcomes?

- Are there additional datasets you would recommend or provide that could help enhance our analysis, such as meteorological data for each of these years?

- Are there any benchmarks or previous studies on this topic that we should compare our results against?


Questions For Professor 
=======================
- What specific performance metrics are most important for evaluating our predictive model in this case?

- Beyond the model and the documentation, are there specific forms of visualizations or types of analysis you expect in our final report?

- Would you recommend any specific statistical tests or analytical methods to understand the impact of various factors on Phil's predictions?

- Do you have any preferred data sources? Or are there constraints around data collection?

- About how many years of historical data should we analyze to make these predictions?

- What kind of performance improvements are expected in comparison to Punxsutawney Phil’s historical accuracy?

- Should we focus on any particular region (e.g., Pennsylvania, where Phil makes predictions), or should we generalize for wider use?




Frame the Problem and Look at the Big Picture
=====================================

1. **Define the objective in business terms:** The ultimate goal is to improve upon Punxsutawney Phil’s prediction accuracy, leading to more reliable planning for businesses and individuals affected by seasonal changes (e.g., agriculture, retail, event planning). This can reduce costs and increase efficiency in sectors where weather predictions directly impact operations.
2. **How will your solution be used?** The solution would likely be used by various stakeholders who rely on seasonal weather changes.
4. **How should you frame this problem?** As a binary classification problem where the goal is to predict one of two possible outcomes: early Spring or continued Winter.
5. **How should performance be measured? Is the performance measure aligned with the business objective?** Performance should be measured using accuracy, precision, recall, and F1-score for the binary classification model. Accuracy would give us a general sense of how well the model is performing, but precision and recall would help us understand the balance between predicting early Spring and extended Winter. If one class is more frequent than the other, focusing only on accuracy might lead to skewed results.
6. **What would be the minimum performance needed to reach the business objective?** A minimum acceptable performance level might be set around 60% accuracy, which would represent a significant improvement. However, achieving an accuracy closer to 70-80% would provide more confidence and a stronger business case for users to adopt the model.
7. **What are comparable problems? Can you reuse (personal or readily available) experience or tools?** Weather forecasting models that predict seasonal shifts based on historical data and meteorological patterns. Existing tools that can be reused include Python libraries such as scikit-learn for machine learning, pandas for data analysis, and matplotlib/seaborn for visualization. 
8. **Is human expertise available?** Yes, meteorology experts who can help identify the most relevant weather patterns to consider. Though for this project, I don't think we will use this.
9. **How would you solve the problem manually?** 
    This is essentially an almanac, which is already been done. 
    1.	Collect historical weather data and Punxsutawney Phil’s past predictions.
	2.	Identify the key variables that might influence the seasonal shift (e.g., temperature, precipitation, snow cover, atmospheric pressure).
	3.	Analyze the patterns in the data and look for correlations between weather conditions in February and the onset of Spring.
	4.	Make our own predictions based on the observed trends and attempt to manually classify whether it will be an early Spring or extended Winter.
10. **List the assumptions you (or others) have made so far. Verify assumptions if possible.** 

Get the Data
============
- List the data you need and how much you need

- Find and document where you can get that data

- Get access authorizations
Create a workspace (with enough storage space)
Get the data
Convert the data to a format you can easily manipulate (without changing the data itself)
Ensure sensitive information is deleted or protected (e.g. anonymized)
Check the size and type of data (time series, geographical, ...)
Sample a test set, put it aside, and never look at it (no data snooping!)







Explore the Data
================


1. Copy the data for exploration, downsampling to a manageable size if necessary.
2. Study each attribute and its characteristics: Name; Type (categorical, numerical, bounded, text, structured, ...); % of missing values; Noisiness and type of noise (stochastic, outliers, rounding errors, ...); Usefulness for the task; Type of distribution (Gaussian, uniform, logarithmic, ...)
3. For supervised learning tasks, identify the target attribute(s)
4. Visualize the data
5. Study the correlations between attributes
Study how you would solve the problem manually
6. Identify the promising transformations you may want to apply
7. Identify extra data that would be useful (go back to “Get the Data”)
