# "Winter" Data Workout: Exploratory Analysis and Modeling

In this workout, you have the choice between 1 of 2 "Winter" datasets:

* Penguins
* Winter Olympics

## Dataset Overview: Palmer Penguins

![Two Penguins](../images/stock/pexels-pixabay-48814.jpg)

The penguins dataset contains size measurements for three penguin species observed in the Palmer Archipelago, Antarctica. It was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.

### Location
* __Seaborn__: The dataset may be accessed via `sns.load_dataset("penguins")`

### __Key Features (Columns)__:
* __species__: The penguin species (Adélie, Chinstrap, or Gentoo).

* __island__: The island in the Palmer Archipelago where the penguin was observed (Biscoe, Dream, or Torgersen).

* __bill_length_mm__: The length of the penguin's bill (also known as the culmen) in millimeters.

* __bill_depth_mm__: The depth of the penguin's bill in millimeters.

* __flipper_length_mm__: The length of the penguin's flipper in millimeters.

* __body_mass_g__: The body mass of the penguin in grams.

* __sex__: The gender of the penguin (male or female).

### __Why it’s great for this workout__:
* __Classification__: Can you predict the species based on physical measurements?

* __Regression__: Can you predict body mass using flipper length?

## Dataset Overview: Olympics

![Winter Olympics](../images/stock/pexels-kimmo-vainio-314143536-36080375.jpg)

Source: [Keith Galli's Olympics Dataset](https://github.com/KeithGalli/Olympics-Dataset)

This is a comprehensive historical record of every athlete and their performance. Because it is split across two files, merging them using the `athlete_id` column will be necessary.

### Folder Location
* `../data/winter_olympics`

### __Key Data Files__:
* `bios.csv`: Biographical details for over 70,000 athletes.

* `results.csv`: A row-by-row breakdown of every event performance (over 160,000 rows).

### __Key Features (`bios.csv`)__:
* __athlete_id__: Unique athlete identifier.
  
* __name__: The athlete's full name.

* __height_cm__: In centimeters (expect many missing values!).

* __weight_kg__: In kilograms (expect many missing values!).

* __NOC__: The 3-letter country code (e.g., USA, NOR, GER).

### __Key Feature (`results.csv`)__:
* __year__: The year of the Olympic Games (1896–2022).

* __type__: The season (you should filter for "Winter").

* __discipline__: The sport (Alpine Skiing, Ice Hockey).

* __event__: The specific competition (Men's 500 metres).


### __Why it’s great for this challenge__:
* __Classification__: Can you predict if an athlete will win a medal based on their age, height, and weight?

* __Regression__: Can you predict an athlete's height or weight based on the sport they compete in?

* __Data Wrangling__: This dataset is "messy." Filtering for the Winter season and deciding how to handle missing physical stats for athletes from the early 20th century will be a unique challenge.

## Goals Checklist
![Curling](../images/stock/pexels-shvets-production-7544435.jpg)
### 1. __Data Foundations__
*  __Load the Data__: Successfully import the dataset (penguins or winter olympics).
*  __Initial Exploration__: Use `.info()`, `.describe()`, and `.head()` or `.sample()` to understand your features.
*  __Data Cleaning__: Handle missing values (drop vs. fill) and fix any data type inconsistencies.

### 2. __Visual Analysis__
*  __Feature Distributions__: Create histograms for all numerical columns.
*  __Correlation Matrix__: Generate a heatmap to identify relationships between variables.
*  __Custom Insights__: Create at least 3 additional unique visualizations.

### 3. __Machine Learning Pipeline__
*  __ML Prep__: Define your target variable ($y$) and features ($X$)
*  __Train/Test Split__: Divide your data (typically 80/20 or 70/30) to ensure unbiased evaluation.
*  __Build Model__: Initialize and fit your chosen model.
*  __Make Predictions__: Run your model on the test set.

### 4. __Evaluation & Reflection__
*  __Quantitative Evaluation__: Calculate relevant metrics.
*  __Visual Performance__: Plot your results
*  __Conclusion__: Write a brief summary of what the model tells us about the data.