<a href="https://colab.research.google.com/github/wbandabarragan/EPIC_5/blob/main/Dorothy_Coding_Challenge/Stage_1/stage1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stage 1 Challenge

Please provide your stage-1 group solutions within a single Jupyter notebook (*.ipynb).

When you finish, please upload your solution notebook here:

https://forms.gle/bmvUYeACiE9igb4T7


**DEADLINE:** 11 May 2025 (by 23h59)



## Exploratory Data Analysis (EDA): Summer Olympics Dataset

### What is Exploratory Data Analysis (EDA)?
EDA is the essential first step in analyzing a dataset. It involves examining data to identify patterns, detect outliers, evaluate different machine learning models, and verify assumptions through the use of statistics and scientific visualization.

### A Brief Outline Of The Dataset

- This dataset provides a comprehensive record of the modern Olympic Games, covering events from the 1896 Athens Olympics up to the 2016 Rio Olympics.

- The dataset includes 271,116 entries and 15 attributes in total.

- Each entry represents an individual athlete participating in a specific Olympic event (i.e., athlete-event combinations).
The attributes capture various details such as:

    - ID – A unique identifier for each athlete
    - Name – Athlete’s full name
    - Sex – Gender (M or F)
    - Age – Age in years
    - Height – Measured in centimeters
    - Weight – Measured in kilograms
    - Team – Name of the team or country represented
    - Medal – Type of medal won (Gold, Silver, Bronze)

## Downloading The Data And Importing Libraries

##  Task 1: Load Data & Import Libraries

**Goal:** Set up your environment and download the dataset for analysis.

### Instructions:

1. Download the dataset using `opendatasets`.  
   - Dataset URL: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results  
   - You may need to provide your Kaggle API credentials.


2. Set file paths for:
   - `athlete_data_filename` → `athlete_events.csv`
   - `regions_data_filename` → `noc_regions.csv`


3. Install and import the following libraries:
   - `pandas`, `numpy`
   - `matplotlib.pyplot`, `seaborn`
   - `plotly.express`
   - `ListedColormap` from `matplotlib.colors`


**Expected Output:**

- Dataset downloaded

- All libraries successfully imported

- File paths assigned


##  Task 2: Set Plot Style and Color Palette

**Goal:** Customize the appearance of your plots for consistent and clean visuals.

### Instructions:
1. Set the global style for all plots using `plt.style.use()`.  
   - Use `'ggplot'` for a simple, clean base style.

2. Define a custom color palette:
   - Use a list of hex color codes.
   - Example: `["#0a2e36", "#27FB6B", "#14cc60", "#036d19", "#09a129"]`

3. Apply the color palette using Seaborn:
   - Use `sns.set_palette()`.

**Expected Output:**

- Style applied.

- Custom color palette preview shown.


## Task 3: Data Preparation

**Goal:** Prepare data for analysis.

###  Steps:

1. Load the file using pandas.

2. Look for some of the information about the data and the columns.

3. Fix any of the missing or incorrect values.

4. Which, and how much data types are in the dataset.

5. List the minimum age on the competence.

The focus of this EDA project will solely be on the "Summer Olympics", filter of all the "Winter Olympics Games" from the dataset.

## Task 4: Merging The Two Datasets Into One

**Goal:** Merge datasets.

Before we can begin analyzing the data, we need to combine the two datasets:  
- `athlete_events.csv` (athlete information)
- `noc_regions.csv` (region/country information)

Use the `pandas.merge()` function to do this.

###  Steps:

1. **Call `pd.merge()`**  
   This function merges two DataFrames based on one or more common columns (known as keys).

2. **Set merge type and key**  
   We'll perform a **left join** on the `NOC` column:
   - This keeps **all records** from `athlete_events` (left DataFrame).
   - It adds matching `region` data from `noc_regions` (right DataFrame).
   - Rows with no match in the right DataFrame will have `NaN` values in those columns.



## Task 5: Finding and Replacing The Null Values In Our Dataset

**Goal:** Data cleaning and exploratory analysis.

### Cleaning Tasks:

- Visualize the distribution of missing values using pie charts or bar plots.

- Calculate and list the percentage of null values for each column. Replace missing values with the mean of the respective column when appropriate.

- Remove duplicate entries from the dataset to ensure accuracy.

### Exploratory Questions:

1. Which country has sent the most athletes to the Summer Olympics?

2. How has the number of athletes, countries, and events changed over time?

3. Which nations have won the most Olympic medals?

4. How has participation by male and female athletes evolved over time?

5. What is the correlation between the height and weight of Olympic participants?

6. In which sports has India won Olympic medals?

7. Which sports have contributed the most medals overall?

##  Task 6: Exploratory Analysis and Visualisations

**Goal:** Data analysis and visualization.

### 1. Create a word-cloud that graphically shows the nations that have sent the maximum number of athletes over the years.

### 2. Show the relation between various features and labels in the Olympics dataset and infere/discuss any trends and correlations.

### 3. Make a plot of the overall spread of the age of athletes in the Summer Olympics and discuss your findings.

### 4. Make a plot of the number of participants in the Summer Olympics over the years and discuss the overall trends.

### 5. Describe the variation in the number of female participants over the years in the Summer Olympics.

### 6. Show graphically the variation of the number of female participants in comparison to male participants over the years.

### 7. Create a scatter plot of the relationship between Height Vs Weight Vs Age of participants across sports. Any conclusions?


### 8. Find and list the top 10 nations that have won the most Gold, Silver, and Bronze Medals, respectively, in the history of the Summer Olympics.

### 9. Create a word-cloud showing sports in which India has won medals over the years.

### 10. Look up and list the top 3 female athletes by the number of awarded medals across all sports.

# Optional: Additional tasks

- Make a Choropleth Map mark the countries and explain why there is a sudden exponential drop in the number of participants at three instances in the historical timeline of the Olympics.

- Does Wealth (GDP) have any effect on a country's performance in the Olympics?

You can use this dataset for this:  [GDP dataset](https://github.com/bhushanrane29/Summer-Olympics-EDA/blob/master/gdp_data.csv) that can merged with the above dataset to perform this analysis.

- What is the relation between a countries climate and their olympic medal tally?

- Does home advantage give countries an edge in their medals tally? (Linearcurve)

- Does an athele's height have any role to play in winning an olympic medal? (Heatmap)

- You can add the Paralympics dataset to this [link](https://www.kaggle.com/shivagovindasamy/2020-tokyo-paralympics) data too

- Replace the pie charts with sunburst charts at places where it is possible