<a href="https://www.kaggle.com/code/mohamedmahmoud111/olympics-eda-ml-model?scriptVersionId=244337974" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Olympics EDA & ML Model

### Introduction
The Olympic Games are a global celebration of athletic excellence, uniting countries through competition across a wide range of sports. This project explores historical Olympic data—spanning both Summer and Winter Games—to uncover meaningful patterns in country participation, population, economic indicators, and medal performance. We also incorporate data on population and GDP per capita to enrich the analysis and understand deeper socio-economic trends affecting Olympic success.

### Objectives
- Merge and clean Summer and Winter Olympic datasets for unified analysis.
- Handle missing values and detect outliers in population and GDP data.
- Perform Exploratory Data Analysis (EDA) to examine:
  - Athlete participation trends.
  - Medal distribution across countries and seasons.
  - Relationships between population, GDP per capita, and medal counts.
- Engineer new features such as Medals per Capita and Medals per GDP.
- Apply and evaluate machine learning models (Linear Regression, Decision Tree, Random Forest) to predict medal counts based on socio-economic factors.
- Visualize key relationships and model results using informative plots.


# Importing libraries & data

In [None]:
import numpy as np  
import pandas as pd  

import matplotlib.pyplot as plt 
import seaborn as sns

from math import pi

import warnings  
warnings.filterwarnings("ignore")  

### Analysis of Summer Olympics Data
The Summer Olympics dataset was cleaned through:
- Replacing missing values with the mean.
- Filling missing athlete names using the forward fill method.

### Analysis of Summer and Winter Olympics Data
The combined analysis of the Summer and Winter Olympics datasets involved:
- Visualizing the top 10 countries in terms of total medals and diversity of sports.
- Analyzing medal distribution based on gender.

### Summary of Data Processing and Analysis:
- Cleaning and analyzing both Summer and Winter Olympics data.
- Handling missing values in country and athlete information.
- Exploring medal distribution by gender.
- Identifying top-performing countries in medals and sports participation.
- Leveraging data visualizations to uncover key insights.


# A Quick Dive into the Winter Olympics


In [None]:
df_winter = pd.read_csv('/kaggle/input/olympic-sports-and-medals/winter.csv')  
# Reading the CSVfile "winter.csv" from the specified path and storing it in a DataFrame

df_winter  # Displaying the DataFrame content

In [None]:
df_winter.info()  # Displays a concise summary of 
#the DataFrame, including column names, non-null counts, data types, and memory usage.

In [None]:
df_winter['Athlete'] = df_winter['Athlete'].str.split(', ').str[::-1].str.join(' ')  
# Splitting the athlete's name by ", ", reversing the order (from "Last, First" to "First Last"), and joining it back as a single string.

df_winter['Athlete'] = df_winter['Athlete'].str.title()  
# Converting the athlete's name to title case (first letter of each word capitalized).

df_winter.head()  # Displaying the first 5 rows of the DataFrame to check the changes.

In [None]:
Gender_Medal = df_winter.groupby("Gender")['Medal'].value_counts().reset_index()  
# Grouping the data by 'Gender' and counting the occurrences of each medal type, then resetting the index to return a structured DataFrame.

Gender_Medal  # Displaying the resulting DataFrame.

In [None]:
plt.figure(figsize=(10, 7))  
# Creating a figure with a specified size (10 inches by 7 inches).

sns.countplot(data=df_winter, x='Gender', hue='Medal', palette='Set2')  
# Creating a count plot using Seaborn to show the number of medals won by each gender.
# The 'hue' parameter separates the medal types (Gold, Silver, Bronze) with different colors.
# 'palette' is set to 'Set2' for a visually appealing color scheme.

plt.show()  
# Displaying the plot.

In [None]:
Top_Medal = (
    df_winter.groupby('Country')['Medal']
    .count()
    .nlargest(10)
    .reset_index()
)

plt.figure(figsize=(12, 6))
sns.barplot(
    data=Top_Medal,
    x='Country',
    y='Medal',
    palette='pastel'
)

plt.title('Top 10 Countries by Total Winter Olympic Medals', fontsize=14, fontweight='bold')
plt.xlabel('Country', fontsize=12)
plt.ylabel('Total Medals', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()


In [None]:
Top_Sport = (
    df_winter.groupby('Country')['Sport']
    .count()
    .nlargest(10)
    .reset_index()
)

plt.figure(figsize=(12, 6))
sns.barplot(
    data=Top_Sport,
    x='Country',
    y='Sport',
    palette='deep')

plt.title('Top 10 Countries by Winter Olympic Sport Participation', fontsize=14, fontweight='bold')
plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of Sport Entries', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()


# A Quick Dive into the Summer Olympics


In [None]:
df_summer = pd.read_csv("/kaggle/input/olympic-sports-and-medals/summer.csv")  
# Reading the CSV file "summer.csv" from the specified path and storing it in a DataFrame.

df_summer.head()  
# Displaying the first 5 rows of the DataFrame to preview the data.

In [None]:
df_summer.info()  
# Displays a concise summary of the DataFrame, including:
# - The number of non-null values in each column.
# - The data types of each column.
# - The total memory usage of the DataFrame.

In [None]:
df_summer.isna().sum()  
# Checking for missing values in each column of the DataFrame.
# This returns the total count of NaN (null) values for each column.

In [None]:
df_summer[df_summer['Country'].isna()]  
# Filtering and displaying rows where the 'Country' column has missing (NaN) values.

In [None]:
df_summer['Country'].fillna(method='ffill', inplace=True)  
# Filling missing values in the 'Country' column using forward fill (ffill).
# This propagates the last valid value forward to fill NaN values.

df_summer.isna().sum()  
# Checking again for missing values in the DataFrame to ensure they have been handled.

In [None]:
df_summer['Athlete'] = df_summer['Athlete'].str.split(', ').str[::-1].str.join(' ')  
# Splitting the athlete's name by ", ", reversing the order (from "Last, First" to "First Last"), and joining it back as a single string.

df_summer['Athlete'] = df_summer['Athlete'].str.title()  
# Converting the athlete's name to title case (capitalizing the first letter of each word).

df_summer.head()  
# Displaying the first 5 rows of the DataFrame to check the changes.

In [None]:
Gender_Medal = df_summer.groupby("Gender")['Medal'].value_counts().reset_index()  
# Grouping the data by 'Gender' and counting the occurrences of each medal type.
# Resetting the index to return a structured DataFrame.

Gender_Medal  # Displaying the resulting DataFrame.

In [None]:
sns.countplot(data=df_summer, x='Gender', hue='Medal', palette='Set2')  
# Creating a count plot to visualize the number of medals won by each gender.
# The 'hue' parameter differentiates medal types (Gold, Silver, Bronze) with different colors.
# 'palette' is set to 'Set2' for a visually appealing color scheme.

plt.show()  
# Displaying the plot.

# A Quick Dive into the Dictionary Olympics


In [None]:
df_dictionary = pd.read_csv('/kaggle/input/olympic-sports-and-medals/dictionary.csv')  
# Reading the CSV file "dictionary.csv" from the specified path and storing it in a DataFrame.

df_dictionary  # Displaying the DataFrame.

In [None]:
df_dictionary.isna().sum()  
# Checking for missing values in each column of the DataFrame.
# This returns the total count of NaN (null) values for each column.

In [None]:
df_dictionary['Population'].fillna(df_dictionary['Population'].mean(), inplace=True)  
# Filling missing values in the 'Population' column with the column's mean value.
# 'inplace=True' applies the changes directly to the DataFrame.

In [None]:
df_dictionary.loc[:, 'GDP per Capita'] = df_dictionary['GDP per Capita'].fillna(df_dictionary['GDP per Capita'].mean())  
# Filling missing values in the 'GDP per Capita' column with the column's mean value.
# The `loc` method ensures the assignment is done correctly without triggering a warning.

In [None]:
df_dictionary.isna().sum()  
# Checking again for missing values in the DataFrame to ensure they have been handled.

# Data Collection and Cleaning in Olympic Data

#### 1. Data Merging
- Added **Season** column to both summer and winter datasets:

  ```python
  df_summer['Season'] = 'Summer'
  df_winter['Season'] = 'Winter'
  ```
- Merged the summer and winter datasets into one DataFrame using **pd.concat()**.

#### 2. Column Renaming
- Renamed the **Country_x** column to **Country** to maintain consistency.

#### 3. Missing Values Detection
- Displayed DataFrame information using **df.info()** to check for missing values.
- Visualized missing values using **Heatmap** from Seaborn.

#### 4. Handling Missing Values
- Filled missing values in the **Population** column with the **Mean**.
- Filled missing values in the **GDP per Capita** column with the **Mean**.

#### 5. Data Validation
- Rechecked DataFrame information to ensure missing values were properly handled.



In [None]:
df_summer['Season'] = 'Summer'  
# Adding a new column 'Season' to the summer dataset and setting its value to 'Summer' for all rows.

df_winter['Season'] = 'Winter'  
# Adding a new column 'Season' to the winter dataset and setting its value to 'Winter' for all rows.

# Merging Summer and Winter Olympics Data


In [None]:
olympic_df = pd.concat([df_summer, df_winter], ignore_index=True)  
# Combining the summer and winter Olympics datasets into a single DataFrame.
# 'ignore_index=True' resets the index to maintain a continuous sequence.

olympic_df  # Displaying the merged DataFrame.

# Merging Olympics Data with Country Dictionary


In [None]:
df_olympic = pd.merge(olympic_df, df_dictionary, how='left', left_on='Country', right_on='Code')  
# Merging the Olympic dataset with the dictionary dataset based on the 'Country' column in olympic_df 
# and the 'Code' column in df_dictionary.
# 'how="left"' ensures all records from olympic_df are kept, adding matching details from df_dictionary.

df_olympic  # Displaying the merged DataFrame.

In [None]:
df_olympic.drop(columns=['Country_y', 'Code'], inplace=True)  
# Dropping unnecessary columns: 
# - 'Country_y' (duplicate from the merge process).
# - 'Code' (since 'Country_x' already represents the country).

df_olympic  # Displaying the updated DataFrame.


In [None]:
df_olympic.rename(columns={'Country_x': 'Country'}, inplace=True)  
# Renaming the 'Country_x' column to 'Country' for clarity after the merge.

# Data Inspection Of Olympic


In [None]:
df_olympic.info()  
# Displays a concise summary of the DataFrame, including:
# - Column names and data types.
# - The number of non-null values in each column.
# - The total memory usage of the DataFrame.

In [None]:
sns.heatmap(df_olympic.isna())  
# Creating a heatmap to visualize missing values in the DataFrame.
# Missing values are highlighted, making it easier to identify patterns of missing data.

plt.show()  
# Displaying the heatmap.
plt.savefig('my_figure.png')  # ممكن تختار png / jpg / pdf / svg


In [None]:
df_olympic[df_olympic['Population'].isna()]  
# Filtering and displaying rows where the 'Population' column has missing (NaN) values.

# Handling Missing Values 

In [None]:
df_olympic['Population'].fillna(df_olympic['Population'].mean(), inplace=True)  
# Filling missing values in the 'Population' column with the column's mean value.

df_olympic['GDP per Capita'].fillna(df_olympic['GDP per Capita'].mean(), inplace=True)  
# Filling missing values in the 'GDP per Capita' column with the column's mean value.

In [None]:
df_olympic.info()  
# Displays a concise summary of the DataFrame, ensuring all missing values have been handled.

#  Handling Outliers  in Olympic Data

#### 1. Population Outliers Detection
- Used **Boxplot** to visualize outliers in the **Population** column.
- Calculated **IQR (Interquartile Range)** to identify outliers:
  - Q1: First quartile (25%)
  - Q3: Third quartile (75%)
  - IQR = Q3 - Q1
  - Values greater than: Q3 + 1.5 * IQR are considered outliers.

#### 2. GDP per Capita Outliers Detection
- Visualized outliers using **Boxplot** for the **GDP per Capita** column.
- Identified outliers using the same IQR method applied to **Population**.

#### 3. Relationship Between Population and Total Medals
- Created **Scatterplot** to show the relationship between population and total medals.
- Grouped data using **groupby** on the **Country** column.

#### 4. Medals Per Capita
- Converted the **Population** column to **Numeric** using **errors='coerce'** to handle non-numeric values as NaN.
- Calculated medals per capita with the formula:

  ```
  Medals_Per_Capita = Total_Medals / Population
  ```
- Filled missing values with **fillna(0)**.
- Visualized the relationship using **Scatterplot**.

#### 5. GDP per Capita Handling
- Converted the **GDP per Capita** column to **Numeric**.
- Filled missing values with the **Median**.
- Reset index using **reset_index(drop=True)**.

#### 6. Medals Per GDP
- Calculated medals per GDP using the formula:

  ```
  Medals_Per_GDP = Total_Medals / GDP per Capita
  ```
- Filled missing values with **fillna(0)**.
- Visualized the relationship using **Scatterplot** with purple color.

In [None]:
plt.figure(figsize=(12, 6))  
# Creating a figure with a specified size (12 inches by 6 inches).

# Plotting the boxplot for Population
plt.subplot(2, 1, 1)  
sns.boxplot(x=df_olympic['Population'])  
# Creating a boxplot to visualize outliers in the 'Population' column.
plt.title('Outliers in Population')  
# Setting the title for the Population boxplot.

# Plotting the boxplot for GDP per Capita
plt.subplot(2, 1, 2)  
sns.boxplot(x=df_olympic['GDP per Capita'])  
# Creating a boxplot to visualize outliers in the 'GDP per Capita' column.
plt.title('Outliers in GDP per Capita')  
# Setting the title for the GDP per Capita boxplot.

plt.tight_layout()  
# Adjusting layout to prevent overlap between subplots.

plt.show()  
# Displaying the plots.

In [None]:
Q1 = df_olympic['Population'].quantile(0.25)  
# Calculating the first quartile (Q1) of the 'Population' column.

Q3 = df_olympic['Population'].quantile(0.75)  
# Calculating the third quartile (Q3) of the 'Population' column.

IQR = Q3 - Q1  
# Calculating the interquartile range (IQR), which is the difference between Q3 and Q1.

outliers_population = df_olympic[df_olympic['Population'] > Q3 + 1.5 * IQR]  
# Identifying outliers where the 'Population' value is greater than the upper bound (Q3 + 1.5 * IQR).

outliers_population[['Country', 'Population']]  
# Displaying only the 'Country' and 'Population' columns for the identified outlier rows.

In [None]:
Q1_gdp = df_olympic['GDP per Capita'].quantile(0.25)  
# Calculating the first quartile (Q1) of the 'GDP per Capita' column.

Q3_gdp = df_olympic['GDP per Capita'].quantile(0.75)  
# Calculating the third quartile (Q3) of the 'GDP per Capita' column.

IQR_gdp = Q3_gdp - Q1_gdp  
# Calculating the interquartile range (IQR), which is the difference between Q3 and Q1.

outliers_gdp = df_olympic[df_olympic['GDP per Capita'] > Q3_gdp + 1.5 * IQR_gdp]  
# Identifying outliers where the 'GDP per Capita' value is greater than the upper bound (Q3 + 1.5 * IQR).

outliers_gdp[['Country', 'GDP per Capita']]  
# Displaying only the 'Country' and 'GDP per Capita' columns for the identified outlier rows.

In [None]:
medals_per_country = df_olympic.groupby('Country')['Medal'].count().reset_index()  
# Grouping the data by 'Country' and counting the total number of medals won by each country.
# Resetting the index to return a structured DataFrame.

medals_per_country.columns = ['Country', 'Total_Medals']  
# Renaming the columns for clarity.

medals_per_country  # Displaying the resulting DataFrame.

In [None]:
plt.figure(figsize=(10, 6))  
# Creating a figure with a specified size (10 inches by 6 inches).

sns.scatterplot(x='Population', y='Medal', data=df_olympic, alpha=0.7)  
# Creating a scatter plot to visualize the relationship between a country's population and the number of medals won.
# 'alpha=0.7' sets the transparency level to make overlapping points more visible.

plt.title('Population vs Total Medals')  
# Setting the title of the plot.

plt.xlabel('Population')  
# Labeling the x-axis.

plt.ylabel('Total Medals')  
# Labeling the y-axis.



plt.show()  
# Displaying the plot.

In [None]:
df_olympic['Population'] = pd.to_numeric(df_olympic['Population'], errors='coerce')  
# Converting the 'Population' column to a numeric data type.
# 'errors="coerce"' ensures that any non-numeric values are converted to NaN instead of raising an error.

df_olympic['Total_Medals'] = pd.to_numeric(df_olympic['Medal'], errors='coerce')  
# Converting the 'Medal' column to a numeric data type.
# 'errors="coerce"' is used to handle any non-numeric values.

In [None]:
df_olympic['Medals_Per_Capita'] = df_olympic['Total_Medals'] / df_olympic['Population']  
# Calculating the number of medals won per person by dividing 'Total_Medals' by 'Population'.

df_olympic['Medals_Per_Capita'].fillna(0, inplace=True)  
# Replacing any NaN values (resulting from division by zero or missing data) with 0.

In [None]:
plt.figure(figsize=(10, 6))  
# Creating a figure with a specified size (10 inches by 6 inches).

sns.scatterplot(data=df_olympic, x='Population', y='Medals_Per_Capita')  
# Creating a scatter plot to visualize the relationship between population size and medals per capita.

plt.title('Population vs Medals Per Capita')  
# Setting the title of the plot.

plt.xlabel('Population')  
# Labeling the x-axis.

plt.ylabel('Medals Per Capita')  
# Labeling the y-axis.

plt.show()  
# Displaying the plot.

In [None]:
df_olympic['GDP per Capita'] = pd.to_numeric(df_olympic['GDP per Capita'], errors='coerce')  
# Converting the 'GDP per Capita' column to a numeric data type.
# 'errors="coerce"' ensures that any non-numeric values are converted to NaN instead of raising an error.

df_olympic['GDP per Capita'].fillna(df_olympic['GDP per Capita'].median(), inplace=True)  
# Filling missing values in the 'GDP per Capita' column with the median value of the column.

df_olympic.reset_index(drop=True, inplace=True)  
# Resetting the index of the DataFrame after modifications.

df_olympic['GDP per Capita'].describe()  
# Generating summary statistics for the 'GDP per Capita' column, including count, mean, std, min, and quartiles.

In [None]:
df_olympic['Medals_Per_GDP'] = df_olympic['Total_Medals'] / df_olympic['GDP per Capita']  
# Calculating the number of medals won per unit of GDP per capita.

df_olympic['Medals_Per_GDP'].fillna(0, inplace=True)  
# Replacing any NaN values (resulting from division by zero or missing data) with 0.

plt.figure(figsize=(10, 6))  
# Creating a figure with a specified size (10 inches by 6 inches).

sns.scatterplot(data=df_olympic, x='GDP per Capita', y='Medals_Per_GDP', color='purple')  
# Creating a scatter plot to visualize the relationship between GDP per capita and medals per GDP.

plt.title('GDP per Capita vs Medals Per GDP')  
# Setting the title of the plot.

plt.xlabel('GDP per Capita')  
# Labeling the x-axis.

plt.ylabel('Medals Per GDP')  
# Labeling the y-axis.

plt.grid(True)  
# Adding grid lines for better readability.

plt.show()  
# Displaying the plot.

# Olympics Data Analysis

- Exploratory Data Analysis (EDA) of the Olympic Games dataset
- Analyzing trends, athlete participation, medal distributions, and top-performing countries.

# EDA

In [None]:
gold_df = df_olympic[df_olympic['Medal'] == 'Gold']
silver_df = df_olympic[df_olympic['Medal'] == 'Silver']
bronze_df = df_olympic[df_olympic['Medal'] == 'Bronze']


In [None]:
import matplotlib.pyplot as plt

noc_medals = df_olympic.pivot_table(index='Country', columns='Medal', aggfunc='size', fill_value=0)
noc_medals = noc_medals[['Gold', 'Silver', 'Bronze']]

noc_medals.head(50).plot(kind='bar', stacked=True, figsize=(8, 5), color=['gold', 'silver', '#cd7f32'])
plt.title('Medal Distribution by Country')
plt.xlabel('Country')
plt.ylabel('Number of Medals')
plt.xticks(rotation=90, fontsize=8)
plt.yticks(fontsize=10, fontweight='bold')
plt.legend(['Gold', 'Silver', 'Bronze'])
plt.show()


In [None]:
noc_medals = df_olympic.pivot_table(index='Country', columns='Medal', aggfunc='size', fill_value=0)

noc_medals = noc_medals[['Gold', 'Silver', 'Bronze']]

noc_totals = noc_medals.sum(axis=1).sort_values(ascending=False)

top_nocs = noc_totals.head(10)

top_nocs.plot(
    kind='pie',
    autopct='%1.1f%%',
    startangle=140,
    colors=plt.cm.tab20.colors,
    figsize=(8, 8),
    legend=False
)

plt.title('Top 10 Countries by Total Medals')
plt.ylabel('')  


In [None]:
medals_by_year = df_olympic.pivot_table(index='Year', columns='Medal', aggfunc='size', fill_value=0)


medals_by_year = medals_by_year[['Gold', 'Silver', 'Bronze']]

medals_by_year.plot(
    kind='area',
    stacked=True,
    figsize=(13,10),
    color=['gold', 'silver', '#cd7f32'],
    alpha=0.7
)

plt.title('Medal Proportions by Year')
plt.xlabel('Year')
plt.ylabel('Number of Medals')
plt.xticks(rotation=45, fontsize=10, fontweight='bold')
plt.yticks(fontsize=10, fontweight='bold')
plt.show()


In [None]:
# Grouping the number of competitions per year
num_of_comp = df_olympic.groupby('Year').size().reset_index()  # Groups data by 'Year' and counts occurrences
num_of_comp.columns = ['Year', 'Count']  # Renaming columns to 'Year' and 'Count'

# Creating the plot
plt.figure(figsize=(10, 5))  # Setting figure size to 10x5 inches
sns.lineplot(data=num_of_comp, x='Year', y='Count', marker="o", linestyle="-", color="b")  
# Drawing a line plot with year on the x-axis and count on the y-axis
# 'marker="o"' adds circular markers to each point
# 'linestyle="-"' connects points with a solid line
# 'color="b"' sets the line color to blue

# Customizing the title and labels
plt.title("Number of Competitions Over Years", fontsize=14)  # Setting the plot title
plt.xlabel("Year", fontsize=12)  # Labeling the x-axis
plt.ylabel("Number of Competitions", fontsize=12)  # Labeling the y-axis
plt.grid(True)  # Enabling grid lines for better readability

# Displaying the plot
plt.show()

# Printing the DataFrame to inspect the number of competitions per year
num_of_comp

In [None]:
# Grouping the number of unique athletes per year
num_of_athletes = df_olympic.groupby('Year')['Athlete'].nunique().reset_index()  
# Groups data by 'Year' and calculates the number of unique athletes per year
# 'nunique()' counts distinct athletes in each year
# 'reset_index()' resets the index to return a DataFrame

# Creating the plot
plt.figure(figsize=(10, 5))  # Setting figure size to 10x5 inches
sns.lineplot(data=num_of_athletes, x='Year', y='Athlete', marker="o", linestyle="-", color="b")  
# Drawing a line plot with 'Year' on the x-axis and 'Athlete' count on the y-axis
# 'marker="o"' adds circular markers to each point
# 'linestyle="-"' connects points with a solid line
# 'color="b"' sets the line color to blue

# Customizing the title and labels
plt.title("Trend of Unique Athletes Over Years", fontsize=14)  # Setting the plot title
plt.xlabel("Year", fontsize=12)  # Labeling the x-axis
plt.ylabel("Number of Unique Athletes", fontsize=12)  # Labeling the y-axis
plt.grid(True)  # Enabling grid lines for better readability

# Displaying the plot
plt.show()

In [None]:
# Grouping the number of unique sports per country
num_of_sports_by_country = df_olympic.groupby('Country')['Sport'].nunique().reset_index()  
# Groups data by 'Country' and calculates the number of unique sports each country has participated in
# 'nunique()' counts distinct sports for each country
# 'reset_index()' resets the index to return a DataFrame

# Sorting the data by the number of unique sports in descending order
num_of_sports_by_country = num_of_sports_by_country.sort_values(by='Sport', ascending=False)  
# Sorting to show countries with the highest number of unique sports first
max_sports = num_of_sports_by_country['Sport'].max()  # Get the maximum number of unique sports

# Creating the plot
plt.figure(figsize=(35, 17.5))  # Setting figure size to 35x17.5 inches for better readability
sns.barplot(data=num_of_sports_by_country, x='Country', y='Sport', palette="viridis")  
# Creating a bar plot with 'Country' on the x-axis and 'Sport' count on the y-axis
# 'palette="viridis"' applies a visually appealing color gradient

# Customizing the title and labels
plt.title("Number of Unique Sports Per Country", fontsize=14)  # Setting the plot title
plt.xlabel("Country", fontsize=12)  # Labeling the x-axis
plt.ylabel("Number of Unique Sports", fontsize=12)  # Labeling the y-axis
plt.xticks(rotation=90)  # Rotating country names for better visibility
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Adding horizontal grid lines for clarity
plt.ylim(0, max_sports + 5)  # Setting Y-axis limits with an extra margin above the maximum value

# Displaying the plot
plt.show()

In [None]:
# Grouping the total number of medals per country
num_of_medal_by_country = df_olympic.groupby('Country')['Medal'].count().reset_index()  
# Groups data by 'Country' and counts the total number of medals each country has won
# 'count()' counts the occurrences of medals for each country
# 'reset_index()' resets the index to return a DataFrame

# Sorting the data by the number of medals in descending order
num_of_medal_by_country = num_of_medal_by_country.sort_values(by="Medal", ascending=False)  
# Sorting to display countries with the highest number of medals first

# Creating the plot
plt.figure(figsize=(28, 12))  # Setting figure size to 28x12 inches for better visibility
sns.barplot(data=num_of_medal_by_country, x='Country', y='Medal', palette="viridis")  
# Creating a bar plot with 'Country' on the x-axis and 'Medal' count on the y-axis
# 'palette="viridis"' applies a visually appealing color gradient

# Customizing the title and labels
plt.title("Total Number of Medals Per Country", fontsize=14)  # Setting the plot title
plt.xlabel("Country", fontsize=12)  # Labeling the x-axis
plt.ylabel("Number of Medals", fontsize=12)  # Labeling the y-axis
plt.xticks(rotation=90)  # Rotating country names for better readability

# Adding horizontal grid lines for clarity
plt.grid(axis='y', linestyle='--', alpha=0.7)  

# Displaying the plot
plt.show()

# Printing the DataFrame to inspect the total number of medals per country
num_of_medal_by_country

In [None]:
# Grouping the total number of medals per sport
medals_per_sport = df_olympic.groupby('Sport')['Medal'].count().reset_index()  
# Groups data by 'Sport' and counts the total number of medals awarded in each sport
# 'count()' counts the occurrences of medals for each sport
# 'reset_index()' resets the index to return a DataFrame

# Creating a pivot table for the heatmap
p_table = medals_per_sport.pivot_table(values='Medal', index='Sport', aggfunc='sum')  
# Pivoting the data so that 'Sport' becomes the index, and the sum of 'Medal' values is used

# Creating the heatmap
plt.figure(figsize=(10, 12))  # Setting figure size to 10x12 inches
sns.heatmap(p_table, cmap="coolwarm", annot=True, linewidths=0.5, fmt="d")  
# 'cmap="coolwarm"' sets the color gradient
# 'annot=True' displays numerical values inside the heatmap cells
# 'linewidths=0.5' adds lines between cells for better readability
# 'fmt="d"' ensures that numbers are displayed as integers

# Customizing the title and labels
plt.title("Heatmap of Sports vs. Number of Medals", fontsize=14)  # Setting the plot title
plt.xlabel("Medals", fontsize=12)  # Labeling the x-axis (not needed for a heatmap, but added for clarity)
plt.ylabel("Sport", fontsize=12)  # Labeling the y-axis

# Displaying the heatmap
plt.show()

In [None]:
# Grouping the total number of medals by gender
medals_by_gender = df_olympic.groupby('Gender')['Medal'].count().reset_index()  
# Groups data by 'Gender' and counts the total number of medals won by each gender
# 'count()' counts the occurrences of medals for each gender
# 'reset_index()' resets the index to return a DataFrame

# Creating the plot
plt.figure(figsize=(8, 5))  # Setting figure size to 8x5 inches
sns.barplot(data=medals_by_gender, x='Gender', y='Medal', palette="coolwarm")  
# Creating a bar plot with 'Gender' on the x-axis and 'Medal' count on the y-axis
# 'palette="coolwarm"' applies a color gradient for better visualization

# Customizing the title and labels
plt.title("Number of Medals by Gender", fontsize=14)  # Setting the plot title
plt.xlabel("Gender", fontsize=12)  # Labeling the x-axis
plt.ylabel("Number of Medals", fontsize=12)  # Labeling the y-axis

# Adding horizontal grid lines for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Displaying the plot
plt.show()

In [None]:
# Selecting the top 10 countries with the highest number of medals
top_countries = df_olympic.groupby('Country')['Medal'].count().nlargest(10).index  
# Groups data by 'Country' and counts the total medals per country
# 'nlargest(10)' selects the top 10 countries with the highest medal counts
# '.index' extracts the country names (indexes) of the top 10 countries

# Filtering the dataset to include only the top 10 countries
df_top = df_olympic[df_olympic['Country'].isin(top_countries)]  
# Keeps only rows where 'Country' is in the top_countries list

# Grouping medals by year and country
medals_by_year = df_top.groupby(['Year', 'Country'])['Medal'].count().reset_index()  
# Groups data by 'Year' and 'Country' and counts the number of medals each country won in each year
# 'reset_index()' ensures the result is a DataFrame

# Creating the plot
plt.figure(figsize=(25, 10))  # Setting figure size to 25x10 inches
sns.lineplot(data=medals_by_year, x='Year', y='Medal', hue='Country', marker="o")  
# Creating a line plot with 'Year' on the x-axis and 'Medal' count on the y-axis
# 'hue="Country"' differentiates lines by country
# 'marker="o"' adds circular markers at data points

# Customizing the title and labels
plt.title("Number of Medals Over Years for Top Countries", fontsize=14)  # Setting the plot title
plt.xlabel("Year", fontsize=12)  # Labeling the x-axis
plt.ylabel("Number of Medals", fontsize=12)  # Labeling the y-axis

# Adding a legend to identify countries
plt.legend(title="Country")  

# Adding horizontal grid lines for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Displaying the plot
plt.show()

In [None]:
import pandas as pd

# احسب عدد الميداليات لكل دولة ولكل نوع ميدالية
medals_count = pd.crosstab(df_olympic['Country'], df_olympic['Medal'])

# دمج العدود مع الداتا الأصلية حسب 'Country'
df_olympic = df_olympic.merge(medals_count, how='left', left_on='Country', right_index=True)

# الأعمدة الجديدة هتكون: 'Gold', 'Silver', 'Bronze'
print(df_olympic[['Country', 'Gold', 'Silver', 'Bronze']].head())


In [None]:
# Top athletes by total medals
athlete_medals = df_olympic.groupby("Athlete")["Medal"].count().reset_index()  # Grouping by athlete and counting medals
athlete_medals = athlete_medals.sort_values(by="Medal", ascending=False).head(10)  # Sorting and selecting top 10 athletes

# Top countries over time
df_olympic["Total Medals"] = 1  # Adding a helper column for medal count aggregation
country_medals = df_olympic.groupby(["Year", "Country"])["Total Medals"].count().reset_index()  # Grouping by year and country

def plot_top_athletes():
    """Function to visualize the top athletes by total medals."""
    plt.figure(figsize=(10, 5))
    sns.barplot(x="Medal", y="Athlete", data=athlete_medals, palette="viridis")  # Bar plot for top athletes
    plt.xlabel("Total Medals")  # X-axis label
    plt.ylabel("Athlete")  # Y-axis label
    plt.title("Top Athletes by Total Medals")  # Plot title
    plt.show()

def plot_top_countries():
    """Function to visualize the performance of top countries over time."""
    top_countries = df_olympic["Country"].value_counts().head(5).index  # Selecting top 5 countries
    top_countries_data = country_medals[country_medals["Country"].isin(top_countries)]  # Filtering data for top countries
    
    plt.figure(figsize=(12, 6))
    sns.lineplot(x="Year", y="Total Medals", hue="Country", data=top_countries_data, marker="o")  # Line plot over time
    plt.xlabel("Year")  # X-axis label
    plt.ylabel("Total Medals")  # Y-axis label
    plt.title("Top Countries' Performance Over Time")  # Plot title
    plt.legend(title="Country")  # Legend title
    plt.grid(True)  # Adding grid for better readability
    plt.show()

# Plot visualizations
plot_top_athletes()  # Display top athletes visualization
plot_top_countries()  # Display top countries performance over time


In [None]:
medal_means = df_olympic.groupby('Medal').size()

labels = ['Gold', 'Silver', 'Bronze']
values = [medal_means.get(medal, 0) for medal in labels]

num_vars = len(labels)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]  

values += values[:1]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
ax.fill(angles, values, color='skyblue', alpha=0.4)
ax.plot(angles, values, color='blue', linewidth=2)

ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)
ax.set_rlabel_position(0)

plt.xticks(angles[:-1], labels, fontsize=12, fontweight='bold')
plt.yticks(color='grey', size=10)
plt.title('Radar Chart for Average Medal Distribution', size=15, fontweight='bold', pad=20)

plt.show()


# Data Preprocessing


In [None]:
df_olympic = df_olympic.drop(columns=['Gold', 'Silver', 'Bronze'], errors='ignore')
df_olympic = df_olympic.merge(medals_count, how='left', left_on='Country', right_index=True)
df_olympic['Total_Medals_By_Country'] = df_olympic['Gold'] + df_olympic['Silver'] + df_olympic['Bronze']


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df_olympic[['Population', 'GDP per Capita', 'Medals_Per_Capita', 'Medals_Per_GDP']]
y = df_olympic['Total_Medals_By_Country']  # أو أي عمود الهدف الصحيح عندك

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Machine Learning Models

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

y_pred_lr = lr_model.predict(X_test_scaled)

mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print(f'Linear Regression - MAE: {mae_lr:.2f}, R²: {r2_lr:.2f}')


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, r2_score

dt_model = DecisionTreeRegressor(
    max_depth=5,             
    min_samples_split=10,     
    min_samples_leaf=5,      
    random_state=42
)

dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

mae_dt = mean_absolute_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print(f'Decision Tree - MAE: {mae_dt:.2f}, R²: {r2_dt:.2f}')


In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f'Random Forest - MAE: {mae_rf:.2f}, R²: {r2_rf:.2f}')


In [None]:
plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
plt.scatter(y_test, y_pred_lr, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Linear Regression')
plt.xlabel('Actual')
plt.ylabel('Predicted')

plt.subplot(1, 3, 2)
plt.scatter(y_test, y_pred_dt, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Decision Tree')
plt.xlabel('Actual')
plt.ylabel('Predicted')

plt.subplot(1, 3, 3)
plt.scatter(y_test, y_pred_rf, alpha=0.7, edgecolors='k')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.title('Random Forest')
plt.xlabel('Actual')
plt.ylabel('Predicted')

plt.tight_layout()
plt.show()
