# Project Overview

Our project focuses on data from major domestic and international gymnastics competitions from the 2022 and 2023 seasons. 

For context on gymnastics terms, here is a useful guide to apparatus abbreviations.
BB = Balanced Beam
VT = Vault
FX = Floor Exercise
UB = Uneven Bars

Another helpful thing to keep in mind for gymnastics, is that Execution Score + Difficulty Score = Final Score. For this project, we'll only be focusing on women's scores. For a brief overview into the technical components of our projects we will:<br /> 
● Use SQL and visualizations to identify the top 5 gymnasts who would optimize a team score (Fantasy Gymnastics!) <br /> 
● Use visualizations to explore difficulty vs execution tradeoffs to maximize score potential<br /> 
● Use neural networks and statistical models to predict scores<br /> 
● Analyze score distributions across athletes <br /> 
● Use Dash for users to view our visualizations <br /> 
● Use Dash, SQL, and visualizations for Monte Carlo Simulations<br /> 

<img src="Screenshot 2025-03-21 at 10.16.12 PM.png" alt="flowchart"  height="300">

Our project focuses on the concept of Fantasy Gymnastics. We hoped to create a program that could identify the group of 5 athletes who will optimize success for the USA Olympic Women’s Artistic Gymnastics teams. We created an analytics model that can be used to identify and compare the expected medal count in 4 medal events for the women (vault, uneven bars, balance beam, and floor exercise). We created a Monte Carlo Simulation that uses a Mathematical technique with repeated random sampling to predict possible outcomes. We hope to run 1000 simulations where gymnasts compete against each other. We intend to use this to find the best-performing gymnasts that will make the team!

A consistent theme in this project is being able to filter to see a particular country's results. By analyzing the score distributions across athletes and apparatuses, one can see different athlete's weaknesses and strengths. We also utilized PyTorch to predict scores! This is helpful for gymnasts looking to improve their routines. We wanted our project to be useful to users which is why we incorporated a Dash App, which is user facing. We added features that users can use for a more interactive and helpful experience.

## Check out the GitHub Repository

[Here's a link to our GitHub repo](https://github.com/megaminding?tab=repositories)

## Instructions

To run the game yourself, clone this repository on your local computer using the following line:
git clone https://github.com/megaminding/gymnastics.git && `cd gymnastics`

To access our functions and classes in a notebook, you will import these functions and classes using import myProject.

To run the game, download index.ipynb and run all cells.

# Let's Add Imports and Read in the Data

In [1]:
import pandas as pd
import sqlite3

import plotly.io as pio
# pio.renderers.default="iframe"

maindf = pd.read_csv("data_2022_2023.csv")
maindf


Unnamed: 0,LastName,FirstName,Gender,Country,Date,Competition,Round,Location,Apparatus,Rank,D_Score,E_Score,Penalty,Score
0,AAS,Fredrik,m,NOR,24-27 Feb 2022,2022 Cottbus World Cup,qual,"Cottbus, Germany",HB,18.0,3.9,8.266,,12.166
1,AAS,Fredrik,m,NOR,24-27 Feb 2022,2022 Cottbus World Cup,qual,"Cottbus, Germany",PB,23.0,3.9,6.900,,10.800
2,AAS,Fredrik,m,NOR,24-27 Feb 2022,2022 Cottbus World Cup,qual,"Cottbus, Germany",PH,33.0,4.2,6.666,,10.866
3,AAS,Fredrik,m,NOR,23-26 Feb 2023,2023 Cottbus World Cup,qual,"Cottbus, Germany",HB,39.0,4.6,6.700,,11.300
4,AAS,Fredrik,m,NOR,23-26 Feb 2023,2023 Cottbus World Cup,qual,"Cottbus, Germany",PH,44.0,4.4,7.800,,12.200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24429,ÖNDER,Ahmet,m,TUR,1-4 Jun 2023,2023 Tel Aviv World Challenge Cup,final,"Tel Aviv, Israel",FX,8.0,4.8,7.050,,11.850
24430,ÖNDER,Ahmet,m,TUR,1-4 Jun 2023,2023 Tel Aviv World Challenge Cup,qual,"Tel Aviv, Israel",FX,3.0,5.8,7.950,0.1,13.650
24431,ÖNDER,Ahmet,m,TUR,1-4 Jun 2023,2023 Tel Aviv World Challenge Cup,qual,"Tel Aviv, Israel",HB,10.0,5.0,6.250,,11.250
24432,ÖNDER,Ahmet,m,TUR,1-4 Jun 2023,2023 Tel Aviv World Challenge Cup,final,"Tel Aviv, Israel",PB,1.0,6.3,8.050,,14.350


## Let's Clean the Data

We'll focus on only women's data and change VT1 and VT2 to just VT.

In [3]:
import inspect
from myProject import data_cleaning
print(inspect.getsource(data_cleaning)) 

maindf = data_cleaning(maindf)
print(maindf)

def data_cleaning(maindf):
    """
    This function  cleans the data by removing NA values. It also replaces apparatus values of 'VT1' and 'VT2' to be 'VT'. It only filters for women gymnasts.

    Args:
        maindf: dataframe

    Returns:
        dataframe
    """
    maindf.dropna(inplace=True, subset=[ 'Apparatus', 'Score', 'Country', 'D_Score', "E_Score"]) #removing NA values
    maindf.loc[maindf['Apparatus'] == 'VT1', 'Apparatus'] = 'VT' #replaces apparatus values of 'VT1' and 'VT2' to be 'VT'
    maindf.loc[maindf['Apparatus'] == 'VT2', 'Apparatus'] = 'VT' #replaces apparatus values of 'VT1' and 'VT2' to be 'VT'
    maindf = maindf[maindf['Gender'] == 'w']#women's only
    return maindf

         LastName  FirstName Gender Country                      Date  \
144    ABDELSALAM       Jana      w     EGY  29 Oct 2022 - 6 Nov 2022   
145    ABDELSALAM       Jana      w     EGY  29 Oct 2022 - 6 Nov 2022   
146    ABDELSALAM       Jana      w     EGY  29 Oct 2022 - 6 Nov 2022   

# First Technical Component: SQL Database 

Our first technical component is the SQL database. We created two main functions— the first to filter country which can be used later for visualizations, and the second for pivoting which can be used later to run Monte Carlo simulations.

Our first function's purpose is to filter and specify country. We're also computing new variables such as the standard deviation of the score, as well as competition count. These new data points will be helpful in analyzing the data, since gymnasts with a higher competition count and lower standard deviation score will be known to perform more consistently and have more experience. With this database, we can use it to populate a visualization. More details about the creation and analysis of the visualizations will be described down in a later section of this blog post. 



In [None]:
with sqlite3.connect("gym") as conn:
    maindf.to_sql("gym", conn, if_exists = "replace", index = False) 
    

In [None]:
from myProject import query_gym_country_database
print(inspect.getsource(query_gym_country_database)) 

query_gym_country_database("NOR")


def query_gym_country_database(country):
    """
    This function  connects to the SQL database to filter by a particular country and add values called the 'PredictedScore' and 'StdDevScore' and 'CompetitionsCount'

    Args:
        country

    Returns:
        returns df
    """
    with sqlite3.connect('gym') as conn: #  connects to the SQL database to filter by a particular country and add values called the 'PredictedScore' and 'StdDevScore' and 'CompetitionsCount'
        cmd = \
        f'''
        SELECT LastName, FirstName, Apparatus, AVG(Score) AS PredictedScore, MAX(Score) as Maxscore, 
        SQRT(AVG(score * score) - AVG(score) * AVG(score)) AS StdDevScore,
        COUNT(DISTINCT Date) AS CompetitionsCount
        FROM gym 
        WHERE Country = '{country}'
        GROUP BY LastName, FirstName, Apparatus 
        ORDER BY Apparatus, Maxscore DESC
        '''
    df = pd.read_sql_query(cmd, conn)
    return (df)



Unnamed: 0,LastName,FirstName,Apparatus,PredictedScore,Maxscore,StdDevScore,CompetitionsCount
0,TRONRUD,Maria,BB,12.333,13.233,0.525935,10
1,MADSØ,Julie,BB,12.24975,12.633,0.284493,2
2,NEURAUTER,Mali,BB,11.50775,12.366,0.586156,4
3,ROENBECK,Marie,BB,11.3198,12.266,0.722424,4
4,KANTER,Mari,BB,11.4455,12.233,0.931899,7
5,TØSSEBRO,Juliane,BB,12.05,12.1,0.05,2
6,TOESSEBRO,Juliane,BB,11.388667,11.8,0.315591,3
7,MADSOE,Julie,BB,11.566,11.566,0.0,1
8,HALVORSEN,Selma,BB,9.9,9.9,0.0,1
9,TRONRUD,Maria,FX,11.375714,12.666,0.930657,6


## Pivoting using SQL
Our second function's purpose is perform pivoting, meaning to transform the data from along form table to a wide form table. Achieving this will allow us to consolidate multiple rows of athlete’s scores across different events into just one row with the best scores of each event in each column. 
 
<img src="Screenshot 2025-03-20 at 11.53.06 PM.png" alt="pivoting diagram"  height="300">

By preparing the data in the right format, it will make it easier for us to perform Monte Carlo simulations, which will be described down in a later section of this blog post.

In [None]:
import sqlite3
import pandas as pd

from myProject import query_pivoted_database
print(inspect.getsource(query_pivoted_database)) 

pivoted_table = query_pivoted_database()
(pivoted_table)


def query_pivoted_database():
    """
    This function  connects to the SQL database to pivot it, meaning to transform the data from along form table to a wide form table. Achieving this will allow us to consolidate multiple rows of athlete’s scores across different events into just one row with the best scores of each event in each column. 

    Args:
        None

    Returns:
        df
    """
    with sqlite3.connect('gym') as conn: # connects to the SQL database to pivot it, meaning to transform the data from along form table to a wide form table. Achieving this will allow us to consolidate multiple rows of athlete’s scores across different events into just one row with the best scores of each event in each column. 

        cmd = f'''
        WITH AthleteScores AS (
            SELECT 
                LastName, 
                FirstName,
                Apparatus,
                AVG(Score) AS PredictedScore,
                COUNT(DISTINCT Date) AS CompetitionsCount,
         

Unnamed: 0,LastName,FirstName,Country,BB_PredictedScore,VT_PredictedScore,FX_PredictedScore,UB_PredictedScore
0,ABDELSALAM,Jana,EGY,11.678750,12.3665,11.50800,10.775857
1,ABDULLAHI,Ayesha,GBR,10.200000,,,10.200000
2,ABEYRATNE,Kumudi Imanya,SRI,9.100000,,7.35000,5.500000
3,ABOELHASAN,Jana,EGY,10.500000,11.6000,11.45800,10.900000
4,ABREU,Yamilet,DOM,10.100000,12.5670,12.36700,11.733000
...,...,...,...,...,...,...,...
856,ZIVADINOVIC,Kristina,SRB,9.333167,,11.35825,
857,ZLOBEC,Evandra,CAN,11.200000,11.9415,12.65000,11.933500
858,ZONNEVELD,Maya,CAN,,,,10.550000
859,ZUO,Tong,CHN,13.288667,12.8330,13.16600,13.819600


# Second Technical Component: Visualization

Our second technical component are interactive visualizations. We created many different ones, including a scatterplot whose data is filtered with a certain country, country medal counts using monte carlo, and exploring  difficulty vs execution tradeoffs to maximize score potential.

### Scatterplot of gymnasts' max scores in a certain country
For this we used the SQL database involved in filtering, which allowed us to create a scatterplot of gymnasts' max scores across different apparatuses of a certain country. Because it would be helpful when performing Fantasy Gymnastics to focus on the best gymnasts, this scatterplot gives us a way to see the highest performing ones across the 4 events. To analyze the scatterplot, we can see that each gymnast is represented by a color and that the size of the bubbles indicates competition count.





In [None]:
import plotly
from plotly import express as px
import plotly.io as pio

from myProject import scatterplot_of_country
print(inspect.getsource(scatterplot_of_country)) 
scatterplot_of_country("USA")

def scatterplot_of_country(country):
    """
    This function shows a scatterplot of the gymnast's predicted score. 

    Args:
        country

    Returns:
        shows figure
    """
    fig = px.scatter(query_gym_country_database(country), # shows a scatterplot of the gymnast's predicted score. 
                x = "Apparatus", 
                y = "PredictedScore", 
                color="LastName",
                size='CompetitionsCount', hover_data=['PredictedScore'])


    fig.update_layout(
        title=f"Scatterplot of Gymnasts' Predicted Score in Country '{country}'", #The colorbar and overall plot have professional titles.
        yaxis_title="Predicted Score of Gymnasts",
        )


    fig.show()



### Scatterplot of difficulty vs execution
For this created a scatterplot showing the relationship between d score and e score to identify optimal performance zones. As we can see there is a somewhat positive association across all apparatuses, with certain apparatuses having different strength slopes and y intercepts.



In [None]:
import plotly
from plotly import express as px
import plotly.io as pio

from myProject import difficultyVsExecutionPlot
print(inspect.getsource(difficultyVsExecutionPlot)) 
difficultyVsExecutionPlot(maindf)

def difficultyVsExecutionPlot(maindf):
    """
    This function  is a scatterplot that shows the relationship between difficulty and execution

    Args:
        maindf: dataframe

    Returns:
        shows figure
    """
    fig = px.scatter(maindf,  #scatterplot that shows the relationship between difficulty and execution
                  x="D_Score", 
                  y="E_Score", 
                  color="Apparatus", 
                  hover_data=['LastName', 'FirstName'],
                  title="Difficulty vs Execution Tradeoff Across Apparatuses",
                  trendline="ols")

    fig.update_layout(
        xaxis_title="Average Difficulty Score (D_Score)", 
        yaxis_title="Average Execution Score (E_Score)",
        legend_title="Apparatus",
    )

    fig.show()



### Monte carlo simulations
For this we used the SQL database involved in pivoting, which allowed us to make use of the single row with the best scores of each event in each column. For each event and player, we computed a simulated score using their best scores and added noise through randomization. We then took the top three scores of each event and awarded gold, silver, and bronze medals. After repeating this 1000 times across the 4 different events, we now have the medal counts for each gymnast to obtain our predicted winners that will yield a country's team the most success.



In [None]:
import inspect
import numpy as np
from myProject import monte_carlo
print(inspect.getsource(monte_carlo)) 

pivoted_table = monte_carlo(pivoted_table)

def monte_carlo(df):
    """
    This function allows users to run Monte Carlo simulations to see the expected medal count of each gymnast

    Args:
        df: dataframe
      

    Returns:
        dataframe
    """
    list_of_events = ['BB', 'VT', 'FX', 'UB']

    num_simulations = 1000

    df['gold'] = 0
    df['silver'] = 0
    df['bronze'] = 0

    for i in range(num_simulations): 
        for event in list_of_events:
            event_data = df[df[f'{event}_PredictedScore'].notna()].copy() #only select data that is the specific event type
            event_data['simulated_score'] = event_data[f'{event}_PredictedScore'] + np.random.normal(0, 0.1, size=len(event_data)) #add noise to create simulated score
            event_data = event_data.sort_values(by='simulated_score', ascending=False).reset_index(drop=True) #sort simulated scores from highest to lowest

            #award medals to top three scorers in simulated score
            df.loc[df['LastName'] == event_data.loc[0,

In [None]:
import inspect
from myProject import medal_count_by_country
print(inspect.getsource(medal_count_by_country)) 
medal_count_by_country(pivoted_table, "USA")


def medal_count_by_country(df, country):
    """
    This function visualizes the athlete's who were able to win medals for their country in the simulation

    Args:
        df: dataframe
        country
      

    Returns:
        dataframe
    """
    results_by_country = (df[df['Country']==country]).head()
    print(results_by_country)

    fig = px.histogram(results_by_country, 
                    x="LastName", 
                    y=["gold", 'silver', 'bronze'], 
                    title=f"Medal County per Athele for Country '{country}'",
                    color_discrete_sequence=['gold', 'silver', '#CD7F32']
                    )

    fig.update_layout(
        xaxis_title="Gymnast Last Name", 
        yaxis_title="Predicted Olympic Medal Count out of 1000 Simulations",
        legend_title="Medal Type",
    )

    return fig

     LastName FirstName Country  BB_PredictedScore  VT_PredictedScore  \
64      BILES    Simone     USA          14.599857          14.983111   
853

# Third Technical Component: Dash App

Our third technical component is our Dash App. This has two main features, one of which being a visualization that can be toggled between scatterplot and box plot form and the second of which being a user input field to run monte carlo simulations.

### Visualizations for Gymnastics Score Analysis 
This visualization allows users to select from a dropdown menu of countries to see the Gymnastics Score Analysis for a particular country. Here's an example of the dropdown code in use. 

```
  dcc.Dropdown(
            id='country-dropdown',
            options=[{'label': country, 'value': country} for country in df['Country'].unique()],
            value='USA',  # Default selection
            clearable=False
        ),
```

After which, users can then toggle between a scatterplot and box plot to see the visualized data. We performed this toggling feature using the radio component from Dash. 
```        dcc.RadioItems(
            id="plot-type",
            options=[
                {"label": "Scatter Plot", "value": "scatter"},
                {"label": "Box Plot", "value": "box"}
            ],
            value="scatter",  # Default selection
            inline=True
        ),
```

### Adding hypothetical players into the monte carlo simulation
Due to our limited dataset, we wanted users to have the option of adding in more gymnasts to see how they would perform in comparison to the already existing gymnasts in our data. This would be helpful in determining whether they could be a good fit for the USA olympics team. We added user input fields, asking for the athlete's name, country, and expected score across the four events. After hitting submit, users are able to see a visualization of whether they are able to win any medals if they compete at an global scale event with players from different coutnries.

For the submit button we used ` if n_clicks > 0:` to check if the button was clicked. If it was, we perform the functions to add the new player's data into the database and perform the simulations using the new database.


In [None]:
import dash
from dash import dcc, html
import pandas as pd
import plotly.express as px
from dash import Dash, dcc, html, Input, Output, State, callback
# Import the function from visualizations.py
from myProject import scatterplot_by_country
from myProject import query_pivoted_database
from myProject import add_user_entry
from myProject import monte_carlo
from myProject import medal_count_by_country

def DashApp():
    '''
    Dash App to show visualizations of gymnastics
    
    First one is the option for either a scatterplot or a box plot
    
    Second one is user submission to enter a hypothetical athelete, and to see their medal count within their country
    '''
    # Load dataset to get the list of unique countries
    df = pd.read_csv("data_2022_2023.csv")
    df['Country'] = df['Country'].str.strip().str.upper()  # Normalize country names

    # Initialize Dash app
    app = dash.Dash(__name__)

    # Layout
    app.layout = html.Div([
        html.H1("Gymnastics Score Analysis"),
        
        html.Label("Select a Country:"),
        dcc.Dropdown(
            id='country-dropdown',
            options=[{'label': country, 'value': country} for country in df['Country'].unique()],
            value='USA',  # Default selection
            clearable=False
        ),

        
        html.Label("Select Plot Type:"),
        dcc.RadioItems(
            id="plot-type",
            options=[
                {"label": "Scatter Plot", "value": "scatter"},
                {"label": "Box Plot", "value": "box"}
            ],
            value="scatter",  # Default selection
            inline=True
        ),

        dcc.Graph(id='country-plot'),  
        html.H1("Enter a Hypothetical Athlete to Run Simulations on Country Medal Count"),
        
        html.P("Your gymnast's first name:"), 
        dcc.Textarea(
            id='FirstName',
            style={'width': 500, 'height': 20}, 
        ),
        html.P("Your gymnast's last name:"), 
        dcc.Textarea(
            id='LastName',
            style={'width': 500, 'height': 20}, 
        ),
        html.P("Your gymnast's country:"),
        dcc.Dropdown(
            id='Country',
            options=[{'label': country, 'value': country} for country in df['Country'].unique()],
            value='USA', 
            clearable=False
        ),
        html.P("Your gymnast's expected balance beam score: "),
        dcc.Textarea(
            id='BB_PredictedScore',
            style={'width': 500, 'height': 20},
        ),
        html.P("Your gymnast's expected vault score: "),
        dcc.Textarea(
            id='VT_PredictedScore',
            style={'width': 500, 'height': 20},
        ),
        html.P("Your gymnast's expected floor exercise score: "),
        dcc.Textarea(
            id='FX_PredictedScore',
            style={'width': 500, 'height': 20},
        ),
        html.P("Your gymnast's expected unbalanced bars score: "),
        dcc.Textarea(
            id='UB_PredictedScore',
            style={'width': 500, 'height': 20},
        ),
        html.Button('Submit', id='submit-button', n_clicks=0,  #keeping track of when the user clicked on the button
                        style={ 
            'backgroundColor': '#643843', #changing background color to dark pink
            'color': 'white', #text is white
            'padding': '10px 20px', #padding for better design
            'borderRadius': '5px', #more rounded edges
        }),
        html.Div(id='textarea-output', style={'whiteSpace': 'pre-line', 'margin': 40, 'border': 50}), #margin for more white space
        dcc.Graph(id='medals-plot')  # This graph updates based on the dropdown & toggle
    ], style={'whiteSpace': 'pre-line', 'margin': 40, 'border': 50})

    # Callback to update the plot based on user selection
    @app.callback(
        dash.Output('country-plot', 'figure'),
        [dash.Input('country-dropdown', 'value'),
        dash.Input('plot-type', 'value')]
    )
    def update_plot(selected_country, plot_type):
        filtered_data = df[df['Country'] == selected_country]

        if filtered_data.empty:
            return px.scatter(title=f"No Data Available for {selected_country}")

        if plot_type == "scatter":
            return scatterplot_by_country(selected_country)  # Calls scatter function
        else:
            # Restore the original Box Plot
            fig = px.box(
                filtered_data, 
                x="Apparatus",
                y="Score", 
                color="Apparatus",
                title=f"Gymnastics Score Distribution for {selected_country}",
                labels={"Apparatus": "Event", "Score": "Final Score"},
                points="all"  # Show all points (outliers included)
            )

            # Layout adjustments for spacing & readability
            fig.update_layout(
                xaxis={'tickangle': -45},  # Rotate event labels for better spacing
                xaxis_title="Event",
                yaxis_title="Final Score",
                margin=dict(l=40, r=40, t=60, b=120)
            )

            # Restore annotation box for explaining Box Plot
            fig.add_annotation(
                x=0.5, y=-0.2,
                text="🔹 Box represents the middle 50% of scores (Q1 to Q3).<br>"
                    "🔹 Line inside the box = Median (middle score).<br>"
                    "🔹 Whiskers extend to non-outlier min/max scores.<br>"
                    "🔹 Dots outside whiskers = Outliers (exceptionally high/low scores).",
                showarrow=False,
                xref="paper", yref="paper",
                font=dict(size=14, color="black"),
                align="center",
                bordercolor="black",
                borderwidth=2,
                bgcolor="white",
                opacity=0.95
            )

            return fig
    @app.callback(
        Output('medals-plot', 'figure'),
        Input('submit-button', 'n_clicks'), 
        State('FirstName', 'value'), 
        State('LastName', 'value'), 
        State('Country', 'value'), 
        State('BB_PredictedScore', 'value'), 
        State('VT_PredictedScore', 'value') ,
        State('FX_PredictedScore', 'value') ,
        State('UB_PredictedScore', 'value') ,
    )

    def update_medals_plot(n_clicks, FirstName, LastName, Country, BB_PredictedScore, VT_PredictedScore, FX_PredictedScore, UB_PredictedScore):
        if n_clicks == 0:
            return px.scatter(title="No Data Available Yet- Please fill in Athlete's Details")
        
        if None in [FirstName, LastName, Country, BB_PredictedScore, VT_PredictedScore, FX_PredictedScore, UB_PredictedScore]:
            return px.scatter(title="Please fill in all fields.")


        if n_clicks > 0:
            BB_PredictedScore = float(BB_PredictedScore) if BB_PredictedScore else 0.0
            VT_PredictedScore = float(VT_PredictedScore) if VT_PredictedScore else 0.0
            FX_PredictedScore = float(FX_PredictedScore) if FX_PredictedScore else 0.0
            UB_PredictedScore = float(UB_PredictedScore) if UB_PredictedScore else 0.0

            pivoted_database = query_pivoted_database()
            new_pivoted_database = add_user_entry(pivoted_database,FirstName, LastName, Country, BB_PredictedScore, VT_PredictedScore, FX_PredictedScore, UB_PredictedScore)
            medal_pivoted_database = monte_carlo(new_pivoted_database)
            return medal_count_by_country(medal_pivoted_database, Country)  # Calls medal count by country
        


    if __name__ == '__main__':
        app.run_server(debug=True, port = 8051)
DashApp()

Selected Country: USA
Filtered Data: 3352 rows


# Fourth Technical Component: PyTorch

Here we will show the code snippets for PyTorch. First, we load in the data and then we convert the events column into numerical values using LabelEncoder to prepare it for model training. After, we set it so that X contains all features except for the Score column and y contains the target variable, Score, to be predicted. Then we split the data into training and testing sets in an 80-20 ratio using train_test_split. Then we convert it to tensors and enable batch processing for training and evaluation. 

We define the neural network model using `OlympicNN` and begin with model initialization. After, we train it using the loss function `nn.MSELoss` and the optimizer function `optim.Adam`. We train this for 1000 epochs. Then we evalulate our model- more details on this later. 

Feature Standardization:

Standard scaling is applied using StandardScaler to ensure that all features are on a similar scale, improving model performance.

```
# loading dataset
data = pd.read_csv('gymnastics.csv') 


# encoding categorical col into numerical vals 
label_e = LabelEncoder() 
data['Event'] = label_e.fit_transform(data['Event'])

# normalize numerical cols
# select features of (x) and target (y)
X = data.drop('Score', axis=1).values
y = data['Score'].values


# split into training and testing set
# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) 
X_test = scaler.transform(X_test) 


# initialize model, loss function, and optimizer
# Convert data to torch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)


# creating a dataloader to help with batch processing 
train_data = TensorDataset(X_train_tensor, y_train_tensor) 
test_dataset = TensorDataset(X_test_tensor, y_test_tensor) 


#training dataloader 
train_load = DataLoader(train_data, batch_size=32, shuffle=True) 
test_load = DataLoader(test_data, batch_size=32, shuffle=False)

#create neural network model 
class OlympicNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(OlympicNN, self).__init__()
        self.hidden = nn.Linear(input_size, hidden_size)  # Hidden layer
        self.output = nn.Linear(hidden_size, output_size)  # Output layer
        
    def forward(self, x):
        x = torch.relu(self.hidden(x))  # Apply ReLU activation function to the hidden layer
        x = self.output(x)  # Output layer
        return x


# Initialize the model
input_size = X_train.shape[1]  # Number of features
hidden_size = 64  # Size of the hidden layer
output_size = 1  # Output size depending on target variable 

model = OlympicNN(input_size, hidden_size, output_size)

# Loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error for regression, or CrossEntropyLoss for classification
optimizer = optim.Adam(model.parameters(), lr=0.001)


 #train model 
num_epochs = 1000  # adjusting this based on convergence

for epoch in range(num_epochs):
    model.train() # setting model to training mode 
    # Forward pass
    predictions = model(X_train_tensor)
    
    # Compute the loss
    loss = criterion(predictions, y_train_tensor)
    
    # Backward pass
    optimizer.zero_grad()  # Zero the gradients before backward pass
    loss.backward()  # Backpropagation
    
    # Update the weights
    optimizer.step()
    
    # Print the loss every 100 epochs (for monitoring progress)
    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')


#evaluate model 
# Test the model
model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # We don't need gradients for evaluation
    y_pred = model(X_test_tensor)
    
# Convert predictions to numpy for easy evaluation
y_pred_np = y_pred.numpy()

# Calculate Mean Squared Error (or any other metric)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred_np)
print(f'Mean Squared Error on Test Set: {mse:.4f}')

```

Here's a code snippet for how we created a visualization on predicted vs actual values. 



```
#create visualization of results
# Plot actual vs predicted values (for regression tasks)
plt.scatter(y_test, y_pred_np, color='blue', label='Predicted vs Actual')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label='Perfect Fit')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()


#save model
torch.save(model.state_dict(), 'olympic_nn_model.pth')
```

Here's a picture of the visualization for Comparing E_Score, D_Score, and Overall Score in Gymnastics Dataset: <br>
<img src="Screenshot 2025-03-21 at 10.03.42 PM.png" alt="visualization for Comparing E_Score, D_Score, and Overall Score in Gymnastics Dataset"  height="300">

# Conclusion and discussion of ethical ramifications

In conclusion, our project was able to hit four different technical components, helping us learn a lot! We were able to build on skills using Dash by learning how to incorporate user input fields and plugging them into functions that built visualizations. In addition, learning how to build a package our functions as a module that can be imported was another challenge, but it was rewarding as well. We had a lot of success with coming up with fun and interactive visualizations that would be helpful to any gymnast of sports lover. We were able to have a lot of fun with this project since gymnastics is a beloved sport by all. 

In terms of our challenges, we faced a limited dataset, which made it hard to come up with interesting analysis. We aimed to build a program that accepted inputs like the apparatus (floor, beam, vault, bars), athletes' skill level, and risk. However, we searched the web, and finding a completed dataset that included these parameters was difficult. Most datasets only included information about the athlete’s score for each apparatus, and how this score was made up of difficulty and execution points.

Although we weren’t able to meet all of our initial goals, we were able to pivot and create new goals, like analyzing the relationship between difficulty and execution scores. Another way in which we were able to build upon our previous goal was to redirect ourselves into the world of fantasy gymnastics. 

In terms of ethnical ramifications, some people who may be harmed are gymnasts who don’t have access to technology. As part of the technilogical divide, this project might the gap between gymnasts of varying socioeconomic statuses by increasing the advantages between gymnasts with resources. However, technology nowadays is pretty accessible so this isn’t a big concern. 

This project would help make the world a better place. By using technology to analyze gymnastics data, gymnasts who can’t afford coaches or have the time to analyze their performance themselves are able to leverage technology to do so.



