Group: Paolo Aquino, Mattia Christen

### Library import

In [6]:
#Import library
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

from typing import Final
from plotly.subplots import make_subplots

### Importing Data From CSV

In [7]:
# Import Data
# https://www.kaggle.com/datasets/mysarahmadbhat/online-chess-games?resource=download
# Load the dataset
CSV : Final = pd.read_csv("chess_games.csv", low_memory=False)

### DataSet: Columns & Shape

In [8]:
print(CSV.columns)
print(CSV.shape)

Index(['game_id', 'rated', 'turns', 'victory_status', 'winner',
       'time_increment', 'white_id', 'white_rating', 'black_id',
       'black_rating', 'moves', 'opening_code', 'opening_moves',
       'opening_fullname', 'opening_shortname', 'opening_response',
       'opening_variation'],
      dtype='object')
(20058, 17)


---
### **Understanding the Dataset**

The `describe()` function provides a quick overview of the dataset, helping to:
- Identify central tendencies (mean, median).
- Understand variability (standard deviation).
- Detect potential outliers (minimum, maximum, and percentiles).

#### **Key Metrics**:
- **`count`**: Total non-missing entries in the column.
- **`mean`**: Average value of the column.
- **`min`/`max`**: Smallest and largest values in the column.

#### **Additional Metrics**:
- **`std` (Standard Deviation)**:
  - Measures how spread out the data is around the mean.
  - A smaller value means data is tightly clustered, while a larger value indicates wider variability.
  - Example: For `turns`, `std = 33.57`, meaning the number of moves deviates on average by 33.57 from the mean of 60.47.

- **`25%`, `50%`, `75%` (Percentiles/Quartiles)**:
  - Divide the data into four equal parts.
  - `25%`: 25% of values fall below this (e.g., `white_rating = 1398`).
  - `50%`: The median; 50% of values are above and below this (e.g., `turns = 55`).
  - `75%`: 75% of values fall below this (e.g., `opening_moves = 6`). 

In [9]:
#Describe

CSV.describe()

Unnamed: 0,game_id,turns,white_rating,black_rating,opening_moves
count,20058.0,20058.0,20058.0,20058.0,20058.0
mean,10029.5,60.465999,1596.631868,1588.831987,4.816981
std,5790.390185,33.570585,291.253376,291.036126,2.797152
min,1.0,1.0,784.0,789.0,1.0
25%,5015.25,37.0,1398.0,1391.0,3.0
50%,10029.5,55.0,1567.0,1562.0,4.0
75%,15043.75,79.0,1793.0,1784.0,6.0
max,20058.0,349.0,2700.0,2723.0,28.0


---
### **Data Overview: Column Types**

The table provides an overview of the columns in our dataset along with their respective data types.

- **Integer Columns**: 
  - `game_id`: Unique identifier for each game.
  - `turns`: Number of turns in the game.
  - `white_rating`: Rating of the white player.
  - `black_rating`: Rating of the black player.
  - `opening_moves`: Number of moves in the opening phase.

- **Boolean Columns**: 
  - `rated`: Indicates if the game was rated (True/False).

- **String Columns**: 
  - `victory_status`: Outcome of the game (e.g., mate, resignation, etc.).
  - `winner`: The winning player (e.g., white, black, or draw).
  - `time_increment`: The time increment for the game (e.g., 10+0 for 10 minutes no increment).
  - `white_id`: Identifier for the white player.
  - `black_id`: Identifier for the black player.
  - `moves`: List or sequence of moves played in the game.
  - `opening_code`: The code of the chess opening (e.g., "C20").
  - `opening_fullname`: Full name of the chess opening.
  - `opening_shortname`: Shortened name or abbreviation of the opening.
  - `opening_response`: Likely the initial moves by the opponent.
  - `opening_variation`: Variation of the chess opening.

In [10]:
#Data Overview

df = CSV.copy()

# Prepare table data
table_data = {
    "Column": df.columns,
    "Data Type": [
        str(type(df[col].dropna().iloc[0])) if not df[col].dropna().empty else "None" 
        for col in df.columns
    ]
}

# Split the data into two parts
split_idx = len(table_data["Column"]) // 2
table_data_1 = {
    "Column": table_data["Column"][:split_idx],
    "Data Type": table_data["Data Type"][:split_idx]
}
table_data_2 = {
    "Column": table_data["Column"][split_idx:],
    "Data Type": table_data["Data Type"][split_idx:]
}

# Define row height and layout
row_height = 30
height = row_height * len(df.columns) - 25
width = 1000  # Wider layout to accommodate both tables

# Create subplots for two tables
fig = make_subplots(
    rows=1, cols=2, 
    column_widths=[0.5, 0.5], 
    horizontal_spacing=0.1,
    specs=[[{"type": "table"}, {"type": "table"}]]
)

# Add the first table
fig.add_trace(
    go.Table(
        header=dict(
            values=["Column", "Data Type"],
            fill_color="paleturquoise",
            align="left"
        ),
        cells=dict(
            values=[table_data_1["Column"], table_data_1["Data Type"]],
            fill_color="lavender",
            align="left"
        )
    ),
    row=1, col=1
)

# Add the second table
fig.add_trace(
    go.Table(
        header=dict(
            values=["Column", "Data Type"],
            fill_color="paleturquoise",
            align="left"
        ),
        cells=dict(
            values=[table_data_2["Column"], table_data_2["Data Type"]],
            fill_color="lavender",
            align="left"
        )
    ),
    row=1, col=2
)

# Update layout
fig.update_layout(
    width=width,
    height=height,
    title_text="Column and Data Types"
)

# Show the figure
fig.show()


---
### **Correlation Analysis of Chess Game Features**

In the following visualization, we present a correlation heatmap for the numerical features in the chess game dataset. The heatmap highlights the relationships between these features, providing insight into how they interact with each other.

#### **Observations**
The correlation heatmap reveals the strength and direction of relationships between features such as player ratings, game turns, and opening moves. While some features, such as `white_rating` and `black_rating`, exhibit a moderate positive correlation, other features show weaker relationships, indicating limited linear dependence.

#### **Purpose of This Analysis**
Understanding correlations in the dataset is essential for:
- Identifying highly correlated features that may provide redundant information.
- Highlighting patterns that could inform strategic analysis or model building.
- Detecting weakly correlated features, which may require additional transformations to uncover hidden relationships.

In [11]:
#Correlation Analysis

df = CSV.copy()

# Calculate the correlation matrix for numerical columns
numerical_columns = df.select_dtypes(include=['int64', 'float64'])
correlation_matrix = numerical_columns.corr()

# Mask the top triangle of the correlation matrix
mask = np.tril(np.ones_like(correlation_matrix, dtype=bool))

correlation_matrix = correlation_matrix.where(mask)

# Create the heatmap using Plotly
fig = px.imshow(
    correlation_matrix,
    text_auto=True,
    color_continuous_scale="RdBu",  # Diverging colormap
    title="Heatmap of Correlation",
    labels=dict(color="Correlation"),
    zmin=-1, zmax=1  # Set limits for diverging colors
)

# Update the layout for better visualization
fig.update_layout(
    width=800,  # Set the width of the figure
    height=600,  # Set the height of the figure
    coloraxis_colorbar=dict(
        title="Correlation",
        tickvals=[-1, -0.5, 0, 0.5, 1],
        ticktext=["-1", "-0.5", "0", "0.5", "1"]
    ),
)

# Display the heatmap
fig.show()

#scatter plot white rating vs black rating

---
### **Scatter Plot of Rated vs. Unrated Chess Games**

This visualization compares **Rated Games** and **Unrated Games** by plotting the **Black Rating** against the **White Rating** in two side-by-side scatter plots.

#### **Key Observations**:
1. **Rated Games**:
   - Ratings are more clustered, particularly in higher Elo ranges, indicating competitive matchmaking between similarly skilled players.
   - This reflects the structured nature of rated games, where the system likely enforces tighter rating brackets.

2. **Unrated Games**:
   - Ratings are more spread out, showing games between players with diverse skill levels.
   - The broader distribution suggests that unrated games are more casual, often played without strict pairing rules.

#### **Insights**:
- **Rated games** highlight a competitive environment with tighter matchmaking.
- **Unrated games** exhibit more variability, offering insights into casual or exploratory gameplay behavior.
- This visualization provides a clear distinction between the dynamics of rated and unrated games.

In [12]:
# Scatter: Rated vs. Unrated Games

df = CSV.copy()

# Divide the dataset into rated and unrated games
rated_games = df[df['rated'] == True]
unrated_games = df[df['rated'] == False]

# Create subplots
fig = make_subplots(
    rows=1, cols=2, 
    subplot_titles=("Rated Games", "Unrated Games"),  # Titles for the subplots
    horizontal_spacing=0.2
)

# Scatter plot for rated games
fig.add_trace(
    go.Scatter(
        x=rated_games['black_rating'],
        y=rated_games['white_rating'],
        mode='markers',
        marker=dict(size=3, opacity=0.5),  # Default color is used
        name="Rated"
    ),
    row=1, col=1
)

# Scatter plot for unrated games
fig.add_trace(
    go.Scatter(
        x=unrated_games['black_rating'],
        y=unrated_games['white_rating'],
        mode='markers',
        marker=dict(size=3, opacity=0.5),  # Default color is used
        name="Unrated"
    ),
    row=1, col=2
)

# Update layout
fig.update_layout(
    title_text="Rated vs. Unrated Games",
    width=1000, height=500,
    xaxis_title="Black Rating",
    yaxis_title="White Rating",
    showlegend=False  # Hide legends
)

# Update axes titles for each subplot
fig.update_xaxes(title_text="Black Rating", row=1, col=1)
fig.update_xaxes(title_text="Black Rating", row=1, col=2)
fig.update_yaxes(title_text="White Rating", row=1, col=1)
fig.update_yaxes(title_text="White Rating", row=1, col=2)

# Show the plot
fig.show()


---
### **Boxplot: Turns for Rated vs. Unrated Games**

This boxplot compares the distribution of game lengths (measured in **turns**) for **Rated** and **Unrated** games. The `rated` column has been mapped to show `"Rated"` and `"Unrated"` for better readability.

  - The **Rated** box (on the left):
    - Shows the typical range of game lengths for rated matches.
    - The median value indicates the middle point of game lengths for these games.
  - The **Unrated** box (on the right):
    - Displays the same metrics for unrated games.
    - The box may indicate if unrated games are typically shorter, longer, or more variable than rated games.

In [13]:
#Boxplot: Turns for Rated vs. Unrated Games

df = CSV.copy()

# Map the 'rated' column to display "Rated" or "Unrated"
df['rated'] = df['rated'].map({True: "Rated", False: "Unrated"})

# Boxplot: Turns for Rated and Unrated Games
fig = px.box(
    df,
    x='rated',
    y='turns',
    title="Boxplot: Turns for Rated vs Unrated Games",
    labels={'rated': 'Game Type', 'turns': 'Number of Turns'},
    category_orders={'rated': ['Rated', 'Unrated']}  # Set the order explicitly
)

# Update layout
fig.update_layout(width=800, height=600)

# Show the plot
fig.show()


---
### **Bar Chart: Count of Matches by Winner**

This bar chart visualizes the number of chess matches won by each outcome category: **White**, **Black**, or **Draw**. 

- The chart provides a quick comparison of how often each outcome occurs.
- It highlights whether White or Black wins more frequently and how often matches end in a draw, offering a straightforward overview of game results.

In [14]:
#Bar: Count of Matches by Winner

df = CSV.copy()

# Count the number of matches won by each winner
winner_counts = df['winner'].value_counts().reset_index()
winner_counts.columns = ['winner', 'count']

# Create a bar chart using Plotly
fig = px.bar(
    winner_counts,
    x='winner',
    y='count',
    title="Count of Matches by Winner",
    labels={'winner': 'Winner', 'count': 'Number of Matches'},
    color='winner',  # Adds color per winner
)

# Update layout for better visualization
fig.update_layout(
    width=800,
    height=600,
    xaxis_title="Winner",
    yaxis_title="Number of Matches",
    showlegend=False
)

# Show the plot
fig.show()


---
### **Pie Chart: Victory Status Distribution**

This visualization presents the distribution of game outcomes based on the **victory status** column. The pie chart provides a clear representation of how games typically end, breaking them into distinct categories such as checkmate, resignation, timeout, etc.

- This chart helps identify the most common ways games conclude (e.g., checkmate, resignation).
- It provides a quick comparison of the prevalence of each victory status, offering valuable insights into player behavior and game dynamics.

In [15]:
#Pie: Victory Status Distribution

df = CSV.copy()

# Count victory statuses
victory_status_counts = df['victory_status'].value_counts().reset_index()
victory_status_counts.columns = ['victory_status', 'count']

fig4 = px.pie(
    victory_status_counts, 
    names='victory_status', 
    values='count', 
    title="Victory Status Distribution"
)

fig4.update_layout(width=600, height=600)
fig4.show()



---
### **Boxplot: Turns by Victory Status**

This boxplot visualizes the distribution of game lengths (measured in **number of turns**) across different victory statuses. The `victory_status` column categorizes games into four outcomes: **Resign**, **Mate**, **Out of Time**, and **Draw**.

**Victory Status Categories**:
   - **Resign**: Games ended with a player resigning.
   - **Mate**: Games concluded with a checkmate.
   - **Out of Time**: Games ended due to a time forfeit.
   - **Draw**: Games resulted in a draw.

**Median Comparison**:
   - The median game length varies across victory statuses, with draws generally showing the highest median turns.

**Outliers**:
   - Longer games are common in draws and some checkmates, as indicated by the presence of extreme outlier points.

In [16]:
#Boxplot: Turns by Victory Status

df = CSV.copy()

fig = px.box(
    df,
    x='victory_status',
    y='turns',
    title="Boxplot: Turns by Victory Status",
    labels={'victory_status': 'Victory Status', 'turns': 'Number of Turns'},
    category_orders={'victory_status': ['Resign', 'Mate', 'Out of Time', 'Draw']}
)

fig.update_layout(width=800, height=600)
fig.show()

---
### **Histogram: Distribution of White and Black Elo Ratings**

This visualization presents a side-by-side comparison of the **White Elo Ratings** and **Black Elo Ratings** in the dataset. Each histogram displays the frequency of games played across different Elo ranges for both White and Black players.

**Left Plot (White Elo Ratings)**:
   - Focuses on the frequency of matches based on the Elo ratings of White players.
**Right Plot (Black Elo Ratings)**:
   - Highlights the distribution of Elo ratings for Black players in a similar fashion.

This visualization helps identify:
- The skill level distribution of players in the dataset.
- Any disparities between White and Black players' Elo ratings.
- Patterns in game participation across different Elo ranges.

By comparing the two histograms, we can confirm that the dataset is balanced in terms of skill levels for both White and Black players.

In [17]:
#Histogram: Distribution of White and Black Elo Ratings

df = CSV.copy()

# Find the minimum and maximum Elo ratings for both white and black
min_elo = min(df['white_rating'].min(), df['black_rating'].min())
max_elo = max(df['white_rating'].max(), df['black_rating'].max())

# Create subplots
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1, cols=2,  # One row, two columns
    subplot_titles=("White Elo Ratings", "Black Elo Ratings"),
    horizontal_spacing=0.2
)

# Histogram for White Elo Ratings
fig.add_trace(
    px.histogram(
        df,
        x='white_rating',
        nbins=30,  # Number of bins for the histogram
        color_discrete_sequence=["blue"],  # Set color
    ).data[0],  # Add the histogram to the first subplot
    row=1, col=1
)

# Histogram for Black Elo Ratings
fig.add_trace(
    px.histogram(
        df,
        x='black_rating',
        nbins=30,  # Number of bins for the histogram
        color_discrete_sequence=["red"],  # Set color
    ).data[0],  # Add the histogram to the second subplot
    row=1, col=2
)

# Update layout for the subplots
fig.update_layout(
    title_text="Distribution of White and Black Elo Ratings",
    width=1000,
    height=500,
    showlegend=False
)

# Update axes
fig.update_xaxes(title_text="White Elo Rating", row=1, col=1, range=[min_elo, max_elo])
fig.update_yaxes(title_text="Number of Matches", row=1, col=1)

fig.update_xaxes(title_text="Black Elo Rating", row=1, col=2, range=[min_elo, max_elo])
fig.update_yaxes(title_text="Number of Matches", row=1, col=2)

# Show the plot
fig.show()


---
### **Analysis of White Elo Ratings by Skill Level**

This visualization categorizes **White Elo Ratings** into five distinct skill levels: **Beginner**, **Novice**, **Intermediate**, **Advanced**, and **Expert**. The x-axis represents these skill levels, while the y-axis indicates the **number of matches** within each category.

- **What are Quintiles?**
  - Quintiles divide the data into five equal groups, with each group containing approximately 20% of the total players.
  - In this graph, each skill level represents a quintile of the White Elo Ratings distribution.

- **Elo Ranges by Skill Level**:
  - The Elo range for each quintile is calculated using `pd.qcut()`:
    - **Beginner**: Lowest 20% of White Elo Ratings.
    - **Novice**: 20%-40% of players.
    - **Intermediate**: 40%-60% of players.
    - **Advanced**: 60%-80% of players.
    - **Expert**: Top 20% of White Elo Ratings.

The average Elo rating for players within each quintile provides a sense of the central tendency within each skill level:

1. **Equal Distribution**:
   - Since the data is divided into quintiles, the height of each bar is roughly similar across all skill levels.

3. **Player Distribution**:
   - The visualization provides a clear breakdown of how players are distributed across different Elo levels.

In [18]:
#Bar: Distribution of White Elo Ratings by Skill Level

df = CSV.copy()

# Define the labels for the skill levels
range_labels = ["Beginner", "Novice", "Intermediate", "Advanced", "Expert"]

# Calculate quintiles for 'white_rating' and get bins
df['elo_range'], bins = pd.qcut(
    df['white_rating'], 
    q=5, 
    retbins=True,  # Get bin edges
    labels=False   # Temporarily use numeric labels for range manipulation
)

# Create human-readable labels for Elo ranges without overlap
elo_labels = [f"{int(bins[i])}-{int(bins[i+1]) - 1}" for i in range(len(bins)-1)]
elo_labels[-1] = f"{int(bins[-2])}-{int(bins[-1])}"  # Ensure the last bin includes the upper boundary
df['elo_range'] = pd.qcut(df['white_rating'], q=5, labels=elo_labels)  # Apply Elo range labels

# Apply skill level labels
df['skill_level'] = pd.qcut(df['white_rating'], q=5, labels=range_labels)  # Apply skill level labels

# Aggregate data for the bar chart
skill_level_counts = df['skill_level'].value_counts().reset_index()
skill_level_counts.columns = ['skill_level', 'count']

# Add Elo ranges to the skill level counts for hover data
skill_level_counts = skill_level_counts.sort_values(by='skill_level')  # Ensure proper order
skill_level_counts['elo_range'] = elo_labels  # Assign skill labels in sorted order

# Create the bar chart based on skill levels
fig = px.bar(
    skill_level_counts,
    x='skill_level',  # Skill levels are displayed on the x-axis
    y='count',  # Number of matches on the y-axis
    title="Distribution of White Elo Ratings by Skill Level",
    labels={'skill_level': 'Skill Level', 'count': 'Number of Matches'},
    hover_data={'elo_range': True},  # Show Elo range on hover
    color='skill_level',  # Color by skill level
    category_orders={'skill_level': range_labels}  # Ensure proper order of skill levels
)

# Update layout for better visualization
fig.update_layout(
    width=800,
    height=600,
    xaxis_title="Skill Level",
    yaxis_title="Number of Matches",
    showlegend=False
)

# Show the plot
fig.show()


In [19]:
#Bar: Distribution of Black Elo Ratings by Skill Level

df = CSV.copy()

# Define the labels for the skill levels
range_labels = ["Beginner", "Novice", "Intermediate", "Advanced", "Expert"]

# Calculate quintiles for 'black_rating' and get bins
df['elo_range'], bins = pd.qcut(
    df['black_rating'], 
    q=5, 
    retbins=True,  # Get bin edges
    labels=False   # Temporarily use numeric labels for range manipulation
)

# Create human-readable labels for Elo ranges without overlap
elo_labels = [f"{int(bins[i])}-{int(bins[i+1]) - 1}" for i in range(len(bins)-1)]
elo_labels[-1] = f"{int(bins[-2])}-{int(bins[-1])}"  # Ensure the last bin includes the upper boundary
df['elo_range'] = pd.qcut(df['black_rating'], q=5, labels=elo_labels)  # Apply Elo range labels

# Apply skill level labels
df['skill_level'] = pd.qcut(df['black_rating'], q=5, labels=range_labels)  # Apply skill level labels

# Aggregate data for the bar chart
skill_level_counts = df['skill_level'].value_counts().reset_index()
skill_level_counts.columns = ['skill_level', 'count']

# Add Elo ranges to the skill level counts for hover data
skill_level_counts = skill_level_counts.sort_values(by='skill_level')  # Ensure proper order
skill_level_counts['elo_range'] = elo_labels  # Assign skill labels in sorted order

# Create the bar chart based on skill levels
fig = px.bar(
    skill_level_counts,
    x='skill_level',  # Skill levels are displayed on the x-axis
    y='count',  # Number of matches on the y-axis
    title="Distribution of Black Elo Ratings by Skill Level",
    labels={'skill_level': 'Skill Level', 'count': 'Number of Matches'},
    hover_data={'elo_range': True},  # Show Elo range on hover
    color='skill_level',  # Color by skill level
    category_orders={'skill_level': range_labels}  # Ensure proper order of skill levels
)

# Update layout for better visualization
fig.update_layout(
    width=800,
    height=600,
    xaxis_title="Skill Level",
    yaxis_title="Number of Matches",
    showlegend=False
)

# Show the plot
fig.show()


### **Win Rates by Average Elo Range**

This bar chart visualizes the win rates across different skill levels based on the **average Elo rating** (calculated as the mean of `white_rating` and `black_rating`).

**Dominance of White Wins**:
   - Across all skill levels, the **White player** consistently has the highest win rate, followed by Black wins and Draws.
   - This pattern suggests a potential inherent advantage for the White player.

**Draw Rates**:
   - **Draws** are rare across all skill levels, but they slightly increase as skill levels improve, peaking at higher Elo ranges.
   - This trend could reflect more evenly matched games between higher-rated players.

**Black Wins**:
   - The frequency of **Black wins** is relatively stable across all skill levels but remains lower than White wins.

**Skill Progression**:
   - The distribution of wins appears consistent across Elo ranges, suggesting that player advantages (e.g., playing as White) hold true at all levels of expertise.

This visualization highlights the distribution of match outcomes across player skill levels, providing insights into the relationship between Elo ratings and win rates. It reveals consistent patterns in win dominance (White > Black > Draw) and subtle differences in the draw rates at higher skill levels.

In [20]:
#Bar: Win Rates by Average Elo Range

df = CSV.copy()

# Create a combined average Elo rating for both white and black players
df['average_rating'] = (df['white_rating'] + df['black_rating']) / 2

# Categorize the average Elo rating for win rate analysis
df['elo_range'] = pd.qcut(
    df['average_rating'], 
    q=5, 
    labels=["Beginner", "Novice", "Intermediate", "Advanced", "Expert"]
)

# Calculate win rates by Elo range, specifying observed=True to silence the warning
elo_win_rate = df.groupby(['elo_range', 'winner'], observed=True).size().reset_index(name='count')

# Create the bar chart
fig = px.bar(
    elo_win_rate, 
    x='elo_range', 
    y='count', 
    color='winner', 
    barmode='group',
    title="Win Rates by Average Elo Range",
    labels={'elo_range': 'Elo Range', 'count': 'Number of Matches', 'winner': 'Winner'}
)

fig.update_layout(width=800, height=500)
fig.show()


---
### **Top 10 Most Used Chess Openings**

This horizontal bar chart visualizes the **top 10 most used chess openings** based on the number of games in which each opening was employed. The y-axis lists the chess openings, while the x-axis represents the **number of matches** where each opening was played.

**Most Popular Opening**:
   - "Van't Kruijs Opening" is the most used opening, appearing in the largest number of games.
   - This indicates its wide popularity among players.

**Diversity of Strategies**:
   - Openings like the "Sicilian Defense" and its variations (e.g., "Bowdler Attack") are prominent, showing the versatility of the Sicilian Defense as a favored strategy.

**Balanced Distribution**:
   - While some openings dominate, the distribution among the top 10 is relatively balanced, indicating a diverse set of opening preferences in the dataset.

In [21]:
#Bar: Top 10 Most Used Chess Openings

df = CSV.copy()

# Count the most used openings
opening_counts = df['opening_fullname'].value_counts().reset_index()
opening_counts.columns = ['opening', 'count']

# Select the top 10 most used openings
top_openings = opening_counts.head(10)

# Create a horizontal bar chart for the most used openings
fig = px.bar(
    top_openings,
    y='opening',  # Use 'opening' on the y-axis
    x='count',  # Use 'count' on the x-axis
    title="Top 10 Most Used Chess Openings",
    labels={'opening': 'Opening', 'count': 'Number of Games'},
    orientation='h',  # Horizontal bar chart
    color='opening'  # Optional: Add color for each opening
)

# Update layout for better visualization
fig.update_layout(
    width=1250,
    height=600,
    xaxis_title="Number of Matches",
    yaxis_title="Opening",
    showlegend=False
)

# Show the plot
fig.show()


---
### **Most Used Chess Openings by Skill Level**

This bar chart visualizes the **most commonly used chess opening for each skill level** based on the Elo ratings of players. Here's what the chart represents:

**Beginner Level:**
   - **Van’t Kruijs Opening** is the most commonly used opening, dominating significantly with a higher count compared to other levels.

**Novice Level:**
   - **Sicilian Defense** emerges as the most used opening, a well-known strategy among intermediate players.

**Intermediate Level:**
   - **Scotch Game** takes the lead, possibly indicating a focus on learning structured openings at this level.

**Advanced Level:**
   - **Sicilian Defense** reappears as the favorite, showing its versatility and complexity favored by experienced players.

**Expert Level:**
   - **Queen's Pawn Game: Mason Attack** becomes the go-to choice, aligning with the preference for positional play and deeper strategies.

In [22]:
#Bar: Most Used Chess Opening for Each Skill Level

df = CSV.copy()

# Define skill levels based on Elo ratings
range_labels = ["Beginner", "Novice", "Intermediate", "Advanced", "Expert"]

# Categorize players by Elo rating into quintiles and assign skill levels
df['skill_level'] = pd.qcut(
    df['white_rating'],
    q=5,
    labels=range_labels
)

# Group by skill level and count the most used opening for each skill level
most_used_openings = (
    df.groupby(['skill_level', 'opening_fullname'], observed=True)
    .size()
    .reset_index(name='count')
    .sort_values(by=['skill_level', 'count'], ascending=[True, False])
    .groupby('skill_level', observed=True).first().reset_index()
)


# Create a bar chart
fig = px.bar(
    most_used_openings,
    x='skill_level',  # Skill levels on the x-axis
    y='count',  # Count of games on the y-axis
    color='opening_fullname',  # Color based on opening name
    title="Most Used Chess Opening for Each Skill Level",
    labels={
        'skill_level': 'Skill Level',
        'count': 'Number of Games',
        'opening_fullname': 'Opening'
    }
)

# Update layout for better readability
fig.update_layout(
    xaxis_title="Skill Level",
    yaxis_title="Number of Games",
    legend_title="Opening",
    width=800,
    height=500
)

# Display the plot
fig.show()


---
### **Opening Effectiveness Analysis**

This visualization evaluates the **effectiveness of the top 10 most used chess openings**, presenting the win percentage for each outcome (White win, Black win, or Draw) as a stacked horizontal bar chart. 

**Effectiveness of Openings**:
   - The chart highlights how often each opening leads to a White win, Black win, or Draw.
   - Openings with a high percentage of White wins (e.g., *Sicilian Defense*) might indicate an inherent advantage for White players when using this strategy.

**Draw Rates**:
   - Some openings have a higher percentage of draws compared to others, suggesting they are more balanced or lead to evenly matched games.

**Black Win Percentages**:
   - Openings with higher Black win percentages may indicate counterplay opportunities for Black or effective defensive strategies.

In [23]:
#Bar: Opening Effectiveness

df = CSV.copy()

# Count the most used openings and select the top 10
top_openings = df['opening_fullname'].value_counts().head(10).index

# Filter the dataset to only include the top 10 openings
top_openings_df = df[df['opening_fullname'].isin(top_openings)]

# Calculate outcomes for the top 10 openings
opening_effectiveness = top_openings_df.groupby(['opening_fullname', 'winner']).size().reset_index(name='count')

# Normalize counts to calculate percentages
opening_effectiveness['percentage'] = opening_effectiveness.groupby('opening_fullname')['count'].transform(lambda x: (x / x.sum()) * 100)

# Create a horizontal bar chart for opening effectiveness
fig = px.bar(
    opening_effectiveness,
    y='opening_fullname',  # Openings on the y-axis
    x='percentage',  # Percentage on the x-axis
    color='winner',
    title="Opening Effectiveness (Top 10 Openings)",
    labels={'opening_fullname': 'Opening', 'percentage': 'Win Percentage', 'winner': 'Outcome'},
    barmode='stack',
    orientation='h'  # Horizontal orientation
)

fig.update_layout(
    width=1250,
    height=600,
    xaxis_title="Win Percentage (%)",
    yaxis_title="Opening",
    showlegend=True
)

fig.show()
