# **Loading the Data**

The goal of this project is to create a machine learning model that can predict the outcome of Valorant games based on a single player's statistics. Through exploratory data analysis and statistical testing, using panda and numpy libraries, we will discover important features that will help us define the features for our model. Through EDA we will also define our target, or what we want the model to tell us. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/my-first-1000-valorant-games/valorant_games.csv')

# **Previewing the Data**

In [None]:
df.head()

# **Handling Missing and Duplicate Values**

In [None]:
df[df.duplicated()]

In [None]:
df[df.isna()].sum()

# **Exploratory Data Analysis** 

In [None]:
wins_df = df[df['outcome'] == 'Win']
win_counts = wins_df['map'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(win_counts, labels=win_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('Set3', len(win_counts)))
plt.title('Distribution of Wins Across Maps')
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is circular.
plt.show()

The user tends to win the most on Ascent, with Lotus coming in a close second.

In [None]:
wins_df = df[df['outcome'] == 'Win']
win_counts_by_agent = wins_df['agent'].value_counts()
plt.figure(figsize=(8, 8))
sns.barplot(x=win_counts_by_agent.index, y=win_counts_by_agent.values, palette='Set3')

# Add labels and title
plt.xlabel('Agent')
plt.ylabel('Number of Wins')
plt.title('Number of Wins for Each Agent')  # Equal aspect ratio ensures that pie chart is circular.
plt.show()

In [None]:
#finding the average headshot and damage per game
headshot_avg = df['headshot_pct'].mean()
dmg_avg = df['avg_dmg'].mean()

print(f"player's average headshot percentage is:" , headshot_avg ,'%')
print(f"player's average damage per game is:" , dmg_avg)

In [None]:
#finding users top five best games based off of damage
def top_five_games_by_dmg():
    top_5_games = df.sort_values(by = 'avg_dmg' , ascending = False).head()
    return top_5_games[['game_id' , 'date' , 'agent' , 'map' , 'avg_dmg' , 'headshot_pct', 'kdr', 'kills' , 'deaths', 'outcome']]

top_five_games_by_dmg()

The player's best game was game number 435 where he produced 373 damage, had a KDR of 9.7 and had 29 kills and three deaths.

In [None]:
#does a better headshot percentage result in more winning games?
# Convert 'outcome' to a binary format (1 for Win, 0 for Loss)
df['outcome_binary'] = df['outcome'].map({'Win': 1, 'Loss': 0})

# ploting the relationship between headshot percentage and game outcome
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='outcome', y='headshot_pct')

# adding labels and title
plt.xticks([0, 1], ['Loss', 'Win'])
plt.xlabel('Game Outcome')
plt.ylabel('Headshot Percentage')
plt.title('Headshot Percentage vs. Game Outcome')

plt.show()

In [None]:
#does landing headshots lead to winning more games?
# finding the correlation coefficient
correlation = df['headshot_pct'].corr(df['outcome_binary'])
print(f"Correlation between headshot percentage and outcome: {correlation:.2f}")

since the correlation is -0.01 , it would suggest that there is little to no correlation between headshots and win rate. Furthermore the graph shows that the median win rate is similar from headshot rate. Meaning that since the headshot medians are similar whether the user wins, loses, or draws (green) there is no correlation between winning more by improving your aim. 

In [None]:
#how many instances did the player not feed (have a KDR >= 1)?
#filtering for KDR where KDR > 1
kdr_above_1_df = df[df['kdr'] > 1]
print(kdr_above_1_df[['game_id', 'rank' , 'outcome' , 'agent']])

In [None]:
#finding the number of times KDR was above 1 by filtering
games_above_1_kdr = df[df['kdr'] > 1]
#printing the length of the list
num_games_above_1_kdr = len(games_above_1_kdr)
print(f"The user had {num_games_above_1_kdr} games where his KDR was above 1.")

#indexing for finding the best KDR
best_kdr_row = kdr_above_1_df.loc[kdr_above_1_df['kdr'].idxmax()]
print(f"the player's best KDR was:" , best_kdr_row['kdr'])

528 games where the users KDR was above 1, meaning he got more kills than deaths. His best game (indicated by KDR) was with Cypher on the Sunset map where he went 29-3-7 with a KDR of 9.7 in Diamond 1. 

In [None]:
#what is the players highest and lowest rank (outside of Placements)?
#create a numeric list to rank ranks
rank_order = {
    'Bronze 1': 1, 'Bronze 2': 2, 'Bronze 3': 3,
    'Silver 1': 4, 'Silver 2': 5, 'Silver 3': 6,
    'Gold 1': 7, 'Gold 2': 8, 'Gold 3': 9,
    'Platinum 1': 10, 'Platinum 2': 11, 'Platinum 3': 12,
    'Diamond 1': 13, 'Diamond 2': 14, 'Diamond 3': 15,
    'Ascendant 1': 16, 'Ascendant 2': 17, 'Ascendant 3': 18
}

# numerically ranking them
df['rank_numeric'] = df['rank'].map(rank_order)

# finding the lowest and highest rank based on the numeric scale
lowest_rank = df.loc[df['rank_numeric'].idxmin()]['rank']
highest_rank = df.loc[df['rank_numeric'].idxmax()]['rank']

#print the results
print(f"The lowest rank is: {lowest_rank}")
print(f"The highest rank is: {highest_rank}")

In [None]:
#is there a difference in damage output from rank to rank?
#filtering and creating new datasets for players highest and lowest ranks
ascendant_data = df[df['rank'].isin(['Ascendant 1', 'Ascendant 2', 'Ascendant 3' , ''])]
silver_data = df[df['rank'].isin(['Silver 1' , 'Silver 2' , 'SIlver 3' , ''])]

#plotting average damage dealt
plt.figure(figsize = (10,6))
plt.hist(ascendant_data['avg_dmg'] , bins = 20 , alpha = 0.5, color = 'pink', edgecolor= 'black')
plt.hist(silver_data['avg_dmg'] , bins = 20 , alpha = 0.5, color = 'purple' , edgecolor = 'black')
plt.title('Average Damage in Ascendant and Silver Ranks')
plt.xlabel('avg_dmg')
plt.ylabel('Frequency')
plt.legend()
plt.show

While the frequency of Ascendant games is higher in the data, the amount of damage produced in a single game is higher in Ascendant than in Silver. 

In [None]:
#what agents does the player use in their highest rank?
#counting the instances of agent use
agent_counts = ascendant_data['agent'].value_counts()

#print the results
print("Most Played Agents in Ascendant Rank:")
print(agent_counts)

# plotting the results using a bar plot
plt.figure(figsize=(12, 6))
agent_counts.plot(kind='bar', color='skyblue', edgecolor='black')

The player has used Cypher the most in his Ascendant games

In [None]:
#what agents did the player use to get out of his lowest rank?
#coutning the instances of agent use
agent_counts = silver_data['agent'].value_counts()

#print the results
print("Most Played Agents in silver Rank:")
print(agent_counts)

# plotting the results using a bar plot
plt.figure(figsize=(12, 6))
agent_counts.plot(kind='bar', color='pink', edgecolor='black')

the player used KAY/O to climb out of Silver rank. While he did not use KAy/O for the majority of his games, he used that agent the most. The player also climbed out of Silver in 17 games. His win rate in Silver will be observed. 

In [None]:
#what is the players win rate in his highest rank?
outcomes = ascendant_data['outcome'].value_counts()

# extracting the number of wins and losses
total_wins = outcomes.get('Win', 0)  # Default to 0 if 'Win' is not found
total_losses = outcomes.get('Loss', 0)  # Default to 0 if 'Loss' is not found

# printing the results
print(f"Total Wins in Ascendant Rank: {total_wins}")
print(f"Total Losses in Ascendant Rank: {total_losses}")

In [None]:
#calculating win rate based on output above
wins = 116
losses = 117

#defining total games variable
total_games = wins + losses

#finding win rate via division
win_rate = wins / total_games
print(f"Total Win Rate in Ascendant Rank: {win_rate:.2f}")

The player has a 50% win rate in Ascendant rank, meaning he is hardstuck Ascendant. 

In [None]:
#what is the players win rate in his lowest rank?
outcomes = silver_data['outcome'].value_counts()

# extracting the number of wins and losses
total_wins = outcomes.get('Win', 0)  # Default to 0 if 'Win' is not found
total_losses = outcomes.get('Loss', 0)  # Default to 0 if 'Loss' is not found

# print the results
print(f"Total Wins in Silver Rank: {total_wins}")
print(f"Total Losses in Silver Rank: {total_losses}")

In [None]:
#calculating win rate based on output above
#defining wins and losses
wins = 7
losses = 10

#creating a total games variable
total_games = wins + losses

#finding win rate via division
win_rate = wins / total_games
print(f"Total Win Rate in Silver Rank: {win_rate:.2f}")

The player has a 41% win rate in Silver, meaning it took him Seven wins to get out of that rank. It may be that he climbed out of silver because he got those wins consecutively despite having lost more games than he won in this rank. Also, in Silver, though it is his lowest rank, his win rate is worse than it is in his highest rank (Ascendant).

In [None]:
# did the player deserve Silver based on Placement matches?
#making a df of placement matches
placement_matches = df[df['rank'].isin(['Placement', ''])]
placement_matches

In [None]:
#making datasets for the placement matches of every season (is called "episode" in valorant)
episode_6 = placement_matches[placement_matches['episode'] == 6]
episode_7 = placement_matches[placement_matches['episode'] == 7]
episode_8 = placement_matches[placement_matches['episode'] == 8]
episode_9 = placement_matches[placement_matches['episode'] == 9]

#checking to see if datasets are empty
# print(f"Placement Matches in Episode 6:\n{episode_6.head()}")
# print(f"Placement Matches in Episode 7:\n{episode_7.head()}")
# print(f"Placement Matches in Episode 8:\n{episode_8.head()}")
# print(f"Placement Matches in Episode 9:\n{episode_9.head()}")

In [None]:
#checking the first season played only, since that is where player placed in silver
#episode 6 placements win rate
outcomes = episode_6['outcome'].value_counts()

# extracting the number of wins and losses
total_wins = outcomes.get('Win', 0)  # Default to 0 if 'Win' is not found
total_losses = outcomes.get('Loss', 0)  # Default to 0 if 'Loss' is not found

# print the results
print(f"Total Wins in Episode 6 Placements: {total_wins}")
print(f"Total Losses in Episode 6 Placements: {total_losses}")

In [None]:
#finding win rate
#defining wins and losses
wins = 2
losses = 4

#defining total games
total_games = wins + losses

#finding win rate via division
win_rate = wins / total_games
print(f"The Player's Win rate in Episode 6 Placements is:" , win_rate)

In [None]:
avg_dmg = episode_6['avg_dmg'].mean()
print(f"The player's average damage per game in episode 6 placements was:" , avg_dmg)

The players win rate in his in placements in episode 6 was 33%, with an average damage of 89. Both statistics are relatively low, considering that 89 damage is barely enough to kill a single agent in Valorant. Given that this is the players first episode in Valorant, it is safe to say based on these statistics that the player did deserve Silver ranking to start.

In [None]:
#is there a correlation between damage and outcome in player's episode 6 placement matches?
#finding the correlation coefficient
correlation = episode_6['avg_dmg'].corr(episode_6['outcome_binary'])
print(f"Correlation between damage and outcome: {correlation:.2f}")

A correlation of 0.14 between headshot percentage and outcome suggests a very weak positive relationship, meaning that while there is a slight tendency for higher headshot percentages to be associated with winning, the relationship is too weak to be of practical importance. This weak correlation implies that other factors are likely more influential in determining match outcomes.

# **Stats Test**

In [None]:
t_stat, p_value = stats.ttest_ind(episode_6['avg_dmg'], episode_6['outcome_binary'])

# rejecting or accepting the null hypothesis
if p_value < 0.05:
    print("There is a significant difference in average damage and wins.")
else:
    print("There is no significant difference in average damage and wins.")

Through a two-tailed stats test, we now know that we can accept the hypothesis that there is a difference between average damage and wins, meaning that average damage alone is not the way to predict wins. We can add this to our features in machine learning however it cannot be the sole feature.

# **Machine Learning**

In [None]:
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# # making sure the 'outcome' column is kept in the final dataset
df_encoded = pd.get_dummies(df.drop(columns=['outcome']), drop_first=True)  # One-Hot Encoding for categorical columns
df_encoded['outcome'] = df['outcome']  # Add 'outcome' column back

# # separate features and target
X = df_encoded.drop(columns=['outcome'])  # Features (independent variables)
y = df_encoded['outcome']  # Target variable (dependent)

# # splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # # handling missing values
imputer = SimpleImputer(strategy='mean')  # Using 'mean' to fill NaN values
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# # applying StandardScaler to numerical columns only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# # training a Logistic Regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced', C=0.1, penalty='l1', solver='liblinear', max_iter=1000)
model.fit(X_train_scaled, y_train)

# # predicting and evaluating the model
from sklearn.metrics import classification_report
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred , zero_division = 1))



In [None]:
# evaluating the model
conf_matrix = confusion_matrix(y_test, y_pred)

# printing evaluation results
print("Confusion Matrix:")
print(conf_matrix)


Here we have created a model that is 99% accurate at predicting the outcome of a match based on KDR, damage, headshot percentage, agent, and map. It is able to predict wins 100% of the time and able to predict losses 97% of the time. It does not accurately predict Draws at all because of the low frequency of draws within the dataset. 

Through EDA we discovered that the player's best map was Ascent, with a 15.5% win rate across all maps in the game. His best agent by far is Cypher with a 51% win rate overall, hwoever he used Kay/O to climb out of Silver, which was his lowest rank. His highest rank, where he collected the most wins with Cypher, was Ascendant 2. Due to his highest rank being Ascendant 2, the model can only predict outcomes of games up to that, which is one of the limitations the data provided. The other limitation is predicting Draws, because of the low frequency of draws within the data. there are 24 rounds within a single game of Valorant and the first team to win 13 rounds wins, so the probability of getting a draw is relatively low, so low that draws do not affect predictions or the dataset as a whole. Statistical data analysis showed that there is a weak correlationbetween damage output and the outcome of match, meaning that the player could do a lot of damage but still lose the game showing that outcome can be affected by things that the data cannot track like team synergy and tactical strategy. 