<a href="https://www.kaggle.com/code/duynhatvo/clashroyale?scriptVersionId=93721569" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Clash Royale Dataset Analysis

### TODO

1) Average Elixir (Classification)
- compute average elixir for each deck and try to figure out if there is a specific range which is better than others

2) Troop types
- air/ground/spell/building

3) Rarity
- common/rare/legend/champion

## 1. Introduction

In this notebook, we are going to analyze the [Clash Royale Dataset](https://www.kaggle.com/datasets/nonrice/clash-royale-battles-upper-ladder-december-2021). This dataset documents over 700 thousands matches of the mobile game [Clash Royale](https://clashroyale.com/) by Supercell. We will first brief through the basic mechanics of the games and the goal of this project, then the analysis and conclusion part.

## 2. Objectives:
### a. Clash Royale:
Clash Royale is a realtime strategy mobile game, where 2 players fight over a 3-minute match to destroy their opponent's towers. Each player will have a prechosen deck consisting of 8 cards, and will continuously deploy their cards which are either minions or spells. The game ends after 3 minutes, or if a player's main tower is destroyed.

As simple as it may sound, analyzing who would win a match is very difficult, due to the huge permutations of cards (there are 106 cards in total), as well as the different players' skills. In this notebook, we are going to analyze the dataset with the following assumptions:
- The matches happen in a small time span (December 2021), and no update (tweaks to each card's stats) are done.
- The matches happen between the top-ranked players in the world, and they all have very similar skills.

### b. Our approach:
In this notebook, we are going to try and see whether we can predict an outcome of a match using the decks and the rank of each player.

In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))
import json

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [9]:
cards = pd.read_json('/data/cards.json')
cards.drop(columns=['arena', 'description'])

ValueError: Expected object or value

In [3]:
cards['type'].unique()

NameError: name 'cards' is not defined

In [None]:
with open('/kaggle/input/clash-royale-dataset/data/cards_stats.json') as file:
    cards_stats = json.load(file)
for i in cards_stats['troop']:
    if i['key'] == 'barbarian-barrel':
        print(i)

In [None]:
cards_stats

## 3. Preliminary Analysis:
### a. A brief overview:
We are first going to have a quick look to get a sense of the data. As the dataset has over 700 thousand points, we are first going to only use the first 20 thousand to reduce computing power, before generalizing to bigger dataset.

We then have a look at the dataset, and its correlation matrix.

In [None]:
df_full = pd.read_csv('../input/clash-royale-battles-upper-ladder-december-2021/data_ord.csv')
df_full.drop(columns='Unnamed: 0', inplace=True)
card_names = pd.read_csv('../input/clash-royale-battles-upper-ladder-december-2021/cardlist.csv')['card']
N_CARDS = card_names.shape[0]

In [None]:
SIZE = 50000
df = df_full.sample(SIZE, random_state=84, ignore_index=True)
df

In [None]:
df.describe()

In [None]:
corr = df.corr()

In [None]:
f = plt.figure(figsize=(9, 7))
plt.imshow(np.abs(corr))
plt.title('Correlation Matrix')
plt.xticks(np.arange(df.shape[1]), df.columns, rotation=45)
plt.yticks(np.arange(df.shape[1]), df.columns)
plt.colorbar()
plt.show()

#### Remarks:
From the plotted correlation matrix, we observe that there seems to be a clear relationship between:
- The current card and the next card(s) in each player's deck
- The players' rankings (trophies) with each other
- Players' rankings (trophies) and cards in their deck

The first relation is very intuitive. This is because cards often have synergies (i.e some go well with others as support fillers while other do not). In this case, the results can be interpreted to explain that every card is followed by another specific card to complement the first one to build a strategy (catch: this is a strategy game). There are also relationships with further cards in the deck and even though they get weaker it is straightforward to realize that the cards are played out in a specific manner to build a strategy.  

Furthermore, the players' rankings should be correlated as well which is well supported by the graph. This is so because the game's "match-making process" is based on player rankings, (and players of different rankings seems to use similar decks of cards.) --- how?

However, unfortunately, it is evident that there is almost no correlation between the outcome and any other attributes of the dataset from this analysis. We can try to apply some well-known classifications methods and check if they give us promising results.

### b. Basic Algorithms:
In the following cells we do the following:
1. We first encode each players' deck into $1 \times 106$ vectors, where $V_i = 1$ if the $i^{th}$ card is in the deck, $0$ otherwise. This seems to be a better choice than one-hot encoding, as there are $\begin{pmatrix}106 \\ 8 \end{pmatrix} \approx 3 \times 10^{11}$ possible choices of decks.
2. We then split the whole data set into training sets and test sets for supervised classification. We also keep a list of all decks choice people used to see if unsupervised learning give us some better insights into the data set.
3. We then apply some well-known simple Classification methods (e.g KNN, Decsion Trees, etc.) to see if we get a good result.

In [None]:
from sklearn.model_selection import train_test_split
y_train = df['outcome']
x_train, decks = [], []

# we first combine the DataFrame into 1x106 vectors, where V[i] = 1 if the ith card is in the deck
# this 
for i in range(df.shape[0]):
    p1 = np.zeros(106)
    p2 = np.zeros(106)
    for j in range(1, 9):
        p1[df[f'p1card{j}'][i]] = 1
        p2[df[f'p2card{j}'][i]] = 1
    decks.append(p1)
    decks.append(p2)
    x_train.append(np.concatenate((p1, p2)))
    
x_train = np.array(x_train)
decks = np.array(decks)
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, random_state=6233)
d_train, d_test = train_test_split(decks, test_size=0.2, random_state=47483)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
print('Training accuracy:\t', knn.score(x_train, y_train))
print('Testing accuracy:\t', knn.score(x_test, y_test))

In [None]:
# ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20))
# ada.fit(x_train, y_train)
# print('Training accuracy:\t', ada.score(x_train, y_train))
# print('Testing accuracy:\t', ada.score(x_test, y_test))

In [None]:
bagging = BaggingClassifier(DecisionTreeClassifier(max_depth=20))
bagging.fit(x_train, y_train)
print('Training accuracy:\t', bagging.score(x_train, y_train))
print('Testing accuracy:\t', bagging.score(x_test, y_test))

In [None]:
extra = ExtraTreesClassifier(max_depth=20)
extra.fit(x_train, y_train)
print('Training accuracy:\t', extra.score(x_train, y_train))
print('Testing accuracy:\t', extra.score(x_test, y_test))

In [None]:
rf = RandomForestClassifier(max_depth=20)
rf.fit(x_train, y_train)
print('Training accuracy:\t', rf.score(x_train, y_train))
print('Testing accuracy:\t', rf.score(x_test, y_test))

#### Remarks:
Unfortunately, our models does not seem to have a good predictions overall (the score is only a bit better than random guessing). In the following parts of the notebook, we are going to see if we could obtain a better fit on this dataset.

## 4. Decks Analysis:
Let us dig deeper into the dataset by analyzing the decks, which seems to be most correlated.

We first visualize the correlation matrix (now with the new encoding).

In [None]:
d_corr = np.corrcoef(decks.T)
f = plt.figure(figsize=(16, 12))
plt.imshow(np.abs(d_corr))
plt.title('Correlation Matrix')
plt.colorbar()
plt.show()

#### Remarks:
We want to observe which pairs have the highest synergy. The following cells sets all correlation under the diagonal to 0, and sort them by value.

In [None]:
d_corr_u = np.triu(d_corr, k=1)
d_corr_u
f = plt.figure(figsize=(16, 12))
plt.imshow(np.abs(d_corr_u))
plt.title('Correlation Matrix')
plt.colorbar()
plt.show()

In [None]:
d_corr_flat = d_corr_u.flatten()
d_corr_argsort = np.argsort(d_corr_flat)
d_corr_high = [(x // 106, x % 106, d_corr_flat[x]) for x in d_corr_argsort[:-31:-1]]
d_corr_low = [(x // 106, x % 106, d_corr_flat[x]) for x in d_corr_argsort[:30]]

In [None]:
[(card_names[x], card_names[y], z) for (x, y, z) in d_corr_high]

In [None]:
[(card_names[x], card_names[y], z) for (x, y, z) in d_corr_low]

#### Remarks:
We observe that these card pairs are very common in the strongest decks (from https://www.deckshop.pro/).

## 5. A deeper look at Outcome:
### a. Applying PCA:

In [None]:
data = np.copy(x_train).T.tolist()
data.append(y_train)
data = np.array(data)

In [None]:
corr = np.corrcoef(data)
f = plt.figure(figsize=(16, 12))
plt.imshow(np.abs(corr))
plt.title('Correlation Matrix')
plt.colorbar()
plt.show()

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
x_train_pca = pca.fit_transform(x_train)

In [None]:
data = np.copy(x_train_pca).T.tolist()
data.append(y_train)
data = np.array(data)
corr = np.corrcoef(data)
f = plt.figure(figsize=(16, 12))
plt.imshow(np.abs(corr))
plt.title('Correlation Matrix')
plt.colorbar()
plt.show()

In [None]:
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(x_train_pca[:,0].flatten(), x_train_pca[:, 1].flatten(), x_train_pca[:, 2].flatten(), c = y_train.tolist())

In [None]:
plt.scatter(x_train_pca[:,0].flatten(), x_train_pca[:, 1].flatten(), c = y_train.tolist())

In [None]:
# ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=15))
# ada.fit(x_train_pca, y_train)
# print('Training accuracy:\t', ada.score(x_train_pca, y_train))
# print('Testing accuracy:\t', ada.score(pca.transform(x_test), y_test))

In [None]:
extra = ExtraTreesClassifier(max_depth=15)
extra.fit(x_train_pca, y_train)
print('Training accuracy:\t', extra.score(x_train_pca, y_train))
print('Testing accuracy:\t', extra.score(pca.transform(x_test), y_test))

### b. How about Neural Network?
Let's try to see if we can instead obtain a good score using Neural Network. 

We first try a naive Neural Network, with all layers are dense, and two Dropouts.

In [None]:
from tensorflow import keras
nn = keras.Sequential([
    keras.layers.Dense(128, activation='ReLU'),
    keras.layers.Dense(32, activation='tanh'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(16, activation='ReLU'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(4, activation='ReLU'),
    keras.layers.Dense(1, activation='sigmoid'),
])
early_stopping = keras.callbacks.EarlyStopping(monitor='loss', patience=2)
nn.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.0001),
    loss="binary_crossentropy",
    metrics=['accuracy'],
)
nn.fit(x_train, y_train, validation_split=0.1, epochs=10, batch_size=16, callbacks=[early_stopping])

In [None]:
nn.evaluate(x_test, y_test)

#### Remarks: 
The previous one did not really give a very good score, so let's try to see if we could use a different network structure.

Through the previous analysis on Decks, we observe that a deck is characterized by a small number of cards. With that in mind, it makes sense to use CNN on this model, as in the following cell.

In [None]:
from tensorflow import keras
cnn = keras.Sequential([
    keras.layers.Dense(212, activation='relu'),
    keras.layers.Conv1D(32, 4, padding='same', activation='relu'),
    keras.layers.Conv1D(32, 4, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    
    keras.layers.MaxPool1D(8),
    keras.layers.Conv1D(64, 4, activation='relu'),
    keras.layers.Conv1D(64, 4, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    
    keras.layers.MaxPool1D(8),
    keras.layers.Dense(128, activation='ReLU'),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(4, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid'),
])
early_stopping = keras.callbacks.EarlyStopping(monitor='accuracy', patience=5)
cnn.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=['accuracy'],
)
cnn.fit(x_train.reshape((x_train.shape[0], x_train.shape[1], 1)), y_train, validation_split=0.1, epochs=30, batch_size=16, callbacks=[early_stopping])

In [None]:
cnn.evaluate(x_test.reshape((x_test.shape[0], x_test.shape[1], 1)), y_test)

#### Remarks:
The result indeed looks much better than our naive model. We observe that indeed, it is better to look at a match in terms of the main cards used in each battle.

However, it appears that, unfortunately, we cannot really get a much better prediction of outcome than decision trees.

Still, with our previous analysis, it seems that we could try and get a good result at deck building.

## 6. Deck Building:

In [None]:
synergies = np.array([(i, j, d_corr[i][j]) for i in range(N_CARDS) for j in range(i+1, N_CARDS)])

In [None]:
import seaborn as sns
x_train, y_train = synergies[:, :2], synergies[:, 2]
fig = plt.figure(figsize=(12, 8))
sns.distplot(y_train)