<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

In [None]:
# Load in Python libraries
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

sns.set_style("white")

In [None]:
# Temporary Developer's code
%load_ext autoreload
%autoreload 2

In [None]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../utilities')
from clustering import KMeansClustering


# Read in data

In [None]:
# Load in the data
soccer_data = pd.read_csv("sample_inputs/soccer.csv")

In [None]:
# Look at the top 10 rows
pd.set_option('display.max_columns', None) # Show all columns -- only use if needed and data is not extremely wide
soccer_data.head(10)

### Other type of optional pre-processing to be added 

# Perform K-means analysis for a specific number of clusters

In [None]:
from clustering.kmeans import KMeansClustering


variable_names = [
    'crossing', 'finishing', 'heading_accuracy', 'short_passing',
    'dribbling', 'free_kick_accuracy', 'sprint_speed', 'ball_control', 
    'reactions', 'agility', 'sliding_tackle'
]

kmeans_result = KMeansClustering.execute_k_means(
    data=soccer_data,    
    variables=variable_names,
    num_clusters=5, 
    standardize_vars=True, 
    generate_charts=True,
    save_results_to_excel=True,
    random_state=0
)

# TODO: Include the dimensionality reduction methods to the notebooks (add as a requirement for the clustering template)


# OK: Is there a way to apply nbstripout conditioned to file size?
# YESSSS
# nbstripout --max-size 10000k

# TODO: add nbstripout for template scaffold


In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans


data = kmeans_result["raw_data"]

# SCALING
for col in data.columns:
    if col == "Cluster_assigned":
        continue

    if data[col].dtype == object:
        continue
        
    scaler = MinMaxScaler()
    data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))


cluster_mean = data.groupby("Cluster_assigned").mean()


# SORTING VARIABLES 
# according to similarity (variable clustering)
means = cluster_mean.T
km = KMeans(n_clusters=10, random_state=0)
means['labels'] = km.fit_predict(means)

means = (
    means
    .reset_index()
    .set_index('labels')
    .sort_index()
    .reset_index(drop=True)
    .set_index('index')
)

# PLOTTING
plt.subplots(1, 1, figsize=(14, 18))
sns.heatmap(means, vmin=0, vmax=1, cmap="Greens");


# 0 == atacante
# 1 == centravante
# 2 == gooleiro
# 3 == meio de campo
# 4 == zagueiro

# data.query("Cluster_assigned == 0")

### Exploring results in depth

The `KMeansClustering.execute_k_means` function returns a dictionary with additional results (beyond plots) that can be used for further analysis of the clustering.
Namely:
* `model`: The scikit-learn model that can be used for prediction on new data and also to access metrics.
* `data`: The data used on the cluster algorithm training.
* `centroids`: A dataframe with the centroids of each cluster found.
* `cluster_n`: A dataframe containing the point counts for each cluster.
* `scores`: Clustering scores calculated for the obtained clustering.
* `cluster_plot`: Matplotlib figure object of the 3D/2D scatter plot of the principal components (PCA) with cluster colors.
* `cluster_plot_info`: Aditional informations about the cluster plot and PCA.
* `factor_plot`: Seaborn FacetGrid object with the box plots of each factor for each cluster.

In [None]:
kmeans_result.keys()

Getting the model scores:

In [None]:
kmeans_result["scores"]

# Perform K-means analysis for a range of clusters

In [None]:
from clustering.kmeans import KMeansClustering

results = KMeansClustering.k_means_range(
    dataset=soccer_data, 
    variables=variable_names,
    min_clusters = 2, max_clusters = 6,
    standardize_vars=True, 
    generate_charts=True,
    save_results_to_excel=True,
    export_charts=True
)

# OK: Add samples of how to acces data from centroids from inside the "results" variable outputed from the function


### Exploring results in depth

The `KMeansClustering.k_means_range` function returns a dictionary the same informations present in the outputs of `KMeansClustering.execute_k_means` for each of the number of clusters inside the range executed.

If you want to acces the details of the algorithm for 3 clusters you should access it as a dictionary: `results[3]` 

## Larger datasets

If you are working with larger datasets the regular KMeans might end up running for longer or consumpting too much computational resources (sometimes needing more than is available). In this case, one can use the `MiniBatchKMeansClustering`.

It is an implementation that processes the data in smaller chunks progressivelly. You can control the chunk size and also the maximum number of iterations via parameter keyword arguments. Every extra kwarg will be redirected to the [sklearn MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html?highlight=kmeans#sklearn.cluster.MiniBatchKMeans) function. Please refer to the sklearn documentation for advanced options.

The output format is the same as used for the regular KMeans from above.

In [None]:
from clustering.mini_batch_k_means import MiniBatchKMeansClustering


variable_names = [
    'crossing', 'finishing', 'heading_accuracy', 'short_passing',
    'dribbling', 'free_kick_accuracy', 'sprint_speed', 'ball_control', 
    'reactions', 'agility', 'sliding_tackle'
]

kmeans_result = MiniBatchKMeansClustering.execute_mini_batch_k_means(
    data=soccer_data,    
    variables=variable_names,
    num_clusters=2,
    standardize_vars=True, 
    generate_charts=True,
    save_results_to_excel=True,
    
    # extra kwargs
    max_iter=100, 
    batch_size=1024
)

The same holds for the cluster range number execution

In [None]:
from clustering.mini_batch_k_means import MiniBatchKMeansClustering


variable_names = [
    'crossing', 'finishing', 'heading_accuracy', 'short_passing',
    'dribbling', 'free_kick_accuracy', 'sprint_speed', 'ball_control', 
    'reactions', 'agility', 'sliding_tackle'
]

kmeans_result = MiniBatchKMeansClustering.mini_batch_k_means_range(
    data=soccer_data,    
    variables=variable_names,
    min_clusters=2, max_clusters=5,
    standardize_vars=True, 
    generate_charts=False,
    save_results_to_excel=False,
    
    # extra kwargs
    max_iter=100, 
    batch_size=1024
)