## Unsupervised Learning

In this section we begin exploring some unsupervised learning concepts.

We have two main goals from this section:
1. Feature Importance Determination
2. Identifiying Similiar Player Profiles of athletes with and without NIL Evaluations

In [1]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, StandardScaler

#### Feature Importance Determination

The outputs of this feature importance determination will be crucial in plotting and visualizing things from the KMeans Effort Below!

Since our goal is to reduce our problem down for visualization purposes we are going to explore options in which the number of components is either 2 or 3 although a further analysis regarding the importance of features should also be performed.

- Principal Components Analysis (PCA)
    - 
- Multidimensional Scaling (MDS)
    - Distance preserving low-dimensional projection
    - Locally influenced
    - If clusters are far apart in the OG data, then they will be far apart in the MDS projection
    - weights can be introduced to handle missing data in MDS! We have lots of missing data in this project...
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
    - Finds a low-dim output
    - Preserves similiaries in high-dim data
    - It is focused on preserving LOCAL DISTANCES between neighbors, not so much global structure
    - Perplexity is the key parameter
        - Experiment with multiple values here!
    - Can give different output everytime it runs!
- Uniform Manifold Approximation & Projection (UMAP)
    - Similiar to t-SNE --> Preserves local neighborhood structure
    - But it also preserves some global structure too!
    - It is a useful technique for plotting AS WELL AS clustering whereas t-SNE should not be used in clustering...


It is sometimes good for very large, high-dimensional datasets to use PCA and then apply t-SNE/UMAP after!

In [None]:
# ===== MDS =====
from sklearn.manifold import MDS
# X_norm = StandardScaler().fit_transform(X_not_norm)

# mds = MDS(n_components=2)
# X_mds = mds.fit_transform(X_norm)

# plot(X_mds, output, [NAMES_OF_CLUSTERS])
# ===============

# ==== tSNE =====
from sklearn.manifold import TSNE
# tsne = TSNE(random_state=0)
# X_tsne = tsne.fit_transform(X_norm)

# plot(X_tsne, output, [NAMES_OF_CLUSTERS])
# ===============

#### Similiar Player Profiles

Notes regarding KMeans clustering:
- Different intializations can result in different solutions. Performing multiple runs is a good idea.
    - Be careful about where you start
    - Could place the first one randomly and the next one could be as far away as possible
- Centroid is typically the mean of the points in the cluster.
    - This works only when the values are continuous in nature. K-Medoid can be used if non-continuous columns are used.
- "Closeness" can take the form of Euclidean Distance, Cosine Similiarity, Correlation, etc

KMeans works well on simple clusters that are similiar in size, well seperated, and globular. Complex shapes... not so much...

##### Visualization Tips

When it comes to plotting this data it will be difficult to use normal ol KMeans because our data has lots of columns.
- PCA would work to reduce the data down to maybe two principal components
- t-SNE would also work to reduce the visualization down to something that is more interpretable

In [None]:
# We must specifiy the number of clusters, K, in advance
# We need to pick K clusters as well as the K points that will act as the initial centroid

def Custom_KMeans(X_train, RANDOM_STATE=0):
    """
        :: Input(s) ::
            X_train - training data
        :: Output(s) ::
            A scatter plot showing the clustering
    """

    # Remember that X_train does not contain any NIL information, that is perfectly okay because we are interested in all athletes
    # regardless of their NIL evaluation

    # We may want to normalize the features we have in some way, shape, or form...
    X = X_train
    X_normalized = MinMaxScaler().fit_transform(X)

    # KMeans Setup
    clusters = 5
    kmeans = KMeans(n_clusters=clusters,
                    random_state=RANDOM_STATE)
    kmeans.fit(X)

    # kmeans.labels_

    # PCA or T-SNE Setup

    return None