# Clustering Crypto

# Background:
Accountability Accounting, a prominent investment bank, is interested in offering a new cryptocurrency investment portfolio for its customers. The company, however, is lost in the vast universe of cryptocurrencies. 

# Problem Statement:
Identify what cryptocurrencies are on the trading market and how they could be grouped to create a classification system for this new investment.

# Deliverables:
* Deliverable 1: Preprocessing the Data for PCA
* Deliverable 2: Reducing Data Dimensions Using PCA
* Deliverable 3: Clustering Cryptocurrencies Using K-means
* Deliverable 4: Visualizing Cryptocurrencies Results

In [1]:
# pip install plotly-express

In [2]:
# pip install hvplot

In [3]:
# import initial imports
import numpy as np
import pandas as pd
import hvplot.pandas
# from path import Path
import plotly.express as px
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


  from . import _distributor_init


ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.7 from "C:\Users\jamie\anaconda3\envs\mlenv\python.exe"
  * The NumPy version is: "1.20.1"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: DLL load failed: The specified module could not be found.


## Preprocessing the Data for PCA (Deliverable 1) 

### Questions for Data Preparation:
* What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
* What data is available? What type? What is missing? What can be removed?
* Is the data in a format that can be passed into an unsupervised learning model?
* Can I quickly hand off this data for others to use?

In [None]:
# Load the crypto_data.csv dataset.
# data sorce: https://min-api.cryptocompare.com/data/all/coinlist
file_path = "./Resources/crypto_data.csv"

df_crypto = pd.read_csv(file_path)
df_crypto

### Question: What knowledge do we hope to glean from running an unsupervised learning model on this dataset?

Answer: Identify what cryptocurrencies are on the trading market and how they could be grouped to create a classification system for a new investment.

### Question: What type of data is available?

In [None]:
# identify missing values and dtypes using info() method
df_crypto.info()

### Question: What data is missing?

In [None]:
# find null values using isnull() method
for column in df_crypto.columns:
    print(f"Column {column} has {df_crypto[column].isnull().sum()} null values")
    

In [None]:
# view the values counts for "IsTrading"
df_crypto["IsTrading"].value_counts()

### Question: What data can be removed?

#### Keep all the crypto currencies that are being traded.
Drop crypto currencies not trading

In [None]:
# Keep all the cryptocurrencies that are being traded ("True").
df_crypto_trading = df_crypto.loc[(df_crypto["IsTrading"] == True)]
df_crypto_trading["IsTrading"].value_counts()

In [None]:
df_crypto_trading.info()

In [None]:
# Remove the "IsTrading" column. 
new_df_crypto = df_crypto_trading.drop(['IsTrading'], axis="columns")
new_df_crypto

#### Keep all the cryptocurrencies that have a working algorithm.
Drop those that don't have a working algorith (NaN)

In [None]:
# view value_counts for Algorithm
new_df_crypto["Algorithm"].value_counts()

# note: there are no null values

#### Remove rows that have at least one null value.

In [None]:
new_df_crypto.info()

In [None]:
new_df_crypto.columns

In [None]:
# check for null values for "ProofType"
new_df_crypto["ProofType"].isnull().value_counts()

In [None]:
# check for null values for "TotalCoinsMined"
new_df_crypto["TotalCoinsMined"].isnull().value_counts()

In [None]:
# check for null values for "TotalCoinSupply"
new_df_crypto["TotalCoinSupply"].isnull().value_counts()

#### Keep all the cryptocurrencies that are being mined
drop those where TotalCoinsMined is less than or equal to zero

In [None]:
# Remove rows that have at least 1 null value.
# this will be all rows where "TotalCoinsMined" is not a numeric values
clean_df_crypto = new_df_crypto.dropna(how='any', axis='rows')
clean_df_crypto.info()

In [None]:
# remove all the rows that do not have coins being mined
clean_df_crypto = clean_df_crypto[clean_df_crypto["TotalCoinsMined"] > 0]
clean_df_crypto.info()

In [None]:
# recast "TotalCoinsSupply" as numeric
# pandas.to_numeric(arg, errors='raise', downcast=None)
supply_clean_df_crypto = clean_df_crypto.copy()
supply_clean_df_crypto["TotalCoinSupply"] = pd.to_numeric(supply_clean_df_crypto["TotalCoinSupply"], errors='coerce')
supply_clean_df_crypto.info()

#### Create a new DataFrame that holds only the cryptocurrency names
use the crypto_df DataFrame index as the index for this new DataFrame.

In [None]:
# Create a new DataFrame that holds only the cryptocurrencies names.
df_crypto_names = supply_clean_df_crypto.copy()
df_crypto_names = df_crypto_names[["CoinName", "Unnamed: 0"]]

# use the crypto_df DataFrame index as the index for this new DataFrame.
df_crypto_names.set_index("Unnamed: 0",drop=True, inplace=True)

df_crypto_names

#### Remove the CoinName column from the crypto_df DataFrame 
since it's not going to be used on the clustering algorithm.

In [None]:
crypto_df = supply_clean_df_crypto.copy()
crypto_df.drop(columns=["CoinName"], inplace=True)
crypto_df.info()

In [None]:
# set index to "Unnamed: 0"
crypto_df.set_index("Unnamed: 0", drop=True, inplace=True)
crypto_df

#### The get_dummies() method 
to create variables for the text features, which are then stored in a new DataFrame, X

In [None]:
# Use get_dummies() to create variables for text features.
print(f"crypto_df shape: {crypto_df.shape}")

X = crypto_df.copy()
X.info()

In [None]:
# convert the string values into numerical ones using the get_dummies() method.
X_encoded = pd.get_dummies(X, columns=["Algorithm", "ProofType"])
X_encoded.shape

In [None]:
# explore the scale of column values using describe method
X_encoded.describe()

# note: when there are large differences in scale it will affect our machine learning model
# the solution is to scale using StandardScaler()

#### Standardize The features from the X DataFrame 
using the StandardScaler fit_transform() function

In [None]:
# Standardize the data with StandardScaler().
# create an instance of the StandardScaler method
data_scaler = StandardScaler()

# train and transform the data_scaler using the fit_transform() method
X_scaled = data_scaler.fit_transform(X_encoded)
X_scaled[: 5]

#### Confirm data processing is complete:
* Null values are handled.
* Only numerical data is used.
* Values are scaled. In other words, data has been manipulated to ensure that the variance between the numbers won't skew results.


## Reducing Data Dimensions Using PCA (Deliverable 2)
PCA reduces the number of dimensions by transforming a large set of variables into a smaller one that contains most of the information in the original large set.

In [None]:
# Using PCA to reduce dimension to three principal components.

# initialize the PCA model
pca = PCA(n_components= 3, random_state=5)

# use X_scaled array and the pca model to reduce the components to 3
X_pca = pca.fit_transform(X_scaled)

# Note: These new components are just the three main dimensions of variation 
# that contain most of the information in the original dataset.

X_pca

In [None]:
# Create a DataFrame with the principal components.
# Transform PCA data into a DataFrame
# use the index from X_encoded (Unnamed: 0)
           
pcs_df = pd.DataFrame(
        data=X_pca,
        columns= ["PC 1", "PC 2", "PC 3"],
        index= X_encoded.index
)

pcs_df

In [None]:
# fetch the explained variance
pca.explained_variance_ratio_

#### What this tells us is:
* the first principal component contains 2.79% of the variance 
* the second contains 2.14% 
* the third contain 2.05%

together the principal components contain only about 7% of the information (that is very low) but allows the ploting in 3D


## Clustering Crytocurrencies Using K-Means (Deliverable 3)

#### Finding the Best Value for `k` Using the Elbow Curve

In [None]:
# Create an elbow curve to find the best value for K.

# use the elbow in the curve with the generated principal components (pcs_df)
# store values of K to plot in an empty list
inertia = []
k = list(range(1, 11)) # generates a list 1 to 10

# iterate through K valeus and find inertia for the best K
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(pcs_df)
    inertia.append(km.inertia_)
    
# create a dataframe (using a dict) to plot the elbow in the curve

# create the dictionary
elbow_data = {"k": k, "inertia": inertia}

df_elbow_data = pd.DataFrame(elbow_data)
df_elbow_data

In [None]:
# graph the elbow data
df_elbow_data.hvplot.line(x="k", y="inertia", xticks=k, title="Elbow Curve")

#### Use the principal components data with the K-means algorithm with a K value of 4. 

Running K-Means with `k=4`

In [None]:
# Initialize the K-Means model.
model = KMeans(n_clusters= 4, random_state=0)

# Fit the model with pcs_df
model.fit(pcs_df)

# Predict clusters
predictions = model.predict(pcs_df)
predictions

#### Create a new DataFrame named clustered_df 
by concatenating the crypto_df and pcs_df DataFrames on the same columns. The index should be the same as the crypto_df DataFrame.

In [None]:
# Create a new DataFrame (called clustered_df) using concat() method
# that concats cryptocurrencies features (crypto_df and pcs_df)

clustered_df = pd.concat([crypto_df, pcs_df], axis=1) # see defaults for concat
clustered_df

In [None]:
# Breakout the CoinNames from the dataframe and index using "Unnamed: 0"
# Add a new column, "CoinName" to the clustered_df DataFrame that holds the names of the cryptocurrencies. 
# this requires df_crypto_names
df_crypto_names

In [None]:
#  Add a new column, "CoinName" to the clustered_df DataFrame that holds the names of the cryptocurrencies. 
clustered_df["CoinName"] = df_crypto_names["CoinName"]

#  Add a new column, "Class" to the clustered_df DataFrame that holds the predictions.
# using the model.labels_ from the KMeans model fit with the pcs_df data
clustered_df["Class"] = model.labels_

# Print the shape of the clustered_df
print(clustered_df.shape)
clustered_df.head(10)

## Visualizing Cryptocurrencies Results (Deliverable 4)

#### Create a 3D scatter plot 
using the Plotly Express scatter_3d() function to plot the three clusters from the clustered_df DataFrame.
https://plotly.com/python-api-reference/generated/plotly.express.scatter_3d.html#plotly-express-scatter-3d

Add the CoinName and Algorithm columns to the hover_name and hover_data parameters, respectively, so each data point shows the CoinName and Algorithm on hover.

In [None]:
import plotly.io as pio
pio.kaleido.scope.default_format = "svg"

In [None]:
# Creating a 3D-Scatter with the PCA data and the clusters
# plot the result in 3D
fig = px.scatter_3d(
    clustered_df,
    x="PC 1",
    y="PC 2",
    z="PC 3",
    color="Class",
    symbol="Class",
    hover_name="CoinName",
    hover_data= ["Algorithm"],
    width = 1000,
    height = 750,
)

fig.update_layout(legend=dict(x=0,y=1))

# save an image of the figure
fig.write_image("./Images/fig1.svg")
fig.show()

#### Create a table with tradable cryptocurrencies 
using the hvplot.table() function.
https://hvplot.holoviz.org/reference/pandas/table.html#table

In [None]:
clustered_df.head()

In [None]:
# Create a table with tradable cryptocurrencies (by coin name).
# example from docs df.hvplot.table(columns=['origin', 'name', 'yr'], sortable=True, selectable=True)
# 

# to save plot
plot = clustered_df[[
    "CoinName",
    "Algorithm",
    "ProofType",
    "TotalCoinsMined",
    "TotalCoinSupply",
    "Class"
]].hvplot.table(sortable=True, selectable=True)

hvplot.save(plot, './Images/Crypto_table.html')

# to display in notebook
clustered_df[[
    "CoinName",
    "Algorithm",
    "ProofType",
    "TotalCoinsMined",
    "TotalCoinSupply",
    "Class"
]].hvplot.table(sortable=True, selectable=True)

#### Print the total number of tradable cryptocurrencies in the clustered_df DataFrame.

In [None]:
# Print the total number of tradable cryptocurrencies (in data sorce: https://min-api.cryptocompare.com/data/all/coinlist)
clustered_df.shape
print(f"There are {clustered_df.shape[0]} total tradable cryptocurrencies in the dataset")

#### scale the TotalCoinSupply and TotalCoinsMined columns between the given range of zero and one.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.fit_transform

In [None]:
# Scaling data to create the scatter plot with tradable cryptocurrencies.
# Use the MinMaxScaler().fit_transform method to scale the TotalCoinSupply and TotalCoinsMined columns between the given range of zero and one.
scaler = MinMaxScaler()

# use fit_transform and identify X and y
scatter_plot_data = scaler.fit_transform(
    clustered_df[["TotalCoinSupply", "TotalCoinsMined"]])
                 
scatter_plot_data

#### Create a new DataFrame using the clustered_df DataFrame index that contains the scaled data created above.

#### Add the CoinName column from the clustered_df DataFrame to the new DataFrame.

#### Add the Class column from the clustered_df DataFrame to the new DataFrame.

In [None]:
# Create a new DataFrame that has the scaled data with the clustered_df DataFrame index.
df_for_plot = pd.DataFrame(
    scatter_plot_data,
    columns=["TotalCoinSupply", "TotalCoinsMined"],
    index=clustered_df.index    
)

df_for_plot

# Add the "CoinName" column from the clustered_df DataFrame to the new DataFrame.
df_for_plot["CoinName"] = clustered_df["CoinName"]

# Add the "Class" column from the clustered_df DataFrame to the new DataFrame. 
df_for_plot["Class"] = clustered_df["Class"]

df_for_plot.head(10)

#### Create an hvplot scatter plot 
with x="TotalCoinsMined", y="TotalCoinSupply", and by="Class", and have it show the CoinName when you hover over the the data point.
https://hvplot.holoviz.org/user_guide/Customization.html

In [None]:
# Create a hvplot.scatter plot using x="TotalCoinsMined" and y="TotalCoinSupply".
plot2 = df_for_plot.hvplot.scatter(
    x="TotalCoinsMined",
    y="TotalCoinSupply",
    hover_cols=["Class"],
    by="Class"
)

hvplot.save(plot2, './Images/Crypto_ScatterPlot.png')

# to save in notebook
# to display in notebook
df_for_plot.hvplot.scatter(
    x="TotalCoinsMined",
    y="TotalCoinSupply",
    hover_cols=["Class"],
    by="Class"
)