In [1]:
#initial imports
import pandas as pd
from pathlib import Path

## Data Preparation

Read crypto_data.csv into Pandas

In [2]:
file_path = Path("crypto_data.csv")
df = pd.read_csv(file_path)
df.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,IsTrading,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,True,PoW/PoS,41.99995,42
1,365,365Coin,X11,True,PoW/PoS,,2300000000
2,404,404Coin,Scrypt,True,PoW/PoS,1055185000.0,532000000
3,611,SixEleven,SHA-256,True,PoW,,611000
4,808,808,SHA-256,True,PoW/PoS,0.0,0


Discard all cryptocurrencies that are not being traded.

In other words, filter for currencies that are currently being traded.

Once you have done this, drop the IsTrading column from the dataframe

In [3]:
df = df.loc[df["IsTrading"] != False]
df.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,IsTrading,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,True,PoW/PoS,41.99995,42
1,365,365Coin,X11,True,PoW/PoS,,2300000000
2,404,404Coin,Scrypt,True,PoW/PoS,1055185000.0,532000000
3,611,SixEleven,SHA-256,True,PoW,,611000
4,808,808,SHA-256,True,PoW/PoS,0.0,0


In [4]:
# Remove the IsTrading Column
df = df.drop(columns=["IsTrading"])
df.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,PoW/PoS,41.99995,42
1,365,365Coin,X11,PoW/PoS,,2300000000
2,404,404Coin,Scrypt,PoW/PoS,1055185000.0,532000000
3,611,SixEleven,SHA-256,PoW,,611000
4,808,808,SHA-256,PoW/PoS,0.0,0


Remove all rows that have at least one null value

In [5]:
# Remove all rows with `null` values if any.
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values.")

Column Unnamed: 0 has 0 null values.
Column CoinName has 0 null values.
Column Algorithm has 0 null values.
Column ProofType has 0 null values.
Column TotalCoinsMined has 459 null values.
Column TotalCoinSupply has 0 null values.


In [6]:
df = df.dropna(axis=0, how='any')
df.head()

Unnamed: 0.1,Unnamed: 0,CoinName,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,42 Coin,Scrypt,PoW/PoS,41.99995,42
2,404,404Coin,Scrypt,PoW/PoS,1055185000.0,532000000
4,808,808,SHA-256,PoW/PoS,0.0,0
5,1337,EliteCoin,X13,PoW/PoS,29279420000.0,314159265359
7,BTC,Bitcoin,SHA-256,PoW,17927180.0,21000000


In [7]:
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values.")

Column Unnamed: 0 has 0 null values.
Column CoinName has 0 null values.
Column Algorithm has 0 null values.
Column ProofType has 0 null values.
Column TotalCoinsMined has 0 null values.
Column TotalCoinSupply has 0 null values.


Filter for cryptocurrencies that have been mined. 

That is, the total coins mined should be greater than zero.

In [8]:
df = df[df['TotalCoinsMined'] != 0]
print(df)

     Unnamed: 0     CoinName    Algorithm ProofType  TotalCoinsMined  \
0            42      42 Coin       Scrypt   PoW/PoS     4.199995e+01   
2           404      404Coin       Scrypt   PoW/PoS     1.055185e+09   
5          1337    EliteCoin          X13   PoW/PoS     2.927942e+10   
7           BTC      Bitcoin      SHA-256       PoW     1.792718e+07   
8           ETH     Ethereum       Ethash       PoW     1.076842e+08   
...         ...          ...          ...       ...              ...   
1238       ZEPH       ZEPHYR      SHA-256      DPoS     2.000000e+09   
1242        GAP      Gapcoin       Scrypt   PoW/PoS     1.493105e+07   
1245        BDX       Beldex  CryptoNight       PoW     9.802226e+08   
1246        ZEN      Horizen     Equihash       PoW     7.296538e+06   
1247        XBC  BitcoinPlus       Scrypt       PoS     1.283270e+05   

     TotalCoinSupply  
0                 42  
2          532000000  
5       314159265359  
7           21000000  
8                  0

In order for your dataset to be comprehensible to a machine learning algorithm, its data should be numeric. 

Since the coin names do not contribute to the analysis of the data, delete the CoinName from the original dataframe.

In [9]:
df = df.drop(columns=["CoinName"])
df.head()

Unnamed: 0.1,Unnamed: 0,Algorithm,ProofType,TotalCoinsMined,TotalCoinSupply
0,42,Scrypt,PoW/PoS,41.99995,42
2,404,Scrypt,PoW/PoS,1055185000.0,532000000
5,1337,X13,PoW/PoS,29279420000.0,314159265359
7,BTC,SHA-256,PoW,17927180.0,21000000
8,ETH,Ethash,PoW,107684200.0,0


Your next step in data preparation is to convert the remaining features with text values, Algorithm and ProofType, into numerical data.

To accomplish this task, use Pandas to create dummy variables. 

Examine the number of rows and columns of your dataset now. How did they change?

In [10]:
df['Algorithm'].value_counts()

Scrypt                 182
X11                     73
SHA-256                 48
CryptoNight             19
X13                     17
                      ... 
TRC10                    1
SkunkHash v2 Raptor      1
Argon2d                  1
Ouroboros                1
POS 2.0                  1
Name: Algorithm, Length: 72, dtype: int64

In [11]:
df['ProofType'].value_counts()

PoW                     237
PoW/PoS                 176
PoS                      86
DPoS                      9
PoC                       3
PoS/PoW                   2
LPoS                      1
dPoW/PoW                  1
TPoS                      1
Proof of Trust            1
DPoC                      1
Proof of Authority        1
PoW + Hive                1
PoW/PoW                   1
PoST                      1
PoW and PoS               1
PoW/PoS                   1
PoS/PoW/PoT               1
DPOS                      1
PoW/nPoS                  1
HPoW                      1
POBh                      1
PoS/LPoS                  1
Zero-Knowledge Proof      1
Pos                       1
PoA                       1
Name: ProofType, dtype: int64

Standardize your dataset so that columns that contain larger values do not unduly influence the outcome.

## Dimensionality Reduction

Creating dummy variables above dramatically increased the number of features in your dataset. 

Perform dimensionality reduction with PCA.

Rather than specify the number of principal components when you instantiate the PCA model, it is possible to state the desired explained variance. 

For example, say that a dataset has 100 features. Using PCA(n_components=0.99) creates a model that will preserve approximately 99% of the explained variance, whether that means reducing the dataset to 80 principal components or 3.
    
For this project, preserve 90% of the explained variance in dimensionality reduction. How did the number of the features change?

Next, further reduce the dataset dimensions with t-SNE and visually inspect the results.

In order to accomplish this task, run t-SNE on the principal components: the output of the PCA transformation. 


Then create a scatter plot of the t-SNE output. Observe whether there are distinct clusters or not.

## Cluster Analysis with k-Means

Create an elbow plot to identify the best number of clusters. Use a for-loop to determine the inertia for each k between 1 through 10.

Determine, if possible, where the elbow of the plot is, and at which value of k it appears.

## Recommendation

Based on your findings, make a brief (1-2 sentences) recommendation to your clients. 

Can the cryptocurrencies be clustered together? 

If so, into how many clusters?