### Liberies and extensions used in this project:

In this section, i am importing the required libraries for your data analysis and clustering tasks. 

- `pandas` is a powerful library for data manipulation and analysis.
- `hvplot.pandas` is an extension for pandas that enables interactive plotting using the Holoviews library.
- `KMeans` is an implementation of the K-means clustering algorithm, a popular unsupervised machine learning algorithm for clustering data points into groups.
- `PCA` is used for Principal Component Analysis, a dimensionality reduction technique.
- `StandardScaler` is used to standardize the features by removing the mean and scaling to unit variance. It's often essential for clustering algorithms.


In [205]:
# Import required libraries and dependencies
import pandas as pd

#hvplot.pandas is an extension for pandas that enables interactive plotting using the Holoviews library.
import hvplot.pandas

# KMeans is an implementation of the K-means clustering algorithm,
#  a popular unsupervised machine learning algorithm for clustering 
from sklearn.cluster import KMeans 

#PCA is used for Principal Component Analysis,
#  a dimensionality reduction technique.
from sklearn.decomposition import PCA

#StandardScaler is used to standardize the features by removing the mean and scaling to unit variance.
# It's often essential for clustering algorithms.
from sklearn.preprocessing import StandardScaler

#Ignore warning msg
import warnings
warnings.filterwarnings("ignore")



In [206]:
# Load the data into a Pandas DataFrame
df_market_data = pd.read_csv(
    "Resources/crypto_market_data.csv",
    index_col="coin_id")

# Display sample data
df_market_data.head(10)

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,1.08388,7.60278,6.57509,7.67258,-3.25185,83.5184,37.51761
ethereum,0.22392,10.38134,4.80849,0.13169,-12.8889,186.77418,101.96023
tether,-0.21173,0.04935,0.0064,-0.04237,0.28037,-0.00542,0.01954
ripple,-0.37819,-0.60926,2.24984,0.23455,-17.55245,39.53888,-16.60193
bitcoin-cash,2.90585,17.09717,14.75334,15.74903,-13.71793,21.66042,14.49384
binancecoin,2.10423,12.85511,6.80688,0.05865,36.33486,155.61937,69.69195
chainlink,-0.23935,20.69459,9.30098,-11.21747,-43.69522,403.22917,325.13186
cardano,0.00322,13.99302,5.55476,10.10553,-22.84776,264.51418,156.09756
litecoin,-0.06341,6.60221,7.28931,1.21662,-17.2396,27.49919,-12.66408
bitcoin-cash-sv,0.9253,3.29641,-1.86656,2.88926,-24.87434,7.42562,93.73082


In [207]:
# Generate summary statistics
df_market_data.describe()

Unnamed: 0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
count,41.0,41.0,41.0,41.0,41.0,41.0,41.0
mean,-0.269686,4.497147,0.185787,1.545693,-0.094119,236.537432,347.667956
std,2.694793,6.375218,8.376939,26.344218,47.365803,435.225304,1247.842884
min,-13.52786,-6.09456,-18.1589,-34.70548,-44.82248,-0.3921,-17.56753
25%,-0.60897,0.04726,-5.02662,-10.43847,-25.90799,21.66042,0.40617
50%,-0.06341,3.29641,0.10974,-0.04237,-7.54455,83.9052,69.69195
75%,0.61209,7.60278,5.51074,4.57813,0.65726,216.17761,168.37251
max,4.84033,20.69459,24.23919,140.7957,223.06437,2227.92782,7852.0897


In [208]:
# Plot your data to see what's in your DataFrame
df_market_data.hvplot.line(
    width=800,
    height=400,
    rot=90
)

---

### Preparing the data 


This code snippet normalizes the selected columns from the `df_market_data` DataFrame using the `StandardScaler` module from scikit-learn.

 `df_scaled_data = StandardScaler().fit_transform(df_market_data[['price_change_percentage_24h', 'price_change_percentage_7d', ... 'price_change_percentage_1y']])`: 
 In this part, a subset of columns is selected from the `df_market_data` DataFrame using double square brackets. The selected columns are `['price_change_percentage_24h', 'price_change_percentage_7d', ..., 'price_change_percentage_1y']`. These columns contain numerical data that needs to be normalized.

   - `StandardScaler()` creates an instance of the `StandardScaler` class, which is used for standardization. Standardization scales the data so that it has a mean of 0 and a standard deviation of 1.
   - `fit_transform()` is a method of the `StandardScaler` class. It computes the mean and standard deviation of the selected columns and then performs the transformation to standardize the data. The result of the transformation is stored in the new DataFrame `df_scaled_data`.


### Prepare the Data

In [209]:
# Use the `StandardScaler()` module from scikit-learn to normalize the data from the CSV file
# first lets get the name of the cloumns 
df_market_data.columns

scaled_data = StandardScaler().fit_transform(df_market_data[['price_change_percentage_24h', 'price_change_percentage_7d',
       'price_change_percentage_14d', 'price_change_percentage_30d',
       'price_change_percentage_60d', 'price_change_percentage_200d',
       'price_change_percentage_1y']])
scaled_data[:5]

array([[ 0.50852937,  0.49319307,  0.77220043,  0.23545963, -0.0674951 ,
        -0.35595348, -0.25163688],
       [ 0.18544589,  0.93444504,  0.55869212, -0.05434093, -0.27348273,
        -0.11575947, -0.19935211],
       [ 0.02177396, -0.70633685, -0.02168042, -0.06103015,  0.00800452,
        -0.55024692, -0.28206051],
       [-0.04076438, -0.81092807,  0.24945797, -0.05038797, -0.37316402,
        -0.45825882, -0.29554614],
       [ 1.19303608,  2.00095907,  1.76061001,  0.54584206, -0.29120287,
        -0.49984776, -0.27031695]])

this code snippet standardizes a subset of columns in the `df_market_data` DataFrame, making it easier to work with the data, especially when performing machine learning algorithms or clustering where feature scaling is important for accurate results. The standardized data is stored in the `df_scaled_data` DataFrame, which can be used for further analysis or modeling.
#### Step 1: Standardizing the Data

The code begins by standardizing selected columns from the `df_market_data` DataFrame. It creates a new DataFrame `df_scaled_data` to store the scaled values.
#### Step 2: Copying the Coin IDs

Next, the code copies the coin IDs from the original data and adds them as a new column named "coin_id" in the `df_scaled_data` DataFrame.

```python
df_scaled_data["coin_id"] = df_market_data.index
```

#### Step 3: Setting the Index

The code sets the "coin_id" column as the index for the `df_scaled_data` DataFrame.

```python
df_scaled_data = df_scaled_data.set_index("coin_id")
```

#### Step 4: Displaying Sample Data (Optional)

Finally, the code can display a sample of the scaled data using the `head()` method.
This provides an overview of the standardized data for further analysis or visualization.



In [210]:
# Create a DataFrame with the scaled data
df_scaled_data = pd.DataFrame(
    scaled_data , columns=['price_change_percentage_24h', 'price_change_percentage_7d',
       'price_change_percentage_14d', 'price_change_percentage_30d',
       'price_change_percentage_60d', 'price_change_percentage_200d',
       'price_change_percentage_1y'] 
)

# Copy the crypto names from the original data
df_scaled_data["coin_id"] = df_market_data.index

# Set the coinid column as index

df_scaled_indexed = df_scaled_data.set_index("coin_id")
# Display sample data

df_scaled_indexed.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317










---





---
---

---

### Find the Best Value for k Using the Original Scaled DataFrame

In this code, I am performing K-means clustering with different values of 'k' (the number of clusters) and storing the corresponding inertia values in a list.

1. `k = list(range(1, 11))`: create a list `k` containing the numbers from 1 to 10. This will be used as the range of 'k' values for K-means clustering.

2. `inertia = []`: An empty list named `inertia` is initialized to store the inertia values for each K-means model.
    inertia is a critical measure in K-means clustering as it helps identify the appropriate number of clusters and evaluates how well the clusters separate the data points. Lower inertia indicates better clustering performance.

3. For loop:
   - The for loop iterates over each value of 'k' in the list `k`.
   - Inside the loop, I perform the following steps for each 'k':

     a. `k_model = KMeans(n_clusters=i)`: I create a KMeans model with 'i' clusters, where 'i' is the current value of 'k' in the loop.

     b. `k_model.fit(df_scaled_data)`: The KMeans model is fitted to the scaled data in the `df_scaled_data` DataFrame. The algorithm attempts to cluster the data points into 'i' clusters based on their similarity.

     c. `inertia.append(k_model.inertia_)`: The inertia value of the KMeans model is computed and appended to the `inertia` list. The inertia is a measure of how tightly the data points are clustered around their respective centroids. A lower inertia generally indicates better clustering.

After the loop finishes, the `inertia` list will contain the inertia values corresponding to each 'k' value, which can be used to evaluate and visualize the optimal number of clusters for my data. now we will plot the inertia values against the number of clusters to identify the "elbow" point, where the inertia starts to level off. This "elbow" point often indicates the optimal number of clusters for your K-means clustering.


In this code, i am plotting a line chart using the HoloViews extension for pandas to visualize the inertia values computed during the Elbow method.

```python
df_elbow_data.hvplot.line(
    x="k",
    y="inertia",
    title="Elbow method"
)
```

- `df_elbow_data`: This is the DataFrame containing the inertia values for different values of "k" (number of clusters).

- `hvplot.line()`: This is the HoloViews function used to create a line chart. The data from `df_elbow_data` will be plotted as a line chart.

- `x="k"` and `y="inertia"`: These parameters specify the columns to use for the x-axis and y-axis of the line chart, respectively. "k" will be used as the x-axis (representing the number of clusters), and "inertia" will be used as the y-axis (representing the inertia values).

- `title="Elbow method"`: This parameter sets the title of the plot to "Elbow method".

The resulting line chart will visually represent the inertia values for different values of "k". It helps you identify the "elbow" point, which indicates the optimal number of clusters for K-means clustering. The "elbow" point is the value of "k" where the inertia starts to level off, suggesting a good balance between the number of clusters and the compactness of each cluster.

To display the inertia values directly without plotting, i used the following line:

```python
df_elbow_data["inertia"]
```

This will display the inertia values for each corresponding value of "k" in the `df_elbow_data` DataFrame.

In [211]:
# Create a list with the number of k-values from 1 to 11
k = list(range(1,11))
df_scaled_indexed.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


In [212]:
# Create an empty list to store the inertia values
inertia = []

# Create a for loop to compute the inertia with each possible value of k
for i in k :
    
    # 1. Create a KMeans model using the loop counter for the n_clusters
    k_model = KMeans(n_clusters=i, random_state=3)
    
    # 2. Fit the model to the data using 'df_scaled_data'
    k_model.fit(df_scaled_indexed)

    # 3. Append the model.inertia_ to the inertia list

    inertia.append(k_model.inertia_)
inertia

[287.0,
 195.82021818036043,
 123.19048183836958,
 79.02243535120975,
 63.85866780584264,
 53.05778846567061,
 44.5502780207748,
 36.791069782922925,
 32.71517024852882,
 28.374754158229536]

In [213]:
# Create a dictionary with the data to plot the Elbow curve
elbow_data = {
    "k" : k,
    "inertia" : inertia
}

# Create a DataFrame with the data to plot the Elbow curve
df_elbow_data = pd.DataFrame(elbow_data)
df_elbow_data

Unnamed: 0,k,inertia
0,1,287.0
1,2,195.820218
2,3,123.190482
3,4,79.022435
4,5,63.858668
5,6,53.057788
6,7,44.550278
7,8,36.79107
8,9,32.71517
9,10,28.374754


In [214]:
# Plot a line chart with all the inertia values computed with 
elbow_data_plot = df_elbow_data.hvplot.line(
    x = "k",
    y = "inertia",
    title = "Elbow method"
)
elbow_data_plot

# the different values of k to visually identify the optimal value for k.
# df_elbow_data["inertia"]



#### Answer the following question: 

**Question:** What is the best value for `k`?

**Answer:**  Four

---

#### Cluster Cryptocurrencies with K-means Using the Original Scaled Data

```
k_means = KMeans(n_clusters= 4)
k_means.fit(df_scaled_indexed)
k_means_predict = k_means.predict(df_scaled_indexed)

k_means_predict

predicted_scaled_df["predicted clusters"] = k_means_predict
predicted_scaled_df.head()

predicted_scaled_df.hvplot.scatter(
    x = "price_change_percentage_24h",
    y = "price_change_percentage_7d",
    by = "predicted clusters" ,
    title = " pedicted clusters"
)```

```

1. Initialize K-Means Model:
   - created a KMeans model with 4 clusters (`n_clusters=4`).

2. Fit the Model:
   - The KMeans model is fitted to the scaled data (`df_scaled_indexed`), trying to group cryptocurrencies into 4 clusters based on similarity.

3. Predict Clusters:
   - The model predicts the cluster for each data point in `df_scaled_indexed`, and the results are stored in `k_means_predict`.

4. Add Predicted Clusters:
   - A new column "predicted clusters" is added to the `predicted_scaled_df` DataFrame to store the cluster assignments.

5. Display Sample Data:
   - The first few rows of `predicted_scaled_df` with the "predicted clusters" column are displayed to inspect the data.

6. Visualize Clusters:
   - A scatter plot is created using HoloViews (`hvplot.scatter()`) to show the data points from `predicted_scaled_df`.
   - The x-axis represents "price_change_percentage_24h," the y-axis represents "price_change_percentage_7d."
   - Data points are colored and grouped by the "predicted clusters" column.
   - The title of the plot is set to "predicted clusters."

This code uses K-means clustering to group cryptocurrencies based on their price change percentages over different time intervals. The resulting clusters help identify patterns and similarities between cryptocurrencies in the market. The scatter plot visually presents the clustered data points, making it easier to interpret and analyze the results.



In [215]:
# Initialize the K-Means model using the best value for k
k_means = KMeans(n_clusters= 4)

In [216]:
# Fit the K-Means model using the scaled data
k_means.fit(df_scaled_indexed)

In [217]:
# Predict the clusters to group the cryptocurrencies using the scaled data
k_means_predict = k_means.predict(df_scaled_indexed)

# Print the resulting array of cluster values.
k_means_predict


array([0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 2, 0, 1, 1, 3, 1, 1, 1, 1],
      dtype=int32)

In [218]:
# Create a copy of the DataFrame
predicted_scaled_df = df_scaled_indexed.copy()
predicted_scaled_df.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317


In [219]:
# Add a new column to the DataFrame with the predicted clusters
predicted_scaled_df["predicted clusters"] = k_means_predict

# Display sample data
predicted_scaled_df.head()

Unnamed: 0_level_0,price_change_percentage_24h,price_change_percentage_7d,price_change_percentage_14d,price_change_percentage_30d,price_change_percentage_60d,price_change_percentage_200d,price_change_percentage_1y,predicted clusters
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bitcoin,0.508529,0.493193,0.7722,0.23546,-0.067495,-0.355953,-0.251637,0
ethereum,0.185446,0.934445,0.558692,-0.054341,-0.273483,-0.115759,-0.199352,0
tether,0.021774,-0.706337,-0.02168,-0.06103,0.008005,-0.550247,-0.282061,1
ripple,-0.040764,-0.810928,0.249458,-0.050388,-0.373164,-0.458259,-0.295546,1
bitcoin-cash,1.193036,2.000959,1.76061,0.545842,-0.291203,-0.499848,-0.270317,0


In [220]:
# Create a scatter plot using hvPlot by setting 
# `x="price_change_percentage_24h"` and `y="price_change_percentage_7d"`. 
# Color the graph points with the labels found using K-Means and 
# add the crypto name in the `hover_cols` parameter to identify 
# the cryptocurrency represented by each data point.
predicted_scaled_df_plot = predicted_scaled_df.hvplot.scatter(
    x = "price_change_percentage_24h",
    y = "price_change_percentage_7d",
    by = "predicted clusters" ,
    title = " pedicted clusters"
)
predicted_scaled_df_plot


---

### Optimize Clusters with Principal Component Analysis.

In this code, I am using the PCA model with `n_components=3` to perform dimensionality reduction on `predicted_scaled_df`, and then viewing the first five rows of the transformed data (`pca_fit`). Additionally, you are retrieving the explained variance to understand how much information each principal component retains.

1. `pca = PCA(n_components=3)`: You create a PCA model instance with `n_components=3`, indicating that you want to reduce the dimensionality of the data to three principal components.

2. `pca_fit = pca.fit_transform(predicted_scaled_df)`: The PCA model is fitted to the `predicted_scaled_df` DataFrame using the `fit_transform()` method. The original high-dimensional data is transformed into a lower-dimensional space containing three principal components. The result is stored in the `pca_fit` variable.

3. `pca_fit[:5]`: You view the first five rows of the transformed data (`pca_fit`) to observe the reduced feature data after PCA. This step allows you to inspect the transformed data and understand how the principal components represent the information from the original features.

4. Retrieve Explained Variance:
   - The explained variance tells you how much information (variance) can be attributed to each principal component.
   - It is obtained from the `explained_variance_ratio_` attribute of the PCA model.

```python
# Retrieve explained variance
explained_variance = pca.explained_variance_ratio_
```

The `explained_variance` will be an array containing the explained variance ratio for each of the three principal components. The explained variance ratio represents the proportion of the total variance in the data that each principal component explains.

To understand PCA further and explore the concept of explained variance in detail, you can refer to the following resources:

1. Scikit-learn documentation on PCA:
   - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

2. Towards Data Science article on PCA:
   - https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

3. PCA Explained (Visual Guide) by StatQuest with Josh Starmer:
   - Video: https://www.youtube.com/watch?v=_UVHneBUBW0
   - Written explanation: https://statquest.org/pca-principal-component-analysis-step-by-step/

In [221]:
# Create a PCA model instance and set `n_components=3`.
pca = PCA(n_components= 3)

In [222]:
# Use the PCA model with `fit_transform` to reduce to 
# three principal components.
pca_fit = pca.fit_transform(predicted_scaled_df)
# View the first five rows of the DataFrame. 
pca_fit[:5] 

array([[-0.82258702,  0.83270555,  0.56655379],
       [-0.70990347,  0.45421229,  1.05640205],
       [-0.30789987, -0.18460919, -0.74081973],
       [-0.353704  , -0.23981488, -0.59614094],
       [-1.48496628,  2.01848451,  1.78701064]])

In [223]:
# Retrieve the explained variance to determine how much information 
# can be attributed to each principal component.
# Retrieve explained variance
explained_variance = pca.explained_variance_ratio_

#calculating the total variance
total_explained_var =  explained_variance[0] + explained_variance[1] + explained_variance[2]
print(f" {round(total_explained_var * 100 , 2)}% is the total var")

 88.81% is the total var


#### Answer the following question: 

**Question:** What is the total explained variance of the three principal components?

**Answer:** the total of the 3 PCA is 88.92%

In [224]:
# Create a new DataFrame with the PCA data.
pca_df = pd.DataFrame(pca_fit , columns=["PC1", "PC2", "PC3"])


# Copy the crypto names from the original data
pca_df["coin_id"] = df_market_data.index

# Set the coinid column as index
pca_df.set_index("coin_id")

# Display sample data
pca_df.head()







Unnamed: 0,PC1,PC2,PC3,coin_id
0,-0.822587,0.832706,0.566554,bitcoin
1,-0.709903,0.454212,1.056402,ethereum
2,-0.3079,-0.184609,-0.74082,tether
3,-0.353704,-0.239815,-0.596141,ripple
4,-1.484966,2.018485,1.787011,bitcoin-cash


In [225]:
pca_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   PC1      41 non-null     float64
 1   PC2      41 non-null     float64
 2   PC3      41 non-null     float64
 3   coin_id  41 non-null     object 
dtypes: float64(3), object(1)
memory usage: 1.4+ KB


---

### Find the Best Value for k Using the PCA Data

- Code Explanation

1. **Create k-values and Compute Inertia:**

   - I created a list of k-values from 1 to 10 using the `list(range(1, 11))` function.
   - An empty list `inertia` is initialized to store the inertia values.

2. **Prepare Data for K-Means Clustering:**

   - The DataFrame `pca_df` is set as the index using `pca_df_indexed = pca_df.set_index("coin_id")`.
   - The first five rows of the indexed DataFrame are displayed using `pca_df_indexed.head()`.

3. **K-Means Clustering with Different k-values:**

   - For each value of k, I perform K-means clustering to find the optimal number of clusters.
   - A KMeans model is initialized with the current value of k (`n_clusters=i`) and `random_state=3`.
   - The model is fitted to the indexed DataFrame using `k_model.fit(pca_df_indexed)`.
   - The inertia value for each k is computed and appended to the `inertia` list.

4. **Create DataFrame for Elbow Curve:**

   - A DataFrame named `df_elbow_pca` is created to store the k-values and corresponding inertia values.
   - The DataFrame has two columns: "k" and "inertia".

5. **Plot the Elbow Curve:**

   - A line plot for the Elbow Curve is created using HoloViews (`hvplot.line()`).
   - The x-axis represents the k-values, the y-axis represents the inertia values, and the title is set to "PCA Elbow Curve".

6. **Display the Elbow Curve:**

   - The Elbow Curve plot (`elbow_pca_plot`) is displayed.

7. **Visualize the Optimal k:**

   - The plot visually shows the inertia values for different k-values to help identify the optimal number of clusters.

Note: This code shares similarities with the previous code I provided for the Elbow Curve. If you need a more detailed explanation, refer to the previous explanation for the Elbow Curve.

### References:

1. Scikit-learn documentation for KMeans:
   - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

2. HoloViews documentation for line plots:
   - http://holoviews.org/user_guide/Plotting_with_Bokeh.html#line-plots


In [226]:
# Create a list with the number of k-values from 1 to 11
k = list(range(1,11))
inertia = []
pca_df_indexed = pca_df.set_index("coin_id")
pca_df_indexed.head()

Unnamed: 0_level_0,PC1,PC2,PC3
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bitcoin,-0.822587,0.832706,0.566554
ethereum,-0.709903,0.454212,1.056402
tether,-0.3079,-0.184609,-0.74082
ripple,-0.353704,-0.239815,-0.596141
bitcoin-cash,-1.484966,2.018485,1.787011


In [227]:
# Create a for loop to compute the inertia with each possible value of k
for i in k :
    k_model = KMeans(n_clusters= i , random_state= 3)
    k_model.fit(pca_df_indexed)
    inertia.append(k_model.inertia_)
inertia[:5]


[268.7074810431747,
 172.81344714093473,
 98.89990525165226,
 46.68514843492797,
 35.91889760360917]

In [228]:
# Create a dictionary with the data to plot the Elbow curve
df_elbow_pca = pd.DataFrame({
    "k" : k,
    "inertia" : inertia
})
df_elbow_pca



Unnamed: 0,k,inertia
0,1,268.707481
1,2,172.813447
2,3,98.899905
3,4,46.685148
4,5,35.918898
5,6,26.535907
6,7,19.704952
7,8,16.306595
8,9,13.319168
9,10,10.632533


In [229]:
# Plot a line chart with all the inertia values computed with 
elbow_pca_plot = df_elbow_pca.hvplot.line(
    x = "k",
    y = "inertia",
    title = "PCA Elbow Curve"
)
elbow_pca_plot
# the different values of k to visually identify the optimal value for k.


#### Answer the following questions: 

* **Question:** What is the best value for `k` when using the PCA data?

  * **Answer:** Four is the best value for k


* **Question:** Does it differ from the best k value found using the original data?

  * **Answer:** no it doesnt

### Cluster Cryptocurrencies with K-means Using the PCA Data

## Code Explanation

1. **Initialize K-Means Model:**

   - A KMeans model is initialized with `n_clusters=4` and `random_state=3`.
   - The model will attempt to find four clusters in the data based on the PCA-transformed features.

2. **Fit K-Means Model:**

   - The KMeans model is fitted to the indexed DataFrame `pca_df_indexed`.
   - The model will identify the clusters and the centroids for each cluster.

3. **Predict Clusters Using PCA Data:**

   - The KMeans model is used to predict the cluster for each data point in `pca_df_indexed`.
   - The resulting cluster assignments are stored in the array `k_3`.

4. **Display Cluster Predictions:**

   - The array `k_3` is printed, showing the resulting cluster assignments for each data point.

5. **Create DataFrame Copy for Predictions:**

   - A copy of the DataFrame `pca_df_indexed` is created and stored in `pca_df_predictions`.

6. **Add Predicted Clusters to DataFrame:**

   - A new column "predicted_clusters" is added to `pca_df_predictions`, containing the cluster assignments obtained from K-means clustering.

7. **Display DataFrame with Cluster Predictions:**

   - The DataFrame `pca_df_predictions` is displayed, showing the original PCA-transformed data with the newly added "predicted_clusters" column.

8. **Create Scatter Plot for Predictions:**

   - A scatter plot is created using HoloViews (`hvplot.scatter()`) to visualize the PCA-transformed data from `pca_df_predictions`.
   - The x-axis represents "PC1" (first principal component), the y-axis represents "PC2" (second principal component).
   - Data points are colored and grouped by the "predicted_clusters" column, indicating the cluster assignments obtained from K-means clustering.
   - The title of the plot is set to "PCA Predictions scatter".

9. **Display Scatter Plot:**

   - The PCA Predictions scatter plot (`pca_predic_plot`) is displayed, showing the data points grouped by the predicted clusters.

The code demonstrates how to apply K-means clustering to the PCA-transformed data to group cryptocurrencies into four clusters. The scatter plot visualizes the results, helping to identify the distinct clusters formed by the K-means algorithm based on the reduced PCA features.

### References:

1. Scikit-learn documentation for KMeans:
   - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

2. HoloViews documentation for scatter plots:
   - http://holoviews.org/user_guide/Plotting_with_Bokeh.html#scatter-plots

3. pandas documentation for data manipulation:
   - https://pandas.pydata.org/pandas-docs/stable/


In [230]:
# Initialize the K-Means model using the best value for k
k_means = KMeans( n_clusters=4, random_state=3)

In [231]:
# Fit the K-Means model using the PCA data
k_means.fit(pca_df_indexed)

In [232]:
# Predict the clusters to group the cryptocurrencies using the PCA data
k_3 = k_means.predict(pca_df_indexed)
# Print the resulting array of cluster values.
k_3

array([1, 1, 3, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 1, 3, 1, 3, 3, 1, 3, 3, 1,
       3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 0, 1, 3, 3, 2, 3, 3, 3, 3],
      dtype=int32)

In [238]:
# Create a copy of the DataFrame with the PCA data
pca_df_predections = pca_df_indexed.copy()


# Add a new column to the DataFrame with the predicted clusters
pca_df_predections["predicted_clusters"] = k_3

# Display sample data
pca_df_predections.head()

Unnamed: 0_level_0,PC1,PC2,PC3,predicted_clusters
coin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bitcoin,-0.822587,0.832706,0.566554,1
ethereum,-0.709903,0.454212,1.056402,1
tether,-0.3079,-0.184609,-0.74082,3
ripple,-0.353704,-0.239815,-0.596141,3
bitcoin-cash,-1.484966,2.018485,1.787011,1


In [234]:
# Create a scatter plot using hvPlot by setting 
pca_predic_plot = pca_df_predections.hvplot.scatter(
    x = "PC1",
    y = "PC2",
    by = "predicted_clusters",
    title = "PCA Predictions scatter"
)
pca_predic_plot
# `x="PC1"` and `y="PC2"`. 
# Color the graph points with the labels found using K-Means and 
# add the crypto name in the `hover_cols` parameter to identify 
# the cryptocurrency represented by each data point.


### Visualize and Compare the Results

In this section, you will visually analyze the cluster analysis results by contrasting the outcome with and without using the optimization techniques.

In [235]:
# Composite plot to contrast the Elbow curves
elbow_pca_plot + elbow_data_plot

In [236]:
# Composite plot to contrast the clusters
# YOUR CODE HERE!
pca_predic_plot + predicted_scaled_df_plot

#### Answer the following question: 

  * **Question:** After visually analyzing the cluster analysis results, what is the impact of using fewer features to cluster the data using K-Means?

  * **Answer:** After visually analyzing the cluster analysis results, the impact of using fewer features to cluster the data using K-Means is that it simplifies the data representation and reduces the dimensionality. When using PCA to reduce the number of features, we transform the original high-dimensional data into a lower-dimensional space while preserving the most significant patterns and variance in the data.

The key impact of using fewer features is as follows:

1. **Simplified Representation:** By reducing the number of features, we create a more compact and simplified representation of the data. This can make it easier to visualize and interpret the results.

2. **Improved Computation Efficiency:** With fewer dimensions, the computational complexity decreases, resulting in faster execution of algorithms. This can be beneficial when dealing with large datasets.

3. **Noise Reduction:** High-dimensional data might contain noise or less relevant features. Reducing the number of features through PCA can help focus on the most informative ones, leading to more accurate clustering results.

4. **Visual Interpretation:** In a lower-dimensional space, it becomes easier to visualize the clusters and their relationships. Scatter plots or other visualization techniques can provide insights into the structure and patterns of the clustered data.

5. **Enhanced Clustering Performance:** Sometimes, clustering can be negatively affected by irrelevant or redundant features. By using only the most relevant features, clustering algorithms like K-Means may perform better, as they can focus on the crucial dimensions of the data.

However, it's important to note that reducing the number of features might lead to information loss, as we are discarding some variability present in the original data. It's crucial to strike a balance between simplicity and information retention, as excessively reducing the dimensionality can also lead to the loss of important patterns and details.

In conclusion, using fewer features through PCA can offer advantages such as simplicity, improved computational efficiency, noise reduction, and better visual interpretation. However, the impact on clustering performance depends on the specific dataset and the relevance of the features chosen. Careful consideration and analysis are necessary to determine the optimal number of features for successful clustering.

## Refrences
1. PCA (Principal Component Analysis) Explained (Visual Guide) by StatQuest with Josh Starmer:
   - Video: https://www.youtube.com/watch?v=_UVHneBUBW0
   - Written explanation: https://statquest.org/pca-principal-component-analysis-step-by-step/

2. K-Means Clustering: From Inertia to Elbow Criteria by Towards Data Science:
   - Article: https://towardsdatascience.com/k-means-clustering-from-inertia-to-elbow-criteria-41c5d66ca261

3. How to Make the Most of Your Machine Learning Data by Jason Brownlee (Machine Learning Mastery):
   - Article: https://machinelearningmastery.com/make-most-of-your-machine-learning-data/

4. Reducing the Dimensionality of Data with Neural Networks by Google Developers:
   - Article: https://developers.google.com/machine-learning/data-prep/transform/reduce-dimensionality

5. On the Surprising Behavior of Distance Metrics in High Dimensional Space by Charu C. Aggarwal and Alok Choudhary (Google Scholar):
   - Paper: https://dl.acm.org/doi/10.1145/1007730.1007732
