# Clustering Stocks Based On Value At Risk
Members of Teams
1. Abdullah Nasih Jasir (5025211111)
2. Mohammad Ahnaf Fauzan (5025211170)
3. Al-Ferro Yudisthira Putra (5025211176)
# ---------------------------------------------------------------------------

### Import Libraries

In [None]:
import os
import glob
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.dates as mdates
from scipy.stats import norm
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

# Visualitation Style
plt.style.use('ggplot')
%matplotlib inline

pd.set_option('display.max_rows', None)

# ---------------------------------------------------------------------------
## **PREPROCESSING**
# ---------------------------------------------------------------------------

### **Data Gathering**
The following code is useful for importing data from CSV files stored in the history folder

In [None]:
path = 'saham'
all_rec = glob.iglob(path + '/*.csv', recursive=True)
count = 0

prices_df = pd.DataFrame()
for f in all_rec:
    count = count + 1
    df = pd.read_csv(f, index_col='date', usecols=['date', 'close'])
    colname = os.path.basename(f).replace('.csv', '')
    df.rename(columns={'close': colname}, inplace=True)
    prices_df = pd.concat([prices_df, df], axis=1, sort=False)

# Convert the 'Date' column to datetime format
prices_df.index = pd.to_datetime(prices_df.index)

# Define the date range
start_date = pd.to_datetime("2022-03-24")
end_date = pd.to_datetime("2023-03-24")

# Filter the data to include only the specified date range
prices_df = prices_df[(prices_df.index >= start_date) & (prices_df.index <= end_date)]

# Filter stocks with at least 100 data points within the date range
valid_stocks = prices_df.columns[prices_df.count() >= 200]

# Create a new DataFrame with the selected date range and valid stocks
prices_train = prices_df.loc[(prices_df.index >= start_date) & (prices_df.index <= end_date),valid_stocks]
prices_train.head(10)

### **Assesing Data**

In [None]:
prices_train.info(any)

In [None]:
prices_train.isnull().sum().sort_values(ascending=False)

In [None]:
prices_train.duplicated().sum()

In [None]:
prices_train.describe()

### **Cleaning Data**

Since there are some columns that are not filled properly (with some empty rows), we decided to fill them in using linear interpolation. However, when we just use linear interpolation, the empty rows at the beginning cannot be filled since there is no number in front of the rows. Additionally, we add limit_direction='backward' to fully fill the last NaN

In [None]:
#Interpolate to fill in the data that is null
prices_train = prices_train.interpolate(method='linear', limit_direction='backward')
prices_train.isnull().sum().sort_values(ascending=False)


# ---------------------------------------------------------------------------
## **EXPLORATORY DATA ANALYSIS**
# ---------------------------------------------------------------------------

### **FINDING VALUE AT RISK VALUE**

To get the value of VaR, we need some work first, namely
1. Expected Values
2. Mean of Expected Values
3. Standard Deviation

### Expected Values
The following is the expected values ​​search algorithm

In [None]:
# Expected Value = Value(t) - Value(t-1) / Value(t-1)
expected_df = (prices_train.diff() / prices_train.shift(1)).shift(-1)
expected_df.columns = [f'{col}' for col in expected_df.columns]
expected_df.dropna(how='all', inplace=True)
expected_df.head()

### Mean Expected Value
The following is a search for mean expected values

In [None]:
# Calculate the mean (expected value) for each column in expected_df
expected_means = expected_df.mean()
expected_means.head()

### Standard Deviation
The following is a search for standard deviation

In [None]:
# Calculate the standard deviation of daily returns for each stock
std_deviation = expected_df.std()
std_deviation.head()

### Value at Risk
Dengan memanfaatkan hasil dari pencarian di atas, kita mampu menemukan nilai value-at-risk sebagaimana ditunjukkan dibawah

In [None]:
# Calculate Value at Risk
value_at_risk = std_deviation.copy()
value_at_risk = -(expected_means + std_deviation*norm.ppf(0.01))
value_at_risk.head(10)

# ---------------------------------------------------------------------------
### **CLUSTERING K-MEANS**

After getting the VaR value, we will do clustering. There are several clustering methods used, namely,
1. K-Means Algorithm
2. Agglomerative Algorithm
3. Gaussian Mixture Model (GMM) Algorithm

In [None]:
# Convert it to a DataFrame with a single column
Elbow = value_at_risk.to_frame()

# Initialize an empty list to store the within-cluster sum of squares (WCSS) for different values of k
wcss = []

# Define the range of k values to test
k_values = range(1, 11)  # Testing k values from 1 to 10

# Iterate over each value of k
for k in k_values:
    # Initialize KMeans with the current value of k
    kmeans = KMeans(n_clusters=k)
    
    # Fit KMeans to the data
    kmeans.fit(Elbow)
    
    # Append the WCSS (inertia_) to the list
    wcss.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(k_values, wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.xticks(np.arange(min(k_values), max(k_values)+1, 1.0))
plt.grid(True)
plt.show()

### **K-Means Algorithm**
The following code is useful for performing clustering using the K-Means method

In [None]:
# Number of Clusters
K = 5

# Reshape the pandas Series into a 2D array with a single column
X = value_at_risk.values.reshape(-1, 1)

# Perform K-means clustering
kmeans = KMeans(n_clusters=K, random_state=42)
clusters_kmeans = kmeans.fit_predict(X)
clusters_kmeans = clusters_kmeans + 1

# Display the resulting clusters
result_df_kmeans = pd.DataFrame({'VaR': value_at_risk, 'Cluster': clusters_kmeans})
print(result_df_kmeans)


### Pin Point Graph of Every Stocks Based On K-Means Algorithm
The following code is useful for showing a stock distribution map based on Expected Values ​​and VaR

In [None]:
# Plot the graph for each cluster with a consistent color palette
plt.figure(figsize=(8, 6))
num_clusters = result_df_kmeans['Cluster'].nunique()
color_palette = plt.cm.get_cmap('tab10', num_clusters) 

for i, cluster in enumerate(result_df_kmeans['Cluster'].unique()):
    cluster_data_kmeans = result_df_kmeans[result_df_kmeans['Cluster'] == cluster]
    plt.scatter(
        cluster_data_kmeans['VaR'],
        expected_means[cluster_data_kmeans.index],
        color=color_palette(i),
        label=f'Cluster {cluster}' 
    )

plt.title('VaR vs. Expected Value by Cluster')
plt.xlabel('Value at Risk (VaR)')
plt.ylabel('Expected Value (Column Means)')
plt.legend()
plt.grid(True)
plt.show()

### Test Clustering Based On Selected Stocks on K-Means Algorithm

In [None]:
stock_name = "BBCA"
cluster_series = result_df_kmeans.loc[stock_name, 'Cluster'] if stock_name in result_df_kmeans.index else None

if cluster_series is not None:
    cluster = cluster_series
    print(f"The stock {stock_name} is in Cluster {cluster}\n")
    
    # Find stocks that are in the same cluster as the selected one
    same_cluster_stocks_kmeans = result_df_kmeans[result_df_kmeans['Cluster'] == cluster]

    # Display stocks that are in the same cluster as the selected one
    print(f"Other stocks in Cluster {cluster}:")
    print(same_cluster_stocks_kmeans.head(5))
else:
    print(f"The stock {stock_name} was not found in any cluster")

# ---------------------------------------------------------------------------

### **CLUSTERING AGGLOMERATIVE ALGORITHM**
Code di bawah berguna untuk melakukan clustering berdasarkan Algoritma Agglomerative 

In [None]:
# Number of Clusters
K = 5

# Perform Agglomerative Clustering
agg_cluster = AgglomerativeClustering(n_clusters=K)
clusters_aglo = agg_cluster.fit_predict(value_at_risk.values.reshape(-1, 1))
clusters_aglo = clusters_aglo + 1

# Display the resulting clusters
result_df_aglo = pd.DataFrame({'VaR': value_at_risk, 'Cluster': clusters_aglo})
print(result_df_aglo)

### Bar Plot Every Stocks Based On Agglomerative Algorithm
The code below is useful for visualizing distribution data based on Expected Values ​​and VaR from the Agglomerative algorithm

In [None]:
# Pick cluster count that is unique
num_clusters_aglo = result_df_aglo['Cluster'].nunique()

# Use the color palletes in order based on Matplotlib
color_palette_aglo = plt.cm.tab10(np.linspace(0, 1, num_clusters_aglo))

# Plot the graph for each cluster with a consistent color palette
plt.figure(figsize=(8, 6))
for i, cluster in enumerate(result_df_aglo['Cluster'].unique()):
    cluster_data_aglo = result_df_aglo[result_df_aglo['Cluster'] == cluster]
    plt.scatter(
        cluster_data_aglo['VaR'],
        expected_means[cluster_data_aglo.index],
        color=color_palette_aglo[i], 
        label=f'Cluster {cluster}'
    )

plt.title('VaR vs. Expected Value by Cluster (Agglomerative)')
plt.xlabel('Value at Risk (VaR)')
plt.ylabel('Expected Value (Column Means)')
plt.legend()
plt.grid(True)
plt.show()

### Test Clustering Based On Selected Stocks on Agglomerative Algorithm

In [None]:
stock_name = "BBCA"
cluster_series_aglo = result_df_aglo.loc[stock_name, 'Cluster'] if stock_name in result_df_aglo.index else None

if cluster_series_aglo is not None:
    cluster_aglo = cluster_series_aglo
    print(f"The stock {stock_name} is in Cluster {cluster_aglo}\n")
    
    # Find stocks that are in the same cluster as the selected one
    same_cluster_stocks_aglo = result_df_aglo[result_df_aglo['Cluster'] == cluster_aglo]

    # Display stocks that are in the same cluster as the selected one
    print(f"Other stocks in Cluster {cluster_aglo}:")
    print(same_cluster_stocks_aglo.head(5))
else:
    print(f"The stock {stock_name} was not found in any cluster")

# ---------------------------------------------------------------------------

### **GMM Algorithm**
The code below is useful for implementing the GMM algorithm in clustering

In [None]:
# Number of Clusters
K_gmm = 5

# Perform GMM Clustering
gmm_cluster = GaussianMixture(n_components=K_gmm, random_state=42)
clusters_gmm = gmm_cluster.fit_predict(value_at_risk.values.reshape(-1, 1))
clusters_gmm = clusters_gmm + 1

# Display the resulting clusters
result_df_gmm = pd.DataFrame({'VaR': value_at_risk, 'Cluster': clusters_gmm})
print(result_df_gmm)

### Bar Plot Every Stocks Based On GMM Algorithm
The following is a visualization of all stocks based on Expected Values ​​and VaR from the GMM Algorithm

In [None]:
# Define colors for each cluster
num_clusters_gmm = len(result_df_gmm['Cluster'].unique())
color_palette_gmm = plt.cm.tab10(np.linspace(0, 1, num_clusters_gmm))

# Plot the graph for each cluster with a consistent color palette
plt.figure(figsize=(8, 6))
for i, cluster in enumerate(result_df_gmm['Cluster'].unique()):
    cluster_data_gmm = result_df_gmm[result_df_gmm['Cluster'] == cluster]
    plt.scatter(
        cluster_data_gmm['VaR'],
        expected_means[cluster_data_gmm.index],
        color=color_palette_gmm[i], 
        label=f'Cluster {cluster}'
    )

plt.title('VaR vs. Expected Value by Cluster (GMM)')
plt.xlabel('Value at Risk (VaR)')
plt.ylabel('Expected Value (Column Means)')
plt.legend()
plt.grid(True)
plt.show()

### Test Clustering Based On Selected Stocks on GMM Algorithm

In [None]:
stock_name = "BBCA"
cluster_series_gmm = result_df_gmm.loc[stock_name, 'Cluster'] if stock_name in result_df_gmm.index else None

if cluster_series_gmm is not None:
    cluster_gmm = cluster_series_gmm
    print(f"The stock {stock_name} is in Cluster {cluster_gmm}\n")
    
    # Find stocks that are in the same cluster as the selected one
    same_cluster_stocks_gmm = result_df_gmm[result_df_gmm['Cluster'] == cluster_gmm]

    # Display stocks that are in the same cluster as the selected one
    print(f"Other stocks in Cluster {cluster_gmm}:")
    print(same_cluster_stocks_gmm.head(5))
else:
    print(f"The stock {stock_name} was not found in any cluster")

# ---------------------------------------------------------------------------
## COMPARISON
# ---------------------------------------------------------------------------

### Mean VaR of Clusters Analysis Based On Algorithm
To find out the cluster similarities of each algorithm, it is necessary to carry out an analysis based on the mean VaR of each cluster

### K-Means Algorithm
The following is a visualization of the Mean VaR K-Means algorithm

In [None]:
grouped_df_kmeans = result_df_kmeans.groupby('Cluster')['VaR'].mean().reset_index()
cluster_labels = [f'Cluster {label}' for label in grouped_df_kmeans['Cluster']]

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(cluster_labels, grouped_df_kmeans['VaR'], color='skyblue')
plt.xlabel('Cluster')
plt.ylabel('Mean VaR')
plt.title('Mean VaR by Cluster')
plt.xticks(cluster_labels) 
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display plot
plt.show()

### Agglomerative Algorithm
The following is a visualization of the Mean VaR Aggloemerative algorithm

In [None]:
grouped_df_aglo = result_df_aglo.groupby('Cluster')['VaR'].mean().reset_index()
cluster_labels_aglo = [f'Cluster {label}' for label in grouped_df_aglo['Cluster']]

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(cluster_labels_aglo, grouped_df_aglo['VaR'], color='skyblue')
plt.xlabel('Cluster')
plt.ylabel('Mean VaR')
plt.title('Mean VaR by Cluster')
plt.xticks(cluster_labels_aglo)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display plot
plt.show()

### GMM Algorithm
The following is a visualization of the Mean VaR GMM algorithm

In [None]:
grouped_df_gmm = result_df_gmm.groupby('Cluster')['VaR'].mean().reset_index()
cluster_labels_gmm = [f'Cluster {label}' for label in grouped_df_gmm['Cluster']]

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(cluster_labels_gmm, grouped_df_gmm['VaR'], color='skyblue')
plt.xlabel('Cluster')
plt.ylabel('Mean VaR')
plt.title('Mean VaR by Cluster')
plt.xticks(cluster_labels_gmm) 
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display plot
plt.show()

### Comparison of 3 Algorithms
The following is a comparison graph of Mean VaR for each cluster

In [None]:
bar_width = 0.2

# Cluster position on x-axis
x_clusters = np.arange(len(grouped_df_kmeans))

plt.figure(figsize=(10, 6))

# Bar plot for K-means
plt.bar(x_clusters - bar_width, grouped_df_kmeans['VaR'], width=bar_width, label='K-means')

# Bar plot for Agglomerative clustering
plt.bar(x_clusters, grouped_df_aglo['VaR'], width=bar_width, label='Agglomerative')

# Bar plot ufor GMM
plt.bar(x_clusters + bar_width, grouped_df_gmm['VaR'], width=bar_width, label='GMM')

# Set labels and title
plt.xlabel('Cluster')
plt.ylabel('Mean VaR')
plt.title('VaR Values by Cluster for Different Methods')
plt.xticks(x_clusters, [f'Cluster {label}' for label in grouped_df_gmm['Cluster']])
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display Plot
plt.show()


#### Note
Labeling clusters (in the form of 1, 2, 3, 4, or 5) in the data grouping process using certain algorithms does not always follow the order from lowest to highest risk. The naming of clusters in the grouping method tends to be random. However, each cluster from different algorithms still has similar characteristics, even though the labels given may vary.

### Cluster Sorting Based On The Mean VaR
In order to equate cluster names according to their characteristics, it is necessary to sort the clusters based on their Mean VaR value

In [None]:
def get_cluster_summary(df, algorithm_name):
    # Extract numeric columns
    numeric_columns = df.select_dtypes(include='number')
    
    # Calculate mean values for each cluster
    cluster_means = numeric_columns.groupby('Cluster').mean()['VaR']
    
    # Explicitly sort the mean values
    cluster_means_sorted = cluster_means.sort_values()
    
    # Create a DataFrame with mean values and cluster counts
    summary = pd.DataFrame({
        f'Mean VaR': cluster_means_sorted,
        f'Counts': df['Cluster'].value_counts().reindex(cluster_means_sorted.index)
    })
    
    return summary

# Get cluster summaries for each algorithm
kmeans_summary = get_cluster_summary(result_df_kmeans, 'K-Means')
aglo_summary = get_cluster_summary(result_df_aglo, 'Agglomerative')
gmm_summary = get_cluster_summary(result_df_gmm, 'GMM')

# Display the individual summaries
print("K-Means Cluster Summary:")
print(kmeans_summary)

print("\nAgglomerative Cluster Summary:")
print(aglo_summary)

print("\nGMM Cluster Summary:")
print(gmm_summary)

### Change of Cluster Order based On The Sorted Mean VaR
With the data obtained, we will change the cluster name to adjust the three algorithms

In [None]:
# Create copy of DataFrame without changing the original one
dfs = [result_df_kmeans, result_df_aglo, result_df_gmm]
new_dfs = [df.copy() for df in dfs]

# Determine the pattern of value subtitution
clusters = [kmeans_summary, aglo_summary, gmm_summary]

# Subtitute the value based on the pattern mentioned
for i, df in enumerate(new_dfs):
    df['Cluster'] = df['Cluster'].replace({clusters[i].index[j]: j+1 for j in range(5)})

# Save the result on the variable created earlier
newresult_df_kmeans, newresult_df_aglo, newresult_df_gmm = new_dfs

# Plot mean VaR for each cluster and algorithm after reassigning clusters
def get_cluster_summary(df, algorithm_name):
    numeric_columns = df.select_dtypes(include='number')
    cluster_means = numeric_columns.groupby('Cluster').mean()['VaR']
    cluster_means_sorted = cluster_means.sort_values()
    summary = pd.DataFrame({
        f'Mean VaR': cluster_means_sorted,
        f'Counts': df['Cluster'].value_counts().reindex(cluster_means_sorted.index)
    })
    return summary

# Get cluster summaries for each algorithm after reassigning clusters
new_kmeans_summary = get_cluster_summary(newresult_df_kmeans, 'K-Means')
new_aglo_summary = get_cluster_summary(newresult_df_aglo, 'Agglomerative')
new_gmm_summary = get_cluster_summary(newresult_df_gmm, 'GMM')

# Plot bar plot for mean VaR by cluster and algorithm after reassigning clusters
plt.figure(figsize=(12, 6))

bar_width = 0.2
x_kmeans = np.arange(len(new_kmeans_summary))
x_aglo = np.arange(len(new_aglo_summary)) + bar_width
x_gmm = np.arange(len(new_gmm_summary)) + 2 * bar_width

plt.bar(x_kmeans, new_kmeans_summary['Mean VaR'], width=bar_width, label='K-Means')
plt.bar(x_aglo, new_aglo_summary['Mean VaR'], width=bar_width, label='Agglomerative')
plt.bar(x_gmm, new_gmm_summary['Mean VaR'], width=bar_width, label='GMM')

plt.xlabel('Cluster')
plt.ylabel('Mean VaR')
plt.title('Mean VaR by Cluster')
plt.xticks(np.arange(len(new_gmm_summary)) + bar_width, new_gmm_summary.index)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()


### Comparison of The Number of Stocks in Every Cluster
The following is an illustration of the number of stocks in each Cluster

In [None]:
# Count the number of stocks in each cluster for each DataFrame
kmeans_cluster_counts = newresult_df_kmeans['Cluster'].value_counts().sort_index()
aglo_cluster_counts = newresult_df_aglo['Cluster'].value_counts().sort_index()
gmm_cluster_counts = newresult_df_gmm['Cluster'].value_counts().sort_index()

# Create a bar plot to compare the number of stocks in each cluster for each algorithm
plt.figure(figsize=(10, 6))

bar_width = 0.2
index = kmeans_cluster_counts.index  

plt.bar(index - bar_width, kmeans_cluster_counts, width=bar_width, label='K-Means')
plt.bar(index, aglo_cluster_counts, width=bar_width, label='Agglomerative')
plt.bar(index + bar_width, gmm_cluster_counts, width=bar_width, label='GMM')

# Display count values on top of each bar
for i, v in enumerate(kmeans_cluster_counts):
    plt.text(i + 1 - bar_width, v + 0.5, str(v), ha='center', va='bottom')
for i, v in enumerate(aglo_cluster_counts):
    plt.text(i+1, v + 0.5, str(v), ha='center', va='bottom')
for i, v in enumerate(gmm_cluster_counts):
    plt.text(i + 1 + bar_width, v + 0.5, str(v), ha='center', va='bottom')

plt.xlabel('Cluster')
plt.ylabel('Number of Stocks')
plt.title('Number of Stocks in Each Cluster for Different Algorithms')
plt.xticks(index)
plt.legend()
plt.tight_layout()
plt.show()


### Stock Cluster 1
The following is Stock data that is in cluster 1 in each algorithm

In [None]:
newresult_df_kmeans['Stocks'] = newresult_df_kmeans.index
newresult_df_aglo['Stocks'] = newresult_df_aglo.index
newresult_df_gmm['Stocks'] = newresult_df_gmm.index

# Choose stocks with the cluster of 1 in each algorithm
stocks_kmeans_cluster1 = newresult_df_kmeans[newresult_df_kmeans['Cluster'] == 1]['Stocks']
stocks_aglo_cluster1 = newresult_df_aglo[newresult_df_aglo['Cluster'] == 1]['Stocks']
stocks_gmm_cluster1 = newresult_df_gmm[newresult_df_gmm['Cluster'] == 1]['Stocks']

# Find stocks that are in the cluster 1 on every algorithm
stocks_in_clusters1 = set(stocks_kmeans_cluster1) & set(stocks_aglo_cluster1) & set(stocks_gmm_cluster1)

#  Display and Check if the stocks are in the same cluster on every algorithm
print("Stock berikut memiliki cluster 1 di ketiga algoritma:")
print(', '.join(f"'{stock}'" for stock in stocks_in_clusters1))

### Stock Cluster 2
The following is Stock data that is in cluster 2 in each algorithm

In [None]:
# Choose stocks with the cluster of 2 in each algorithm
stocks_kmeans_cluster2 = newresult_df_kmeans[newresult_df_kmeans['Cluster'] == 2]['Stocks']
stocks_aglo_cluster2 = newresult_df_aglo[newresult_df_aglo['Cluster'] == 2]['Stocks']
stocks_gmm_cluster2 = newresult_df_gmm[newresult_df_gmm['Cluster'] == 2]['Stocks']

# Find stocks that are in the cluster 2 on every algorithm
stocks_in_clusters2 = set(stocks_kmeans_cluster2) & set(stocks_aglo_cluster2) & set(stocks_gmm_cluster2)

# Display and Check if the stocks are in the same cluster on every algorithm
print("Stock berikut memiliki cluster 2 di ketiga algoritma:")
print(', '.join(f"'{stock}'" for stock in stocks_in_clusters2))

### Stock Cluster 3
The following is Stock data that is in cluster 3 in each algorithm

In [None]:
# Choose stocks with the cluster of 3 in each algorithm
stocks_kmeans_cluster3 = newresult_df_kmeans[newresult_df_kmeans['Cluster'] == 3]['Stocks']
stocks_aglo_cluster3 = newresult_df_aglo[newresult_df_aglo['Cluster'] == 3]['Stocks']
stocks_gmm_cluster3 = newresult_df_gmm[newresult_df_gmm['Cluster'] == 3]['Stocks']

# Find stocks that are in the cluster 3 on every algorithm
stocks_in_clusters3 = set(stocks_kmeans_cluster3) & set(stocks_aglo_cluster3) & set(stocks_gmm_cluster3)

#  Display and Check if the stocks are in the same cluster on every algorithm
print("Stock berikut memiliki cluster 3 di ketiga algoritma:")
print(', '.join(f"'{stock}'" for stock in stocks_in_clusters3))

### Stock Cluster 4
The following is Stock data that is in cluster 4 in each algorithm

In [None]:
# Choose stocks with the cluster of 4 in each algorithm
stocks_kmeans_cluster4 = newresult_df_kmeans[newresult_df_kmeans['Cluster'] == 4]['Stocks']
stocks_aglo_cluster4 = newresult_df_aglo[newresult_df_aglo['Cluster'] == 4]['Stocks']
stocks_gmm_cluster4 = newresult_df_gmm[newresult_df_gmm['Cluster'] == 4]['Stocks']

# Find stocks that are in the cluster 4 on every algorithm
stocks_in_clusters4 = set(stocks_kmeans_cluster4) & set(stocks_aglo_cluster4) & set(stocks_gmm_cluster4)

#  Display and Check if the stocks are in the same cluster on every algorithm
print("Stock berikut memiliki cluster 4 di ketiga algoritma:")
print(', '.join(f"'{stock}'" for stock in stocks_in_clusters4))

### Stock Cluster 5
The following is Stock data that is in cluster 5 in each algorithm

In [None]:
# Choose stocks with the cluster of 5 in each algorithm
stocks_kmeans_cluster5 = newresult_df_kmeans[newresult_df_kmeans['Cluster'] == 5]['Stocks']
stocks_aglo_cluster5 = newresult_df_aglo[newresult_df_aglo['Cluster'] == 5]['Stocks']
stocks_gmm_cluster5 = newresult_df_gmm[newresult_df_gmm['Cluster'] == 5]['Stocks']

# Find stocks that are in the cluster 5 on every algorithm
stocks_in_clusters5 = set(stocks_kmeans_cluster5) & set(stocks_aglo_cluster5) & set(stocks_gmm_cluster5)

#  Display and Check if the stocks are in the same cluster on every algorithm
print("Stock berikut memiliki cluster 5 di ketiga algoritma:")
print(', '.join(f"'{stock}'" for stock in stocks_in_clusters5))

# ---------------------------------------------------------------------------
## Features
# ---------------------------------------------------------------------------

### Note
Value at Risk (VaR) is a method for estimating the potential loss of an investment with a certain level of confidence. For example, with a VaR of 0.56 at a 99% confidence level, there is a 99% probability that investment losses will not exceed 56% in a given period.

In general, the lower the VaR value, the more stable and less volatile the stock price, indicating lower risk. Conversely, the higher the VaR value, the greater the share price fluctuations, indicating a higher level of risk in daily stock price changes.

### Merging All The Clusters Into One Data Frame
The code below is useful for unifying all clusters into one dataframe

In [None]:
# Merge three DataFrame based on 'Stock' and 'Cluster' Columns
finalresult_df = pd.merge(newresult_df_kmeans, newresult_df_aglo, on=['VaR', 'Stocks', 'Cluster'])
finalresult_df = pd.merge(finalresult_df, newresult_df_gmm, on=['VaR', 'Stocks', 'Cluster'])

print(finalresult_df)

### Feature 1
#### Testing a specific Stock to where it falls in cluster while also recommending some stocks that fall in the same cluster.

In [None]:
def print_cluster_info(cluster, cluster_data):
    risk_levels = {
        1: "Sangat Rendah",
        2: "Rendah",
        3: "Sedang",
        4: "Tinggi",
        5: "Sangat Tinggi"
    }

    print(f"Cluster {cluster}: Investasi dalam klaster ini memiliki risiko yang {risk_levels.get(cluster, 'tidak dikenal')}.\n")

    top_5_highest_var = cluster_data.head(5)
    
    print(f"5 saham dengan Resiko Terendah di Cluster {cluster}:")
    print(top_5_highest_var[['VaR', 'Stocks', 'Cluster']], end='\n\n')

stock_name = "BBCA"
cluster_final = finalresult_df[finalresult_df.Stocks == stock_name]['Cluster']

if not cluster_final.empty:
    clusters_found = cluster_final.unique()
    cluster_data_dict = {}

    for cluster in clusters_found:
        print(f"Saham {stock_name} berada pada Cluster {cluster}")
        same_cluster_final = finalresult_df[finalresult_df['Cluster'] == cluster]
        sorted_cluster = same_cluster_final.sort_values(by='VaR', ascending=True)
        cluster_data_dict[cluster] = sorted_cluster.head(5)

        print_cluster_info(cluster, cluster_data_dict[cluster])

    additional_stocksfix = [cluster_data['Stocks'].tolist() for cluster_data in cluster_data_dict.values()]
    stocks_to_plot = [item for sublist in additional_stocksfix for item in sublist] + [stock_name]

    selected_stocks = prices_df[stocks_to_plot]

    plt.figure(figsize=(12, 6))
    for stock in stocks_to_plot:
        plt.semilogy(selected_stocks.index, selected_stocks[stock], label=stock)

    plt.title('Grafik Harga Saham')
    plt.xlabel('Tanggal')
    plt.ylabel('Harga Saham')
    plt.legend()
    plt.grid(True)
    plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

else:
    print(f"Saham {stock_name} tidak ditemukan di cluster manapun")

### Feature 2
#### Giving some stocks recommendation depending on the cluster risk level.

In [None]:
# Memilih tingkat risiko
tingkat_resiko = "Tinggi"

# Dictionary for the mapping of the risk level
tingkat_resiko_to_cluster = {
    "Sangat Rendah": 1,
    "Rendah": 2,
    "Sedang": 3,
    "Tinggi": 4,
    "Sangat Tinggi": 5
}

# Check if the result is valid
if tingkat_resiko in tingkat_resiko_to_cluster:
    cluster = tingkat_resiko_to_cluster[tingkat_resiko]
    emiten_cluster = finalresult_df[finalresult_df['Cluster'] == cluster]['Stocks']
    
    if not emiten_cluster.empty:
        if len(emiten_cluster) < 10:
            print(f"Rekomendasi emiten untuk tingkat Risiko '{tingkat_resiko}':")
            for emiten in emiten_cluster:
                print(emiten)
        else:
            emiten_rekomendasi = random.sample(emiten_cluster.tolist(), k=10)
            print(f"Rekomendasi 10 emiten untuk tingkat Risiko '{tingkat_resiko}':")
            for emiten in emiten_rekomendasi:
                print(emiten)
    else:
        print(f"Tidak ada emiten untuk tingkat Risiko '{tingkat_resiko}'")
else:
    print("Tingkat risiko tidak valid")


# ---------------------------------------------------------------------------
## EVALUATION
# ---------------------------------------------------------------------------


### Silhouette Score

In [None]:
# Take the result of clustering from the all the algorithms
test1_clusters_kmeans = result_df_kmeans['Cluster']  # Clustering result of K-Means
test1_clusters_gmm = result_df_gmm['Cluster']   # Clustering result of GMM
test1_clusters_aglo = result_df_aglo['Cluster']   # Clustering result of Agglomerical

# Count the silhouette score of every clustering method
silhouette_kmeans = silhouette_score(value_at_risk.values.reshape(-1, 1), test1_clusters_kmeans)
silhouette_gmm = silhouette_score(value_at_risk.values.reshape(-1, 1), test1_clusters_gmm)
silhouette_aglo = silhouette_score(value_at_risk.values.reshape(-1, 1), test1_clusters_aglo)

# Display the result
print(f"Silhouette Score K-Means: {silhouette_kmeans}")
print(f"Silhouette Score GMM: {silhouette_gmm}")
print(f"Silhouette Score Agglomerative Clustering: {silhouette_aglo}")

methods = ['K-Means', 'GMM', 'Agglomerative']
scores = [silhouette_kmeans, silhouette_gmm, silhouette_aglo]

# Create a bar chart
plt.barh(methods, scores)
plt.xlabel('Silhouette Score')
plt.title('Silhouette score for Different Clustering Methods')
plt.xlim([0, max(scores) + 0.1])  # Adjust the x-axis limits if needed

# Display the plot
plt.show()

In [None]:
clusters=range(2,35,1)

# Compute the silhouette score for K means cluster within the range
scores = []
for k in clusters:
    km = KMeans(n_clusters=k,random_state=0, n_init=10)
    labels = km.fit_predict(value_at_risk.values.reshape(-1, 1))
    score = silhouette_score(value_at_risk.values.reshape(-1, 1),labels)
    scores.append(score)


plt.figure(figsize=(20,10))
plt.plot(clusters,scores)

# Compute the silhouette score for Agglomerative cluster within the range
scores = []
for k in clusters:
    ag = AgglomerativeClustering(n_clusters=k)
    labels = ag.fit_predict(value_at_risk.values.reshape(-1, 1))
    score = silhouette_score(value_at_risk.values.reshape(-1, 1),labels)
    scores.append(score)

plt.plot(clusters,scores)

# Compute the silhouette score for GMM cluster within the range
scores = []
for k in clusters:
    gm =GaussianMixture(n_components=k, random_state=42)
    labels = gm.fit_predict(value_at_risk.values.reshape(-1, 1))
    score = silhouette_score(value_at_risk.values.reshape(-1, 1),labels)
    scores.append(score)

plt.plot(clusters,scores)

plt.plot(clusters,scores)
plt.title('Silhouette Scores')
plt.xlabel('Number of Clusters')
plt.ylabel('silhouette score')
plt.legend(['kmeans','Agglomerative','GMM'])

### Davies-Bouldin Index

In [None]:
# Count Davies-Bouldin Index for every method of clustering
db_index_kmeans = davies_bouldin_score(value_at_risk.values.reshape(-1, 1), test1_clusters_kmeans)
db_index_gmm = davies_bouldin_score(value_at_risk.values.reshape(-1, 1), test1_clusters_gmm)
db_index_aglo = davies_bouldin_score(value_at_risk.values.reshape(-1, 1), test1_clusters_aglo)

# Display the result
print(f"Davies-Bouldin Index K-Means: {db_index_kmeans}")
print(f"Davies-Bouldin Index GMM: {db_index_gmm}")
print(f"Davies-Bouldin Index Agglomerative Clustering: {db_index_aglo}")

methods = ['K-Means', 'GMM', 'Agglomerative']
scores = [db_index_kmeans, db_index_gmm, db_index_aglo]

# Create a bar chart
plt.barh(methods, scores)
plt.xlabel('Davies-Bouldin Index Score')
plt.title('Davies-Bouldin Index for Different Clustering Methods')
plt.xlim([0, max(scores) + 0.1])  # Adjust the x-axis limits if needed

# Display the plot
plt.show()


In [None]:
clusters=range(3,35,1)

# Compute the Davies Bouldin score for K means cluster within the range 
scores_dbi = []
for k in clusters:
    km = KMeans(n_clusters=k,random_state=0, n_init=10)
    labels = km.fit_predict(value_at_risk.values.reshape(-1, 1))
    score = davies_bouldin_score(value_at_risk.values.reshape(-1, 1),labels)
    scores_dbi.append(score)


plt.figure(figsize=(20,10))
plt.plot(clusters,scores_dbi)

# Compute the Davies Bouldin score for Agglomerative cluster within the range
scores_dbi = []
for k in clusters:
    ag = AgglomerativeClustering(n_clusters=k)
    labels = ag.fit_predict(value_at_risk.values.reshape(-1, 1))
    score = davies_bouldin_score(value_at_risk.values.reshape(-1, 1),labels)
    scores_dbi.append(score)

plt.plot(clusters,scores_dbi)

# Compute the Davies Bouldin score for GMM cluster within the range
scores_dbi = []
for k in clusters:
    gm =GaussianMixture(n_components=k, random_state=42)
    labels = gm.fit_predict(value_at_risk.values.reshape(-1, 1))
    score = davies_bouldin_score(value_at_risk.values.reshape(-1, 1),labels)
    scores_dbi.append(score)

plt.plot(clusters,scores_dbi)

plt.title('David-Bouldin Score')
plt.xlabel('Number of Clusters')
plt.ylabel('David-Bouldin Score')
plt.legend(['kmeans','Agglomerative','GMM'])

# Summary

Our results show that in our case with five clusters, Kmeans shows a better degree of similarity between the data in each cluster compared to other algorithms based on Silhouette calculations. On the other hand, when looking at the Davies-Bouldin Index (DBI), GMM shows a better level of member density in each cluster compared to the other two algorithms.

In conclusion, from our graph analysis, Kmeans seems to be the best algorithm choice due to its better consistency within each cluster when evaluated with both Silhouette and DBI metrics.