# Predict which stock will provide greatest rate of return

This data set has 750 rows and 16 columns.

This dataset contains weekly data for the Dow Jones Industrial Index. It has been used in computational investing research.

In this dataset, each record (row) is data for a week. Each record also has the percentage of return that stock has in the following week (percent_change_next_weeks_price). 

Ideally, this could be used to determine which stock will produce the greatest rate of return in the following week. 

Key metrics from each sector (primary, secondary and tertiary) to identify:

1) Stock returns
2) Volatility (standard deviation) - measures the degree of variation of a financial instrument's returns over time
3) Range of Returns (min and max) - measure the difference between the minimum and maximum values of returns within a dataset.

We must also determine or highlight each sector performance, creating a bar chart showing average returns of each sector.

Key words: stock returns, volatility, range of returns, diversification, market risk, investors

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression # for regression
from sklearn.cluster import KMeans # for clustering

# Data Preprocessing

In [256]:
# Define the data types for each column
data_types = {
    'quarter': int,
    'stock': str,
    'date': str,
    'open': str,
    'high': str,
    'low': str,
    'close': str,
    'volume': int,
    'percent_change_price': float,
    'percent_change_volume_over_last_wk': float,
    'previous_week_volume': int,
    'next_weeks_open': str,
    'next_weeks_close': str,
    'percent_change_next_weeks_price': float,
    'days_to_next_dividend': int,
    'percent_return_next_dividend': float
}

#define missing value markers
missing_values = ['?']

stock = pd.read_csv('stock_1.csv', delimiter = ',', dtype=data_types, na_values=missing_values)

In [257]:
stock

Unnamed: 0,quarter,stock,date,open,high,low,close,volume,percent_change_price,percent_change_volume_over_last_wk,previous_weeks_volume,next_weeks_open,next_weeks_close,percent_change_next_weeks_price,days_to_next_dividend,percent_return_next_dividend
0,1,AA,1/7/2011,$15.82,$16.72,$15.78,$16.42,239655616,3.79267,,,$16.71,$15.97,-4.428490,26,0.182704
1,1,AA,1/14/2011,$16.71,$16.71,$15.64,$15.97,242963398,-4.42849,1.380223,239655616.0,$16.19,$15.79,-2.470660,19,0.187852
2,1,AA,1/21/2011,$16.19,$16.38,$15.60,$15.79,138428495,-2.47066,-43.024959,242963398.0,$15.87,$16.13,1.638310,12,0.189994
3,1,AA,1/28/2011,$15.87,$16.63,$15.82,$16.13,151379173,1.63831,9.355500,138428495.0,$16.18,$17.14,5.933250,5,0.185989
4,1,AA,2/4/2011,$16.18,$17.39,$16.18,$17.14,154387761,5.93325,1.987452,151379173.0,$17.33,$17.37,0.230814,97,0.175029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,2,XOM,5/27/2011,$80.22,$82.63,$80.07,$82.63,68230855,3.00424,-21.355713,86758820.0,$83.28,$81.18,-2.521610,75,0.568801
746,2,XOM,6/3/2011,$83.28,$83.75,$80.18,$81.18,78616295,-2.52161,15.221032,68230855.0,$80.93,$79.78,-1.420980,68,0.578960
747,2,XOM,6/10/2011,$80.93,$81.87,$79.72,$79.78,92380844,-1.42098,17.508519,78616295.0,$80.00,$79.02,-1.225000,61,0.589120
748,2,XOM,6/17/2011,$80.00,$80.82,$78.33,$79.02,100521400,-1.22500,8.811952,92380844.0,$78.65,$76.78,-2.377620,54,0.594786


In [258]:
columns_to_clean = ['open', 'high', 'low', 'close', 'next_weeks_open','next_weeks_close']
for column in columns_to_clean:
    stock[column] = stock[column].str.replace('$', '').astype(float)


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



In [259]:
stock

Unnamed: 0,quarter,stock,date,open,high,low,close,volume,percent_change_price,percent_change_volume_over_last_wk,previous_weeks_volume,next_weeks_open,next_weeks_close,percent_change_next_weeks_price,days_to_next_dividend,percent_return_next_dividend
0,1,AA,1/7/2011,15.82,16.72,15.78,16.42,239655616,3.79267,,,16.71,15.97,-4.428490,26,0.182704
1,1,AA,1/14/2011,16.71,16.71,15.64,15.97,242963398,-4.42849,1.380223,239655616.0,16.19,15.79,-2.470660,19,0.187852
2,1,AA,1/21/2011,16.19,16.38,15.60,15.79,138428495,-2.47066,-43.024959,242963398.0,15.87,16.13,1.638310,12,0.189994
3,1,AA,1/28/2011,15.87,16.63,15.82,16.13,151379173,1.63831,9.355500,138428495.0,16.18,17.14,5.933250,5,0.185989
4,1,AA,2/4/2011,16.18,17.39,16.18,17.14,154387761,5.93325,1.987452,151379173.0,17.33,17.37,0.230814,97,0.175029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,2,XOM,5/27/2011,80.22,82.63,80.07,82.63,68230855,3.00424,-21.355713,86758820.0,83.28,81.18,-2.521610,75,0.568801
746,2,XOM,6/3/2011,83.28,83.75,80.18,81.18,78616295,-2.52161,15.221032,68230855.0,80.93,79.78,-1.420980,68,0.578960
747,2,XOM,6/10/2011,80.93,81.87,79.72,79.78,92380844,-1.42098,17.508519,78616295.0,80.00,79.02,-1.225000,61,0.589120
748,2,XOM,6/17/2011,80.00,80.82,78.33,79.02,100521400,-1.22500,8.811952,92380844.0,78.65,76.78,-2.377620,54,0.594786


In [260]:
stock[['percent_change_volume_over_last_wk', 'previous_weeks_volume']] = stock[['percent_change_volume_over_last_wk', 'previous_weeks_volume']].fillna(stock[['percent_change_volume_over_last_wk', 'previous_weeks_volume']].mean())


In [261]:
stock

Unnamed: 0,quarter,stock,date,open,high,low,close,volume,percent_change_price,percent_change_volume_over_last_wk,previous_weeks_volume,next_weeks_open,next_weeks_close,percent_change_next_weeks_price,days_to_next_dividend,percent_return_next_dividend
0,1,AA,1/7/2011,15.82,16.72,15.78,16.42,239655616,3.79267,5.593627,1.173876e+08,16.71,15.97,-4.428490,26,0.182704
1,1,AA,1/14/2011,16.71,16.71,15.64,15.97,242963398,-4.42849,1.380223,2.396556e+08,16.19,15.79,-2.470660,19,0.187852
2,1,AA,1/21/2011,16.19,16.38,15.60,15.79,138428495,-2.47066,-43.024959,2.429634e+08,15.87,16.13,1.638310,12,0.189994
3,1,AA,1/28/2011,15.87,16.63,15.82,16.13,151379173,1.63831,9.355500,1.384285e+08,16.18,17.14,5.933250,5,0.185989
4,1,AA,2/4/2011,16.18,17.39,16.18,17.14,154387761,5.93325,1.987452,1.513792e+08,17.33,17.37,0.230814,97,0.175029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,2,XOM,5/27/2011,80.22,82.63,80.07,82.63,68230855,3.00424,-21.355713,8.675882e+07,83.28,81.18,-2.521610,75,0.568801
746,2,XOM,6/3/2011,83.28,83.75,80.18,81.18,78616295,-2.52161,15.221032,6.823086e+07,80.93,79.78,-1.420980,68,0.578960
747,2,XOM,6/10/2011,80.93,81.87,79.72,79.78,92380844,-1.42098,17.508519,7.861630e+07,80.00,79.02,-1.225000,61,0.589120
748,2,XOM,6/17/2011,80.00,80.82,78.33,79.02,100521400,-1.22500,8.811952,9.238084e+07,78.65,76.78,-2.377620,54,0.594786


In [262]:
stock.describe()

Unnamed: 0,quarter,open,high,low,close,volume,percent_change_price,percent_change_volume_over_last_wk,previous_weeks_volume,next_weeks_open,next_weeks_close,percent_change_next_weeks_price,days_to_next_dividend,percent_return_next_dividend
count,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0,750.0
mean,1.52,53.65184,54.669987,52.64016,53.729267,117547800.0,0.050262,5.593627,117387600.0,53.70244,53.88908,0.238468,52.525333,0.691826
std,0.499933,32.638852,33.215994,32.119277,32.788787,158438100.0,2.517809,39.723229,156010700.0,32.778111,33.016677,2.679538,46.335098,0.305482
min,1.0,10.59,10.94,10.4,10.52,9718851.0,-15.4229,-61.433175,9718851.0,10.52,10.52,-15.4229,0.0,0.065574
25%,1.0,29.83,30.6275,28.72,30.365,30866240.0,-1.288053,-18.890959,31272650.0,30.315,30.4625,-1.222068,24.0,0.534549
50%,2.0,45.97,46.885,44.8,45.93,53060880.0,0.0,1.801868,55128820.0,46.015,46.125,0.101193,47.0,0.681067
75%,2.0,72.715,74.2875,71.0375,72.6675,132721800.0,1.650888,19.984489,129617000.0,72.715,72.915,1.845562,69.0,0.854291
max,2.0,172.11,173.54,167.82,170.58,1453439000.0,9.88223,327.408924,1453439000.0,172.11,174.54,9.88223,336.0,1.56421


In [263]:
stock['date'] = pd.to_datetime(stock['date'])

In [264]:
stock

Unnamed: 0,quarter,stock,date,open,high,low,close,volume,percent_change_price,percent_change_volume_over_last_wk,previous_weeks_volume,next_weeks_open,next_weeks_close,percent_change_next_weeks_price,days_to_next_dividend,percent_return_next_dividend
0,1,AA,2011-01-07,15.82,16.72,15.78,16.42,239655616,3.79267,5.593627,1.173876e+08,16.71,15.97,-4.428490,26,0.182704
1,1,AA,2011-01-14,16.71,16.71,15.64,15.97,242963398,-4.42849,1.380223,2.396556e+08,16.19,15.79,-2.470660,19,0.187852
2,1,AA,2011-01-21,16.19,16.38,15.60,15.79,138428495,-2.47066,-43.024959,2.429634e+08,15.87,16.13,1.638310,12,0.189994
3,1,AA,2011-01-28,15.87,16.63,15.82,16.13,151379173,1.63831,9.355500,1.384285e+08,16.18,17.14,5.933250,5,0.185989
4,1,AA,2011-02-04,16.18,17.39,16.18,17.14,154387761,5.93325,1.987452,1.513792e+08,17.33,17.37,0.230814,97,0.175029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,2,XOM,2011-05-27,80.22,82.63,80.07,82.63,68230855,3.00424,-21.355713,8.675882e+07,83.28,81.18,-2.521610,75,0.568801
746,2,XOM,2011-06-03,83.28,83.75,80.18,81.18,78616295,-2.52161,15.221032,6.823086e+07,80.93,79.78,-1.420980,68,0.578960
747,2,XOM,2011-06-10,80.93,81.87,79.72,79.78,92380844,-1.42098,17.508519,7.861630e+07,80.00,79.02,-1.225000,61,0.589120
748,2,XOM,2011-06-17,80.00,80.82,78.33,79.02,100521400,-1.22500,8.811952,9.238084e+07,78.65,76.78,-2.377620,54,0.594786


In [265]:
next_week_price = stock[['quarter', 'stock', 'date','percent_change_next_weeks_price']]

In [267]:
# create dictionary that maps stock symbols to industry or stock
industry_sector_mapping = {'MMM': 'Industrial',
    'AXP': 'Financial',
    'AA': 'Industrial',
    'T': 'Telecommunications',
    'BAC': 'Financial',
    'BA': 'Aerospace',
    'CAT': 'Industrial',
    'CVX': 'Energy',
    'CSCO': 'Technology',
    'KO': 'Beverage',
    'DD': 'Chemical',
    'XOM': 'Energy',
    'GE': 'Industrial',
    'HPQ': 'Technology',
    'HD': 'Retail',
    'INTC': 'Technology',
    'IBM': 'Technology',
    'JNJ': 'Healthcare',
    'JPM': 'Financial',
    'KRFT': 'Food',
    'MCD': 'Restaurant',
    'MRK': 'Pharmaceutical',
    'MSFT': 'Technology',
    'PFE': 'Pharmaceutical',
    'PG': 'Consumer',
    'TRV': 'Financial',
    'UTX': 'Aerospace',
    'VZ': 'Telecommunications',
    'WMT': 'Retail',
    'DIS': 'Entertainment'
}

# create industry sector column in next_week_price dataframe
next_week_price['industry_sector'] = next_week_price['stock'].map(industry_sector_mapping)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [268]:
next_week_price

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector
0,1,AA,2011-01-07,-4.428490,Industrial
1,1,AA,2011-01-14,-2.470660,Industrial
2,1,AA,2011-01-21,1.638310,Industrial
3,1,AA,2011-01-28,5.933250,Industrial
4,1,AA,2011-02-04,0.230814,Industrial
...,...,...,...,...,...
745,2,XOM,2011-05-27,-2.521610,Energy
746,2,XOM,2011-06-03,-1.420980,Energy
747,2,XOM,2011-06-10,-1.225000,Energy
748,2,XOM,2011-06-17,-2.377620,Energy


In [269]:
# subsetting again
# create 'sector' column by mapping the stock symbols to sector categories
next_week_price['sector']=next_week_price['industry_sector'].map(industry_sector_mapping)
def classify_sector(sector):
    primary_sectors = ['Telecommunications', 'Beverage', 'Healthcare']
    secondary_sectors = ['Industrial', 'Aerospace', 'Energy', 'Technology', 'Chemical']
    if sector in primary_sectors:
        return 'Primary Sector'
    elif sector in secondary_sectors:
        return 'Secondary Sector'
    else:
        return 'Tertiary Sector'
    
next_week_price['sector']=next_week_price['industry_sector'].apply(classify_sector)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [270]:
next_week_price

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector,sector
0,1,AA,2011-01-07,-4.428490,Industrial,Secondary Sector
1,1,AA,2011-01-14,-2.470660,Industrial,Secondary Sector
2,1,AA,2011-01-21,1.638310,Industrial,Secondary Sector
3,1,AA,2011-01-28,5.933250,Industrial,Secondary Sector
4,1,AA,2011-02-04,0.230814,Industrial,Secondary Sector
...,...,...,...,...,...,...
745,2,XOM,2011-05-27,-2.521610,Energy,Secondary Sector
746,2,XOM,2011-06-03,-1.420980,Energy,Secondary Sector
747,2,XOM,2011-06-10,-1.225000,Energy,Secondary Sector
748,2,XOM,2011-06-17,-2.377620,Energy,Secondary Sector


In [271]:
# filter by sector
pri_sector_next_wk = next_week_price[next_week_price['sector'] == 'Primary Sector']
sec_sector_next_wk = next_week_price[next_week_price['sector'] == 'Secondary Sector']
ter_sector_next_wk = next_week_price[next_week_price['sector'] == 'Tertiary Sector']

In [272]:
pri_sector_next_wk

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector,sector
168,1,JNJ,2011-01-07,0.417402,Healthcare,Primary Sector
169,1,JNJ,2011-01-14,0.723356,Healthcare,Primary Sector
170,1,JNJ,2011-01-21,-4.076090,Healthcare,Primary Sector
171,1,JNJ,2011-01-28,1.130320,Healthcare,Primary Sector
172,1,JNJ,2011-02-04,-0.295664,Healthcare,Primary Sector
...,...,...,...,...,...,...
719,2,VZ,2011-05-27,-3.467890,Telecommunications,Primary Sector
720,2,VZ,2011-06-03,-1.068320,Telecommunications,Primary Sector
721,2,VZ,2011-06-10,0.794777,Telecommunications,Primary Sector
722,2,VZ,2011-06-17,1.838760,Telecommunications,Primary Sector


In [152]:
sec_sector_next_wk

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector,sector
0,1,AA,2011-01-07,-4.428490,Industrial,Secondary Sector
1,1,AA,2011-01-14,-2.470660,Industrial,Secondary Sector
2,1,AA,2011-01-21,1.638310,Industrial,Secondary Sector
3,1,AA,2011-01-28,5.933250,Industrial,Secondary Sector
4,1,AA,2011-02-04,0.230814,Industrial,Secondary Sector
...,...,...,...,...,...,...
745,2,XOM,2011-05-27,-2.521610,Energy,Secondary Sector
746,2,XOM,2011-06-03,-1.420980,Energy,Secondary Sector
747,2,XOM,2011-06-10,-1.225000,Energy,Secondary Sector
748,2,XOM,2011-06-17,-2.377620,Energy,Secondary Sector


In [153]:
ter_sector_next_wk

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector,sector
12,1,AXP,2011-01-07,4.638010,Financial,Tertiary Sector
13,1,AXP,2011-01-14,-0.065175,Financial,Tertiary Sector
14,1,AXP,2011-01-21,-4.755700,Financial,Tertiary Sector
15,1,AXP,2011-01-28,-0.702470,Financial,Tertiary Sector
16,1,AXP,2011-02-04,6.346680,Financial,Tertiary Sector
...,...,...,...,...,...,...
732,2,WMT,2011-05-27,-2.223030,Retail,Tertiary Sector
733,2,WMT,2011-06-03,-2.116600,Retail,Tertiary Sector
734,2,WMT,2011-06-10,-0.170100,Retail,Tertiary Sector
735,2,WMT,2011-06-17,-0.550285,Retail,Tertiary Sector


# Exploratory Data Analysis

In [225]:
# primary sector
# pri_sector_next_wk


fig = px.line(pri_sector_next_wk, 
              x='date',
              y='percent_change_next_weeks_price',
              title='Predicted Weekly Price Change from Jan to Jun 2011 in Primary Sector',
              color='stock',
              labels={'date': 'Date', 'percent_change_next_weeks_price': '% Change'},
             markers=True)

fig.update_layout(
    width=800,
    height=400,
plot_bgcolor='lightgray',
    legend_title_text='Stocks')



fig.show()

In [203]:
fig = px.bar(
pri_sector_next_wk,
x='stock',
y='percent_change_next_weeks_price',
title='Percentage Change in Stock Prices in Primary Sector',
color='stock',
labels={'percent_change_next_weeks_price': '% Change', 
       'stock': 'Companies in Primary Sector'})
fig.update_layout(
plot_bgcolor='lightgray',
xaxis=dict(tickangle=0),
    legend_title_text='Stocks'
)


fig.show()

In [226]:
pri_sector_next_wk.describe()

Unnamed: 0,quarter,percent_change_next_weeks_price,month
count,100.0,100.0,100.0
mean,1.52,0.314336,3.52
std,0.502117,1.906633,1.68463
min,1.0,-4.07609,1.0
25%,1.0,-0.83209,2.0
50%,2.0,0.271484,4.0
75%,2.0,1.529085,5.0
max,2.0,5.98842,6.0


Volatility = In primary sector, the standard deviation is approx. 1.9066. This indicates that the weekly returns in this sector vary moderately around the mean value of 0.3143. 

Range of Returns using min and max values = the range is approx. 10.065

In [198]:
color_sequence = px.colors.qualitative.Set1

fig = px.line(sec_sector_next_wk, 
              x='date',
              y='percent_change_next_weeks_price',
              title='Predicted Weekly Price Change from Feb to Jun 2011 in Secondary Sector',
              color='stock',
              color_discrete_sequence=color_sequence,
              labels={'date': 'Date', 'percent_change_next_weeks_price': '% Change'},
             markers=True)
fig.update_layout(
    width=800,
    height=500,
plot_bgcolor='lightgray',
    legend_title_text='Stocks'
)

fig.show()

In [202]:
color_sequence = px.colors.qualitative.Set1

fig = px.bar(
sec_sector_next_wk,
x='stock',
y='percent_change_next_weeks_price',
title='Percentage Change in Stock Prices in Secondary Sector',
color='stock',
    color_discrete_sequence=color_sequence,
labels={'percent_change_next_weeks_price': '% Change', 
       'stock': 'Companies in Secondary Sector'})
fig.update_layout(
plot_bgcolor='lightgray',
xaxis=dict(tickangle=0),
    legend_title_text='Stocks'
)


fig.show()

In [227]:
sec_sector_next_wk.describe()

Unnamed: 0,quarter,percent_change_next_weeks_price
count,350.0,350.0
mean,1.52,0.199503
std,0.500315,3.043459
min,1.0,-15.4229
25%,1.0,-1.660718
50%,2.0,0.11758
75%,2.0,2.047655
max,2.0,9.88223


Volatility = In secondary sector, the standard deviation is approx. higher at 3.0435. This indicates that the weekly returns in this sector are more volatile or have a larger degree of variation compared to the primary sector.

Range of Returns using min and max values = the range is larger approx. 25.305

In [197]:
color_sequence = px.colors.qualitative.Set1



fig = px.line(ter_sector_next_wk, 
              x='date',
              y='percent_change_next_weeks_price',
              title='Predicted Weekly Price Change from Feb to Jun 2011 in Tertiary Sector',
              color='stock',
              color_discrete_sequence=color_sequence,
              labels={'date': 'Date', 'percent_change_next_weeks_price': '% Change'},
             markers=True)
fig.update_layout(
    width=900,
    height=800,
plot_bgcolor='lightgray',
    legend_title_text='Stocks'
)

fig.show()

In [201]:
color_sequence=px.colors.qualitative.Set1

fig = px.bar(
ter_sector_next_wk,
x='stock',
y='percent_change_next_weeks_price',
title='Percentage Change in Stock Prices in Tertiary Sector',
color='stock',
    color_discrete_sequence=color_sequence,
labels={'percent_change_next_weeks_price': '% Change', 
       'stock': 'Companies in Tertiary Sector'})
fig.update_layout(
plot_bgcolor='lightgray',
xaxis=dict(tickangle=0),
    legend_title_text='Stocks'
)


fig.show()

In [228]:
ter_sector_next_wk.describe()

Unnamed: 0,quarter,percent_change_next_weeks_price
count,300.0,300.0
mean,1.52,0.258638
std,0.500435,2.442643
min,1.0,-8.13204
25%,1.0,-1.141463
50%,2.0,0.05024
75%,2.0,1.6585
max,2.0,7.93978


Volatility = In tertiary sector, the standard deviation is approx. higher at 2.4426. This indicates that the weekly returns in this sector are moderately volatile or have a moderate degree of variation in weekly returns.

Range of Returns using min and max values = the range is second largest to that of secondary sector, approx. 16.072

What's the take away from this?

The secondary sector has the largest standard deviation and range of returns compared to other sectors, suggesting higher volatility. This means that the returns of stocks in the secondary sector are more spread out and can experience more significant fluctuations compared to the primary and tertiary sectors. This implies that the stocks in secondary sector are riskier.

Investors typically focus to volatility when assessing risk and making investment decisions. Risk assessment is often undertaken to identify which sectors are more or less volatile and how they compare to the market's risk. 

Investors may also manage risk by choosing to diversity their portfolio, involving spreading investments across different sectors. This is done by including stocks from the primary and tertiary sectors in their portfolio to balance the higher volatility represented by the secondary sector.

With risk assessment taken, the information they get from said assessment can be valuable for investors, financial analysts, and portfolio managers when making informed decisions about their investment startegies.

In other words, interpreting the statistics offer valuable insights into the risk associated with different sectors, particularly in stock returns. This helps highlight the importance of considering volatility and risk when making investment decisions and suggests potential strategies for managing and diversifying risk in investment portfolios.

In [232]:
# Bar and Pie chart
sectors = ['Primary', 'Secondary', 'Tertiary']
mean_returns = [0.314336, 0.199503, 0.258638]
total_average_returns = 0.314336 + 0.199503 + 0.258638
primary_percent = (0.314336 / total_average_returns) * 100
secondary_percent = (0.199503 / total_average_returns) * 100
tertiary_percent = (0.258638 / total_average_returns) * 100

In [245]:
print(primary_percent)
print(secondary_percent)
print(tertiary_percent)

40.69195587700346
25.826400009320665
33.48164411367587


In [246]:
# define percentages using the values above for pie chart
percentages = [40.69, 25.83, 33.48]

In [248]:
# define colors
colors =['blue','green','red']

In [250]:
bar = px.bar(x=sectors,y=mean_returns,labels={'x':'Sector', 'y':'Mean Returns'})
bar.update_layout(title='Average Returns by Sector')
bar.show()

In [247]:
# pie chart
pie = px.pie(names=sectors, values=percentages, title='Average Returns by Sector')
pie.show()

What's the takeaway from this?

In terms of Average Returns, the primary sector (40.7%) showed the highest average returns among the three sectors, followed by the tertiary sector (33.5%) and then the secondary sector (25.8%).

However, in terms of volatility, the secondary sector still exhibited the highest volatility due to larger standard deviation and range of returns copared to the other sectors from PREVIOUS statistic results.

Interpretation:

The primary sector's higher average returns indicate the potential for greater short-term gains. This makes the primary sector an excellent choice for investors who seek higher returns.

Whereas, the secondary sector's greater volatility suggests that investments in this sector is riskier, as returns can be more unpredictable. This is a bad choice to invest, as it may lead to lower returns. Thus, investors are likely to avoid making investment in this sector.

The tertiary sector, on the other hand, strikes a balance between moderate returns and relatively lower volatility, making it a good choice for investors who seek for a more stable investment option.

# Feature Engineering

In [286]:
# lets transform the data by creating binary columns
# load the primary, secondary, and tertiary sector datasets
# replace primary df, secondary df, and tertiary df with the actual dataframe names

# perform one-hot encoding for each sector dataset via pd.get_dummies, columns and prefix
primary_df = pd.get_dummies(pri_sector_next_wk, columns=['sector'], prefix='')
secondary_df = pd.get_dummies(sec_sector_next_wk, columns=['sector'],prefix='')
tertiary_df = pd.get_dummies(ter_sector_next_wk, columns=['sector'],prefix='')

In [311]:
primary_df = pd.get_dummies(pri_sector_next_wk, columns=['sector'], dtype=bool, drop_first=False)
secondary_df = pd.get_dummies(sec_sector_next_wk, columns=['sector'],dtype=bool, drop_first=False)
tertiary_df = pd.get_dummies(ter_sector_next_wk, columns=['sector'], dtype=bool, drop_first=False)

In [304]:
combined_df = pd.concat([primary_df, secondary_df, tertiary_df], axis=1)

In [312]:
primary_df

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector,sector_Primary Sector
168,1,JNJ,2011-01-07,0.417402,Healthcare,True
169,1,JNJ,2011-01-14,0.723356,Healthcare,True
170,1,JNJ,2011-01-21,-4.076090,Healthcare,True
171,1,JNJ,2011-01-28,1.130320,Healthcare,True
172,1,JNJ,2011-02-04,-0.295664,Healthcare,True
...,...,...,...,...,...,...
719,2,VZ,2011-05-27,-3.467890,Telecommunications,True
720,2,VZ,2011-06-03,-1.068320,Telecommunications,True
721,2,VZ,2011-06-10,0.794777,Telecommunications,True
722,2,VZ,2011-06-17,1.838760,Telecommunications,True


In [313]:
combined_df = primary_df.merge(secondary_df, on=['stock','date', 'quarter', 'percent_change_next_weeks_price','industry_sector'], how='outer')
combined_df = combined_df.merge(tertiary_df, on=['stock','date', 'quarter', 'percent_change_next_weeks_price','industry_sector'], how='outer')

In [315]:
# Created binary columns and replace or 'fill' NaN with False

combined_df[['sector_Primary Sector','sector_Secondary Sector','sector_Tertiary Sector']] = combined_df[['sector_Primary Sector','sector_Secondary Sector','sector_Tertiary Sector']].fillna(False)

In [337]:
combined_df

Unnamed: 0,quarter,stock,date,percent_change_next_weeks_price,industry_sector,sector_Primary Sector,sector_Secondary Sector,sector_Tertiary Sector
0,1,JNJ,2011-01-07,0.417402,Healthcare,True,False,False
1,1,JNJ,2011-01-14,0.723356,Healthcare,True,False,False
2,1,JNJ,2011-01-21,-4.076090,Healthcare,True,False,False
3,1,JNJ,2011-01-28,1.130320,Healthcare,True,False,False
4,1,JNJ,2011-02-04,-0.295664,Healthcare,True,False,False
...,...,...,...,...,...,...,...,...
745,2,WMT,2011-05-27,-2.223030,Retail,False,False,True
746,2,WMT,2011-06-03,-2.116600,Retail,False,False,True
747,2,WMT,2011-06-10,-0.170100,Retail,False,False,True
748,2,WMT,2011-06-17,-0.550285,Retail,False,False,True


In [350]:
# Data splitting
train_size = int(0.7 * len(combined_df)) # 70% for training
val_size = int(0.15 * len(combined_df)) # 15% for validation
test_size = len(combined_df) - train_size - val_size # remaining 5% for testing
 
train_data = combined_df[:train_size]
val_data = combined_df[train_size:train_size + val_size]
test_data = combined_df[train_size + val_size:train_size + val_size + test_size]

train_data.reset_index(drop=True, inplace=True)
val_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

In [351]:
train_data['day'] = train_data['date'].dt.day
train_data['month'] = train_data['date'].dt.month
train_data['year'] = train_data['date'].dt.year
train_data['day_of_week'] = train_data['date'].dt.dayofweek

val_data['day'] = train_data['date'].dt.day
val_data['month'] = train_data['date'].dt.month
val_data['year'] = train_data['date'].dt.year
val_data['day_of_week'] = train_data['date'].dt.dayofweek

test_data['day'] = train_data['date'].dt.day
test_data['month'] = train_data['date'].dt.month
test_data['year'] = train_data['date'].dt.year
test_data['day_of_week'] = train_data['date'].dt.dayofweek



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [361]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [374]:
y_train

0      0.417402
1      0.723356
2     -4.076090
3      1.130320
4     -0.295664
         ...   
520    1.559450
521    0.829346
522   -0.255892
523    0.482251
524   -1.306400
Name: percent_change_next_weeks_price, Length: 525, dtype: float64

In [357]:
# split train_data into X_train and y_train
# use sector_Primary_Sector, sector_Secondary_Sector, sector_Tertiary_Sector, quarter as X
# use percent_change_next_weeks_price as Y

X_train = train_data.drop(columns=['percent_change_next_weeks_price', 'industry_sector', 'date','stock'])
y_train = train_data['percent_change_next_weeks_price']

In [356]:
# split val_data and test_data into X_val and y_val & X_test and y_test
X_val = val_data.drop(columns=['percent_change_next_weeks_price','industry_sector', 'stock', 'date'])
y_val = val_data['percent_change_next_weeks_price']

X_test = test_data.drop(columns=['percent_change_next_weeks_price','industry_sector', 'stock', 'date'])
y_test = test_data['percent_change_next_weeks_price']

In [None]:
# Experimenting using train_test_split to see if the results differ from above

In [359]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [363]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

y_pred = linear_model.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
rmse= np.sqrt(mse)
mae= mean_absolute_error(y_val, y_pred)
r_squared = r2_score(y_val, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-Squared: {r_squared}")

Mean Squared Error: 6.86525904030559
Root Mean Squared Error: 2.620163933860931
Mean Absolute Error: 1.943607740215439
R-Squared: -0.13553384762941434


In [364]:
y_pred_test = linear_model.predict(X_test)

In [366]:
coefficients = linear_model.coef_
intercept = linear_model.intercept_

# Decision Tree Regressor

In [368]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [369]:
decision_tree_model = DecisionTreeRegressor(random_state=42)
decision_tree_model.fit(X_train, y_train)

In [370]:
y_pred = decision_tree_model.predict(X_val)

In [371]:
mse = mean_squared_error(y_val, y_pred)
rmse= np.sqrt(mse)
mae= mean_absolute_error(y_val, y_pred)
r_squared = r2_score(y_val, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-Squared: {r_squared}")

Mean Squared Error: 9.67712289739015
Root Mean Squared Error: 3.1108074349580286
Mean Absolute Error: 2.4337293924532313
R-Squared: -0.6006243221329359


# Random Forest Regressor

In [372]:
from sklearn.ensemble import RandomForestRegressor

In [373]:
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train,y_train)


y_pred = rf_model.predict(X_val)

mse = mean_squared_error(y_val, y_pred)
rmse= np.sqrt(mse)
mae= mean_absolute_error(y_val, y_pred)
r_squared = r2_score(y_val, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-Squared: {r_squared}")

Mean Squared Error: 9.235121954926365
Root Mean Squared Error: 3.0389343452806554
Mean Absolute Error: 2.390187313072879
R-Squared: -0.527516078451953


Traditional regression methods, including linear regression, decision tree regression, and random forest regression, were initially considered for analyzing stock returns. However, these models provded unsuitable for several reasons.

1) The dataset's complex composition, with multiple sectors and clusters, introduced non-linearity that regression models could not effectively capture.

2) The inherent characteristics of stock returns and the interplay of various factors violated key assumptions of linear regression, rendering it inappropriate for the data.

3) Linear regresson, decision tree and random forest regression models does not fit the data due to negative R-squared values, limiting their ability to generalize unseen data effectively.

4) I decided to identify meaningful clusters within the data for which agglomerative clustering is suited for. This offers an approach towards a more cluster-centric analysus which aligns with the dataset's natural structure.

# Agglomerative Clustering Algorithm (unsupervised learning)

The agglomerative clustering algortithm (unsupervised learning) allows me to group stocks (companies) into clusters based on their price changes; cluster that performs better and the other cluster that performs less. I found that each sector was divided into 2 clusters that can be analyzed further to understand how the stocks within those 2 clusters differ in terms of price changes.

Dendogram, while initially considered, had to be discarded due to not providing a clear and interpretable representation of the data in favour of bar plots with standard error.

In [442]:
# Create separate dataframes for each sector
pri_sector_df = combined_df[combined_df['sector_Primary Sector'] == 1]
sec_sector_df = combined_df[combined_df['sector_Secondary Sector'] == 1]
ter_sector_df = combined_df[combined_df['sector_Tertiary Sector'] == 1]

In [507]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

silhouette_scores = []
range_n_clusters = range(2,11)
for n_clusters in range_n_clusters:
    agg_cluster= AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
    cluster_labels= agg_cluster.fit_predict(pri_sector_df)
    silhouette_avg = silhouette_score(pri_sector_df, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    
optimal_n_clusters = range_n_clusters[silhouette_scores.index(max(silhouette_scores))]

print("Optimal No of clusters for primary sector", optimal_n_clusters)

Optimal No of clusters for primary sector 2


In [508]:
silhouette_avg = silhouette_score(pri_sector_df, cluster_labels)
print(f"Silhouette Score: {silhouette_avg:.2f}")

Silhouette Score: 0.40


In [509]:
silhouette_scores = []
range_n_clusters = range(2,11)
for n_clusters in range_n_clusters:
    agg_cluster= AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
    cluster_labels= agg_cluster.fit_predict(sec_sector_df)
    silhouette_avg = silhouette_score(sec_sector_df, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    
optimal_n_clusters = range_n_clusters[silhouette_scores.index(max(silhouette_scores))]

print("Optimal No of clusters for secondary sector", optimal_n_clusters)

Optimal No of clusters for secondary sector 2


In [510]:
silhouette_avg = silhouette_score(sec_sector_df, cluster_labels)
print(f"Silhouette Score: {silhouette_avg:.2f}")

Silhouette Score: 0.32


In [511]:
silhouette_scores = []
range_n_clusters = range(2,11)
for n_clusters in range_n_clusters:
    agg_cluster= AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
    cluster_labels= agg_cluster.fit_predict(ter_sector_df)
    silhouette_avg = silhouette_score(ter_sector_df, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    
optimal_n_clusters = range_n_clusters[silhouette_scores.index(max(silhouette_scores))]

print("Optimal No of clusters for tertiary sector", optimal_n_clusters)

Optimal No of clusters for tertiary sector 2


In [512]:
silhouette_avg = silhouette_score(ter_sector_df, cluster_labels)
print(f"Silhouette Score: {silhouette_avg:.2f}")

Silhouette Score: 0.34


In [514]:
import time
start_time = time.time()
n_clusters = 2
agg_cluster = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
pri_cluster_labels = agg_cluster.fit_predict(pri_sector_df)
training_time = time.time() - start_time

pri_sector_df_copy = pri_sector_df.copy()
pri_sector_df_copy['Cluster']= pri_cluster_labels
print(f"Training time: {training_time:.2f} seconds")

Training time: 0.00 seconds


In [475]:
# Create a box plot for each cluster
fig = px.box(pri_sector_df_copy, x='Cluster', y='percent_change_next_weeks_price')

# Set labels
fig.update_xaxes(title_text= 'Cluster')
fig.update_yaxes(title_text='Percent Change in Stock Price')
fig.update_layout(title='Primary Sector Stock Price by Cluster')

# Show the plot
fig.show()


In [515]:
import time
start_time = time.time()
n_clusters = 2
agg_cluster = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')

sec_cluster_labels = agg_cluster.fit_predict(sec_sector_df)
training_time = time.time() - start_time

sec_sector_df_copy = sec_sector_df.copy()
sec_sector_df_copy['Cluster']= sec_cluster_labels
print(f"Training time: {training_time:.2f} seconds")

Training time: 0.01 seconds


In [476]:
# Create a box plot for each cluster
fig = px.box(sec_sector_df_copy, x='Cluster', y='percent_change_next_weeks_price')

# Set labels
fig.update_xaxes(title_text= 'Cluster')
fig.update_yaxes(title_text='Percent Change in Stock Price')
fig.update_layout(title='Secondary Sector Stock Price by Cluster')
# Show the plot
fig.show()

In [516]:
import time
start_time = time.time()
n_clusters = 2
agg_cluster = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')

ter_cluster_labels = agg_cluster.fit_predict(ter_sector_df)
training_time = time.time() - start_time

ter_sector_df_copy = ter_sector_df.copy()
ter_sector_df_copy['Cluster']= ter_cluster_labels
print(f"Training time: {training_time:.2f} seconds")

# Create a box plot for each cluster
fig = px.box(ter_sector_df_copy, x='Cluster', y='percent_change_next_weeks_price')

# Set labels
fig.update_xaxes(title_text= 'Cluster')
fig.update_yaxes(title_text='Percent Change in Stock Price')
fig.update_layout(title='Tertiary Sector Stock Price by Cluster')
# Show the plot
fig.show()

Training time: 0.00 seconds


In [494]:
import plotly.subplots as sp
import plotly.graph_objects as go

fig = sp.make_subplots(rows=1, cols=3, subplot_titles=("Primary Sector", "Secondary Sector", "Tertiary Sector"))
fig.add_trace(go.Box(x=pri_sector_df_copy['Cluster'], y=pri_sector_df_copy['percent_change_next_weeks_price'], name='Primary'), row=1, col =1)
fig.add_trace(go.Box(x=sec_sector_df_copy['Cluster'], y=sec_sector_df_copy['percent_change_next_weeks_price'], name='Secondary'), row=1, col =2)
fig.add_trace(go.Box(x=ter_sector_df_copy['Cluster'], y=ter_sector_df_copy['percent_change_next_weeks_price'], name='Tertiary'), row=1, col =3)

fig.update_xaxes(title_text='Cluster', row=1, col=1)
fig.update_xaxes(title_text='Cluster', row=1, col=2)
fig.update_xaxes(title_text='Cluster', row=1, col=3)
fig.update_yaxes(title_text='Percent Change in Stock Price', row=1, col=1)
fig.update_layout(title_text='Sector Stock Price by Cluster')
fig.show()

# ANOVA and Post-hoc tests

In [505]:
# ANOVA Test among clusters in each sector
from scipy.stats import f_oneway

cluster_0_pri = pri_sector_df_copy[pri_sector_df_copy['Cluster'] == 0]['percent_change_next_weeks_price']
cluster_1_pri = pri_sector_df_copy[pri_sector_df_copy['Cluster'] == 1]['percent_change_next_weeks_price']

cluster_0_sec = sec_sector_df_copy[sec_sector_df_copy['Cluster'] == 0]['percent_change_next_weeks_price']
cluster_1_sec = sec_sector_df_copy[sec_sector_df_copy['Cluster'] == 1]['percent_change_next_weeks_price']

cluster_0_ter = ter_sector_df_copy[ter_sector_df_copy['Cluster'] == 0]['percent_change_next_weeks_price']
cluster_1_ter = ter_sector_df_copy[ter_sector_df_copy['Cluster'] == 1]['percent_change_next_weeks_price']

f_statistic, p_value = f_oneway(cluster_0_pri,cluster_1_pri)
if p_value < 0.05:
    print("Primary Sector: Reject the null hypothesis. There are significant differences.")
else:
    print("Primary Sector: Null hypothesis accepted. No significant differences.")
print(f_statistic)

f_statistic, p_value = f_oneway(cluster_0_sec,cluster_1_sec)
if p_value < 0.05:
    print("Secondary Sector: Reject the null hypothesis. There are significant differences.")
else:
    print("Secondary Sector: Null hypothesis accepted. No significant differences.")
print(f_statistic) 

f_statistic, p_value = f_oneway(cluster_0_ter,cluster_1_ter)
if p_value < 0.05:
    print("Tertiary Sector: Reject the null hypothesis. There are significant differences.")
else:
    print("Tertiary Sector: Null hypothesis accepted. No significant differences.")
print(f_statistic)


Primary Sector: Null hypothesis accepted. No significant differences.
0.21135205925053208
Secondary Sector: Reject the null hypothesis. There are significant differences.
27.516224104792183
Tertiary Sector: Null hypothesis accepted. No significant differences.
0.005346724905993331


In [499]:
from statsmodels.stats.multicomp import MultiComparison

mc = MultiComparison(sec_sector_df_copy['percent_change_next_weeks_price'], sec_sector_df_copy['Cluster'])
result = mc.tukeyhsd()
print(result)

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
     0      1  -1.6463   0.0 -2.2636 -1.029   True
--------------------------------------------------


# Findings

Whats the take away from this?

In the primary sector, cluster 0 have a higher average percent change than cluster 1 which has one outlier below the bar, suggesting that cluster 0 performs better. Again, this indicates that primary sector is an excellent choice for investors seeking higher returns which suggests the potential for greater short-term gains.

In the secondary sector, cluster 0 has higher average percent change and has one outlier above the bar, indicating a potentially significant positive performance. In comparison, cluster 1 has more outliers with lower values, indicating higher volatility and potentially worse performance overall.

In the tertiary sector, cluster 1 has slightly higher average percent change and has outliers both above and below the bar. Cluster 0, on the other hand, has more extreme outliers in both directions, indicating it as a more volatile cluster. This, however, aligns with the idea that tertiary sector is a good choice for investors seeking a more stable or safe investment option.


ANOVA:
    a) Primary Sector: Null hypothesis accepted. No significant differences among clusters.
    b) Secondary Sector: Reject the null hypothesis. There are significant differences among clusters, so there might be distinct price change patterns.
    c) Tertiary Sector: Null hypothesis accepted. No significant differences among clusters.
    
   There are significant differences in stock price changes among clusters within the secondary sector, suggesting distinct patterns or variations in stock price changes within said sector compared to the primary and tertiary sectors.
   
Post-Hoc (Tukey's HSD) Test:
    The test revealed that cluster 0 and cluster 1 in the secondary sector have significantly different means in terms of price changes. This confirms that these 2 clusters show distinct price change patterns.
    
Interpretation: The significant differences identified in the secondary sector suggests that there are subgroups of stocks within this sector that perform differently in terms of price changes. The signficant differences in the means of secondary sector's cluster 0 and cluster 1 suggests that there are subsets of stocks that have varying price change dynamics. This implies that some stocks or companies in the secondary sector may experience more significant price fluctuation than others, resulting in different average returns.

Referring back to previous findings, the secondary sector had the lowest average returns of 25.8% among the three sectors, which aligns with its observed highest volatility indicated by its larger standard deviation and range of returns. 

Moreover, the secondary sector's observed highest volatility aligns with the presence of more outliers with lower outliers below its cluster 1, suggesting potential worse performance, as indicated by barplot. 

Furthermore, the secondary sector's greater volatility suggests that investments in this sector is riskier, as returns can be more unpredictable. This is a bad choice to invest, as it may lead to lower returns. Thus, investors are likely to avoid making investment in this sector.

Given the significant differences within the secondary sector, investors should cautiously assess the specific clusters and their risk-return profiles within this sector. Some stocks or companies in cluster 0 may offer higher returns with a potential for positive performance, but other in cluster 1, while capable of stable returns, come with higher volatility.

Therefore, investors with different risk tolerances might choose to invest selectively within the secondary sector based on their objectives. Some may seek stocks or companies with slight potential for higher returns and positive performance, while others may prefer to invest in cluster 1, which, despite its high volatility, offers the possibility of more stable returns.