# COGS 118B - Final Project

# Insert title here

## Group members

- Griffin Barros-King
- Alex Vo
- Xinyi Zhang
- Ryan Wong
- Immanuel Chai


# Abstract 
Unsupervised machine learning presents a viable approach to optimizing stock portfolios by analyzing risk-return profiles without explicit guidance or predefined labels. The research aims to explore the effectiveness of leveraging unsupervised learning algorithms to categorize stocks based on the risk-return variability. Using the historical stock market data such as daily prices, volatility, etc. and financial ratios, we can identify the correlations that can inform portfolio optimization. We can apply clustering techniques and dimensionality reduction methods, we can aim to group stocks into distinct categories that reflect the risk-return dynamics. The performance and success of this unsupervised machine learning approach will be evaluated based on the portfolios overall stability compared to traditional portfolio optimization methods. This study also aims to provide insights into the applicability of unsupervised learning in financial portfolio management.


__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

Stock trading involves buying and selling financial instruments such as stocks, bonds, or derivatives with the goal of generating profits from fluctuations in their prices. Stock portfolios reffer to the collections of such financial assets held by investors, which are carefully constructed to achieve specific investment objectives, such as maximizing returns, minimizing risk, or achieving a balance between the two. Unsupervised machine learning techniques, such as clustering or dimensionality reduction, can assist in portfolio construction by grouping assets with similar risk-return profiles or reducing the dimensionality of the feature space while perserving relevant information. This can aid in the creation of diversified portfolios that balance risk and return effectively. The application of unsupervised machine learning methods on optimizing stock portfolios have been gaining considerable attention due to its limitless potential to outperform traditional statistical methods in portfolio management by means of pattern recognition, anomaly detection, and feature extraction from historical stock market data that all falls under unsupervised machine learning techniques that recognizes shifts in patterns or sudden changes in data to provide valuable signals for traders and portfolio managers to take appropriate action. Prior research that supports the use of machine learning methods that utilizes historical stock market data to analyze and optimize stock portfolios have been conducted. The application of the clustering method using K-Means helped reduce loan risk categories to minimize business losses <a name="ieee"></a>[<sup>[1]</sup>](#ieeenote). It proved to have high accuracy in predicting loan risk to achieve overall refined stock portfolios. Unsupervised machine learning methods can also extract meaningful features from raw financial data, such as technical indicators, sentiment analysis from news articles, or fundamental ratios from financial statements. <a name="oleh"></a>[<sup>[2]</sup>](#olehnote). These extracted features can be used as inputs to predictive models for forecasting stock prices or estimating future returns. by leveraging these methods, investors can gain deeper insights into financial markets, enhance their trading strategies, and ultimately achieve better risk-adjusted returns in their portfolios compared to relying only on traditional portfolio optimization methods.


Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

# Problem Statement

We are optimizing stock portfolios based off certain risk-return investor preferences. The problem is that a given investor does not know how to optimize their portfolio because there is a plethora of financial advice on the internet that encourages different investing styles. One machine learning relevant potential solution is utilizing a clustering technique such as K-means or the Gaussian Mixture Model to group stocks based off their risk return profiles. The problem is measurable because we will be comparing our data against the Sharpe ratio metric and the Compound Annual Growth Rate. This experiment can be easily reproduced because of the large sample size of our dataset. Our dataset contains all of the stock market data up until 2020. 

# Data

1. 
- The dataset we are using is a stock market dataset that contains some price and volume metrics for every stock and etf in the NASDAQ exchange for every day from 11/08/1999 to 04/01/2020.
- Link to dataset: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset
- The dataset has 7 variables and 7450 observations(per stock/etf, 5884 stocks, 2165 etfs)
- Each observation consists of the date along with the opening price, closing price, high price, low price, adjusted closing price, and volume traded on that day.
- For looking at volatility in each stock we will likely want to look at the differences between the high and low prices to see how much a stock can change in a given day.
We will also want to look at, over a given set of observations. see how the opening and closing prices change. Looking at if they go up consistently over time to imply a good return.
- The dataset is already relatively clean. The one thing we may want to do is normalize the prices, so we can look at the volatility and returns in terms of percent change easier between stocks.
We will likely want to adjust the names of the stocks from their tickers to the official names for readability.
If you don't yet know what your dataset(s) will be, you should describe what you desire in terms of the above bullets.
2. 
- Another dataset we may want would be similar historical data but with multiple observations per day. This would be more useful in analyzing volatility on a shorter term basis.
- This would contain the same variables with more observations as it would contain multiple observations per day.
- This would also allow us to look at more interesting volatility metrics other than just daily highs and lows. like average price change per hour(or per however often the intra-day observations take place)
3. 
- A dataset that could be useful for non-analysis purposes would be a list of each stock and etf on the nasdaq exchange with its ticker name and its official name.
- this would consist of 2 variables and 8049 observations (one for each stock/etf)
- The variables would simply be the ticker name and official name for each stock/etf
- This dataset would ideally already be clean and would be use to clean up our other datasets and improve readability in any tables or graphs that we present in the final project.


# Proposed Solution


In order to address problem of optimizing stock portfolios using unsupervised machine learning techniques, we propose the following solution:

Data preprocessing: 

- cleaning dataset: remove unneccessary data and handling missing values to keep data consistency.
- Normalize the prices: Convert the prices into percentage changes to facilitate easy comparison of volatility and returns across different stocks.
- calculate additional features about stock protfoloios. For example: price volatility, average price change. 

Unsupervised Learning:

- Utilize clustering techniques such as K-Means or Gaussian Mixture Models to categorize stocks based on their risk-return profiles.
- Implement dimensionality reduction methods like Principal Component Analysis (PCA) to reduce the feature space while preserving relevant information.

Benchmark Model:

- Compare the performance of our unsupervised learning-based approach with traditional portfolio optimization method

Portfolio Optimization

- Construct portfolios by selecting stocks from each cluster to achieve desired risk-return profiles.
- Evaluate portfolio performance using metrics such as Sharpe ratio, cumulative returns, and portfolio volatility.

Implementation:
- Utilize Python libraries such as pandas, NumPy, and scikit-learn for data preprocessing, feature engineering, and model implementation.
- For clustering and dimensionality reduction, use scikit-learn's implementations of K-Means and PCA.
- Visualize results using matplotlib or seaborn to provide insights into portfolio composition and performance.

Testing:
- Split the dataset into training and testing sets to evaluate model performance.

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

Total return and Compound Annual Growth Rate (CAGR):

- $CAGR = \left( \frac{{P_f}}{{P_i}} \right)^{\frac{1}{n}} - 1$
where P_i is the initial investment value, P_f is the final investment value, n is the number of years in the investment period.

Sharpe ratio:

- $Sharpe Ratio = \frac{{R_p - R_f}}{{\sigma_p}}$
where R_p is the average return of the portfolio, R-f is the risk-free rate of return, and signma_p is the standard deviation of the portfolio's excess return.

For the benchmark model and the solution model, we can calculate the Sharpe ratio for each portfolio constructed using their respective optimization methods. We then compare the Sharpe ratios to assess which model generates portfolios with superior risk-adjusted returns.

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

### Subsection 1

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?



# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   

### Ethics & Privacy

One of the main ethical issues that we believe is present is a strong bias towards American markets. Since we only have data from the NASDAQ exchange, any beneficial information or model we make may not be very effective in foreign markets. This could lead to an unfair benefit towards people who invest in American markets versus those who don't or can't. This could be mitigated by finding datasets that include information from multiple markets other than the NASDAQ. We could also just test a few other markets and see if the conclusions we find in the American markets carry over to those ones. If so, we can possibly show that the benefit of any model developed would carry over to other markets.

Another potential issue is the unfairness of those who have the technology to run whatever model we develop versus those who don't. While the world has seen a great increase in access to technology. There are still great differences in the quality of the technology people can access. Seeing as the models we will develop will run based off of large datasets and may get quite complex. It is possible that our model would be more beneficial to those who can run similar simulations quicker on better technology. This could create an unfair financial advantage to those who are more likely to already be financially better off, based on the fact that they have access to better technology.

Another potential issue is if the model is effective enough in predicting and managing stocks. It could lead to a loss in jobs of those who work in the financial sector. A way to mitigate both this issue and the previously addressed issue would be to restrict the use of our model to research purposes. therefore preventing any market advantages from being gained by those who would unfairly benefit from the model.

There are no issues with data privacy for our datasets. This is because all the information we use is publicly available and must be by law. There is also no PII in the dataset so that is also beneficial. 

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes

<a name="ieeenote"></a>1.[^](#ieee): S. H. Sudjono, F. H. Adrian, C. A. Sunarya, G. F. Ariyanto and N. T. M. Sagala, "Comparison of Different Machine Learning Algorithms for Predicting Loan Risk Categories," 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE), Jakarta, Indonesia, 2023, pp. 773-778, doi: 10.1109/ICCoSITE57641.2023.10127758. keywords: {Machine learning algorithms;Computational modeling;Clustering algorithms;Companies;Predictive models;Boosting;Prediction algorithms;kmeans clustering;boosting algorithm;loan risk;predictive model},

<a name="olehnote"></a>2.[^](#oleh): Oleh Onyshchak. (2020). Stock Market Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/1054465



<a name="lorenznote"></a>3.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>4.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>5.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
