# COGS 118B - Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project must include some elements of unsupervised learning, but you are welcome to include some supervised or other learning approaches as well.
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Griffin Barros-King
- Alex Vo
- Johan Cruyff
- Roberto Carlos
- Franz Beckenbaur

# Abstract 
Unsupervised machine learning presents a viable approach to optimizing stock portfolios by analyzing risk-return profiles without explicit guidance or predefined labels. The research aims to explore the effectiveness of leveraging unsupervised learning algorithms to categorize stocks based on the risk-return variability. Using the historical stock market data such as daily prices, volatility, etc. and financial ratios, we can identify the correlations that can inform portfolio optimization. We can apply clustering techniques and dimensionality reduction methods, we can aim to group stocks into distinct categories that reflect the risk-return dynamics. The performance and success of this unsupervised machine learning approach will be evaluated based on the portfolios overall stability compared to traditional portfolio optimization methods. This study also aims to provide insights into the applicability of unsupervised learning in financial portfolio management.

# Background

Stock trading involves buying and selling financial instruments such as stocks, bonds, or derivatives with the goal of generating profits from fluctuations in their prices. Stock portfolios reffer to the collections of such financial assets held by investors, which are carefully constructed to achieve specific investment objectives, such as maximizing returns, minimizing risk, or achieving a balance between the two. Unsupervised machine learning techniques, such as clustering or dimensionality reduction, can assist in portfolio construction by grouping assets with similar risk-return profiles or reducing the dimensionality of the feature space while perserving relevant information. This can aid in the creation of diversified portfolios that balance risk and return effectively. The application of unsupervised machine learning methods on optimizing stock portfolios have been gaining considerable attention due to its limitless potential to outperform traditional statistical methods in portfolio management. Prior research that supports the use of machine learning methods that utilizes historical stock market data to analyze and optimize stock portfolios have been conducted. The application of the clustering method using K-Means helped reduce loan risk categories to minimize business losses <a name="ieee"></a>[<sup>[4]</sup>](#ieeenote). It proved to have high accuracy in predicting loan risk to achieve overall refined stock portfolios. Unsupervised machine learning methods can also extract meaningful features from raw financial data, such as technical indicators, sentiment analysis from news articles, or fundamental ratios from financial statements. <a name="oleh"></a>[<sup>[5]</sup>](#olehnote). These extracted features can be used as inputs to predictive models for forecasting stock prices or estimating future returns. by leveraging these methods, investors can gain deeper insights into financial markets, enhance their trading strategies, and ultimately achieve better risk-adjusted returns in their portfolios compared to relying only on traditional portfolio optimization methods.

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

1. 
- The dataset we are using is a stock market dataset that contains some price and volume metrics for every stock and etf in the NASDAQ exchange for every day from 11/08/1999 to 04/01/2020.
- Link to dataset: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset
- The dataset has 7 variables and 7450 observations(per stock/etf, 5884 stocks, 2165 etfs)
- Each observation consists of the date along with the opening price, closing price, high price, low price, adjusted closing price, and volume traded on that day.
- For looking at volatility in each stock we will likely want to look at the differences between the high and low prices to see how much a stock can change in a given day.
We will also want to look at, over a given set of observations. see how the opening and closing prices change. Looking at if they go up consistently over time to imply a good return.
- The dataset is already relatively clean. The one thing we may want to do is normalize the prices, so we can look at the volatility and returns in terms of percent change easier between stocks.
We will likely want to adjust the names of the stocks from their tickers to the official names for readability.
If you don't yet know what your dataset(s) will be, you should describe what you desire in terms of the above bullets.
2. 
- Another dataset we may want would be similar historical data but with multiple observations per day. This would be more useful in analyzing volatility on a shorter term basis.
- This would contain the same variables with more observations as it would contain multiple observations per day.
- This would also allow us to look at more interesting volatility metrics other than just daily highs and lows. like average price change per hour(or per however often the intra-day observations take place)
3. 
- A dataset that could be useful for non-analysis purposes would be a list of each stock and etf on the nasdaq exchange with its ticker name and its official name.
- this would consist of 2 variables and 8049 observations (one for each stock/etf)
- The variables would simply be the ticker name and official name for each stock/etf
- This dataset would ideally already be clean and would be use to clean up our other datasets and improve readability in any tables or graphs that we present in the final project.



# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Ethics & Privacy

One of the main ethical issues that we believe is present is a strong bias towards American markets. Since we only have data from the NASDAQ exchange, any beneficial information or model we make may not be very effective in foreign markets. This could lead to an unfair benefit towards people who invest in American markets versus those who don't or can't. This could be mitigated by finding datasets that include information from multiple markets other than the NASDAQ. We could also just test a few other markets and see if the conclusions we find in the American markets carry over to those ones. If so, we can possibly show that the benefit of any model developed would carry over to other markets.

Another potential issue is the unfairness of those who have the technology to run whatever model we develop versus those who don't. While the world has seen a great increase in access to technology. There are still great differences in the quality of the technology people can access. Seeing as the models we will develop will run based off of large datasets and may get quite complex. It is possible that our model would be more beneficial to those who can run similar simulations quicker on better technology. This could create an unfair financial advantage to those who are more likely to already be financially better off, based on the fact that they have access to better technology.

Another potential issue is if the model is effective enough in predicting and managing stocks. It could lead to a loss in jobs of those who work in the financial sector. A way to mitigate both this issue and the previously addressed issue would be to restrict the use of our model to research purposes. therefore preventing any market advantages from being gained by those who would unfairly benefit from the model.

There are no issues with data privacy for our datasets. This is because all the information we use is publicly available and must be by law. There is also no PII in the dataset so that is also beneficial. 

# Team Expectations 

* Mandatory weekly meetings on Tuesdays
* Main form of communication through Discord
* Even split of workload/parts
* Complete assigned parts at least a day before official deadline for any group discussion/review
* Midweek group check-ins to see how everyone is doing
* If any conflicts arise or any help is needed, message groupchat on Discord

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic (Pelé) | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (Beckenbaur)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data ,do some EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Cruyff) | Discuss/edit project code; Complete project |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes

<a name="ieeenote"></a>1.[^](#ieee): S. H. Sudjono, F. H. Adrian, C. A. Sunarya, G. F. Ariyanto and N. T. M. Sagala, "Comparison of Different Machine Learning Algorithms for Predicting Loan Risk Categories," 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE), Jakarta, Indonesia, 2023, pp. 773-778, doi: 10.1109/ICCoSITE57641.2023.10127758. keywords: {Machine learning algorithms;Computational modeling;Clustering algorithms;Companies;Predictive models;Boosting;Prediction algorithms;kmeans clustering;boosting algorithm;loan risk;predictive model},

<a name="olehnote"></a>2.[^](#oleh): Oleh Onyshchak. (2020). Stock Market Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/1054465


Template:

<a name="lorenznote"></a>3.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>4.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>5.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem