# COGS 108 - Project Proposal

# Names

- Aditya Saini
- Saumya Sadh
- Ojasvi Tewari
- Natalia Avanni
- Mariella Avanni

# Research Question

How can machine learning models be used to predict short-term stock price movements and provide real-time investment recommendations tailored to different risk profiles?
OR
How can machine learning models analyze real-time market data, sentiment, and stock performance to provide actionable investment recommendations tailored to different risk profiles without relying on price predictions?



## Background and Prior Work


Investment strategies that rely heavily on price predictions often face significant challenges due to market volatility and the unpredictable nature of financial market and data. To address these challenges, machine learning models are increasingly being used not only to predict but also analyse real-time data such as market trends, risk exposure, and sentiment shifts to offer investors actionable guidance tailored to their risk profiles. 
A recent study by Timmermann et al. (2023) introduced a Fund Confidence Set (FCS) approach, in their paper “Picking Funds with Confidence.” They identify high-performing mutual funds through a series of pairwise performance comparisons. This method avoids direct predictions of future returns and instead focuses on eliminating underperforming funds based on real-time risk-adjusted metrics. By continuously adapting to changing economic conditions, the model ensures that selected funds remain relevant and competitive. This dynamic approach highlights the importance of "evolving market states and risk conditions in shaping fund performance" (Timmermann 6)​.
Additionally, machine learning research has demonstrated the value of integrating technical and sentiment analysis for decision-making in financial markets. In a study on stock market analysis, in the paper “Stock Market Prediction Using Machine Learning”, Parmar et al. (2018) describe how regression and LSTM models are traditionally used for prediction. However, they emphasize that the foundation of any reliable model lies in understanding historical market variables such as trading volume and price fluctuations without solely depending on forecasting (Parmar, p. 3). The study underlines that maintaining data integrity and focusing on diverse indicators like volatility and news sentiment can improve the model's ability to provide timely investment insights.
Building on these findings, our project aims to create a decision-support system that analyzes market data to generate investment recommendations. By leveraging both risk-adjusted performance models and real-time data streams, our system would adapt to users’ risk tolerance and portfolio needs. We would like to emphasize the dynamic data integration to guide decision-making without speculating on exact future price levels. This framework seeks to improve investor confidence and stability in uncertain market environments. 



# Hypothesis


We hypothesize that there is a strong relationship between market data analysis and improved investment decision-making for different risk profiles. By continuously evaluating sentiment, risk-adjusted performance, and technical indicators, the system will enable more effective investment recommendations without the need for price predictions. This approach is expected to result in more stable portfolio performance across varying market conditions.

# Data

1. Explain what the **ideal** dataset you would want to answer this question. (This should include: What variables? How many observations? Who/what/how would these data be collected? How would these data be stored/organized?)

 **Machine Learning Analysis of S&P 500 Stocks for Investment Recommendations**

 **Objective**
Our research aims to analyze S&P 500 stocks using machine learning models to provide actionable investment recommendations based on real-time market trends, sentiment analysis, and risk-adjusted performance metrics. The dataset will focus on S&P 500 companies, leveraging historical and live data to develop a robust decision-support system.

 **Dataset Components**
The dataset should include:

- **Market Data for S&P 500**  
- **Sentiment Analysis Data**  
- **Macroeconomic Indicators**  
- **Risk & Portfolio Management Metrics**  
- **Company Equity Analysis**, including:
  - Capital Asset Pricing Model (CAPM)
  - Price-to-Earnings (P/E) Ratio
  - Return on Equity (ROE)
- **Volume Trends**  
- **Financial Performance Metrics**, including:
  - Revenue  
  - Net Income  
  - Earnings Per Share (EPS)  

 **Data Sources**
To ensure data accuracy and comprehensiveness, we will utilize the following sources:

- **S&P 500 Market Data**:  
  - [Yahoo Finance API](https://www.yahoofinanceapi.com/)  
  - [Alpha Vantage](https://www.alphavantage.co/)  
  - [Quandl](https://www.quandl.com/)  

- **News & Sentiment Analysis**:  
  - [Twitter API](https://developer.twitter.com/en/docs/twitter-api)  
  - [Google News API](https://newsapi.org/)  
  - [FinBERT](https://huggingface.co/ProsusAI/finbert) for NLP-based sentiment analysis  

- **Economic Indicators**:  
  - [FRED (Federal Reserve Economic Data)](https://fred.stlouisfed.org/)  
  - [International Monetary Fund (IMF)](https://www.imf.org/en/Data)  
  - [World Bank Economic Data](https://databank.worldbank.org/)  

- **Risk & Portfolio Analysis**:  
  - [Morningstar](https://www.morningstar.com/)  
  - [SEC EDGAR Filings](https://www.sec.gov/edgar.shtml)  

 **Size and Structure**
- **Historical Data**: Minimum of **5 years** of historical market data  
- **Live Data**: Real-time streaming for up-to-date analysis  
- **Storage Formats**:  
  - **CSV** for easy processing and sharing  
  - **MySQL** or **MongoDB** for structured and efficient querying  

This structured dataset will serve as the foundation for training machine learning models to generate dynamic investment insights.



2. Search for potential **real** datasets that could provide you with something useful for this project.  You do not have to find every piece of data you will use, but you do need to have demonstrated some idea that (a) this data is gettable and (b) that this data may be different from what your ideal is.

 **S&P 500 Historical Data Sources**

For our research on analyzing S&P 500 stocks using machine learning models, we have identified the following Kaggle datasets that provide extensive historical data:

 **Available Datasets**

 1. [S&P 500 Historical Data (1927-2020)](https://www.kaggle.com/datasets/henryhan117/sp-500-historical-data)
- Contains nearly a century of S&P 500 stock data from **1927 to 2020**.
- Includes key financial metrics and market performance indicators.
- Useful for long-term trend analysis and historical market performance.

 2. [Recent S&P 500 Stock Data](https://www.kaggle.com/datasets/camnugent/sandp500)
- A **smaller, more recent** dataset focused on modern S&P 500 stock data.
- Ideal for short-term trend analysis and modern market insights.

 3. [S&P 500 Daily Close Price (1986-2018)](https://www.kaggle.com/datasets/pdquant/sp500-daily-19862018)
- Contains daily **closing prices** of the S&P 500 Index from **1986 to 2018**.
- Useful for time-series analysis and modeling stock price fluctuations.

 **Integration in Research**
These datasets will serve as the foundation for:
- **Machine Learning Models** to analyze historical market trends.
- **Sentiment and Risk Analysis** for investment decision-making.
- **Feature Engineering** to extract relevant insights for stock predictions.

We plan to combine these datasets with **real-time market data** from sources like Yahoo Finance, Alpha Vantage, and Quandl to enhance our predictive capabilities.



# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

## **Are there any biases/privacy/terms of use issues with the data you proposed?**  
Financial datasets often contain historical biases, such as overrepresentation of certain sectors or market conditions that may not generalize well to future predictions. This could lead to recommendations that unfairly favor specific industries or companies. 

Sentiment scores from news and social media may be influenced by misinformation, biased reporting, or exaggerated public reactions, leading to skewed analysis. Additionally, models used in financial predictions tend to be somewhat of **“black boxes”** since companies developing them prioritize intellectual property protection. This opacity reduces trust in the system and limits accountability.

There are also ethical concerns when models do not allow for transparency in decision-making. If biases exist in the dataset, they could negatively impact companies or industries by leading to incorrect or skewed recommendations. To mitigate such effects, there must be **regular updates, monitoring, and model adjustments** to ensure fairness and accuracy.

- **Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis?**  
Most financial market data is publicly available, but certain **investor behavior data** (e.g., trade logs, institutional fund movements) may have **privacy concerns**, especially if sourced from individual accounts. 

Sentiment analysis using data from platforms like **Twitter and Reddit** raises privacy issues since it involves analyzing individual user-generated content. These platforms may also **amplify extreme opinions** rather than providing a balanced view of market sentiment. Additionally, the **underrepresentation of minority-owned firms or smaller businesses** in market indices like the S&P 500 could introduce bias in analysis, disproportionately focusing on large-cap stocks.

- **How will you set out to detect these specific biases before, during, and after/when communicating your analysis?**  
1. **Before Analysis:** Conduct exploratory data analysis (EDA) to **identify sector overrepresentation, missing data patterns, and biases** in the dataset. Use statistical tests to assess bias in financial metrics across industries and company sizes.
2.   **During Analysis:** Apply fairness-aware machine learning techniques, such as re-weighting underrepresented data points, debiasing sentiment analysis models (e.g., adjusting for extreme opinions), and running sensitivity analysis to test model robustness.
3. **After Analysis:** Communicate uncertainty in findings, use **explainable AI (XAI) techniques** to increase transparency, and regularly audit predictions against real-world market performance. We will also **monitor model impact** to ensure it does not inadvertently contribute to market distortions, similar to the **Quant Meltdown of 2007**, where over-reliance on quantitative models led to significant financial losses.

- **Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?**  
1. **Data Privacy Concerns:** Financial data aggregation and sentiment analysis from social media could pose privacy risks if individual users' posts are analyzed without consent.
2. **Market Manipulation Risks:** If the model's investment recommendations gain widespread adoption, they could impact stock movements, leading to artificial demand or market distortions.
3. **Regulatory and Ethical Compliance:** The use of financial data must align with **SEC regulations** and **data protection laws** (e.g., GDPR, CCPA) to ensure compliance.

- **How will you handle issues you identified?**  
 **Regular Audits:** Implement periodic audits of model predictions and fairness metrics to detect emerging biases.
 **Explainable AI (XAI):** Utilize interpretable machine learning techniques (e.g., SHAP, LIME) to make financial recommendations more transparent.
 **Ethical Oversight:** Establish an **ethical review framework** to assess the fairness and impact of investment recommendations, ensuring no unfair advantages are given to certain companies or investor groups.
 **Data Privacy Safeguards:** Aggregate sentiment data in an anonymized manner, following data protection best practices to prevent privacy violations.


# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Commitment to Deadlines and Contributions*   
  Each team member agrees to complete their assigned tasks on time and contribute equally to the project. Members are expected to communicate early if they encounter difficulties meeting deadlines, allowing the team to reassign tasks or provide support to ensure project progress is not hindered.

* *Effective Communication*  
  The team will maintain regular communication through agreed platforms (e.g., group chats, email, or weekly virtual meetings). Updates, feedback, and concerns should be shared openly and respectfully. In case of conflicts, the team will hold a dedicated meeting to discuss and resolve issues constructively, ensuring that all viewpoints are heard.

* *Quality and Accountability*  
  All members will ensure that their contributions meet the project's quality standards. Members are expected to review their work for accuracy and completeness before submission. The team will perform peer reviews to maintain high standards, holding each other accountable while offering constructive feedback to improve the final deliverable.


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Aditya); EDA (Ojas) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Aditya; Ojas) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Saumya, Natalia, Mariella)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |