# Business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

----------

### The tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.

### The output

**Business terminology:**

* **Dividend Yield**: A financial ratio that shows how much a company pays out in dividends each year relative to its stock price.

* **Earnings Per Share (EPS)**: The portion of a company's profit allocated to each outstanding share of common stock.

* **Market Capitalization**: The total market value of a company's outstanding shares, calculated as stock price multiplied by the total number of shares outstanding.

* **Sentiment Analysis**: The process of computationally identifying and categorizing opinions expressed in text to determine whether the writer's attitude towards a particular topic is positive, negative, or neutral.

* **Arbitrage**: The simultaneous purchase and sale of the same or equivalent assets in different markets to profit from price differences.

* **Quantitative Easing (QE)**: A monetary policy whereby a central bank buys government securities or other securities from the market to increase the money supply and encourage lending and investment.

* **Volatility**: The degree of variation of a trading price series over time, usually measured by the standard deviation of returns.

* **Sentiment Analysis**: The process of computationally identifying and categorizing opinions expressed in text to determine whether the writer's attitude towards a particular topic is positive, negative, or neutral.

* **The Global Industry Classification Standard (GICS)**: method for assigning companies to a specific economic sector and industry group that best defines its business operations.

* **Bull Market**: A financial market in which prices are rising or are expected to rise.

**ML-terminology:**

* **Feature:** An individual measurable property or characteristic of a phenomenon being observed, such as stock price, volume, or sentiment score.

*  **Label**: The outcome variable that is being predicted or classified, such as stock price movement or sentiment classification.

* **Training Set:** A subset of the dataset used to train machine learning models.

* **Test Set**: A subset of the dataset used to evaluate the performance of trained machine learning models.

* **Validation Set**: A subset of the dataset used to tune the hyperparameters of the model.

* **Regression:** The task of predicting a continuous value, such as the future stock price.

* **Exploratory data analysis (EDA)**: approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

* **Model:** A mathematical representation of a process, built by training on data.

* **ML inference:** process of running data points into a machine learning model to calculate an output such as a single numerical score.

* **Overfitting**: When a model learns the details and noise in the training data to the extent that it performs poorly on new data.

In case of more business terminology appearing during the data explanatory analysis and model building all definitions and explanations would be provided.

## Scope of the project
----------

### The tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.

### The output

#### 1. Background
  >X is an investment bank. They offer a wide range of services, including investment banking, securities trading, wealth management, and asset management. Despite their strong market position, X is currently struggling with the volatility and unpredictability of financial markets.

#### 2. Business problem
  >X Investment Group is facing significant challenges in navigating the increasingly volatile and unpredictable financial markets. The firm needs to enhance its predictive accuracy for stock prices to optimize trading strategies and mitigate risks. Traditional models have proven inadequate in processing and analyzing the vast amounts of real-time data generated by the markets, leading to suboptimal trading decisions and missed opportunities. To maintain its competitive edge and improve financial performance, X requires a sophisticated machine learning model that can accurately predict closing stock prices and provide actionable insights for better decision-making.

#### 3. Business objectives
  >What is the impact of market sentiment on stock prices?
  
  >How do corporate earnings announcements affect stock prices?

#### 4.ML objectives
  > Predict the closing prices of stocks using historical market data, trade volumes, and relevant news data.


## Success Criteria
-------------

### The tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.

### The output. Success Criteria
#### 1.Business success criteria
- Increase user engagement and satisfaction by 10% within the next 6 months
- Improve investment performance and outperform the market by 2% within a year


#### 2. ML success criteria
- Improve the prediction accuracy so that the error was less than 30%
- Achieve a predictive precision of at least 70% for daily stock price predictions.

#### 3. Economic success criteria
- Increase the owners' profit by 5% within a month and make the model deliver a positive return on investment
- Deliver the operational efficiency and integrate the model into existing workflows without significant delays





## Data collection
----------

### The tasks
- Specify the data sources
- Collect the data
- Version control on the data

### The output

#### Data collection report:

**Data Source:** Stock-NewsEventsSentiment (SNES) 1.0  is a dataset consisting of market and news time series data for S&P 500 companies over a period of 21 months that X was collecting from October 2020 to July 2022.
    
**Data Type:** The data consists of numerical values, categorical data, and date. Numerical values represent the price of the stock during the day (Open, High, Low, Close). Date represents the date of the deal and categorical values represents industry sector.

**Data Size:** The dataset contains more than 200 hundred thousands of records with 27 features each.

**Data Collection Method:** The data collection involved aggregating news articles from various financial news sources and natural language processing techniques were employed to extract sentiment from the articles.

#### Data version control report:

**Data Version:** "The current data version is v1.1, which was updated on june 21, 2024."

**Data Change Log:** The data change log shows that the date of the deal feature was updated on June 21, 2024, to encode time as categorical feature.

**Data Backup:** The company has a daily backup of the data stored on a PC.

**Data Archiving:** The company archives data older than five years to a cloud storage service for long-term retention.

**Data Access Control:** The company uses role-based access control to ensure that only developers and employees can access data.

## Data quality verification
--------

### The tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### The output

#### 1. Data description

 The data acquired for this project includes a dataset of more than 40000 (without splitting on batches for data updating simulation) records with 27 fields each. The fields include date of the deal, stock prices during the day (High, Low, Open, Close), GISC sector of the stock, and articles data. The data is in a CSV format and is stored in a local database.

The table:

| The column name        | Description                                                                 |
|-------------------|-----------------------------------------------------------------------------|
| Date              | Date of the news event                                                      |
| Open              | The openning price                                                          |
| High              | The highest price of the stock during a day                                 |
| Low               | The lowest price during a day                                               |
| Close             | The price to the end of a day (the closing price)|
| Adj_close| The adjusted closing price|
| Volume | The number of shares |
| Symbol | The stock ticker symbol |
| Security | The name of security |
| GICS Sector | The Global Industry Classification Standard sector |
| GICS Sub-Industry | The GICS sub-industry |

After those columns the dataset contains features about news with corresponding themes: all news volume, volume, positive sentiment, negative sentiment, new products, layoffs, analyst comments, stocks, dividends, corporate earnings, merges & acquisitions, store openings, product recalls, adverse events, personnel changes, stock rumors. All the columns are in numerical format (float64).


#### 2. Data Exploration
An interesting finding from the initial data exploration is the positive correlation between the closing price ("Close") and the opening price ("Open"), the lowest price ("Low"), and the highest price ("High") of the day. This suggests that stocks with higher opening prices tend to also have higher closing prices, lows, and highs throughout the trading day. Conversely, stocks that open lower tend to see their price stay within a lower range for the day. This initial observation highlights a relationship between a stock's starting point and its overall price movement during a trading session. We built a heatmap for the first 100 entries in the dataset to demonastrate the correlation:

![Correlation](https://raw.githubusercontent.com/paket2004/Stock-market-prediction/images/images/correlation.jpg)
![Correlation](https://raw.githubusercontent.com/paket2004/Stock-market-prediction/images/images/correlation2.jpg)
![Correlation](https://raw.githubusercontent.com/paket2004/Stock-market-prediction/images/images/dependencies.jpg)

However, the closing price forming includes other features too. Our goal is to decrease the percentage of the error, thus we must take into account all the features and do not rely only on existing prices, probably assigning more weights to news or GICS values.

#### 3. Data requirements



*   Date should be stored in a format year-month-day.
*   Open. High, Low, Close price should be float numbers (>0)
*   Every feature column should not contain more than 5% of missing values
*   Data should contain numerical, categorical and time features that will be converted or encoded
*   Data should be stored in csv format
*   GICS sector is a categorical text data and should correspond to valid sector, for example: 'Energy', 'Materials', 'Industrials'.
*   News headline is a text data and amount of characters is bounded from 10 to 500.










#### 4. Data quality verification report

**Completeness:** The data is complete in the sense that it covers all the required cases. It contains necessary fields, doesn't exceed the available amount of missing values, and desired data type and limitations are followed in this dataset

**Correctness:** The data appears to be correct, with no obvious errors. However, a manual review is needed to check some typos.

**Missing Values:** There are missing values in the data. However, their amount is not significant. Missing values filling technique were applied to avoid this issue.

Overall, the data quality is high, and the data is suitable for analysis and modeling. Our team is sure, that we spent adequate amount of efforts to evaluate this dataset.

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### Output
#### 1. Inventory of resources

**PERSONNEL:**

>Data Experts: Data scientists and analysts skilled in data cleaning, preprocessing, and feature engineering.

>Machine Learning Personnel: Data scientists and machine learning engineers experienced in building a models suitable for the corresponding task.

>Business experts: Individuals with domain knowledge in finance and stock market trends.

**DATA:**

>Fixed Extracts: Access to the "Stock News Events Sentiment (SNES-10)" dataset (5 different versions).

**COMPUTING RESOURCES:**

>Hardware Platforms: High-performance servers or cloud computing resources, and local machine resources for data processing and model training.

>Storage: Sufficient storage for datasets and backups on local machines.

>Software: ML libraries documentstion, Google collab for the team work.

#### 2. Requirements, assumptions and constraints

_The project timeline_:

Phase I - Business and data understanding - week 1

Phase II - Data engineering/preparation - week 2

Phase III - Model engineering - week 3

Phase IV - Model validation - week 4

Phase V - Model deployment - week 5

Phase VI - Model monitoring and maintenance - week 6

**Comprehensibility and quality of results:** Clear, interpretable results and high-quality, accurate sentiment analysis.

**Security and legal issues:** For avoiding data leaks we have to ensure that the data is used safely, it is not stored within open source, and our actions correspond to company policies and legal regulations.

**Data usage permission:** Confirm permission to use the "Stock News Events Sentiment (SNES-10)" dataset for analysis and model development. To analyse the gotten conditions for using the data.  

**Assumptions:**

>Data quality: The dataset is (assumed to be) of high quality with minimal missing or erroneous values (Data Quality block).

>Relevance of sentiment: Consider the sentiment analysis as an _important_ part in the Close price (target) forming to avoid relying on "High" and "Low" prices mostly so that we could get more accurate results and improve the predictions.

>Business conditions: Assume that the financial market conditions remain relatively stable during the project duration.

**Constraints:**

>Resource availability: Limited availability of personnel and computing resources, the lack of data.

>Data size: Practical constraints on the size of data that can be processed and modeled effectively within the available computing resources. The highly overloaded database might be a bottleneck in model training, thus we need to find a trade-off between underfitting (small data) and time consuming training (big data).

>Technological constraints: Limitations related to software compatibility and integration with existing systems. Too complex and 'heavy' model might be too slow or inapropriate for intagrating it to alredy built processes.

#### 3. Risks and Contingencies

**Risks:**

>Data Issues: Incomplete, incorrect, or irrelevant data can lead to inaccurate models. There can be no data leack. The lack or excess of data.

>Resource Limitations: Insufficient computational resources or personnel availability.

>Technological Failures: Hardware or software malfunctions during critical phases of the project. The wrong choice of model, underfitting/overfitting, low precision.

>Regulatory Changes: New regulations that restrict data usage or access (failure of training and duty of finding a new dataset for project implementation).

**Contingencies:**

>Data Issues: Implement data validation checks and cleaning procedures. Source additional data if necessary.

>Resource Limitations: Prioritize tasks and consider cloud-based solutions to scale resources dynamically.

>Technological Failures: Regular backups and use of reliable infrastructure with redundancy.

>Regulatory Changes: Stay informed about regulatory updates and adjust data usage practices accordingly.

#### 4. Costs and Benefits

**Costs:**

>Computing Resources: Costs of servers for model training.

**Benefits:**

>Enhanced Decision-Making: Improved sentiment analysis can lead to better investment decisions and increase of money earning.

>Competitive Advantage: Early identification of market trends through sentiment analysis.

>Cost Savings: Efficient data preprocessing and analysis can reduce the need for manual analysis. For example we must get rid of zero values that cannot influence the results, encode categorical features with apropriate encoders, keep the data clean, and get rid of inefficient features.

>Scalability: Development of a scalable model that can be adapted for real-time sentiment analysis or integrated into a real systems.

**Cost-Benefit Analysis:**

![Cost-Benefit_Table](https://raw.githubusercontent.com/paket2004/Stock-market-prediction/images/images/costbenefir.jpg)


We evaluate this project in 36640$ (check the next point - proof of concept) and as a perspective and encouraging one and believe in its successful implementation. We take into account unpredictable situations and sure we would be able to deal with them in case of any. An accurate model worths the work with data, model building, and automating the processes, thus we evaluate it as a profitable and exciting theme. The total cost consists of server rent, working hours and additional resource wasting for stressful situations.

**Potential Benefits:**

Increased revenue through better investment decisions: \$500,000 annually.
Operational efficiency savings: $50,000 annually.
Intangible benefits like enhanced reputation and market intelligence.

#### 5. Feasibility report. Proof of Concept model

The plan is taken from the provided in the chat link: https://habr.com/ru/companies/ods/articles/438212/

ROI (Return of investment) - an indicator of the profitability of the project, equal to the ratio of income to investments spent.


##### 1) Desires.
Bank X is experiencing challenges in identifying banks at risk of closure. Early identification allows for proactive measures to be taken, such as mergers, acquisitions, or restructuring. Currently, bank closure prediction relies on manual analysis by financial experts, which is time-consuming and subjective. Thus they need an automated predicted 'Close' values.

Our task: develop a machine learning model to automate bank close prices prediction.

Input: readily available financial data points (already present in the provided dataset) for prediction

Output: generated values of Close prices

##### 2) Experiment.
Business process: Close prices prediction and prediction automation

ML task: regression

Training data for the experiment: some 300 records, feature-vector building for each of them.

Technical metrics: MSE, RMSE

Business metrics: prediction accurace in percentage format (the difference between predicted and actual values in percents)

##### 3) Data.
The data will be taken from the provided data from the bank X. They kept records every day and were ready to share with us with observations. Since its a real data, we have no need to check if the data from the training dataset wolud be different from the real-world data. We assume that the dataset is well-structured, deep, and consistent.

##### 4) Model building.
Feature engineering: prepare the dataset, encode categorical variables, drop some redundant columns.

Model selection: choose the simplest regression model - regression model.

Model training: train the model on the prepared data, splitting it into training and testing sets.

##### 5) Model evaluation
Cross-validation: perform the cross-validation to avoid overfitting.

Metrics assessment: calculate technical metrics (MSE, RMSE) and business metrics (average prediction accuracy within a defined price range) on the testing set.

##### 6) ROI evaluation
Since the bank is in charge of data collection, so our model can easily get new data for prediction of new values. However, we take risks that there might be problems with it and some predictions might be less effective and beneficial.

Costs:

Development Costs:
    Model training and optimization.
    Integration development with existing systems.
    Automation development for seamless operation.
Operational Costs:
    Infrastructure upkeep for model deployment.
    Model retraining at defined intervals to maintain accuracy.
    Data access fees or subscription costs (if applicable).

Calculating existing costs:

Assume the Close manual calculation takes 3 hours, server maintainance 2 hours, and data tracking 2 hours. Current analists waist 7 hours on Close price prediction. The whole process costs the company 300$ per day.  

Calculating the project costs:

Assume the project takes 8 hours per week for each team member. There are 6 weeks, so the working time is: 144 hours. The hour salary rate is 60\$, so the project cost is 8640\$. Adding the maintainance cost (3000\$ per month (server, manual work and time)).

ROI calculation:

For the next 6 months the company waists 300 * 30 * 6 = 54000\$

After the model integration the company pays: 8640+3000*6+10000 = 36640, where 8640 the project cost, 2000 per month is maintainance cost, and 10000 is for inpredictable situations (increased project duration or other).

Additionally, the company gets the stability on financial markets.

Thus, the ROI 54000/36640 = 1.47 and the payback period is less than 6 months.


## Project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
- List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


### Output

Stages:

1) Data acquisition. Get the dataset, aggreements with data providers (bank X). Distribute the workload between team members, build the plan, define business problem. 1 week. Input: project idea. Output: found dataset. Resources: 4-5 working hours, personal computers, software (kaggle). Risks: refusal from the bank to use their data and their software. Solution: find another problem and agree with a new company.  

2) Data preparation. Exploratory data analysis, data preprocessing and feature engineering for model applying. 1 week. Input: found dataset. Output: cleared and prepared dataset. Resources: 3-4 working time, personal computers and additional software (google collab, libraries documentation, etc.). Risks: too large or too small dataset. Solution: Reduce the dataset using an appropriate technique or find more data and combine it with existing one.

3) Model building. Model selection and training. Input: prepared dataset, splitted on test and train parts. Output: trained model, ready for testing and probably for integration. Resources: 4-5 working hours, probably server usage. Risks: the hardware is not enough. Solution: use google collab GPU power or get a new server. 1 week.

4) Model validation. Check the metrics using test data and evaluate it. Check if the gotten results are acceptable and suits the success criteria. Input: test data. Output: metrics and decision about model deployment. Process: analysis of gotten results and model tunning. Resources: 3-4 hours, hardware (server probably) and software. Risks: low accuracy. Soluition: check the prepared dataset, try to extract as much features as possible, tune the model. 1 week.

5) Model deployment. Integrating the tunned model into esisting processes. Automate it for a new data appearance. Check for any inconsistiences. 1 week. Input: server and built model. Output: fully automated and integrated model. Resources: server, some working hours. Risks: problem with integrating, model is too big or slow. Solution: apply the optimization techniques.

6) Model monitoring. Check if the model finely deals with new data and the work process is not broken. 1 week. Input: Integrated model. Output: ready analysis of the model, project finish. Risks: problems with the model. Solution: ask about solution the developers wich have expirience with it.

Workload distribution per each team member:

Ilia Mitrokhin: full stack and ML developer. Team leader.

Alie Ablaeva: ML developer.

Anastasia Pichugina: in charge of reports and data preparation.


![Table1](https://raw.githubusercontent.com/paket2004/Stock-market-prediction/images/images/table1.png)


![Table2](https://raw.githubusercontent.com/paket2004/Stock-market-prediction/images/images/table2.png)

Gantt Chart Link: https://teamflame.ru/66868eca1229a1f6e892b9b6/p/66868eca1229a1f6e892b9b9/b/66868eca1229a1f6e892b9bb.

