ML model and tool which looks at topics in the news and makes predictions of stock price movements which can be used for trading. The project is an extension of STTM, an approach which detects the association between topics in the news and stock price movements. This project extends that by using the association as signal and suggesting trades.
I also made a trading simulation (backtest) to assess the performance. In my experience it's very noisy, but there are some promising results when using Kommersant news and data before 2022 (Russo-Ukrainian war).
Inspired by STTM. Article link: https://peerj.com/articles/cs-1156/
-
Setup
-
Install python and pip.
-
Then create a virtual python environment inside this project:
python3.9 -m venv .venvNote: the project was made using Python 3.9, but using any
python3should also work. -
Activate the virtual environment:
source .venv/bin/activate -
Install requirements:
pip install -r requirements.txt
-
Download Yandex MyStem for your system. Download link: https://yandex.ru/dev/mystem/
-
Place it in
core/lib/and name itmystem. -
On Linux/macOS make sure to mark it executable:
chmod +x mystem.
-
Place raw news and market series data in data/raw/. The data can be found online or by contacting me.
Expected structure of data/raw/:
data/raw
├── market_series
│ ├── AFKS.xlsx
│ ├── AFLT.xlsx
│ ├── ALRS.xlsx
│ ...
└── news
└── kommersant
├── 2016-01-01
│ ├── 0001__2016-01-01.txt
│ ├── 0002__2016-01-01.txt
│ ├── 0003__2016-01-01.txt
│ ...
├── 2016-01-02
├── 2016-01-04
...
Make sure the news file names have sequence numbers padded to have 4 digits, otherwise they will not be parsed in the correct order. You can use the bash script data/fix_file_names to fix this.
-
Config
Open
core/config.pyand define your training years range, depending on your data. The system will automatically find the dates with overlapping news and market data. -
Training
Run the training scripts. Just run them in sequence and check if the data they generate seems clean or malformed.
-
python -m core.run_news_preprocessingprocesses your news data and stores output indata/preprocessed_news/. -
python -m core.run_lda_modelingbuilds LDA models and stores output in anddata/models/. -
python -m core.run_sttm_trainingbuilds the STTM index and stores output in a few directories underdata/.
-
-
Trading simulation
-
Open
core/run_backtest.pyand around the start, configure which training dataset you want to use and on which data you want to run the test (change_TRAINED_DATASET_START_YEAR,_TRAINED_DATASET_END_YEARand_TEST_YEAR). -
python -m core.run_backtestruns the backtest and stores data indata/backtest_results/. It will also store run logs inlogs/.
-
-
Results and statistics
-
Open
core/run_results_calc.pyand specify the years from which you want to consider results (edit_RESULTS_YEARS). -
python -m core.run_results_calccalculates statistics for your results - mean weekly returns, annualized returns, z-score and p-value. Output is stored indata/aggregated_results/.
-
The original repo published by the authors is here.