### Markov Chain to predict the stock market, based on the post from Pranab Gosh "Customer Conersion Prediction with Markov Chain Classifier" (https://pkghosh.wordpress.com/2015/07/06/customer-conversion-prediction-with-markov-chain-classifier/) 
### -->  binary classification with two transition matrices (first ordered matrix), positive and negative

### 1) First-Order Transition Matrix

In [None]:
# insert image using python package
from IPython.display import Image
Image(filename="./first-order-matrix.png", width=400)

### 2) Gataloging Patterns Using Market Data
#### &#x23f5; 10 years of S&P 500 index data represents only one sequence of many events leading to the last quoted price. Breaking data into may samples of sequences leading to different price patterns for model to learn richer and diverse patterns, I use the moving average to understand this

### 3) example
#### &#x23f5; 2012-10-18 to 2012-11-21 
1417.26 -> 1428.39 -> 1394.53 -> 1377.51 -> Next Day Volume Up
#### &#x23f5; 2016-08-12 to 2016-08-22
2184.05 -> 2190.15 -> 2178.15 -> 2182.22 -> 2187.02 -> Next Day Volume Up 
#### &#x23f5; 2014-04-04 to 2014-04-10
1865.09 -> 1854.04 -> Next Day Volume Down

### 4) if any similar price up and down has found compared to historical data in the current dataset, I consider that's a pattern
### 5) Binning Values into n(3) Buckets
In Pranab Ghosh's approach is to simplify each even within a sequence into a single feature. He split the value into 3 groups - Low, Medium, High. The precent difference between one day's price and the previous day's. Once we have collected all of them, binning them into three groups of equal frequency(number of appearance?) using InfoTheo package.
### 6) example
#### &#x23f5; closes, opens, highs, lows, we'll end up with a feature containing four letters: "MLHL" for example
#### &#x23f5; String all the feature events for the sequence and end up with something like this along with the observed outcome: "HMLL" -> "MHHL" -> "LLLH" -> "HMMM" -> Volume Up
### 7) Creating Two Markov Chains, One for Days with Volume Jumps, and another for Volume Drops

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import io, base64, os, json, re
import pandas as pd
import numpy as np
from pandas_datareader.data import DataReader
import datetime
from random import randint

### &#x23fa; Loading Data

In [None]:
# data extraction
start_date = "2012-12-01"
end_date = "2022-12-01"
symbol = "SPY"
data = DataReader(name=symbol, data_source="yahoo", start=start_date, end=end_date)
data = data[["Open", "High", "Low", "Adj Close", "Volume"]]

In [None]:
data.head()

In [None]:
# add return and range 
df = data.copy()
df["Returns"] = (df["Adj Close"] / df["Adj Close"].shift(1)) - 1 # because latter - previous will cause the last row has no one to divide with, so exclude that row
df["Range"] = (df["High"] / df["Low"]) - 1
df.dropna(inplace=True)
df.head()

In [None]:
# show all the columns, since Date is not listed as a column, so it should have been recongized as index, we need make it one of the attributes
df.columns

In [None]:
df.index

In [None]:
# reset index 
df.reset_index(inplace=True)

In [None]:
# convert Date columns into datetime type 
df["Date"]  = pd.to_datetime(df["Date"])

In [None]:
df.head()

In [None]:
# take random sets of sequential rows from the stock price with certain pattern
new_set = []
for row_set in range(0, 100000):
    if row_set%2000==0: print(row_set)
    row_quant = randint(10, 30)
    row_start = randint(0, len(df)-row_quant)
    market_subset = df.iloc[row_start:row_start+row_quant]

    Close_Date = max(market_subset['Date'])
    if row_set%2000==0: print(Close_Date)
    
    # Close_Gap = (market_subset['Close'] - market_subset['Close'].shift(1)) / market_subset['Close'].shift(1)
    Close_Gap = market_subset['Adj Close'].pct_change()
    High_Gap = market_subset['High'].pct_change()
    Low_Gap = market_subset['Low'].pct_change() 
    Volume_Gap = market_subset['Volume'].pct_change() 
    Daily_Change = (market_subset['Adj Close'] - market_subset['Open']) / market_subset['Open']
    Outcome_Next_Day_Direction = (market_subset['Volume'].shift(-1) - market_subset['Volume'])
    
    new_set.append(pd.DataFrame({'Sequence_ID':[row_set]*len(market_subset),
                            'Close_Date':[Close_Date]*len(market_subset),
                           'Close_Gap':Close_Gap,
                           'High_Gap':High_Gap,
                           'Low_Gap':Low_Gap,
                           'Volume_Gap':Volume_Gap,
                           'Daily_Change':Daily_Change,
                           'Outcome_Next_Day_Direction':Outcome_Next_Day_Direction}))

In [None]:
len(market_subset)

In [None]:
new_set_df = pd.concat(new_set)
print(new_set_df.shape)
new_set_df = new_set_df.dropna(how='any') 
print(new_set_df.shape)
new_set_df.tail(20)

In [None]:
# new dataset
new_set_df.head()

In [None]:
# confirm sequence
# new_set_df[new_set_df['Close_Date'] == '1973-06-27'] {HLH, HLH, HHH, HHH, LLL, LML, LML, LLL, LHL, ...

### creating new sequence of dataset for transfer sequential data into categorical data

In [None]:
# create sequences
# simplify the data by binning values into three groups
 
# Close_Gap
new_set_df['Close_Gap_LMH'] = pd.qcut(new_set_df['Close_Gap'], 3, labels=["L", "M", "H"])

# High_Gap - not used in this example
new_set_df['High_Gap_LMH'] = pd.qcut(new_set_df['High_Gap'], 3, labels=["L", "M", "H"])

# Low_Gap - not used in this example
new_set_df['Low_Gap_LMH'] = pd.qcut(new_set_df['Low_Gap'], 3, labels=["L", "M", "H"])

# Volume_Gap
new_set_df['Volume_Gap_LMH'] = pd.qcut(new_set_df['Volume_Gap'], 3, labels=["L", "M", "H"])
 
# Daily_Change
new_set_df['Daily_Change_LMH'] = pd.qcut(new_set_df['Daily_Change'], 3, labels=["L", "M", "H"])

# new set
new_set_df = new_set_df[["Sequence_ID", 
                         "Close_Date", 
                         "Close_Gap_LMH", 
                         "Volume_Gap_LMH", 
                         "Daily_Change_LMH", 
                         "Outcome_Next_Day_Direction"]]

new_set_df['Event_Pattern'] = new_set_df['Close_Gap_LMH'].astype(str) + new_set_df['Volume_Gap_LMH'].astype(str) + new_set_df['Daily_Change_LMH'].astype(str)

In [None]:
# 
new_set_df.tail(10)

In [None]:
new_set_df["Outcome_Next_Day_Direction"].describe()

In [None]:
# reduce the set
compressed_set = new_set_df.groupby(['Sequence_ID', 
                                     'Close_Date'])['Event_Pattern'].apply(lambda x: "{%s}" % ', '.join(x)).reset_index()

print(compressed_set.shape)
compressed_set.head() 

In [None]:
#compressed_outcomes = new_set_df[['Sequence_ID', 'Close_Date', 'Outcome_Next_Day_Direction']].groupby(['Sequence_ID', 'Close_Date']).agg()

compressed_outcomes = new_set_df.groupby(['Sequence_ID', 'Close_Date'])['Outcome_Next_Day_Direction'].mean()
compressed_outcomes = compressed_outcomes.to_frame().reset_index()
print(compressed_outcomes.shape)
compressed_outcomes.describe()

In [None]:
 compressed_set = pd.merge(compressed_set, compressed_outcomes, on= ['Sequence_ID', 'Close_Date'], how='inner')
print(compressed_set.shape)
compressed_set.head()

In [None]:
# # reduce set  again
# compressed_set = new_set_df.groupby(['Sequence_ID', 'Close_Date','Outcome_Next_Day_Direction'])['Event_Pattern'].apply(lambda x: "{%s}" % ', '.join(x)).reset_index()

compressed_set['Event_Pattern'] = [''.join(e.split()).replace('{','')
                                   .replace('}','') for e in compressed_set['Event_Pattern'].values]
compressed_set.head()

In [None]:
# use last x days of data for validation, setting it as pattern recognize
compressed_set_validation = compressed_set[compressed_set['Close_Date'] >= datetime.datetime.now() 
                                           - datetime.timedelta(days=90)] # Sys.Date()-90 

compressed_set_validation.shape

### check the shape for newly combined dataset

In [None]:
compressed_set = compressed_set[compressed_set['Close_Date'] < datetime.datetime.now() 
                                           - datetime.timedelta(days=90)]  
compressed_set.shape

In [None]:
list(compressed_set)

In [None]:
# drop the field 
# drop date field
compressed_set = compressed_set[['Sequence_ID', 'Event_Pattern','Outcome_Next_Day_Direction']]
compressed_set_validation = compressed_set_validation[['Sequence_ID', 'Event_Pattern','Outcome_Next_Day_Direction']]

### Keep build moving(rising or droping) only and build outcome 

In [None]:
compressed_set['Outcome_Next_Day_Direction'].describe()

In [None]:
print(len(compressed_set['Outcome_Next_Day_Direction']))
len(compressed_set[abs(compressed_set['Outcome_Next_Day_Direction']) > 10000000])

In [None]:
# keep only keep big/interesting moves 
print('all moves:', len(compressed_set))
compressed_set = compressed_set[abs(compressed_set['Outcome_Next_Day_Direction']) > 10000000]
compressed_set['Outcome_Next_Day_Direction'] = np.where((compressed_set['Outcome_Next_Day_Direction'] > 0), 1, 0)
compressed_set_validation['Outcome_Next_Day_Direction'] = np.where((compressed_set_validation['Outcome_Next_Day_Direction'] > 0), 1, 0)
print('big moves only:', len(compressed_set))  

In [None]:
compressed_set.head()

In [None]:
# create two data sets - won/not won
compressed_set_pos = compressed_set[compressed_set['Outcome_Next_Day_Direction']==1][['Sequence_ID', 'Event_Pattern']]
print(compressed_set_pos.shape)
compressed_set_neg = compressed_set[compressed_set['Outcome_Next_Day_Direction']==0][['Sequence_ID', 'Event_Pattern']]
print(compressed_set_neg.shape)

In [None]:
flat_list = [item.split(',') for item in compressed_set['Event_Pattern'].values ]
unique_patterns = ','.join(str(r) for v in flat_list for r in v)
unique_patterns = list(set(unique_patterns.split(',')))
len(unique_patterns)

In [None]:
compressed_set['Outcome_Next_Day_Direction'].head() 

### Build the Markov Chain grid (first ordered matrix)

In [None]:
# build the markov transition grid
def build_transition_grid(compressed_grid, unique_patterns):
    # build the markov transition grid

    patterns = []
    counts = []
    for from_event in unique_patterns:

        # how many times 
        for to_event in unique_patterns:
            pattern = from_event + ',' + to_event # MMM,MlM

            ids_matches = compressed_grid[compressed_grid['Event_Pattern'].str.contains(pattern)]
            found = 0
            if len(ids_matches) > 0:
                Event_Pattern = '---'.join(ids_matches['Event_Pattern'].values)
                found = Event_Pattern.count(pattern)
            patterns.append(pattern)
            counts.append(found)

    # create to/from grid
    grid_Df = pd.DataFrame({'pairs':patterns, 'counts': counts})

    grid_Df['x'], grid_Df['y'] = grid_Df['pairs'].str.split(',', 1).str
    grid_Df.head()

    grid_Df = grid_Df.pivot(index='x', columns='y', values='counts')

    grid_Df.columns= [col for col in grid_Df.columns]
    del grid_Df.index.name

    # replace all NaN with zeros
    grid_Df.fillna(0, inplace=True)
    grid_Df.head()

    #grid_Df.rowSums(transition_dataframe) 
    grid_Df = grid_Df / grid_Df.sum(1)
    return (grid_Df)

In [None]:
grid_pos = build_transition_grid(compressed_set_pos, unique_patterns) 
grid_neg = build_transition_grid(compressed_set_neg, unique_patterns) 

### Separately with positive and negative grid for display

In [None]:
grid_neg.head()

In [None]:
grid_pos.head()

In [None]:
# compressed_set_validation[compressed_set_validation['Sequence_ID' == seq_id]]

In [None]:
def safe_log(x,y):
   try:
      lg = np.log(x/y)
   except:
      lg = 0
   return lg

# predict on out of sample data
actual = []
predicted = []
for seq_id in compressed_set_validation['Sequence_ID'].values:
    patterns = compressed_set_validation[compressed_set_validation['Sequence_ID'] == seq_id]['Event_Pattern'].values[0].split(',')
    pos = []
    neg = []
    log_odds = []
    
    for id in range(0, len(patterns)-1):
        # get log odds
        # logOdds = log(tp(i,j) / tn(i,j)
        if (patterns[id] in list(grid_pos) and patterns[id+1] in list(grid_pos) and patterns[id] in list(grid_neg) and patterns[id+1] in list(grid_neg)):
                
            numerator = grid_pos[patterns[id]][patterns[id+1]]
            denominator = grid_neg[patterns[id]][patterns[id+1]]
            if (numerator == 0 and denominator == 0):
                log_value =0
            elif (denominator == 0):
                log_value = np.log(numerator / 0.00001)
            elif (numerator == 0):
                log_value = np.log(0.00001 / denominator)
            else:
                log_value = np.log(numerator/denominator)
        else:
            log_value = 0
        
        log_odds.append(log_value)
        
        pos.append(numerator)
        neg.append(denominator)
      
    print('outcome:', compressed_set_validation[compressed_set_validation['Sequence_ID']==seq_id]['Outcome_Next_Day_Direction'].values[0])
    print(sum(pos)/sum(neg))
    print(sum(log_odds))

    actual.append(compressed_set_validation[compressed_set_validation['Sequence_ID']==seq_id]['Outcome_Next_Day_Direction'].values[0])
    predicted.append(sum(log_odds))

from sklearn.metrics import confusion_matrix

confusion_matrix(actual, [1 if p > 0 else 0 for p in predicted])

### Build the confusion matrix for accuracy 

In [None]:
from sklearn.metrics import accuracy_score
score = accuracy_score(actual, [1 if p > 0 else 0 for p in predicted])
print('Accuracy:', round(score * 100,2), '%')

In [None]:
import seaborn as sns
cm = confusion_matrix(actual, [1 if p > 0 else 0 for p in predicted])
fig, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(cm, annot=True, ax = ax, fmt='g')

ax.set_title('Confusion Matrix') 
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')

ax.xaxis.set_ticklabels(['up day','down day'])
ax.yaxis.set_ticklabels(['up day','down day'])
ax.set_yticklabels(ax.get_yticklabels(), rotation = 0, fontsize = 8)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90, fontsize = 8)  
plt.show()

In [None]:
import pandas as pd 
import numpy as np 

from pyhhmm.gaussian import GaussianHMM 
from pandas_datareader.data import DataReader

import matplotlib.pyplot as plt

### Data

In [None]:
# data extraction 
start_date = '2017-01-1'
end_date = '2022-06-01'
symbol = "SPY"
data = DataReader(name=symbol, data_source="yahoo", start=start_date, end=end_date)
data = data[["Open", "High", "Low", "Adj Close", "Volume"]]

In [None]:
# add return and range 
df = data.copy()
df["Returns"] = (df["Adj Close"] / df["Adj Close"].shift(1)) - 1 # because latter - previous value in row level 
df["Range"] = (df["High"] / df["Low"]) - 1
df.dropna(inplace=True)
df.head()

In [None]:
# structure data 
X_train = df[["Returns", "Range"]]
X_train.head()

### HMM Learning

In [None]:
# Train Model 
model = GaussianHMM(n_states=4, covariance_type='full', n_emissions=2)
model.train([np.array(X_train.values)])

In [None]:
# check results 
hidden_states = model.predict([X_train.values])[0]
hidden_states[:40]
len(hidden_states)

In [None]:
# regime state means for each feature 
model.means

In [None]:
# 
model.covars

In [None]:
# helper function
# dir(model)

### Data Visualization

In [None]:
# structure the prices for plotting 
i = 0
labels_0 = []
labels_1 = [] 
labels_2 = [] 
labels_3 = []
prices = df["Adj Close"].values.astype(float)
print("Correct number of rows:", len(prices) == len(hidden_states))
for s in hidden_states:
    if s == 0:
        labels_0.append(prices[i])
        labels_1.append(float('nan'))
        labels_2.append(float('nan'))
        labels_3.append(float('nan'))
    if s == 1:
        labels_0.append(float('nan'))
        labels_1.append(prices[i])
        labels_2.append(float('nan'))
        labels_3.append(float('nan'))
    if s == 2:
        labels_0.append(float('nan'))
        labels_1.append(float('nan'))
        labels_2.append(prices[i])
        labels_3.append(float('nan'))
    if s == 3:
        labels_0.append(float('nan'))
        labels_1.append(float('nan'))
        labels_2.append(float('nan'))
        labels_3.append(prices[i])
    i += 1

In [None]:
# plot chart 
fig = plt.figure(figsize=(18, 8))
plt.plot(labels_0, color="green")
plt.plot(labels_1, color="red")
plt.plot(labels_2, color="orange")
plt.plot(labels_3, color="black")
plt.show()

In [None]:
# moving average 
# https://www.youtube.com/watch?v=r3Ulu0jZCJI
# define a period as a day, for example a 20 days moving average, take value of 20 days and adding together / count
# this result is used as a value for the first day of that 20 days 
# when moving forward a day, drop the current first day and include a new day followed the current last day
# and do the calculation for that

In [None]:
# interesting way of using markov chain
# https://www.youtube.com/watch?v=sdp49vTanSk

In [None]:
# first order matrix 


In [None]:
# binning data: make continuous data into categorical data
# https://www.youtube.com/watch?v=iv_ec0EfXcE

# equal frequency binding in python

In [None]:
# https://setosa.io/blog/2014/07/26/markov-chains/

In [None]:
# https://www.youtube.com/watch?v=WT6jI8UgROI