In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


# read the CSV file
data = pd.read_csv('data/prices.csv')

# display the first few rows of the dataframe
data.head()


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2015-02-17,2.97006,2.99706,2.902558,2.902558,2.349932,55057948.0
1,2015-02-18,2.920559,2.97906,2.916059,2.95656,2.393652,31801480.0
2,2015-02-19,3.006061,3.006061,2.96556,2.99256,2.422798,37002206.0
3,2015-02-20,2.98806,2.98806,2.929559,2.938559,2.379078,30158007.0
4,2015-02-23,2.866558,2.889058,2.857558,2.875558,2.328072,83954148.0


Given that the dataset primarily consists of stock prices and volume, scaling may not be strictly necessary, especially if interpretability is a concern. However, scaling could still be beneficial for ensuring that all features have similar ranges and for potentially improving convergence speed.

Therefore, my recommendation would be to scale the data as a precautionary measure to ensure consistent behavior across different datasets and to potentially improve the model's performance. However, if interpretability is a top priority and you don't observe convergence issues during model training, you could choose to not scale the data.

If you decide to proceed with scaling, I can include the scaling step in the preprocessing pipeline. Let me know your decision, and I'll adjust the code accordingly!

In [9]:
print("Data shape:", data.shape)
print("\nData types:")
print(data.dtypes)
print("\nMissing values:")
print(data.isnull().sum())

# select numeric columns for scaling
numeric_cols = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

# initialize scaler
scaler = StandardScaler()

# scale numeric features
data[numeric_cols] = scaler.fit_transform(data[numeric_cols])

# describe the final dataset
print("\nFinal dataset overview:")
print(data.describe())


Data shape: (2057, 7)

Data types:
Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

Missing values:
Date         0
Open         1
High         1
Low          1
Close        1
Adj Close    1
Volume       1
dtype: int64

Final dataset overview:
               Open          High           Low         Close     Adj Close  \
count  2.056000e+03  2.056000e+03  2.056000e+03  2.056000e+03  2.056000e+03   
mean   1.105903e-16 -1.658855e-16  1.105903e-16  2.211806e-16 -1.658855e-16   
std    1.000243e+00  1.000243e+00  1.000243e+00  1.000243e+00  1.000243e+00   
min   -7.369443e-01 -7.338389e-01 -7.615335e-01 -7.356884e-01 -6.718673e-01   
25%   -4.783748e-01 -4.747312e-01 -4.817743e-01 -4.785460e-01 -4.674504e-01   
50%   -2.856022e-01 -2.861363e-01 -2.845076e-01 -2.862686e-01 -2.840696e-01   
75%   -9.717587e-03 -2.199153e-02 -2.954564e-03 -1.290984e-02 -4.776187e-02   
max    5.374

Data Shape: The dataset contains 2057 rows and 7 columns.

Data Types: All columns except 'Date' are numeric (float64).

Missing Values: There is one missing value in each column. We'll need to handle these missing values before proceeding with further analysis.

Final Dataset Overview:

Mean: The mean of each numeric column (except 'Volume') is approximately 0, indicating that the data has been centered around zero.
Standard Deviation (std): The standard deviation of each numeric column is approximately 1, indicating that the data has been scaled to have unit variance.
Min/Max: The minimum and maximum values of each column vary, but after scaling, they're within a comparable range.
Interpretation of Scaling:

By scaling the data, we've transformed the features such that they have mean 0 and standard deviation 1. This ensures that the features are on a similar scale, which can improve the performance of certain algorithms, such as logistic regression.
Scaling helps in preventing features with larger magnitudes from dominating the optimization process and ensures that the model learns the underlying patterns more effectively.
For instance, without scaling, the 'Volume' column, which has much larger values compared to the price columns, could disproportionately influence the model's predictions. Scaling mitigates this issue by putting all features on a comparable scale.
In summary, scaling the data ensures that the logistic regression model can effectively learn from all features without being biased by differences in feature magnitudes. It helps in improving model convergence and performance.

Handle Missing Values: Impute missing values with the mean of each numeric column, ensuring that the dataset is complete for analysis.

Split the Data into Training and Testing Sets: Define features (X) and the target variable (y), then split the data into training and testing sets. Here, you can choose a relevant column as the target variable. Since we are dealing with stock price prediction, you might want to predict the future close price based on historical data. Therefore, you should keep the 'Close' column as the target variable.

Discretize the Target Variable y: Since you want to predict whether to buy, sell, or hold stocks based on price movements, you need to discretize the 'Close' column into multiple classes. You can do this by categorizing the price movements into different categories (e.g., if the price increases, decreases, or remains the same). The number of bins will depend on how finely you want to categorize the price movements.

Check Unique Values: After splitting the data and discretizing the target variable, it's essential to verify the unique values in both y_test (the actual target values in the testing set) and predictions (the predicted values). This check helps ensure that the model outputs and actual targets align properly and allows for further analysis of the classification results.

So, the 'Close' column already exists in your dataset, and you'll use it as the target variable for predicting future price movements. Then, you'll discretize this column into multiple classes representing different price movements (e.g., buy, sell, hold) to perform classification.

In [10]:
# handle missing values
# impute missing values using the mean of each numeric column (all of them)
numeric_cols = data.select_dtypes(include=['float64']).columns
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

X = data.drop(columns=['Date']) # dont need the date column for further processing
y = data['Close']  # 'Close' variable is the target variable for prediction

# discretize the target variable y
# use the cut function with 4 bins
y_bins = pd.cut(y, bins=4, labels=False)

# split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y_bins, test_size=0.2, random_state=42)

# display the shapes of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

# check unique values in y_test
print("Unique values in y_test:", np.unique(y_test))
# check unique values of y_bins
print("Unique values of y_bins:", np.unique(y_bins))

Training set shape: (1645, 6) (1645,)
Testing set shape: (412, 6) (412,)
Unique values in y_test: [0 1 2 3]
Unique values of y_bins: [0 1 2 3]


Training set shape: (1645, 6) (1645,)

This means that the training set contains 1645 samples (rows) and 6 features (columns). Additionally, there are 1645 target values corresponding to each sample. In other words, the training set consists of 1645 observations with 6 features each, along with their corresponding target values.
Testing set shape: (412, 6) (412,)

This means that the testing set contains 412 samples (rows) and 6 features (columns). Similarly, there are 412 target values corresponding to each sample. The testing set consists of 412 observations with 6 features each, along with their corresponding target values.
In summary, the shape of the training and testing sets indicates the number of samples (observations) and features in each set, as well as the number of target values associated with each sample. These sets are ready for use in training and evaluating a machine learning model, such as a logistic regression model, for stock price prediction.

We chose an 80/20 split for the following reasons:

Balancing Training and Testing Data: An 80/20 split is a common choice that strikes a balance between having enough data for training the model effectively and having a sufficient amount of data for evaluating its performance. With 80% of the data allocated for training, the model can learn from a substantial portion of the dataset, while the 20% testing set provides a reasonable amount of data for evaluating the model's generalization performance.

Trade-off between Bias and Variance: A larger training set (e.g., 80%) can help reduce bias by providing more data for the model to learn from, potentially leading to a more accurate model. However, a smaller testing set (e.g., 20%) can increase the variance of performance estimates, as there is less data available for evaluation. Nonetheless, in practice, an 80/20 split is often sufficient for obtaining reliable performance estimates while ensuring efficient use of data for training and testing.

Computational Efficiency: Using a smaller testing set can lead to faster model evaluation, which is advantageous when experimenting with different models or hyperparameters.

Overall, the 80/20 split is a widely used and practical choice for dividing data into training and testing sets, offering a good balance between model training and evaluation requirements. If you have specific considerations or requirements that suggest a different split ratio, feel free to adjust it accordingly.






In [11]:
# define the Binary Logistic Regression class
class BinaryLogisticRegression:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
        # Internally store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if hasattr(self, 'w_'):
            return 'Binary Logistic Regression Object with coefficients:\n' + str(self.w_)  # If we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    def _sigmoid(self, theta):
        return 1 / (1 + np.exp(-theta))
    
    def _add_intercept(self, X):
        return np.hstack((np.ones((X.shape[0], 1)), X))
    
    def _get_gradient(self, X, y):
        gradient = np.zeros(self.w_.shape)
        for (xi, yi) in zip(X, y):
            gradi = (yi - self._sigmoid(xi @ self.w_)) * xi 
            gradient += gradi.reshape(self.w_.shape) 
        
        return gradient / float(len(y))
       
    def fit(self, X, y):
        Xb = self._add_intercept(X)
        num_samples, num_features = Xb.shape
        self.w_ = np.zeros((num_features, 1))
        
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb, y)
            self.w_ += gradient * self.eta
            
    def predict_proba(self, X):
        Xb = self._add_intercept(X)
        return self._sigmoid(Xb @ self.w_)
    
    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(int)

# define the MultiClass Logistic Regression class
class MultiClassLogisticRegression:
    def __init__(self, eta=0.1, iterations=20):
        self.eta = eta
        self.iters = iterations
        self.classifiers_ = []
        self.unique_ = None
    
    def __str__(self):
        if hasattr(self, 'w_'):
            return 'MultiClass Logistic Regression Object with coefficients:\n' + str(self.w_)  # If we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self, X, y):
        self.unique_ = np.unique(y)
        for yval in self.unique_:
            binary_y = (y == yval).astype(int)
            blr = BinaryLogisticRegression(eta=self.eta, iterations=self.iters)
            blr.fit(X, binary_y)
            self.classifiers_.append(blr)
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self, X):
        probs = []
        for blr in self.classifiers_:
            probs.append(blr.predict_proba(X).reshape(-1, 1))
        return np.hstack(probs)
    
    def predict(self, X):
        return self.unique_[np.argmax(self.predict_proba(X), axis=1)]
    
# train the logistic regression model
mlr = MultiClassLogisticRegression()
mlr.fit(X_train, y_train)

# evaluate the trained model using y_bins instead of y_test
predictions = mlr.predict(X_test)
# calculate accuracy using y_bins instead of y_test
print('Accuracy:', accuracy_score(y_bins[X_test.index], predictions))

# All credit to instructor code

Accuracy: 0.9514563106796117


In [16]:
# define actions based on predicted labels
actions = {0: "BUY THE STOCK", 1: "SELL THE STOCK", 2: "HOLD ONTO THE STOCK", 3: "NO ACTION"}

# print interpretations of predictions along with stock prices and correctness
for idx, (actual, predicted) in enumerate(zip(y_test[:10], predictions[:10])):
    stock_price = X_test.iloc[idx]['Close']  # Assuming 'Close' column represents the stock price
    correct = "Correct" if actual == predicted else "Incorrect"
    print(f"Instance {idx + 1}: Stock Price: {stock_price} | Predicted Action: {actions[predicted]} | Actual Action: {actions[actual]} | Prediction: {correct}")


Instance 1: Stock Price: -0.38356534094497124 | Predicted Action: BUY THE STOCK | Actual Action: BUY THE STOCK | Prediction: Correct
Instance 2: Stock Price: 2.881975321865618 | Predicted Action: HOLD ONTO THE STOCK | Actual Action: HOLD ONTO THE STOCK | Prediction: Correct
Instance 3: Stock Price: -0.5851094679218563 | Predicted Action: BUY THE STOCK | Actual Action: BUY THE STOCK | Prediction: Correct
Instance 4: Stock Price: -0.4854956803452672 | Predicted Action: BUY THE STOCK | Actual Action: BUY THE STOCK | Prediction: Correct
Instance 5: Stock Price: -0.02680915304393465 | Predicted Action: BUY THE STOCK | Actual Action: BUY THE STOCK | Prediction: Correct
Instance 6: Stock Price: -0.37198258182643495 | Predicted Action: BUY THE STOCK | Actual Action: BUY THE STOCK | Prediction: Correct
Instance 7: Stock Price: 2.9797852877554796 | Predicted Action: HOLD ONTO THE STOCK | Actual Action: HOLD ONTO THE STOCK | Prediction: Correct
Instance 8: Stock Price: -0.4391641290818278 | Predi