# Logistic Regression

## Import the required modules

In [2]:
import pandas as pd
from pathlib import Path
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Step 1 :Read in the CSV file from the Resources folder into a Pandas DataFrame.

In [3]:
# Read the csv file into a Pandas DataFrame
arbilogreg_df = pd.read_csv(
    Path('../Resources/encoded_arbi_data.csv')
)

# Review the DataFrame
arbilogreg_df.head() 

Unnamed: 0,hour_of_day,volume,buy_price,total_purchase_amount,sell_price,total_sale_amount,price_difference,is_profitable,gold_close,spy_close,...,buy_exchange_Binanceus,buy_exchange_Bitstamp,buy_exchange_Gemini,buy_exchange_Kraken,buy_exchange_Poloniex,sell_exchange_Binanceus,sell_exchange_Bitstamp,sell_exchange_Gemini,sell_exchange_Kraken,sell_exchange_Poloniex
0,23,0.008658,69300.0,600.0,69474.95,601.514719,174.95,0,17.95,518.51,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,23,989.184242,0.9103,900.454416,0.91553,905.627849,0.00523,0,17.95,518.51,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,23,2.512527,179.193,450.227208,179.66,451.400558,0.467,0,17.95,518.51,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
3,23,2.51784,179.193,451.179225,179.66,452.355056,0.467,0,17.95,518.51,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
4,23,2.520005,179.109,451.3556,179.66,452.744123,0.551,0,17.95,518.51,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0


### Step 2: Create a Series named y that contains the data from the "is_profitable" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named X that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [4]:
# Split the data into X (features) and y (labels)

# The y variable should focus on the is_profitable column
y = arbilogreg_df['is_profitable']

# The X variable should include all features except the is_profitable column
X = arbilogreg_df.drop(columns=['is_profitable'])


### Step 3: Split the features and labels into training and testing sets.

In [5]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify= y)

### Step 4: Check the magnitude of imbalance in the data set by viewing the number of distinct values (value_counts) for the labels.

In [6]:
# Count the distinct values in the original labels data
y_train.value_counts()

is_profitable
0    13422
1      313
Name: count, dtype: int64

### Step 5: Create a Logistic Regression Model

In [7]:
lrmodel = LogisticRegression(random_state=1)
lrmodel

### Step 6 : Fit the Model

In [8]:
lrmodel.fit(X_train, y_train)

### Step 7 : Make predictions

In [9]:
# Predict outcomes for test data set
predictions = lrmodel.predict(X_test)

### Step 8 : Calculate the Accuracy Score

In [10]:
# Display the accuracy score for the test dataset.
accuracy_score(y_test, predictions)

0.9995632234112252

### Step 9 : Generate a Confusion Matrix

In [11]:
print('Confusion Matrix')
print(confusion_matrix(y_test,predictions))

Confusion Matrix
[[4475    0]
 [   2  102]]


### Step 10 : Print the classification report

In [14]:
print('Classification Report - Logistic Regression Model')
print(classification_report(y_test, predictions,digits=4))

Classification Report - Logistic Regression Model
              precision    recall  f1-score   support

           0     0.9996    1.0000    0.9998      4475
           1     1.0000    0.9808    0.9903       104

    accuracy                         0.9996      4579
   macro avg     0.9998    0.9904    0.9950      4579
weighted avg     0.9996    0.9996    0.9996      4579



# Evaluation

### What is Supervised Learning?

Supervised learning is a type of machine learning where we teach a computer using labeled data. This means we provide the computer with examples where both the input data (features) and the correct output (labels) are known. The computer learns from these examples by identifying patterns and relationships between the features and the labels. Once the model is trained on this labeled data, it can then make predictions or decisions when given new, unseen data. The goal is to create a model that can accurately map inputs to outputs based on the patterns it has learned from the training examples.

### What are the labels?

In our dataset, the column labeled "is_profitable" serves as the labels for our data. These labels indicate whether a trade resulted in a profit or not. Specifically, the label '0' represents trades that are not profitable, while the label '1' indicates trades that are profitable. This binary classification allows us to categorize each trade based on its financial outcome, providing a clear distinction between profitable and non-profitable trades for our analysis and predictions.

### Logistic Regression

Logistic Regression is a statistical method used for predicting binary outcomes based on input data. In the context of our dataset, we've added a binary classification to determine whether a trade is profitable or not, making logistic regression an ideal choice for this type of prediction.

- The key parameter driving this binary classification is the spread percentage of the trade being higher than 5%. This threshold helps us categorize trades as profitable or not based on the spread percentage, allowing the model to learn and identify patterns associated with profitable trades.
- To prep the data for the model, we split our dataset into training and testing sets using the "train_test_split" method. This division creates two separate datasets: one for training the model and another for testing its predictions.
- The logistic regression process involves three main steps: object creation, fitting, and prediction. First, we create a logistic regression object. Then, we fit this object to the training data, teaching it to recognize the classification patterns related to profitable trades. Finally, the model is ready to make predictions based on the fitted function.
- After fitting the model, we make predictions on the test dataset and use the accuracy score function to measure the prediction's accuracy. In our case, the model achieved an accuracy score of 99.96%, indicating high performance in predicting profitable trades. However, it's important to remember that no model is infallible. Even with high accuracy, there's still a small risk of making trades that could result in losses. Therefore, it's crucial to use the model's predictions as a tool for informed decision-making while always considering potential risks and monitoring trade outcomes closely.

### Examine Confusion Matrix

The confusion matrix provides a detailed breakdown of a model's predictions compared to the actual outcomes.
- **False Positives (FP):**
    - **Advantage:** The model has zero false positives, meaning it never incorrectly labeled a non-profitable trade as profitable. This is excellent as it ensures that we avoid making trades that might cost us money.
    - **Disadvantage:** While having no false positives is beneficial for risk mitigation, an overly cautious model might miss out on some potentially profitable trades if it's too conservative in its predictions.
- **False Negatives (FN):**
    - **Advantage:** The model has only two false negatives, indicating a high accuracy in correctly identifying profitable trades.
    - **Disadvantage:** However, false negatives mean that there were trades the model missed labeling as profitable. This could result in missed opportunities for gains.