# Exploratory Data Analysis of Stock Options Flow Dataset

## What is Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves the visual and statistical examination of data to identify patterns, relationships, and anomalies. It provides an understanding of the underlying structure of the data and allows for insights to be gathered and hypotheses to be formed before proceeding to more complex analysis.

## Introduction to the Project
This project focuses on performing EDA on a stock options flow dataset to gain insights into the stock options market. The dataset contains information on the `trading volume`, `open interest`, and `pricing of various stock options contracts`.

## Contents
1. Overview of the Stock Options Flow Dataset
2. Data Cleaning and Preprocessing
3. Univariate Analysis
4. Bivariate Analysis
5. Multivariate Analysis

## Some Advanced Stuff
1. Detailed Filtering
2. Clustering and Dimensionality Reduction (***TODO***)


## Main Idea
The main idea of this project is to perform exploratory data analysis(EDA) on a stock options flow dataset to gain insights into the stock options market. The project covers various data analysis techniques, including univariate, bivariate, multivariate, and time series analysis, as well as advanced techniques such as clustering and dimensionality reduction (***TODO***)

## Learning Objectives
1. To gain hands-on experience in performing EDA on a real-world stock options flow dataset
2. To understand the importance of data cleaning and preprocessing in the data analysis process
3. To be able to apply various data analysis techniques, including univariate, bivariate, multivariate, and time series analysis
4. To understand advanced techniques such as clustering and dimensionality reduction and their applications in stock options analysis
5. To gain insights into the stock options market through the analysis of the stock options flow dataset

## How it's Helpful
This project is helpful in gaining a deeper understanding of the stock options market and the techniques used in data analysis. The project provides hands-on experience in performing exploratory data analysis, which is a valuable skill for anyone interested in data analysis, finance, or stock options. The insights gained through the analysis of the stock options flow dataset can be useful for investors and traders in making informed decisions in the stock options market.

## Discalimer
This project is for educational and informational purposes only. The findings and conclusions are based solely on the data provided by [tradytics.com](https://tradytics.com) and should not be taken as financial advice of any sort.

All rights to the data belong to [tradytics.com](https://tradytics.com), and it is a paid software. The data used in this project is for ***educational purposes only and should not be used for commercial or financial gain***.

In [1]:
# Import needed libararies
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Set some settings for the themes
mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams['lines.linestyle'] = '-'

# Get data into notebook
options_flow_data = "https://raw.githubusercontent.com/muhammadanas0716/Machine-Learning-101/main/Projects/Stock%20Options%20Flow%20Analysis%20(EDA)/OptionsFlow.csv"
df_options_flow = pd.read_csv(options_flow_data)

## 1. Introduction


In [3]:
# First 5 rows of our data
df_options_flow.head(5)

Unnamed: 0,Uid,Time,Symbol,C/P,Trade,Side,Str,Spot,Exp,Size,Prems,Price,Bid,Ask,Vol,OI,Ivol,Label,Events,Details
0,14479225,2023-02-06 10:51:33,UBER,CALL,,TO ASK,35.0,33.09,2023-09-15,100,46.10K,4.61,4.55,4.65,528,3146,48.28,,,
1,14479223,2023-02-06 10:51:33,UBER,CALL,,TO BID,37.5,33.09,2024-01-19,100,49.10K,4.91,4.85,5.0,344,7081,45.92,,,
2,14479230,2023-02-06 10:51:31,AAPL,CALL,SWEEP,AT ASK,150.0,152.24,2023-02-17,85,39.52K,4.65,4.6,4.65,2454,50589,32.19,,,
3,14479214,2023-02-06 10:51:31,GM,CALL,,TO BID,40.0,41.24,2023-04-21,99,32.47K,3.28,3.25,3.35,432,2827,33.23,,,
4,14479224,2023-02-06 10:51:30,QQQ,PUT,SWEEP,BLW BID,305.0,304.42,2023-02-06,268,41.17K,1.54,1.6,1.62,19016,11550,41.74,,,


In [4]:
# Last 5 rows of our data
df_options_flow.tail(5)

Unnamed: 0,Uid,Time,Symbol,C/P,Trade,Side,Str,Spot,Exp,Size,Prems,Price,Bid,Ask,Vol,OI,Ivol,Label,Events,Details
11474,14464870,2023-02-06 08:41:23,SPX,PUT,,TO ASK,4080.0,4105.16,2023-02-06,177,100.89K,5.7,5.5,5.8,1109,1304,31.91,,,
11475,14464869,2023-02-06 08:32:26,SPX,PUT,,TO BID,4100.0,4108.64,2023-02-06,150,165.00K,11.0,10.9,11.2,848,6038,31.23,,,
11476,14464868,2023-02-06 08:26:34,SPX,CALL,SPLIT,ABV ASK,4150.0,4110.88,2023-02-06,819,195.50K,2.39,2.2,2.35,4558,3935,28.98,,,
11477,14464864,2023-02-06 08:22:53,SPX,CALL,,TO BID,4065.0,4112.13,2023-02-06,100,473.80K,47.38,46.9,47.9,200,169,20.37,,,
11478,14464863,2023-02-06 08:21:28,SPX,CALL,,TO BID,4065.0,4113.62,2023-02-06,100,487.30K,48.73,48.2,49.3,100,169,18.46,,,


In [24]:
# Shape (dimensions) of our data
print(f"Rows: {df_options_flow.shape[0]}")
print(f"Columns: {df_options_flow.shape[1]}")

# Total Elements in our data
print(f"Total Elements: {df_options_flow.size}")

# Column names in our data
column_list = [column for column in df_options_flow.columns]
print(column_list)

Rows: 11479
Columns: 20
Total Elements: 229580
['Uid', 'Time', 'Symbol', 'C/P', 'Trade', 'Side', 'Str', 'Spot', 'Exp', 'Size', 'Prems', 'Price', 'Bid', 'Ask', 'Vol', 'OI', 'Ivol', 'Label', 'Events', 'Details']


Regarding the columns, the following are the meanings of the columns in our data (mentioned above)...

1. Uid: Unique identifier for each observation

1. Time: Timestamp of the trade

1. Symbol: The stock symbol the options are traded for

1. C/P: Call/Put option type

1. Trade: The trade type, such as a market or limit order

1. Side: The side of the trade, whether it was a buy or sell

1. Str: Strike price of the option

1. Spot: The current market price of the underlying stock

1. Exp: The expiration date of the option

1. Size: The number of contracts traded

1. Prems: The premium (price) paid for the option

1. Price: The price at which the option was traded

1. Bid: The highest bid price for the option

1. Ask: The lowest ask price for the option

1. Vol: The trading volume for the option

1. OI: Open interest for the option, which is the total number of contracts outstanding

1. Ivol: Implied volatility of the option

1. Label: Binary classification label, indicating if the option price increased or decreased after the trade

1. Events: Any events related to the trade, such as earnings announcements or dividend payments

1. Details: Additional details about the trade, such as the trader's name or firm.

In [16]:
# Some info about our data
df_options_flow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11479 entries, 0 to 11478
Data columns (total 20 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Uid      11479 non-null  int64  
 1   Time     11479 non-null  object 
 2   Symbol   11479 non-null  object 
 3   C/P      11479 non-null  object 
 4   Trade    6887 non-null   object 
 5   Side     9596 non-null   object 
 6   Str      11479 non-null  float64
 7   Spot     11479 non-null  float64
 8   Exp      11479 non-null  object 
 9   Size     11479 non-null  int64  
 10  Prems    11479 non-null  object 
 11  Price    11479 non-null  float64
 12  Bid      11479 non-null  float64
 13  Ask      11479 non-null  float64
 14  Vol      11479 non-null  int64  
 15  OI       11479 non-null  int64  
 16  Ivol     11479 non-null  float64
 17  Label    1316 non-null   object 
 18  Events   422 non-null    object 
 19  Details  0 non-null      float64
dtypes: float64(7), int64(4), object(9)
memory usage: 1

Well, I don't know anything to explain here honestly. 

Everything is pretty explainable via the output above.

***GLOSSARY***:
1. **int64**: Integers (Whole Numbers)
2. **float64**: Float Values (Decimals)
3. **object**: Strings (Characters, Words etc.)


In [25]:
df_options_flow.describe()

Unnamed: 0,Uid,Str,Spot,Size,Price,Bid,Ask,Vol,OI,Ivol,Details
count,11479.0,11479.0,11479.0,11479.0,11479.0,11479.0,11479.0,11479.0,11479.0,11479.0,0.0
mean,14472040.0,389.847818,397.68025,404.51102,7.505449,7.415742,7.588984,7575.078056,11923.239481,60.007839,
std,4176.802,853.361029,882.065843,2121.509902,15.060311,14.981002,15.1378,16800.937335,20521.796878,51.064679,
min,14464860.0,0.5,1.01,50.0,0.03,0.0,0.0,2.0,0.0,8.28,
25%,14468420.0,100.0,102.25,100.0,2.05,2.02,2.08,321.5,1598.0,28.385,
50%,14472030.0,192.5,193.71,150.0,4.15,4.1,4.2,1220.0,4998.0,44.11,
75%,14475690.0,305.5,304.35,299.0,7.32,7.25,7.4,5695.5,12610.0,82.935,
max,14479240.0,5000.0,4362.95,115561.0,637.0,635.2,637.4,138003.0,318507.0,739.71,


Some stastical numbers of our data. Nothing too promising.



In [27]:
# Checking for NULL/NAN values
df_options_flow.isna().sum()

Uid            0
Time           0
Symbol         0
C/P            0
Trade       4592
Side        1883
Str            0
Spot           0
Exp            0
Size           0
Prems          0
Price          0
Bid            0
Ask            0
Vol            0
OI             0
Ivol           0
Label      10163
Events     11057
Details    11479
dtype: int64

In [28]:
df_options_flow["Trade"]

0          NaN
1          NaN
2        SWEEP
3          NaN
4        SWEEP
         ...  
11474      NaN
11475      NaN
11476    SPLIT
11477      NaN
11478      NaN
Name: Trade, Length: 11479, dtype: object

In [2]:
!pip install polars

Defaulting to user installation because normal site-packages is not writeable
Collecting polars
  Downloading polars-0.16.6-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.2/15.2 MB[0m [31m98.3 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:04[0m
Installing collected packages: polars
Successfully installed polars-0.16.6


In [3]:
import numpy as np
import pandas as pd
import polars as pl


n = 10_000_000
pandas_df = pd.DataFrame({
    'A': np.random.randint(0, 100, n),
    'B': np.random.randint(0, 100, n),
    'C': np.random.randn(n),
})

n = 10_000_000
polars_df = pl.DataFrame({
    'A': np.random.randint(0, 100, n),
    'B': np.random.randint(0, 100, n),
    'C': np.random.randn(n),
})


In [None]:
# Using Pandas
%timeit pandas_df.groupby('A').agg({'B': 'sum', 'C': 'mean'})

# Using Polars
%timeit polars_df.groupby('A').agg({'B': 'sum', 'C': 'mean'})