In [1]:
import pandas as pd

DATA = '/kaggle/input/microsoft-stock/microsoft_stock_synthetic.csv'
df = pd.read_csv(filepath_or_buffer=DATA, parse_dates=['Date'])
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,AdjustedClose,Dividend,SplitCoefficient,DailyReturnPercent,year,month
0,2020-01-01,150.56,150.61,150.01,150.14,7878334,151.2,0.0,1.0,0.09,2020,1
1,2020-01-02,149.83,149.89,149.39,149.64,2780798,150.01,0.0,1.0,-0.33,2020,1
2,2020-01-03,149.32,150.22,148.76,150.03,11714555,151.22,0.0,1.0,0.26,2020,1
3,2020-01-04,151.86,152.53,151.67,151.75,24082477,152.38,0.0,1.0,1.15,2020,1
4,2020-01-05,151.89,153.84,151.13,152.63,43187302,153.35,0.0,1.0,0.58,2020,1


First let's look at the price/volume correlations.

In [2]:
df[['Open', 'High', 'Low', 'Close', 'AdjustedClose', 'Volume']].corr()

Unnamed: 0,Open,High,Low,Close,AdjustedClose,Volume
Open,1.0,0.998923,0.998881,0.998089,0.987086,-0.051173
High,0.998923,1.0,0.997797,0.998141,0.987062,-0.049765
Low,0.998881,0.997797,1.0,0.998222,0.987184,-0.051307
Close,0.998089,0.998141,0.998222,1.0,0.988703,-0.049996
AdjustedClose,0.987086,0.987062,0.987184,0.988703,1.0,-0.046289
Volume,-0.051173,-0.049765,-0.051307,-0.049996,-0.046289,1.0


What do these correlations tell us? First, none of our prices are perfectly correlated, which tells us that none of them is redundant. Second, we see that prices and volumes are uncorrelated.

In [3]:
from plotly import express
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
express.scatter(data_frame=df, x='Date', y='AdjustedClose', color='year').show(renderer='iframe_connected',)

What do we see? We see that we only have three years of data, and we also see that MSFT has moved up and down quite a bit over the period of interest, but ended up not much changed at the end relative to the beginning. Let's take a look at the volume time series.

In [4]:
express.scatter(data_frame=df, x='Date', y='Volume', color='year').show(renderer='iframe_connected',)

Wow. That data looks a lot like random data. Let's see what its distribution looks like.

In [5]:
express.histogram(data_frame=df, x='Volume', facet_col='year').show(renderer='iframe_connected',)

That looks very random. A not very close look at the data card tells us that this isn't actual MSFT data at all, but synthetic data. That explains how the prices and volume could be uncorrelated; we usually see some correlation (-20 pct or so) for stocks with prices that rise over the long term, but price/volume correlations for stocks over the short term can be unpredictable. 

Let's take a look at the daily return; we expect that to be random even for real stock data, but we expect it to be more Gaussian rather than uniform.

In [6]:
express.histogram(data_frame=df, x='DailyReturnPercent', facet_col='year').show(renderer='iframe_connected',)

This is actually more Gaussian than we would expect daily returns to be, which is another (difficult to prove) clue we're working with synthetic data.