First let's load up our data and do a little feature engineering. We may want to use the year to indicate the passage of time in some of our plots that aren't time series plots, so let's add that.

In [1]:
import pandas as pd

KO = '/kaggle/input/coca-cola-complete-stocks-dataweekly-updated/KO_1919-09-06_2025-01-31.csv'
df = pd.read_csv(filepath_or_buffer=KO, parse_dates=['Date'])
df['year'] = df['Date'].dt.year
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,year
0,1962-01-02,0.263021,0.270182,0.263021,0.263021,0.046381,806400,1962
1,1962-01-03,0.259115,0.259115,0.253255,0.257161,0.045348,1574400,1962
2,1962-01-04,0.257813,0.261068,0.257813,0.259115,0.045692,844800,1962
3,1962-01-05,0.259115,0.26237,0.252604,0.253255,0.044659,1420800,1962
4,1962-01-08,0.251302,0.251302,0.245768,0.250651,0.0442,2035200,1962


Next let's look at our daily price and volume correlations. This will tell us if we have any redundant data, and it will tell us a little about how volume behaves as a function of the price.

In [2]:
df[['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']].corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
Open,1.0,0.999946,0.999937,0.999894,0.969414,0.471763
High,0.999946,1.0,0.999915,0.999947,0.969053,0.473107
Low,0.999937,0.999915,1.0,0.999947,0.969846,0.46989
Close,0.999894,0.999947,0.999947,1.0,0.969424,0.471387
Adj Close,0.969414,0.969053,0.969846,0.969424,1.0,0.434369
Volume,0.471763,0.473107,0.46989,0.471387,0.434369,1.0


What do we see? We see that none of our prices are perfectly correlated, so we know none of the price series are redundant. In particular, the difference between the closing price and the adjusted closing price usually represents dividends, and KO has historically been known as a dividend stock, so it isn't surprising that the lowest price correlations are between the Close and Adj Close prices. Finally, it is a little surprising, but the prices and the volume are somewhat positively correlated, which is unusual.

Let's look at the price time series. We'll use the adjusted closing price because it best represents the total return (including splits and dividends) over time.

In [3]:
from plotly import express
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
express.line(data_frame=df, x='Date', y='Adj Close', color='year').show(renderer='iframe_connected')

Because there has been so much price appreciation the first twenty years of our data looks flat. Let's use a log plot for the price instead, to see if we can see any volatility for prices for those years.

In [4]:
express.line(data_frame=df, x='Date', y='Adj Close', color='year', log_y=True).show(renderer='iframe_connected')

What do we see? KO looks like a steady upward climb with occasional retrenchments, some of them lasting a decade or more. Coloring by the year doesn't really do what we want it to do, so let's try again with a scatter plot.

In [5]:
express.scatter(data_frame=df, x='Date', y='Adj Close', color='year', log_y=True).show(renderer='iframe_connected')

Much better. Now time seems to pass more or less continuously, rather than abruptly at the end of each calendar year.

Next let's look at the volume time series.

In [6]:
express.scatter(data_frame=df, x='Date', y='Volume', color='year', log_y=False).show(renderer='iframe_connected')

Volume looks almost flat, but with a slight upward slope over time, but the overall shape of the plot is dominated by outliers. Let's again try again, with a log plot, to see if we can relativize some of our volume volatility.

In [7]:
express.scatter(data_frame=df, x='Date', y='Volume', color='year', log_y=True).show(renderer='iframe_connected')

What do we see? Again we see that a log plot shows volume behavior with relative change rather than absolute change being more prominent. And we can clearly see that volume has gradually risen on average over time. Let's go back to our price/volume correlations and take another look, with price and volume in the same plot.

In [8]:
express.scatter(data_frame=df, x='Adj Close', y='Volume', color='year', log_y=True).show(renderer='iframe_connected')

What do we see? Based on the overall Pearson correlation we calculated above, we might expect to see this plot trending upward over time, but in fact what we see is that the slope is essentially flat, suggesting a weak to negligible correlation between price and the log of the volume, except for the very early years. 

Let's look at how the correlation behaves if we take the series starting at each year.

In [9]:
express.line(
y=[df[df['year'] > year][['Adj Close', 'Volume']].corr()[['Adj Close']].T['Volume']['Adj Close'] for year in df['year'].unique().tolist()[:-1]]    
).show(renderer='iframe_connected')

What do we see? We see that the positive correlation dies off over time, reaches a minimum, and then oscillates. This suggests that price/volume correlation varies a lot year to year, which is not something we can see from the overall daily Pearson correlation or from the price/volume plot.

Let's look at the same data, but with the annual price/volume correlation.

In [10]:
express.scatter(
    data_frame=pd.DataFrame(data=[pd.Series(data={'year': year,
                          'correlation': df[df['year'] == year][['Adj Close', 'Volume']].corr().to_dict()['Adj Close']['Volume']}) 
           for year in range(1962, 2025)]),
    x='year',
    y='correlation').show(renderer='iframe_connected')

Interestingly, the annual price/volume correlation looks essentially random.