Madusanka Madiligama 01/08/2024

In [7]:
# library imports
import os
import gc
import io
import requests
import zipfile
import datetime

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
#Configure plot aesthetics for matplotlib and seaborn.
def set_plot_aesthetics():

    plt.rcParams['figure.figsize'] = (10, 8)  
    plt.rcParams['xtick.labelsize'] = 13      
    plt.rcParams['ytick.labelsize'] = 13      
    plt.rcParams['axes.labelsize'] = 14       
    sns.set_palette('tab10')                  

# Apply the plot settings
set_plot_aesthetics()
colors = list(sns.color_palette('tab10')) 

In [3]:
#convert date into datetime
def convert_to_date(x):
    return datetime.datetime.strptime(x, '%m/%d/%Y')

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/amazon_revenue_profit.csv', 
                 parse_dates=['Quarter'], date_parser=convert_to_date)
df.head()

  df = pd.read_csv('https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/amazon_revenue_profit.csv',


Unnamed: 0,Quarter,Revenue,Net Income
0,2020-03-31,75452,2535
1,2019-12-31,87437,3268
2,2019-09-30,69981,2134
3,2019-06-30,63404,2625
4,2019-03-31,59700,3561


In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Quarter     61 non-null     datetime64[ns]
 1   Revenue     61 non-null     int64         
 2   Net Income  61 non-null     int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.6 KB


Time series analysis can be performed using various modeling techniques, each suited to specific characteristics of the data:

ARMA (Autoregressive Moving Average): The ARMA model is appropriate for stationary time series, where the data do not show trends or seasonal patterns. This model combines two components: autoregression (AR) and moving average (MA).

ARIMA (Autoregressive Integrated Moving Average): Suitable for non-stationary time series with a trend, the ARIMA model extends ARMA by including an integration component (I) to account for the non-stationarity of the data. This model first differences the data to remove trends and then applies the ARMA model to the differenced series

In [8]:
fig = px.scatter(df, x='Quarter', y='Revenue', title='Amazon Revenue')
fig.update_traces(mode='lines+markers', marker=dict(color='rgb(102,194,165)'))
fig.update_xaxes(rangeslider_visible=True)
fig.show()

  v = v.dt.to_pydatetime()


The upward trend observed in the revenue data suggests that the time series is non-stationary. However, to confirm this, we will conduct statistical tests. The first test we will apply is the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test. The hypotheses for this test are defined as follows:
- Null Hypothesis ($H_0$): The data is stationary.
- Alternative Hypothesis ($H_1$): The data is non-stationary.

In [9]:
from statsmodels.tsa.stattools import kpss

test_stat, p_val, lags, crit_vals = kpss(df.Revenue, regression='c')


The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.




In [10]:
print(f'Test statistics: {test_stat}')
print(f'p-value: {p_val}')
print(f'Critical values: {crit_vals}')

if p_val < 0.05:
    print('Series is non-stationary')
else:
    print('Series is stationary')

Test statistics: 1.1700203698692262
p-value: 0.01
Critical values: {'10%': 0.347, '5%': 0.463, '2.5%': 0.574, '1%': 0.739}
Series is non-stationary


To further ascertain the stationarity of the series, we can employ the Augmented Dickey-Fuller (ADF) test. This test comes with a different set of hypotheses compared to the KPSS test:

- Null Hypothesis: The series has a unit root, indicating it is non-stationary.
- Alternative Hypothesis: The series does not have a unit root and is, therefore, stationary.

The ADF test will help in determining whether the series is stationary by checking for the presence of a unit root.

In [11]:
from statsmodels.tsa.stattools import adfuller

In [12]:
results = adfuller(df.Revenue)

print(f'Test statistics: {results[0]}')
print(f'p-value: {results[1]}')
print(f'Critical values: {results[4]}')

if results[1] > 0.05:
    print('Series is non-stationary')
else:
    print('Series is stationary')

Test statistics: -2.444836038197237
p-value: 0.1294794312183868
Critical values: {'1%': -3.568485864, '5%': -2.92135992, '10%': -2.5986616}
Series is non-stationary
