<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/07Apr20_4_djia_modeling_distributions_lab_CLEAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling A Distribution Lab

### Introduction

In this lesson, we'll put our knowledge of modeling distributions to see the probability of different daily fluctuations in the stock market.

Let's start by loading our data.

In [0]:
!pip install gcsfs

Collecting gcsfs
  Downloading https://files.pythonhosted.org/packages/18/3b/454be7c97d05e15eb20a0099f425f0ed6b7552e352c77adb923c3872ba14/gcsfs-0.6.1-py2.py3-none-any.whl
Installing collected packages: gcsfs
Successfully installed gcsfs-0.6.1


In [0]:
import pandas as pd
url = "gs://curriculum-assets/mod-2/upload_DJIA_table.csv"
stocks_df = pd.read_csv(url)

In [0]:
stocks_df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

Make sure that the data is in the correct format, and change it if it isn't.

In [0]:
date_col = 

stocks_df = None

Let's start by getting a sense of the range of dates in our dataset.  We'd like to get a sense of the range of years, months, and days before using it for our analysis.

Let's start with the year.  Display all of the years listed and the number of rows for each year.

> Do so without creating another column in your dataframe.

In [0]:
# write code here

# 2015    252
# 2014    252
# 2013    252
# 2011    252
# 2010    252
# 2009    252
# 2012    250
# 2016    126
# 2008    101
# Name: Date, dtype: int64

We can also use group by to make some calculations.  Here we `groupby` the month and then count.



In [0]:
stocks_df.groupby(lambda idx: stocks_df.iloc[idx]['Date'].month).count()['Date']

> Notice that groupby take a lambda function.  The argument passed the function is the index of the row, and so we select the row using iloc, then calculate the month value for each row.  Pandas groups by that returned month value.

Ok, now we've seen that we have daily stock data for the years between 2008 and 2016, where 2008 and 2016 have missing data.

### Calculating Daily Change

Let's take another look at our dataframe.

In [0]:
stocks_df[:3]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688


Now we'd like to track the daily fluctuations in the stock market, so add a column that shows difference between the open and close of the stock market, name the column `Movement`.  Assign the new dataframe to the variable `df_with_movement`.

In [0]:
movement = None
df_with_movement = None

Now that we have a plot that shows daily changes in the stock market, let's plot a frequency distribution of our `Movement` column.

Answer: <img src="https://github.com/jigsawlabs-student/modeling-distributions/blob/master/movement-hist.png?raw=1" width="50%">

Next let's plot probability density function of the daily fluctuation in prices.

Answer: <img src="https://github.com/jigsawlabs-student/modeling-distributions/blob/master/movement-density.png?raw=1" width="50%">

### Modeling our Distribution

Now let's choose a normal distribution to begin modeling our random variable.  Begin by initializing a random variable with the location and spread values of our distribution above.  Assign it to the variable `norm_dist_djia`.

In [0]:
import scipy.stats as stats
import numpy as np
norm_dist_djia = None

In [0]:
norm_dist_djia.mean()
# -3.9162063846153483

norm_dist_djia.std()
# 141.2279379239883

141.2279379239883

Now create a list of 100 values between the one percentile value of the distribution and 99th percentile values. 

And pass through these values to the distribution's pdf function, to get a list of pdf values at each of the provided points.

In [0]:
x = None
pdf_nums_norm = None

Then plot the modeled normal distribution along with a frequency distribution of our sample.

In [0]:
from scipy.stats import norm
import matplotlib.pyplot as plt
fig = plt.figure()
# ax = fig.add_subplot(111)


<Figure size 432x288 with 0 Axes>

Answer: <img src="https://github.com/jigsawlabs-student/modeling-distributions/blob/master/djia-normal.png?raw=1" width="50%">

Let's see the probability, according to our modeled distribution, of getting lower than -750 in a day.

In [0]:
import numpy as np
p_less_than_750 = None

np.format_float_positional(p_less_than_750)
# '0.00000006360021876390582'

'0.00000006360021876390582'

Let's compare this to our the CDF on our actual data.  Scipy provides us with the ECDF object, which allows us to do so.

In [0]:
import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
e_movement = ECDF(movement_col)

In [0]:
e_movement(-750)

0.001005530417295123

We can see that this is a very big difference (in statistical terms) than what our normal distribution modeled above.

### Changing our Distribution

Let's try a skewed t distribution instead of modeling with a normal distribution.  We'll set the t-distribution with 2 degrees of freedom, but you can experiment with different values for this.  We'll keep the mean and scale the same as with our normal distribution.

In [0]:
from scipy.stats import t
import numpy as np
t_dist = t(df=2, loc=movement_col.mean(), scale=movement_col.std())

Then we proceed with defining a range of x values, and finding the corresponding array of pdf values.

In [0]:

x = np.linspace(t_dist.ppf(0.0001), t_dist.ppf(0.9999), 100)
y_1 = t_dist.pdf(x) 

Then we plot our distributions to see how they line up.

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

fig_1 = plt.figure()
ax_1 = fig_1.add_subplot(111)
ax_1.plot(x, pdf_nums_norm,
'r-', alpha=0.6, label='norm pdf')
# plot of t distribution
ax_1.plot(x,y_1)

# plot of histogram
ax_1.hist(movement_col, density = True, alpha = .5)


> The t distribution (in orange) puts more weights on the tails of the distribution than the normal distribution.

Now let's look at the probabilituy of a fluctuation less than -750.

In [0]:
t_less_than_750 = t.cdf(-750)

np.format_float_positional(t_less_than_750)

# '0.017006983460026732'

'0.017006983460026732'

In [0]:
e_movement(-750)

0.001005530417295123

### Resources

As an alternative to our aggregated dataset, [also see the UCI dataset](http://archive.ics.uci.edu/ml/datasets/Dow+Jones+Index) for daily prices of individual stocks.