# Time Series Review

In [96]:
import pandas as pd
import numpy as np


In [97]:
# get the dataset
# ! curl https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv -o 'temps.csv'

In [98]:
ts = pd.read_csv('temps.csv')
ts.head()

Unnamed: 0,Date,Temp
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


## What is the first thing we do with a time series dataset? 
Change the Date to datetime and assign it as an index


In [None]:
# Your code here

In [101]:
# What is the frequency of the time series

In [103]:
# Plot the series
import matplotlib.pyplot as plt


What types of patterns do you see in this data?
- Trend?
- Seasonality?
- Change in variance?
- Cyclical?

## Visual Tests for Stationarity

The basic time series models (Autoregressive, Moving Average, ARMA) expect the input time series to be stationary.  We can do some visual checks for stationarity by plotting the rolling mean and std.

In [106]:
# Save the rolling mean and std to the variables defined below

rolling_mean = None
rolling_std = None

## Manipulating time series
There are many ways we can manipulate our time series, including resampling and differencing.

## Downsample the data so that it is weekly then plot


In [293]:
# Your code here

## Difference the data with different lags. Plot the lagged value along with the rolling mean and std


In [None]:
# your code here

In [321]:
# Run a Dickey Fuller test on the original series and interpret the result.
# Does the Dickey Fuller test make sense with your intuition?
# What is the null hypothesis of the Dickey Fuller test

from statsmodels.tsa.stattools import adfuller

dftest = adfuller()

# Extract and display test results in a user friendly manner
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value
print(dftest)

(-4.444804924611681, 0.00024708263003611787, 20, 3629, {'1%': -3.4321532327220154, '5%': -2.862336767636517, '10%': -2.56719413172842}, 16642.822304301197)


##  Basic Time Series Models

## Autoregressive Model
$$\large Y_t = \mu + \phi * Y_{t-1}+\epsilon_t$$

- 1st order: Predict today's value based on yesterday's value

## Moving Average Model

$$\large Y_t = \mu +\epsilon_t + \theta * \epsilon_{t-1}$$

- 1st order: Predict today's value based on the weighted sum of today and yesterday's error


To determine the order, we look at partial autocorrelation and autocorrelation plots.

Use statsmodels to plot the pacf and acf of the residuals. We use residuals because the basic ARMA and MA models expect stationary series as inputs.

What do the plots suggest are the correct terms for MA and AR?


In [None]:
# Your code here
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

# NLP Review 

In [140]:
# Import nltk, our favorite Natural Language Processing library
import nltk

# Look at the Project Gutenberg texts in NLTK
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [206]:
# Let's look at the Jane Austin novels

austen_file_ids = nltk.corpus.gutenberg.fileids()[:3]

austen_docs = [nltk.corpus.gutenberg.raw(file_id)[:5000] for file_id in austen_file_ids]
austen_docs[0]

"[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister's marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died too long ago for her to have more than an indistinct\nremembrance of her caresses; and her place had been supplied\nby an excellent woman as governess, who had fallen little short\nof a mother in affection.\n\nSixteen years had Miss Taylor been in Mr. Woodhouse's family,\nless as a governess than a friend, very fond of both daughters,\nbut particularly of Emma.  Between _them_ it was more the intimacy\nof sisters.  Even before Miss Taylor had ceased to hold the nominal\noffice of 

## What are the preprocessing steps 


In [165]:
# Your answer here

## Let's perform manual removal with list comprehensions and such

In [290]:
# your code here

## We can also use sklearn's built in tools.
First code a Count Vectorizer, and change some of the parameters to do different things


In [292]:
# your code here

 ## What is the cosine similarity of the texts based on CV

In [322]:
# your code here

**_Term Frequency_** is calculated with the following formula:

$$\large Term\ Frequency(t) = \frac{number\ of\ times\ t\ appears\ in\ a\ document} {total\ number\ of\ terms\ in\ the\ document} $$ 

**_Inverse Document Frequency_** is calculated with the following formula:

$$\large IDF(t) = log_e(\frac{Total\ Number\ of\ Documents}{Number\ of\ Documents\ with\ t\ in\ it})$$

The **_TF-IDF_** value for a given word in a given document is just found by multiplying the two!

## Perform same transformation with TFIDF

In [288]:
# your code here

## look at similarity with tfidf


In [289]:
# your code here