# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
%matplotlib inline


# Challenge 1 - Loading and Evaluating The Data

In this lab, we will look at a dataset of sensor data from a cellular phone. The phone was carried in the subject's pocket for a few minutes while they walked around. As usual, download the file from [here](https://drive.google.com/file/d/1G44c7-GImWpFjiQ0bctRNJXKus_WTOlx/view?usp=sharing) and 
place it in the provided data folder.

To load the data, run the code below.

In [None]:
# Load the file sub_1.csv and drop the "Unnamed column" Run this code:

sensor = pd.read_csv('../data/sub_1.csv')
sensor.drop(columns=['Unnamed: 0'], inplace=True)

Examine the data using the `head` function.

In [None]:
# Your code here:

sensor.head()

Check whether there is any missing data. If there is any missing data, remove the rows containing missing data.

In [None]:
# Your code here:
sensor.isnull().sum()
# There are no missing values

How many rows and columns are in our data?

In [None]:
# Your code here:
rows, columns = sensor.shape

rows, columns 

# Number of rows is 1751
# Number of columns is 12

To perform time series analysis on the data, we must change the index from a range index to a time series index. In the cell below, create a time series index using the `pd.date_range` function. Create a time series index starting at 1/1/2018 00:00:00 and ending at 1/1/2018 00:29:10. The number of periods is equal to the number of rows in `sensor`.

In [None]:
# Your code here:

time_sensordata = pd.date_range(start = '1-1-2018 00:00:00', end ='1-1-2018 00:29:10', periods = 1751)

time_sensordata 

Assign the time series index to the dataframe's index.

In [None]:
# Your code here:

sensor.set_index(time_sensordata, inplace=True)
sensor.tail()


Our next step is to decompose the time series and evaluate the patterns in the data. Load the `statsmodels.api` submodule and plot the decomposed plot of `userAcceleration.x`. Set `freq=60` in the `seasonal_decompose` function. Your graph should look like the one below.

[time series decomposition](https://drive.google.com/file/d/1tiOAggkGBE7ZzQ0QaOj4jMpZ4AJ4cGp1/view?usp=sharing)

In [None]:
# Your code here:
import statsmodels.api as sm

res = sm.tsa.seasonal_decompose(sensor["userAcceleration.x"], freq=60)
resplot = res.plot()


Plot the decomposed time series of `rotationRate.x` also with a frequency of 60.

# Challenge 2 - Modelling the Data

To model our data, we should look at a few assumptions. First, let's plot the `lag_plot` to detect any autocorrelation. Do this for `userAcceleration.x`

In [None]:
import pandas as pd
from pandas.plotting import lag_plot

In [None]:
lag_plot(sensor["userAcceleration.x"], lag=1)  

Create a lag plot for `rotationRate.x`

In [None]:
# Your code here:

lag_plot(sensor["rotationRate.x"], lag=1)  

What are your conclusions from both visualizations?

In [None]:
 # Your conclusions here:

# Both of the plots show an autocorelation. It looks like they have an autoregressive relationship.

The next step will be to test both variables for stationarity. Perform the Augmented Dickey Fuller test on both variables below.

In [None]:
from statsmodels.tsa.stattools import adfuller
print(adfuller(sensor['userAcceleration.x'])[1])
print(adfuller(sensor['rotationRate.x'])[1])

What are your conclusions from this test?

In [None]:
# Your conclusions here:

# Considering the following hypothesis:
# 𝐻0 : Data is not stationary
# 𝐻1 : Data is stationary

# From the above, we see that the p-value for both 'userAcceleration.x' and 'rotationRate.x' are very small. 
# It means that the p-value is less than 0.05, and with a 95% confidence interval, we reject the null 
# hypothesis and conclude that the data is stationary.

Finally, we'll create an ARMA model for `userAcceleration.x`. Load the `ARMA` function from `statsmodels`. The order of the model is (2, 1). Split the data to train and test. Use the last 10 observations as the test set and all other observations as the training set. 

In [None]:
train, test = sensor['userAcceleration.x'][:-10], sensor['userAcceleration.x'][-10:]

In [None]:
# ARMA (AutoRegressive Moving Average) 

from statsmodels.tsa.arima_model import ARMA

# fit model

model = ARMA(sensor['userAcceleration.x'], order=(2, 1))      # AR 2, MA 1
#Paolo:ok good approach but be careful here you should use only the train data to build the model instead of the entire 
#column: you shoud use sensor['userAcceleration.x'][:-10] instead of sensor['userAcceleration.x']
#in model 
model_fit = model.fit(disp=False)

# make prediction

predictions = model_fit.predict(len(sensor['userAcceleration.x'])-10, len(sensor['userAcceleration.x'])-1)

pd.DataFrame({'observed':sensor['userAcceleration.x'][-10:], 'predicted':predictions})


To compare our predictions with the observed data, we can compute the RMSE (Root Mean Squared Error) from the submodule `statsmodels.tools.eval_measures`. You can read more about this function [here](https://www.statsmodels.org/dev/generated/statsmodels.tools.eval_measures.rmse.html). Compute the RMSE for the last 10 rows of the data by comparing the observed and predicted data for the `userAcceleration.x` column.

In [None]:

from statsmodels.tools.eval_measures import rmse
rmse(sensor['userAcceleration.x'][-10:],predictions, axis=0)
#Paolo: if you use the train as indicated above the error  you calculate here  should be bigger
# than what you now get (0.0938..)

In [None]:
#Paolo great lab, well done!