# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [None]:
# Import libraries
import numpy as np 
import pandas as pd 
import statsmodels.api as sm
from pandas.plotting import lag_plot
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tools.eval_measures import rmse

# Challenge 1 - Loading and Evaluating The Data

In this lab, we will look at a dataset of sensor data from a cellular phone. The phone was carried in the subject's pocket for a few minutes while they walked around. As usual, download the file from [here](https://drive.google.com/file/d/1G44c7-GImWpFjiQ0bctRNJXKus_WTOlx/view?usp=sharing) and 
place it in the provided data folder.

To load the data, run the code below.

In [None]:
# Load the file sub_1.csv and drop the "Unnamed column" Run this code:
sensor = pd.read_csv('../data/sub_1.csv')
sensor.drop(columns=['Unnamed: 0'], inplace=True)

Examine the data using the `head` function.

In [None]:
# Your code here:
sensor.head()

Check whether there is any missing data. If there is any missing data, remove the rows containing missing data.

In [None]:
# Your code here:
start_rows = len(sensor)
sensor = sensor.replace(0, np.nan).dropna(axis=0).reset_index(drop=True)
remove_rows = start_rows - len(sensor)
print('Removed', remove_rows,'rows that had incomplete pieces of data.')
print('This was', (remove_rows/start_rows * 100),'% of the total data.')

How many rows and columns are in our data?

In [None]:
# Your code here:
#sensor.info()
#11 Columns; 1751 rows 
#Paolo:ok

To perform time series analysis on the data, we must change the index from a range index to a time series index. In the cell below, create a time series index using the `pd.date_range` function. Create a time series index starting at 1/1/2018 00:00:00 and ending at 1/1/2018 00:29:10. The number of periods is equal to the number of rows in `sensor`.

In [None]:
# Your code here:
timeseriesindex = pd.date_range(start ='1-1-2018 00:00:00', end ='1-1-2018 00:29:10',freq="s")
timeseriesindex

Assign the time series index to the dataframe's index.

In [None]:
# Your code here:
sensor['Time Series'] = timeseriesindex
sensor.set_index(['Time Series'],inplace= True)
sensor.head()


Our next step is to decompose the time series and evaluate the patterns in the data. Load the `statsmodels.api` submodule and plot the decomposed plot of `userAcceleration.x`. Set `freq=60` in the `seasonal_decompose` function. Your graph should look like the one below.

[time series decomposition](https://drive.google.com/file/d/1tiOAggkGBE7ZzQ0QaOj4jMpZ4AJ4cGp1/view?usp=sharing)

In [None]:
# Your code here:
#patterns = sm.tsa.seasonal_decompose(sensor['userAcceleration.x'],freq=60)
#FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead
patterns = sm.tsa.seasonal_decompose(sensor['userAcceleration.x'],period=60)
patterns.plot()

Plot the decomposed time series of `rotationRate.x` also with a frequency of 60.

In [None]:
patterns2 = sm.tsa.seasonal_decompose(sensor['rotationRate.x'],period=60)
patterns2.plot()

# Challenge 2 - Modelling the Data

To model our data, we should look at a few assumptions. First, let's plot the `lag_plot` to detect any autocorrelation. Do this for `userAcceleration.x`

In [None]:
# Your code here:
lag_plot(sensor['userAcceleration.x'], lag=1)

Create a lag plot for `rotationRate.x`

In [None]:
# Your code here:
lag_plot(sensor['rotationRate.x'], lag=1)

What are your conclusions from both visualizations?

In [None]:
# Your conclusions here:
#There is a strong linear positive trend here, in both cases. 
#Paolo: yes,it means there is also a correlation

The next step will be to test both variables for stationarity. Perform the Augmented Dickey Fuller test on both variables below.

In [None]:
# Your code here:
adfuller(sensor['userAcceleration.x'])[1]

In [None]:
adfuller(sensor['rotationRate.x'])[1]

What are your conclusions from this test?

In [None]:
# Your conclusions here:
#Both p-values are very small (respectively(to 2 dec places) 2.82  -30; and 6.32 -6). 
#This means that the data is stationary. 
#Paolo:yes

Finally, we'll create an ARMA model for `userAcceleration.x`. Load the `ARMA` function from `statsmodels`. The order of the model is (2, 1). Split the data to train and test. Use the last 10 observations as the test set and all other observations as the training set. 

In [None]:
# Your code here:
model = ARMA(sensor['userAcceleration.x'][:-10], order=(2, 1)) 
model_fit = model.fit(disp=False)
predictions = model_fit.predict(start = len(sensor['userAcceleration.x'])-10, end = len(sensor['userAcceleration.x'])-1)  
arma = pd.DataFrame({'Observed':sensor['userAcceleration.x'][-10:], 'redicted':predictions})
arma.head()
#no freq specified; so inferred freq S used 

To compare our predictions with the observed data, we can compute the RMSE (Root Mean Squared Error) from the submodule `statsmodels.tools.eval_measures`. You can read more about this function [here](https://www.statsmodels.org/dev/generated/statsmodels.tools.eval_measures.rmse.html). Compute the RMSE for the last 10 rows of the data by comparing the observed and predicted data for the `userAcceleration.x` column.

In [None]:
# Your code here:
rmse(arma['observed'], arma['predicted'], axis=0) #compare

In [None]:
rmse(sensor['userAcceleration.x'][-10:],predictions, axis=0) # Double check if it's the same vs original userAcceleration.x

In [None]:
#Paolo: great lab!