# [Creating Time Series Forecast using Python]()

This course is divided into the below sections:
- Understanding Time Series
- Data Exploration
- Time Series Forecasting using different methods

## Introduction to Time Series
Which of the following do you think is an example of time series?

![ts-1](images/ts-1.PNG)

__Time Series is generally data which is collected over time and is dependent on it.__

Here we see that the count of cars is independent of time, hence it is not a time series. While the CO2 level increases with respect to time, hence it is a time series.

__Definition of time series:__ A series of data points collected in time order is known as a __time series__. Most of business houses work on time series data to analyze sales number for the next year, website traffic, count of traffic, number of calls received, etc. Data of a time series can be used for forecasting.

Not every data collected with respect to time represents a time series.

***
***

Some of the examples of time series are:

![ts-2](images/ts-2.PNG)

![ts-3](images/ts-3.PNG)

![ts-4](images/ts-4.PNG)

![ts-5](images/ts-5.PNG)

***
***

Now as we have an understanding of what a time series is and the difference between a time series and a non time series, let’s now look at the components of a time series.

## Components of a Time Series

1. __Trend:__ Trend is a general direction in which something is developing or changing. So we see an increasing trend in this time series. We can see that the passenger count is increasing with the number of years.

![ts-6](images/ts-6.PNG)

Example: Here the red line represents an increasing trend of the time series.

2. __Seasonality:__ Another clear pattern can also be seen in the above time series, i.e., the pattern is repeating at regular time interval which is known as the seasonality. Any predictable change or pattern in a time series that recurs or repeats over a specific time period can be said to be seasonality.  Let’s visualize the seasonality of the time series:

![ts-7](images/ts-7.PNG)

Example:  
We can see that the time series is repeating its pattern after every 12 months i.e there is a peak every year during the month of january and a trough every year in the month of september, hence this time series has a seasonality of 12 months.

***
***

## Difference between a time series and regression problem

Here you might think that as the target variable is numerical it can be predicted using regression techniques, but a time series problem is different from a regression problem in following ways:

- The main difference is that a time series is time dependent. So the basic assumption of a linear regression model that the observations(i.e., x1, x2, etc) are independent doesn’t hold in this case.
- Along with an increasing or decreasing trend, most Time Series have some form of seasonality trends,i.e. variations specific to a particular time frame.


Also time series accounts for the autocorrelation between time events, which always exists, while in normal regression, independence of  serial errors are presumed, or at least minimized.

Reference: [AV discussion forum](https://discuss.analyticsvidhya.com/t/difference-between-regression-and-time-series/82364)

So, predicting a time series using regression techniques is not a good approach.

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.

***
***

## Problem Statement
Unicorn Investors wants to make an investment in a new form of transportation - JetRail. JetRail uses Jet propulsion technology to run rails and move people at a high speed! The investment would only make sense, if they can get more than 1 Million monthly users with in next 18 months. In order to help Unicorn Ventures in their decision, you need to forecast the traffic on JetRail for the next 7 months. You are provided with traffic data of JetRail since inception in the test file.

You can get the dataset [here](https://datahack.analyticsvidhya.com/contest/practice-problem-time-series-2/).

## Table of Contents
1. Understanding Data:
    - Hypothesis Generation
    - Getting the system ready and loading the data
    - Dataset Structure and Content
    - Feature Extraction
    - Exploratory Analysis
2.  Forecasting using Multiple Modeling Techniques:
    - Splitting the data into training and validation part
    - Modeling techniques
    - Holt’s Linear Trend Model on daily time series
    - Holt Winter’s Model on daily time series
    - Introduction to ARIMA model
    - Parameter tuning for ARIMA model
    - SARIMAX model on daily time series

### Hypothesis Generation 
Hypothesis Generation is the process of listing out all the possible factors that can affect the outcome.

Hypothesis generation is done before having a look at the data in order to avoid any bias that may result after the observation.

Hypothesis generation helps us to point out the factors which might affect our dependent variable. 

Below are some of the hypotheses which I think can affect the passenger count(dependent variable for this time series problem) on the JetRail:

- There will be an increase in the traffic as the years pass by.  
    Explanation - Population has a general upward trend with time, so I can expect more people to travel by JetRail. Also, generally companies expand their businesses over time leading to more customers travelling through JetRail.
- The traffic will be high from May to October.  
    Explanation - Tourist visits generally increases during this time perion.
- Traffic on weekdays will be more as compared to weekends/holidays.  
    Explanation - People will go to office on weekdays and hence the traffic will be more
- Traffic during the peak hours will be high.  
    Explanation - People will travel to work, college.

We will try to validate each of these hypothesis based on the dataset. Now let’s have a look at the dataset.

### Getting the system ready and loading the data

In [1]:
import pandas as pd          
import numpy as np          # For mathematical calculations 
import matplotlib.pyplot as plt  # For plotting graphs 
from datetime import datetime    # To access datetime 
from pandas import Series        # To work on series 
%matplotlib inline 
import warnings    

In [2]:
# Now let’s read the train and test data
train=pd.read_csv("Train_SU63ISt.csv") 
test=pd.read_csv("Test_0qrQsBZ.csv")

Let’s make a copy of train and test data so that even if we do changes in these dataset we do not lose the original dataset.

In [3]:
train_original=train.copy() 
test_original=test.copy()

### Dataset Structure and Content

In [4]:
train.columns

Index(['ID', 'Datetime', 'Count'], dtype='object')

In [5]:
test.columns

Index(['ID', 'Datetime'], dtype='object')

Let’s understand each feature first:

- __ID__ is the unique number given to each observation point.
- __Datetime__ is the time of each observation.
- __Count__ is the passenger count corresponding to each Datetime.

Let’s look at the data types of each feature.

In [6]:
train.dtypes

ID           int64
Datetime    object
Count        int64
dtype: object

In [7]:
test.dtypes

ID           int64
Datetime    object
dtype: object

__ID__ and __Count__ are in integer format while the Datetime is in object format for the train file.

###  shape of the dataset.

In [8]:
train.shape

(18288, 3)

In [9]:
test.shape

(5112, 2)