# Data Acquisition and Processing Systems (DaPS) (ELEC0136)    
### Final Assignment
---

<div class="alert alert-heading alert-info">

#### Task 1: Data Acquisition

You will first have to acquire the necessary data for conducting your study. One essential type of
data that you will need, are the stock prices for each company from April 2017 to April 202 1 as
described in Section 1. Since these companies are public, the data is made available online. The
first task is for you to search and collect this data, finding the best way to access and download
it. A good place to look is on platforms that provide free data relating to the stock market such as
Google Finance or Yahoo! Finance.

[Optional] Providing more than one method to acquire the very same or different data, e.g. from
a downloaded comma-separated-value file and a web API, will result in a higher score.

There are many valuable sources of information for analysing the stock market. In addition to time
series depicting the evolution of stock prices, acquire auxiliary data that is likely to be useful for
the forecast, such as:

- Social Media, e.g., Twitter: This can be used to uncover the public’s sentimental
response to the stock market
- Financial reports: This can help explain what kind of factors are likely to affect the stock
market the most
- News: This can be used to draw links between current affairs and the stock market
- Climate data: Sometimes weather data is directly correlated to some companies’ stock
prices and should therefore be taken into account in financial analysis
- Others: anything that can justifiably support your analysis.

Remember, you are looking for historical data, not live data.
   
    
</div>

In [1]:
def acquire():
    # Implement me, and remove the exception below.
    # Make sure you return what you need.
    raise NotImplementedError()

<div class="alert alert-heading alert-info">
    
## Task 2: Data Storage

Once you have found a way to acquire the relevant data, you need to decide on how to store it.
You should choose a format that allows an efficient read access to allow training a parametric
model. Also, the data corpus should be such that it can be easily inspected. Data can be stored
locally, on your computer.
    
</div>

In [None]:
def store(data):
    # Implement me, and remove the exception below.
    # Make sure you return what you need.
    raise NotImplementedError()

<div class="alert alert-heading alert-warning">

[Optional] Create a simple API to allow Al retrieving the compound of data you collected. It is enough to provide a single access point to retrieve all the data, and not implement query mechanism. The API must be accessible from the web. If you engage in this task data must be stored online.  
    
</div>

In [3]:
def retrieve(data):
    # Implement me, and remove the exception below.
    # Make sure you return what you need.    
    raise NotImplementedError()

<div class="alert alert-heading alert-info">

## Task 3: Data Preprocessing

Now that you have the data stored, you can start preprocessing it. Think about what features to
keep, which ones to transform, combine or discard. Make sure your data is clean and consistent
(e.g., are there many outliers? any missing values?). You are expected to:

1. Clean the data from missing values and outliers, if any.
2. Provide useful visualisation of the data. Plots should be saved on disk, and not printed on
the juptyer notebook.
3. Transform your data (e.g., using normalization, dimensionality reduction, etc.) to improve
the forecasting performance.

</div>

In [None]:
def process(data):
    # Implement me, and remove the exception below.
    # Make sure you return what you need.
    raise NotImplementedError()

<div class="alert alert-heading alert-info">
    
## Task 4: Data Exploration

After ensuring that the data is well preprocessed, it is time to start exploring the data to carry out
hypotheses and intuition about possible patterns that might be inferred. Depending on the data,
different EDA (exploratory data analysis) techniques can be applied, and a large amount of
information can be extracted.
For example, you could do the following analysis:

    
- Time series data is normally a combination of several components:
  - Trend represents the overall tendency of the data to increase or decrease over time.
  - Seasonality is related to the presence of recurrent patterns that appear after regular
intervals (like seasons).
  - Random noise is often hard to explain and represents all those changes in the data
that seem unexpected. Sometimes sudden changes are related to fixed or predictable
events (i.e., public holidays).
- Features correlation provides additional insight into the data structure. Scatter plots and
boxplots are useful tools to spot relevant information.
- Explain unusual behaviour.
- Explore the correlation between stock price data and other external data that you can
collect (as listed in Sec 2.1)
- Use hypothesis testing to better understand the composition of your dataset and its
representativeness.

    
At the end of this step, provide key insights on the data. This data exploration procedure should
inform the subsequent data analysis/inference procedure, allowing one to establish a predictive
relationship between variables.

</div>

In [None]:
def explore(data):
    # Implement me, and remove the exception below.
    # Make sure you return what you need.
    raise NotImplementedError()

<div class="alert alert-heading alert-info">

## Task 5: Inference

Train a model to predict the closing stock price on each day for the data you have already
collected, stored, preprocessed and explored from previous steps. The data must be spanning
from April 2017 to April 202 1.
You should develop two separate models:


1. A model for predicting the closing stock price on each day for a 1-month time window (until
    end of May 202 1 ), using only time series of stock prices.
2. A model for predicting the closing stock price on each day for a 1-month time window (until
    end of May 202 1 ), using the time series of stock prices and the auxiliary data you collected.
Which model is performing better? How do you measure performance and why? How could you
further improve the performance? Are the models capable of predicting the closing stock prices
far into the future?

[IMPORTANT NOTE] For these tasks, you are not expected to compare model architectures, but
examine and analyse the differences when training the same model with multiple data attributes
and information from sources. Therefore, you should decide a single model suitable for time series
data to solve the tasks described above. Please see the lecture slides for tips on model selection
and feel free to experiment before selecting one.

The following would help you evaluate your approach and highlight potential weaknesses in your
process:

1. Evaluate the performance of your model using different metrics, e.g. mean squared error,
    mean absolute error or R-squared.
2. Use ARIMA and Facebook Prophet to explore the uncertainty on your model’s predicted
    values by employing confidence bands.
3. Result visualization: create joint plots showing marginal distributions to understand the
    correlation between actual and predicted values.
4. Finding the mean, median and skewness of the residual distribution might provide
    additional insight into the predictive capability of the model.
</div>

In [None]:
def train(model, train_data):
    # Implement me, and remove the exception below.
    # Make sure you return what you need.
    raise NotImplementedError()

In [None]:
def evaluate(trained_model, val_data):
    # Implement me, and remove the exception below.
    # Make sure you return what you need.
    raise NotImplementedError()

<div class="alert alert-heading alert-danger">

## Autorun

</div>

In [5]:
def main():
    """This function will be called to check that your work is reproducible, and it is the only function that will be called by us.
    It should perform all the work that you used to support your Report, e.g. storing data, running experiments, saving figures.
    
    Example:
    ```
    data = acquire()
    store(data)
    data = process(data)
    explore(data)
    model = create_model()
    trained_model = train(model, data)
    performance = evaluate(trained_model, data)
    ```
    """
    raise NotImplementedError()