<a href="https://colab.research.google.com/github/purple-affogato/RezoomAI/blob/main/Copy_of_ACM_AIML_Series_Demo1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Workshop 1

Welcome to ACM's Data Analysis & Engineering demo! Follow along with the prompts. :)

# Imports

What's nice about Google Colab is that most of the necessary packages are already downloaded, so no need to worry about managing packages and virtual environments.

Libraries we'll be using:
- pandas
- matplotlib.pyplot
- yahoo finance (yfinance)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns
import numpy as np

# Loading Stock Data

We'll be using historial stock prices of Apple (AAPL) for this demo! Download "aapl_10y.csv" from [here](https://drive.google.com/file/d/1346lULourghgUn0dEvVbG6nfOYPCAQpk/view?usp=drive_link).

In [None]:
# start & end dates of our data
start = dt.date(2015, 1, 1)
end = dt.date(2025, 1, 1)

In [None]:
# read in csv

In [None]:
# look at the dataframe!

# Basic Functions

- info
- describe
- head
- tail

##Getting a Summary of Your Data

How can we see a summary of our dataset?

In [None]:
# i

How can we see the descriptive stats of our dataset?

In [None]:
# d

Get the first five entries of your dataframe.

In [None]:
# h

Get the last five entries of your dataframe.

In [None]:
# t

We can easily make boxplots in pandas as well for each feature.

In [None]:
# box

## Accessing Features and Entries

In [None]:
# access a feature (column)


In [None]:
# let's try another feature!

In [None]:
# access an entry (row) using integer indices

In [None]:
# you can also put a range to get entries

#Resampling

Right now, our dataset has entries for only dates when the stock market is open (excludes weekends and holidays). To make our data more smooth, we want to **impute** rows and deal the resulting NaN values.

## Cast Index into Datetime

In [None]:
# first cast the column into datetime

In [None]:
# set aapl with updated datetime column

Set the Date column to the index.

In [None]:
# set index to Date

In [None]:
# double check your dataframe

## Imputing

First, let's only consider the "Close" feature. For clarification, "Close" refers to the **closing price** of a stock at the end of a trading day.

**Reindex** your dataset to add in missing dates with NaN values.

In [None]:
# reindex

In [None]:
# double check dataframe

## Dealing with NaN Values

Two options:
- dropping
- interpolate

Dropping NaN values is very simple with pandas.

In [None]:
# dropping NaN values

However, dropping NaN values is only optimal when the NaN entries we drop are insignificant. For this case, it's better that we **impute** values. For this demo, we'll be using **spline interpolation** to fill in NaN values. Feel free to look up how it works after the workshop!

In [None]:
# create a copy of reindexed dataframe
# interpolate
# show the new dataframe

Read the [pandas interpolate documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

## Visualizing Data

After resampling our data, we can get a good sense of it by visualizing it. This is very easy to do in Python using matplotlib!

In [None]:
# make a plot of our interpolated data with a title and subtitles

## Groupby using Datetime Index

Right now, our graph is looking very bumpy and it has a lot of fine data. To make it smoother, we can use **groupby** to make each data entry a month apart.

In [None]:
# make our datetime index monthly

In [None]:
# check the dataframe!

[Documentation of pandas Grouper](https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html#pandas.Grouper)

[Frequencies for Groupers](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)

MS means "month start". From the link above, let's try to **group by year**.

In [None]:
# make our datetime index yearly

In [None]:
# check the dataframe

Let's compare our daily and monthly data by graphing the two different features.

In [None]:
# graph two features on one plot to compare

#Correlation

We can gain a better understanding of our data by testing correlation against other features and previous entries (autocorrelation). This can help can decide what kinds of models we can fit our data to.

##Merging Dataframes

Finding if our data correlates with other sources of data can help us make better decisions on how to use our data and build our models.

For example, the way our dataframe is right now, we can only using time series ML algorithms. Adding other features to the dataframe opens the possibility of using other algorithms and maybe creating more accurate models.

In this example, we'll be using monthly GDP to compare with AAPL's stock prices. To do so, we need to first [download the data](https://drive.google.com/drive/folders/1xattZsDEh-ZMBNX794Ib5-7Vqvqtwv-J?usp=sharing) and merge it with our current dataframe.

First, we need to read in our GDP data and cast its index into datetime64.

In [None]:
# read in GDP.csv

In [None]:
# check the dataframe

In [None]:
# check dataframe's info

In [None]:
# cast date column to datetime

In [None]:
# set the index

In [None]:
# double check the dataframe

Now that our two dataframes have the same index, we can merge them.

In [None]:
# merge (inner join)

Read the [pandas merge documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge).

SQL Joins reference:

![SQL joins](https://knowthecode.io/wp-content/uploads/2016/10/sql-joins-diagram.png)

In [None]:
# check the dataframe

##Testing for Correlation

First, let's test if there's any direct correlation between AAPL stock prices and GDP.

To do so, we can make pairplots and calculate the direct correlation coefficient.

In [None]:
# make a pairplot

In [None]:
# use numpy to get the direct correlation coefficient

Does it seem that AAPL stock prices and GDP have correlation?

##Spurious Correlation

However, correlation is not causation! To test for **true** correlation we can see if the differences become data entries have correlation. If they don't, then what we saw earlier is just **spurious correlation**.

In [None]:
# get the differences

In [None]:
# check the dataframe!

Now let's test for correlation again!

In [None]:
# make another pairplot!

In [None]:
# use numpy to get the direct correlation coefficient

Now, is GDP **really** correlated with AAPL stock prices?

##Autocorrelation

With time series data, we can test to see if data entries are correlated with data entries on previous dates/times.

This is your homework, as we won't have enough time to cover this in the workshop!