# EDA - Stock Data

Este notebook consiste em realizar uma análise exploratória dos dados coletados, visando entender suas características e observar como ele pode ser manipulado para obter um melhor modelo de Deep Learning com LSTM.

### Configurações
    
Os notebooks rodam de forma isolada e podem acabar não conseguindo importar corretamente outros módulos/funções de outros arquivos, por conta disso, uma boa prática é adicionar o caminho do projeto ao `sys.path`.


In [1]:
import os
import sys

src_path = os.path.abspath(os.path.join('..', 'src'))

# check the path is not already in sys.path, to avoid duplicates
if src_path not in sys.path:
    sys.path.insert(0, src_path)

## Coletando os dados

Pensando no objetivo final (modelo de Deep Learning), tentamos procurar uma empresa com muitos dados, por isso decidimos seguir com os dados da [Coca-Cola (KO)](https://finance.yahoo.com/quote/KO/), que possuem dados desde janeiro de 1962.

Como recomendado no material de referência para o Tech Challenge, decidimos utilizar a [yfinance](https://pypi.org/project/yfinance/) para coletar os dados.

In [2]:
import pandas as pd
import yfinance as yf

ticker_symbol = 'KO'  # Coca-Cola

data: pd.DataFrame = yf.download(ticker_symbol, start='1960-01-01', end='2024-11-01')
data

[*********************100%***********************]  1 of 1 completed


Price,Adj Close,Close,High,Low,Open,Volume
Ticker,KO,KO,KO,KO,KO,KO
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1962-01-02 00:00:00+00:00,0.046733,0.263021,0.270182,0.263021,0.263021,806400
1962-01-03 00:00:00+00:00,0.045692,0.257161,0.259115,0.253255,0.259115,1574400
1962-01-04 00:00:00+00:00,0.046039,0.259115,0.261068,0.257813,0.257813,844800
1962-01-05 00:00:00+00:00,0.044998,0.253255,0.262370,0.252604,0.259115,1420800
1962-01-08 00:00:00+00:00,0.044535,0.250651,0.251302,0.245768,0.251302,2035200
...,...,...,...,...,...,...
2024-10-25 00:00:00+00:00,66.919998,66.919998,67.699997,66.790001,67.070000,11138100
2024-10-28 00:00:00+00:00,66.669998,66.669998,67.400002,66.599998,66.959999,10761400
2024-10-29 00:00:00+00:00,65.559998,65.559998,66.339996,65.519997,66.290001,16525900
2024-10-30 00:00:00+00:00,65.919998,65.919998,66.540001,65.320000,65.510002,14177800


In [3]:
data.columns = data.columns.droplevel(1)
data = data.reset_index()
data.columns.name = 'Id'
data

Id,Date,Adj Close,Close,High,Low,Open,Volume
0,1962-01-02 00:00:00+00:00,0.046733,0.263021,0.270182,0.263021,0.263021,806400
1,1962-01-03 00:00:00+00:00,0.045692,0.257161,0.259115,0.253255,0.259115,1574400
2,1962-01-04 00:00:00+00:00,0.046039,0.259115,0.261068,0.257813,0.257813,844800
3,1962-01-05 00:00:00+00:00,0.044998,0.253255,0.262370,0.252604,0.259115,1420800
4,1962-01-08 00:00:00+00:00,0.044535,0.250651,0.251302,0.245768,0.251302,2035200
...,...,...,...,...,...,...,...
15812,2024-10-25 00:00:00+00:00,66.919998,66.919998,67.699997,66.790001,67.070000,11138100
15813,2024-10-28 00:00:00+00:00,66.669998,66.669998,67.400002,66.599998,66.959999,10761400
15814,2024-10-29 00:00:00+00:00,65.559998,65.559998,66.339996,65.519997,66.290001,16525900
15815,2024-10-30 00:00:00+00:00,65.919998,65.919998,66.540001,65.320000,65.510002,14177800


In [4]:
from utils import get_project_root

file_name = 'historical_stock_data.csv'
project_root = get_project_root()
data_path = os.path.join(project_root, 'data', 'raw', file_name)
data.to_csv(data_path)

## Importação dos dados

In [6]:
stock_data = pd.read_csv(data_path, index_col=0)
stock_data

Unnamed: 0,Date,Adj Close,Close,High,Low,Open,Volume
0,1962-01-02 00:00:00+00:00,0.046733,0.263021,0.270182,0.263021,0.263021,806400
1,1962-01-03 00:00:00+00:00,0.045692,0.257161,0.259115,0.253255,0.259115,1574400
2,1962-01-04 00:00:00+00:00,0.046039,0.259115,0.261068,0.257813,0.257813,844800
3,1962-01-05 00:00:00+00:00,0.044998,0.253255,0.262370,0.252604,0.259115,1420800
4,1962-01-08 00:00:00+00:00,0.044535,0.250651,0.251302,0.245768,0.251302,2035200
...,...,...,...,...,...,...,...
15812,2024-10-25 00:00:00+00:00,66.919998,66.919998,67.699997,66.790001,67.070000,11138100
15813,2024-10-28 00:00:00+00:00,66.669998,66.669998,67.400002,66.599998,66.959999,10761400
15814,2024-10-29 00:00:00+00:00,65.559998,65.559998,66.339996,65.519997,66.290001,16525900
15815,2024-10-30 00:00:00+00:00,65.919998,65.919998,66.540001,65.320000,65.510002,14177800
