In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# üìä EDA & Feature Engineering ‚Äî Stock Market Analytics\n",
    "\n",
    "Notebook inicial para explora√ß√£o de dados de mercado, gera√ß√£o de features e salvamento de figuras em `reports/img/`.\n",
    "\n",
    "> **Nota**: este projeto √© educacional e n√£o constitui recomenda√ß√£o de investimento."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Setup\n",
    "- Carrega vari√°veis do `.env`\n",
    "- Define caminhos e cria diret√≥rios caso n√£o existam\n",
    "- Importa bibliotecas necess√°rias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "from datetime import datetime\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import yfinance as yf\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "BASE_DIR = Path(\"..\").resolve().parent if (Path.cwd().name == \"notebooks\") else Path(\".\")\n",
    "DATA_DIR = BASE_DIR / \"data\"\n",
    "PROCESSED_DIR = DATA_DIR / \"processed\"\n",
    "ANALYTICS_DIR = DATA_DIR / \"analytics\"\n",
    "IMG_DIR = BASE_DIR / \"reports\" / \"img\"\n",
    "\n",
    "for d in [DATA_DIR, PROCESSED_DIR, ANALYTICS_DIR, IMG_DIR]:\n",
    "    d.mkdir(parents=True, exist_ok=True)\n",
    "\n", 
    "load_dotenv(dotenv_path=BASE_DIR / \".env\", override=True)\n",
    "DATA_START = os.getenv(\"DATA_START\", \"2015-01-01\")\n",
    "TICKERS = [t.strip() for t in os.getenv(\"TICKERS\", \"AAPL,MSFT,SPY\").split(\",\") if t.strip()]\n",
    "\n",
    "print(f\"DATA_START = {DATA_START}\")\n",
    "print(f\"TICKERS    = {TICKERS}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Carregamento dos dados\n",
    "Carrega pre√ßos hist√≥ricos (OHLCV) de `data/processed/` se existir; caso contr√°rio, baixa via **yfinance** e persiste em parquet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_or_download_prices(tickers, start_date):\n",
    "    parquet_path = PROCESSED_DIR / \"prices.parquet\"\n",
    "    if parquet_path.exists():\n",
    "        print(\"Lendo prices de\", parquet_path)\n",
    "        return pd.read_parquet(parquet_path)\n",
    "    print(\"Baixando dados via yfinance...\")\n",
    "    df = yf.download(\n",
    "        tickers,\n",
    "        start=start_date,\n",
    "        auto_adjust=False,\n",
    "        progress=False,\n",
    "        group_by=\"ticker\"\n",
    "    )\n",
    "    # Normaliza para formato long: index (Date), colunas multiindex (Ticker -> OHLCV)\n",
    "    frames = []\n",
    "    for t in tickers:\n",
    "        if t in df:\n",
    "            sub = df[t].copy()\n",
    "            sub.columns = [c.replace(\" \", \"_\").lower() for c in sub.columns]\n",
    "            sub = sub.rename(columns={\"adj_close\": \"adjclose\"})\n",
    "            sub[\"ticker\"] = t\n",
    "            frames.append(sub.reset_index())\n",
    "    full = pd.concat(frames, ignore_index=True)\n",
    "    full = full.rename(columns={\"Date\": \"date\"}) if \"Date\" in full.columns else full\n",
    "    full[\"date\"] = pd.to_datetime(full[\"date\"]).dt.tz_localize(None)\n",
    "    full.sort_values([\"ticker\", \"date\"], inplace=True)\n",
    "    full.to_parquet(parquet_path, index=False)\n",
    "    print(\"Salvo em\", parquet_path)\n",
    "    return full\n",
    "\n",
    "prices = load_or_download_prices(TICKERS, DATA_START)\n",
    "prices.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Feature Engineering (b√°sico)\n",
    "- Retornos (1D, 5D, 21D)\n",
    "- Volatilidade rolling (21D)\n",
    "- Sazonalidade simples (dia da semana, m√™s)\n",
    "\n",
    "> Features completas ser√£o geradas tamb√©m por `src/features/build_features.py` ‚Äî aqui fazemos um *preview* para EDA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_basic_features(df: pd.DataFrame) -> pd.DataFrame:\n",
    "    df = df.copy()\n",
    "    df[\"ret_1d\"] = df.groupby(\"ticker\")[\"adjclose\"].pct_change(1)\n",
    "    df[\"ret_5d\"] = df.groupby(\"ticker\")[\"adjclose\"].pct_change(5)\n",
    "    df[\"ret_21d\"] = df.groupby(\"ticker\")[\"adjclose\"].pct_change(21)\n",
    "    df[\"vol_21d\"] = df.groupby(\"ticker\")[\"ret_1d\"].rolling(21).std().reset_index(level=0, drop=True)\n",
    "    df[\"dow\"] = df[\"date\"].dt.dayofweek\n",
    "    df[\"month\"] = df[\"date\"].dt.month\n",
    "    return df\n",
    "\n",
    "features_preview = make_basic_features(prices)\n",
    "features_preview.to_parquet(ANALYTICS_DIR / \"features_preview.parquet\", index=False)\n",
    "features_preview.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. EDA ‚Äî Visualiza√ß√µes\n",
    "Gera gr√°ficos simples e salva em `reports/img/`:\n",
    "- S√©rie de pre√ßos de um ticker de refer√™ncia\n",
    "- Histograma de retornos di√°rios\n",
    "- Correla√ß√£o de retornos entre tickers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def savefig(path: Path):\n",
    "    path.parent.mkdir(parents=True, exist_ok=True)\n",
    "    plt.tight_layout()\n",
    "    plt.savefig(path, dpi=120)\n",
    "    print(\"Figura salva em\", path)\n",
    "\n",
    "ref = TICKERS[0]\n",
    "ref_df = features_preview[features_preview[\"ticker\"] == ref]\n",
    "\n",
    "# 3.1 Pre√ßo (Adj Close)\n",
    "plt.figure(figsize=(10, 4))\n",
    "plt.plot(ref_df[\"date\"], ref_df[\"adjclose\"])  # cor padr√£o\n",
    "plt.title(f\"Adj Close ‚Äî {ref}\")\n",
    "plt.xlabel(\"Date\")\n",
    "plt.ylabel(\"Adj Close\")\n",
    "savefig(IMG_DIR / f\"price_{ref}.png\")\n",
    "plt.close()\n",
    "\n",
    "# 3.2 Histograma de retornos di√°rios\n",
    "plt.figure(figsize=(6, 4))\n",
    "ref_df[\"ret_1d\"].dropna().hist(bins=60)\n",
    "plt.title(f\"Distribui√ß√£o de Retornos Di√°rios ‚Äî {ref}\")\n",
    "plt.xlabel(\"Return 1D\")\n",
    "plt.ylabel(\"Freq\")\n",
    "savefig(IMG_DIR / f\"hist_returns_{ref}.png\")\n",
    "plt.close()\n",
    "\n",
    "# 3.3 Matriz de correla√ß√£o de retornos (1D) entre tickers\n",
    "pivot = features_preview.pivot(index=\"date\", columns=\"ticker\", values=\"ret_1d\")\n",
    "corr = pivot.corr(min_periods=100)\n",
    "\n",
    "plt.figure(figsize=(5 + 0.5*len(TICKERS), 4 + 0.3*len(TICKERS)))\n",
    "im = plt.imshow(corr, aspect=\"auto\")\n",
    "plt.xticks(range(len(TICKERS)), TICKERS, rotation=45, ha=\"right\")\n",
    "plt.yticks(range(len(TICKERS)), TICKERS)\n",
    "plt.colorbar(im, fraction=0.046, pad=0.04)\n",
    "plt.title(\"Correla√ß√£o de Retornos (1D)\")\n",
    "savefig(IMG_DIR / \"corr_returns_1d.png\")\n",
    "plt.close()\n",
    "\n",
    "corr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Observa√ß√µes Iniciais\n",
    "- Preencha aqui achados relevantes da EDA (per√≠odos de alta volatilidade, correla√ß√µes fortes, outliers etc.).\n",
    "- Utilize as imagens exportadas em `reports/img/` no relat√≥rio `reports/eda_summary.md`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
