실행 전 Requirements!!

이 노트북을 로컬에서 실행하려면 아래 라이브러리들이 필요합니다!!(특히 `pyarrow`나 `fastparquet`가 없으면 파일 저장이 안 됩니다!)

터미널(cmd)에서 다음 명령어를 실행해 주세요!

```bash
pip install pandas numpy scikit-learn pyarrow fastparquet

0. 세팅: Drive 마운트 & 경로 설정 (Local 버전)

In [1]:
import os
import sys
from pathlib import Path

# 1. Colab vs Local
try:
    from google.colab import drive
    is_colab = True
except ImportError:
    is_colab = False

# 2. 경로 설정
if is_colab:
    print("Environment: Google Colab")
    drive.mount('/content/drive')
    
    # 본인의 드라이브 경로에 맞게 수정.
    ROOT = Path("/content/drive/MyDrive/ML_team_project_final") 
else:
    print("Environment: Local Machine")
    # 로컬에서는 현재 노트북 파일이 있는 위치를 기준.
    ROOT = Path.cwd()
    EMB_DIR = ROOT / "embedding"
    EIDX_DIR = ROOT / "Economic_index"

print("PROJECT ROOT:", ROOT)
print("Embedding data ROOT:", EMB_DIR)
print("Economic index data ROOT:", EIDX_DIR)

Environment: Local Machine
PROJECT ROOT: /Users/migreeni/Library/CloudStorage/OneDrive-개인/25-2/Machine learning/20252R0136COSE36203
Embedding data ROOT: /Users/migreeni/Library/CloudStorage/OneDrive-개인/25-2/Machine learning/20252R0136COSE36203/embedding
Economic index data ROOT: /Users/migreeni/Library/CloudStorage/OneDrive-개인/25-2/Machine learning/20252R0136COSE36203/Economic_index


1. 입력 파일 경로 정의 (Local 버전)
- sp500.csv
- fear_greed.csv
- 4가지 임베딩 + 메타데이터 파일

In [2]:
import os

SP500_CSV_PATH      = f"{EIDX_DIR}/sp500.csv"
FEARGREED_CSV_PATH  = f"{EIDX_DIR}/fear_greed.csv"

EMB_CONFIGS = {
    "headlines": {
        "emb_path": EMB_DIR / "vector_headlines" / "embeddings.npy",
        "meta_path": EMB_DIR / "vector_headlines" / "metadata.jsonl",
    },
    "chunking": {
        "emb_path": EMB_DIR / "vector_chunking" / "embeddings.npy",
        "meta_path": EMB_DIR / "vector_chunking" / "metadata.jsonl",
    },
    "bodyText": {
        "emb_path": EMB_DIR / "vector_bodyText" / "embeddings.npy",
        "meta_path": EMB_DIR / "vector_bodyText" / "metadata.jsonl",
    },
    "paragraphs": {
        "emb_path": EMB_DIR / "vector_paragraphs" / "embeddings.npy",
        "meta_path": EMB_DIR / "vector_paragraphs" / "metadata.jsonl",
    },
}

# lag 길이 L
LAG_L = 5
NUM_PERSONS = 100

print("=== INPUT FILE CHECK ===")
print("sp500.csv exists?      ", os.path.exists(SP500_CSV_PATH))
print("fear_greed.csv exists? ", os.path.exists(FEARGREED_CSV_PATH))
for name, cfg in EMB_CONFIGS.items():
    print(f"{name} emb exists?  ", os.path.exists(cfg['emb_path']))
    print(f"{name} meta exists? ", os.path.exists(cfg['meta_path']))

=== INPUT FILE CHECK ===
sp500.csv exists?       True
fear_greed.csv exists?  True
headlines emb exists?   True
headlines meta exists?  True
chunking emb exists?   True
chunking meta exists?  True
bodyText emb exists?   True
bodyText meta exists?  True
paragraphs emb exists?   True
paragraphs meta exists?  True


2. 기본 import & 날짜 유틸 함수

In [3]:
import numpy as np
import pandas as pd
import json

def to_datetime_any(dt_str):
    """문자열을 pandas.Timestamp로 변환."""
    return pd.to_datetime(dt_str).normalize()

def pubdate_to_timestamp(pub_date_str: str):
    """
    metadata의 pub_date: 'YYYY_MM_DD' -> Timestamp로 변환.
    예: '2017_12_31' -> 2017-12-31
    """
    return pd.to_datetime(pub_date_str, format="%Y_%m_%d").normalize()

def format_date_underscore(ts: pd.Timestamp):
    """Timestamp -> 'YYYY_MM_DD' 문자열."""
    return ts.strftime("%Y_%m_%d")

3. S&P 500 로드 + date_index 생성
- Date, Open 사용
- value = Open
- 날짜 오름차순 정렬 후 date_index = 0,1,2,... (거래일 순서)

In [4]:
def load_sp500(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    df.columns = [c.strip() for c in df.columns]

    # 필요한 컬럼만 사용
    if not {"Date", "Open"}.issubset(df.columns):
        raise ValueError("sp500.csv에는 'Date', 'Open' 컬럼이 있어야 합니다.")

    df = df[["Date", "Open"]].copy()
    df["Date"] = df["Date"].apply(to_datetime_any)
    df = df.sort_values("Date").reset_index(drop=True)

    df = df.rename(columns={"Open": "value"})
    df["date_index"] = np.arange(len(df))  # 0부터 시작
    df["date_str"] = df["Date"].apply(format_date_underscore)
    return df

print("=== STEP 3: Load S&P 500 ===")
sp500_df = load_sp500(SP500_CSV_PATH)
print("sp500_df.shape =", sp500_df.shape)
print(sp500_df.head(4))

=== STEP 3: Load S&P 500 ===
sp500_df.shape = (754, 4)
        Date    value  date_index    date_str
0 2017-01-03  2251.57           0  2017_01_03
1 2017-01-04  2261.60           1  2017_01_04
2 2017-01-05  2268.18           2  2017_01_05
3 2017-01-06  2271.14           3  2017_01_06


4. Fear-Greed Index 로드
- Date, fear_greed_index 사용
- value = fear_greed_index

In [5]:
def load_fear_greed(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    df.columns = [c.strip().lower() for c in df.columns]

    if not {"date", "fear_greed_index"}.issubset(df.columns):
        raise ValueError("fear_greed.csv에는 'date', 'fear_greed_index' 컬럼이 있어야 합니다.")

    df = df[["date", "fear_greed_index"]].copy()
    df["Date"] = pd.to_datetime(df["date"]).dt.normalize()
    df = df.rename(columns={"fear_greed_index": "value"})
    df = df.drop(columns=["date"])
    df = df.sort_values("Date").reset_index(drop=True)
    return df

print("=== STEP 4: Load Fear-Greed ===")
fear_greed_raw = load_fear_greed(FEARGREED_CSV_PATH)
print("fear_greed_raw.shape =", fear_greed_raw.shape)
print(fear_greed_raw.head(4))

=== STEP 4: Load Fear-Greed ===
fear_greed_raw.shape = (754, 2)
   value       Date
0     70 2017-01-03
1     70 2017-01-04
2     70 2017-01-05
3     68 2017-01-06


5. 모든 metadata 로드 + person/date 범위 확인

metadata 형식 (예시): {"person": "alex_morgan", "article_id": "...", "pub_date": "2017_12_31"}

In [6]:
def load_metadata_jsonl(path: str) -> pd.DataFrame:
    rows = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue
            rows.append(json.loads(line.strip()))
    df = pd.DataFrame(rows)

    # 필수 컬럼 체크
    for col in ["person", "article_id", "pub_date"]:
        if col not in df.columns:
            raise ValueError(f"{path} 에 '{col}' 컬럼이 없습니다.")

    # 날짜 처리
    df["article_date"] = df["pub_date"].apply(pubdate_to_timestamp)
    df["date_str"] = df["article_date"].apply(format_date_underscore)
    return df

print("=== STEP 5: Load metadata for all methods ===")
meta_raw_dict = {}
all_persons = set()
all_meta_dates = []

for method_name, cfg in EMB_CONFIGS.items():
    print(f"\n[metadata] method = {method_name}")
    meta_df = load_metadata_jsonl(cfg["meta_path"])
    meta_raw_dict[method_name] = meta_df

    all_persons.update(meta_df["person"].unique())
    all_meta_dates.append(meta_df["article_date"].min())
    all_meta_dates.append(meta_df["article_date"].max())

    print("  meta_df.shape =", meta_df.shape)
    print(meta_df.head(4))

print("\n전체 person 수:", len(all_persons))
print("person 예시 10개:", sorted(all_persons)[:10])

=== STEP 5: Load metadata for all methods ===

[metadata] method = headlines
  meta_df.shape = (461270, 6)
        person                                         article_id    pub_date  \
0  alex_morgan  uk-news/2017/dec/31/new-years-eve-celebrations...  2017_12_31   
1  alex_morgan  sport/2017/dec/31/alastair-cook-david-warner-t...  2017_12_31   
2  alex_morgan  world/2017/dec/31/eight-big-ideas-for-2018-pol...  2017_12_31   
3  alex_morgan  world/2017/dec/31/look-to-the-future-what-does...  2017_12_31   

                                            headline article_date    date_str  
0  New Year's Eve celebrations to go ahead despit...   2017-12-31  2017_12_31  
1  Alastair Cook and David Warner: two Test maest...   2017-12-31  2017_12_31  
2  Eight big ideas for 2018 by Ed Miliband, Maggi...   2017-12-31  2017_12_31  
3  Look to the future: what does 2018 have in store?   2017-12-31  2017_12_31  

[metadata] method = chunking
  meta_df.shape = (460722, 5)
        person             

6. person → person_id (1..NUM_PERSONS) 매핑

문자열 person을 고정된 정수 ID로 변환해서
나중에 one-hot vector를 만들 수 있게 한다.

In [7]:
def build_person_mapping(person_names, num_persons_expected=None):
    sorted_persons = sorted(person_names)
    person_to_id = {p: i+1 for i, p in enumerate(sorted_persons)}  # 1-based

    print("=== person_to_id 예시 (앞 10개) ===")
    for p in sorted_persons[:10]:
        print(f"  {p} -> {person_to_id[p]}")
    print(f"총 person 수 = {len(sorted_persons)}")

    if num_persons_expected is not None and len(sorted_persons) != num_persons_expected:
        print(f"경고: 기대 인원 수 {num_persons_expected}명과 실제 {len(sorted_persons)}명이 다릅니다.")
    return person_to_id, sorted_persons

print("=== STEP 6: Build person_to_id mapping ===")
person_to_id, sorted_persons = build_person_mapping(all_persons, num_persons_expected=NUM_PERSONS)

=== STEP 6: Build person_to_id mapping ===
=== person_to_id 예시 (앞 10개) ===
  alex_morgan -> 1
  alicia_keys -> 2
  andres_manuel_lopez -> 3
  ann_mckee -> 4
  ashley_graham -> 5
  barbara_lynch -> 6
  barbara_rae_venter -> 7
  barry_jenkins -> 8
  benjamin_netanyahu -> 9
  bernard_tyson -> 10
총 person 수 = 100


7. Global date range & date_index_df
- sp500, fear_greed, 모든 metadata 날짜 범위를 합쳐서 전체 캘린더 날짜(Date)에 대해 date_index를 부여한다.
- 규칙: 가장 가까운 이전 S&P 거래일의 index 사용 (이전이 없으면 0)

In [8]:
def compute_global_date_range(sp500_df, fear_greed_raw, meta_raw_dict):
    dates = [
        sp500_df["Date"].min(),
        sp500_df["Date"].max(),
        fear_greed_raw["Date"].min(),
        fear_greed_raw["Date"].max(),
    ]
    for meta_df in meta_raw_dict.values():
        dates.append(meta_df["article_date"].min())
        dates.append(meta_df["article_date"].max())
    start_date = min(dates)
    end_date = max(dates)
    return start_date, end_date

print("=== STEP 7-1: Compute global date range ===")
global_start, global_end = compute_global_date_range(sp500_df, fear_greed_raw, meta_raw_dict)
print("global_start =", global_start)
print("global_end   =", global_end)

def build_date_index_df(sp500_df, start_date, end_date):
    calendar = pd.date_range(start=start_date, end=end_date, freq="D")
    df = pd.DataFrame({"Date": calendar})

    # S&P 거래일의 date_index를 merge (NaN은 이후 ffill)
    df = df.merge(sp500_df[["Date", "date_index"]], on="Date", how="left")

    # 가장 가까운 이전 거래일의 index로 forward-fill
    df["date_index"] = df["date_index"].ffill()

    # 첫 거래일 이전 날짜: NaN → 0
    df["date_index"] = df["date_index"].fillna(0).astype(int)
    df["date_str"] = df["Date"].apply(format_date_underscore)
    return df

print("\n=== STEP 7-2: Build date_index_df ===")
date_index_df = build_date_index_df(sp500_df, global_start, global_end)
print("date_index_df.shape =", date_index_df.shape)
print("date_index_df.head(8):")
print(date_index_df.head(8))
print("date_index_df.tail(8):")
print(date_index_df.tail(8))

=== STEP 7-1: Compute global date range ===
global_start = 2017-01-01 00:00:00
global_end   = 2019-12-31 00:00:00

=== STEP 7-2: Build date_index_df ===
date_index_df.shape = (1095, 3)
date_index_df.head(8):
        Date  date_index    date_str
0 2017-01-01           0  2017_01_01
1 2017-01-02           0  2017_01_02
2 2017-01-03           0  2017_01_03
3 2017-01-04           1  2017_01_04
4 2017-01-05           2  2017_01_05
5 2017-01-06           3  2017_01_06
6 2017-01-07           3  2017_01_07
7 2017-01-08           3  2017_01_08
date_index_df.tail(8):
           Date  date_index    date_str
1087 2019-12-24         749  2019_12_24
1088 2019-12-25         749  2019_12_25
1089 2019-12-26         750  2019_12_26
1090 2019-12-27         751  2019_12_27
1091 2019-12-28         751  2019_12_28
1092 2019-12-29         751  2019_12_29
1093 2019-12-30         752  2019_12_30
1094 2019-12-31         753  2019_12_31


8. Feature A: S&P 500 lag features (dataset_A)
- 입력: sp500_df (거래일 기준)
- 출력: dataset_A (date_index, date_str, value, lag_1..lag_L)
- 없는 과거 값: 첫 value 반복

In [9]:
def build_lag_features_sp500(sp500_df: pd.DataFrame, L: int = 5) -> pd.DataFrame:
    sp = sp500_df.sort_values("date_index").reset_index(drop=True)
    values = sp["value"].values
    T = len(values)

    data = {
        "date_index": sp["date_index"],
        "date_str": sp["date_str"],
        "value": sp["value"],
    }
    for k in range(1, L+1):
        lag_list = []
        for j in range(T):
            prev_idx = j - k
            if prev_idx < 0:
                lag_list.append(values[0])
            else:
                lag_list.append(values[prev_idx])
        data[f"lag_{k}"] = lag_list
    return pd.DataFrame(data)

print("=== STEP 8: Build dataset_A ===")
dataset_A = build_lag_features_sp500(sp500_df, L=LAG_L)
print("dataset_A.shape =", dataset_A.shape)
print(dataset_A.head(4))

=== STEP 8: Build dataset_A ===
dataset_A.shape = (754, 8)
   date_index    date_str    value    lag_1    lag_2    lag_3    lag_4  \
0           0  2017_01_03  2251.57  2251.57  2251.57  2251.57  2251.57   
1           1  2017_01_04  2261.60  2251.57  2251.57  2251.57  2251.57   
2           2  2017_01_05  2268.18  2261.60  2251.57  2251.57  2251.57   
3           3  2017_01_06  2271.14  2268.18  2261.60  2251.57  2251.57   

     lag_5  
0  2251.57  
1  2251.57  
2  2251.57  
3  2251.57  


9. Fear-Greed: S&P 거래일 기준 align + lag
- fg_on_sp: 각 S&P date_index에 대응하는 Fear-Greed 값
- fg_lag_df: date_index, fg_value, fg_lag_1..fg_lag_L

In [10]:
def align_fear_greed_to_sp500(sp500_df, fg_raw, date_index_df):
    # global calendar에 대해 FG 시계열 만들기
    fg_full = fg_raw.set_index("Date").reindex(date_index_df["Date"])
    fg_full["value"] = fg_full["value"].ffill().bfill()
    fg_full = fg_full.reset_index().rename(columns={"value": "fg_value"})

    # S&P 거래일에 대해 Fear& Greed 가져오기
    sp = sp500_df[["Date", "date_index"]].merge(
        fg_full[["Date", "fg_value"]],
        on="Date",
        how="left"
    )
    return sp

print("=== STEP 9-1: Align Fear-Greed to S&P ===")
fg_on_sp = align_fear_greed_to_sp500(sp500_df, fear_greed_raw, date_index_df)
print("fg_on_sp.shape =", fg_on_sp.shape)
print(fg_on_sp.head(4))

def build_lag_features_fear_greed(fg_on_sp: pd.DataFrame, L: int = 5) -> pd.DataFrame:
    fg = fg_on_sp.sort_values("date_index").reset_index(drop=True)
    values = fg["fg_value"].values
    T = len(values)

    data = {
        "date_index": fg["date_index"],
        "fg_value": fg["fg_value"],
    }
    for k in range(1, L+1):
        lag_list = []
        for j in range(T):
            prev_idx = j - k
            if prev_idx < 0:
                lag_list.append(values[0])
            else:
                lag_list.append(values[prev_idx])
        data[f"fg_lag_{k}"] = lag_list
    return pd.DataFrame(data)

print("\n=== STEP 9-2: Build fg_lag_df ===")
fg_lag_df = build_lag_features_fear_greed(fg_on_sp, L=LAG_L)
print("fg_lag_df.shape =", fg_lag_df.shape)
print(fg_lag_df.head(4))

=== STEP 9-1: Align Fear-Greed to S&P ===
fg_on_sp.shape = (754, 3)
        Date  date_index  fg_value
0 2017-01-03           0      70.0
1 2017-01-04           1      70.0
2 2017-01-05           2      70.0
3 2017-01-06           3      68.0

=== STEP 9-2: Build fg_lag_df ===
fg_lag_df.shape = (754, 7)
   date_index  fg_value  fg_lag_1  fg_lag_2  fg_lag_3  fg_lag_4  fg_lag_5
0           0      70.0      70.0      70.0      70.0      70.0      70.0
1           1      70.0      70.0      70.0      70.0      70.0      70.0
2           2      70.0      70.0      70.0      70.0      70.0      70.0
3           3      68.0      70.0      70.0      70.0      70.0      70.0


10. metadata에 date_index & person_id 붙이기
- article_date를 date_index_df와 merge해서 date_index 얻기
- person 문자열을 person_id(1..NUM_PERSONS)로 매핑

In [11]:
def attach_date_index_and_person_id(meta_df, date_index_df, person_to_id, method_name=""):
    df = meta_df.merge(
        date_index_df[["Date", "date_index"]],
        left_on="article_date",
        right_on="Date",
        how="left"
    )
    df = df.drop(columns=["Date"])

    # date_index NaN이면 마지막 index로 채운다 (이론상 거의 없을 것)
    last_idx = sp500_df["date_index"].max()
    df["date_index"] = df["date_index"].fillna(last_idx).astype(int)

    df["person_id"] = df["person"].map(person_to_id)
    if df["person_id"].isna().any():
        unknown = df[df["person_id"].isna()]["person"].unique()
        raise ValueError(f"[{method_name}] person_to_id 매핑에 없는 person이 있습니다: {unknown}")
    return df

print("=== STEP 10: Attach date_index & person_id to metadata ===")
meta_with_idx_dict = {}
for method_name, meta_df in meta_raw_dict.items():
    print(f"\n[method: {method_name}]")
    meta_with = attach_date_index_and_person_id(meta_df.copy(), date_index_df, person_to_id, method_name)
    meta_with_idx_dict[method_name] = meta_with
    print("meta_with.shape =", meta_with.shape)
    print(meta_with.head(4))

=== STEP 10: Attach date_index & person_id to metadata ===

[method: headlines]
meta_with.shape = (461270, 8)
        person                                         article_id    pub_date  \
0  alex_morgan  uk-news/2017/dec/31/new-years-eve-celebrations...  2017_12_31   
1  alex_morgan  sport/2017/dec/31/alastair-cook-david-warner-t...  2017_12_31   
2  alex_morgan  world/2017/dec/31/eight-big-ideas-for-2018-pol...  2017_12_31   
3  alex_morgan  world/2017/dec/31/look-to-the-future-what-does...  2017_12_31   

                                            headline article_date    date_str  \
0  New Year's Eve celebrations to go ahead despit...   2017-12-31  2017_12_31   
1  Alastair Cook and David Warner: two Test maest...   2017-12-31  2017_12_31   
2  Eight big ideas for 2018 by Ed Miliband, Maggi...   2017-12-31  2017_12_31   
3  Look to the future: what does 2018 have in store?   2017-12-31  2017_12_31   

   date_index  person_id  
0         250          1  
1         250          1

11. [함수 정의] embeddings 로드 함수

In [12]:
def load_embedding_with_metadata(config: dict, meta_df: pd.DataFrame, method_name: str):
    emb = np.load(config["emb_path"])
    N, d = emb.shape
    if len(meta_df) != N:
        raise ValueError(
            f"[{method_name}] embeddings(N={N})과 metadata(len={len(meta_df)}) 길이가 다릅니다."
        )
    df = meta_df.copy().reset_index(drop=True)
    df["idx"] = np.arange(N)
    df["embedding"] = list(emb)

    print(f"[{method_name}] emb_df.shape = {df.shape}, embedding_dim = {d}")
    print(df.head(3))
    return df, d
print("=== STEP 11: Function 'load_embedding_with_metadata' defined. ===")

=== STEP 11: Function 'load_embedding_with_metadata' defined. ===


12. [함수 정의] Dataset B 생성 함수

In [13]:
# ============================
# STEP 12 — Build dataset_B (RAM-SAFE & Logic Fixed)
# ============================

def build_dataset_B_for_method_light(method_name: str, emb_df: pd.DataFrame, dataset_A: pd.DataFrame):
    """
    메모리 절약 버전: embedding을 1024개 컬럼으로 풀지 않고 merge만 수행.
    수정사항: merge 키에서 'date_str' 제외 (주말 기사 매칭 문제 해결)
    """
    print(f"\n--- Building dataset_B for {method_name} ---")

    # 병합할 때 SP500 데이터프레임에서 불필요한 중복 컬럼(date_str) 제외
    # date_index, value, lag_1... 만 가져옴
    cols_to_use = ["date_index", "value"] + [f"lag_{k}" for k in range(1, LAG_L+1)]
    base_for_merge = dataset_A[cols_to_use]

    print(" base_for_merge shape:", base_for_merge.shape)

    # date_index 기준으로 병합
    dataset_B = emb_df.merge(
        base_for_merge,
        on="date_index",
        how="left"
    )

    # print(f"[{method_name}] dataset_B.shape =", dataset_B.shape)
    # 확인: value 컬럼에 NaN이 없는지 체크 (앞부분 패딩 구간 제외하고)
    nan_count = dataset_B['value'].isna().sum()
    # print(f"[{method_name}] Rows with NaN value (should be 0 or very small): {nan_count}")

    return dataset_B


print("=== STEP 12: Function 'build_dataset_B_for_method_light' defined.(MEMORY SAFE) ===")

=== STEP 12: Function 'build_dataset_B_for_method_light' defined.(MEMORY SAFE) ===


13. Feature C: B + person one-hot
- person_id (1..NUM_PERSONS)를 one-hot(100차원)으로 붙인다.

In [14]:
# ============================
# STEP 13 — Build dataset_C (one-hot)
# ============================

def add_person_one_hot(dataset_B: pd.DataFrame, num_persons: int = NUM_PERSONS):
    print("\n--- Building dataset_C (one-hot) ---")
    print(" input dataset_B.shape:", dataset_B.shape)
    print(dataset_B.head(3))

    oh = pd.get_dummies(dataset_B["person_id"], prefix="person")

    expected_cols = [f"person_{i}" for i in range(1, num_persons + 1)]
    for col in expected_cols:
        if col not in oh.columns:
            oh[col] = 0
    oh = oh[expected_cols]

    dataset_C = pd.concat([dataset_B, oh], axis=1)

    print("dataset_C.shape =", dataset_C.shape)
    print(dataset_C.head(3))
    return dataset_C, expected_cols

14. Feature D: C + Fear-Greed lag
- dataset_C에 fg_lag_df를 date_index 기준으로 merge

In [15]:
# ============================
# STEP 14 — Build dataset_D (FG lags)
# ============================

def add_fear_greed_lags_to_dataset_C(dataset_C: pd.DataFrame, fg_lag_df: pd.DataFrame, L: int = 5):
    print("\n--- Building dataset_D (FG lags) ---")
    print(" input dataset_C.shape:", dataset_C.shape)
    print(dataset_C.head(3))

    merged = dataset_C.merge(
        fg_lag_df,
        on="date_index",
        how="left"
    )

    print("merged (dataset_D).shape =", merged.shape)
    print(merged.head(3))

    fg_cols = [f"fg_lag_{k}" for k in range(1, L + 1)]
    return merged, fg_cols

15. Dimensionality reduction - [함수 정의] PCA 학습 및 변환 함수
- 누적 설명 분산 비율(Explained Variance Ratio)이 90%가 되도록 차원 수를 자동으로 결정.

In [16]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from pathlib import Path
import gc

# ==============================================================================
# STEP 15: PCA Functions (Sampling + Batch Processing)
# ==============================================================================
# 메모리 절약을 위해 정의하는 함수들입니다.

def fit_pca_on_sample(dataset_B, variance_ratio=0.90, sample_size=50000, random_state=42):
    """일부 샘플만 사용하여 PCA 학습"""
    print(f"   [PCA Fit] Sampling {sample_size} rows...")
    n_total = len(dataset_B)

    # 샘플링 인덱스 추출
    if n_total <= sample_size:
        indices = np.arange(n_total)
    else:
        np.random.seed(random_state)
        indices = np.random.choice(n_total, sample_size, replace=False)

    # 임베딩 추출 (Stacking)
    X_sample = np.stack(dataset_B.iloc[indices]['embedding'].values)

    # 학습
    pca = PCA(n_components=variance_ratio, random_state=random_state)
    pca.fit(X_sample)

    print(f"   [PCA Fit] Components selected: {pca.n_components_}")
    return pca

def transform_in_batches(dataset_B, pca_model, batch_size=10000):
    """전체 데이터를 배치 단위로 나누어 PCA 변환"""
    print(f"   [PCA Transform] transforming in batches...")
    n_total = len(dataset_B)
    pca_results = []

    for start_idx in range(0, n_total, batch_size):
        end_idx = min(start_idx + batch_size, n_total)

        # 배치 추출 및 스택
        batch_emb = np.stack(dataset_B.iloc[start_idx:end_idx]['embedding'].values)

        # 변환
        batch_pca = pca_model.transform(batch_emb)
        pca_results.append(batch_pca)

        # 메모리 해제
        del batch_emb

    # 전체 합치기
    X_pca_full = np.vstack(pca_results)

    # DataFrame 생성
    pca_cols = [f"pca_{i}" for i in range(X_pca_full.shape[1])]
    df_pca = pd.DataFrame(X_pca_full, columns=pca_cols, index=dataset_B.index)

    # 원본 컬럼과 병합 ('embedding' 제외)
    cols_to_keep = [c for c in dataset_B.columns if c != 'embedding']
    dataset_B_pca = pd.concat([dataset_B[cols_to_keep], df_pca], axis=1)

    return dataset_B_pca

16. 결과 저장

- Dataset C와 D에 대해 별도로 PCA를 돌리지 않고, PCA가 적용된 Dataset B(df_B_pca)를 이용하여 df_C_pca, df_D_pca를 생성.

- 파일명 구분: _orig(원본 임베딩)와 _pca(차원 축소됨)로 구분하여 저장.

In [17]:
import gc
from pathlib import Path

# ==============================================================================
# STEP 16: Generate All Datasets & Save (Sequential Processing)
# ==============================================================================

OUTPUT_ROOT = Path(ROOT) / "feature_datasets"
OUTPUT_ROOT.mkdir(exist_ok=True, parents=True)

print("=== STEP 16: Sequential Pipeline Started ===")
print(f"Output Directory: {OUTPUT_ROOT}")

# 1. Dataset A 저장
print(f"\n[Saving] Dataset A...")
dataset_A.to_parquet(OUTPUT_ROOT / "dataset_A.parquet", index=False)

# 2. Method 별 순차 처리
methods = ["headlines", "chunking", "bodyText"]

for method in methods:
    print(f"\n>>> Processing Method: {method.upper()}")

    # 메타데이터(Cell 10 결과)가 있는지 확인.
    if method not in meta_with_idx_dict:
        print(f"Skipping {method} (Metadata not found or excluded in Config)")
        continue

    # -------------------------------------------------------
    # A. 데이터 로드 (Embeddings) - 여기서 로드해야 메모리가 안 터짐
    # -------------------------------------------------------
    cfg = EMB_CONFIGS[method]
    emb_df, _ = load_embedding_with_metadata(cfg, meta_with_idx_dict[method], method)

    # -------------------------------------------------------
    # B. 데이터셋 구축 (Original)
    # -------------------------------------------------------
    # 함수들은 Cell 12, 13, 14에 정의.
    df_B_orig = build_dataset_B_for_method_light(method, emb_df, dataset_A)

    # 임베딩 데이터프레임은 병합되었으니 즉시 삭제
    del emb_df
    gc.collect()

    df_C_orig, _ = add_person_one_hot(df_B_orig, NUM_PERSONS)
    df_D_orig, _ = add_fear_greed_lags_to_dataset_C(df_C_orig, fg_lag_df, LAG_L)

    # -------------------------------------------------------
    # C. 저장 (Original)
    # -------------------------------------------------------
    print(f"   [Saving] Original datasets for {method}...")
    df_B_orig.to_parquet(OUTPUT_ROOT / f"dataset_B_{method}_orig.parquet", index=False)
    df_C_orig.to_parquet(OUTPUT_ROOT / f"dataset_C_{method}_orig.parquet", index=False)
    df_D_orig.to_parquet(OUTPUT_ROOT / f"dataset_D_{method}_orig.parquet", index=False)

    # -------------------------------------------------------
    # D. PCA 학습 및 변환
    # -------------------------------------------------------
    print(f"   [Processing] PCA for {method}...")
    # 함수들은 Cell 15에 정의
    pca_model = fit_pca_on_sample(df_B_orig, variance_ratio=0.95)
    df_B_pca = transform_in_batches(df_B_orig, pca_model)

    # 파생 데이터셋 (PCA 버전)
    df_C_pca, _ = add_person_one_hot(df_B_pca, NUM_PERSONS)
    df_D_pca, _ = add_fear_greed_lags_to_dataset_C(df_C_pca, fg_lag_df, LAG_L)

    # -------------------------------------------------------
    # E. 저장 (PCA)
    # -------------------------------------------------------
    print(f"   [Saving] PCA datasets for {method}...")
    df_B_pca.to_parquet(OUTPUT_ROOT / f"dataset_B_{method}_pca.parquet", index=False)
    df_C_pca.to_parquet(OUTPUT_ROOT / f"dataset_C_{method}_pca.parquet", index=False)
    df_D_pca.to_parquet(OUTPUT_ROOT / f"dataset_D_{method}_pca.parquet", index=False)

    # -------------------------------------------------------
    # F. 메모리 정리 (Cleanup) - 중요
    # -------------------------------------------------------
    print(f"   [Cleanup] Clearing memory for {method}...")
    del df_B_orig, df_C_orig, df_D_orig
    del df_B_pca, df_C_pca, df_D_pca, pca_model
    gc.collect()

print("\nAll datasets processed and saved successfully!")

# 2. Method 별 순차 처리
methods = ["headlines", "chunking", "bodyText", "paragraphs"]

for method in methods:
    print(f"\n>>> Processing Method: {method.upper()}")

    # 메타데이터(Cell 10 결과)가 있는지 확인.
    if method not in meta_with_idx_dict:
        print(f"Skipping {method} (Metadata not found or excluded in Config)")
        continue

    # -------------------------------------------------------
    # A. 데이터 로드 (Embeddings) - 여기서 로드해야 메모리가 안 터짐
    # -------------------------------------------------------
    cfg = EMB_CONFIGS[method]
    emb_df, _ = load_embedding_with_metadata(cfg, meta_with_idx_dict[method], method)

    # -------------------------------------------------------
    # B. 데이터셋 구축 (Original)
    # -------------------------------------------------------
    # 함수들은 Cell 12, 13, 14에 정의.
    df_B_orig = build_dataset_B_for_method_light(method, emb_df, dataset_A)

    # 임베딩 데이터프레임은 병합되었으니 즉시 삭제
    del emb_df
    gc.collect()

    df_C_orig, _ = add_person_one_hot(df_B_orig, NUM_PERSONS)
    df_D_orig, _ = add_fear_greed_lags_to_dataset_C(df_C_orig, fg_lag_df, LAG_L)

    # -------------------------------------------------------
    # C. 저장 (Original)
    # -------------------------------------------------------
    print(f"   [Saving] Original datasets for {method}...")
    df_B_orig.to_parquet(OUTPUT_ROOT / f"dataset_B_{method}_orig.parquet", index=False)
    df_C_orig.to_parquet(OUTPUT_ROOT / f"dataset_C_{method}_orig.parquet", index=False)
    df_D_orig.to_parquet(OUTPUT_ROOT / f"dataset_D_{method}_orig.parquet", index=False)

    # -------------------------------------------------------
    # D. PCA 학습 및 변환
    # -------------------------------------------------------
    print(f"   [Processing] PCA for {method}...")
    # 함수들은 Cell 15에 정의
    pca_model = fit_pca_on_sample(df_B_orig, variance_ratio=0.95)
    df_B_pca = transform_in_batches(df_B_orig, pca_model)

    # 파생 데이터셋 (PCA 버전)
    df_C_pca, _ = add_person_one_hot(df_B_pca, NUM_PERSONS)
    df_D_pca, _ = add_fear_greed_lags_to_dataset_C(df_C_pca, fg_lag_df, LAG_L)

    # -------------------------------------------------------
    # E. 저장 (PCA)
    # -------------------------------------------------------
    print(f"   [Saving] PCA datasets for {method}...")
    df_B_pca.to_parquet(OUTPUT_ROOT / f"dataset_B_{method}_pca.parquet", index=False)
    df_C_pca.to_parquet(OUTPUT_ROOT / f"dataset_C_{method}_pca.parquet", index=False)
    df_D_pca.to_parquet(OUTPUT_ROOT / f"dataset_D_{method}_pca.parquet", index=False)

    # -------------------------------------------------------
    # F. 메모리 정리 (Cleanup) - 중요
    # -------------------------------------------------------
    print(f"   [Cleanup] Clearing memory for {method}...")
    del df_B_orig, df_C_orig, df_D_orig
    del df_B_pca, df_C_pca, df_D_pca, pca_model
    gc.collect()

print("\nAll datasets processed and saved successfully!")

=== STEP 16: Sequential Pipeline Started ===
Output Directory: /Users/migreeni/Library/CloudStorage/OneDrive-개인/25-2/Machine learning/20252R0136COSE36203/feature_datasets

[Saving] Dataset A...

>>> Processing Method: HEADLINES
[headlines] emb_df.shape = (461270, 10), embedding_dim = 1024
        person                                         article_id    pub_date  \
0  alex_morgan  uk-news/2017/dec/31/new-years-eve-celebrations...  2017_12_31   
1  alex_morgan  sport/2017/dec/31/alastair-cook-david-warner-t...  2017_12_31   
2  alex_morgan  world/2017/dec/31/eight-big-ideas-for-2018-pol...  2017_12_31   

                                            headline article_date    date_str  \
0  New Year's Eve celebrations to go ahead despit...   2017-12-31  2017_12_31   
1  Alastair Cook and David Warner: two Test maest...   2017-12-31  2017_12_31   
2  Eight big ideas for 2018 by Ed Miliband, Maggi...   2017-12-31  2017_12_31   

   date_index  person_id  idx  \
0         250          1

In [18]:
	# 이 노트북에서 최종적으로 얻는 것은:
	# dataset_A (S&P only), dataset_B_* (A + embedding), dataset_C_* (B + person one-hot), dataset_D_* (C + Fear-Greed lag)
	# 이후 Model 단계에서: 각 모델(Linear, LightGBM, GRU, SARIMAX, TFT)에 원하는 dataset(A/B/C/D, embedding method 별)을 그대로 넣어서 학습.

	# Linear Regression, LightGBM, GRU 예시
	# X = dataset.drop(columns=['value', 'fg_value', 'date_str', ...]) # 입력에서 정답 삭제
	# y = dataset['value'] # 정답은 따로 준비
	# model.fit(X, y)

	# 시계열 특화 모델 (SARIMAX, TFT)
	# 이 모델들은 value를 drop하는 게 아니라, "이게 정답(Target)이야"라고 알려줘야.
	# 이 모델들은 구조 자체가 "과거의 value 흐름을 보고 미래를 예측"하도록 설계. 그래서 value 컬럼 자체를 모델에 통째로 넘겨주되, 역할(Role)을 다르게 지정.
	#
	# SARIMAX
	# Endog(내생변수, 예측대상)와 Exog(외생변수, 도와주는 피처)로 나뉨.
	# Endog: value 컬럼을 통째로 넣기. (모델이 알아서 과거값만 보고 학습함)
	# Exog: embeddings, person_vector 등을 넣기. 단, 여기서 value와 fg_value는 빼야 함. (도와주는 변수에 정답이 섞이면 안 됨)

	# TFT (Temporal Fusion Transformer)
	# TFT는 데이터프레임을 통째로 넣지만, 설정을 통해 역할을 지정. Target: value라고 지정합니다.
	# Time Varying Known Inputs (이미 아는 미래): 날짜, 요일 등.
	# Time Varying Unknown Inputs (과거만 아는 것): lag_1, embedding 등.
	# 여기서도 value를 "입력 피처"로 설정하면 안 되고 반드시 Target으로만 설정해야 데이터 누수Leakage가 안 생김.