<a href="https://colab.research.google.com/github/jacobgreen4477/Construction-Equipment-Oil-Condition-Classification-AI-Competition/blob/main/ETRI_v1_0_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> title : 제 4회 ETRI 휴먼이해 인공지능 논문경진대회 <br>
> author : hjy <br>

In our study, we used smartphones, smartwatches, sleep sensors, and self-recording apps to collect daily life logs and sleep health records of study participants in 2024.The data collection procedures and methods followed a similar approach to those used in previous studies. Here, we pu﻿blicly provide the following 12 data items, which comprise a total of 700 days' worth of lifelog data, strictly for non-commercial and academic research purposes only.
- mACStatus: Indicates whether the smartphone is currently being charged.
- mActivity: Value calculated by the Google Activity Recognition API.
- mAmbience: Ambient sound identification labels and their respective probabilities.
- mBle: Bluetooth devices around individual subject.
- mGps: Multiple GPS coordinates measured within a single minute using the smartphone.
- mLight: Ambient light measured by the smartphone.
- mScreenStatus: Indicates whether the smartphone screen is in use.
- mUsageStats: Indicates which apps were used on the smartphone and for how long.
- mWifi: Wifi devices around individual subject.
- wHr: Heart rate readings recorded by the smartwatch.
- wLight: Ambient light measured by the smartwatch.
- wPedo: Step data recorded by the smartwatch.

For the purpose of training a learning model to predict sleep health, fatigue, and stress, the following six metrics were derived from sleep sensor data and self-reported survey records. Each metric consists of values categorized into either two levels (0, 1) or three levels (0, 1, 2), depending on the specific metric. The detailed classification criteria for each metric's levels will be provided in a separate document.
- Q1: Overall sleep quality as perceived by a subject immediately after waking up.
- Q2: Physical fatigue of a subject just before sleep.
- Q3: Stress level experienced by a subject just before sleep.
- S1: Adherence to sleep guidelines for total sleep time (TST).
- S2: Adherence to sleep guidelines for sleep efficiency (SE).
- S3: Adherence to sleep guidelines for sleep onset latency (SOL, or SL).

### 📦 라이브러리

In [1]:
! pip install haversine
import pandas as pd
import numpy as np
import os
import sys
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
from tqdm.auto import tqdm
from collections import Counter
from scipy.stats import entropy
from haversine import haversine  # 설치 필요: pip install haversine

warnings.filterwarnings('ignore')



In [2]:
import re
import ast
from tqdm import tqdm  # ← 추가
from math import radians, cos, sin, asin, sqrt
from datetime import time
from datetime import timedelta
from functools import reduce

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# pandas 옵션
pd.set_option('display.max_columns', 999)
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', lambda x: '%0.4f' % x)

### 📦 데이터 읽기

In [5]:
path = '/content/drive/MyDrive/data/ch2025_data_items/'

# 1
mACStatus = pd.read_parquet(path+'ch2025_mACStatus.parquet')
mActivity = pd.read_parquet(path+'ch2025_mActivity.parquet')
mAmbience = pd.read_parquet(path+'ch2025_mAmbience.parquet')
mBle = pd.read_parquet(path+'ch2025_mBle.parquet')
mGps = pd.read_parquet(path+'ch2025_mGps.parquet')
mLight = pd.read_parquet(path+'ch2025_mLight.parquet')
mScreenStatus = pd.read_parquet(path+'ch2025_mScreenStatus.parquet')
mUsageStats = pd.read_parquet(path+'ch2025_mUsageStats.parquet')
mWifi = pd.read_parquet(path+'ch2025_mWifi.parquet')
wHr = pd.read_parquet(path+'ch2025_wHr.parquet')
wLight = pd.read_parquet(path+'ch2025_wLight.parquet')
wPedo = pd.read_parquet(path+'ch2025_wPedo.parquet')

# 2
train = pd.read_csv('/content/drive/MyDrive/data/ch2025_metrics_train.csv')
test = pd.read_csv('/content/drive/MyDrive/data/ch2025_submission_sample.csv')

### ✅ mACStatus 핸드폰 충전상태
- Indicates whether the smartphone is currently being charged.
- m_charging : 0/1 상태
- 핸드폰이 오랫 동안 충전했다는 의미?
 - 한 자리에 장시간 머물러 있었다.
 - 핸드폰을 장시간 사용하지 않았다.  

In [6]:
mACStatus['lifelog_date'] = mACStatus['timestamp'].astype(str).str[:10]
mACStatus.head(1)

Unnamed: 0,subject_id,timestamp,m_charging,lifelog_date
0,id01,2024-06-26 12:03:00,0,2024-06-26


In [7]:
def process_mACStatus(df):
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df = df.sort_values(['subject_id', 'timestamp'])

    results = []

    for (subj, lifelog_date), group in df.groupby(['subject_id', 'lifelog_date']):
        status = group['m_charging'].values  # 0/1 상태
        times = group['timestamp'].values

        ratio_charging = status.mean()
        sum_charging = status.sum()

        # 상태 전이 횟수
        transitions = (status[1:] != status[:-1]).sum()

        # 연속된 1 상태 길이들
        lengths = []
        current_len = 0
        for val in status:
            if val == 1:
                current_len += 1
            elif current_len > 0:
                lengths.append(current_len)
                current_len = 0
        if current_len > 0:
            lengths.append(current_len)

        avg_charging_duration = np.mean(lengths) if lengths else 0
        max_charging_duration = np.max(lengths) if lengths else 0

        results.append({
            'subject_id': subj,
            'lifelog_date': lifelog_date,
            'charging_ratio': ratio_charging,
            'charging_sum': sum_charging,
            'charging_transitions': transitions,
            'avg_charging_duration': avg_charging_duration,
            'max_charging_duration': max_charging_duration,
        })

    return pd.DataFrame(results)

mACStatus2 = process_mACStatus(mACStatus)

# check
print(f'# mACStatus2 shape: {mACStatus2.shape}')
mACStatus2.head(1)

# mACStatus2 shape: (700, 7)


Unnamed: 0,subject_id,lifelog_date,charging_ratio,charging_sum,charging_transitions,avg_charging_duration,max_charging_duration
0,id01,2024-06-26,0.2159,147,22,13.3636,41


### ✅ mActivity 추정행동
- Value calculated by the Google Activity Recognition API.
 - 0 : IN_VEHICLE
 - 1 : ON_BICYCLE
 - 2 : ON_FOOT
 - 3 : STILL (not moving)
 - 8 : RUNNING
 - 3 : TILTING (This often occurs when a device is picked up from a desk or a user who is sitting stands up.)
 - 4 : UNKNOWN
 - 7 : WALKING

- 근무시간   : 오전 7시부터 오후 6시까지
- 근무외시간 : 오후6시부터 12시까지

In [8]:
mActivity['lifelog_date'] = mActivity['timestamp'].astype(str).str[:10]
mActivity.head(1)

Unnamed: 0,subject_id,timestamp,m_activity,lifelog_date
0,id01,2024-06-26 12:03:00,4,2024-06-26


In [9]:
def process_mActivity_by_timezones(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['hour'] = df['timestamp'].dt.hour

    def summarize(group_df, prefix=''):
        summary = []
        for (subj, date), group in group_df.groupby(['subject_id', 'lifelog_date']):
            counts = group['m_activity'].value_counts(normalize=True)
            counts2 = group['m_activity'].value_counts()
            row = {'subject_id': subj, 'lifelog_date': date}

            for i in range(9):
                row[f'{prefix}activity_{i}_ratio'] = counts.get(i, 0)
                row[f'{prefix}activity_{i}_count'] = counts2.get(i, 0)

            row[f'{prefix}dominant_activity'] = group['m_activity'].mode()[0] if not group['m_activity'].mode().empty else None
            row[f'{prefix}num_unique_activities'] = group['m_activity'].nunique()

            summary.append(row)
        return pd.DataFrame(summary)

    # 전체 데이터 요약 (prefix 없음)
    total_summary = summarize(df)

    # 근무시간 (07:00 ~ 18:00)
    worktime_df = df[(df['hour'] >= 7) & (df['hour'] < 18)]
    worktime_summary = summarize(worktime_df, prefix='worktime_')

    # 근무외시간 (18:00 ~ 24:00)
    afterwork_df = df[(df['hour'] >= 18) & (df['hour'] < 24)]
    afterwork_summary = summarize(afterwork_df, prefix='afterwork_')

    # 취침시간 (00:00 ~ 07:00)
    sleeptime_df = df[(df['hour'] >= 0) & (df['hour'] < 7)]
    sleeptime_summary = summarize(sleeptime_df, prefix='sleeptime_')

    # subject_id와 lifelog_date 기준으로 모두 합치기
    result = total_summary.copy()
    result = pd.merge(result, worktime_summary, on=['subject_id', 'lifelog_date'], how='left')
    result = pd.merge(result, afterwork_summary, on=['subject_id', 'lifelog_date'], how='left')
    result = pd.merge(result, sleeptime_summary, on=['subject_id', 'lifelog_date'], how='left')

    # 컬럼 정렬
    cols = ['subject_id', 'lifelog_date'] + [col for col in result.columns if col not in ['subject_id', 'lifelog_date']]
    result = result[cols]

    # 결측 처리
    result = result.fillna(0)

    return result

mActivity2 = process_mActivity_by_timezones(mActivity)

# check
print(f'# mActivity2 shape: {mActivity2.shape}')
mActivity2.head(1)

# mActivity2 shape: (700, 82)


Unnamed: 0,subject_id,lifelog_date,activity_0_ratio,activity_0_count,activity_1_ratio,activity_1_count,activity_2_ratio,activity_2_count,activity_3_ratio,activity_3_count,activity_4_ratio,activity_4_count,activity_5_ratio,activity_5_count,activity_6_ratio,activity_6_count,activity_7_ratio,activity_7_count,activity_8_ratio,activity_8_count,dominant_activity,num_unique_activities,worktime_activity_0_ratio,worktime_activity_0_count,worktime_activity_1_ratio,worktime_activity_1_count,worktime_activity_2_ratio,worktime_activity_2_count,worktime_activity_3_ratio,worktime_activity_3_count,worktime_activity_4_ratio,worktime_activity_4_count,worktime_activity_5_ratio,worktime_activity_5_count,worktime_activity_6_ratio,worktime_activity_6_count,worktime_activity_7_ratio,worktime_activity_7_count,worktime_activity_8_ratio,worktime_activity_8_count,worktime_dominant_activity,worktime_num_unique_activities,afterwork_activity_0_ratio,afterwork_activity_0_count,afterwork_activity_1_ratio,afterwork_activity_1_count,afterwork_activity_2_ratio,afterwork_activity_2_count,afterwork_activity_3_ratio,afterwork_activity_3_count,afterwork_activity_4_ratio,afterwork_activity_4_count,afterwork_activity_5_ratio,afterwork_activity_5_count,afterwork_activity_6_ratio,afterwork_activity_6_count,afterwork_activity_7_ratio,afterwork_activity_7_count,afterwork_activity_8_ratio,afterwork_activity_8_count,afterwork_dominant_activity,afterwork_num_unique_activities,sleeptime_activity_0_ratio,sleeptime_activity_0_count,sleeptime_activity_1_ratio,sleeptime_activity_1_count,sleeptime_activity_2_ratio,sleeptime_activity_2_count,sleeptime_activity_3_ratio,sleeptime_activity_3_count,sleeptime_activity_4_ratio,sleeptime_activity_4_count,sleeptime_activity_5_ratio,sleeptime_activity_5_count,sleeptime_activity_6_ratio,sleeptime_activity_6_count,sleeptime_activity_7_ratio,sleeptime_activity_7_count,sleeptime_activity_8_ratio,sleeptime_activity_8_count,sleeptime_dominant_activity,sleeptime_num_unique_activities
0,id01,2024-06-26,0.1252,89,0.0014,1,0,0,0.6723,478,0.1575,112,0,0,0,0,0.0436,31,0.0,0,3,5,0.07,25.0,0.0028,1.0,0.0,0.0,0.8768,313.0,0.0028,1.0,0.0,0.0,0.0,0.0,0.0476,17.0,0.0,0.0,3.0,5.0,0.1808,64.0,0.0,0.0,0.0,0.0,0.4661,165.0,0.3136,111.0,0.0,0.0,0.0,0.0,0.0395,14.0,0.0,0.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ✅ mAmbience 추정주변소리
- Ambient sound identification labels and their respective probabilities.
- 무슨 소리가 난게 중요할까?
- 새벽에 무슨 소리던지 소리가 난게 중요한 걸까?
- 여러 가지 소리 중에 노이즈도 포함되어 있을까?

In [None]:
def extract_labels_and_probs(row):
    items = row['m_ambience']
    labels = [item[0] for item in items]
    probs = [item[1] for item in items]
    return pd.Series({'labels': labels, 'prob': probs})

mAmbience[['labels', 'prob']]  = mAmbience.apply(extract_labels_and_probs, axis=1)
mAmbience['lifelog_date'] = mAmbience['timestamp'].astype(str).str[:10]
mAmbience = mAmbience.drop(columns=['m_ambience'])
mAmbience.head(1)

In [None]:
def process_mAmbience(df, top_n=3, special_labels=None):
    df = df.copy()

    # 시간 파생변수
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['hour'] = df['timestamp'].dt.hour
    df['weekday'] = df['timestamp'].dt.weekday
    df['is_weekend'] = df['weekday'] >= 5

    def map_time_period(row):
        if 0 <= row['hour'] < 7:
            return 'sleeptime'
        elif 7 <= row['hour'] < 18:
            return 'worktime'
        else:
            return 'afterwork'

    df['time_period'] = df.apply(map_time_period, axis=1)

    # 타임스탬프 차이(초단위)
    df = df.sort_values(['subject_id', 'timestamp'])
    df['duration_sec'] = df.groupby(['subject_id'])['timestamp'].diff().dt.total_seconds()
    df['duration_sec'] = df['duration_sec'].fillna(0)

    def process_group(group):
        result = {}

        time_period = group['time_period'].iloc[0]  # 그룹별 time_period 가져오기

        labels = group['labels'].tolist()
        probs = group['prob'].tolist()
        durations = group['duration_sec'].tolist()

        flat_labels = [label for labels_list in labels for label in labels_list]
        flat_probs = [float(prob) for probs_list in probs for prob in probs_list]
        repeated_durations = [dur for labels_list, dur in zip(labels, durations) for _ in labels_list]

        if not flat_labels or not flat_probs:
            return pd.Series()

        flat_probs = np.array(flat_probs)

        # prefix 설정
        prefix = f'{time_period}_'

        result[prefix + 'max_label'] = flat_labels[np.argmax(flat_probs)]
        result[prefix + 'max_prob'] = np.max(flat_probs)
        result[prefix + 'entropy'] = entropy(flat_probs, base=2)
        result[prefix + 'label_count'] = len(flat_labels)

        # 정규식
        vehicle_pattern = re.compile(r'car|vehicle|truck|bus|motorcycle|bicycle|boat|ship|train|subway|aircraft|helicopter|engine', re.IGNORECASE)
        environment_pattern = re.compile(r'outside|inside|environment|ocean|rain|waterfall|wind|fire|waves|rustling|earthquake|explosion|thunder|smoke', re.IGNORECASE)
        human_pattern = re.compile(r'speech|baby|child|laughter|crying|shout|screaming|groan|cough|hubbub|snoring|whimper|talking|singing|giggle|chatter|babbling', re.IGNORECASE)
        music_pattern = re.compile(r'music|song|singing|choir|instrument|guitar|piano|violin|vocal|jazz|hip hop|pop|rock|opera|blues|folk|electronic|dubstep|reggae|house|metal|disco|dance|flamenco|trance|techno|saxophone|synthesizer|harp|accordion|clarinet|flute|drum|tabla|tambourine|maraca|steelpan|xylophone|orchestra|band', re.IGNORECASE)

        # 분류
        vehicle_related = [(label, dur) for label, dur in zip(flat_labels, repeated_durations) if vehicle_pattern.search(label)]
        environment_related = [(label, dur) for label, dur in zip(flat_labels, repeated_durations) if environment_pattern.search(label)]
        human_related = [(label, dur) for label, dur in zip(flat_labels, repeated_durations) if human_pattern.search(label)]
        music_related = [(label, dur) for label, dur in zip(flat_labels, repeated_durations) if music_pattern.search(label)]

        # has (존재 여부)
        result[prefix + 'has_vehicle_related'] = int(len(vehicle_related) > 0)
        result[prefix + 'has_environment_related'] = int(len(environment_related) > 0)
        result[prefix + 'has_human_related'] = int(len(human_related) > 0)
        result[prefix + 'has_music_related'] = int(len(music_related) > 0)

        # sum (개수)
        result[prefix + 'sum_vehicle_related'] = len(vehicle_related)
        result[prefix + 'sum_environment_related'] = len(environment_related)
        result[prefix + 'sum_human_related'] = len(human_related)

        return pd.Series(result)

    features = df.groupby(['subject_id', 'lifelog_date', 'time_period']).apply(process_group).reset_index()
    features = features.pivot(index=['subject_id', 'lifelog_date'], columns='level_3', values=0).reset_index()

    return features

In [None]:
mAmbience2 = process_mAmbience(mAmbience)

# check
print(f'# mAmbience2 shape: {mAmbience2.shape}')
mAmbience2.head(1)

### ✅ mBle 블루투스
- Bluetooth devices around individual subject.
 - 7936 : Wearable, Headset, AV Device
 - 1796 : Peripheral (입력장치) 계열
 - 0 : 정보 없음 또는 알 수 없음(Unknown)
 - 1084 : Audio/Video (스피커, 헤드셋, 이어폰, TV 등)
 - 524 : Phone (휴대폰, 스마트폰)
 - 1060 : Headphones
 - 284 : commputer (PC, 노트북, PDA)

In [None]:
def extract_mble_info(row):
    m_data = row['m_ble']
    address = [item['address'] for item in m_data]
    device_class = [item['device_class'] for item in m_data]
    rssi = [item['rssi'] for item in m_data]
    return pd.Series({'address': address, 'device_class': device_class, 'rssi': rssi})

mBle[['address','device_class','rssi']] = mBle.apply(extract_mble_info, axis=1)
mBle['lifelog_date'] = mBle['timestamp'].astype(str).str[:10]
mBle.head(1)

In [None]:
def process_mBle(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['hour'] = df['timestamp'].dt.hour

    # 시간대 분류
    def map_time_period(row):
        if 0 <= row['hour'] < 7:
            return 'sleeptime'
        elif 7 <= row['hour'] < 18:
            return 'worktime'
        else:
            return 'afterwork'

    df['time_period'] = df.apply(map_time_period, axis=1)

    features = []

    for idx, row in df.iterrows():
        entry = ast.literal_eval(row['m_ble']) if isinstance(row['m_ble'], str) else row['m_ble']

        rssi_list = []
        class_0_cnt = 0
        class_other_cnt = 0

        for device in entry:
            try:
                rssi = int(device['rssi'])
                rssi_list.append(rssi)

                device_class = str(device['device_class'])
                if device_class == '0':
                    class_0_cnt += 1
                else:
                    class_other_cnt += 1
            except:
                continue  # malformed record

        feature = {
            'subject_id': row['subject_id'],
            'lifelog_date': row['lifelog_date'],
            'time_period': row['time_period'],
            'device_class_0_cnt': class_0_cnt,
            'device_class_others_cnt': class_other_cnt,
            'device_count': len(rssi_list),
            'rssi_mean': np.mean(rssi_list) if rssi_list else np.nan,
            'rssi_min': np.min(rssi_list) if rssi_list else np.nan,
            'rssi_max': np.max(rssi_list) if rssi_list else np.nan,
        }
        features.append(feature)

    return pd.DataFrame(features)

def summarize_mBle_daily(df):

    # row 단위 BLE feature 추출
    df = process_mBle(df)

    # 하루 + 시간대별로 groupby
    grouped = df.groupby(['subject_id', 'lifelog_date', 'time_period']).agg({
        'device_class_0_cnt': 'sum',
        'device_class_others_cnt': 'sum',
        'rssi_mean': 'mean',
        'rssi_min': 'min',
        'rssi_max': 'max',
    }).reset_index()

    # 총합 구해서 비율 계산
    total_cnt = grouped['device_class_0_cnt'] + grouped['device_class_others_cnt']
    grouped['device_class_0_ratio'] = grouped['device_class_0_cnt'] / total_cnt.replace(0, np.nan)
    grouped['device_class_others_ratio'] = grouped['device_class_others_cnt'] / total_cnt.replace(0, np.nan)

    # 필요 없는 cnt 컬럼 제거
    grouped.drop(columns=[
        'device_class_0_cnt',
        'device_class_others_cnt'
    ], inplace=True)

    # pivot해서 time_period별로 펼치기
    final = grouped.pivot(index=['subject_id', 'lifelog_date'], columns='time_period')
    final.columns = ['_'.join(col).strip() for col in final.columns.values]
    final = final.reset_index()

    return final

In [None]:
mBle2 = summarize_mBle_daily(mBle)

# check
print(f'\n # mBle2 shape: {mBle2.shape}')
mBle2.head(1)

### ✅ mGps, GPS 기반 핸드폰 위치
- Multiple GPS coordinates measured within a single minute using the smartphone.
- speed가 1보다 큰경우 정지 상태가 아니고 움직이고 있다고 판단
 - 0.5-2 : 걸어서 이동하는 경우  
 - 2-5 : 조깅
 - 5 이상 : 차를 타고 이동하는 경우

- speed가 0.5-2사이를 하루에 몇분동안 지속했는지?
- speed가 2-5사이를 하루에 몇분동안 지속했는지? (유산소 운동 시간)
- speed가 5이상을 하루에 몇분동안 지속했는지?  

In [None]:
def extract_gps_info(row):
    m_data = row['m_gps']
    altitude = [item['altitude'] for item in m_data]
    latitude = [item['latitude'] for item in m_data]
    longitude = [item['longitude'] for item in m_data]
    speed = [item['speed'] for item in m_data]
    return pd.Series({'altitude': altitude, 'latitude': latitude, 'longitude': longitude, 'speed': speed})

mGps[['altitude','latitude','longitude','speed']] = mGps.apply(extract_gps_info, axis=1)
mGps['lifelog_date'] = mGps['timestamp'].astype(str).str[:10]
mGps = mGps.drop(columns=['m_gps'])
mGps.head(1)

In [None]:
# 거리 계산 함수
def haversine(coord1, coord2, unit='m'):
    lat1, lon1 = coord1
    lat2, lon2 = coord2
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    r = 6371000  # 지구 반지름(m)
    return c * r if unit == 'm' else c * r / 1000

def process_mGps(df):
    df = df.copy()

    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['week'] = df['timestamp'].dt.isocalendar().week

    expanded_rows = []

    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing GPS data"):
        speeds = ast.literal_eval(row['speed']) if isinstance(row['speed'], str) else row['speed']
        lats = ast.literal_eval(row['latitude']) if isinstance(row['latitude'], str) else row['latitude']
        lons = ast.literal_eval(row['longitude']) if isinstance(row['longitude'], str) else row['longitude']
        alts = ast.literal_eval(row['altitude']) if isinstance(row['altitude'], str) else row['altitude']
        n = len(speeds)
        if n > 0:
            expanded_rows.append(pd.DataFrame({
                'subject_id': [row['subject_id']] * n,
                'lifelog_date': [row['lifelog_date']] * n,
                'timestamp': pd.date_range(start=row['timestamp'], periods=n, freq='1S'),
                'speed': speeds,
                'latitude': lats,
                'longitude': lons,
                'altitude': alts
            }))

    expanded_df = pd.concat(expanded_rows, ignore_index=True)

    # 벡터화
    speeds = expanded_df['speed'].values

    walk_mask = (0.5 <= speeds) & (speeds < 2)
    jog_mask = (2 <= speeds) & (speeds < 5)
    vehicle_mask = (speeds >= 5)
    le5_mask = (speeds <= 5)

    expanded_df['walk'] = walk_mask.astype(int)
    expanded_df['jog'] = jog_mask.astype(int)
    expanded_df['vehicle'] = vehicle_mask.astype(int)
    expanded_df['le5_speed'] = expanded_df['speed'].where(le5_mask)

    # 아침/저녁 구간 조건
    expanded_df['hour'] = expanded_df['timestamp'].dt.hour
    morning_condition = (expanded_df['hour'] >= 6) & (expanded_df['hour'] < 9) & (expanded_df['speed'] >= 1)
    evening_condition = (expanded_df['hour'] >= 21) & (expanded_df['hour'] <= 23) & (expanded_df['speed'] <= 1)

    # 이동 특성 계산
    movement_features = []
    for (subject_id, lifelog_date), group in expanded_df.groupby(['subject_id', 'lifelog_date']):
        all_speeds = group['speed'].values
        all_alts = group['altitude'].values
        all_lats = group['latitude'].values
        all_lons = group['longitude'].values

        active_mins = group.shape[0] / 60  # 1초 단위 → 분
        movement_ratio = (all_speeds > 1.0).mean() if len(all_speeds) > 0 else 0
        alt_change = all_alts[-1] - all_alts[0] if len(all_alts) > 0 else 0
        lat_change = all_lats[-1] - all_lats[0] if len(all_lats) > 0 else 0
        lon_change = all_lons[-1] - all_lons[0] if len(all_lons) > 0 else 0

        total_dist = 0.0
        if len(all_lats) > 1:
            for i in range(len(all_lats)-1):
                coord1 = (all_lats[i], all_lons[i])
                coord2 = (all_lats[i+1], all_lons[i+1])
                total_dist += haversine(coord1, coord2, unit='m')

        movement_features.append({
            'subject_id': subject_id,
            'lifelog_date': lifelog_date,
            'active_minutes': active_mins,
            'movement_ratio': movement_ratio,
            'alt_change': alt_change,
            'lat_change': lat_change,
            'lon_change': lon_change,
            'total_distance_m': total_dist
        })

    movement_df = pd.DataFrame(movement_features)

    # Groupby + Aggregation
    agg_funcs = {
        'walk_minutes': ('walk', lambda x: x.sum() / 60),
        'jog_minutes': ('jog', lambda x: x.sum() / 60),
        'vehicle_minutes': ('vehicle', lambda x: x.sum() / 60),
        'speed_le5_max': ('le5_speed', 'max'),
        'speed_le5_mean': ('le5_speed', 'mean'),
        'speed_le5_std': ('le5_speed', 'std')
    }

    grouped = expanded_df.groupby(['subject_id', 'lifelog_date']).agg(**agg_funcs).reset_index()
    grouped['exercise_flag'] = (grouped['jog_minutes'] >= 5)

    # 아침 wakeup time
    morning_first_movement = (
        expanded_df[morning_condition]
        .groupby(['subject_id', 'lifelog_date'])['timestamp']
        .min()
        .reset_index()
        .rename(columns={'timestamp': 'morning_wakeup_time'})
    )


    # 최종 merge
    final = pd.merge(grouped, movement_df, on=['subject_id', 'lifelog_date'], how='left')
    final = pd.merge(final, morning_first_movement, on=['subject_id', 'lifelog_date'], how='left')

    # 아침 wakeup_time 처리
    valid_wakeup = final['morning_wakeup_time'].dropna()
    if not valid_wakeup.empty:
        total_seconds = valid_wakeup.dt.hour * 3600 + valid_wakeup.dt.minute * 60 + valid_wakeup.dt.second
        mean_seconds = total_seconds.mean()
        mean_hour = int(mean_seconds // 3600)
        mean_minute = int((mean_seconds % 3600) // 60)
        mean_second = int(mean_seconds % 60)
        mean_wakeup_time = time(mean_hour, mean_minute, mean_second)
    else:
        mean_wakeup_time = time(7, 0, 0)

    final['morning_wakeup_time'] = final['morning_wakeup_time'].fillna(
        pd.Timestamp.combine(pd.to_datetime('today').date(), mean_wakeup_time)
    )
    final['morning_wakeup_time'] = final['morning_wakeup_time'].dt.hour * 100 + final['morning_wakeup_time'].dt.minute

    mean_wakeup_hhmm = mean_wakeup_time.hour * 100 + mean_wakeup_time.minute

    # wake_up_early_minutes
    def compute_minutes_diff(actual_hhmm, mean_hhmm):
        actual_hour = actual_hhmm // 100
        actual_minute = actual_hhmm % 100
        mean_hour = mean_hhmm // 100
        mean_minute = mean_hhmm % 100
        actual_sec = actual_hour * 3600 + actual_minute * 60
        mean_sec = mean_hour * 3600 + mean_minute * 60
        return (mean_sec - actual_sec) / 60

    final['wake_up_early_minutes'] = final['morning_wakeup_time'].apply(lambda x: compute_minutes_diff(x, mean_wakeup_hhmm))

    return final

In [None]:
%%time

mGps2 = process_mGps(mGps)

# check
print(f'\n # mGps2 shape: {mGps2.shape}')
mGps2.head(1)

### ✅ mLight 주변 밝기
- Ambient light measured by the smartphone.
 - 어두운 밤	0.1 ~ 1 lux	캄캄한 방, 달빛 없는 밤
 - 가로등 켜진 거리	10 ~ 20 lux	흐릿한 외부 조명
 - 실내 조명	100 ~ 500 lux	사무실, 일반 거실
 - 밝은 실외	10,000 ~ 25,000 lux	맑은 날 햇빛
 - 직사광선 아래	30,000 ~ 100,000 lux	여름 한낮, 매우 강한 햇빛

- 밝기에 따라서 언제 불을 끄고 잠든 시간 추정
- 직사광선 잠에 좋은 영향을 주는지? (논문)
- 결측치 처리 x

In [None]:
mLight['lifelog_date'] = mLight['timestamp'].astype(str).str[:10]
mLight.head(1)

In [None]:
mLight.loc[mLight['m_light']<1000,'m_light'].hist(bins=30)

In [None]:
def process_mLight(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['hour'] = df['timestamp'].dt.hour
    df['is_night'] = df['hour'].apply(lambda h: h >= 22 or h < 6)

    daily_light = df.groupby(['subject_id', 'lifelog_date']).agg(
        light_mean=('m_light', 'mean'),
        light_std=('m_light', 'std'),
        light_max=('m_light', 'max'),
        light_min=('m_light', 'min'),
        light_night_mean=('m_light', lambda x: x[df.loc[x.index, 'is_night']].mean()),
        light_day_mean=('m_light', lambda x: x[~df.loc[x.index, 'is_night']].mean()),
        light_night_ratio=('is_night', 'mean')
    ).reset_index()

    results = []

    for subject_id, group in tqdm(df.groupby('subject_id'), desc="Processing sleep detection"):
        group = group.sort_values('timestamp').reset_index(drop=True)

        sleep_time = None
        wake_time = None
        sleeping = False
        zero_count = 0
        first_zero_time = None

        for i in range(len(group)):
            light = group.loc[i, 'm_light']
            hour = group.loc[i, 'hour']

            if light == 0:
                zero_count += 1
                if zero_count == 1:
                    first_zero_time = group.loc[i, 'timestamp']
                if zero_count == 6 and not sleeping:
                    sleep_hour = first_zero_time.hour
                    if (sleep_hour >= 21) or (sleep_hour <= 2):
                        sleep_time = first_zero_time
                        sleeping = True
            else:
                if sleeping:
                    candidate_wakeup = group.loc[i, 'timestamp']
                    wake_hour = candidate_wakeup.hour
                    if 5 <= wake_hour <= 9:
                        wake_time = candidate_wakeup
                        results.append({
                            'subject_id': subject_id,
                            'lifelog_date': group.loc[i, 'lifelog_date'],
                            'sleep_time': sleep_time,
                            'wake_time': wake_time,
                            'sleep_duration_min': (wake_time - sleep_time).total_seconds() / 60
                        })
                        sleeping = False
                        zero_count = 0
                        first_zero_time = None
            if light > 0:
                zero_count = 0
                first_zero_time = None

    sleep_df = pd.DataFrame(results)

    if not sleep_df.empty:
        mean_sleep = (
            sleep_df[(sleep_df['sleep_duration_min'] >= 180) & (sleep_df['sleep_duration_min'] <= 600)]
            .groupby('subject_id')['sleep_duration_min']
            .mean()
            .to_dict()
        )

        def replace_outlier(row):
            if (row['sleep_duration_min'] > 600) or (row['sleep_duration_min'] < 180):
                return mean_sleep.get(row['subject_id'], 360)
            else:
                return row['sleep_duration_min']

        sleep_df['sleep_duration_min'] = sleep_df.apply(replace_outlier, axis=1)

    def to_hhmm(t):
        if pd.isnull(t):
            return np.nan
        return t.hour * 100 + t.minute

    sleep_df['sleep_time_hhmm'] = sleep_df['sleep_time'].apply(to_hhmm)
    sleep_df['wake_time_hhmm'] = sleep_df['wake_time'].apply(to_hhmm)

    sleep_df = sleep_df.drop(columns=['sleep_time', 'wake_time'])

    ### lifelog_date 하루 전 빼기 -> 일어난날은 다음날이므로 -1 처리
    sleep_df['lifelog_date'] = sleep_df['lifelog_date'] + pd.Timedelta(days=-1)

    ### merge
    final = pd.merge(daily_light, sleep_df, on=['subject_id', 'lifelog_date'], how='left')

    ### merge 후 전체 NaN을 그룹 평균으로 메우기
    # if not final.empty:
    #     for col in ['sleep_duration_min', 'sleep_time_hhmm', 'wake_time_hhmm']:
    #         group_means = final.groupby('subject_id')[col].transform('mean')
    #         final[col] = final[col].fillna(group_means)

    return final

In [None]:
mLight2 = process_mLight(mLight)

# check
print(f'\n # mLight2 shape: {mLight2.shape}')
mLight2.head(1)

In [None]:
mLight2['sleep_duration_min'].hist(bins=30)

### ✅ mScreenStatus 화면 사용여부

- Indicates whether the smartphone screen is in use.
 - 기상시간, 취침시간, 수면시간
 - 휴대폰 이용횟수, 이용시간
 - 00 - 05 사이에 휴대폰 이용한 건수
 - 결측치 처리 x

In [None]:
mScreenStatus['lifelog_date'] = mScreenStatus['timestamp'].astype(str).str[:10]
mScreenStatus.head(1)

In [None]:
def process_mScreenUse(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['hour'] = df['timestamp'].dt.hour

    # 1. 하루별 screen 사용 패턴 요약
    features = []

    for (subj, lifelog_date), group in df.groupby(['subject_id', 'lifelog_date']):
        status = group['m_screen_use'].values
        ratio_on = status.mean()
        transitions = (status[1:] != status[:-1]).sum()

        # 연속된 1(화면on) 상태 길이
        durations = []
        current = 0
        for val in status:
            if val == 1:
                current += 1
            elif current > 0:
                durations.append(current)
                current = 0
        if current > 0:
            durations.append(current)

        features.append({
            'subject_id': subj,
            'lifelog_date': lifelog_date,
            'screen_on_ratio': ratio_on,
            'screen_on_transitions': transitions,
            'screen_on_duration_avg': np.mean(durations) if durations else 0,
            'screen_on_duration_max': np.max(durations) if durations else 0,
        })

    daily_screen = pd.DataFrame(features)

    # 2. 수면 시간, 기상 시간 추정
    results = []

    for subject_id, group in tqdm(df.groupby('subject_id'), desc="Processing sleep detection (screen use)"):
        group = group.sort_values('timestamp').reset_index(drop=True)

        sleep_time = None
        wake_time = None
        sleeping = False
        zero_count = 0
        first_zero_time = None

        for i in range(len(group)):
            screen = group.loc[i, 'm_screen_use']
            hour = group.loc[i, 'hour']

            if screen == 0:
                zero_count += 1
                if zero_count == 1:
                    first_zero_time = group.loc[i, 'timestamp']
                if zero_count >= 120 and not sleeping:   # 2시간 연속
                    sleep_hour = first_zero_time.hour
                    if (sleep_hour >= 21) or (sleep_hour <= 2):
                        sleep_time = first_zero_time
                        sleeping = True
            else:
                if sleeping:
                    candidate_wakeup = group.loc[i, 'timestamp']
                    wake_hour = candidate_wakeup.hour
                    if 5 <= wake_hour <= 9:
                        wake_time = candidate_wakeup
                        results.append({
                            'subject_id': subject_id,
                            'lifelog_date': group.loc[i, 'lifelog_date'],
                            'sleep_time': sleep_time,
                            'wake_time': wake_time,
                            'sleep_duration_min': (wake_time - sleep_time).total_seconds() / 60
                        })
                        sleeping = False
                        zero_count = 0
                        first_zero_time = None
            if screen == 1:
                zero_count = 0
                first_zero_time = None

    sleep_df = pd.DataFrame(results)

    # 3. 수면시간 이상치 보정
    if not sleep_df.empty:
        mean_sleep = (
            sleep_df[(sleep_df['sleep_duration_min'] >= 180) & (sleep_df['sleep_duration_min'] <= 600)]
            .groupby('subject_id')['sleep_duration_min']
            .mean()
            .to_dict()
        )

        def replace_outlier(row):
            if (row['sleep_duration_min'] > 600) or (row['sleep_duration_min'] < 180):
                return mean_sleep.get(row['subject_id'], 360)
            else:
                return row['sleep_duration_min']

        sleep_df['sleep_duration_min'] = sleep_df.apply(replace_outlier, axis=1)

    # sleep_time, wake_time 숫자(hhmm) 변환
    def to_hhmm(t):
        if pd.isnull(t):
            return np.nan
        return t.hour * 100 + t.minute

    sleep_df['sleep_time_hhmm'] = sleep_df['sleep_time'].apply(to_hhmm)
    sleep_df['wake_time_hhmm'] = sleep_df['wake_time'].apply(to_hhmm)

    sleep_df = sleep_df.drop(columns=['sleep_time', 'wake_time'])

    ### lifelog_date 하루 전 빼기 -> 일어난날은 다음날이므로 -1 처리
    sleep_df['lifelog_date'] = sleep_df['lifelog_date'] + pd.Timedelta(days=-1)

    # 최종 merge
    final = pd.merge(daily_screen, sleep_df, on=['subject_id', 'lifelog_date'], how='left')

    return final

In [None]:
mScreenStatus2 = process_mScreenUse(mScreenStatus)

# check
print(f'\n # mScreenStatus2 shape: {mScreenStatus2.shape}')
mScreenStatus2.head(1)

### ✅ mUsageStats 앱사용통계
- mUsageStats: Indicates which apps were used on the smartphone and for how long.

 - 몇시까지 핸드폰 보다가 잠잤는지
 - 통화, 전화 얼마나 했는지
 - YouTube 얼마나 봤는지
 - 메시지, 카카오톡 얼마나 했는지
 - NAVER 얼마나 했는지
 - 평소보다 얼마나 많은 앱을 이용했는지
 - 제외? -> 시스템 UI,One UI 홈

In [None]:
def extract_mUsageStats_info(row):
    m_data = row['m_usage_stats']
    app_name = [item['app_name'] for item in m_data]
    total_time = [item['total_time'] for item in m_data]
    return pd.Series({'app_name': app_name, 'total_time': total_time})

mUsageStats[['app_name', 'total_time']] = mUsageStats.apply(extract_mUsageStats_info, axis=1)
mUsageStats['lifelog_date'] = mUsageStats['timestamp'].astype(str).str[:10]
mUsageStats.head(1)

In [None]:
def process_mUsageStats(df):
    df = df.copy()
    df['lifelog_date'] = pd.to_datetime(df['lifelog_date'])
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['요일'] = df['lifelog_date'].dt.day_name()

    # 리스트 평탄화
    exploded_df = df.explode(['app_name', 'total_time'])
    exploded_df['total_time'] = exploded_df['total_time'].astype(float)
    exploded_df['total_time'] = exploded_df['total_time'] * 0.001 / 60  # 밀리초 → 초 → 분 변환

    # app_name 특수문자 제거
    exploded_df['app_name'] = exploded_df['app_name'].astype(str).apply(
        lambda x: re.sub(r'[^가-힣a-zA-Z0-9]', '', x)
    )

    # 시스템 앱 제거
    filtered_df = exploded_df[~exploded_df['app_name'].isin(['시스템UI'])]  # '시스템UI'만 제거 (OneUI홈은 포함)

    # 주요 파생변수 생성
    def calculate_daily_metrics(group):
        last_use = group['timestamp'].max()

        app_times = {
            '통화_시간(분)': group[group['app_name'] == '통화']['total_time'].sum(),
            '전화_시간(분)': group[group['app_name'] == '전화']['total_time'].sum(),
            'YouTube_시간(분)': group[group['app_name'] == 'YouTube']['total_time'].sum(),
            '메신저_시간(분)': group[group['app_name'].isin(['메시지', '카카오톡'])]['total_time'].sum(),
            'NAVER_시간(분)': group[group['app_name'] == 'NAVER']['total_time'].sum(),
            '캐시워크_시간(분)': group[group['app_name'] == '캐시워크']['total_time'].sum(),
            '성경일독Q_시간(분)': group[group['app_name'] == '성경일독Q']['total_time'].sum(),
            'OneUI홈_시간(분)': group[group['app_name'] == 'OneUI홈']['total_time'].sum(),
        }

        return pd.Series({
            **app_times,
            '고유앱수': group['app_name'].nunique(),
            '총화면시간(분)': group['total_time'].sum()
        })

    # daily metrics 생성
    daily_stats = filtered_df.groupby(['subject_id', 'lifelog_date']).apply(calculate_daily_metrics).reset_index()

    # subject_id별 평균 총화면시간 구하기
    avg_screen_time = daily_stats.groupby('subject_id')['총화면시간(분)'].mean().to_dict()

    # 평균대비 화면사용량(%) 생성
    def compute_screen_usage(row):
        avg_time = avg_screen_time.get(row['subject_id'], np.nan)
        if pd.isna(avg_time) or avg_time == 0:
            return np.nan
        return round((row['총화면시간(분)'] / avg_time - 1) * 100, 1)

    daily_stats['평균대비_화면사용량(%)'] = daily_stats.apply(compute_screen_usage, axis=1)

    return daily_stats

In [None]:
mUsageStats2 = process_mUsageStats(mUsageStats)

# check
print(f'\n # mUsageStats2 shape: {mUsageStats2.shape}')
mUsageStats2.head(1)

### ✅ mWifi 주변wifi 정보
- Wifi devices around individual subject.
 - -30 ~ -50 dBm	매우 강한 신호 (최적)
 - -51 ~ -60 dBm	강한 신호 (문제 없음)
 - -61 ~ -70 dBm	괜찮은 신호 (약간 느릴 수 있음)
 - -71 ~ -80 dBm	약한 신호 (끊김 주의)
 - -81 dBm 이하	매우 약한 신호 (거의 끊김)

In [None]:
def extract_wifi_info(row):
    wifi_data = row['m_wifi']
    bssids = [item['bssid'] for item in wifi_data]
    rssis = [item['rssi'] for item in wifi_data]
    return pd.Series({'bssid': bssids, 'rssi': rssis})

mWifi[['bssid', 'rssi']] = mWifi.apply(extract_wifi_info, axis=1)
mWifi['lifelog_date'] = mWifi['timestamp'].astype(str).str[:10]
mWifi.head(1)

In [None]:
def process_mWifi(df,threshold):

    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])

    def filter_strong_rssi(df,threshold):
        filtered_df = df.copy()
        def filter_row(row):
            bssids = row['bssid']
            rssis = row['rssi']
            # RSSI > threshold 조건 만족하는 항목만 추출
            filtered = [(b, r) for b, r in zip(bssids, rssis) if r > threshold]
            if filtered:
                new_bssids, new_rssis = zip(*filtered)
                return pd.Series({'bssid': list(new_bssids), 'rssi': list(new_rssis)})
            else:
                return pd.Series({'bssid': [], 'rssi': []})
        filtered_df[['bssid', 'rssi']] = filtered_df.apply(filter_row, axis=1)
        return filtered_df

    # === wifi 약신호 제거 ===
    df = filter_strong_rssi(df, threshold=threshold) ####

    features = []
    grouped = df.groupby(['subject_id', 'lifelog_date'])

    for (subject_id, date), group in grouped:
        scan_count = len(group)
        bssid_flat = sum(group['bssid'], [])  # flatten
        rssi_flat = sum(group['rssi'], [])    # flatten

        unique_bssid_count = len(set(bssid_flat))
        avg_rssi = sum(rssi_flat) / len(rssi_flat) if rssi_flat else None
        max_rssi = max(rssi_flat) if rssi_flat else None
        min_rssi = min(rssi_flat) if rssi_flat else None
        strong_rssi_ratio = sum(1 for r in rssi_flat if r > -60) / len(rssi_flat) if rssi_flat else 0
        empty_scan_count = sum(1 for b in group['bssid'] if len(b) == 0)

        # 가장 많이 탐지된 BSSID
        bssid_counter = Counter(bssid_flat)
        top_bssid, top_bssid_count = bssid_counter.most_common(1)[0] if bssid_counter else (None, 0)

        first_time = group['timestamp'].min()
        last_time = group['timestamp'].max()
        hour_span = (last_time - first_time).total_seconds() / 60  # 분 단위

        features.append({
            'subject_id': subject_id,
            'lifelog_date': date,
            'scan_count': scan_count,
            'unique_bssid_count': unique_bssid_count,
            'avg_rssi': avg_rssi,
            'max_rssi': max_rssi,
            # 'min_rssi': min_rssi,
            # 'strong_signal_ratio': strong_rssi_ratio,
            'empty_scan_count': empty_scan_count,
            'top_bssid': top_bssid,
            'top_bssid_count': top_bssid_count,
            'hour_span_minutes': hour_span
        })

    return pd.DataFrame(features)

In [None]:
mWifi2 = process_mWifi(mWifi,threshold=-60)

# check
print(f'\n # mWifi2 shape: {mWifi2.shape}')
mWifi2.head(1)

### ✅ wHr 심박동수
- Heart rate readings recorded by the smartwatch.


In [None]:
wHr['lifelog_date'] = wHr['timestamp'].astype(str).str[:10]
wHr.head(1)

In [None]:
def get_time_block(hour):
    if 0 <= hour < 6:
        return 'early_morning'
    elif 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 18:
        return 'afternoon'
    else:
        return 'evening'

def process_wHr_by_timeblock(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['block'] = df['timestamp'].dt.hour.map(get_time_block)

    results = []

    for (subj, date), group in df.groupby(['subject_id', 'lifelog_date']):
        block_stats = {'subject_id': subj, 'lifelog_date': date}

        for block, block_group in group.groupby('block'):
            hr_all = []
            for row in block_group['heart_rate']:
                parsed = ast.literal_eval(row) if isinstance(row, str) else row
                hr_all.extend([int(h) for h in parsed if h is not None])

            if not hr_all:
                continue

            above_100 = [hr for hr in hr_all if hr > 100]
            block_stats[f'hr_{block}_mean'] = np.mean(hr_all)
            block_stats[f'hr_{block}_std'] = np.std(hr_all)
            block_stats[f'hr_{block}_max'] = np.max(hr_all)
            block_stats[f'hr_{block}_min'] = np.min(hr_all)
            block_stats[f'hr_{block}_above_100_ratio'] = len(above_100) / len(hr_all)

        results.append(block_stats)

    return pd.DataFrame(results)

In [None]:
wHr2 = process_wHr_by_timeblock(wHr)

# check
print(f'\n # wHr2 shape: {wHr2.shape}')
wHr2.head(1)

### ✅ wLight 앰비언트 라이트
- Ambient light measured by the smartwatch.  
  - 어두운 밤 0.1 ~ 1 lux 캄캄한 방, 달빛 없는 밤
  - 가로등 켜진 거리 10 ~ 20 lux 흐릿한 외부 조명
  - 실내 조명 100 ~ 500 lux 사무실, 일반 거실
  - 밝은 실외 10,000 ~ 25,000 lux 맑은 날 햇빛
  - 직사광선 아래 30,000 ~ 100,000 lux 여름 한낮, 매우 강한 햇빛

In [None]:
wLight['lifelog_date'] = wLight['timestamp'].astype(str).str[:10]
wLight.head(1)

In [None]:
def get_time_block(hour):
    if 0 <= hour < 6:
        return 'early_morning'
    elif 6 <= hour < 12:
        return 'morning'
    elif 12 <= hour < 18:
        return 'afternoon'
    else:
        return 'evening'

def process_wLight_by_timeblock(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date
    df['block'] = df['timestamp'].dt.hour.map(get_time_block)

    results = []

    for (subj, date), group in df.groupby(['subject_id', 'lifelog_date']):
        block_stats = {'subject_id': subj, 'lifelog_date': date}

        for block, block_group in group.groupby('block'):
            lux = block_group['w_light'].dropna().values
            if len(lux) == 0:
                continue

            block_stats[f'wlight_{block}_mean'] = np.mean(lux)
            block_stats[f'wlight_{block}_std'] = np.std(lux)
            block_stats[f'wlight_{block}_max'] = np.max(lux)
            block_stats[f'wlight_{block}_min'] = np.min(lux)

        results.append(block_stats)

    return pd.DataFrame(results)

In [None]:
wLight2 = process_wLight_by_timeblock(wLight)

# check
print(f'\n # wLight2 shape: {wLight2.shape}')
wLight2.head(1)

### ✅ wPedo 걸음수
- Step data recorded by the smartwatch.

In [None]:
wPedo['lifelog_date'] = wPedo['timestamp'].astype(str).str[:10]
wPedo.head(1)

In [None]:
def process_wPedo(df):
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['lifelog_date'] = df['timestamp'].dt.date

    summary = df.groupby(['subject_id', 'lifelog_date']).agg({
        'step': 'sum',
        'step_frequency': 'mean',
        'distance': 'sum',
        'speed': ['mean', 'max'],
        'burned_calories': 'sum'
    }).reset_index()

    # 컬럼 이름 정리
    summary.columns = ['subject_id', 'lifelog_date',
                       'step_sum', 'step_frequency_mean',
                       'distance_sum', 'speed_mean', 'speed_max',
                       'burned_calories_sum']

    return summary

In [None]:
wPedo2 = process_wPedo(wPedo)

# check
print(f'\n # wPedo2 shape: {wPedo2.shape}')
wPedo2.head(1)

### 📦 merge 데이터
- train, test 기간 서로 겹침

In [None]:
print('# train:',len(train))
display(train.groupby(['subject_id'])['lifelog_date'].agg([min,max]).reset_index())

print('# test:',len(test))
display(test.groupby(['subject_id'])['lifelog_date'].agg([min,max]).reset_index())

In [None]:
mACStatus2['lifelog_date'] = mACStatus2['lifelog_date'].astype(str)
mActivity2['lifelog_date'] = mActivity2['lifelog_date'].astype(str)

In [None]:
df_list = [
    mACStatus2,       # 1
    mActivity2,       # 2
    mAmbience2,       # 3
    mBle2,            # 4
    mGps2,            # 5
    mLight2,          # 6
    mScreenStatus2,   # 7
    mUsageStats2,     # 8
    mWifi2,           # 9
    wHr2,             # 10
    wLight2,          # 11
    wPedo2            # 12
]

data = reduce(lambda left, right: pd.merge(left, right, on=['subject_id', 'lifelog_date'], how='outer'), df_list)
data['lifelog_date'] = data['lifelog_date'].astype(str)

# 중복체크
print(data.shape)
print(data[['subject_id','lifelog_date']].drop_duplicates().shape)

In [None]:
train2 = train.merge(data, on=['subject_id','lifelog_date'], how='left')

print('# train:',len(train))
display(train.groupby(['subject_id'])['lifelog_date'].agg([min,max]).reset_index())

print('# train2:',len(train2))
display(train2.groupby(['subject_id'])['lifelog_date'].agg([min,max]).reset_index())

print('# train2 shape:',train2.shape)
display(train2.head(1))

In [None]:
test2 = test.merge(data, on=['subject_id','lifelog_date'], how='left')

print('# test:',len(test))
display(test.groupby(['subject_id'])['lifelog_date'].agg([min,max]).reset_index())

print('# test2:',len(test2))
display(test2.groupby(['subject_id'])['lifelog_date'].agg([min,max]).reset_index())

print('# test2 shape:',test2.shape)
display(test2.head(1))

In [None]:
# 저장
print('# test   shape:',test.shape)
print('# test2  shape:',test2.shape)
print('# train  shape:',train.shape)
print('# train2 shape:',test2.shape)
train2.to_parquet(f"{path}/train2.parquet")
test2.to_parquet(f"{path}/test2.parquet")