In [1]:
# necessary imports
import pandas as pd
import numpy as np
import os

## Data Manipulation and Extraction

### Table and column selection

Tables and respective fields needed:
- Smartphone Data:
    - AppUsageEvent, containing all the necessary information of apps used
        - name
        - category
    - Location, containing the coordinates to be able to determine where apps were used
        - longitude
        - latitude
- Wearable Smartwatch Data: 
    - AmbientLight, the light intensity in lumen per square meter
        - brightness
- Self reported emotional and cognitive states Data
    - EsmResponse
        - stress
        - valence
        - arousal
        - attention
        - pcode (participant number)

In [2]:
data_dir = os.path.join("..","data")

# Smartphone and Smartwatch data
table_names = ["AppUsageEvent", "Location", "AmbientLight", "UltraViolet"]
data_files = [f"{name}.csv" for name in table_names]
column_names = ["app_name", "app_category", "brightness", "uv_intensity", "longitude", "latitude"]
# ESM response data
esm_path = os.path.join(data_dir,"SubjData","EsmResponse.csv")

###  Functions used

#### To TimeSeries

This step is **crucial** to our preprocessing as it allows us to merge the different tables used to one another by using the merge_asof pandas method (which matches on nearest timestamp). It is also important for determining big gaps in the data (see *Getting home coordinates* Section).

- Converting timestamps to pandas.datetime objects 
- Using datetime objects as dataframe index for TimeSeries purposes
- Converting index to Korean timezone

In [3]:
def df_to_timeseries(df, timestamp_col="timestamp"):
    df[timestamp_col] = pd.to_datetime(df[timestamp_col], unit="ms") # convert timestamp to datetime
    df.set_index(timestamp_col,inplace=True) # set it as index 
    df = df.tz_localize("UTC") # need to localize a timezone
    df = df.tz_convert("Asia/Seoul") # convert it to Korean timezone
    return df

#### Feature Extraction *at_home*

These methods helped us determine whether app-usage-events occurred at participants' home or not. During our data exploration when looking at participants' heart rate (HR), measured by the wearable smartwatch, we noticed there were around 7 big gaps in the HR data for each participant. Knowing that participants had to charge their wearable devices every night we deduced that getting the coordinates of the last and first data recordings before and after each data gap would give us the participants' home coordinates (we used the ambiental light table from the wearabale data, as we were already planning to use it).

- Measures the time difference between each consecutive data recording 
- Identifies the ones with a time difference that exceeds 4 hours (assuming participants charged their wearable devices at night, when at home)
- Finds the closest corresponding indices in the coordinates table and extracts the coordinates of each identified "gap"
- Returns average (they may differ a bit, depending on the room where they took the wearable off) of the extracted coordinates

- Used haversine distance function to determine how far each app-usage-event was performed from the determined home coordiantes 

In [4]:
def get_home_coor(amb_df,loc_df):
    time_diff = amb_df.index.to_series().diff()
    # Find indices where the time difference exceeds 4 hours
    jump_indices = time_diff[time_diff > pd.Timedelta(hours=4)].index
    closest_indices = {}
    for jump_index in jump_indices:
        indexer = loc_df.index.get_indexer([jump_index])
        closest_index = loc_df.index[indexer[0]]
        closest_indices[jump_index] = closest_index

    # Extract values from other_df at the closest indices and get their average
    return loc_df.loc[closest_indices.values()].mean()

def haversine(lon1, lat1, lon2, lat2): # Returns haversine distance between two coordinates
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km * 1000

#### Determining Entertainment app categories

The AppUsageEvent table contains a column called "category" which states the category of the application used. The values of this field were retrieved from Google Play and from application archive websites (i.e., https://apkcombo.com) for those that weren't available on Google Play, the rest were manually labeled (Kang et al., 2023). 

Categories left out:

- PERSONALIZATION, COMMUNICATION, PHOTOGRAPHY, SYSTEM, FINANCE, TOOLS, PRODUCTIVITY, HEALTH_AND_FITNESS, TRAVEL_AND_LOCAL, MAPS_AND_NAVIGATION, LIFESTYLE, HOUSE_AND_HOME, ART_AND_DESIGN, FOOD_AND_DRINK, EDUCATION, BUSINESS, BEAUTY, AUTO_AND_VEHICLES, WEATHER
- _LIBRARIES_AND_DEMO_ (After some data exploration, we decided to remove this category from our chosen ones as it contained only one app which was the participants' university platform) 

Categories chosen as Entertainment related:

- **VIDEO_PLAYERS**,**MUSIC_AND_AUDIO**,**SOCIAL**, **GAME**, **SHOPPING**, **NEWS_AND_MAGAZINES**, **SPORTS**, **BOOKS_AND_REFERENCE**,**COMICS**
- _**ENTERTAINMENT**_ (this category was too vague so we looked up what type of apps were included here and found apps that would correctly belong to other categories chosen, for example "Netflix" and other video players were moved to the VIDEO_PLAYER category, the same happened with some SHOPPING apps, BOOKS_AND_REFERENCE and SOCIAL)
- _**MISC**_ (some apps from here were moved to COMICS and GAMES)

Manually defined category
- **GAME_RELATED** (many apps within the ENTERTAINMENT category were game related but were not games directly, this included game exchange platforms, game tools, game statistics, etc)


##### Create chosen categories list and mapping dictionary (collapsable)

In [5]:
entertainment_categories = ['SOCIAL','SHOPPING','BOOKS_AND_REFERENCE',
                            'COMICS','MUSIC_AND_AUDIO', 'GAME','VIDEO_PLAYERS', 
                            'SPORTS','NEWS_AND_MAGAZINES','GAME_RELATED']

app_name_to_new_category = {
    'Netflix': 'VIDEO_PLAYER',
    'AfreecaTV': 'VIDEO_PLAYER',
    '카카오페이지': 'BOOKS_AND_REFERENCE',
    'Google Play 게임': 'GAME_RELATED',
    '롯데시네마': 'SHOPPING',
    '네이버TV': 'VIDEO_PLAYER',
    'Twitch': 'VIDEO_PLAYER',
    '어드벤처': 'GAME',
    'CGV': 'SHOPPING',
    '연애의과학': 'BOOKS_AND_REFERENCE',
    '팟빵': 'MUSIC_AND_AUDIO',
    'U+모바일tv': 'VIDEO_PLAYER',
    'TVING': 'VIDEO_PLAYER',
    'JAM Live': 'VIDEO_PLAYER',
    'CashLeaflet': 'SHOPPING',
    '티켓링크': 'SHOPPING',
    'Galaxy Apps': 'APP_STORE',
    'LoL 상점': 'GAME_RELATED',
    'OP.GG': 'GAME_RELATED',
    'GGtics': 'GAME_RELATED',
    'CGV포토티켓': 'SHOPPING',
    'Prime Video': 'VIDEO_PLAYER',
    '왓챠': 'VIDEO_PLAYER',
    '인터파크 티켓': 'SHOPPING',
    '왓챠플레이': 'VIDEO_PLAYER',
    'Doctor Who': 'VIDEO_PLAYER',
    '팝콘티비': 'VIDEO_PLAYER',
    '얼굴인식 체험판': 'LIBRARIES_AND_DEMO',
    '올레 tv 모바일': 'VIDEO_PLAYER',
    '아이즈원 프라이빗 메일': 'VIDEO_PLAYER',
    'V LIVE': 'VIDEO_PLAYER',
    'FOW.KR': 'GAME_RELATED',
    'Steam': 'GAME_RELATED',
    '메가박스': 'SHOPPING',
    'Nintendo Switch Online': 'GAME_RELATED',
    '피키캐스트': 'SOCIAL_MEDIA',
    'HTV 3.4.6': 'VIDEO_PLAYER',
    'Q.Feat': 'VIDEO_PLAYER',
    'WoW BfA Talent Calculator': 'GAME_RELATED',
    '꽁음따 시즌4': 'GAME',
    '프리스타일2:플라잉덩크': 'GAME',
    'Hentoid': 'COMICS',
}

##### Function to map apps to new categories

In [6]:
def map_entertainment_to_new_category(row):
    if row["category"] == "ENTERTAINMENT":
        return app_name_to_new_category.get(row["name"], "ENTERTAINMENT")
    elif row["category"] == "MISC":
        return app_name_to_new_category.get(row["name"], "MISC")
    else:
        return row["category"]

#### Converting interval and continuous to categorical

All of our dependent variables (stress, valence, arousal and attention) were in an interval form, the values ranged from −3 to +3 (from not stressed at all to very stressed, for example). For easier interpretation of the data we turned this interval data into binary data (stressed or not stressed for example). Our hope was to turn the results from the statistical analysis to be more intuitive and actionable.

- **0** for values -3, -2, -1 and 0
- **1** for values 1, 2, 3

Brightness, the light intensity in lumen per square meter was originally a continuous variable, ranging from low to high levels of brightness. For easier interpretation of the data, we categorized brightness into distinct levels (e.g., low, medium, high).

- LOW: Less than 300 lx
- MEDIUM: Between 300 lx and 750 lx
- HIGH: Greater than 750 lx

In [7]:
def dv_to_binary(df, dv):
    df[dv] = df[dv].apply(lambda x: 0 if x <=0 else 1) 
    
def brightness_to_categorical(df):
    
    def categorize_brightness(brightness):
        if brightness < 300:
            return 'LOW'
        elif 300 <= brightness <= 750:
            return 'MEDIUM'
        else:
            return 'HIGH'
        
    df = df["brightness"].apply(categorize_brightness)
    return df
    

### Manipulation and merging

In [8]:
P = {} # Dictionary that'll contain all participant data

# get in situ data and turn it to TimeSeries 
dvs = ["stress", "valence","arousal","attention"]
esm_df = df_to_timeseries(pd.read_csv(esm_path),timestamp_col="responseTime")[dvs + ["pcode"]]
for dv in dvs:
    dv_to_binary(esm_df, dv)

for p_code in os.listdir(data_dir):
    if p_code.startswith("P"): # Only get Participant Directories
        pn = int(p_code[1:])
        
        # get tables (and columns) of interest and turn them to TimeSeries
        
        # AppUsageEvent
        app_df=df_to_timeseries(pd.read_csv(os.path.join(data_dir,p_code,"AppUsageEvent.csv")))[["name","category"]]
        # apply the function to update the app_category column
        app_df["category"] = app_df.apply(map_entertainment_to_new_category, axis=1)
        # filter for entertainment related apps only
        app_df = app_df[app_df["category"].isin(entertainment_categories)] 
        
        # AmbienLight
        amb_df = df_to_timeseries(pd.read_csv(os.path.join(data_dir,p_code, "AmbientLight.csv")))
        # apply brightness_to_categorical
        amb_df = brightness_to_categorical(amb_df)
        
        # Location
        loc_df = df_to_timeseries(pd.read_csv(os.path.join(data_dir,p_code, "Location.csv")))
        loc_df = loc_df[loc_df["accuracy"] < 20][["longitude","latitude"]] # remove innacurate readings
        
        # merge using merge_asof (we match on nearest timestamp rather than equal timestamps)
        joined_df = pd.merge_asof(app_df, amb_df, left_index=True,right_index=True)
        joined_df = pd.merge_asof(joined_df, loc_df, left_index=True,right_index=True)
        #joined_df.columns = column_names
        
        # calculate distance from determined home
        home_longitude, home_latitude = get_home_coor(amb_df,loc_df)
        distances = haversine(joined_df['longitude'], joined_df['latitude'], home_longitude, home_latitude)
        joined_df["at_home"] = distances <= 25
        
        
        final_df=pd.merge_asof(joined_df, esm_df[esm_df["pcode"] == p_code], left_index=True,right_index=True)
        P[pn] = final_df.drop(["longitude", "latitude"],axis=1)



### Saving preprocessed data

In [9]:
# all data        
df = pd.concat(P.values(), axis = 0).dropna()
df.to_csv(os.path.join("clean_data", "final_data.csv"))