# Project Intro
*Nicole Basinski, Nydia Chang, Danny Wan, Jason Xing*  
  
As part of the Erdos May Bootcamp 2022, we submit our project using gaze position data on Atari videogames, obtained from the [Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset](https://zenodo.org/record/3451402#.YpEEB5PML0r).  
  
20 Atari games were played by 4 human players in a frame-by-frame manner to obtain gaze samples, as well as associated action taken by the player, current game score, among some other information. Semi-frame-by-frame gameplay allowed for the players to make near-optimal game decisions that led to scores in the range of known human records. Semi-frame-by-frame gameplay also resulted in more accurate game state and action associations (due to removing the effect of human reaction time); this resulted in more optimal data for any supervised learning algorithms.  

Regular trials were set to a 15-minute timeframe; highscore trials allowed the player to continue gameplay until they ran out of lives (up to a max of 2 hours). Each trial, whether regular or highscore, corresponds to a text file and a .tar.bz2 file. The text file recorded player actions, gaze positions, and other data for each frame during that trial, and the .tar.bz2 file includes .png images of each game frame.  
    
This project will use the game frames along with associated gaze positions to model the best resulting action. We focused on the Ms. Pacman game and specifically the `highscore` trials, largely for time and computational reasons. However, much of the cleaning, EDA, and modeling included in this repo would apply for other games and trials as well.

TO DO: MAYBE WANT TO PUT A SUMMARIZED VERSION OF OUR KPIS/STAKEHOLDERS HERE??

In [1]:
# imports
import os
import pandas as pd
import shutil

## Cleaning Data
First step, we will clean the data into a workable format. The cells below will 1. clean the appropriate text files, 2. combine the cleaned text files into a single csv (named `combined_trial_data.csv`), 3. extract the appropriate images (placed in a single directory called `frames`), and 4. ravel the image data and place this information into a single csv (called `ravelled_image_data.csv`). Upon success of those steps, there is also a cell that removes the unneeded files from this directory to conserve space and memory. The functions in `utils/clean_data.py` could also be pulled out and adpated as needed.  

To run the cleaning scripts as written below, declare parameters based on the below descriptions.  
- `source_dir`: Str. By default, this is `raw_data`. This is the top-level directory for the script to look to for extracting the original data.
- `target_dir`: Str. By default, this is `cleaned_data`. This is the top-level directory where the script will place the cleaned form of each file from `raw_data`.
- `final_dir`: Str. By default, this is `final_data`. This is the top-level directory where the final forms of the combined clean data will be placed.
- `game_name`: Str. The name of the game data being cleaned. This is used to look into the correct game directory under `raw_data` (derived from the directory structure of the original data).
- `highscore`: Boolean. Whether or not these are highscore being cleaned. This only affects what directory to point to, as the highscore trial data and regular trial data are in the same format. 

After the below section of cells has been run, this will be the final directory structure of the cleaned data (made to imitate the original data directory structure):
```
erdos-project-2022--atari-HEAD
|___final_data
    |___{game_name}
        |   combined_trial_data.csv
        |   ravelled_image_data.csv
        |
        |___frames
        |   |   {image-1}.png
        |   |   {image-2}.png
        |   |   ...
        |
        |___highscore
            |   combined_trial_data.csv
            |   ravelled_image_data.csv
            |
            |___frames
                |   {image-1}.png
                |   {image-2}.png
                |   ...
```

In [2]:
# declare parameters
GAME_NAME = 'ms_pacman'
HIGHSCORE = True
SOURCE_DIR = 'raw_data'
TARGET_DIR = 'cleaned_data'
FINAL_DIR = 'final_data'

In [3]:
from utils.clean_data import clean_all_raw_data

In [8]:
# this could take a minute to run
# for reference, on the Ms Pacman highscore trials, this took ~55s
clean_all_raw_data(
    game_name = GAME_NAME,
    highscore = HIGHSCORE,
    source_dir = SOURCE_DIR,
    target_dir = TARGET_DIR,
    final_dir = FINAL_DIR,
)

In [5]:
## THIS CELL WILL DELETE THINGS
## This will retain everything in the `final_data` directory
##  but will delete all the other data files.
## Recommended to conserve memory and space, but you've been warned

shutil.rmtree('raw_data')
shutil.rmtree('cleaned_data')

In [7]:
import torch

## EDA
TO DO

## Modeling

### Train-Test Split

test set of 20% of full
val set of 20% of test

## Appendix

**Citations:**  
  
Ruohan Zhang, Calen Walshe, Zhuode Liu, Lin Guan, Karl S. Muller, Jake A. Whritner, Luxin Zhang, Mary Hayhoe, & Dana Ballard. (2019). Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset (Version 4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3451402