# CSE 158: Assignment 2

## This is an open-ended assignment in which you are expected to write a detailed report documenting your results. Please submit your solution electronically via gradescope, on or before Dec 3 (Tuesday week 10). This assignment is worth 25% of the final grade.

## Specify the names of all of your group members when submitting. Submissions should be in the form of a written report, which is expected to be at least four pages (double column, 11pt), or roughly 2.5-3 thousand words, plus figures, tables, and equations. See an example template in the lecture slides to get an idea of the length expected.

## Tasks

### 1. Identify a dataset to study, and perform an exploratory analysis of the data. Describe the dataset, including its basic statistics and properties, and report any interesting findings. This exploratory analysis should motivate the design of your model in the following sections. Datasets should be reasonably large (e.g. more than 50,000 samples).

In [8]:
from collections import defaultdict, namedtuple

import csv
import numpy as np

In [35]:
class Game:
    def __init__(self, line):
        # Game ID
        self.id = line[0]
        
        # Whether game was rated or not
        self.rated = True if line[1] is 'TRUE' else False
        
        # Time of game creation and last move
        self.created_at = float(line[2])
        self.last_move_at = float(line[3])
        
        # Number of turns in game
        self.turns = int(line[4])
        
        # Game result: 'outoftime resign mate draw'
        self.victory_status = line[5]
        
        # Game winner: 'white black draw'
        self.winner = line[6]
        
        # Time increment, describes game timing in time+increment form
        self.increment_code = line[7]
        
        # ID and rating of white and black players
        self.white_id = line[8]
        self.white_rating = int(line[9])
        self.black_id = line[10]
        self.black_rating = int(line[11])
        
        # All game moves in standard chess notation
        self.moves = line[12]
        
        # Standardised code for opening
        self.opening_eco = line[13]
        
        # Opening name and number of moves in opening phase
        self.opening_name = line[14]
        self.opening_ply = int(line[15])
    
    def __repr__(self):
        return 'Game {}, rated {}, turns {}, status {}, winner {}'.format(self.id, self.rated, self.turns,
                                                                          self.victory_status, self.winner)

In [36]:
def ParseCsv(filepath : str):
    with open(filepath) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        next(csv_reader, None)
        for row in csv_reader:
            yield row

In [37]:
idx = 0
CUTOFF = 20
myGame = None
for line in ParseCsv('data/games.csv'):
    idx += 1
    if idx > CUTOFF: break
    elif idx == CUTOFF: myGame = Game(line)
    print(line)

['TZJHLljE', 'FALSE', '1.50421E+12', '1.50421E+12', '13', 'outoftime', 'white', '15+2', 'bourgris', '1500', 'a-00', '1191', 'd4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4', 'D10', 'Slav Defense: Exchange Variation', '5']
['l1NXvwaE', 'TRUE', '1.50413E+12', '1.50413E+12', '16', 'resign', 'black', '5+10', 'a-00', '1322', 'skinnerua', '1261', 'd4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6 Qe5+ Nxe5 c4 Bb4+', 'B00', 'Nimzowitsch Defense: Kennedy Variation', '4']
['mIICvQHh', 'TRUE', '1.50413E+12', '1.50413E+12', '61', 'mate', 'white', '5+10', 'ischia', '1496', 'a-00', '1500', 'e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc6 bxc6 Ra6 Nc4 a4 c3 a3 Nxa3 Rxa3 Rxa3 c4 dxc4 d5 cxd5 Qxd5 exd5 Be6 Ra8+ Ke7 Bc5+ Kf6 Bxf8 Kg6 Bxg7 Kxg7 dxe6 Kh6 exf7 Nf6 Rxh8 Nh5 Bxh5 Kg5 Rxh7 Kf5 Qf3+ Ke6 Bg4+ Kd6 Rh6+ Kc5 Qe3+ Kb5 c4+ Kb4 Qc3+ Ka4 Bd1#', 'C20', "King's Pawn Game: Leonardis Variation", '3']
['kWKvrqYL', 'TRUE', '1.50411E+12', '1.50411E+12', '61', 'mate', 'white', '20+0', 'daniamurashov', '1439

In [38]:
print(myGame)

Game x31mXlvc, rated False, turns 25, status resign, winner white


In [39]:
sum(1 for _ in ParseCsv('data/games.csv'))

20058

In [40]:
whiteUsers = set(l[8] for l in ParseCsv('data/games.csv'))

In [42]:
blackUsers = set(l[10] for l in ParseCsv('data/games.csv'))

In [41]:
len(whiteUsers)

9438

In [43]:
len(blackUsers)

9331

In [46]:
len(whiteUsers.union(blackUsers))

15635

### In the dataset there are 20,058 games with 15,635 unique players.

### 2. Identify a predictive task that can be studied on this dataset. Describe how you will evaluate your model at this predictive task, what relevant baselines can be used for comparison, and how you will assess the validity of your model’s predictions. It’s fine to use models that were described in class here (i.e., you don’t have to invent anything new (though you may!)), though you should explain and justify which model was appropriate for the task. It’s also important in this section to carefully describe what features you will use and how you had to process the data to obtain them.

### 3. Describe your model. Explain and justify your decision to use the model you proposed. How will you optimize it? Did you run into any issues due to scalability, overfitting, etc.? What other models did you consider for comparison? What were your unsuccessful attempts along the way? What are the strengths and weaknesses of the different models being compared?

### 4. Describe literature related to the problem you are studying. If you are using an existing dataset, where did it come from and how was it used? What other similar datasets have been studied in the past and how? What are the state-of-the-art methods currently employed to study this type of data? Are the conclusions from existing work similar to or different from your own findings?

### 5. Describe your results and conclusions. How well does your model perform compared to alternatives, and what is the significance of the results? Which feature representations worked well and which do not? What is the interpretation of your model’s parameters? Why did the proposed model succeed why others failed (or if it failed, why did it fail)?