# Trexquant Interview Project (The Hangman Game)

* Copyright Trexquant Investment LP. All Rights Reserved. 
* Redistribution of this question without written consent from Trexquant is prohibited

## Instruction:
For this coding test, your mission is to write an algorithm that plays the game of Hangman through our API server. 

When a user plays Hangman, the server first selects a secret word at random from a list. The server then returns a row of underscores (space separated)—one for each letter in the secret word—and asks the user to guess a letter. If the user guesses a letter that is in the word, the word is redisplayed with all instances of that letter shown in the correct positions, along with any letters correctly guessed on previous turns. If the letter does not appear in the word, the user is charged with an incorrect guess. The user keeps guessing letters until either (1) the user has correctly guessed all the letters in the word
or (2) the user has made six incorrect guesses.

You are required to write a "guess" function that takes current word (with underscores) as input and returns a guess letter. You will use the API codes below to play 1,000 Hangman games. You have the opportunity to practice before you want to start recording your game results.

Your algorithm is permitted to use a training set of approximately 250,000 dictionary words. Your algorithm will be tested on an entirely disjoint set of 250,000 dictionary words. Please note that this means the words that you will ultimately be tested on do NOT appear in the dictionary that you are given. You are not permitted to use any dictionary other than the training dictionary we provided. This requirement will be strictly enforced by code review.

You are provided with a basic, working algorithm. This algorithm will match the provided masked string (e.g. a _ _ l e) to all possible words in the dictionary, tabulate the frequency of letters appearing in these possible words, and then guess the letter with the highest frequency of appearence that has not already been guessed. If there are no remaining words that match then it will default back to the character frequency distribution of the entire dictionary.

This benchmark strategy is successful approximately 18% of the time. Your task is to design an algorithm that significantly outperforms this benchmark.

In [16]:
import json
import requests
import random
import string
import secrets
import time
import re
import collections
import numpy as np
from copy import deepcopy
try:
    from urllib.parse import parse_qs, urlencode, urlparse
except ImportError:
    from urlparse import parse_qs, urlparse
    from urllib import urlencode

from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

In [17]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split


# Define the alphabet
alphabet = "abcdefghijklmnopqrstuvwxyz"
alphabet_size = len(alphabet)

# Function to convert characters to one-hot vectors
def char_to_one_hot(char):
    one_hot = np.zeros(alphabet_size)
    if char in alphabet:
        one_hot[alphabet.index(char)] = 1
    return one_hot

# Function to convert strings to sequences of one-hot vectors
def string_to_one_hot(string, max_length):
    string = string.lower()
    one_hot_seq = np.zeros((max_length, alphabet_size))
    for i, char in enumerate(string):
        one_hot_seq[i] = char_to_one_hot(char)
    return one_hot_seq

# Define a function to convert indices to one-hot probability distributions
def indices_to_one_hot(indices, num_classes):
    batch_size = indices.size(0)
    one_hot = torch.zeros(batch_size, num_classes)
    one_hot.scatter_(1, indices.view(-1, 1), 1)
    return one_hot

# Generate training data with multiple letters missing and repeated examples
def generate_training_data(words, max_length, num_samples):
    X_train = []
    y_train = []
    for _ in range(num_samples):
        for word in words:
            if len(word)>1:
                # print(word)
                missing_indices = np.random.choice(len(word), size=np.random.randint(1, len(word)), replace=False)
                missing_word = ''.join('.' if i in missing_indices else char for i, char in enumerate(word))
                X_train.append(string_to_one_hot(missing_word, max_length))
                y_train.append(alphabet.index(word[np.random.choice(missing_indices,1)[0]]))
    return torch.tensor(X_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.long)

# Generate training data

text_file = open("words_250000_train.txt","r")
words = text_file.read().splitlines()
text_file.close()

max_length = max(len(word) for word in words)+10
num_samples = 5  # Number of samples per word
X_train, y_train = generate_training_data(words, max_length, num_samples)

# Define the size of the test set
test_set_size = 0.2  # 20% of the original training data

# Calculate the size of the test set
num_test_samples = int(len(X_train) * test_set_size)

# Create a TensorDataset for the full training set
full_train_dataset = TensorDataset(X_train, y_train)

# Use random_split to split the full training dataset into train and test datasets
train_dataset, test_dataset = random_split(full_train_dataset, [len(X_train) - num_test_samples, num_test_samples])

# Create DataLoaders for train and test sets
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)  # No need to shuffle for test set

# Train the model
train_losses = []
test_losses = []


In [20]:
# Define the attention model with multi-head attention
class AttentionModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_heads):
        super(AttentionModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, input_size)
        self.attention = nn.MultiheadAttention(input_size, num_heads)
        self.fc = nn.Linear(output_size, 1)
        self.softmax=nn.Softmax(dim=1)

    def forward(self, input):
        input=self.fc1(input)
        input=self.fc2(input)
        input = input.permute(1, 0, 2)  # MultiheadAttention expects (sequence_length, batch_size, input_size)
        output, _ = self.attention(input, input, input)
        output = self.fc(output.permute(1, 2, 0)).squeeze() # Reshape back to (batch_size, sequence_length, output_size) 64,26,6
        if output.dim()<2:
            output=output.unsqueeze(dim=0)
        return self.softmax(output)

# Initialize the model
hidden_size = 64
num_heads = 2
model = AttentionModel(alphabet_size, hidden_size, max_length, num_heads)
# Ensure model parameters require gradients
for param in model.parameters():
    param.requires_grad = True



In [21]:
# Train the model
epochs = 100

# Define loss function and optimizer
criterion = nn.KLDivLoss(reduction='batchmean')
optimizer = optim.Adam(model.parameters(), lr=0.002)

for epoch in range(epochs):
    running_train_loss = 0.0
    running_test_loss = 0.0
    
    # Training loop
    for inputs, targets_indices in train_loader:
        optimizer.zero_grad()
        predicted = model(inputs)
        targets = indices_to_one_hot(targets_indices, alphabet_size)
        predicted_log_probs = torch.log(predicted + 1e-10)
        loss = criterion(predicted_log_probs, targets.float())  
        loss.backward()
        optimizer.step()
        running_train_loss += loss.item()
    
    # Evaluation on test set
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():
        for inputs, targets_indices in test_loader:
            predicted = model(inputs)
            targets = indices_to_one_hot(targets_indices, alphabet_size)
            predicted_log_probs = torch.log(predicted + 1e-10)
            loss = criterion(predicted_log_probs, targets.float())
            running_test_loss += loss.item()
    
    # Calculate average losses
    avg_train_loss = running_train_loss / len(train_loader)
    avg_test_loss = running_test_loss / len(test_loader)
    train_losses.append(avg_train_loss)
    test_losses.append(avg_test_loss)
    
    print(f'Epoch [{epoch+1}/{epochs}], Train Loss: {avg_train_loss:.4f}, Test Loss: {avg_test_loss:.4f}')

    model.train()  # Set model back to training mode

Epoch [1/100], Train Loss: 2.8798, Test Loss: 2.8757
Epoch [2/100], Train Loss: 2.8659, Test Loss: 2.8721
Epoch [3/100], Train Loss: 2.8628, Test Loss: 2.8593
Epoch [4/100], Train Loss: 2.8612, Test Loss: 2.8598
Epoch [5/100], Train Loss: 2.8596, Test Loss: 2.8589
Epoch [6/100], Train Loss: 2.8584, Test Loss: 2.8584
Epoch [7/100], Train Loss: 2.8576, Test Loss: 2.8577
Epoch [8/100], Train Loss: 2.8571, Test Loss: 2.8573
Epoch [9/100], Train Loss: 2.8567, Test Loss: 2.8712
Epoch [10/100], Train Loss: 2.8563, Test Loss: 2.8565
Epoch [11/100], Train Loss: 2.8557, Test Loss: 2.8571
Epoch [12/100], Train Loss: 2.8556, Test Loss: 2.8542
Epoch [13/100], Train Loss: 2.8553, Test Loss: 2.8565
Epoch [14/100], Train Loss: 2.8551, Test Loss: 2.8582
Epoch [15/100], Train Loss: 2.8547, Test Loss: 2.8593
Epoch [16/100], Train Loss: 2.8549, Test Loss: 2.8561
Epoch [17/100], Train Loss: 2.8546, Test Loss: 2.8558
Epoch [18/100], Train Loss: 2.8547, Test Loss: 2.8576
Epoch [19/100], Train Loss: 2.8546, T

In [39]:
from itertools import combinations

class HangmanAPI(object):
    def __init__(self, access_token=None, session=None, timeout=None):
        self.access_token = access_token
        self.session = session or requests.Session()
        self.timeout = timeout
        self.guessed_letters = []
        self.lives=6
        self.alphabet='abcdefghijklmnopqrstuvwxyz'

        full_dictionary_location = "words_250000_train.txt"
        largedict=self.build_dictionary(full_dictionary_location) 
        random.shuffle(largedict)
        split_point = len(largedict) // 2
        self.full_dictionary = largedict#[:split_point]    
        self.guess_dictionary = largedict#[split_point:]       
        self.full_dictionary_common_letter_sorted = collections.Counter("".join(self.full_dictionary)).most_common()
        self.current_dictionary = []
        self.model=model

    def build_dictionary(self, dictionary_file_location):
        text_file = open(dictionary_file_location,"r")
        full_dictionary = text_file.read().splitlines()
        text_file.close()
        return full_dictionary
    
    
        # Function to predict missing letters given a word
    def predict_missing_letters(self,word):
        input_tensor = torch.tensor([string_to_one_hot(word, max_length)], dtype=torch.float32)
        with torch.no_grad():
            predicted = self.model(input_tensor)
        return predicted.squeeze().tolist()
    
    def guess(self,word):
        current_alphabet=set(self.alphabet)-set(self.guessed_letters)

        clean_word = word[::2].replace("_",".")

        predicted_pdf = self.predict_missing_letters(clean_word)

        for let in self.guessed_letters:
            predicted_pdf[alphabet_to_index(let)]=0
            
        predicted_pdf_0=[pdf/sum(predicted_pdf) for pdf in predicted_pdf]


        #for each possible next guess see if it yields maximum entropy
        max_reduction=0
        best_guess='!'
        current_entropy=entropy(predicted_pdf_0)

        indexes_dot = [i for i, letter in enumerate(clean_word) if letter == '.']
        all_combinations = []
        for r in range(1, len(indexes_dot) + 1):
            all_combinations.extend(combinations(indexes_dot, r))
            
        if(len(self.guessed_letters)>0):
            for nextguess in current_alphabet:
                for combo in all_combinations:
                    modifword=replace_at_indexes(clean_word,combo,nextguess)
                    predicted_pdf = self.predict_missing_letters(modifword)
                    for let in list(self.guessed_letters)+[nextguess]:
                        predicted_pdf[alphabet_to_index(let)]=0
                    predicted_pdf=[pdf/sum(predicted_pdf) for pdf in predicted_pdf]
                    expected_entropy=entropy(predicted_pdf)
                    entropy_reduction=current_entropy-expected_entropy

                    if entropy_reduction>max_reduction:
                        max_reduction=entropy_reduction
                        best_guess=nextguess

        if(best_guess=='!'):
            pred=np.argmax(predicted_pdf_0)
            best_guess=alphabet[pred]

        return best_guess
 
    ##########################################################
    # You'll likely not need to modify any of the code below #
    ##########################################################
    
    def play_game(self):
        # reset guessed letters to empty set and current plausible dictionary to the full dictionary
        self.guessed_letters = set()
        self.lives=10
        self.current_dictionary = self.full_dictionary
        word = random.choice(self.guess_dictionary )
        blanks = ['_',' '] * len(word)
        while self.lives > 0:
            print(" ".join(blanks))
            guess = self.guess(" ".join(blanks))
        
            if guess in word and guess not in self.guessed_letters:
                print("Correct!")
                for i in range(len(word)):
                    if word[i] == guess:
                        blanks[2*i] = guess
            else:
                print("Incorrect!")
                self.lives -= 1
        
            if '_' not in blanks:
                print("Congratulations! You guessed the word:", word)
                return True
            self.guessed_letters.add(guess)
            
        print("Out of lives! The word was:", word)
        return False
    

def alphabet_to_index(letter):
    """
    Maps English alphabet letters to their corresponding indices.
    
    Parameters:
        letter (str): A single English alphabet letter.
        
    Returns:
        int: The index of the input letter in the English alphabet (0 for 'a', 1 for 'b', ..., 25 for 'z').
             Returns None if the input is not a valid English alphabet letter.
    """
    if len(letter) != 1 or not letter.isalpha() or not letter.isascii():
        return None
    
    # Convert lowercase letter to index
    index = ord(letter.lower()) - ord('a')
    return index

def entropy(pdf):
    """
    Compute the entropy of a probability distribution.

    Args:
    pdf (array-like): Probability distribution function.

    Returns:
    float: Entropy value.
    """
    entropy_val = 0
    for prob in pdf:
        if prob != 0:  # To avoid log(0) which is undefined
            entropy_val -= prob * np.log2(prob)
    return entropy_val


def replace_at_indexes(input_string, indexes,charac):
    # Convert string to list of characters for easy manipulation
    string_list = list(input_string)
    # Replace characters at specified indexes with 'a'
    for index in indexes:
        string_list[index] = charac
    # Join the list back into a string
    return ''.join(string_list)

# API Usage Examples

## To start a new game:
1. Make sure you have implemented your own "guess" method.
2. Use the access_token that we sent you to create your HangmanAPI object. 
3. Start a game by calling "start_game" method.
4. If you wish to test your function without being recorded, set "practice" parameter to 1.
5. Note: You have a rate limit of 20 new games per minute. DO NOT start more than 20 new games within one minute.

In [40]:
api = HangmanAPI()

In [42]:
results=[]
for i in range(1):
    try:
        results.append(api.play_game())
    except:
        pass
print(sum(results)/len(results))

_   _   _   _   _   _   _   _   _   _   _   _   _  
Correct!
_   e   _   _   _   _   _   _   _   _   _   e   _  
_   _   _   _   _   _   _   _   _   _   _   _   _   _  
Correct!
_   _   e   _   _   _   _   _   _   _   _   _   _   _  
Correct!
_   _   e   _   _   _   u   _   _   _   u   _   _   _  
Incorrect!
_   _   e   _   _   _   u   _   _   _   u   _   _   _  
Correct!
_   v   e   _   _   _   u   _   _   _   u   _   _   _  
Correct!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   u   _   _   _   u   _   _   y  
Incorrect!
_   v   e   _   _   _   

## Playing practice games:
You can use the command below to play up to 100,000 practice games.

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading shards:   0%|          | 0/4 [01:26<?, ?it/s]


KeyboardInterrupt: 

## Playing recorded games:
Please finalize your code prior to running the cell below. Once this code executes once successfully your submission will be finalized. Our system will not allow you to rerun any additional games.

Please note that it is expected that after you successfully run this block of code that subsequent runs will result in the error message "Your account has been deactivated".

Once you've run this section of the code your submission is complete. Please send us your source code via email.

In [None]:
# for i in range(1000):
#     print('Playing ', i, ' th game')
#     # Uncomment the following line to execute your final runs. Do not do this until you are satisfied with your submission
#     #api.start_game(practice=0,verbose=False)
    
#     # DO NOT REMOVE as otherwise the server may lock you out for too high frequency of requests
#     time.sleep(0.5)

## To check your game statistics
1. Simply use "my_status" method.
2. Returns your total number of games, and number of wins.

In [None]:
[total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
success_rate = total_recorded_successes/total_recorded_runs
print('overall success rate = %.3f' % success_rate)