# Trexquant Interview Project (The Hangman Game)

* Copyright Trexquant Investment LP. All Rights Reserved. 
* Redistribution of this question without written consent from Trexquant is prohibited

## Instruction:
For this coding test, your mission is to write an algorithm that plays the game of Hangman through our API server. 

When a user plays Hangman, the server first selects a secret word at random from a list. The server then returns a row of underscores (space separated)—one for each letter in the secret word—and asks the user to guess a letter. If the user guesses a letter that is in the word, the word is redisplayed with all instances of that letter shown in the correct positions, along with any letters correctly guessed on previous turns. If the letter does not appear in the word, the user is charged with an incorrect guess. The user keeps guessing letters until either (1) the user has correctly guessed all the letters in the word
or (2) the user has made six incorrect guesses.

You are required to write a "guess" function that takes current word (with underscores) as input and returns a guess letter. You will use the API codes below to play 1,000 Hangman games. You have the opportunity to practice before you want to start recording your game results.

Your algorithm is permitted to use a training set of approximately 250,000 dictionary words. Your algorithm will be tested on an entirely disjoint set of 250,000 dictionary words. Please note that this means the words that you will ultimately be tested on do NOT appear in the dictionary that you are given. You are not permitted to use any dictionary other than the training dictionary we provided. This requirement will be strictly enforced by code review.

You are provided with a basic, working algorithm. This algorithm will match the provided masked string (e.g. a _ _ l e) to all possible words in the dictionary, tabulate the frequency of letters appearing in these possible words, and then guess the letter with the highest frequency of appearence that has not already been guessed. If there are no remaining words that match then it will default back to the character frequency distribution of the entire dictionary.

This benchmark strategy is successful approximately 18% of the time. Your task is to design an algorithm that significantly outperforms this benchmark.

In [1]:
import json
import requests
import random
import string
import secrets
import time
import re
import collections
try:
    from urllib.parse import parse_qs, urlencode, urlparse
except ImportError:
    from urlparse import parse_qs, urlparse
    from urllib import urlencode
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.stats import entropy
from IPython.core.debugger import Pdb
from functools import reduce


In [4]:
class HangmanPlayer(object):
    def __init__(self):
        self.guessed_letters = []
        
        full_dictionary_location = "words_250000_train.txt"
        self.full_dictionary = self.build_dictionary(full_dictionary_location)  
        self.full_dictionary_common_letter_sorted = collections.Counter("".join(self.full_dictionary)).most_common()

        target_dictionary_location = "words_final_test.txt"
        self.target_dictionary = self.build_dictionary(target_dictionary_location)  
        
    def reset(self,length):
        """
        Resets the probability estimates to the full dictionary before a word is started        
        """
        max_length = length
        self.prob_matrix = []
        self.dictionary_i = []
        for i in range(max_length):
            # Estimate positional probabilities
            self.prob_matrix.append(self.letter_distribution_at_place(self.full_dictionary,i)) 
            # Append the dictionary for each index which is being used in the previous estimate
            self.dictionary_i.append(self.full_dictionary)
            

    def letter_distribution(self,dictionary): 
        """
        Returns the probability distribution of letters given words in a dictionary     
        
        """
        dict_string = "".join(dictionary)        
        c = collections.Counter(dict_string)
        letter_count = dict(c.most_common()) 
        for letter in string.ascii_lowercase:
            if letter in letter_count:
                letter_count[letter] = (letter_count[letter]+1)/(len(dict_string)+26)
            else:
                letter_count[letter] = 1/26
        return letter_count
    
    def unique_letter_distribution(self,dictionary):
        """
        Returns the probability distribution of unique letters ('aaa will resolve to a') for dictionary
        
        """
        remove_duplicates = lambda s: ''.join(sorted(set(s), key=s.index))
        dictionary = [remove_duplicates(word) for word in dictionary]
        return self.letter_distribution(dictionary)
        
    def letter_distribution_at_place(self,dictionary,i):
        """
        Returns the letter distribution at position i in all words
        
        """
        dictionary = [word[i] for word in dictionary if len(word) > i]
        return self.letter_distribution(dictionary)
        
    def guess(self, word): # word input example: "_ p p _ e "
        clean_word = word[::2].replace("_",".")
        len_word = len(clean_word)
        
        self.update_prob_matrix(clean_word) # Update the prob matrix
        self.prev_clean_word = clean_word 
              
        guess_letter = '!'
        
        
        # Here I sum up the prob vectors to get the expected number of hits for all letters and then
        # I choose the maximum
        prob_vector = reduce(lambda a,b : dict(collections.Counter(a)+collections.Counter(b)),self.prob_matrix[:len_word])
        prob_vector = collections.Counter(prob_vector).most_common()
        for letter,probability in prob_vector:
            if letter not in self.guessed_letters:
                guess_letter = letter
                break

        if guess_letter == '!':
            sorted_letter_count = self.full_dictionary_common_letter_sorted
            for letter,instance_count in sorted_letter_count:
                if letter not in self.guessed_letters:
                    guess_letter = letter
                    break  
                                
        return guess_letter
    
    def update_prob_matrix(self,clean_word):
        
        def regex_to_match(word,i,c):
            """
            Build a regex to match for the word for a position i and c context window
            This regex is used to filter the dictionary for the new probability estimates
            """   
            if (i == 0) or (i == len(word)-1):
                c = c+1
            start = max(i-c,0)
            end = min(i+c+1,len(word))
            regex = "."*start + word[start:end]+".*"*(end < len(word))
            return regex   
        
        char_list = list(clean_word)
        singleton_dict = dict([(letter,0) for letter in string.ascii_lowercase])
        context = 2
        for i,elem in enumerate(char_list):
            if elem != ".": # Already guessed, assign 1 probabiltiy to this letter for this index
                self.prob_matrix[i] = singleton_dict.copy()
                self.prob_matrix[i][elem] = 1
            else:
                filtered_dict = self.dictionary_i[i]
                regex = regex_to_match(clean_word,i,context) 
                prev_regex = regex_to_match(self.prev_clean_word,i,context)
                if regex != prev_regex: # Some new word has been guessed in the context window
                    filtered_dict = list(filter(lambda word: re.match(regex,word) is not None,
                                                    self.dictionary_i[i]))      
                if len(filtered_dict)<len(self.dictionary_i[i]):
                    self.prob_matrix[i] = self.letter_distribution_at_place(filtered_dict,i) # Estimate new
                    self.dictionary_i[i] = filtered_dict # Update the dictionary for this estimate
        return
    
    def build_dictionary(self, dictionary_file_location):
        text_file = open(dictionary_file_location,"r")
        full_dictionary = text_file.read().splitlines()
        text_file.close()
        return full_dictionary
                
    
    
    def start_game(self,verbose):
        self.guessed_letters = []
        
        target_word = self.target_dictionary[np.random.randint(len(self.target_dictionary))]
        #if len(target_word)<11:
        #    target_word = self.target_dictionary[np.random.randint(len(self.target_dictionary))]
        
        word = "_"*len(target_word)
        self.reset(len(word))
        self.prev_clean_word = word
        
        self.TARGET = target_word
        
        self.incorrect_guesses = []
        self.current_dictionary = self.full_dictionary
        
        if verbose:
            print("Word is ",target_word)
            print("Successfully start a new game!")
           
        tries_remains = 6
        
        while tries_remains>0:
            guess_letter = self.guess(word,tries_remains)
            self.guessed_letters.append(guess_letter)
            if verbose:
                print("Guessing letter: {0}".format(guess_letter))
            
            updated_word,correct = self.update_word(target_word,guess_letter,word)
            if correct == False:
                tries_remains = tries_remains -1 
                self.incorrect_guesses.append(guess_letter)
                if verbose:
                    print("Incorrect guess")
                    print("Tries remaining ",tries_remains)
                    
            if updated_word == target_word:
                if verbose:
                    print("Game won")
                return 1
            word = updated_word
        if verbose:
            print("Couldnt guess")
        return 0
        
    def update_word(self,target_word,guess_letter,word):
        if guess_letter in target_word:
            temp_word = ''.join(map(lambda char: char if char == guess_letter else '_', target_word))
            union = ""
            for c1, c2 in zip(temp_word, word):
                if c1 != "_":
                    union += c1
                else:
                    union += c2
            return(union,1)
        return (word,0)
        

In [5]:
np.random.seed(29)
player = HangmanPlayer()
total_won = 0
total_games = 500

winning_words = []
losing_words = []
for i in range(total_games):
    won = player.start_game(verbose=False)
    total_won = total_won + won
    if won:
        winning_words.append(player.TARGET)
    else:
        losing_words.append(player.TARGET)
    print(total_won/(i+1))
    #print(won)
    
#print(total_won/total_games)

TypeError: 'str' object cannot be interpreted as an integer

In [18]:
total_won/(i+1)

0.51340206185567

In [19]:
i

484

In [18]:
losing_words

['drinn',
 'fistnotes',
 'denizenize',
 'rattle',
 'counterstratagem',
 'wavira',
 'overdraw',
 'quodlibetic',
 'monkly',
 'deskman',
 'ungothic',
 'animalculae',
 'cucujus',
 'tophus',
 'falsen',
 'reliving',
 'hunterlike',
 'alsoon',
 'bumfuzzle',
 'steelbow',
 'epicalyxes',
 'sitcom',
 'wightness',
 'peacockish',
 'vowelless',
 'gerbilles',
 'befitted',
 'drunkenness',
 'lithuria',
 'democracy',
 'repiece',
 'tagassuidae',
 'interrepulsion',
 'stowwood',
 'syrianic',
 'enzymotic',
 'bambos',
 'unanalagous',
 'enjoiner',
 'bullshots',
 'valsaceae',
 'abjuratory',
 'sieur',
 'antoinette']

In [10]:
i

92

In [173]:
word = "asdsds"
"."*(len(word)-3) + word[len(word)-3:]

'...sds'

In [125]:
import re

string = "ello, World! Tis string contains'."
pattern = "[hyxz]"

if re.search(pattern, string):
    print("The string contains 'h', 'y', 'x', or 'z'")
else:
    print("The string does not contain 'h', 'y', 'x', or 'z'")

The string does not contain 'h', 'y', 'x', or 'z'


In [None]:

    def conditional_entropy(self):
        conditional_entropy = dict()
        for letter in string.ascii_lowercase:
            matching_dictionary = list(filter(lambda word: letter in word,self.current_dictionary))
            non_matching_dictionary = list(filter(lambda word: letter not in word,self.current_dictionary))            
            match_prob = len(matching_dictionary)/len(self.current_dictionary)
            non_match_prob = 1- match_prob
            match_letter_count = self.unique_letter_distribution(matching_dictionary)
            non_match_letter_count = self.unique_letter_distribution(non_matching_dictionary)
            match_entropy = entropy(list(match_letter_count.values()))
            non_match_entropy = entropy(list(non_match_letter_count.values()))
            conditional_entropy[letter] = match_prob*match_entropy + non_match_prob*non_match_entropy           
        return sorted(conditional_entropy.items(), key=lambda x: x[1])

In [177]:
def regex_to_match(word,i,c):
    start = max(i-c,0)
    end = min(i+c+1,len(word))
    regex = "."*start + word[start:end]+".*"*(end < len(word))
    return regex
    if i == 0:
        regex = "." + word[i+1] + ".*" 
    elif i == len(word)-1:
        regex = "."*(len(word)-2) + word[i-1:]
    else:
        regex = "."*(i-1) + word[i-1:i+2] + ".*"

    return regex



In [183]:
regex_to_match("blissful",7,2)

'.....ful'

In [288]:
incorrect_regex = "r[" + "".join(["a"]) + "]"

In [289]:
incorrect_regex

'r[a]'

In [307]:
if re.search(incorrect_regex,"e") is None:
    print("1")

1


In [399]:
import gc
gc.collect()

979