In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
%load_ext autoreload

In [3]:
%autoreload 2

In [4]:
import seaborn as sns

In [5]:
sns.set_context('poster', font_scale=1.25)

In [6]:
import findspark as fs

In [7]:
fs.init()

In [8]:
import pyspark as ps

In [9]:
import multiprocessing as mp

In [10]:
mp.cpu_count()

12

In [11]:
config = ps.SparkConf()
config = config.setMaster('local[' + str(2*mp.cpu_count()) + ']')
config = config.setAppName('anagram_solver')

In [12]:
sc = ps.SparkContext(conf=config)

In [13]:
wlist = sc.textFile('EOWL_words.txt', minPartitions=24)

In [14]:
word_count = wlist.map(lambda x: 1)

In [15]:
word_count.sum()

128985

Ok. That is a lot of words. So, calculating permutations for each is likely hopeless. We probably have to do it one at a time...not sure.

In [16]:
wlist.take(10)

[u'aa',
 u'aah',
 u'aal',
 u'aalii',
 u'aardvark',
 u'aardvarks',
 u'aardwolf',
 u'aardwolves',
 u'aargh',
 u'aarrghh']

Wow. We need to compute all possible anagrams of *each word* in this anagram. Hardcore. 

For a given word, we need to create all possible mixes of a word. And then filter those by real words. Let's figure out how to do that.

In [17]:
import itertools

In [18]:
def get_list_of_all_combos(input_str):
    list_of_permutations = list(itertools.permutations(input_str))
    strings_to_test = [''.join(l) for l in list_of_permutations]
    parallel_strings = sc.parallelize(strings_to_test)
    return parallel_strings.intersection(wlist)

In [19]:
def get_list_of_all_combos(input_str):
    list_of_permutations = list(itertools.permutations(input_str))
    strings_to_test = [(input_str, ''.join(l)) for l in list_of_permutations]
    return strings_to_test

Ok. One more naive implementation...

In [20]:
local_wlist = wlist.collect()

In [21]:
broadcast_wlist = sc.broadcast(local_wlist)

In [22]:
def get_list_of_all_combos(input_str):
    list_of_permutations = list(itertools.permutations(input_str))
    strings_to_test = [(input_str, ''.join(l)) for l in list_of_permutations]
    matches = set(strings_to_test).intersection(broadcast_wlist.value)
    matches = list(matches)
    return (input_str, matches)

In [23]:
word_permutation_rdd = wlist.map(get_list_of_all_combos)

Ok, this technique is no good. We simply cannot comb through every factorial word. Let us see if we can figure out if another word contains the letters of the desired word and has the same length. That's an easy way to do it.

In [24]:
'soij' in 'asldkjwlekrjsoaweralwkerjawerij'

False

In [25]:
want = 'soijz'
waffle = 'asldkjwlekrjsoaweralwkerjawerij'

In [26]:
all([i in waffle for i in want])

False

Yes. This is what we want.

In [36]:
waffle.count('a')

4

In [37]:
def get_anagrams(input_str):
    anagrams = []
    for cur_dict_word in broadcast_wlist.value:
        cond1 = all([input_str.count(i) == cur_dict_word.count(i) for i in input_str]) 
        cond2 = len(cur_dict_word) == len(input_str)
        cond3 = cur_dict_word != input_str
        if all([cond1, cond2, cond3]):
            anagrams.append(cur_dict_word)
    return (input_str, anagrams)

In [38]:
anagram_rdd = wlist.map(get_anagrams)

In [39]:
anagram_rdd.take(10)

[(u'aa', []),
 (u'aah', [u'aha']),
 (u'aal', [u'ala']),
 (u'aalii', []),
 (u'aardvark', []),
 (u'aardvarks', []),
 (u'aardwolf', []),
 (u'aardwolves', []),
 (u'aargh', []),
 (u'aarrghh', [])]

I don't think this is working unfortunately.

In [68]:
word_permutation_rdd.take(5)

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Ok...the permutation rdd actually fails here. Not enough memory...interesting. Yeah. No matter how I do this there is not enough  memory. Nasty!

In [36]:
word_permutation_rdd.take(10)

[(u'aa', u'aa'),
 (u'aa', u'aa'),
 (u'aah', u'aah'),
 (u'aah', u'aha'),
 (u'aah', u'aah'),
 (u'aah', u'aha'),
 (u'aah', u'haa'),
 (u'aah', u'haa'),
 (u'aal', u'aal'),
 (u'aal', u'ala')]

Good. This actually looks reasonable. We now need to check if the values of each are in the original wordlist. I'm not sure how to do this. Probably some sort of shuffle. Ah, I think I get it, we have to use the original wordlist as a key, and combine the keys...

In [None]:
word_permutation_rdd.join(wlist).take(10)