# Anagrams in the English dictionary

Two words are anagrams of each other when their letters can be rearranged to turn one word into the other. For instance, **stop** can be anagrammed into **post**, **spot**, **tops**, **pots**, and **opts**. 

I'll use this simple strategy: I will define the **signature** of a word as the sorted list of its letters including duplicates. So the signature of Python would be hnopty. Two words are anagrams of each other if they have the same signature.

signature('python') = 'hnopty'

Thus, I'm going to make a Python dict of all the words in a dictionary indexed by the signature. Looking up if a word has an anagram would then be as simple as computing its signature and looking it up in the dict.

anagrams_by_signature = {'post':{'post', 'spot', 'tops', 'pots', 'opts'}, ...}

Let's begin!

## Implementation
We begin by loading a dictionary from a file. The repository contains the 1934 English dictionary that is distributed with many Unix systems.

In [4]:
# importing the standard set of Python modules
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

In Python, we talk of idioms when we think of code constructs that have become the preferred way to achieve a certain goal. One example is looping through all the lines of the text file. In this case we will store the words into a list.

In [6]:
words = []
for line in open('words.txt'):
    words.append(line)

We get more than 200 000 words.

In [7]:
len(words)

235886

I do see two problems, though, every word ends in the new line character, and also some words are capitalized, which will interfere with our signature scheme. We can fix both issues using Python string methods.

In [3]:
words[:10]

['A\n',
 'a\n',
 'aa\n',
 'aal\n',
 'aalii\n',
 'aam\n',
 'Aani\n',
 'aardvark\n',
 'aardwolf\n',
 'Aaron\n']

We can refactor your code to address those issues.

In [8]:
words = []
for line in open('words.txt'):
    words.append(line.strip().lower())

In [9]:
words[:10]

['a',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aani',
 'aardvark',
 'aardwolf',
 'aaron']

Problem solved! Ah, I do see a duplicate, which comes from the **A** appearing both in uppercase and lowercase. One way to get rid of that is to build not a list, but a set.

In [10]:
words = set()
for line in open('words.txt'):
    words.add(line.strip().lower())

Given that the body of the loop is just one line, we can do it more idiomatically with a comprehension.

In [11]:
word = {words.add(line.strip().lower()) for line in open('words.txt')}

In [17]:
first_10_elements = list(words)
first_10_elements

['inconcinnately',
 'lohar',
 'slavonism',
 'davach',
 'directer',
 'constitutional',
 'shedhand',
 'semidouble',
 'heteromorphite',
 'puerperalism',
 'yachtsman',
 'mastigium',
 'finiglacial',
 'musefully',
 'myrtales',
 'monometer',
 'voluntarity',
 'ovoviviparity',
 'acuteness',
 'darabukka',
 'intraparty',
 'recolor',
 'ornithosaurian',
 'pinniferous',
 'cardioparplasis',
 'copperer',
 'blindingly',
 'doweral',
 'kriegspiel',
 'theosophism',
 'unstammering',
 'neuropore',
 'pavan',
 'sulphosalicylic',
 'ornithophile',
 'shopbreaking',
 'patroclinous',
 'arrhythmy',
 'garrulously',
 'utterable',
 'hypomixolydian',
 'overcheapness',
 'uninfracted',
 'tramp',
 'conchoidally',
 'colopexy',
 'spet',
 'parentally',
 'mortiferous',
 'calculi',
 'underworker',
 'dactylate',
 'unassimilating',
 'arthrorheumatism',
 'bibliopegy',
 'mimosa',
 'revigorate',
 'bisymmetrically',
 'unpretending',
 'insidious',
 'phytophysiology',
 'glycogen',
 'psaltery',
 'aphetism',
 'pay',
 'unconsideredly',
 