# Lipogram Investigation

Well, I knew [@Lipogrammatical](https://twitter.com/Lipogrammatical) was some kind of writing-based constraint, and the reference to [Anton Voyl](https://fr.wikipedia.org/wiki/La_Disparition_(roman%29) was a clue that the missing letter might be the most common in the English language, but could I prove it?
<img src="perec.jpg">

## Getting the data
My first thought was to pull stuff from Twitter's API, but they seem a little edgy about state-sponsored agents of chaos these days, and the API is more complex than what I was looking for. I came up with a hacky solution: using [a service](https://www.allmytweets.net/connect/) to pull the last 969 tweets (a month's worth, maybe?), and then copying the raw HTML into a text file. (It requires Twitter authentication, so I wasn't sure that `requests` would do the trick.) 

In [1]:
with open('lipograms.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
import re

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'html.parser')

In [66]:
tweets = soup.findAll('li')

In [67]:
len(tweets)

969

Replace all images, URLs, and spans, which may contain misleading letters:

In [68]:
for a in soup.findAll('a'):
    a.replaceWithChildren()
for a in soup.findAll('span'):
    a.replaceWithChildren()
for a in soup.findAll('img'):
    a.replaceWithChildren()

In [4]:
tweets = soup.findAll('li')

In [70]:
tweets[:5]

[<li>How-To
 
 If you got hiv, say aids.  If you a girl,
 say you with child—nobody gonna stoop down
 to try and auscult a k… https://t.co/Whh4uzGFYv  Aug 02, 2018 </li>,
 <li>I RILLY DON'T CAIR DO U?  Jun 21, 2018 </li>,
 <li>Womp, womp.  Jun 20, 2018 </li>,
 <li>"I can find a lot of formulations to say what I want, and, plus, do it with an assonantial music almost total, with… https://t.co/a96r3KtAJV  Jun 06, 2018 </li>,
 <li>Philip Roth's Human Stain: (Jim) Crow=Kafka (v. Faunia's amazing fantasia) just as Nathan is and is not Roth and Silk is and isn't too.  Jun 02, 2018 </li>]

Remove dates (Feb, Sep...), URLs within tweets (which were missed by Beautiful Soup), and line breaks and spaces just to make it look nice. 

In [62]:
tweets = [tweet.text for tweet in tweets]
no_date = [tweet[:-13] for tweet in tweets]
no_url = [re.sub('https?://t.co/\w+ ', '', i) for i in no_date]
no_new_line = [re.sub('\n', ' ', i) for i in no_url]
final = [re.sub(' +', ' ', i) for i in no_new_line]

In [77]:
joined = ','.join(final).lower()

In [78]:
from collections import Counter

In [84]:
alphabet = 'a b c d e f g h i j k l m n o p q r s t u v w x y z'.split(" ")

In [79]:
counts=Counter(joined)

In [85]:
for letter in sorted(counts, key=counts.get, reverse=True):
    if letter in alphabet:
        print (letter, counts[letter])

o 8493
a 8084
i 7892
t 7149
n 6307
s 5408
l 3875
r 3695
h 3400
u 3044
g 2818
d 2721
c 2463
m 2380
w 2248
y 2141
f 1804
p 1749
b 1453
k 1060
v 358
j 356
z 150
x 133
q 125
e 76


Well, we have very few e's, but still more than zero! How about that!

In [89]:
with open('joined.txt', 'w', encoding='utf-8') as f:
    f.write(joined)

I created an e-dictionary, let's call it an `edict`: every tweet and its number of e's.

In [97]:
edict = {tweet:len(re.findall('e', tweet)) for tweet in final}

In [98]:
for tweet in sorted(edict, key=edict.get, reverse=True):
    if edict[tweet] != 0:
        print (tweet, edict[tweet])

Eek! Reemergence. events elsewhere ended, embers excepted. Yes: depressed.  23
Well, this one is straightforward: attack was across the river from us, two blocks from the dojo we were looking at for J yesterday. #Sirens  11
It was a dark and stormy night. At which point killing upon killing began. #HeyICanDoItToo H/T Scott Martin  2
Paris January 4: Riparian SOBBING willows. 1.5 hour wait (we didn't) for MdL - gratuit aujourd'hui. To grim exposition sur la collaboration.  2
Occupy MLA? I would but I'm lazy, & anyhow I got mine, Jack! Let adjuncts & junior faculty do it - I find it too much work. #omla @occupymla  2
.@sravana Hilariously witty of you. I hadn't thought to peg 12 to 12. So no: this noonday son is - gasp! - what would be a military 4:00 pm.  2
I got at last to what you’d call a croup or Boundary of that terrifying hollow, A bank of rocks which from a mounta…  1
Dylan Thomas, bumped up a notch. Too long to avoid a photograph.  1
"Uncouth matutinal jocularity"--S. Sassoon, "

Hmm. It seems the rules are flexible...