# Introduction
I like the New York Times (NYT) crossword puzzle. I try and do it daily when I can, and as often as possible I avoid using the "check puzzle" hint when I get frustrated. One of the things I find interesting about the crossword, is that its edited. Crossword curation seems like a subtle thing, particularly as the NYT adheres to a difficulty schedule. Mondays are the easiest, with difficulty increasing through Friday until (approximately) maxing out on Saturday. Sunday crosswords are then larger, but the difficulty is reduced to a Wed-Fri to make the solve more pleasant. Assessing this difficulty, and in some sense defining what qualifies as a "difficult" crossword, is very much a learned skill, rather than an expressable protocol. With that said, some broad guidelines have emerged in the modern era of crossword making.

Crossword makers have a variety of tools that they can use to control the difficulty of a puzzle. Different layouts of black squares, for example, may restrict the flow of information in a puzzle. For example, if the upper left block of a puzzle is closed off except for a single word, solving that block exactly reveals little information about the rest of the puzzle.
FIG idea: dictionary entropy of a tightcrossword puzzle with a single quadrant filled out

Furthermore, different cluings produce different difficulty levels. Writers can use puns, atypical synonyms, or sometimes trivia to clue the same phrase. The word SANDWICH could be indicated by "beach sorceress' lunch?", "they're pressed for lunch", or "Earl's culinary creation". 

The interplay of these two factors, a puzzle's layout and the cluing, controls a crossword's overall *feel*; how frustrating or satisfying it can be. I was curious about digging into these concepts more, so I decided to do a data project around it.

# Getting Data
It turns out that actually getting crossword data is a little hard. Up until a few years ago some databases of crossword clues and solutions were maintained by single individuals, but it seems like as of 2020 those are no longer available. This is speculation, but it seems like the NYT doesn't want their crossword data publicly available. I'm going to respect that and not include the dataset I've created in this GitHub repo. However, I am going to show you how I got it. Importantly, this method requires a subscription to the NYT crossword puzzle service. This includes the bulk of the Shortz era on archive, and unlimited downloads of `.puz` crossword files, which I have simply automated. 

A friend of mine once remarked that any website is an API if you think about, and this is absolutely true of the NYT crossword portal:
<img src="imgs/nyt_xword_scrot1.png" width="400"/>

Notice up in the right hand corner there's a download button? 
![image](imgs/nyt_xword_scrot2.png)

This button is actually a link which directs you to a URL of the form: `https://www.nytimes.com/svc/crosswords/v2/puzzle/XXXXX.puz`, where `XXXXX` is some number.  If you follow the link to this endpoints (and you're logged in to your NYT crossword subscription) it will hand over a `.puz` file containing a crossword. At time of writing I can't quite figure out the relationship between the ID number `XXXXX` and the returned puzzle. It doesn't increase with the puzzle date, so maybe it's a hash of the puzzle data or something.

What I could determine, by just crudely increasing the number to unreasonable values, is that they seem to have an upper bound. As far as I can tell, there are no puzzles with ID numbers large than 30000. (although not every number less than 30000 has been assigned a puzzle). Given all this, scraping the puzzles seems easy enough. Just send a GET request to incrementally increasing endpoints until I hit 30000, and then stop.

The difficulty here for me was actually getting access to these endpoints. As I mentioned, these will only respond to GET after you've logged in with your NYT crossword subscription, and that login process includes a Google captcha. To get around this I logged in using Firefox, then raided Firefox's cookie database for everything which corresponded to the NYT website. By including all of those cookies with my automated GET requests I was able to successfully tap the endpoints. I implemented this process in a little Go program, which you can find in the `code/go` directory.

# What Does That Get Of You?
Of the 30,000 ID numbers tried, just over 10,200 of them returned a puzzle.
