# Introduction
I like the New York Times (NYT) crossword puzzle. I got started with it as a way to divert my attention from social media, but after about a year or two of pretty regular puzzling, I've started to grow an appreciation for the crossword as a craft (at least, the NYT crossword). Crossword constructors can control the puzzle-solver's experience in a number of subtle ways, both in the choices which they make for an individual clue (eg. whether its phrased misleadingly or not) and for the puzzle as a whole (eg. the distribution of easy clues, the placement of black squares). The mixture of these factors can have a serious impact on how a puzzle "feels" to solve. A witty or punny clue that you spend 10 minutes trying to work out gives you a little reward at the end, whereas the intersection of two obscure trivia clues can be an immensely frustrating experience (referred to as a [natick](https://blog.puzzlenation.com/2014/08/01/its-follow-up-friday-that-has-a-name-edition/)).

Understanding these principles, and using them to curate crossword puzzles is therefore quite important, made even more so by the fact that the NYT adheres to a difficulty schedule. Mondays are the easiest, with difficulty increasing through to Saturday where it (approximately) maxes out. Sunday crosswords are physically larger (17x17 squares, instead of the usual 15x15), but the difficulty is reduced to the level of a Wed-Fri to make the solve more pleasant.

I'd like to know more about crossword difficulty. Partly because I'd like to start making my own crosswords, but also because I'm running out of ways to procrastinate writing my thesis. Since I'm pretty hand with data analytics tools (and looking for a job 😉), I might as well see what I can pull out of some crossword data.

# Getting Data
It turns out that actually getting crossword data is a little hard. Up until a few years ago some databases of crossword clues and solutions were maintained by single individuals, but it seems like as of 2020 those are no longer available. This is speculation, but my guess would be that the NYT doesn't want their crossword data publicly available after a [plagiarism scandal](https://fivethirtyeight.com/features/a-plagiarism-scandal-is-unfolding-in-the-crossword-world/) rocked the crossword puzzling world. I'm going to respect that and not include the dataset I've created in the GitHub repo which accompanies the project. However, I am going to show you how I got it. **Important: this method requires a subscription to the online NYT crossword puzzle service.** This service includes the bulk of the Shortz era on archive, and unlimited downloads of `.puz` crossword files. As far as I am aware simply automating this procedure doesn't violate any terms of service, but use it at your own peril.
 
A friend of mine told me that any website is an API if you think about, and this is absolutely true of the NYT crossword portal:
<img src="imgs/nyt_xword_scrot1.png" width="400"/>

Notice up in the right hand corner there's a download button? 
![image](imgs/nyt_xword_scrot2.png)

This button is actually a link which directs you to a URL of the form `https://www.nytimes.com/svc/crosswords/v2/puzzle/XXXXX.puz`, where `XXXXX` is some crossword ID number.  If you direct your internet browser to this URL (and you're logged in to your NYT crossword subscription) it will hand over a `.puz` file containing a crossword. At time of writing I can't quite figure out the relationship between the ID number `XXXXX` and the returned puzzle. It doesn't increase with the puzzle date, so maybe it's a hash of the puzzle data or smething.

Starting from your login to the NYT puzzle service, let's take a closer look at the transaction which actually hands over the `.puz` file (starting from when you click the login button on the NYT crossword website):
0. The NYT website uses some JavaScript (I believe) to prompt you for your login info. 
1. You enter the login info and complete a Captcha challenge, and then send your NYT Crossword username and password to the server hosting the crossword app
2. The server sends you back some authorization tokens (cookies) which certify that you are who you say you are, without having to login again 
3. You direct your browser to the URL `https://www.nytimes.com/svc/crosswords/v2/puzzle/XXXXX.puz`
4. Your browser bundles the relevant NYT authorization cookies with a `GET` request to the server at the above URL
5. If the NYT cookies are valid and have not expired, the server will then send back the file `XXXXX.puz`

Automating steps 3-5 is fairly simple. Most popular languages (Python, R, Javascript) have functionality (either base or with a package) to send `GET` requests. You could even use Bash tools like `curl` or `wget` wrapped in a little script to incremement the puzzle's ID number. 

The difficulty is really in steps 1 and 2, and specifically getting past the Captcha. As you might expect, automating Captcha challenges is quite difficult, so we don't have much hope of handling it within our scraper script.  Instead what I ended up doing was simply logging in using the Firefox browser, and then raiding Firefox's cookie database for everything which corresponded to the NYT website. By including all of those cookies with my automated GET requests I was able to successfully tap the endpoints. Because I'm trying to learn it, I implemented this process using Go, and you can find source code for that in the `code/go` directory.


# What Does That Get Of You?
I tried 30,000 ID numbers tried, and just over 10,200 of them returned a puzzle. This is approximately consistent with the number of puzzle's published in the Will Shortz era (365 puzzles per year for Shortz's tenure of 26 years would be 9,490 puzzles). These puzzles are stored in the `.puz` file format, which is a popular, open-source format used by most crossword apps or software. It includes all clues (and their number and direction), the puzzle's layout, as well as the solutions (which are typically scrambled by some key phrase). To work with these files, I used the [`puzpy`](https://github.com/alexdej/puzpy) package for python.
 
# Processing the data 
For now I'm going to ignore the spatial information of the puzzle, eg. placement of the clues and black squares. While in my dream analysis I'd figure out a way to estimate the entropy for a given empty square when some nearby squares are filled, in reality that's a lot to tackle in one blog post. So to start I used `puzpy` to pull the clue text, the date of the puzzle, the clue's orientation (across or down), and how long the answer was, and stored it in a `.csv` file. Besides these basic factors, I added two simple features to start from, one indicating whether the clue contained a "?" (what I'll call a "pun clue"), and the second indicating whether it contained a proper noun (what I'll call a "culture clue").

In a pun clue, the last character of the clue is a question mark ("?"). This indicates that the clue text is not to be taken literally, ie. that it contains a pun, a joke, or a play on word, etc. For example, the clue "What's up?" has the solution "SKY". The clue "Going MY way?" has the solution "EGOTRIP". Not all clues containing a pun end in question marks, but all clues ending in question marks are puns (with the rare exception that the clue quotes a question or something).  

Culture clues are a concept I came across in this cool 2014 Bachelor's thesis by Jocelyn Adams: ["A Pragmatic Analysis of Crossword Puzzle Difficulty"](https://scholarship.tricolib.brynmawr.edu/bitstream/handle/10066/15350/Adams_thesis_2015.pdf?sequence=1). The idea here was that crossword clues can be divided into straightforward hints (eg. where the clue is a synonym or the solution) and "culture clues". These might broadly be considered as "trivia clues", containing references to history, contemporary culture, sports, art, etc. While it is difficult to automatically detect whether a clue contains trivia or not, one heuristic used by Jocelyn was to determine whether all of the words in the clue were in Scrabble dictionary. If any clue word was not a Scrabble word, then the clue was considered a culture clue. I couldn't find a fast way to check clue words against the Scrabble dictionary in Python, so I instead checked if words were proper nouns or not, which was straightforward in the NLP package I used (`spaCy`). This is probably not exactly the same as the Scrabble dictionary method, but given that neither is perfect I figured the difference would get lost in the overall error. 

# Low-Hanging Fruit
Having read the data 