The final code is in the .ipynb that's called Copy1
. Just a tiny joke 😉
Currently you'll need to use jupyter notebooks in a Python 3.x environment to make it run.
Something about a critique of Chomsky's generative grammar made me do this project. The critique by Dixon implied that there should be a frequency associated to how often a sentence appears. Chomsky argued that such a frequency doesn't exist, because it would in most cases be very close to 0.
And that's because we create language on the go.
So I checked some corpora about whether there are any sentences in there that appear more than once.
There aren't really.
Cool.
I'm planning to refactor this project into a more thorough search and run it on a spoken-word corpus, to determine the cut-off value regarding the length of sentences. Because it is obvious that shorter sentences are certainly spoken using exactly the same words more often (e.g., just think about Honey, I'm home! in various TV shows).
However, once the word count of a sentence increases, the doubles reduce drastically. I'm interested to find that cut-off value.