# Validating the Corpus

In order to determine whether the corpus of generated proverbs has any validity, we need to verify it against known sources. One way to do this would be to search for the n-grams in Google or DuckDuckGo to see what results, if any, are returned. I performed an example search for one of the generated proverbs, ["Never judge a person's worth by their online persona"](https://duckduckgo.com/?q=Never+judge+a+person%27s+worth+by+their+online+persona.&t=osx&ia=web) on DuckDuckGo. If this proves useful, there is a Python package called [duckduckgo-search](https://pypi.org/project/duckduckgo-search/) that can be used to perform these searches programmatically.

The alternative is to use an external, already established corpus of English language usage to validate the generated proverbs. There are three main corpora that I can see using for this purpose, the first two of which are commonly used: the Google Ngram Viewer, the Corpus of Contemporary American English (COCA), and CommonCrawl:

The **Google Ngram Viewer** allows users to search for the frequency of n-grams in a vast collection of digitized books, which can help determine if a particular phrase has been used historically. 

* [NGRAMS](https://ngrams.dev/) is an independent project that provides an API for accessing Google Ngram data.
* [Google Books Ngram Exports](https://storage.googleapis.com/books/ngrams/books/datasetsv3.html) allows users to download raw n-gram data.
* [google-books-ngrams-2020 – Marketplace – My Project – Google Cloud console](https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-books-ngrams-2020?pli=1&project=stately-arc-170423)
* [Try BigQuery using the sandbox  |  Google Cloud Documentation](https://docs.cloud.google.com/bigquery/docs/sandbox)

The **[Corpus of Contemporary American English (COCA)](https://www.english-corpora.org/coca/)** contains over 1 billion words of text from a variety of sources including spoken language, fiction, magazines, newspapers, and academic texts. By checking the presence of the generated proverbs in this corpus, we can assess their validity and common usage.

**CommonCrawl** is a massive repository of web crawl data that can be used to analyze the frequency and context of phrases across the internet. By searching for the generated proverbs in this dataset, we can determine their prevalence in online content.

* [Common Crawl - Get Started](https://commoncrawl.org/get-started)
* [commoncrawl/whirlwind-python: A whirlwind tour of Common Crawl's data using Python](https://github.com/commoncrawl/whirlwind-python)

What we are looking for is some measure of frequency of occurrence for each generated proverb in one or more of these corpora. If a generated proverb appears with some regularity in these established sources, we can have greater confidence in its validity as a recognized saying. Conversely, if a generated proverb does not appear at all, it may indicate that it is not widely recognized or used. We are thus loking for the opposite of novelty in this case, as we want to confirm that the generated proverbs are not entirely new creations but rather have some basis in existing language usage. Repeated ngrams that are entirely novel will be considered *emergent*, and ngrams that are well grounded in language usage as represented by these corpora will be considered *existent*.

To distinguish between the two possibilities, we will use a threshold frequency value to determine whether a generated proverb is considered valid or not. This threshold can be set based on the specific corpus being used and the desired level of confidence in the results. For example, we might decide that a generated proverb must appear at least 10 times in the COCA corpus to be considered valid. By applying this threshold, we can filter out any generated proverbs that do not meet the criteria and focus on those that have a stronger basis in existing language usage.