<a href="https://colab.research.google.com/github/nicholsl/KaitenZushi3D-Unity/blob/master/llama_c4_take_home_april_6_prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLaMa C4 Pre Processing Take Home Assessment
In late February 2023, Meta AI released [LLaMa](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/). It comes in multiple model sizes, and introduced the key insight that bigger isn’t always better in terms of parameter counts. For a fixed compute budget, sometimes training a smaller model on more data can yield better results.

The [paper](https://arxiv.org/abs/2302.13971) discusses the datasets they carefully curate to train each of these models, an excerpt below is from the discussion of C4, or Colossal Cleaned Common Crawl:

> C4 [15%]. During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data. The preprocessing of C4 also ... [deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages]: the main difference with CCNet is the quality filtering, which mostly relies on heuristics such as presence of punctuation marks or the number of words and sentences in a webpage.

While there is some level of ambiguity in the exact interpretation of these pre-processing instructions, for the purpose of this interview we will have the following assumptions:

1. We will operate on only one small (~30MB / ~27K docs) shard of C4
1. We should not include duplicate sentences from the same document.
     - Duplicate defined as where all the words are the same in a sentence, regardless of whitespace, punctuation, capitalization, etc.
1. For a simple heuristic on the number of sentences, let’s assume we do not want to keep sentences with fewer than 5 words, or more than 30.
1. Use a [fastText classifier](https://fasttext.cc/docs/en/language-identification.html) to only keep English documents (>0.5 score).
1. Drop the documents with the sentence count either below the 5th percentile, or above the 95th.

Your objective is implement the C4 pre-processing task as described, including loading the C4 data (`c4-train.00000-of-00512.json.gz`), pre-processing the documents following the instructions above, and returning a variable containing the documents that should be trained over.

Further notes:
1. [nltk.tokenize](https://www.nltk.org/api/nltk.tokenize.html) has some useful tooling.
1. You do not need to consult any further research papers such as LLaMa or CCNet for further information.
1. You can make additional assumptions, but be sure to clearly document them.
1. It may be useful to keep stats on causes why sentences/documents get dropped.
1. There should be 26,953 documents in the shard.

In [None]:
!pip install nltk fasttext
!pip install nltk tokenize

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.12.0-py3-none-any.whl (234 kB)
Building wheels for collected packages: fasttext


In [2]:
!wget https://huggingface.co/datasets/allenai/c4/resolve/main/realnewslike/c4-train.00000-of-00512.json.gz
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

--2024-05-21 03:39:09--  https://huggingface.co/datasets/allenai/c4/resolve/main/realnewslike/c4-train.00000-of-00512.json.gz
Resolving huggingface.co (huggingface.co)... 18.154.227.67, 18.154.227.87, 18.154.227.7, ...
Connecting to huggingface.co (huggingface.co)|18.154.227.67|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/datasets/allenai/c4/6666a680b0a34eb8756dcb5fd2b12f0078237f3502e8a513bd3e5b71bb92be00?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27c4-train.00000-of-00512.json.gz%3B+filename%3D%22c4-train.00000-of-00512.json.gz%22%3B&response-content-type=application%2Fgzip&Expires=1716521949&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNjUyMTk0OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9hbGxlbmFpL2M0LzY2NjZhNjgwYjBhMzRlYjg3NTZkY2I1ZmQyYjEyZjAwNzgyMzdmMzUwMmU4YTUxM2JkM2U1YjcxYmI5MmJlMDA%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlv