# 👋 Hello, LLM!

Welcome to the _"**Ask Your Documents Questions**"_ tutorial notebook series!

To kick things off, let's talk to an LLM, shall we?

After all, the [Large Language Model](https://en.wikipedia.org/wiki/Large_language_model) is at the heart of all of this.

## Choosing an LLM

This notebook supports both running local LLMs as well as using hosted LLMs (_OpenAI ChatGPT_).

When choosing an LLM, be sure to check the [licensing terms](https://huggingface.co/blog/os-llms#licensing). Some LLMs are only available for non-commercial use.

For local models, this notebook uses Hugging Face to simplify the process of downloading and running the pretrained models (_both text generation [LLM] and sentence transformers [embeddings]_).

Here are a fully open-source models which this notebook can run locally:
- [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) _(context length of 4,096 ~ there is a version with **65k tokens** requiring lots of GPU VRAM)_
- [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) _(context length of 2,048 tokens)_

The 7B models tend to be possible to run on customer GPU cards with ~16GB of memory. Your results may vary.

# Local LLMs

The following cells are your _"Hello, world!"_ cells for the local LLMs.

You can skip these if so desired. They support nice GPUs to run.  
(_This notebook tested on a Nvidia 2080 Ti_)

> Note: these also require a fast network connection to download the models.
> - `MPT-7B`: 12.3GB download _(note: it takes a bit longer to prepare the pipeline after the download)_
> - `falcon-7b`: 

> Curious about `trust_remote_code=True`? See [here](https://huggingface.co/mosaicml/mpt-7b#how-to-use) and [here](https://www.reddit.com/r/LocalLLaMA/comments/13t2b67/security_psa_huggingface_models_are_code_not_just/jlu113p/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button).

In [None]:
run_MPT: bool = True
run_FALCON: bool = True

In [None]:
# This cell setups up MPT 7b and loads the model into GPU memory
if run_MPT:
    from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, AutoConfig

    # Setup tokenizer
    mpt_7b_tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b", trust_remote_code=True)

    # Setup model
    mpt_7b_config = AutoConfig.from_pretrained("mosaicml/mpt-7b", trust_remote_code=True)
    # mpt_7b_config.max_seq_len = 4096 # TODO: Test if this slows down generatin
    mpt_7b_config.init_device = "cuda:0" # For fast initialization directly on GPU!
    mpt_7b_model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b", trust_remote_code=True, config=mpt_7b_config)
    
    # Create a pipeline using the simpler pipeline interface
    mpt_7b_pipeline = pipeline(
        'text-generation',
        model=mpt_7b_model,
        tokenizer=mpt_7b_tokenizer,
        device="cuda:0", # If you're using Nvidia
        trust_remote_code=True, # Required,
    )

In [None]:
# This model can be repeated to test prompts for MPT 7b
if run_MPT:
    mpt_7b_prompt = "What color is the sky on Mars?"

    # NOTE This takes 1.5 minutes on a 2080 Ti
    mpt_7b_result = mpt_7b_pipeline(
        mpt_7b_prompt,
        max_new_tokens=50, # TODO: verify generation times with this set to different values
        do_sample=True, # TODO: required or no?
        use_cache=True # TODO: required or no?
        # temperature=
    )

    # TODO: where to configure the temperature?
    print(mpt_7b_result[0]["generated_text"])

# OpenAI Hosted GPT LLM