# Large Language Models

In the ever-evolving landscape of artificial intelligence, large language models (LLMs) have emerged as transformative tools, reshaping the way we engage with and analyse language. These sophisticated models, honed on massive repositories of text data, possess the remarkable ability to comprehend, generate, and translate human language with unprecedented accuracy and fluency. Among the prominent LLM architectures, LangChain stands out for its efficiency and flexibility.

This notebook is designed to seamlessly run both locally and on Google Colab. For those who may only have a CPU, there are clear instructions on how to run the notebook without a GPU. Don't worry, simply follow the instructions for either GPU or CPU, depending on your setup.

Please note that using only a CPU will result in noticeably slower model performance.

---
## 1.&nbsp; Installations and Settings 🛠️

On Google Colab, you have access to free GPUs, whenever they're available. Let's utilise this advantage. To configure a Colab GPU, navigate to "Edit" and then "Notebook Settings". Select "GPU" and then click "Save".

To proceed, you'll need to install two libraries: Langchain and Llama.cpp. When operating this notebook locally, you only need to install these libraries once, and they'll remain on your computer. However, in Colab, they're not default libraries and must be installed for each session.

**LangChain** is a framework that simplifies the development of applications powered by large language models (LLMs)

**llama.cpp** enables us to execute quantised versions of models.

> Quantisation of LLMs is a process that reduces the precision of the numerical values in the model, such as converting 32-bit floating-point numbers to 8-bit integers. These models are therefore smaller and faster, allowing them to run on less powerful hardware with only a small loss in precision.

* If you're using a **CPU**, use the [standard installation](https://python.langchain.com/docs/integrations/llms/llamacpp#cpu-only-installation) of llama.cpp. Windows users might have to install [a couple of extra libraries too](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-windows). Some students using windows have also found [this guide](https://medium.com/@piyushbatra1999/installing-llama-cpp-python-with-nvidia-gpu-acceleration-on-windows-a-short-guide-0dfac475002d) useful.
* If you have an **NVIDIA GPU**, you need to [activate cuBLAS](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-openblas-cublas-clblast) with llama.cpp. cuBLAS is a library that speeds up operations on NVIDIA GPUs.
* If you have a **silicon chip Apple with a GPU**, you need to [enable Metal](https://python.langchain.com/docs/integrations/llms/llamacpp#installation-with-metal).

In [4]:
!pip install cmake




In [3]:
CUDA_PATH = r"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2"

In [None]:

!pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

In [8]:
!pip3 install -qqq langchain --progress-bar off
# As this notebook is originally on Colab, here we'll use their NVIDIA GPU and activate cuBLAS
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install -qqq llama-cpp-python --force-reinstall --upgrade --no-cache-dir --progress-bar off

Der Befehl "CMAKE_ARGS" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In [5]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Der Befehl "CMAKE_ARGS" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In [10]:
import os
os.environ['CMAKE_ARGS'] = "-DLLAMA_CUBLAS=on"
os.environ['FORCE_CMAKE'] = "1"
!pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir



Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.61.tar.gz (37.4 MB)
     ---------------------------------------- 0.0/37.4 MB ? eta -:--:--
     ---------------------------------------- 0.1/37.4 MB 3.2 MB/s eta 0:00:12
     ---------------------------------------- 0.4/37.4 MB 4.4 MB/s eta 0:00:09
      --------------------------------------- 0.6/37.4 MB 5.0 MB/s eta 0:00:08
     - -------------------------------------- 1.0/37.4 MB 5.3 MB/s eta 0:00:07
     - -------------------------------------- 1.3/37.4 MB 5.7 MB/s eta 0:00:07
     - -------------------------------------- 1.4/37.4 MB 5.4 MB/s eta 0:00:07
     - -------------------------------------- 1.5/37.4 MB 4.7 MB/s eta 0:00:08
     -- ------------------------------------- 2.4/37.4 MB 6.6 MB/s eta 0:00:06
     -- ------------------------------------- 2.7/37.4 MB 6.7 MB/s eta 0:00:06
     --- ------------------------------------ 2.9/37.4 MB 6.6 MB/s eta 0:00:06
     --- ------------------------------------ 3.4/37.4 

  error: subprocess-exited-with-error
  
  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [46 lines of output]
      [92m***[0m [1m[92mscikit-build-core 0.8.2[0m using [94mCMake 3.29.2[0m [91m(wheel)[0m[0m
      [92m***[0m [1mConfiguring CMake...[0m
      loading initial cache file C:\Users\Marvin\AppData\Local\Temp\tmptgi7769c\build\CMakeInit.txt
      -- Building for: Visual Studio 17 2022
      -- Selecting Windows SDK version 10.0.20348.0 to target Windows 10.0.19045.
      -- The C compiler identification is MSVC 19.39.33523.0
      -- The CXX compiler identification is MSVC 19.39.33523.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done

Before we dive into the examples, let's download the large language model (LLM) we'll be using. For these exercises, we've selected a [quantised version of Mistral AI's Mistral 7B model](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF). While this is a great choice, it's by no means the only option. We encourage you to explore and try different models to discover the unique strengths and weaknesses of each. Even models of similar size can exhibit surprisingly different capabilities.

> Since we're working in Colab, we'll need to download the LLM for each session.
<br>
If you're working locally, you can download the model once. The model is then on your computer and doesn't need to be downloaded each time. Change the `--local-dir` to your folder of choice.

In [8]:
!pip install huggingface_hub


Collecting huggingface_hub
  Downloading huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting filelock (from huggingface_hub)
  Downloading filelock-3.13.4-py3-none-any.whl.metadata (2.8 kB)
Collecting fsspec>=2023.5.0 (from huggingface_hub)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting tqdm>=4.42.1 (from huggingface_hub)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     ---------------------------------------- 57.6/57.6 kB 1.5 MB/s eta 0:00:00
Downloading huggingface_hub-0.22.2-py3-none-any.whl (388 kB)
   ---------------------------------------- 0.0/388.9 kB ? eta -:--:--
   --- ------------------------------------ 30.7/388.9 kB ? eta -:--:--
   ------------ --------------------------- 122.9/388.9 kB 2.4 MB/s eta 0:00:01
   --------------------- ------------------ 204.8/388.9 kB 2.1 MB/s eta 0:00:01
   -------------------------- ------------- 256.0/388.9

In [9]:
!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir C:\Users\Marvin\Documents\WBS\Data-Science-Bootcamp\8_Large_Language_Models\LLMs --local-dir-use-symlinks False


C:\Users\Marvin\Documents\WBS\Data-Science-Bootcamp\8_Large_Language_Models\LLMs\mistral-7b-instruct-v0.1.Q4_K_M.gguf


Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf to C:\Users\Marvin\.cache\huggingface\hub\tmpwebqpifi


---
## 2.&nbsp; Setting up your LLM 🧠

Langchain simplifies LLM deployment with its streamlined setup process. A single line of code configures your LLM, allowing you to tailor the parameters to your specific needs.

If you want to know more about Llama.cpp, you can [read the docs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/). Alternatively, here are the [LangChain docs for Llama.cpp](https://python.langchain.com/docs/integrations/llms/llamacpp).

Here's a brief overview of some of the parameters:
* **model_path:** The path to the Llama model file that will be used for generating text.
* **max_tokens:** The maximum number of tokens that the model should generate in its response.
* **temperature:** A value between 0 and 1 that controls the randomness of the model's generation. A lower temperature results in more predictable, constrained output, while a higher temperature yields more creative and diverse text.
* **top_p:** A value between 0 and 1 that controls the diversity of the model's predictions. A higher top_p value prioritizes the most probable tokens, while a lower top_p value encourages the model to explore a wider range of possibilities.
* **n_gpu_layers:** The default setting of 0 will cause all layers to be executed on the CPU. Setting n_gpu_layers to 1 will cause the first layer of the model to be executed on the GPU, while the remaining layers are executed on the CPU. Setting n_gpu_layers to 2 will cause the first two layers of the model to be executed on the GPU, while the remaining layers are executed on the CPU, and so on. -1 will cause all layers to be offloaded to the GPU. In general, it is a good idea to experiment with different values of n_gpu_layers to find the best balance between performance and memory usage for your specific application.

In [9]:
from langchain.llms import LlamaCpp

llm = LlamaCpp(model_path = r"C:\Users\Marvin\Documents\WBS\Data-Science-Bootcamp\8_Large_Language_Models\LLMs\mistral-7b-instruct-v0.1.Q4_K_M.gguf",
               max_tokens = 2000,
               temperature = 0.1,
               top_p = 1,
               n_gpu_layers = -1)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\Marvin\Documents\WBS\Data-Science-Bootcamp\8_Large_Language_Models\LLMs\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_coun

If you're using a GPU, check the output of this cell ☝️
  * If you're using cuBLAS, you'll see `BLAS = 1` if it's installed correctly.
  * If you're using Metal, you'll see `NEON = 1` if it's installed correctly.

---
## 3.&nbsp; Asking your LLM questions 🤖
Play around and note how small changes make a big difference.

In [None]:
answer_1 = llm.invoke("Which animals live at the north pole?")
print(answer_1)


llama_print_timings:        load time =     232.35 ms
llama_print_timings:      sample time =     149.94 ms /   226 runs   (    0.66 ms per token,  1507.22 tokens per second)
llama_print_timings: prompt eval time =     232.30 ms /     8 tokens (   29.04 ms per token,    34.44 tokens per second)
llama_print_timings:        eval time =    5334.07 ms /   226 runs   (   23.60 ms per token,    42.37 tokens per second)
llama_print_timings:       total time =    6673.09 ms /   234 tokens




1. Polar Bears
2. Arctic Foxes
3. Walruses
4. Caribou
5. Beluga Whales
6. Narwhals
7. Seals
8. Musk Oxen
9. Arctic Hares
10. Snowy Owls
11. Reindeer
12. Beavers
13. Moose
14. Lynx
15. Wolverines
16. Arctic Wolves
17. Harp Seals
18. Dall Sheep
19. Pacific Walruses
20. Harp Porpoises
21. Pacific Salmon
22. Arctic Char
23. Dwarf Arctic Foxes
24. Arctic Poppies
25. Arctic Willows
26. Arctic Cotton
27. Arctic Cranberries
28. Arctic Blueberries
29. Arctic Crowberries
30. Arctic Huckleberries


In [None]:
answer_2 = llm.invoke("Write a poem about animals that live at the north pole.")
print(answer_2)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     232.35 ms
llama_print_timings:      sample time =     166.65 ms /   299 runs   (    0.56 ms per token,  1794.17 tokens per second)
llama_print_timings: prompt eval time =     110.36 ms /    12 tokens (    9.20 ms per token,   108.73 tokens per second)
llama_print_timings:        eval time =    6941.35 ms /   298 runs   (   23.29 ms per token,    42.93 tokens per second)
llama_print_timings:       total time =    8131.73 ms /   310 tokens




In the land of snow and ice,
Where the sun doesn't shine nice,
Lives a group of animals brave,
Adapted to their cold environment.

The polar bear is the king,
Of this arctic realm so thick,
With fur so white and thick,
It keeps him warm from the cold trick.

He hunts for seals to eat,
On the frozen sea he meets,
With his powerful paws and teeth,
He can break through the thickest ice sheet.

The arctic fox is next,
With his coat of white and red,
He blends in with his surroundings,
And can run fast when he needs.

He eats lemmings and birds,
And keeps warm with his fur,
His small size helps him conserve,
Energy in this harsh world.

The caribou roam free,
With antlers tall and wide,
They eat grasses and lichens,
And can run fast when they need to hide.

The arctic hare is small,
But quick as a cheetah,
He eats the leftovers,
When the caribou have had enough.

These animals live in harmony,
In this land of snow and ice,
They have adapted to their surroundings,
And can survive in this c

In [None]:
answer_3 = llm.invoke("Explain the central limit theorem like I'm 5 years old.")
print(answer_3)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     232.35 ms
llama_print_timings:      sample time =      86.72 ms /   136 runs   (    0.64 ms per token,  1568.32 tokens per second)
llama_print_timings: prompt eval time =     134.51 ms /    15 tokens (    8.97 ms per token,   111.51 tokens per second)
llama_print_timings:        eval time =    3147.49 ms /   135 runs   (   23.31 ms per token,    42.89 tokens per second)
llama_print_timings:       total time =    3836.48 ms /   150 tokens




The central limit theorem is a math rule that says when you take lots of numbers and add them up, it doesn't really matter where each number came from or how many there are - the result will still be pretty much normal. Like if you have a bunch of apples and bananas and you mix them together, the height of the pile won't be exactly average (because apples and bananas have different heights), but it will be pretty close to average. And if you have even more apples and bananas, the pile will get even closer to average. The more numbers you add, the closer the result gets to the middle.


The answers provided by the 7B model may not seem as impressive as those from the latest OpenAI or Google models, but consider the significant size difference - they perform very well. These models may not have the most extensive knowledge base, but for our purposes, we only need them to generate coherent English. We'll then infuse them with specialised knowledge on a topic of your choice, resulting in a local, specialised model that can function offline.

---
## 4.&nbsp; Challenge 😀
Play around with this, and other, LLMs. keep a record of your findings:
1. Pose different questions to the model, each subtly different from the last. Observe the resulting outputs. Smaller models tend to be highly sensitive to minor changes in language and grammar.
2. Experiment with the parameters, one at a time, to assess their impact on the output.
3. Attempt to load different models: Explore the [models page on HuggingFace](https://huggingface.co/models). You can use the left hand menu to find `Text Generation` under `Natural Language Processing`. Then use the filter bar for `GGUF` to find already quantised models.

You can alter the download command accordingly. In this note book we used the command:

In [None]:
# !huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

This downloads the version `mistral-7b-instruct-v0.1.Q4_K_M.gguf` of the model `TheBloke/Mistral-7B-Instruct-v0.1-GGUF` from huggingface. You can read about the different versions on the models `model card`.

To adapt this just change the model and the version to your new choice.

`!huggingface-cli download {model_name} {model_version} --local-dir . --local-dir-use-symlinks False`

For example:

In [None]:
# !huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

The code above would download a [quantised version of Meta's Llama 2 7B chat](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main).








