<a href="https://colab.research.google.com/github/kfahn22/Colab_notebooks/blob/main/Mistral_7b_instruct_Coding_Train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Documentation from LlamaIndex on [using LLMs](https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html)

Notebook from [here](https://colab.research.google.com/drive/1ZAdrabTJmZ_etDp10rjij_zME2Q3umAQ?usp=sharing)

Note: Responses from local models can be quite slow, especially with 8-bit quantization.

With 4bit quantization, `mistralai/Mistral-7B-Instruct-v0.1` uses about 12GB of VRAM and 8.5GB of RAM. I used a T4-High RAM instance for this notebook.

In [1]:
!pip install huggingface_hub



In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
!pip install git+https://github.com/run-llama/llama_index

Collecting git+https://github.com/run-llama/llama_index
  Cloning https://github.com/run-llama/llama_index to /tmp/pip-req-build-_umwzmtp
  Running command git clone --filter=blob:none --quiet https://github.com/run-llama/llama_index /tmp/pip-req-build-_umwzmtp
  Resolved https://github.com/run-llama/llama_index to commit e5b163daff3b9cfc3fe9396e8f48e1fced66f211
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting dataclasses-json (from llama-index==0.9.46)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index==0.9.46)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index==0.9.46)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index==0.9.46)
  Dow

In [4]:
!pip install transformers accelerate bitsandbytes

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes, accelerate
Successfully installed accelerate-0.26.1 bitsandbytes-0.42.0


## Setup

### Data

In [5]:
from llama_index import download_loader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()

challenges = ["challenges", "tracks", "showcase", "about", "challenges/1-starfield", "challenges/2-menger-sponge", "challenges/3-snake-game", "challenges/85-the-game-of-life", "challenges/21-mandelbrot-set-with-p5js", "challenges/22-julia-set", "challenges/168-the-mandelbulb", "challenges/178-climate-spiral", "challenges/179-wolfram-ca",  "challenges/180-falling-sand"]
urls = ["https://thecodingtrain.com"]
for challenge in challenges:
    urls.append(f"https://thecodingtrain.com/{challenge}")


#documents = loader.load_data(urls=['https://thecodingtrain.com/'])

documents = loader.load_data(urls)

[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/_static/nltk_cache...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### LLM

This should run on a T4 instance on the free tier

In [6]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


llm = HuggingFaceLLM(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.2",
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.2, "top_k": 5, "top_p": 0.95},
    device_map="auto",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

[Flag embedding](https://huggingface.co/BAAI/bge-small-en-v1.5)

[github](https://github.com/FlagOpen/FlagEmbedding)

In [7]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en-v1.5")

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

### Index Setup

In [12]:
from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [13]:
from llama_index import SummaryIndex

summary_index = SummaryIndex.from_documents(documents, service_context=service_context)

### Helpful Imports / Logging

In [16]:
from llama_index.response.notebook_utils import display_response

In [14]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [17]:
query_engine = vector_index.as_query_engine(response_mode="compact")

response = query_engine.query("What is the Coding Train?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** The Coding Train is a community dedicated to learning creative coding with beginner-friendly tutorials and projects on YouTube and more. It was started by Dan Shiffman in 2015 and is a welcoming space for beginner programmers and code-curious individuals to try their hand at expressing themselves with code. The Coding Train provides online educational content through sequenced and one-off video tutorials, live streaming events, and a Discord community where individuals can get help with their code from the Station Managers. The community also features a Passenger Showcase where individuals can share their work inspired by The Coding Train and have it featured on the site. The Coding Train also offers various ways to support the community, including becoming a YouTube Member, Patreon Supporter, or GitHub Sponsor.

### Refine

In [18]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("who is Daniel Shiffman?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Daniel Shiffman is the founder and curator of The Coding Train, a community dedicated to learning creative coding through beginner-friendly tutorials and projects on YouTube and more. He is a Professor of the Practice of Computer Science at the Institute for Advanced Study and a Visiting Artist at the Princeton University Program in Art and Archaeology. Daniel is known for his work in the processing programming language and has created a large number of tutorials and projects using it. He also takes on coding challenges in p5.js and Processing, covering topics such as algorithmic art, machine learning, simulation, and generative poetry. Daniel is an approachable and passionate teacher who loves to share his knowledge in a fun and engaging way. He is also involved in various educational programs, including the Interactive Telecommunications Program at NYU's Tisch School of the Arts and The Processing Foundation. Daniel is an active member of the coding community and can be found on social media platforms such as Twitter and GitHub. He also has a website, thecodingtrain.com, which provides educational resources and a community for coders.

In [19]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("Looking at Challenges/Wolphram CA, what are some passenger showcases?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Based on the context provided, the "Wolphram CA" mentioned in the query likely refers to the Wolfram Cloud Alpha (CA), a computational knowledge engine from Wolfram Research. The Coding Train Showcase page features various projects created by viewers using Wolfram Language and other tools. Some of these projects that involve Wolfram CA include:

* "Wolfram CA Sandfall" by John Afolayan
* "Wolfram CA Rainbow colored falling dots" by Amit Sheen
* "Wolfram CA Cellular Automata with control form" by Patrick McTighe
* "Wolfram CA Rules Switching Infinite Canvas" by Esprit Orgue
* "Wolfram CA Pseudo-Islamic tiling" by Kathy McGuiness
* "Wolfram CA 3d" by Matthew Millar
* "Wolfram CA high-def scrolling" by Matthew Millar
* "Wolfram CA dog stars" by Panna
* "Wolfram CA particle systems with joystick" by alpaslan özdemir
* "Wolfr

### Tree Summarize

In [20]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("Looking at the Passenger Showcase, which challenge has the most showcases")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Based on the information provided in the context, it appears that there are a total of 32 showcases listed on the Passenger Showcase page. However, the challenge that has the most showcases associated with it is not explicitly stated in the text. Therefore, it is not possible to definitively answer the query with the given information alone.

However, we can see that some challenges have multiple showcases associated with them, such as "Falling Sand" and "Wolfram CA", which each have multiple showcases listed. So, if we were to make an educated guess based on the information provided, we could assume that these challenges may have had a greater impact or appeal to the community, leading to more submissions.

But again, this is just an assumption and not a definitive answer based on the information provided.

## Router Query Engine

In [21]:
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

### Single Selector

In [22]:
from llama_index.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    select_multi=False
)

response = query_engine.query("How can I contribute to the Coding Train?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** There are several ways to contribute to the Coding Train community:

1. Submit your work: Share what you've created inspired by the Coding Train by submitting it to the Passenger Showcase. You can learn how to submit your work on the website.
2. Join the Discord: Connect with the community and get help with your code from the Station Managers by joining the Coding Train Discord.
3. Support the Coding Train: You can support the Coding Train financially by becoming a YouTube Member, Patreon Supporter, or GitHub Sponsor. These contributions help keep the community running and provide perks for supporters.
4. Contribute content: If you have expertise in a particular area of coding or design, you can contribute content to the Coding Train by creating tutorials, challenges, or other resources for the community.
5. Collaborate on projects: You can collaborate on projects with other members of the community or contribute to open-source projects related to the Coding Train.

For more information about contributing, check out the "Contribute" section on the Coding Train website.

### Multi Selector

In [25]:
from llama_index.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    select_multi=True,
)

response = query_engine.query("Looking at \"topics\" list for the Coding challenges, which challenge is similar to Falling Sand?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**`Final Response:`** Based on the context information provided, challenge #179 - "Wolfram CA" might be similar to the "Falling Sand" challenge as both involve coding simulations using p5.js. However, while "Falling Sand" focuses on creating a falling sand simulation using a grid of pixels and simple rules, "Wolfram CA" is about coding a visualization of the Wolfram Elementary Cellular Automaton. Although they are different simulations, they share the commonality of using p5.js for implementation.

## SubQuestion Query Engine

In [26]:
from llama_index.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

In [32]:
import nest_asyncio
nest_asyncio.apply()

In [33]:
from llama_index.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    service_context=service_context,
    verbose=True,
)

response = query_engine.query("What is the Coding Train? Who is Daniel Shiffman?")

display_response(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generated 2 sub questions.
[1;3;38;2;237;90;200m[vector_search] Q: Who is Daniel Shiffman?
[0m

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1;3;38;2;237;90;200m[vector_search] A: Daniel Shiffman is the founder of The Coding Train community and the host of its YouTube channel. He is a programmer, artist, and educator who teaches creative coding with beginner-friendly tutorials and projects. Daniel has been making videos since 2012 and launched the Coding Train YouTube channel in 2015. He also teaches at the Interactive Telecommunications Program at NYU's Tisch School of the Arts and serves on the Board of Directors of The Processing Foundation. Daniel is passionate about coding and enjoys sharing his knowledge in a fun and approachable way. He also loves music, playing rubik's cubes, and going running. Dan's social media handles are on Twitter and GitHub.
[0m[1;3;38;2;90;149;237m[vector_search] Q: What is the Coding Train?
[0m

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1;3;38;2;90;149;237m[vector_search] A: The Coding Train is a community dedicated to learning creative coding with beginner-friendly tutorials and projects on YouTube and more. It was started by Dan Shiffman in 2015 and is a welcoming space for beginner programmers and code-curious individuals to try their hand at expressing themselves with code. The Coding Train provides online educational content through sequenced and one-off video tutorials, live streaming events, and a Discord community where individuals can get help with their code from the Station Managers. The community also features a Passenger Showcase where individuals can share their work inspired by The Coding Train and have it featured on the site. The Coding Train also offers various ways to support the community, including becoming a YouTube Member, Patreon Supporter, or GitHub Sponsor.
[0m

**`Final Response:`** The Coding Train is a community founded by Daniel Shiffman that is dedicated to learning creative coding through beginner-friendly tutorials and projects on YouTube and other platforms. Daniel Shiffman is the host of the Coding Train YouTube channel and a programmer, artist, and educator. He has been making videos since 2012 and launched the Coding Train channel in 2015. Daniel also teaches at the Interactive Telecommunications Program at NYU's Tisch School of the Arts and serves on the Board of Directors of The Processing Foundation. He is passionate about coding and enjoys sharing his knowledge in a fun and approachable way. The Coding Train community provides educational content through video tutorials, live streaming events, and a Discord community, and offers various ways to support the community.