In [None]:
# A Quick Exploration of the Huggingface LLM for Scraping and Querying a Website

In [9]:
# install llamaindex packages and other needed libraries
!pip install llama-index llama-index-readers-web html2text IPython



In [11]:
# import libraries
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os

In [13]:
# access huggingface llm
# load huggingface token after setting up a huggingface account
from getpass import getpass
from huggingface_hub import login

# Prompt to securely enter the Hugging Face token
HF_TOKEN = getpass("Enter your Hugging Face token: ")

# Log in using the token
login(token=HF_TOKEN)

Enter your Hugging Face token:  ········


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/ijmg/.cache/huggingface/token
Login successful


In [119]:
# load url of website to be scraped and analyzed by llm
documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://en.wikipedia.org/wiki/Twenty_Thousand_Leagues_Under_the_Seas"]
)

In [121]:
# index website
index = SummaryIndex.from_documents(documents)

In [123]:
# create llm model
from llama_index.llms.huggingface import HuggingFaceInferenceAPI
llm = HuggingFaceInferenceAPI(model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", token=HF_TOKEN)
llm

  llm = HuggingFaceInferenceAPI(model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", token=HF_TOKEN)


HuggingFaceInferenceAPI(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7218fd0e8bd0>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x7219b7550c20>, completion_to_prompt=<function default_completion_to_prompt at 0x7219b74800e0>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', token='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', timeout=None, headers=None, cookies=None, task=None, context_window=3900, num_output=256, is_chat_model=False, is_function_calling_model=False)

In [125]:
# set up query engine using indexed website and huggingface llm 
query_engine = index.as_query_engine(llm=llm)

In [25]:
# Task #1: Use huggingface llm to query website contents
response = query_engine.query("What is the Captain's name?")
print(response)


Captain Nemo, whose real name is Prince Dakkar, is a character in the 2012 film "Journey 2: The Mysterious Island". He is an Indian freedom fighter who leads a life of solitude beneath the seas and is the commander of the Nautilus, an advanced submarine. He is portrayed as a radical environmentalist in the film, taking action against the consequences of the destruction of the Earth. He has established a ZAD (Zone A Défendre) in the abysses of the seas, according to the film. Captain Nemo is also a character in Jules Verne's novel "Twenty Thousand Leagues Under the Sea" (1870). In the novel, he is a mysterious and complex character, who is both a scientific genius and a vengeful enemy of society. He is commander of the Nautilus, which uses to travel the seas and explore the depths of the ocean. His motives for his actions are unclear, and he remains an enigmatic figure throughout the novel.

Captain Nemo is a fictional character in Jules Verne's novel "Twenty Thousand


In [27]:
# Task #2: Use huggingface llm to query website contents
response = query_engine.query("What is the name of the vessel in the story?")
print(response)


The name of the vessel in the story is the Nautilus.


In [127]:
# Task #3: Use huggingface llm to query website contents
summary = query_engine.query("Which oceans or seas were visited in the story?")
summary_text = summary.text if hasattr(summary, 'text') else str(summary)
print(summary)



The story visits the Mediterranean Sea, Red Sea, Atlantic Ocean, Indian Ocean, China Sea, and the Arctic Ocean. The novel, _Twenty Thousand Leagues Under the Sea_ by Jules Verne, is the source of the story. The expedition sails around Crete during the Cretan Revolt of 1866-1869, encounters Indian pearl divers in the Indian Ocean, and visits the Arctic Ocean during the search for the lost Franklin expedition. Captain Nemo, the commander of the Nautilus, engages in the protection of marine life. The novel has been adapted into various films, TV series, games, and series, including _Mysterious Island_ (1961), _Mysterious Island_ (2005), _Journey 2: The Mysterious Island_ (2012), _Jules Verne's Mysterious Island_ (2012), _La isla misteriosa y el capitán Nemo_ (1973), _Captain Nemo_ (1975), _Mysterious Island_ (2012), and _2


In [29]:
# Task #4: Use huggingface llm to summarize website contents
summary = query_engine.query("Summarize the contents of the document")
summary_text = summary.text if hasattr(summary, 'text') else str(summary)
print(summary)



Twenty Thousand Leagues Under the Sea is a science fiction novel by French author Jules Verne, first published in 1870. The novel follows Captain Nemo and his submarine, the Nautilus, as they travel 20,000 leagues under the sea. The story is divided into three parts, each focusing on different aspects of the journey, wonders of the deep sea, technology, and freedom. The 1871 Chelebourg edition of the text served as the basis for the translation, and includes 42 illustrations, published for the first time with Hetzel's illustrations and includes notes and a new introduction. Despite criticisms of technical accuracy, the novel is widely regarded as a classic of science fiction and has been translated into numerous languages. Jules Verne is also known for other works such as The Mutineers of the Bounty and the novel has inspired numerous adaptations, including films and games.


In [129]:
# Task #5: Use huggingface llm to query website contents
response = query_engine.query("What is the final resting place of the vessel in the story?")
print(response)



In the novel "The Mysterious Island" (1874-7) by Jules Verne, the Nautilus, the submarine captained by Captain Nemo, is scuttled by Nemo himself in the lagoon of the island where the submarine had been hidden for some time. This event occurs after Nemo's death, when the protagonists discover the submarine and its secrets. Nemo had previously mentioned that the Nautilus was his home and his grave, and he fulfills this prophecy by destroying it before his own death.

The Nautilus is scuttled in the lagoon of the island, which is located in the Pacific Ocean, near the volcanic island of Tabor. The exact coordinates are not provided in the novel, but it is suggested that the island is near the Line Islands, which are part of Kiribati.

The Nautilus is scuttled in the lagoon of the island, which is located in the Pacific Ocean, near the volcanic island of Tabor. The exact coordinates are not provided in the novel, but it is suggested that the island is near the Line Islands,
