In [26]:
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [4]:
Settings.embed_model = HuggingFaceEmbedding('BAAI/bge-small-en-v1.5')

Settings.llm = None

Settings.chunk_size = 256

Settings.chunk_overlap = 25



LLM is explicitly disabled. Using MockLLM.


In [17]:
documents = SimpleDirectoryReader('articles').load_data()

In [18]:
print(len(documents))

71


In [20]:
for doc in documents:
    if 'Member-only story' in doc.text:
        documents.remove(doc)
        continue
    if 'The Data Entrepreneurs' in doc.text:
        documents.remove(doc)
    if 'min read' in doc.text:
        documents.remove(doc)

print(len(documents))

61


In [50]:
index = VectorStoreIndex.from_documents(documents)

### Setting up retriver

In [32]:
top_k = 3

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever, node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)]
)

In [101]:
query = 'What is fat-tailedness?'
response = query_engine.query(query)

In [102]:
context = 'Context:\n'

for i in range(top_k):
    context = context + response.source_nodes[i].text + '\n\n'

print(context)

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
rare events drive the aggregate statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 – 20).
This maps directly to the idea of Mediocristan vs Extremistan discussed
earlier.

print("mean kappa_1n = " + str(np.mean(kappa_dict[filename])))
    print("")
Mean κ (1,100) values from 1000 runs for each dataset. Image by author.
These more stable results

In [53]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = 'TheBloke/Mistral-7B-Instruct-v0.2-GPTQ'

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    trust_remote_code=False,
    revision='main',
)

config = PeftConfig.from_pretrained('shawhin/shawgpt-ft')

model = PeftModel.from_pretrained(model, 'shawhin/shawgpt-ft')

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)



In [109]:
instructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'.

ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""

prompt_template = lambda comment: f"""[INST]\n{instructions_string} \n{comment}\n[/INST]"""

comment = """Help me choose the correct answer and exaplain why

Compared to the C corporation, the limited liability company is an attractive form of business ownership because:

a.
once formed, the limited liability company does not require the firm to hold annual meetings, and has the option to avoid double taxation.

b.
even though it is a little more expensive to form, it has a longer life than the C corporation.

c.
 C corporation permits one owner to own all the stock of the company, whereas a limited liability company requires several owners.

d.
once formed, the limited liability company is a legal form of business ownership, worldwide, whereas the C corporation must file for corporate status in each nation it elects to do business."""

prompt = prompt_template(comment)

print(prompt)

[INST]
ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'.

ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Help me choose the correct answer and exaplain why

Compared to the C corporation, the limited liability company is an attractive form of business ownership because:

a.
once formed, the limited liability company does not require the firm to hold annual meetings, and has the option to avoid double taxation.

b.
even though it is a little more expensive to form, it has a longer life than the C corporation.

c.
 C corporation permits one owner to own all the stock of the company, whereas a limited 

In [110]:
model.eval()

inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(
    input_ids=inputs['input_ids'].to('cuda'),
    max_new_tokens=280,
)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]
ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'.

ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
Help me choose the correct answer and exaplain why

Compared to the C corporation, the limited liability company is an attractive form of business ownership because:

a.
once formed, the limited liability company does not require the firm to hold annual meetings, and has the option to avoid double taxation.

b.
even though it is a little more expensive to form, it has a longer life than the C corporation.

c.
 C corporation permits one owner to own all the stock of the company, whereas a limi

In [95]:
prompt_template_w_context = (
    lambda context,
    comment: f"""[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.
    
{context}

Please respond to the following comment. Use the context above if it is helpful.

{comment}
[/INST]
"""
)

In [96]:
prompt = prompt_template_w_context(context, comment)

inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(input_ids=inputs['input_ids'].to('cuda'), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.
    
Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to wh

In [97]:
print(prompt)

[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.
    
Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
