# Wikipedia Generation

## Setup

In [89]:
import torch
import transformers

In [90]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

Device set to use cuda:0


## Text Generation

In [122]:
import wikipedia_generation_prompts
messages = [
    {"role": "system", "content": wikipedia_generation_prompts.ONE_SHOT.format(topic=topic).strip()},
    {"role": "user", "content": wikipedia_generation_prompts.PROMPT.format(topic=topic).strip()},
]

prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(
    prompt,
    max_new_tokens=1000,
)

In [128]:
generated = outputs[0]['generated_text'][len(prompt):]
print(generated)

Survey: Introduction to Data Structures

Introduction

Data Structures are an essential part of any software development process. They define how data is organized and accessed within a system. This survey aims to provide an overview of the key concepts surrounding data structures, including their history, key ideas, variations, and applications.

History

Data structures have a rich history, dating back to the early days of computer programming. The first data structures were based on classical algorithms such as sorting and searching, which were used to organize and access data within a system. The concept of a data structure was introduced by Alan Perlis in the late 1960s, and further developed by many researchers during the 1970s and 1980s.

One of the most important developments in data structures was the invention of the stack, which was used to implement priority queues. This was a key concept in the development of data structures for the early internet, as it allowed for effici

## Sentencizer

In [129]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")
doc = nlp(generated)

In [150]:
sentences = [str(s) for s in doc.sents]
sentences[:5]

['Survey: Introduction to Data Structures\n\nIntroduction\n\nData Structures are an essential part of any software development process.',
 'They define how data is organized and accessed within a system.',
 'This survey aims to provide an overview of the key concepts surrounding data structures, including their history, key ideas, variations, and applications.',
 '\n\nHistory\n\nData structures have a rich history, dating back to the early days of computer programming.',
 'The first data structures were based on classical algorithms such as sorting and searching, which were used to organize and access data within a system.']

In [2]:
import json
# export to file for RARR
input_file = "generated_input.jsonl"
output_file = "output.jsonl"
claim_field = "claim"

In [None]:
with open(input_file, "w", encoding="utf-8") as writer:
    for sentence in sentences:
        writer.write(json.dumps({"input_info": {claim_field: sentence}}, ensure_ascii=False) + "\n")

# Attribution

## RARR

In [159]:
import sys
sys.path.append("./RARR/RARR")
import run_editor_sequential

In [165]:
from unittest.mock import patch
test_args = ["./RARR/RARR/run_editor_sequential.py",
             "--input_file", input_file,
             "--output_file", output_file,
             "--claim_field", claim_field,
             "--num_rounds_qgen", str(2)
]
with patch.object(sys, "argv", test_args):
    run_editor_sequential.main()

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are no

## Piecing Back Together

In [3]:
with open(output_file, "r", encoding="utf-8") as file:
    attr = [json.loads(line) for line in file]

In [58]:
def markdown_citations(output_dict):
    # TODO: teest if this works
    text_l, url_l = "text", "url"

    selected_evidences = output_dict["result"]["selected_evidences"]
    selected_evidences = [selected_evidences[i][text_l] for i in range(len(selected_evidences))]
    evidences = output_dict["result"]["revisions"][0]["evidences"]

    citations = []
    distinct_urls = set() # no duplicate websites
    for e in evidences:
        if e[text_l] in selected_evidences and e[url_l] not in distinct_urls:
            citations.append({text_l: e[text_l], url_l: e[url_l]})
            distinct_urls.add(e[url_l])
    return ''.join([f"[[{i}]]({citations[i][url_l]} \"Cited text: {citations[i][text_l].replace("\"", "\\\"")}\")" for i in range(len(citations))])

def add_citation_to_sentence(output_dict):
    sentence = output_dict["result"]["revisions"][0]["revised_text"]
    citations = markdown_citations(output_dict)
    return sentence + citations + " "

In [59]:
reconstructed_article = "".join([add_citation_to_sentence(a) for a in attr])

In [60]:
md_path = "md_output.md"
with open(md_path, "w", encoding="utf-8") as file:
    file.write(reconstructed_article)

# Extra

## RARR Playground

In [1]:
%load_ext autoreload
%autoreload 2

In [67]:
import importlib
import sys
sys.path.append("./RARR/RARR")
sys.path.append("./RARR/RARR/utils")
sys.path.append("./RARR/RARR/prompts")
import run_editor_sequential
import rarr_prompts
import stop_criteria
import search
import agreement_gate

In [3]:
import transformers
import torch

In [4]:
# Init pipeline
model_id = "TinyLlama/TinyLlama_v1.1"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

Device set to use cuda:0


In [85]:
claim = "Water is a solid at room temperature."
pipeline_bundle = stop_criteria.PipelineInfoBundle(pipeline, tokenizer)
hallucinate_evidence = False

result = run_editor_sequential.main()

Questions generated.


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Evidence found.


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Revisions completed.


In [86]:
result["revisions"][0]

{'original_text': 'Water is a solid at room temperature.',
 'revised_text': 'Water is a solid at room temperature.',
 'evidences': [{'text': 'The melting point (or, rarely, liquefaction point ) of a substance is the temperature at which it changes state from solid to liquid . At the melting point the solid and liquid phase exist in equilibrium . The melting point of a substance depends on pressure and is usually specified at a standard pressure such as 1 atmosphere or 100 kPa . When considered as the temperature of the reverse change from liquid to solid, it is referred to as the freezing point or crystallization point .',
   'url': 'https://en.wikipedia.org/wiki/Melting_point',
   'query': 'What is its melting point?',
   'sents_per_passage': 4,
   'retrieval_score': 9.515480995178223,
   'score': 0.5304679274559021},
  {'text': 'The boiling point of a liquid varies according to the applied pressure; the normal boiling point is the temperature at which the vapour pressure is equal to 

## Context Cite (Unused)

In [10]:
from context_cite import ContextCiter

hallucinated_context = outputs[0]['generated_text'] # DO NOT USE AS REAL CONTEXT
query = "What are some types of data structures?"
cc = ContextCiter.from_pretrained(model_id, hallucinated_context, query, device="cuda")

In [11]:
cc.response

'Sure, here are some types of data structures:\n\n1. Arrays: An array is a collection of objects or data elements, where each object occupies a specific location in memory.\n\n2. Sets: A set is a collection of unique elements.\n\n3. Maps: A map is a collection of keys (indices) and values.\n\n4. Stacks: A stack is a collection of items that can be added to or removed from the end.\n\n5. Queues: A queue is a collection of items that can be added to the front, and removed from the back.\n\n6. Hash Tables: A hash table is a data structure that maps keys to values.\n\n7. Graphs: Graphs are used to represent networks and to analyze their properties.\n\n8. Image Processing: Data structures are used to store and manipulate images.\n\n9. Web Technologies: Web technologies use data structures to store and retrieve user data.\n\nThese are just a few examples of the many types of data structures that exist. The specific data structures used in a particular application will depend on the specific 

In [12]:
cc.get_attributions(as_dataframe=True, top_k=5)

Attributed: Sure, here are some types of data structures:

1. Arrays: An array is a collection of objects or data elements, where each object occupies a specific location in memory.

2. Sets: A set is a collection of unique elements.

3. Maps: A map is a collection of keys (indices) and values.

4. Stacks: A stack is a collection of items that can be added to or removed from the end.

5. Queues: A queue is a collection of items that can be added to the front, and removed from the back.

6. Hash Tables: A hash table is a data structure that maps keys to values.

7. Graphs: Graphs are used to represent networks and to analyze their properties.

8. Image Processing: Data structures are used to store and manipulate images.

9. Web Technologies: Web technologies use data structures to store and retrieve user data.

These are just a few examples of the many types of data structures that exist. The specific data structures used in a particular application will depend on the specific requireme

  0%|          | 0/64 [00:00<?, ?it/s]

  with ch.no_grad(), ch.cuda.amp.autocast():
  return df.style.applymap(lambda val: _color_scale(val, max_val), subset=["Score"])


Unnamed: 0,Score,Source
0,21.734,- Graph Theory - Graphs are used to represent networks and to analyze their properties.
1,16.587,- Web Technologies - Web technologies use data structures to store and retrieve user data.
2,16.557,"- Arrays - An array is a collection of objects or data elements, where each object occupies a specific location in memory."
3,7.766,- Stacks - A stack is a collection of items that can be added to or removed from the end.
4,7.158,- Maps - A map is a collection of keys (indices) and values.


In [13]:
cc.response[457:634]

'emoved from the back.\n\n6. Hash Tables: A hash table is a data structure that maps keys to values.\n\n7. Graphs: Graphs are used to represent networks and to analyze their properti'

In [14]:
cc.get_attributions(start_idx=457, end_idx=634, as_dataframe=True, top_k=5)

Attributed: removed from the back.

6. Hash Tables: A hash table is a data structure that maps keys to values.

7. Graphs: Graphs are used to represent networks and to analyze their properties


  return df.style.applymap(lambda val: _color_scale(val, max_val), subset=["Score"])


Unnamed: 0,Score,Source
0,20.081,- Graph Theory - Graphs are used to represent networks and to analyze their properties.
1,1.29,"- Queues - A queue is a collection of items that can be added to the front, and removed from the back."
2,0.123,"- Arrays - An array is a collection of objects or data elements, where each object occupies a specific location in memory."
3,0.114,"C++ had the concept of a data structure for arrays and pointers, while Java had the concept of a stack or a queue."
4,-0.0,"By understanding these concepts, developers can work with data structures to build efficient and reliable software systems."
