# Using remote deployment of llama3 to summaize a number of wiki articles

Original links
- https://github.com/tushitdave/Text_summarization/blob/main/Llama_2_Text_Summ.ipynb
- https://medium.com/@tushitdavergtu/llama2-and-text-summarization-e3eafb51fe28

In [1]:
!python -m pip install langchain groq

'python3' is not recognized as an internal or external command,
operable program or batch file.


## Helper functions of using remote Llama3

In [2]:
# We use GROQ instead of Replica
import os

os.environ["GROQ_API_KEY"] = "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"

from groq import Groq

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

def llama2(prompt, temperature=0.0, top_p=0.9, input_print=True):
  chat_completion = client.chat.completions.create(
      messages=[
          {
              "role": "user",
              "content": prompt,
          }
      ],
      model="llama2-70b-4096",
      temperature=temperature,
      top_p=top_p
  )

  return (chat_completion.choices[0].message.content)

def llama3_8b(prompt, temperature=0.0, top_p=0.9, input_print=True):
  chat_completion = client.chat.completions.create(
      messages=[
          {
              "role": "user",
              "content": prompt,
          }
      ],
      model="llama3-8b-8192",
      temperature=temperature,
      top_p=top_p
  )

  return (chat_completion.choices[0].message.content)

def llama3_70b(prompt, temperature=0.0, top_p=0.9, input_print=True):
  chat_completion = client.chat.completions.create(
      messages=[
          {
              "role": "user",
              "content": prompt,
          }
      ],
      model="llama3-70b-8192",
      temperature=temperature,
      top_p=top_p
  )

  return (chat_completion.choices[0].message.content)

In [3]:
from typing import Dict, List
from langchain.llms import Replicate
from langchain.memory import ChatMessageHistory
from langchain.schema.messages import get_buffer_string
# We use GROQ instead of Replica
from groq import Groq


DEFAULT_MODEL = "llama3_8b"

def completion(
    prompt: str,
    model: str = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    if model == "llama2":
        return llama2(prompt = prompt, temperature = temperature, top_p = top_p)
    elif model == "llama3_8b":
        return llama3_8b(prompt = prompt, temperature = temperature, top_p = top_p)
    elif model == "llama3_70b":
        return llama3_70b(prompt = prompt, temperature = temperature, top_p = top_p)
    else:
        print("Unknown model")
        return ""

def chat_completion(
    messages: List[Dict],
    model = DEFAULT_MODEL,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    history = ChatMessageHistory()
    for message in messages:
        if message["role"] == "user":
            history.add_user_message(message["content"])
        elif message["role"] == "assistant":
            history.add_ai_message(message["content"])
        else:
            raise Exception("Unknown role")
    return completion(
        get_buffer_string(
            history.messages,
            human_prefix="USER",
            ai_prefix="ASSISTANT",
        ),
        model,
        temperature,
        top_p,
    )

def assistant(content: str):
    return { "role": "assistant", "content": content }

def user(content: str):
    return { "role": "user", "content": content }

def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):
    print(f'==============\n{prompt}\n==============')
    response = completion(prompt, model)
    print(response, end='\n\n')



## Download wiki texts

In [4]:
from transformers import LongformerTokenizer
import requests
import re


tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")

def fetch_and_save_wiki_text(title):
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            "explaintext": True,
        },
    ).json()
    
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]
    
    return wiki_text

def clean_text(text):
    # Remove special characters except "."
    text = re.sub(r'[^A-Za-z0-9\s.\(\)\[\]\{\}]+', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

def count_tokens(text):
    tokens = tokenizer.encode(text, add_special_tokens=True)
    return len(tokens)

In [5]:
import pandas as pd

wonders_cities = [
    'Beirut',
    'Doha',
    'Durban',
    'Havana',
    'Kuala Lumpur',
    'La Paz',
    'Vigan',
]

data = []
for wonder_city in wonders_cities:
    info = fetch_and_save_wiki_text(wonder_city)
    tokens = tokenizer.encode(info, add_special_tokens=True, truncation=True, max_length=29999)
    num_tokens = len(tokens)
    data.append([wonder_city, info, num_tokens])

df = pd.DataFrame(data, columns=["wonder_city", "information", "num_tokens"])
df["cleaned_information"] = df["information"].apply(clean_text)
df["token_count"] = df["cleaned_information"].apply(count_tokens)
df.head()

Token indices sequence length is longer than the specified maximum sequence length for this model (12083 > 4096). Running this sequence through the model will result in indexing errors


Unnamed: 0,wonder_city,information,num_tokens,cleaned_information,token_count
0,Beirut,"Beirut ( bay-ROOT; Arabic: بيروت, romanized: )...",12729,beirut ( bayroot arabic romanized ) is the cap...,12083
1,Doha,"Doha (Arabic: الدوحة, romanized: ad-Dawḥa [adˈ...",11165,doha (arabic romanized addawa [addua] or adda)...,10190
2,Durban,"Durban ( DUR-bən; Zulu: eThekwini, from itheku...",8352,durban ( durbn zulu ethekwini from itheku mean...,7737
3,Havana,Havana (; Spanish: La Habana [la aˈβana] ; Luc...,29999,havana ( spanish la habana [la aana] lucumi il...,28948
4,Kuala Lumpur,"Kuala Lumpur (Malaysian: [ˈkualə, -a ˈlumpo(r)...",12925,kuala lumpur (malaysian [kual a lumpo(r) (r)])...,12674


In [31]:
def generate_summary(text_chunk, word_count):
    prompts = """Summarize the following text in under {} words.
        {}
    """.format(word_count, text_chunk)

    res = completion(prompts)

    return res.replace(f"Here is a summary of the text in under {word_count} words:", "")

    

In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from tqdm import tqdm
import time

text_splitter = RecursiveCharacterTextSplitter(chunk_size=4096, chunk_overlap=50, length_function=len)

df["summary"] = ""

for index, row in tqdm(df.iterrows(), total=len(df), desc="Generating Summaries"):
    wonder_city = row["wonder_city"]
    text_chunk = row["cleaned_information"]
    chunks = text_splitter.split_text(text_chunk)
    chunk_summaries = []

    for chunk in chunks:
        summary = generate_summary(text_chunk = chunk, word_count = 50)
        time.sleep(30) # Avoid "exceed rate limit" when querying too frequently
        chunk_summaries.append(summary)

    combined_summary = "\n".join(chunk_summaries)
    df.at[index, "summary"] = generate_summary(text_chunk = combined_summary, word_count = 250) # map reduce

    # If we go over all summarization task, we will run into DoS trap. So we only do the first one.
    break


Generating Summaries:   0%|          | 0/7 [07:17<?, ?it/s]


In [33]:
df[["wonder_city", "summary"]]

Unnamed: 0,wonder_city,summary
0,Beirut,"\n\nBeirut, the capital of Lebanon, has a rich..."
1,Doha,
2,Durban,
3,Havana,
4,Kuala Lumpur,
5,La Paz,
6,Vigan,


In [34]:
from termcolor import colored
selected_columns = df[["wonder_city", "summary"]]

for index, row in selected_columns.iterrows():
    wonder_city = row["wonder_city"]
    summary = row["summary"]

    formatted_wonder_city = colored(wonder_city, "green", attrs=["bold", "underline"])
    
    formatted_summary = colored(f"Summary: {summary}", "white")
    
    print(formatted_wonder_city)
    
    print()
    
    print(formatted_summary)
    
    print("\n----------------------------------------------\n")

[4m[1m[32mBeirut[0m

[97mSummary: 

Beirut, the capital of Lebanon, has a rich history dating back over 5,000 years. The city has been influenced by various cultures and empires, including the Romans, Ottomans, and Europeans. Beirut was a significant city in the Roman Empire and later became a major port and commercial center. The city was devastated by the Lebanese Civil War, but has since undergone reconstruction and regained its status as a cultural and intellectual center. Today, Beirut is a financial hub with a diverse economy and a strong banking system. The city is also known for its vibrant nightlife, shopping, and dining scene, with popular neighborhoods like Badaro, Hamra Street, and Gemmayzeh. Beirut is a popular tourist destination, attracting visitors from around the world, and is promoting medical tourism with a 30% annual growth rate. The city is also home to numerous museums, galleries, and cultural events, making it a hub of culture, nightlife, and tourism.[0m

-

The summarization result

Beirut, the capital of Lebanon, has a rich history dating back over 5,000 years. The city has been influenced by various cultures and empires, including the Romans, Ottomans, and Europeans. Beirut was a significant city in the Roman Empire and later became a major port and commercial center. The city was devastated by the Lebanese Civil War, but has since undergone reconstruction and regained its status as a cultural and intellectual center. Today, Beirut is a financial hub with a diverse economy and a strong banking system. The city is also known for its vibrant nightlife, shopping, and dining scene, with popular neighborhoods like Badaro, Hamra Street, and Gemmayzeh. Beirut is a popular tourist destination, attracting visitors from around the world, and is promoting medical tourism with a 30% annual growth rate. The city is also home to numerous museums, galleries, and cultural events, making it a hub of culture, nightlife, and tourism.



In [35]:
selected_columns = df[["wonder_city", "summary"]]

for index, row in selected_columns.iterrows():
    if row["summary"] != "":
      prompts = """Help improving the writing of the following text.
          {}
      """.format(row["summary"])

      res = completion(prompts)

      print(res)

Here is a rewritten version of the text with some improvements:

Beirut, the capital of Lebanon, boasts a rich history spanning over 5,000 years. The city has been shaped by the influences of various cultures and empires, including the Romans, Ottomans, and Europeans. As a significant city in the Roman Empire, Beirut later flourished as a major port and commercial center. Unfortunately, the city was ravaged by the Lebanese Civil War, but it has since undergone extensive reconstruction and has regained its status as a cultural and intellectual hub. Today, Beirut is a thriving financial center with a diverse economy and a robust banking system. The city is renowned for its lively nightlife, shopping, and dining scene, with popular neighborhoods like Badaro, Hamra Street, and Gemmayzeh. A popular tourist destination, Beirut attracts visitors from around the world, and its medical tourism industry is experiencing a remarkable 30% annual growth rate. The city is also home to numerous museum

Beirut, the capital of Lebanon, boasts a rich history spanning over 5,000 years. The city has been shaped by the influences of various cultures and empires, including the Romans, Ottomans, and Europeans. As a significant city in the Roman Empire, Beirut later flourished as a major port and commercial center. Unfortunately, the city was ravaged by the Lebanese Civil War, but it has since undergone extensive reconstruction and has regained its status as a cultural and intellectual hub. Today, Beirut is a thriving financial center with a diverse economy and a robust banking system. The city is renowned for its lively nightlife, shopping, and dining scene, with popular neighborhoods like Badaro, Hamra Street, and Gemmayzeh. A popular tourist destination, Beirut attracts visitors from around the world, and its medical tourism industry is experiencing a remarkable 30% annual growth rate. The city is also home to numerous museums, galleries, and cultural events, making it a vibrant hub of culture, nightlife, and tourism.