In [9]:
# Importing required modules
import requests
import pdfplumber
import tempfile
import openai
import tiktoken
from rouge_score import rouge_scorer

In [10]:
# Download the PDF
url = "https://arxiv.org/pdf/1711.01731"
response = requests.get(url)

# Create a temporary file to store the PDF (NOTE: When uploaded directly as a .pdf file has issue with extraction regardless of the library utilized)
with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as temp_pdf:
    temp_pdf.write(response.content)
    temp_pdf_path = temp_pdf.name

# Extracting the text
all_text = []
with pdfplumber.open(temp_pdf_path) as pdf:
    for page in pdf.pages:
        all_text.append(page.extract_text())

# Combine results
all_text_combined = "\n".join(filter(None, all_text))

# Output the extracted text
print(all_text_combined)


A Survey on Dialogue Systems:
Recent Advances and New Frontiers
Hongshen Chen†, Xiaorui Liu‡, Dawei Yin†, and Jiliang Tang‡
†DataScienceLab,JD.com
‡DataScienceandEngineering Lab,MichiganStateUniversity
chenhongshen@jd.com, yindawei@acm.org,{xiaorui,tangjili}@msu.edu
8102
ABSTRACT
May I know
Dialogue systems have attracted more and more attention. your name ?
Dialogue State
Recent advances on dialogue systems are overwhelmingly NLU Tracking
naJ
contributed by deep learning techniques, which have been
employed to enhance a wide range of big data applications
such as computer vision, natural language processing, and
11
recommender systems. For dialogue systems, deep learning I am Robot.
can leverage a massive amount of data to learn meaningful NLG Policy learning
feature representations and response generation strategies,
]LC.sc[
whilerequiringaminimumamountofhand-crafting. Inthis
article, we give an overview to these recent advances on di-
Figure 1: Traditional Pipeline for Task-oriented

In [11]:
# Declaring an API key
openai.api_key = "your_key"

In [12]:
# Define the chunking function

def chunking(text, chunkSize, overlap):

    # Determine the token limit for our model and perform encoding
    tt_encoding = tiktoken.get_encoding("cl100k_base")
    tokens = tt_encoding.encode(text)
    total_tokens = len(tokens)

    chunks = []

    # Perform chunking based on token limitations
    for i in range(0, total_tokens, chunkSize - overlap):
        chunk = tokens[i:i+chunkSize]
        chunks.append(chunk)

    return chunks

In [13]:
# Run the chunking() on our text
chunks = chunking(all_text_combined, 3000, 50)

# Redefine the encoding model for GPT4
tt_encoding = tiktoken.get_encoding("cl100k_base")

# Run a prompt and store output of gpt-4 model for each chunk to the final_response list
final_response = []

for index, chunk in enumerate(chunks):
    response = openai.ChatCompletion.create(model = "gpt-4", messages = [{"role": "system", "content": "You are an assistant specialized in summarizing documents."},
        {"role": "user", "content": f"Please summarize the following document:\n\n{tt_encoding.decode(chunk)}"}, {"role": "assistant", "content": "Summary:"}], temperature = 0.2)
    final_response.append(response["choices"][0]["message"]["content"])

In [14]:
# Displaying our summaries
final_response

['This document is a survey on dialogue systems, focusing on recent advances and potential future research directions. Dialogue systems, which facilitate human-computer conversation, have gained increasing attention due to their potential applications and commercial value. The recent advances in these systems are largely contributed by deep learning techniques, which have been used to enhance big data applications such as computer vision, natural language processing, and recommender systems. The document categorizes dialogue systems into task-oriented and non-task-oriented models, and discusses how deep learning techniques can enhance them. It also highlights some appealing research directions that could further advance dialogue system research.',
 'The document discusses various approaches to Natural Language Generation (NLG) and dialogue systems. It mentions the use of neural network-based approaches with LSTM-based structures similar to RNNLM. The dialogue act type and its slot-valu

In [16]:
# Combine the responses into a single string
final_response_combined = ' '.join(final_response)

# Initialize a ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeLsum'], use_stemmer = True)

# String variable for storing an actual abstract from the article
abstract = "Dialogue systems have attracted more and more attention. Recent advances in dialogue systems are overwhelmingly contributed by deep learning techniques, which have been employed to enhance a wide range of big data applications such as computer vision, natural language processing, and recommender systems. For dialogue systems, deep learning can leverage a massive amount of data to learn meaningful feature reprsentations and response generation strategies, while requiring a minimum amount of hand-crafting. In this article, we give an overview to these recent advances on dialogue systems from various perspectives and discuss some possible research directions. In particular, we generally divide existing dialogue systems into task-oriented and non-task oriented models, then detail how deep learning techniques help them with representative algorithms and finally discuss some appealing research directions that can bring the dialogue system research into a new frontier."

# String variable for storing the conclusion
conclusion="Deep learning has become a basic technique in dialogue systems. Researchers investigated on applying neural networks to the different components of a traditional task-oriented dialogue system, including natural language understanding, natural language generation, dialogue state tracking. Recent years, end-to-end frameworks become popular in not only the non-task-oriented chit-chat dialogue systems, but also the task-oriented ones. Deep learning is capable of leveraging large amount of data and is promising to build up a unified intelligent dialogue system. It is blurring the boundaries between the task-oriented dialogue systems and non-task- oriented systems. In particular, the chit-chat dialogues are modeled by the sequence-to-sequence model directly. The task completion models are also moving towards an end-to- end trainable style with reinforcement learning representing the state-action space and combing the whole pipelines. It is worth noting that current end-to-end models are still far from perfect. Despite the aforementioned achievements, the problems remain challenging. Next, we discuss some possible research directions: Swift Warm-Up. Although end-to-end models have drawn most of the recent research attention, we still need to rely on traditional pipelines in practical dialogue engineering, especially in a new domain warm- up stage. The daily conversation data is quite “big”, however, the dialogue data for a specific domain is quite limited. In particular, domain specific dialogue data collection and dialogue system construction are laborsome. Neural network based models are better at leveraging large amount of data. We need new way to bridge over the warm-up stage. It is promising that the dialogue agent has the ability to learn by itself from the interactions with human. Deep Understanding. Current neural network based dialogue systems heavily rely on the huge amount of different types of annotated data, and structured knowledge base and conversation data. They learn to speak by imitating a response again and again, just like an infant, and the responses are still lack of diversity and sometimes are not meaningful. Hence, the dialogue agent should be able to learn more effectively with a deep understanding of the language and the real world. Specifically, it remains much potential if a dialogue agent can learn from human instruction to get rid of repeatedly training. Since a great quantity of knowledge is available on the Internet, a dialogue agent can be smarter if it is capable of utilizing such unstructured knowledge resource to make comprehension. Last but not least, a dialogue agent should be able to make rea- sonable inference, find something new, share its knowledge across domains, instead of repeating the words like a parrot. Privacy Protection. Widely applied dialogue system serves a large number of people. It is quite necessary to notice the fact that we are using the same dialogue assistant. With the ability of learning through inter- actions, comprehension and inference, a dialogue assistant can inadvertently and implicitly store some of sensitive information. Hence, it is important to protect users’ privacy while building better dialogue systems."

# Combine abstract and conclusion
abstractonclusion = abstract + conclusion

# Calculate the scores
scores_abstract = scorer.score(abstract, final_response_combined)
scores_text = scorer.score(all_text_combined, final_response_combined)
scores_abstractconclusion = scorer.score(abstractonclusion, final_response_combined)

# Display scores for article abstract as reference versus the entire article text as reference
print("ROUGE FOR ABSTRACT")
print(f"ROUGE-1: {scores_abstract['rouge1']}")
print(f"ROUGE-L: {scores_abstract['rougeLsum']}")

print("ROUGE FOR FULL TEXT")
print(f"ROUGE-1: {scores_text['rouge1']}")
print(f"ROUGE-L: {scores_text['rougeLsum']}")

print("ROUGE FOR ABSTRACT PLUS CONCLUSION")
print(f"ROUGE-1: {scores_abstractconclusion['rouge1']}")
print(f"ROUGE-L: {scores_abstractconclusion['rougeLsum']}")

ROUGE FOR ABSTRACT
ROUGE-1: Score(precision=0.16129032258064516, recall=0.75, fmeasure=0.26548672566371684)
ROUGE-L: Score(precision=0.10906298003072197, recall=0.5071428571428571, fmeasure=0.17951959544879897)
ROUGE FOR FULL TEXT
ROUGE-1: Score(precision=0.9109062980030722, recall=0.06619781201160974, fmeasure=0.12342595483400978)
ROUGE-L: Score(precision=0.8924731182795699, recall=0.06485822728287564, fmeasure=0.12092829638880218)
ROUGE FOR ABSTRACT PLUS CONCLUSION
ROUGE-1: Score(precision=0.5483870967741935, recall=0.5560747663551402, fmeasure=0.5522041763341067)
ROUGE-L: Score(precision=0.25806451612903225, recall=0.2616822429906542, fmeasure=0.259860788863109)
