# Distributed data vs distributed computation

In this short, additional section we explore a distributed data paradigm vs a distributed computation paradigm through a working LLM summarisation framework, whilst thinking about applications in regards to the AI language detection problem as well. The data used is the same dataset throughout the project, found at https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset/data. We also use data from a wikipedia article, detailed later.

### Exploring distributed data vs distributed computation. 

Distributed data paradigms focus on breaking down large datasets into smaller chunks or partitions and distributing these segments across multiple computing nodes or storage systems. The primary goal is to distribute the data to make it accessible and processable across a distributed system. In comparison, distributed computation paradigms aim to perform computational tasks concurrently across multiple computing resources, leveraging parallel processing capabilities. The primary goal is to distribute computational tasks, allowing them to execute in parallel across a distributed infrastructure.

Examples of distributed data paradigms are data partioning (splitting large texts into smaller segments for parallel processing), and distributed storage (storing data across multiple nodes using distributed file systems e.g. HDFS, AWS S3).

Examples of distributed computation paradigms are MapReduce (dividing a computational task into smaller tasks *mapping* and aggregating the results *reducing* across distributed nodes).

First, we want to download some libraries and load in our data.

In [1]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("Training_Essay_Data.csv")

Data partitioning, a popular distriubted data method, can be easily applied to many different problem domains. Within the context of our problem we'd want to check if the size per document was large enough to warrant data partioning.

In [3]:
df.head

<bound method NDFrame.head of                                                     text  generated
0      Car-free cities have become a subject of incre...          1
1      Car Free Cities  Car-free cities, a concept ga...          1
2        A Sustainable Urban Future  Car-free cities ...          1
3        Pioneering Sustainable Urban Living  In an e...          1
4        The Path to Sustainable Urban Living  In an ...          1
...                                                  ...        ...
29140  There has been a fuss about the Elector Colleg...          0
29141  Limiting car usage has many advantages. Such a...          0
29142  There's a new trend that has been developing f...          0
29143  As we all know cars are a big part of our soci...          0
29144  Cars have been around since the 1800's and hav...          0

[29145 rows x 2 columns]>

In [4]:
doc1 = df.iloc[0,0]
len(doc1)

4091

Let's also have a look at the average length of the documents we have in the dataset we've used.

In [5]:
average_length = df['text'].apply(lambda x: len(str(x))).mean()
print(average_length)

2235.996740435752


We can see here, it's just over 2200 characters which is approximately 300-500 words. Whilst splitting up a document can significantly aid processing for AI detection, we'd more so see it being used on exceptionally large documents that may exceed the memory capacity of a single processing unit or, when considering the scalability of the task against the efficieny in a single computing unit.

For example we might consider the case of an AI detection bot (among other functions) such as Turnitin, used widely across universities in the UK. It handles far larger documents (PhD's coming in at almost half a million characters) among a far larger volume of documents as well.

### Intro to Langchain

Within the realm of Natural Language Processing (NLP), we highlight a solution to summarising extensive or multiple documents - an often formidable challenge when cosidering the sheer volumes of data recquired - which uses both distributed data and distributed computation techniques.

Langchain is a solution to this problem that stores relevant information from previous documents within the current one, it establishes a comprehensive chain of interconnected documents. We'll see some of the processes it uses below.

First we want to set up the environment and install some libraries.

In [6]:
%%capture
!pip install tiktoken openai langchain

In [7]:
from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.chains.mapreduce import MapReduceChain
from langchain.chains import ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.document_loaders import WebBaseLoader
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
import textwrap

Initialise OpenAI Key.

In [8]:
import os
os.environ["OPENAI_API_KEY"] = "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"

Set up the summarisation chain.

In [9]:
model_name = "gpt-3.5-turbo-16k"
temperature = 0.1

Upload the document.

In [10]:
llm = ChatOpenAI(model_name=model_name, temperature=temperature)

In [11]:
loader = WebBaseLoader("https://simple.wikipedia.org/wiki/Artificial_intelligence")

def load_folder(folder):
    documents = loader.load()
    return documents

documents = load_folder(loader)
len(documents)

1

### Splitting Document Into Chunks

In [12]:
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(documents)
    return docs

docs = split_docs(documents)
print(len(docs))

16


In [13]:
docs[7]

Document(page_content='At present we use the term AI for successfully understanding human speech,[6] competing at a high level in strategic game systems (such as chess and Go), self-driving cars, and interpreting complex data.[9] Some people also consider AI a danger to humanity if it continues to progress at its current pace.[10]\nAn extreme goal of AI research is to create computer programs that can learn, solve problems, and think logically.[11][12] In practice, however, most applications have picked on problems which computers can do well. Searching databases and doing calculations are things computers do better than people. On the other hand, "perceiving its environment" in any real sense is way beyond present-day computing.', metadata={'source': 'https://simple.wikipedia.org/wiki/Artificial_intelligence', 'title': 'Artificial intelligence - Simple English Wikipedia, the free encyclopedia', 'language': 'en'})

This is how Langchain applies its data partitioning.

___
In the domain of document processing, the practice of breaking extensive documents into manageable segments is a fundamnetal necessity. There are, equally important, functions of Langchain that effectively amalgamate these segments and summarises them as the principal function.

## Map Reduce Chain

So we've seen the use of distributed data in Langchain. Further along in the 'chains' it uses as part of the summarisation process, Langchain employs a distributed computation paradigm in its Map Reduce Chain.

The Map Reduce Chain is a two-step solution that greatly simplifies the task of summarising a document.

### Map Chain

In the first step, known as the "map", the document is divided into smaller, more manageable chunks. Each of these is then summarised individually.

In [14]:
map_template = """Write a concise summary of the following content:

{text}

Summary:
"""

In [15]:
map_prompt = PromptTemplate.from_template(map_template)

In [16]:
map_chain = LLMChain(prompt=map_prompt, llm=llm)

### Reduce Chain

In the second step, referred to as the "reduce", we aim to combine these individual summaries into one cohesive final summary.

In [17]:
reduce_template = """The following is set of summaries:

{doc_summaries}

Summarize the above summaries with all the key details
Summary:"""

In [18]:
reduce_prompt = PromptTemplate.from_template(reduce_template)

In [19]:
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

In [20]:
stuff_chain = StuffDocumentsChain(llm_chain=reduce_chain, 
                                  document_variable_name="doc_summaries")

In [21]:
reduce_chain = ReduceDocumentsChain(
    combine_documents_chain=stuff_chain
)

Attempting to merge all the chunk summaries can run into token limit contraints. A previous 'chain' combines chunks in such a way this is never reached.

### Map Reduce Chain

In [22]:
map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    document_variable_name="text",
    reduce_documents_chain=reduce_chain)

In [23]:
#output = map_reduce_chain.run(docs)
#wrapped_text = textwrap.fill(output, 
#                             width=100,
#                             break_long_words=False,
#                             replace_whitespace=False)
#print(wrapped_text)

A paid version of OpenAI's API is required to run with some Langchain functions however, the output it would produce is below:

___
Artificial intelligence (AI) is a field of computer science that focuses on creating intelligent

machines capable of tasks that typically require human intelligence. It is used in various

industries and has the potential to greatly impact society. The term "intelligence" is debated in

AI, with different perspectives on whether it should be defined in terms of action or thinking. AI

involves computer programs that mimic human cognition and can learn, solve problems, and think

logically. It is a multidisciplinary field that encompasses various disciplines. AI research began

in 1956 but experienced a decline known as the "AI winter" before resurging in the 90s and early

2000s. Notable achievements include machines defeating human champions in chess and Jeopardy! AI has

become popular worldwide due to advancements in technology and access to more data. The content also

includes references to various topics and sources discussing AI.
___

Map Reduce employs a smart iterative approach by progressively combining chunks, retaining the key information throughout the amalgamation process. The ability to perform computaional tasks concurrently across multiple computing resources allows Langchain to effectively summarise large documents far quicker than usual.

When thinking back to our original problem, the key take-away here is scalability. Again, considering the Turnitin example where a programme would be processing an incredibly high volume of documents. Something else worth noting is reduced memory consumption that comes from distributed computation techniques by distributing memory requirements and thus, reducing hardware costs associated with language detection systems.

### Conclusion

In conclusion, we can respect the strength of both distributed data and distributed computation paradigms in their importance and respective uses through both the Langchain example and hypothesised use in AI language detection.

It's worth noting their interdependence, in the way distributed computation often relies on the set up from distributed data paradigms to process data efficiently across distributed infrastructure.

A worthy consideration is the requirement of very well-designed algorithms to be able to leverage distribution techniques effectively.