<a href="https://colab.research.google.com/github/michellebonat/LLM-LangChain-PP-Summ/blob/main/LLM_LangChain_SummPrivPols.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLMs and LangChain: Summarize a long privacy policy with various approaches

**Purpose** This notebook records an experiment to understand if company privacy policies can be understood more easily by using the summarization features in LLMs and LangChain.

Company privacy policies are typically very long and written in legal jargon. As a result, it's difficult to understand as a consumer what you may be agreeing to when you agree to their terms. Let's see if we can make this easier.

**Why is this important?** The example company privacy policy we use here is TikTok. We chose this because of it's recent attention. The US government is trying to ban TikTok becuase of it's allegedly offensive ways that is uses consumer data. 170 million people in the US use TikTok and there are fears that it's owhers the Chinese government are spying on Americans through the use of this app and the capturing of their data.

As of April 2024, "the Senate passed legislation that would force TikTok's China-based parent company to sell the social media platform under the threat of a ban. A measure to outlaw the popular video-sharing app has won congressional approval and is on its way to President Biden for his signature. The measure gives Beijing-based parent company ByteDance nine months to sell the company, with a possible additional three months if a sale is in progress. If it doesn’t, TikTok will be banned." (Source: AP News) [What a TikTok ban in the US could mean for you.](https://apnews.com/article/tiktok-divestment-ban-what-you-need-to-know-5e1ff786e89da10a1b799241ae025406)

**How can this project help?** Since company privacy policies are long and confusing, perhaps US consumers did not read what they were agreeing to by using TikTok. If the information was shorter and more simple perhaps they would be more informed about the data policies and perhaps they would make different choices. They may still use it, but perhaps they would look into changing the app data settings or be more careful what they post.

References:
* The Anthropic API information and signup can be found [here.](https://www.anthropic.com/)
* The OpenAI API information and signup can be found [here.](https://platform.openai.com/)
* This notebook was adapted from Roger Oriol's blog post about summarizing long documents with LLMs which can be found [here.](https://www.ruxu.dev/articles/ai/summarize-long-docs/)

Version notes:
* For these examples we will be using LangChain and Anthropic's Claude Sonnet model via their api which we set to Claude-3-sonnet-20240229 as of May 2024. This is the medium sized option, using the Evaluation plan tier (non-production).

* For OpenAI we are using TODO (INSERT VERSION INFO HERE WHEN ADDED)


# Setup

**Install the required Anthropic LangChain**

In [2]:
!pip install -q langchain langchain-anthropic langchain-text-splitters

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.6/120.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m870.8/870.8 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

**Setup to use the model**

❗ Remember to include your own API key for Anthropic in the api_key param below. Otherwise all following cells will fail. If you are doing this in a colab notebook, you can easily setup your own secrets. Find this by clicking on the left side navigation on the key icon.

In [3]:
from langchain_anthropic import ChatAnthropic
from google.colab import userdata

model = ChatAnthropic(
    temperature=0,
    api_key=userdata.get('michelle-secret-key-anthropic'),
    model_name='claude-3-sonnet-20240229')

# Two Approaches

## Summarize short text snippet directly with direct imput

Let's try a simple approach and see how useful it might be. To summarize a short bit of text, we can just copy it from the internet where the privacy policy is posted, then paste the text in the prompt in this notebook and ask the model to summarize it. We aren't using LangChain's stuff documents chain yet. Below we have manually pasted in the top part of the TikTok privacy policy.

In [4]:
prompt = """Write a concise summary of the following text:
This Privacy Policy applies to TikTok services (the “Platform”), which include TikTok apps, websites, software
and related services accessed via any platform or device that link to this Privacy Policy. The Platform is provided
 and controlled by TikTok Inc. (“TikTok”, “we” or “us”). We are committed to protecting and respecting your privacy.
  This Privacy Policy explains how we collect, use, share, and otherwise process the personal information of users
  and other individuals age 13 and over in connection with our Platform. For information about how we collect, use,
   share, and otherwise process the personal information of users under age 13 (“Children”), please refer to our
   Children’s Privacy Policy. For information about how we collect, use, share, and otherwise process consumer
   health data as defined under Washington’s My Health My Data Act and other similar state laws, please refer
   to the Consumer Health Data Privacy Policy.

Summary: """

In [5]:
response = model.invoke(prompt)
print(response.content)
total_tokens = response.response_metadata['usage']['input_tokens'] + response.response_metadata['usage']['output_tokens']
print(f"Tokens used: {total_tokens}")

This is TikTok's Privacy Policy, which outlines how the company collects, uses, shares, and processes personal information of users aged 13 and over who access TikTok's services, including apps, websites, software, and related services. It also mentions separate policies for handling data of children under 13 and consumer health data, which are covered in the Children's Privacy Policy and Consumer Health Data Privacy Policy, respectively.
Tokens used: 330


**Conclusion:** This is much shorter. However it is so short as to not be very useful. It also by default responds in a single line of text that is difficult to read (you need to scroll). Let's continue to discover some other approaches.

## Summarize the text using MapReduce and LangChain

This time we will use a longer block of text using the entire TikTok privacy policy. We will get it directly from it's source location online using LangChain's BeautifulSoup plugin.

In [25]:
import requests

page = requests.get("https://www.tiktok.com/legal/page/us/privacy-policy/en")

with open("tiktok.html", "w", encoding="utf-8") as f:
    f.write(page.text)

In [26]:
from langchain_community.document_loaders import BSHTMLLoader
import re

loader = BSHTMLLoader("tiktok.html")
data = loader.load()

data[0].page_content = re.sub("\n\n+", "\n", data[0].page_content)

Estimate the number of tokens of the text using the rough rule that a token is 3/4 of a word.

In [27]:
print(f"Estimated tokens: {int(len(data[0].page_content) * 4 / 3)}")

Estimated tokens: 38822


The number of tokens of this text is about 40k, so it might be able to be summarized in a single copied and pasted prompt from a technology perspective, but realistically that would not make for a good user experience.

To solve this problem we will use MapReduce to divide the text in multiple chunks and summarize each chunk individually. To do this, we could just use LangChain's load_summarize_chain utility. However, for the sake of visualization, we will implement each step in a simple and understandable way.

First, we will divide the text in chunks using LangChain's RecursiveCharacterTextSplitter utility.

In [28]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 2000
chunk_overlap = 100
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

splits = text_splitter.split_documents(data)

In [10]:
len(splits)

16

In [29]:
splits[0]

Document(page_content='Privacy Policy | TikTokU.S.Privacy PolicyLast updated: March 28, 2024This Privacy Policy applies to TikTok services (the â\x80\x9cPlatformâ\x80\x9d), which include TikTok apps, websites, software and related services accessed via any platform or device that link to this Privacy Policy. The Platform is provided and controlled by TikTok Inc. (â\x80\x9cTikTokâ\x80\x9d, â\x80\x9cweâ\x80\x9d or â\x80\x9cusâ\x80\x9d). We are committed to protecting and respecting your privacy. This Privacy Policy explains how we collect, use, share, and otherwise process the personal information of users and other individuals age 13 and over in connection with our Platform. For information about how we collect, use, share, and otherwise process the personal information of users under age 13 (â\x80\x9cChildrenâ\x80\x9d), please refer to our Childrenâ\x80\x99s Privacy Policy. For information about how we collect, use, share, and otherwise process consumer health data as defined under Was

Once the text is divided, we do a summary of each chunk. We have the LLM make a summary in bullet point format for easy readability.

In [30]:
from langchain_core.prompts import PromptTemplate

map_prompt = PromptTemplate(
    template="""Write a concise summary of the following text. The summary should be a list of bullet points. The summary cannot be more than 5 bullet points. The text is:
{text}

Summary: """,
    input_variables=['text']
)

In [31]:
summaries = []

for split in splits:
  response = model.invoke(map_prompt.format(text=split.page_content))
  summaries.append(response.content)

RateLimitError: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': 'Number of requests has exceeded your per-minute rate limit (https://docs.anthropic.com/claude/reference/rate-limits); see the response headers for current usage. Please try again later or contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase.'}}

**Conclusion:** What we are seeing here is a blocker on processing based on a rate limit error in Anthropic's Claude service. Since our experiment is to test this on the medium level plan at Anthropic, what we're seeing here is this approach is not completely viable. We are able to process some things so we will keep going and evaluate the usedfulness of the result.

We will keep moving forward despite this Anthropic processing blocker.

In [14]:
summaries[0]

'Here is a concise summary of the provided text in 5 bullet points:\n\n- This Privacy Policy applies to TikTok services, including apps, websites, software, and related services.\n- TikTok collects various types of information from users, including account/profile information, user-generated content, and automatically collected information.\n- Account/profile information includes name, age, username, password, email, phone number, and profile image.\n- User-generated content includes comments, photos, livestreams, videos, text, hashtags, and virtual item videos uploaded by users.\n- Automatically collected information includes device information, location data, usage data, and information from cookies and similar technologies.'

Now we have the summaries of at least some of the chunks. However, there are too many summaries to consolidate in a single prompt. So we will group them in groups that will fit in a single prompt. Then, we will consolidate the summaries in each group in a single summary. We will repeat this process until the number of groups is reduced to 1. This will mean that all the consolidated summaries now fit in a single summarization prompt.

In [15]:
def group_summaries(summaries, max_summaries):
  groups = []
  current_group = []
  for summary in summaries:
    current_group.append(summary)
    if len(current_group) >= 10:
      groups.append(current_group)
      current_group = []
  if len(current_group) > 0:
    groups.append(current_group)
  return groups

groups = group_summaries(summaries, 10)

In [32]:
len(groups)

1

In [33]:
combine_prompt = PromptTemplate(
    template= """The following is set of bullet-point summaries:
{docs}
Take these and distill it into a consolidated bullet-point summary of the main themes. Remove the bullet points that are not relevant to the whole text. The consolidated summary cannot be more than 7 bullet points.
Helpful Answer: """,
    input_variables=['docs']
)

In [34]:
while len(groups) > 1:
  new_summaries = []
  for group in groups:
    response = model.invoke(combine_prompt.format(docs="\n".join(group)))
    new_summaries.append(response.content)
  groups = group_summaries(new_summaries, 10)

In [35]:
groups[0][0]

'Here is a concise summary of the provided text in 5 bullet points:\n\n- This Privacy Policy applies to TikTok services, including apps, websites, software, and related services.\n- TikTok collects various types of information from users, including account/profile information, user-generated content, and automatically collected information.\n- Account/profile information includes name, age, username, password, email, phone number, and profile image.\n- User-generated content includes comments, photos, livestreams, videos, text, hashtags, and virtual item videos uploaded by users.\n- Automatically collected information includes device information, location data, usage data, and information from cookies and similar technologies.'

Finally, all summaries fit in a single prompt. All that's left to do is run a last prompt to do a final consolidation of all summaries into a single final summary. This summary will be in regular sentences, instead of bullet points, because that's what we want for this use case.

In [36]:
reduce_prompt = PromptTemplate(
    template= """The following is set of bullet-point summaries:
{docs}
Take these and distill it into a final, consolidated summary of the main themes. The final summary should be at most 15 sentences.
Helpful Answer: """,
    input_variables=['docs']
)

In [37]:
response = model.invoke(reduce_prompt.format(docs="\n".join(groups[0])))
final_summary = response.content

**Final result:** Below is the final summary of the TikTok privacy policy based on this MapReduce and LangChain approach.

In [24]:
final_summary

"Here is a consolidated 15-sentence summary of the main themes from the provided bullet points:\n\nTikTok's Privacy Policy outlines the various types of information collected from users of its services, including apps, websites, and related offerings. This includes account/profile information like name, age, username, password, email, phone number, and profile images. User-generated content such as comments, photos, videos, text, hashtags, and virtual items are also collected. \n\nAutomatically collected information encompasses device information, location data based on SIM cards and IP addresses, usage data, cookies, and identifiers assigned across devices to track user activity. User content is analyzed for objects, scenes, faces, text, and audio characteristics, and biometric data like faceprints and voiceprints may be collected with permission.\n\nOther data collected includes messages and metadata, clipboard information with permission, purchase and transaction histories, phone an

**Conclusion:** What we see here is a more useful summary than what came from the previous approaches. It is fairly short but still contains relevant data. You can see what type of profile information is collected, what types of content are also collected. You can even see what is collected automatically from your phone including location data from your SIM card and your IP address. It also shows that it analyzes user content including face and audio, transaction histories, and biometric data with permission.

# Conclusion
The information in company privacy policies is important to understand. In the example of TikTok, where concern over the application's use of customer data has come to light, we examine if the methods used here can make that easier and more transparent for users of the app to understand this risks.

We tried two approaches using the Anthropic LLM and LangChain capabilities:
* First approach: A simple summarization based on manually copying and pasting a snippet of information from the privacy policy into the cell in this notebook. The result was a short but overly summarized version that was not very useful.
* Second approach: A summarization using MapReduce and LangChain. While we did encounter some rate limiting that resulted in processing errors (keeping to a medium level Anthropic service), this approach did result in a final summary that was more useful than the first approach. We were able to gather from a single summary paragraph some impactful disclosures about using the TikTok application. These included automatic location and IP collection as well as face and audio among others.

While more investigation should continue, this is an improved approach to helping end users more easily discover what the impacts to their personal information from using TikTok specifically. It also provides some examples to build on for additional investigation.