This notebook shows how we could use the langchain SemanticChunker to split up text data.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document

import pandas as pd

from dsp_interview_transcripts import PROJECT_DIR
from dsp_interview_transcripts.utils.data_cleaning import clean_data, convert_timestamp

In [None]:
# Read in the raw data
data = pd.read_csv(PROJECT_DIR / 'data/qual_af_transcripts.csv')

In [None]:
# Clean up the text a little and move audio transcriptions to the text column
interviews_df = clean_data(data)

In [None]:
# Make sure the conversations are sorted by time, so that the replies go in the right order
interviews_df['timestamp_clean'] = interviews_df['timestamp'].apply(convert_timestamp)
interviews_df = interviews_df.groupby('conversation', group_keys=False).apply(lambda x: x.sort_values('timestamp_clean'))

In [None]:
# Turn every conversation into one big block of text (mimics the format of other interview/focus group transcripts we might see)
df_grouped = interviews_df.groupby('conversation')['text_clean'].apply(lambda x: '. '.join(x)).reset_index()

In [None]:
# Take the first conversation as a guinea pig
text1 = df_grouped['text_clean'][0]

In [None]:
# Turn it into a langchain document
doc = Document(page_content=text1)

In [None]:
# Define the model we'll use to generate embeddings. The SemanticChunker documentation suggests OpenAI embeddings
# but we can just use HF embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"

chunker = SemanticChunker(HuggingFaceEmbeddings(model_name=model_name), breakpoint_threshold_type="percentile")

chunked_docs = chunker.split_documents([doc])


See [the documentation](https://python.langchain.com/docs/how_to/semantic-chunker/) for info on different breakpoints. Percentile is the default.

See also [this notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb).

Note that [the documentation](https://api.python.langchain.com/en/latest/text_splitter/langchain_experimental.text_splitter.SemanticChunker.html) suggests you can manipulate:
* the exact numerical value of the breakpoint threshold
* the regex for sentence delimiters
* the number of chunks if you have a sense of what this would be for your document


Also, the documentation says that the chunker will look at windows of 3 sentences, but the [source code](https://github.com/langchain-ai/langchain-experimental/blob/main/libs/experimental/langchain_experimental/text_splitter.py) makes it look like it just takes 1 sentence at a time by default?

In [None]:
chunked_docs