# Parsing Scraped Transcripts

Now, we want to try parsing the podcast transcripts into digestible blocks of text we can pass to the OpenAI API. 

The strategy I took here was to use the title headings handily included in the transcripts to break the tanscript into sections (accomplished using a regex).

In [16]:
%pip install transformers
%pip install regex

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Preprocessing the Document Library

I used the GPT3 Tokenizer to gauge how large my sections were. Realized later on that we would need to truncate some of the larger sections because they exceed the token limits, but that should be ok.

For more info on GPT3 Tokenizer: 
- https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast
- https://beta.openai.com/tokenizer

In [42]:
from transformers import GPT2TokenizerFast

# counts the tokens in a text string
def count_tokens(input: str):
    tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
    res = tokenizer(input)['input_ids']
    return len(res)


print(count_tokens('Hello world'))

2


Below, we process all transcripts and save the result into a csv with the columns `title`, `heading`, `content`, and `tokens`. This file serves as an input to the final notebook of this project.

In [61]:
import regex
import os.path
import pandas as pd

# data containers

data_dict = {
    'title': [],
    'heading': [],
    'content': [],
    'tokens': []
}

# outer loop: get title of each podcast
for i in range(1, 180):

    # since I manually deleted some transcripts that did not scrape properly
    if not (os.path.isfile('transcript_{}'.format(i))):
        continue 
    
    # open the transcript as a long string in memory
    with open('transcript_{}'.format(i)) as f:
        test_transcript = f.read()
    
    # print('transcript_{}'.format(i) + ' tokens: ' + str(count_tokens(test_transcript)))

    r_title = regex.search(r'(?<=Released.*\n\n.*\n\n).*', test_transcript)
    # print('transcript_{}'.format(i) + ': ' + r.group(0))

    # get section headers
    r = regex.findall(r'.*(?=\n\n\[[0-9]{2}\:[0-9]{2}\:[0-9]{2}\])', test_transcript)
    r = [i for i in r if i != '']

    # loop through section headers and isolate text between headers into content
    this_section = ''
    for i in range(0, len(r)):
        if (i + 1 != len(r)):
            this_section = test_transcript.split(r[i])[1].split(r[i+1])[0]
        else:
            this_section = test_transcript.split(r[i])[1]
        # print('section title: ' + r[i])
        # print(this_section)

        data_dict['title'].append(r_title.group(0))
        data_dict['heading'].append(r[i])
        data_dict['content'].append(this_section)

        token_count = count_tokens(this_section)
        # if token_count > 2000:
        #     print('token count: ' + str(token_count) + '\n title: ' + r_title.group(0) + '\n heading: ' + r[i] + '\n content: \n' + this_section)

        data_dict['tokens'].append(token_count)

out_df = pd.DataFrame(data_dict)

Token indices sequence length is longer than the specified maximum sequence length for this model (2536 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1517 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2411 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4623 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3464 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence leng

In [62]:
out_df[:3]

Unnamed: 0,title,heading,content,tokens
0,Gabriel Leydon - How Web3 Onboards a Billion U...,Introduction,\n\n[00:01:38] Patrick: My guest today is Gabe...,183
1,Gabriel Leydon - How Web3 Onboards a Billion U...,Free-to-Own Gaming,"\n\n[00:02:19] Patrick: All right, Gabe, so it...",2536
2,Gabriel Leydon - How Web3 Onboards a Billion U...,Three Waves of NFTs,\n\n[00:12:32] Patrick: Can you say a little b...,1517


In [63]:
# remove newline characters and the timestamps which could confuse the model
def replace_timestamps(input_str: str):
    return regex.sub(r'.*[[0-9]{2}\:[0-9]{2}\:[0-9]{2}\]', '', input_str)

def remove_newlines(input_str: str):
    return regex.sub(r'\n', '', input_str)

out_df['content'] = out_df['content'].apply(lambda x: replace_timestamps(x))
out_df['content'] = out_df['content'].apply(lambda x: remove_newlines(x))
out_df[:3]

Unnamed: 0,title,heading,content,tokens
0,Gabriel Leydon - How Web3 Onboards a Billion U...,Introduction,"Patrick: My guest today is Gabe Leydon, whose...",183
1,Gabriel Leydon - How Web3 Onboards a Billion U...,Free-to-Own Gaming,"Patrick: All right, Gabe, so it's been almost...",2536
2,Gabriel Leydon - How Web3 Onboards a Billion U...,Three Waves of NFTs,Patrick: Can you say a little bit about this ...,1517


In [64]:
out_df.to_csv('preprocessed.csv')

In [8]:
# # test tokenizer on one of the transcripts
# with open('transcript_1') as f:
#     test_transcript = f.readlines()

# count_tokens(test_transcript)

628

In [None]:
# # get number of tokens for all transcripts
# for i in range(1, 180):
#     with open('transcript_{}'.format(i)) as f:
#         test_transcript = f.readlines()
#         print('transcript_{}'.format(i) + ' tokens: ' + str(count_tokens(test_transcript)))

# # I manually cut some of the transcripts with very few tokens


In [None]:
# # get a title for each transcript
# with open('transcript_1') as f:
#     test_transcript = f.read()
#     # print(test_transcript)

# import regex

# r = regex.search(r'(?<=Released.*\n\n.*\n\n).*', test_transcript)
# print(r.group(0))

In [None]:
# import regex
# import os.path

# # get title of each podcast
# for i in range(1, 180):

#     if not (os.path.isfile('transcript_{}'.format(i))):
#         continue 

#     with open('transcript_{}'.format(i)) as f:
#         test_transcript = f.read()
    
#     # print('transcript_{}'.format(i) + ' tokens: ' + str(count_tokens(test_transcript)))

#     r = regex.search(r'(?<=Released.*\n\n.*\n\n).*', test_transcript)
#     print('transcript_{}'.format(i) + ': ' + r.group(0))

In [None]:
# # get section headers
# with open('transcript_3') as f:
#     test_transcript = f.read()
#     # print(test_transcript)

# r = regex.findall(r'.*(?=\n\n\[[0-9]{2}\:[0-9]{2}\:[0-9]{2}\])', test_transcript)
# r = [i for i in r if i != '']
# this_section = ''
# for i in range(0, len(r)):
#     if (i + 1 != len(r)):
#         this_section = test_transcript.split(r[i])[1].split(r[i+1])[0]
#     else:
#         this_section = test_transcript.split(r[i])[1]
#     print('section title: ' + r[i])
#     print(this_section)




In [None]:
# with open('transcript_3') as f:
#     test_transcript = f.read()
#     print(test_transcript)