### Cleanup of scraped transcripts for sentiment analysis

The [transcripts that I found](https://itysldb.com/) were continuous blocks of dialogue text, not scripts. Somewhere along the line, some numbers and words were lost or just not accounted for.  I did a manual scan while watching each episode again to fill in some of the blanks, but there are probably still some errors that I missed. 

This notebook uses the NLP library `spaCy` to identify and extract full sentences.  I'll then conduct [sentiment analysis](https://colab.research.google.com/drive/1GuUbnw1pVMQrNDVKJ98bJGYvR3YWWB_D?usp=sharing) on the individual sentences.  

In [1]:
import pandas as pd
import spacy # for nlp


In [2]:
# transcripts scraped from https://itysldb.com/
df = pd.read_json('transcripts.json')

In [3]:
df

Unnamed: 0,slug,name,season,episode,id,netflixLink,transcript
0,both-ways,Both Ways,1,1,1,https://www.netflix.com/watch/80986856?t=7,"Obviously, I'd love to work for you, and I app..."
1,has-this-ever-happened-to-you,Has This Ever Happened To You,1,1,2,https://www.netflix.com/watch/80986856?t=101,Have you been the victim of unfair treatment b...
2,baby-of-the-year,Baby of the Year,1,1,3,https://www.netflix.com/watch/80986856?t=208,Look at their rolls. Look at their folds. Look...
3,instagram,Instagram,1,1,4,https://www.netflix.com/watch/80986856?t=444,Let me see it. Oh yeah that's good. That's gre...
4,gift-receipt,Gift Receipt,1,1,5,https://www.netflix.com/watch/80986856?t=563,Maybe I'll go for this. That could be a good o...
...,...,...,...,...,...,...,...
81,house-party,House Party,3,5,82,https://www.netflix.com/watch/81643783?t=413,Right? It's so nice you live this close. Yeah....
82,banana-breath,Banana Breath,3,6,83,https://www.netflix.com/watch/81643784?t=7,I'm gonna show a quick scenario that has right...
83,photo-wall-of-metal-metal-motto-search,Photo Wall of Metal: Metal Motto Search,3,6,84,https://www.netflix.com/watch/81643784?t=206,"Welcome to Photo Wall of Metal, the Metal Mott..."
84,don-bondarley-king-of-the-dirty-songs,"Don Bondarley, King of the Dirty Songs",3,6,85,https://www.netflix.com/watch/81643784?t=404,"Seriously, this has been a perfect weekend so ..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   slug         86 non-null     object
 1   name         86 non-null     object
 2   season       86 non-null     int64 
 3   episode      86 non-null     int64 
 4   id           86 non-null     int64 
 5   netflixLink  86 non-null     object
 6   transcript   86 non-null     object
dtypes: int64(3), object(4)
memory usage: 4.8+ KB


In [5]:
transcripts = df['transcript'].to_list()
transcripts[-1] # peek at last item

"All right, everybody, let's get those lunch orders in. Put your order in, right into the app. Ooh, Trebinetti's. David, can I talk to you? Of course. I love their salads. Draven. Don't know if we're gonna have enough GOTV buses for this weekend without Mike Flaherty. Okay. This isn't the first time Mike flaked. Concerns me come primary day. I don't know what other options we have. I asked Lynn to quietly check with George Faust if he's willing to help. I'd love not to go down that road. You could talk to him, but I know him more than anyone so I don't know what'll help. Let me think. Thank you for bringing it to my attention. Oh, thank you. All right. Everybody all in? Everybody good? Let's go. Get motivated. Fifteen days. Fifteen days. All right! I'm mad at you. What? I'm mad at you! You're not following me on Instagram. Oh. Didn't know you were on there. I'm on there. I'll look. I'll find you. It's okay. I followed myself from your phone. You followed yourself from my phone? What th

In [6]:
transcripts[0] # peek at first item

"Obviously, I'd love to work for you, and I appreciate you taking the time to meet with me. I... I feel good about it. I hope I didn't do too much talking.\nNo, you were great. You were great.\nI hope to hear from you soon. We'll be in touch.\nReally nice meeting you. Nice to meet you as well. Okay. Oh!\nLooks like you push.\nOh, it does both.\nWhat?\nIt does both. I was here yesterday, and it actually goes both ways.\nOh, okay. Okay, see you. See? Hope to hear from you soon."

The first sketch seems to be the only one with \n line breaks so I'll remove them.

In [7]:
# replace \n with a space
transcripts[0] = transcripts[0].replace('\n', ' ')


In [8]:
# verify that it was replaced
transcripts[0]

"Obviously, I'd love to work for you, and I appreciate you taking the time to meet with me. I... I feel good about it. I hope I didn't do too much talking. No, you were great. You were great. I hope to hear from you soon. We'll be in touch. Really nice meeting you. Nice to meet you as well. Okay. Oh! Looks like you push. Oh, it does both. What? It does both. I was here yesterday, and it actually goes both ways. Oh, okay. Okay, see you. See? Hope to hear from you soon."

note to self: remember to download model first

`% python -m spacy download en_core_web_sm`

In [9]:
# load language model (en_core_web_sm) to process transcripts
# process an index number with nlp()

nlp = spacy.load("en_core_web_sm")
docs = nlp(transcripts[0])

In [10]:
# get sentences (sent) from one sketch transcript
for i in docs.sents:
    print(i)

Obviously, I'd love to work for you, and I appreciate you taking the time to meet with me.
I...
I feel good about it.
I hope I didn't do too much talking.
No, you were great.
You were great.
I hope to hear from you soon.
We'll be in touch.
Really nice meeting you.
Nice to meet you as well.
Okay.
Oh!
Looks like you push.
Oh, it does both.
What?
It does both.
I was here yesterday, and it actually goes both ways.
Oh, okay.
Okay, see you.
See?
Hope to hear from you soon.


In [44]:
# Extract sentences from all sketches using spaCy
sentences_data = []

print(f"Processing {len(df)} sketches...")

for idx, row in df.iterrows():
    sketch_id = row['id']
    sketch_name = row['name']
    slug = row['slug']
    season = row['season']
    episode = row['episode']
    transcript = row['transcript']
    
    # Process transcript with spaCy
    doc = nlp(transcript)
    
    # Extract sentences
    # Note: 'sent' is a Span object (a slice/view of the Doc)
    # Span objects have a .text attribute that gives you the sentence as a string
    for sent_idx, sent in enumerate(doc.sents, start=1):
        sentence_text = sent.text.strip()  # sent.text extracts the string from the Span
        
        # Skip empty sentences
        if not sentence_text:
            continue
            
        sentences_data.append({
            'sketch_id': sketch_id,
            'sketch_name': sketch_name,
            'slug': slug,
            'season': season,
            'episode': episode,
            'sentence_index': sent_idx,
            'sentence_text': sentence_text
        })
    
    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{len(df)} sketches...")

print(f"Done! Extracted {len(sentences_data)} sentences from {len(df)} sketches.")


Processing 86 sketches...
Processed 10/86 sketches...
Processed 20/86 sketches...
Processed 30/86 sketches...
Processed 40/86 sketches...
Processed 50/86 sketches...
Processed 60/86 sketches...
Processed 70/86 sketches...
Processed 80/86 sketches...
Done! Extracted 6509 sentences from 86 sketches.


In [45]:
# Create a DataFrame from the extracted sentences
sentences_df = pd.DataFrame(sentences_data)

# Display summary information
print(f"\nTotal sentences extracted: {len(sentences_df)}")
print(f"\nSentences per sketch statistics:")
print(sentences_df.groupby('sketch_id').size().describe())
print(f"\nFirst few rows:")
sentences_df.head(10)



Total sentences extracted: 6509

Sentences per sketch statistics:
count     86.000000
mean      75.686047
std       38.871209
min        4.000000
25%       51.250000
50%       75.000000
75%       95.000000
max      199.000000
dtype: float64

First few rows:


Unnamed: 0,sketch_id,sketch_name,slug,season,episode,sentence_index,sentence_text
0,1,Both Ways,both-ways,1,1,1,"Obviously, I'd love to work for you, and I app..."
1,1,Both Ways,both-ways,1,1,2,I...
2,1,Both Ways,both-ways,1,1,3,I feel good about it.
3,1,Both Ways,both-ways,1,1,4,I hope I didn't do too much talking.
4,1,Both Ways,both-ways,1,1,5,"No, you were great."
5,1,Both Ways,both-ways,1,1,6,You were great.
6,1,Both Ways,both-ways,1,1,7,I hope to hear from you soon.
7,1,Both Ways,both-ways,1,1,8,We'll be in touch.
8,1,Both Ways,both-ways,1,1,9,Really nice meeting you.
9,1,Both Ways,both-ways,1,1,10,Nice to meet you as well.


In [46]:
# Save to CSV format (useful for pandas/huggingface datasets)
sentences_df.to_csv('sentences_extracted.csv', index=False)
print("Saved to 'sentences_extracted.csv'")

# Save to JSON format (useful for huggingface transformers)
sentences_df.to_json('sentences_extracted.json', orient='records', indent=2)
print("Saved to 'sentences_extracted.json'")

Saved to 'sentences_extracted.csv'
Saved to 'sentences_extracted.json'
