# Feature Engineering for Standup Scripts
## Goal: Generate useful features related to standup comedy style.

## Table of Contents

   - [x] [imports](#imports)
   - [x] [prepare the data](#prepare-the-data)
   - [x] [word lengths](#word-lengths)
   - [x] [distinct words](#distinct-words)
   - [x] [words per minute and sentences per minute](#words-per-minute-and-sentences-per-minute)
   - [ ] [repetition and phrases](#repetition-and-phrases)
       - [with gensim](#with-gensim)
       - [with sklearn](#with-sklearn)
   - [ ] [LDA topic model](#LDA-topic-model)
   - [x] [profanity](#profanity)
   - [ ] [part-of-speech frequencies](#part-of-speech-frequencies)
   - [ ] [sentence structure](#sentence-structure)
   - [ ] [point of view](#point-of-view)
   - [ ] [sentiment](#sentiment)
   - [ ] [polarity](#polarity)
   - [ ] [cosine similarities](#cosine-similarities)
   - [playground](#playground)

## imports 


In [317]:
import pickle
import numpy as np
import pandas as pd
from datetime import date
import json
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import re
from collections import Counter, defaultdict
import itertools
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize
import gensim
from gensim.corpora.dictionary import Dictionary
import spacy

import warnings
warnings.filterwarnings('ignore')

In [181]:
sw = stopwords.words("english")

In [182]:
transcripts_df = pd.read_pickle('../data/transcripts_raw_df.pickle')

In [183]:
with open(f'../data/imdb_title_results_2022-05-23.pickle', 'rb') as file:
    show_meta = pickle.load(file)

In [184]:
with open(f'../data/metascripts_df_2022-05-28.pickle', 'rb') as file:
    metascripts = pickle.load(file)

In [185]:
metascripts.head()

Unnamed: 0,description,link,transcript,script characters,id,artist,title,fullTitle,year,image,...,genres,genreList,companies,companyList,contentRating,imDbRating,imDbRatingVotes,similars,languages,languageList
0,Jim Gaffigan: Comedy Monster (2021) | Transcript,https://scrapsfromtheloft.com/comedy/jim-gaffi...,"Thank you! Thank you! Oh, my gosh. Thank you s...",49799,tt15907298,Jim Gaffigan,Jim Gaffigan: Comedy Monster,Jim Gaffigan: Comedy Monster (2021),2021,https://imdb-api.com/images/original/MV5BMDcyN...,...,Comedy,"[{'key': 'Comedy', 'value': 'Comedy'}]",The Nacelle Company,"[{'id': 'co0649705', 'name': 'The Nacelle Comp...",TV-14,6.8,1618,"[{'id': 'tt6090102', 'title': 'Jim Gaffigan: C...",English,"[{'key': 'English', 'value': 'English'}]"
1,Louis C. K.: Sorry (2021) | Transcript,https://scrapsfromtheloft.com/comedy/louis-c-k...,♪♪ [“Like a Rolling Stone” by Bob Dylan playin...,44669,tt16491756,Louis C.K.,Sorry,Sorry (2021),2021,https://imdb-api.com/images/original/MV5BOWNkN...,...,Comedy,"[{'key': 'Comedy', 'value': 'Comedy'}]",,[],,7.7,2363,"[{'id': 'tt12087624', 'title': 'Sincerely Loui...",English,"[{'key': 'English', 'value': 'English'}]"
2,Drew Michael: Drew Michael (2018) | Transcript,https://scrapsfromtheloft.com/comedy/drew-mich...,“This is the latest I’ve stayed up in a long t...,40006,tt8563704,Drew Michael,Drew Michael: Drew Michael,Drew Michael: Drew Michael (2018),2018,https://imdb-api.com/images/original/MV5BMDkyZ...,...,Comedy,"[{'key': 'Comedy', 'value': 'Comedy'}]",A24 Television,"[{'id': 'co0702684', 'name': 'A24 Television'}]",TV-MA,5.4,368,"[{'id': 'tt16153658', 'title': 'Drew Michael: ...",English,"[{'key': 'English', 'value': 'English'}]"
3,Drew Michael: Red Blue Green (2021) | Transcript,https://scrapsfromtheloft.com/comedy/drew-mich...,(EMOTIONAL MUSIC PLAYING) (MUSIC ENDS) DREW MI...,50422,tt16153658,Drew Michael,Drew Michael: Red Blue Green,Drew Michael: Red Blue Green (2021),2021,https://imdb-api.com/images/original/MV5BNTcxM...,...,Comedy,"[{'key': 'Comedy', 'value': 'Comedy'}]","Rotten Science, HBO Films","[{'id': 'co0602462', 'name': 'Rotten Science'}...",TV-MA,6.9,261,"[{'id': 'tt8563704', 'title': 'Drew Michael: D...",English,"[{'key': 'English', 'value': 'English'}]"
4,Mo Amer: Mohammed in Texas (2021) | Transcript,https://scrapsfromtheloft.com/comedy/mo-amer-m...,[quirky flute music playing] [single note pian...,58020,tt15845288,Mo Amer,Mo Amer: Mohammed in Texas,Mo Amer: Mohammed in Texas (2021),2021,https://imdb-api.com/images/original/MV5BMDI1M...,...,Comedy,"[{'key': 'Comedy', 'value': 'Comedy'}]",A24,"[{'id': 'co0390816', 'name': 'A24'}]",TV-MA,6.5,615,"[{'id': 'tt9060526', 'title': 'Mo Amer: The Va...",English,"[{'key': 'English', 'value': 'English'}]"


In [186]:
metascripts.shape

(316, 24)

## prepare the data


In [252]:
# Replace bracket and parenthetical content from scripts
metascripts['transcript'] = (metascripts['transcript']
                                 .replace("\[.+?\]|\(.+?\)","", regex = True)
                                 .replace("\’|\‘", "'", regex = True)
                                 .replace("\“|\”", '"', regex = True))

In [253]:
# Fill censored words to clean up our profanity detection
profanity_fill = json.load(open('../data/profanity_fill.json'))

for key, value in profanity_fill.items(): 
    metascripts['transcript'] = metascripts['transcript'].str.replace(key, value, regex = False)

In [254]:
transcripts_dict = dict(zip(metascripts['description'].values, metascripts['transcript'].values))

In [255]:
descriptions = list(transcripts_dict.keys())
scripts = list(transcripts_dict.values())

In [191]:
parens = (re.findall(r"\(.+?\)", script) for script in scripts)
[(ind, len(matches)) for ind, matches in enumerate(parens) if len(matches) > 0]
parenscripts = (scripts[ind] for ind, matches in enumerate(parens) if len(matches) > 0)

## word lengths
word lengths are calculated as letters per word

   * tokenize words (allow apostrophes and dashes but not numbers)
   * do not lemmatize
   * do not remove stopwords
   
[to the top](#Feature-Engineering-for-Standup-Scripts)

In [192]:
bow_cased = [regexp_tokenize(transcript, r"[a-zA-Z]+") for description, transcript in transcripts_dict.items()]
bow_counter = [Counter(word.lower() for word in script_words) for script_words in bow_cased]

tokenized_list = [[word.lower() for word in script_words] for script_words in bow_cased]
dictionary = Dictionary(tokenized_list)
corpus = [dictionary.doc2bow(script) for script in tokenized_list]

In [193]:
word_lengths = [[len(word) for word in script_words] for script_words in tokenized_list]

In [194]:
metascripts['mean word length'] = [np.mean(script_word_lengths) for script_word_lengths in word_lengths]
metascripts['std word length'] = [np.std(script_word_lengths) for script_word_lengths in word_lengths]

for quantile in (0.25, 0.50, 0.75):
    metascripts[f'Q{quantile/0.25} word length'] = [np.quantile(script_word_lengths, quantile) for script_word_lengths in word_lengths]

metascripts['max word length'] = [np.max(script_word_lengths) for script_word_lengths in word_lengths]

## sentence lengths
sentence lengths are calculated as words per sentence

   * tokenize sentences and then count whitespaces
   * do not remove stopwords
   * get arrays so we can do mean, median, boxplot values, standard deviation
   
[to the top](#Feature-Engineering-for-Standup-Scripts)

In [195]:
sent_tokenized_list = [sent_tokenize(transcript) for description, transcript in transcripts_dict.items()]
sent_words_tokenized_list = [[regexp_tokenize(sent, r"['\-\w]+") for sent in sent_script] for sent_script in sent_tokenized_list]
sent_lengths = [[len(sent) for sent in script] for script in sent_words_tokenized_list]
sent_counts = [len(script) for script in sent_tokenized_list]

In [196]:
metascripts['mean sentence length'] = [np.mean(script_sent_lengths) for script_sent_lengths in sent_lengths]
metascripts['std sentence length'] = [np.std(script_sent_lengths) for script_sent_lengths in sent_lengths]

for quantile in (0.25, 0.50, 0.75):
    metascripts[f'Q{quantile/0.25} sentence length'] = [np.quantile(script_sent_lengths, quantile) for script_sent_lengths in sent_lengths]

metascripts['max sentence length'] = [np.max(script_sent_lengths) for script_sent_lengths in sent_lengths]

## distinct words
count distinct words in each show and normalize by determining the proportion of distinct words and distinct words per sentence

   * tokenize: allow apostrophes and dashes but not numbers 
   * lemmatize
   
[to the top](#Feature-Engineering-for-Standup-Scripts)

In [197]:
from nltk.stem.wordnet import WordNetLemmatizer

In [198]:
lemmatizer = WordNetLemmatizer()
lem_counter = [Counter(lemmatizer.lemmatize(word.lower()) for word in script_words) for script_words in bow_cased]

In [199]:
unique_word_counts = [len(script_lem_counts) for script_lem_counts in lem_counter]
total_word_counts = [np.sum([count for lem, count in script_lem_counts.items()]) for script_lem_counts in lem_counter]
unique_total_ratio = [unique/total for unique, total in zip(unique_word_counts, total_word_counts)]
unique_per_sent = [unique/sent_count for unique, sent_count in zip(unique_word_counts, sent_counts)]

In [200]:
metascripts['unique words'] = unique_word_counts
metascripts['total words'] = total_word_counts
metascripts['proportion unique words'] = unique_total_ratio
metascripts['unique words per sentence'] = unique_per_sent

## words per minute and sentences per minute

In [214]:
word_tok_scripts = [regexp_tokenize(script, r"[\w'-]+") for script in scripts]
words_per_minute = [len(script_words)/minutes for script_words, minutes in zip(tok_scripts, metascripts['runtimeMins'].values)]

sent_tok_scripts = [sent_tokenize(script) for script in scripts]
sent_per_minute = [len(script_sentences)/minutes for script_sentences, minutes in zip(sent_tok_scripts, metascripts['runtimeMins'].values)]

In [215]:
metascripts['words per minute'] = words_per_minute
metascripts['sentences per minute'] = sent_per_minute

## repetition and phrases
I've found three ways to get ngrams:

   1. Using Gensim's [Phrases model](https://radimrehurek.com/gensim_3.8.3/models/phrases.html) iteratively across the corpus, where the kth iteration creates a kgram
   2. Using one of SKLearn's text feature extraction modules [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) or [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer), which is equivalent to the CountVectorizer followed by the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)
   
[to the top](#Feature-Engineering-for-Standup-Scripts)

### with gensim

In [83]:
from gensim.models import Phrases

In [84]:
tok_scripts = [regexp_tokenize(transcript, r"['\-\w]+") for description, transcript in transcripts_dict.items()]
docs_lem = [[lemmatizer.lemmatize(tok.lower()) for tok in transcript] for transcript in tok_scripts]
docs_no_lem = [[tok.lower() for tok in transcript] for transcript in tok_scripts]

In [228]:
# still not picking up anything greater than a bigram. May need to reduce the Phrases threshold.

def append_ngrams(docs, ngram):
    for idx in range(len(docs)):
        for token in ngram[docs[idx]]:
            if '_' in token:
                # if token is an ngram, add to document.
                docs[idx].append(token)
    return docs

def make_ngrams(tok_corpus, with_dict = False, lemmatize = True, max_n = 2, min_count = 5, **kwargs):
    ngram_dict = {}
    if lemmatize:
        docs = [[lemmatizer.lemmatize(tok.lower()) for tok in transcript] for transcript in tok_corpus]
    else:
        docs = [[tok.lower() for tok in transcript] for transcript in tok_corpus]
    for n in range(2, max_n+1):
        if n == 2:
            ngram_dict[f'{str(n)}grams'] = Phrases(docs, min_count = min_count, **kwargs)
        else:
            ngram_dict[f'{str(n)}grams'] = Phrases(ngram_dict[f'{str(n-1)}grams'][docs], min_count = min_count, **kwargs)
    docs = append_ngrams(docs, ngram_dict[f'{str(max_n)}grams'])
    if with_dict:
        return docs, ngram_dict
    else:
        return docs

In [229]:
docs, ngram_dict = make_ngrams(tok_scripts, with_dict = True, lemmatize = True, max_n = 4, min_count = 1, threshold = 1)

In [222]:
ngram_dict

{'2grams': <gensim.models.phrases.Phrases at 0x1dca340a1f0>,
 '3grams': <gensim.models.phrases.Phrases at 0x1dc899a40d0>,
 '4grams': <gensim.models.phrases.Phrases at 0x1dca168d6d0>}

In [230]:
for ind in range(len(docs)):
    c = Counter(tok for tok in docs[ind] if re.search("(.+_){2}", tok))
    if len(c) > 0:
        print(c)

Counter({'__quarteroid__': 2})
Counter({'lick_my_ass': 2})


In [89]:
[(descriptions[ind], len(re.findall("what is that", script.lower()))) for ind, script in enumerate(scripts) if len(re.findall("what is that", script.lower())) > 0 ][15:19]

[('JIM NORTON: AMERICAN DEGENERATE (2013) – FULL TRANSCRIPT', 1),
 ('CHRIS D’ELIA: WHITE MALE. BLACK COMIC. (2013) – FULL TRANSCRIPT', 1),
 ('Bert Kreischer: Hey Big Boy (2020) – Transcript', 1),
 ('Marc Maron: End Times Fun (2020) – Full Transcript', 4)]

### with sklearn

In [122]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [125]:
%%timeit
ct_vectorizer = CountVectorizer(lowercase = True, 
                             token_pattern = r"\b[a-zA-z][a-zA-Z\-']*\b", 
                             ngram_range = (1, 4),
                             stop_words = "english",
                             
                            )
scripts_tf = ct_vectorizer.fit_transform(scripts)

tfidf_vectorizer = TfidfVectorizer(**ct_vectorizer.get_params())
scripts_tfidf = tfidf_vectorizer.fit_transform(scripts)

  and should_run_async(code)


In [173]:
tfidf_vectorizer.get_feature_names_out()[8000:8015]

array(['a blusher and you', 'a bmw', 'a bmw okay', 'a bmw okay but',
       'a bmw right', 'a bmw right it', 'a bnp', 'a bnp campaign',
       'a bnp campaign are', 'a boar', 'a boar and', 'a boar and i',
       'a board', 'a board and', 'a board and bright'], dtype=object)

In [174]:
scripts_tf

<316x4761068 sparse matrix of type '<class 'numpy.int64'>'
	with 7955448 stored elements in Compressed Sparse Row format>

## LDA topic model

In [175]:
# # for TF DTM
# lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
# lda_tf.fit(scripts_tf)
# # for TFIDF DTM
# lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
# lda_tfidf.fit(scripts_tfidf)

In [176]:
#pyLDAvis.sklearn.prepare(lda_tfidf, scripts_tfidf, tfidf_vectorizer)

## profanity
[to the top](#Feature-Engineering-for-Standup-Scripts)

In [177]:
from profanityfilter import ProfanityFilter

In [310]:
pf = ProfanityFilter()
tok_scripts = [regexp_tokenize(transcript, r"\b[a-zA-Z'\-\w\*]+\b") for transcript in scripts]
tok_scripts_lc = [[token.lower() for token in script] for script in tok_scripts]
word_counts = [Counter(token for token in script) for script in tok_scripts_lc]

In [311]:
dictionary = Dictionary(tok_scripts_lc)
corpus = [dictionary.doc2bow(script) for script in tok_scripts_lc]

In [312]:
#[dictionary.id2token(token) for token in dictionary.iterkeys()]
corpus_overall_counts = {}
for bow in tqdm(corpus):
    for id, count in bow:
        if dictionary[id] in corpus_overall_counts.keys():
            corpus_overall_counts[dictionary[id]] += count
        else:
            corpus_overall_counts[dictionary[id]] = count




  0%|                                                                                          | 0/316 [00:00<?, ?it/s][A[A[A


 14%|███████████▏                                                                    | 44/316 [00:00<00:00, 433.42it/s][A[A[A


 32%|█████████████████████████▌                                                     | 102/316 [00:00<00:00, 517.16it/s][A[A[A


 50%|███████████████████████████████████████▌                                       | 158/316 [00:00<00:00, 528.82it/s][A[A[A


 67%|█████████████████████████████████████████████████████                          | 212/316 [00:00<00:00, 531.46it/s][A[A[A


100%|███████████████████████████████████████████████████████████████████████████████| 316/316 [00:00<00:00, 524.28it/s][A[A[A


In [318]:
# profane_dict = {word: pf.is_profane(word) for word in tqdm(corpus_overall_counts)}

  0%|          | 0/54739 [00:00<?, ?it/s]

In [321]:
# with open('../data/profanity_booleans_no_lemma.pickle', 'wb') as file:
#     pickle.dump(profane_dict, file)

In [322]:
with open('../data/profanity_booleans_no_lemma.pickle', 'rb') as file:
    profane_dict = pickle.load(file)

In [340]:
profanity_counts = {description: {word:count for word, count in script_counts.items() if profane_dict[word]} for script_counts, description in zip(word_counts,descriptions)}

In [363]:
profane_words = [sum(words.values()) for description, words in profanity_counts.items()]
total_words = [sum(script_word_counts.values()) for script_word_counts in word_counts]
profane_proportion = [profane/total for profane, total in zip(profane_words, total_words)]
profane_per_sent = [profane/sent_count for profane, sent_count in zip(profane_words, sent_counts)]
profane_per_min = [profane/minutes for profane, minutes in zip(profane_words, metascripts['runtimeMins'].values)]

In [364]:
metascripts['profane count'] = profane_words
metascripts['profane proportion'] = percent_profane
metascripts['profanity per sentence'] = profane_per_sent
metascripts['profanity per minute'] = profane_per_min

In [366]:
px.box(metascripts, x = 'profanity per minute', hover_data = ['description', 'profane count'], points = 'all')

## part-of-speech frequencies

In [368]:
import spacy

In [381]:
nlp.max_length

1000000

In [380]:
# instantiate the English model: nlp
nlp = spacy.load('en_core_web_md')

# create docs with nlp.pipe
#docs = nlp.pipe(scripts)

# get part-of-speech tags

def get_doc_pos_count(doc):
    pos_dict = {}
    for token in doc:
        if token.pos_ in tok_pos_dict:
            pos_dict[token.pos_] += 1
        else:
            pos_dict[token.pos_] = 1

#docs_pos_counts = {description: get_doc_pos_count(doc) for description, doc in zip(descriptions, nlp.pipe(scripts))}

doc_pos_count = {}
for description, doc in zip(descriptions, nlp.pipe(scripts)):
    doc_pos_count[description] = get_doc_pos_count(doc)

MemoryError: Error assigning 2458800 bytes

In [377]:
tok_pos_details

[[('Thank', 'VERB', 'ROOT', 'Thank'),
  ('you', 'PRON', 'dobj', 'Thank'),
  ('Thank', 'VERB', 'ROOT', 'Thank'),
  ('you', 'PRON', 'dobj', 'Thank'),
  ('Oh', 'INTJ', 'ROOT', 'Oh'),
  ('my', 'PRON', 'poss', 'gosh'),
  ('gosh', 'NOUN', 'intj', 'Oh'),
  ('Thank', 'VERB', 'ROOT', 'Thank'),
  ('you', 'PRON', 'dobj', 'Thank'),
  ('so', 'ADV', 'advmod', 'much'),
  ('much', 'ADV', 'advmod', 'Thank'),
  ('Thank', 'VERB', 'ROOT', 'Thank'),
  ('you', 'PRON', 'dobj', 'Thank'),
  ('Thank', 'VERB', 'ROOT', 'Thank'),
  ('you', 'PRON', 'dobj', 'Thank'),
  ('Aw', 'INTJ', 'intj', 'thank'),
  ('thank', 'VERB', 'ROOT', 'thank'),
  ('you', 'PRON', 'dobj', 'thank'),
  ('so', 'ADV', 'advmod', 'much'),
  ('much', 'ADV', 'advmod', 'thank'),
  ('Thank', 'VERB', 'ROOT', 'Thank'),
  ('you', 'PRON', 'dobj', 'Thank'),
  ('Aw', 'INTJ', 'ROOT', 'Aw'),
  ('That', 'PRON', 'nsubj', "'s"),
  ("'s", 'AUX', 'ROOT', "'s"),
  ('so', 'ADV', 'advmod', 'nice'),
  ('nice', 'ADJ', 'acomp', "'s"),
  ('That', 'PRON', 'nsubj', 'makes

## sentence structure

## point-of-view

## sentiment

## polarity

## cosine similarity

In [None]:
books_counters = {k: Counter([x.lower() for x in regexp_tokenize(v, r"[-'\w]+") if x not in sw]) for k, v in books.items()}
books_df = pd.DataFrame.from_dict(books_counters, orient = 'index').fillna(0)

# playground
[to the top](#Feature-Engineering-for-Standup-Scripts)

### Remove brackets and parentheticals, as well as a check to ensure we don't accidentally remove too much
I'd also like to remove intro and exit music programmatically, but that's more fraught. Some shows deliberately contain music as content, and some scripts use an odd number of music signs, which makes it tough to single out lyrics.

In [197]:
fake_tok = "Thank you, thank you. [applause, laughter] Have you heard what Florida man's up to?"
re.search(r"\[.+\]", fake_tok)

<re.Match object; span=(22, 42), match='[applause, laughter]'>

In [198]:
re.sub(r"\[.+\]", "", fake_tok).strip()

"Thank you, thank you.  Have you heard what Florida man's up to?"

In [199]:
if fake_tok not in sw and re.search(r"\[.+\]", fake_tok):
    print("Yup, that's true")

Yup, that's true


In [200]:
re.sub(r"\[.+?\]", "", transcripts_dict['Tom Papa: Human Mule (2016) – Transcript'])
re.search(r"♪.+?♪", transcripts_dict['Tom Papa: Human Mule (2016) – Transcript'])

In [201]:
gen = (script for script in transcripts_dict.values())

In [202]:
re.sub(r"\[.+?\]", "", re.sub(r"♪.+?♪", "", re.sub(r"♪♪.+?♪♪", "", transcripts_dict['Dave Chappelle: The Closer (2021) | Transcript'])))

'     \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n Thank you.  Everybody have a seat, be comfortable, relax. I got to tell you… let’s go.  Thank you. I need you guys to know something. And I’m gonna tell you the truth, and don’t get freaked out. This is going to be my last special for a minute.  It is all good. Listen to me. I did it in Detroit for that reason.  That’s right. You wanna know why? ‘Cause I talked so much shit about Detroit in the first special I figured, I might as well, do the last special here. Sorry about that, by the way.  First of all, before I even start, I’m gonna say that “I’m rich and famous.”  And the only reason I say that is ’cause the last 17 months were hell, and I cannot imagine what everybody went through. Well, I’m happy to see you and I’m happy you’re well and I hope everyone you love is okay.  I don’t want you to worry about me, I’m… vaccinated, I…  got the Johnson & Johnson vaccine.  I got to admit, that’s probably the most n*ggaish decision I’ve made in a long t

In [203]:
mm = metascripts.assign(
    modprop = lambda metascripts: (metascripts['script characters'] - metascripts['transcript'].replace("\[.+?\]|\(.+?\)","", regex = True).apply(len))/metascripts['script characters']
)

px.box(mm, x = 'modprop', hover_data = [mm.index, 'description'])

In [204]:
metascripts['transcript'][32]



In [205]:
metascripts['transcript'].replace("\[.+?\]|\(.+?\)","", regex = True)[32]

