<a href="https://colab.research.google.com/github/jcalz23/nlp_podcast_segmentation/blob/main/preprocess_TAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Preprocessing

Published Documents:
 - [Spotify Podcasts Dataset](https://arxiv.org/pdf/2004.04270v3.pdf)
 - [Speech Recognition Diarization](https://arxiv.org/pdf/2005.08072.pdf)
 - [Unsupervised Topic Segmentation of Meetings with BERT Embeddings](https://arxiv.org/pdf/2106.12978.pdf)


Dataset:
 - [This American Life Podcast Transcripts](https://www.kaggle.com/datasets/shuyangli94/this-american-life-podcast-transcriptsalignments?resource=download)

Citation:

 - Mao, H. H., Li, S., McAuley, J., & Cottrell, G. (2020). Speech Recognition and Multi-Speaker Diarization of Long Conversations. INTERSPEECH.

## Imported Packages and Libraries

In [None]:
!pip install nlp --quiet

[K     |████████████████████████████████| 1.7 MB 5.0 MB/s 
[K     |████████████████████████████████| 212 kB 57.4 MB/s 
[?25h

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
base = '/content/drive/MyDrive/nlp_podcast_segmentation/'

In [None]:
from collections import Counter
import numpy as np
import tensorflow as tf
from tensorflow import keras
from nlp import load_dataset

import seaborn as sns
from pprint import pprint 

from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import json

# Formatting options for float number in numpy
float_formatter = "{:.4f}".format
np.set_printoptions(formatter={'float_kind':float_formatter})

## The American Life Podcast Dataset

### Load

In [None]:
dataset_path = base + 'data/TALDataset/'

In [None]:
full_speaker_data = dataset_path + 'full-speaker-map.json'
test_transcript_data = dataset_path + 'test-transcripts-aligned.json'
train_transcript_data = dataset_path + 'train-transcripts-aligned.json'
valid_transcript_data = dataset_path + 'valid-transcripts-aligned.json'

In [None]:
with open(test_transcript_data) as json_data:
    transcripts = json.load(json_data)

print(len(transcripts))

36


In [None]:
# Get list of episodes
episode_list = []
for episode in transcripts:
    episode_list.append(episode)
print(episode_list)

['ep-11', 'ep-113', 'ep-120', 'ep-164', 'ep-171', 'ep-177', 'ep-195', 'ep-219', 'ep-242', 'ep-258', 'ep-270', 'ep-279', 'ep-343', 'ep-355', 'ep-382', 'ep-403', 'ep-416', 'ep-432', 'ep-437', 'ep-456', 'ep-475', 'ep-489', 'ep-493', 'ep-516', 'ep-522', 'ep-524', 'ep-527', 'ep-548', 'ep-558', 'ep-619', 'ep-635', 'ep-648', 'ep-665', 'ep-682', 'ep-683', 'ep-78']


In [None]:
# Understand 1 episode
ep = episode_list[0]
ep_df = pd.DataFrame(transcripts.get(ep))
print(f"Num Rows: {len(ep_df)}")
print(f"Acts: {ep_df['act'].unique()}")
print(f"Speakers: {ep_df['speaker'].unique()}\n")
ep_df.head(3)

Num Rows: 234
Acts: ['prologue' 'act1' 'act2' 'act3' 'act4' 'credits']
Speakers: ['ira glass' 'shirley jahad' 'julia sweeney' 'bob' 'david sedaris'
 'terry sweeney' 'dave' 'man' 'sarah thyre']



Unnamed: 0,episode,act,act_title,role,speaker,utterance_start,utterance_end,duration,utterance,n_sentences,n_words,has_q,ends_q,alignments
0,ep-11,prologue,Act One: Dave's Love,host,ira glass,0.93,32.51,31.58,"""I'll pour this pestilence into his ear. So wi...",6,89,False,False,"[[0.93, 2.65, 1], [2.65, 2.81, 2], [2.81, 3.17..."
1,ep-11,prologue,Act One: Dave's Love,host,ira glass,32.51,72.55,40.04,"But in our American lives, the real era of int...",5,80,False,False,"[[32.51, 32.949999999999996, 0], [32.949999999..."
2,ep-11,prologue,Act One: Dave's Love,host,ira glass,72.55,82.69,10.14,"But before we get into the body of our story, ...",3,39,False,False,"[[72.55, 72.86999999999999, 0], [72.8699999999..."


In [None]:
# Check one conversation
for line in transcripts[ep][:4]:
    print(line['speaker'], ": ", line['utterance'], "\n")

ira glass :  "I'll pour this pestilence into his ear. So will I make the net that will enmesh them all." It's an adult, Iago, who says that in Othello. And it's grownups that Machiavelli was writing about when he wrote The Prince, his book about manipulating others and seizing power. Notice he titled the book The Prince, not The Little Prince. The Little Prince is actually by somebody else, if you don't know that. 

ira glass :  But in our American lives, the real era of intrigue and manipulation for most of us is not adulthood. It's adolescence, when our social circle is at its most constricting. Today on our program, a story of betrayal and of someone who holds David Koresh-like powers over others, and who is only in the seventh grade. From WBEZ in Chicago, it's Your Radio Playhouse. I'm Ira Glass. 

ira glass :  But before we get into the body of our story, we will try, as adults, to manipulate you a little bit at Pledge Central. Let's check in with Pledge Central. Shirley Jahad. 



### Process

In [None]:
# Goal: Get a df of utterances, and the corresponding topic per episode
ep_df = pd.DataFrame(transcripts.get(ep))
ep_df = ep_df[['episode', 'act', 'speaker', 'utterance']]
ep_df.head()

Unnamed: 0,episode,act,speaker,utterance
0,ep-11,prologue,ira glass,"""I'll pour this pestilence into his ear. So wi..."
1,ep-11,prologue,ira glass,"But in our American lives, the real era of int..."
2,ep-11,prologue,ira glass,"But before we get into the body of our story, ..."
3,ep-11,prologue,shirley jahad,"Hi, Ira Glass."
4,ep-11,prologue,ira glass,Hi.


In [None]:
# Goal: For each input episode, get a list of utterances and corresponding topic
U_list = []
T_list = []

for ep in episode_list:
  U = []  #  list of M utterances U = {U_1,..., U_M}
  T_temp = []  #  Topic label
  T = []  #  Topic label transition indicator

  for line in transcripts[ep]:
    U.append(line['utterance'])
    T_temp.append(line['act'])

  #Format topic transitions
  T = np.zeros(len(T_temp))
  for i in range(len(T_temp)):
    if i == 0:
      T[i] = 1
    if i != (len(T)-1):
      if T_temp[i] != T_temp[i+1]:
        T[i] = 1
  T = T.astype(int)
  
  # Append to episode matrix
  U_list.append(U)
  T_list.append(T)

print(len(U_list))
print(U_list[0][:3])
print(T_list[0][:3])

36
['"I\'ll pour this pestilence into his ear. So will I make the net that will enmesh them all." It\'s an adult, Iago, who says that in Othello. And it\'s grownups that Machiavelli was writing about when he wrote The Prince, his book about manipulating others and seizing power. Notice he titled the book The Prince, not The Little Prince. The Little Prince is actually by somebody else, if you don\'t know that.', "But in our American lives, the real era of intrigue and manipulation for most of us is not adulthood. It's adolescence, when our social circle is at its most constricting. Today on our program, a story of betrayal and of someone who holds David Koresh-like powers over others, and who is only in the seventh grade. From WBEZ in Chicago, it's Your Radio Playhouse. I'm Ira Glass.", "But before we get into the body of our story, we will try, as adults, to manipulate you a little bit at Pledge Central. Let's check in with Pledge Central. Shirley Jahad."]
[1 0 0]


In [None]:
# Get vocab, convert words to tokens

In [None]:
# Pad each individual utterance and each sequence/episode of utterances
max_utter = 50
max_sequence = 200