# Method of gathering podcasts:
1) Compile list of top podcast channels in Apple store (there were 24)
2) Generate 10 random numbers between 1-24, select those channels (repeats ok)
3) For each of those 10, select a random podcast from their most recent 10 episodes (as of 4/4/24).

Some challenges: not all the originally selected podcasts had transcripts or sometimes the transcripts were not formatted in an ideal way.

I'm considering pulling additional podcasts from lemonada due to its ideal formatting.

# Podcast episodes that are converted to .csv so far

* MisSpelling: <a href = "https://omny.fm/shows/misspelling/miss-spelling-part-1" > Part 1 </a>
* Stuff You Should Know: <a href = "https://omny.fm/shows/stuff-you-should-know-1/short-stuff-mariko-aoki-phenomenon" > Short Stuff: Mariko Aoki Phenomenon </a>
* 20/20: <a href = "https://www.happyscribe.com/public/20-20/highway-hunter" > Highway Hunter </a>
* The Sarah Silverman Podcast: <a href = "https://lemonadamedia.com/podcast/flaps-canters-kid-actors/" > Flaps, Canter’s, Kid Actors </a>
* Last Day <a href = "https://lemonadamedia.com/podcast/christine-in-our-crisis-we-have-opportunity/" > Christine: In Our Crisis, We Have Opportunity </a>
* The Deep Dive with Jessica St. Clair and... <a href = "https://lemonadamedia.com/podcast/cat-scratch-fever/" > Cat Scratch Fever </a>
* Add to Cart with Kulap Vilaysack &... <a href = "https://lemonadamedia.com/podcast/the-wax-man-with-lamar-woods/" > The Wax Man (with Lamar Woods) </a>

**Need to spot check above transcripts for reliability**
* MissSpelling is pretty bad, but could potentially be fixed by merging neighboring strings together.

## Podcasts whose format would require RegEx 
(as opposed to relying on line breaks)
* ZOE:<a href="https://zoe.com/learn/podcast-longevity-according-to-blue-zones#transcript"> 9 longevity practices: Secrets from the blue zones</a>
* Serial: <a href = "https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html" > Poor Baby Raul </a>
* Serial: <a href = "https://serialpodcast.org/season-three/1/transcript" > A Bar Fight Walks into the Justice Center </a>

# Steps for importing chunked transcripts
1) Copy/paste transcripts into transcripts.py (formatted as strings)
2) Manually determine the line_freq and line_hit by counting lines
3) Run raw_transcript_to_csv on each transcript with the corresponding line_freq and line_hit values

# Convert English transcripts to .csv

In [2]:
import csv
import os

## Original function, which does not account for max_char_size = 5000

In [3]:
def eng_transcript_to_csv(transcript, transcript_name, line_freq, line_hit):
    """
    required packages: csv and os
    transcript: transcript file
    transcript_name: desired name for the string in the .csv
    line_freq: the number of lines in a cycle (i.e. "name/time, new line, quote, name/time..." has line_freq = 3)
    line_hit: the index in the cycle with the desired line of text (*first line has index 0)
    """
    spoken_parts = []
    lines = transcript.strip().split('\n')

# for transcripts from lemonada, the format of lines follows the pattern: "name time", "", "spoken_part", "", "name time"...
# so it looks like quotes will always be in the same position of each cycle of lines
# I'll use index%%line_freq == line_hit to extract the correct line for the spoken_part

    for index in range(len(lines)):
        if index%line_freq == line_hit:
            spoken_parts.append(lines[index].strip())

# Save spoken parts into a CSV file
    csv_file = os.path.join("English Transcript CSVs", transcript_name + "_spoken_parts.csv")
    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, quoting=csv.QUOTE_ALL)
        for spoken_part in spoken_parts:
            writer.writerow([spoken_part])
    
    message = ("Spoken parts have been saved to:", csv_file)

    return message

In [12]:
import csv
import os
import pandas as pd

def eng_transcript_to_csv_alt(transcript, transcript_name, line_freq, line_hit):
    """
    transcript: transcript file
    transcript_name: desired name for the string in the .csv
    line_freq: the number of lines in a cycle (i.e. "name/time, new line, quote, name/time..." has line_freq = 3)
    line_hit: the index in the cycle with the desired line of text (*first line has index 0)
    """
    # Initialize list to store starting indices
    starting_indices = []
    
    spoken_parts = []
    lines = transcript.strip().split('\n')

    # Extract spoken parts based on line frequency and line hit
    for index in range(len(lines)):
        if index % line_freq == line_hit:
            spoken_parts.append(lines[index].strip())
            starting_indices.append(index)
    
    # Create DataFrame
    df = pd.DataFrame(spoken_parts, columns=['Spoken_Part'])
    
    # Save DataFrame to CSV file
    csv_file = os.path.join("English Transcript CSVs", transcript_name + "_spoken_parts.csv")
    df.to_csv(csv_file, index=False, quoting=csv.QUOTE_ALL)
    
    message = ("Spoken parts have been saved to:", csv_file, starting_indices)

    return message


In [None]:
eng_transcript_to_csv(Flaps, Eng_podcasts[podcast][0], Eng_podcasts[podcast][1], Eng_podcasts[podcast][2])

Transcripts were saved as a **.py** file in the same directory.  They are imported below.

In [4]:
from Transcripts_english import Flaps, Christine, Mariko, Highway, Cat, Wax, Alyssa, Carrie, Unfit, Awkward

Set up a dictionary to loop over with the transcript to csv function.  The dictionary values are formatted as *["name", line_freq, line_hit]*

In [16]:
Eng_podcasts = {Flaps:["Flaps",6,2], Christine:["Christine",6,2], Mariko:["Mariko",5,2],
            Highway:["Highway",3,0], Cat:["Cat",6,2], Wax:["Wax",6,2], Alyssa:["Alyssa",6,2], Carrie:["Carrie",4,2],
               Unfit:["Unfit",4,2], Awkward:["Awkward",4,2]}

for podcast in Eng_podcasts:
    eng_transcript_to_csv(podcast, Eng_podcasts[podcast][0], Eng_podcasts[podcast][1], Eng_podcasts[podcast][2])

# Convert Spanish Transcripts to .csv

In [6]:
import re

In [7]:
def spotify_transcript_to_csv(transcript, transcript_name):
    """
    required packages: csv, re, and os
    transcript: transcript file
    transcript_name: desired name for the string in the .csv
    """
    # for transcripts from spotify, the speakers swap on timestamps with inconsistent # of lines in between

    timestamp_pattern = r'\b\d{1,2}:\d{2}\b'

    lines = re.split(timestamp_pattern, transcript)
    lines = [line.strip() for line in lines if line.strip()]
    lines = [' '.join(line.split('\n')).strip() for line in lines]

# Save spoken parts into a CSV file
    csv_file = os.path.join("Spanish Transcript CSVs", transcript_name + "_spoken_parts.csv")
    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, quoting=csv.QUOTE_ALL)
        for line in lines:
            writer.writerow([line])
    
    message = ("Spoken parts have been saved to:", csv_file)

    return message

In [8]:
from Transcripts_spanish import Barberias, Techos, Amponas, Damitas, Subio, Daniel, Ovnis, Fobias, Farid, Triangulo

In [9]:
Sp_podcasts = {Barberias:"Barberias", Techos:"Techos", Amponas:"Amponas", Damitas:"Damitas", Subio:"Subio", Daniel:"Daniel", Ovnis:"Ovnis", Fobias:"Fobias", Farid:"Farid", Triangulo:"Triangulo"}

for podcast in Sp_podcasts:
    spotify_transcript_to_csv(podcast, Sp_podcasts[podcast])