# YouTube Apology Transcription

This notebook shows the transcription process of videos from this [YouTube playlist][]. **It does not need to be run.** 

Final results can be found in "apology_transcript.csv".

Running this notebook requires downloading the playlist as mp3 files and renaming them based on the IDs determined from [this spreadsheet](https://docs.google.com/spreadsheets/d/1zewXF_WeSIH3QBlW4c5MMv7hBsBAc8cOhAaG4aBBuKM/edit?usp=sharing). Some videos deleted from YouTube were retrieved from the Internet Archive and other locations.



In [None]:
# IMPORTED LIBRARIES -- Run this

# Libraries for apology transcription
import pandas as pd
import whisper
from pathlib import Path

# Whisper Installation (only run once)

[Whisper](https://openai.com/research/whisper) is an open source speech recognition software from OpenAI. We are using Whisper to automate the transcription process of the apology videos as we cannot watch and transcribe all the videos ourselves.

In [None]:
# Installs OpenAi's Whisper speech-recognition model 
!pip install -U openai-whisper

In [None]:
# Installs rust
!pip install setuptools-rust

# Data Intake

In [None]:
# Intakes CSV file as a dataframe
apology_info_df = pd.read_csv('100_YT_Sheet.csv')

# Deletes rows that do not have an ID
apology_info_df = apology_info_df.dropna(subset=['ID'])

# Transforms date posted column into numerical dates
apology_info_df['Date Posted'] = pd.to_datetime(apology_info_df['Date Posted'])

In [None]:
apology_info_df

# Initial Transcription

In [None]:
# ONLY RUN THIS BLOCK ONCE -- DO NOT RUN IF YOU HAVE ALREADY ADDED APOLOGIES TO THE LIST, THIS WILL ERASE THEM
# Creates a list to store apology transcripts
apology_list = []

In [None]:
# Function that transcribes files using Whisper 
def transcribeAudio(audioFile, youtuber_id): 
    # Loads Whisper to transcribe the audio
    model = whisper.load_model("base")
    apology_file = Path(audioFile)
    # Checks if file is present in the directory and begins transcription if it is
    if apology_file.is_file()==True:
        result = model.transcribe(audioFile)
        #Appends the transcription to the apology list along with the YouTuber ID
        transcript = dict(ID=youtuber_id, Transcript=result["text"])
        apology_list.append(transcript)

In [None]:
# Creates a list of apology IDs from the IDs in the apology dataframe
apology_ids = apology_info_df["ID"]

# Transcribes apologies of YouTubers present in the apology ID list
for youtuber in apology_ids:
    count = 0
    transcibed = 0
    # Iterates through apologies list to check if the apology was already transcribed
    for apology in apology_list:
        if youtuber in apology_list[count]['ID']:
            transcibed = 1
            #print ("Transcription done: ", youtuber)
        count += 1
    # Transcribes the apology if it is not in the list
    if  transcibed == 0:
        transcribeAudio(youtuber+".mp3", youtuber)
        print (youtuber)

In [None]:
# Function that exports the dataframe of apologies into a CSV file for further manipulation
def exportApologies (exportList, exportDF):
    # Converts the list into a dataframe and exports it as a csv file    
    exportDF = pd.DataFrame.from_dict(exportList)
    exportDF.to_csv('apology_transcript.csv')
    
    # Exports transcripts into text files
    count = 0
    for apology in exportList:
        fileName = exportList[count]['ID']+".txt"
        transcript = exportList[count]['Transcript']
        with open(fileName, 'w') as output:
            output.write(transcript)
        output.close()
        count += 1

In [None]:
# Exports the dataframe of apology transcriptions into a CSV file
exportApologies (apology_list, apology_df)

# Transcription based on missing apologies in apology_transcript.csv

In [1]:
# Imports CSV of apology metadata and CSV of transcribed apology
apology_info_df = pd.read_csv('100 YT rework.csv')
apology_transcripts_df = pd.read_csv('apology_transcript.csv')

NameError: name 'pd' is not defined

In [None]:
# List for apology IDs that need to be transcribed
apologiesToDo = []

count = 0
transcription = 0

# Goes through every apology in the CSV and checks if it present in the apology transcipt dataframe
# If it is not, the apology ID is added to a list 
for apology in apology_info_df.index: #100 youtubers
    ID = (apology_info_df.loc[count]['ID'])
    
    count2 = 0
    for index in apology_transcripts_df.index:
        #ID = apology_transcripts_df.loc[index,'ID']
        if ID in apology_transcripts_df.loc[count2]['ID']:
            transcription = 1
        count2+=1
    if transcription == 0:
        apologiesToDo.append(ID) # Adds ID if it has not been transcribed
    transcription = 0
    count += 1

In [None]:
apologiesToDo

In [None]:
# Transcribes the remaining apologies
for ID in apologiesToDo:
    transcribeAudio(ID+".mp3", ID)

In [None]:
# Exports dataframe into a CSV file
exportApologies (apology_list, apology_df)