# Youtube Transcripts

This code uses a youtube transcriber library. Reference for that and how to install it yourself is below:
https://github.com/jdepoix/youtube-transcript-api

## Install relevant libraries
if you get any messages like `ModuleNotFoundError: No module named 'youtube_transcript_api'` 

simply run `! pip install <package_name>`

In [9]:
#! pip install youtube_transcript_api

## Get youtube video id's
These are store in a google sheet

In [12]:
import pandas as pd
import numpy as np
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from datetime import datetime, timezone
import pytz

#transcriptions
from youtube_transcript_api import YouTubeTranscriptApi as yt

#writing to aws
import json
import boto3 

In [13]:
def authenticate_gsheet(gsheet_json_key, sheet_name):
    
    SCOPE = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']

    # use creds to create a client to interact with the Google Drive API
    creds = ServiceAccountCredentials.from_json_keyfile_name(gsheet_json_key, SCOPE)
    gc = gspread.authorize(creds)

    # ensure you share the spreadsheet with the project service account
    # the email you share it with can be found in the json key
    sh = gc.open(sheet_name).sheet1
    return sh

In [14]:
def import_gsheet(sheet):
    """
    given an instance of a gsheet, import it into a dataframe and return dataframe
    """
    data = sheet.get_all_values()
    df = pd.DataFrame.from_records(data)
    new_header = df.iloc[0] #grab the first row for the header
    df = df[1:] #take the data less the header row
    df.columns = new_header #set the header row as the df header
    return df

## Get the transcript for each video

In [39]:
def get_text(transcript):
    """
    Given the dictionary of text, start, duration...extract JUST the string
    returns: the transcript as a string
    """
    text = ''
    for t in transcript:
        text += t['text'] + ' '

    return text

def get_duration_seconds(transcript):
    """
    INPUT: the dictionary of text, start, duration
    RETURNS: the length of the press conference
    """
    last_transcription = list(transcript)[-1]['start']
    duration = list(transcript)[-1]['duration']
    
    return last_transcription + duration

# Get all past transcripts
This only has to be done once, and then it will be a daily task

In [167]:
#import the googlsheet into a dataframe
df = import_gsheet(sh)

#get the columns required
df = df[['date', 
         'link', 
         'id', 
        ]]

#add new columns/keys
df['duration_seconds']=0
df['transcript']=''
df['text']=''

temp_dict = df.to_dict(orient = 'records')#[STATE+'_id']
transcripts_dict = {}

for item in temp_dict:
    date = item.pop('date')
    transcripts_dict[date] = item
    
#transcripts_dict

{'20210617': {'link': 'https://www.youtube.com/watch?v=cItoaypG3zA',
  'id': 'cItoaypG3zA',
  'duration_seconds': 0,
  'transcript': '',
  'text': ''},
 '20210618': {'link': 'https://www.youtube.com/watch?v=H5oadwexEc4',
  'id': 'H5oadwexEc4',
  'duration_seconds': 0,
  'transcript': '',
  'text': ''},
 '20210619': {'link': 'https://www.youtube.com/watch?v=2RLfKycplrk',
  'id': '2RLfKycplrk',
  'duration_seconds': 0,
  'transcript': '',
  'text': ''},
 '20210620': {'link': 'https://www.youtube.com/watch?v=UZhRxCSwOqo',
  'id': 'UZhRxCSwOqo',
  'duration_seconds': 0,
  'transcript': '',
  'text': ''},
 '20210621': {'link': 'https://www.youtube.com/watch?v=f1XzcYnzWYI',
  'id': 'f1XzcYnzWYI',
  'duration_seconds': 0,
  'transcript': '',
  'text': ''},
 '20210622': {'link': 'https://www.youtube.com/watch?v=Hb0AlfE8BYk',
  'id': 'Hb0AlfE8BYk',
  'duration_seconds': 0,
  'transcript': '',
  'text': ''},
 '20210623': {'link': 'https://www.youtube.com/watch?v=fHg-sqMF3oA',
  'id': 'fHg-sqMF3o

This piece of code iterates over the youtube id's and gets their corresponding:
- transcript
- duration
- text (1 big string i.e. transcript without timestamps)

If there is an error (e.g. now video found or their are no captions for the video then it will set the values as "error"

In [169]:
for date in transcripts_dict.keys():
    try:
        video_id = transcripts_dict[date]['id']
        #print("processing:",date, video_id)
        transcript = yt.get_transcript(video_id)
        duration = get_duration_seconds(transcript)
        text = get_text(transcript)
        
    except:
        print("An exception occurred for {}".format(date, video_id))
        transcript='error'
        duration = np.nan #null
        text='error'      
        
    #store the transcript dict in a dic
    transcripts_dict[date]['duration_seconds'] = duration
    transcripts_dict[date]['transcript'] = transcript
    transcripts_dict[date]['text'] = text

processing: 20210617 cItoaypG3zA
processing: 20210618 H5oadwexEc4
processing: 20210619 2RLfKycplrk
processing: 20210620 UZhRxCSwOqo
processing: 20210621 f1XzcYnzWYI
processing: 20210622 Hb0AlfE8BYk
processing: 20210623 fHg-sqMF3oA
processing: 20210624 _SyzWqiI1w0
processing: 20210625 MkGwTbr_4N4
processing: 20210626 WKr8XhxfAeM
processing: 20210627 J3p9wpyHtKM
processing: 20210628 QMog1CXrlEU
processing: 20210629 hBe9xC8z3iQ
processing: 20210630 z1wqP8ERpaw
processing: 20210701 VEKL80gn1DY
processing: 20210702 SI8OuUf-9B0
processing: 20210703 VYGs32vxm2k
processing: 20210704 XpdSqtdPT3c
processing: 20210705 fbT7e_tYV1A
processing: 20210706 EIkH5VHky1c
processing: 20210707 y2PU1MfEkgo
processing: 20210708 UcBi5RdMGfw
processing: 20210709 S77u-GXSyy0
processing: 20210710 XttK2JFDSHo
processing: 20210711 pAxEm5kSIsM
processing: 20210712 F9byDzTQAR0
processing: 20210713 zQr-0vyyrrU
processing: 20210714 i5znaJdsW34
processing: 20210715 gnjHQ9WgXMI
processing: 20210716 K_mnXZS1SYI
processing

### Write to AWS S3 bucket
Next we write to aws S3:
- each day gets its own transcription text
- and then there is one giant .json file with text, links, durations etc

In [None]:
s3 = boto3.resource('s3')
STATE = 'nsw'
#DATE = '20210807'

json_object = (bytes(json.dumps(transcripts_dict).encode('UTF-8')))
BUCKET_NAME = 'covid19-au-press-conferences'
json_key = STATE+'/json/'+STATE+'_transcriptions.json'

s3_client = boto3.client('s3',
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,)


#dump the transcript dictionary to s3 as a json file
s3_client.put_object(Body=json_object,
                     Bucket=BUCKET_NAME, 
                     Key=json_key, 
                     ContentType='application/json') #MIME type


#write each individual transcriptions as a new text file
for date in transcripts_dict.keys():
    print(date)
    txt_object = transcripts_dict[date]['text']#.encode('ascii')
    txt_key = STATE+'/text/'+date+'_'+STATE+'_press_conference.txt'

    s3_client.put_object(Body=txt_object,
                         Bucket=BUCKET_NAME, 
                         Key=txt_key,
                         ContentType='text/plain') #MIME type

## Defining some functions to interact with aws S3

In [36]:
def download_s3_file_to_local(bucket_name, key, local_file):
    """
    using your aws access key (configures in aws cli) this function
    downloads the file at bucket_name/key into local_file
    
    """
    import boto3
    import botocore

    s3 = boto3.resource('s3')

    try:
        s3.Bucket(bucket_name).download_file(key, local_file)
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            print("The object does not exist.")
        else:
            raise
            

def download_s3_file_to_memory(bucket_name, key):
    s3_client = boto3.client('s3')
    s3_response_object = s3_client.get_object(Bucket=bucket_name, Key=key)
    object_content = s3_response_object['Body'].read()
    return object_content


def add_file_to_s3(bucket_name, key, txt_object):
    """
    Writes a string to a bucket/key
    depending on format of key will write as txt or json    
    """
    print("starting up S3 client...")
    s3_client = boto3.client('s3')
    
    # if writing as a .json file
    if key[-4:]=='json':
        print("working a .json file")
        print("converting to string")
        json_object = (bytes(json.dumps(txt_object).encode('UTF-8')))
        
        print("writing to S3....")
        #dump the transcript dictionary to s3 as a json file
        s3_client.put_object(Body=json_object,
                             Bucket=bucket_name, 
                             Key=key, 
                             ContentType='application/json') #MIME type

        
    # if writing as a .txt file
    elif key[-3:]=='txt':
            s3_client.put_object(Body=txt_object,
                                 Bucket=bucket_name, 
                                 Key=key,
                                 ContentType='text/plain') #MIME type

    print("Completed writing to S3")
    
def transcribe_youtube_video(youtube_id):
    """
    given a youtube id, call the youtube transcriber and return the transcript, duration and string
    """
    from youtube_transcript_api import YouTubeTranscriptApi as yt
    
    try:
        print("getting transcript...")
        transcript = yt.get_transcript(youtube_id)
        print("calculating duration...")
        duration = get_duration_seconds(transcript)
        print("extracting text from transcript")
        text = get_text(transcript)

    except:
        print("An exception occurred for {}".format(date, youtube_id))
        transcript='error'
        duration = np.nan #null
        text='error'      
        
    return transcript, duration, text

## Daily Routine: Process Latest Press Conference
ensure the youtube link as been added to the google sheet at:
https://docs.google.com/spreadsheets/d/1eKGoGjkvzLmoxH9mK9UGGQXWECCV4qWWCl0QKsj87cE/edit?usp=sharing


declare the date below

In [37]:
date = '20210809'

### Transcribe the latest video
This will return 3 things:
- the transcript (with timestamps)
- the duration of the press conference
- a string (or text) of the transcription

In [40]:
# import the latest youtube ID (from googlesheet referenced above)
JSON_KEY = '/Users/liampearson/Downloads/covid19data-321603-3d98ddb02134.json'
SHEET_NAME = 'youtube_ids'
BUCKET_NAME = 'covid19-au-press-conferences'

sheet = authenticate_gsheet(JSON_KEY, SHEET_NAME)

#import the youtube id's
df = import_gsheet(sheet)

#get the youtube id and link
youtube_id = df[df['date']==date]['id'].tolist()[0]
youtube_link = df[df['date']==date]['link'].tolist()[0]

print("youtube id for {} is {}".format(date, youtube_id))

#call the youtube transcriber
transcript, duration, text = transcribe_youtube_video(youtube_id)

youtube id for 20210809 is EvRhxnmPCo0
getting transcript...
calculating duration...
extracting text from transcript


##  add to S3 bucket as .txt file

In [41]:
add_file_to_s3(bucket_name = BUCKET_NAME, 
               key = 'nsw/text/{}_nsw_press_conference.txt'.format(date),
               txt_object = text)

starting up S3 client...
Completed writing to S3


### Now get the big json file (all transcriptions) 
then:
- append the latest data to it
- push back to S3

In [42]:
import ast
#download from S3
print("downloading json from S3")
bytes_object = download_s3_file_to_memory(BUCKET_NAME, 'nsw/json/nswtranscriptions.json')

#convert to dictionary
print("Converting downloaded bytes to dictionary")
transcriptions_dict = ast.literal_eval(bytes_object.decode('utf-8'))

# create latest entry
print("Creating new entry (dictionary)")
new_dict = {'link':youtube_link,
                  'id':youtube_id,
                  'duration_seconds':duration,
                  'transcript': transcript,
                  'text':text
                 }

# add it to the dictionary
print("appending to large dictionary of past entries")
transcriptions_dict[date] = new_dict

#send back to S3
print("sending back to S3...")
add_file_to_s3(bucket_name = BUCKET_NAME, 
               key = 'nsw/json/nswtranscriptions.json'.format(date),
               txt_object = transcriptions_dict)

print("Complete")

downloading json from S3
Converting downloaded bytes to dictionary
Creating new entry (dictionary)
appending to large dictionary of past entries
sending back to S3...
starting up S3 client...
working a .json file
converting to string
writing to S3....
Completed writing to S3
Complete


In [25]:
def manually_import_text(text_path):
    with open(text_path, 'r') as file:
        data = file.read().replace('\n', '')
        
    return data

text_path = '/Users/liampearson/Downloads/20210718.txt'

data = manually_import_text(text_path)
data = data.replace("\ufeff", "")

date = '20210718'
BUCKET_NAME = 'covid19-au-press-conferences'


add_file_to_s3(bucket_name = BUCKET_NAME, 
               key = 'nsw/text/{}_nsw_press_conference.txt'.format(date),
               txt_object = data)

starting up S3 client...
Completed writing to S3
