# Video Subtitles in Spanish




## Architecture

In this example, we will use the Notebooks feature of Amazon SageMaker to create an interactive notebook with Python code. These notebooks are just one part of Amazon SageMaker, a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the algorithm, tune and optimize it for deployment, make predictions, and take action. In this example though, for the actual machine learning and prediction, we will be using Amazon transcribe to extract text/subtitle from the video and Amazon Translate to help us to translate the subtitle into spanish. All of our input video files will be read from a bucket in Amazon Simple Storage Service (Amazon S3), an object storage service that offers industry-leading scalability, data availability, security, and performance. The SRT file which has the subtitle is written to the S3 bucket. This notebook is just for the demo purpose and eventually for the production workload, you can split this code in different lambda and use step function to orchestrate. You can combine the video and srt file to create a video with appropriate subtitles

![alt-text](VideoSubtitles.jpg "diagram")





In [137]:



#Import all of the required libraries
%matplotlib inline
import boto3
import json
import io
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import numpy as np
import matplotlib as mpl
from imageio import imread
from datetime import datetime
import base64
import time
#import cStringIO



#Implement AWS Services
transcribe=boto3.client('transcribe')
translate = boto3.client(service_name='translate')
s3=boto3.resource('s3')



In [138]:
file_uri = 's3://nkkoshiy-demobucket/VideoSubtitles/reinvent.mp4'
now = datetime.now()
current_time = now.strftime("%d%m%y%H%M%S")
job_name='Example-job'+current_time
transcribe.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': file_uri},
        MediaFormat='mp4',
        LanguageCode='en-US',
        OutputBucketName='nkkoshiy-demobucket',
        OutputKey=job_name
    )

{'TranscriptionJob': {'TranscriptionJobName': 'Example-job181120221425',
  'TranscriptionJobStatus': 'IN_PROGRESS',
  'LanguageCode': 'en-US',
  'MediaFormat': 'mp4',
  'Media': {'MediaFileUri': 's3://nkkoshiy-demobucket/VideoSubtitles/reinvent.mp4'},
  'StartTime': datetime.datetime(2020, 11, 18, 22, 14, 26, 69000, tzinfo=tzlocal()),
  'CreationTime': datetime.datetime(2020, 11, 18, 22, 14, 26, 29000, tzinfo=tzlocal())},
 'ResponseMetadata': {'RequestId': '86e49a8a-8e63-442a-a05a-193c9aa2621e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 18 Nov 2020 22:14:26 GMT',
   'x-amzn-requestid': '86e49a8a-8e63-442a-a05a-193c9aa2621e',
   'content-length': '294',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

In [139]:

    max_tries = 60
    while max_tries > 0:
        max_tries -= 1
        job = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        job_status = job['TranscriptionJob']['TranscriptionJobStatus']
        if job_status in ['COMPLETED', 'FAILED']:
            print(f"Job {job_name} is {job_status}.")
            if job_status == 'COMPLETED':
                print(
                    f"Download the transcript from\n"
                    f"\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}.")
            break
        else:
            print(f"Waiting for {job_name}. Current status is {job_status}.")
        time.sleep(10)

Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Waiting for Example-job181120221425. Current status is IN_PROGRESS.
Job Example-job181120221425 is COMPLETED.
Download the transcript from
	https://s3.us-east-1.amazonaws.com/nkkoshiy-demobucket/Example-job181120221425.


In [140]:
bucketname='nkkoshiy-demobucket'
obj = s3.Object(bucketname, job_name)
body = obj.get()['Body'].read()
json_content = json.loads(body)
SampleText="Hello, how are you?"
translate = boto3.client(service_name='translate')


In [141]:
bucketname='nkkoshiy-demobucket'
obj = s3.Object(bucketname, job_name)
body = obj.get()['Body'].read()
json_content = json.loads(body)
#print(json_content['results'])
i=0

for index,content in enumerate(json_content['results']['items']):
    #print(content)
    #currentvalue=content['alternatives'][0]['content']
    #print(currentvalue)
    result = translate.translate_text(Text=content['alternatives'][0]['content'], SourceLanguageCode="en", TargetLanguageCode="es")
    #newvalue=result['TranslatedText']
    #print(newvalue)
    #json.loads(json.dumps(json_content).replace(currentvalue,newvalue))
    #print(json_content)
    #print("****")
    #print(result['TranslatedText'])
    #print("#####")
    json_content['results']['items'][index]['alternatives'][0]['content']=result['TranslatedText']
    #print(content['alternatives'][0]['content'])
    #spanishtext= str(i)+'\n'+content['start_time']+"-->"+content['end_time']+'\n'+result['TranslatedText']
#print(json_content['results']) 
result1 = translate.translate_text(Text=json_content['results']['transcripts'][0]['transcript'], SourceLanguageCode="en", TargetLanguageCode="es")
#currentvalue=json_content['results']['transcripts'][0]['transcript']
#newvalue=result1['TranslatedText']
#json.loads(json.dumps(json_content).replace(currentvalue,newvalue))
json_content['results']['transcripts'][0]['transcript']=result1
#print("Result****")
#print(result['TranslatedText']) 
#print("Result1****###")
#print(result1['TranslatedText']) 
#print(json_content['results'])
translatedfilename="spanish"+job_name
s3object = s3.Object(bucketname, translatedfilename)

s3object.put(Body=(bytes(json.dumps(json_content).encode('UTF-8'))))
#print(json_content)

#print(json_content['results']['transcripts'][0]['transcript'])
#print(json_content['results']['items'][1]['alternatives'][0]['content'])
#print(json_content['results']['items'][2]['alternatives'][0]['content'])
#print(json_content['results']['items'][3]['alternatives'][0]['content'])

{'ResponseMetadata': {'RequestId': '2C0DD46F211501A1',
  'HostId': 'upg87WWEr5U7i0AnlKb+anqntP3S5E0mqEg4Dbd6KckyZ+62uXVtf1pUwlu5110hHaGWhCGsLfY=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'upg87WWEr5U7i0AnlKb+anqntP3S5E0mqEg4Dbd6KckyZ+62uXVtf1pUwlu5110hHaGWhCGsLfY=',
   'x-amz-request-id': '2C0DD46F211501A1',
   'date': 'Wed, 18 Nov 2020 22:17:46 GMT',
   'etag': '"4724901463bc94abcadc2ae98c1cadb0"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"4724901463bc94abcadc2ae98c1cadb0"'}

In [142]:
def getTimeCode( seconds ):
# Format and return a string that contains the converted number of seconds into SRT format

   thund = int(seconds % 1 * 1000)
   tseconds = int( seconds )
   tsecs = ((float( tseconds) / 60) % 1) * 60
   tmins = int( tseconds / 60 )
   return str( "%02d:%02d:%02d,%03d" % (00, tmins, int(tsecs), thund ))
def newPhrase():
	return { 'start_time': '', 'end_time': '', 'words' : [] }
def getPhraseText( phrase ):

	length = len(phrase["words"])
		
	out = ""
	for i in range( 0, length ):
		if re.match( '[a-zA-Z0-9]', phrase["words"][i]):
			if i > 0:
				out += " " + phrase["words"][i]
			else:
				out += phrase["words"][i]
		else:
			out += phrase["words"][i]
			
	return out
	

In [145]:

items = json_content['results']['items']
    
    #set up some variables for the first pass
phrase =  newPhrase()
phrases = []
nPhrase = True
x = 0
c = 0

print ("==> Creating phrases from transcript...")

for item in items:

        # if it is a new phrase, then get the start_time of the first item
    if nPhrase == True:
        if item["type"] == "pronunciation":
            phrase["start_time"] = getTimeCode( float(item["start_time"]) )
            nPhrase = False
        c+= 1
    else:    
            # We need to determine if this pronunciation or puncuation here
            # Punctuation doesn't contain timing information, so we'll want
            # to set the end_time to whatever the last word in the phrase is.
            # Since we are reading through each word sequentially, we'll set 
            # the end_time if it is a word
        if item["type"] == "pronunciation":
            phrase["end_time"] = getTimeCode( float(item["end_time"]) )
                
        # in either case, append the word to the phrase...
    phrase["words"].append(item['alternatives'][0]["content"])
    x += 1
        
        # now add the phrase to the phrases, generate a new phrase, etc.
    if x == 10:
            #print c, phrase
        phrases.append(phrase)
        phrase = newPhrase()
        nPhrase = True
        x = 0
#print(phrases)

==> Creating phrases from transcript...


In [144]:
import codecs
import re
filename=job_name+"SRTFile.srt"
e = codecs.open(filename,"w+", "utf-8")
x = 1
	
for phrase in phrases:

		# determine how many words are in the phrase
	length = len(phrase["words"])
		
		# write out the phrase number
	e.write( str(x) + "\n" )
	x += 1
		
		# write out the start and end time
	e.write( phrase["start_time"] + " --> " + phrase["end_time"] + "\n" )
					
		# write out the full phase.  Use spacing if it is a word, or punctuation without spacing
	out = getPhraseText( phrase )

		# write out the srt file
	e.write(out + "\n\n" )
		

		#print out
		
e.close()
with codecs.open(filename, 'rb', encoding='utf-8') as fin:
    text = fin.read()
s3.Object(bucketname, filename).put(Body=text)

{'ResponseMetadata': {'RequestId': '37B3C2EFEB1163D5',
  'HostId': '0krG4BvAhisGR2i2EKueOdGb5OPbz2dxvopmDIIxuxm+Pi9RDIO8WJrD/Y0VWuzFJz+CkWLqNcg=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '0krG4BvAhisGR2i2EKueOdGb5OPbz2dxvopmDIIxuxm+Pi9RDIO8WJrD/Y0VWuzFJz+CkWLqNcg=',
   'x-amz-request-id': '37B3C2EFEB1163D5',
   'date': 'Wed, 18 Nov 2020 22:19:13 GMT',
   'etag': '"d05064a20694d70df5e3323d03baf545"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"d05064a20694d70df5e3323d03baf545"'}