# Parse Video Metadata Files

The WALDO video search tool enables users to index any type of video content and search it with AI assistance.
There are 2 requirements for each video you wish to index:
1. A video file with resolution 720p and higher
2. A metadata file containing information on the provided video file. 

In this notebook we will parse raw metadata files (JSON format) to validate that all infromation is inplace for successful video indexing.<br>
Parsed files are stored in the **silver** storage container and used later for video processing.

<hr />

#### What is in this notebook?
1. Connect to Azure Blob Storage which contains the videos and metadata files
2. Read JSON file from Blob Storage
3. Parse the files with MetadataParser class and inspect results.

#### What is the MetadataParser class?
This class iterate over provdied JSONs and make sure that all information for video processing is in place.
The mandatory key list can be found [enrichment/metadata_parser/assets/static.py](/workspaces/Waldo/src/python/common/enrichment/metadata_parser/assets/static.py)
##### Examples for keys:
1. `matching_video_name` - Name of the video file matching the metadata file
2. `video_description` - video annotation or a short description
3. `video_languages` - To assure maximal audio transcription results, it's best to provide the video language. If you are not sure of the language, Video Indexer will identify it for you.
4. `data_source` - Assures that all videos from the same collection/catalogue are stored in the same location

#### Exceptions
The code will throw and exception when the `matching_video_name` key is missing in the JSON.
This is due to the fact that we can't connect the metadata file to the relevant video file.

----------------

#### Parse metadata files from a data_source named `msr-vtt`. 
The following enviroment variables are needed in order to connect to your Blob Storage account:

1. `WALDO_STORAGE_ACCOUNT_NAME` - Name of Blob Storage account where videos and metadata are hosted.
2. `WALDO_UPLOAD_STORAGE_KEY` - Blob Storage account access key.
3. `WALDO_CONTAINER_NAME` - Container name where the blobs are hosted.

In [17]:
"""
Copyright (c) Microsoft Corporation.
Licensed under the MIT license.
"""
from azure.storage.blob import BlobServiceClient
from enrichment.metadata_parser.json_reader import json_reader
import os
from dotenv import load_dotenv

# load env variables from local .env file
load_dotenv()

data_source = 'msr-vtt'

storage_account_name = os.getenv('WALDO_STORAGE_ACCOUNT_NAME')
account_storage_key = os.getenv('WALDO_UPLOAD_STORAGE_KEY')
container_name = os.getenv('WALDO_CONTAINER_NAME')

# connect to storage container
connection_string = f"DefaultEndpointsProtocol=https;AccountName={storage_account_name};AccountKey={account_storage_key};EndpointSuffix=core.windows.net"
service = BlobServiceClient.from_connection_string(conn_str=connection_string)
container_client = service.get_container_client(container_name)

In [None]:

# Get all files and storage and filter videos and XML files separately
data_source_files = list(container_client.list_blobs(name_starts_with  = f'{data_source}/'))
video_files =  [x['name'] for x in data_source_files if x['name'].lower().endswith(("mov","mp4"))]
metadata_files =  [x['name'] for x in data_source_files if x['name'].lower().endswith("json")]
print(f' we got {len(metadata_files)} xml files and {len(video_files)} video files')


In [None]:
# Read the Json metadata file
from enrichment.metadata_parser.json_reader import json_reader
file_to_read = metadata_files[25]
data = json_reader(container_client, file_to_read)

In [None]:
from enrichment.metadata_parser.metadata_parser import *

parser  = MetadataParser()
parsed_data = parser.parse_metadata(data)
parsed_data

# View Available Keys

In [None]:
print(parsed_data.keys())

## Check out language to use for Video Indexer upload

In [None]:
parsed_data['video_languages'], parsed_data['language_codes']

### Videos should be uploaded to Video Indexer using the `video_hash_id` as key since we have a limitation of 80 characters.
The original name can be uploaded using the `description` field

In [None]:
parsed_data['video_hash_id']