# Scene Boundary Paritioning
Movies are comprised of self-contained units called scenes. Scenes have a beginning and end, usually comprised of a single conversation. They most often take place in one location with a fixed number of characters. By identifying scenes in a movie, we can then begin to analyze them individually, most notably by treating a scenes's dialogue as a freestanding, indepdent conversation.

To start, we'll just be identifying two-character dialogue scenes. These are the most basic building-blocks of films: just two characters speaking together with no distractions, purely advancing the plot with their dialogue. In modern filmmaking, these scenes are usually shot in a specific manner. We can take advantage of this by looking for specific patterns of shots, to identify a few two-character dialogue scenes.

In [1]:
import sys
sys.path.append('../data_serialization')
from serialization_preprocessing_io import *
from time_reference_io import *
from scene_identification_io import *

We have saved pickle objects of various dataframes. We'll load into memory the five dataframes, but we're most interested in the two which deal with onscreen images. The each have one row per frame (screencap), with one frame per second — so each row represents one second of onscreen action.
- vision_df: contains general computer vision information on each frame, including clusterings of similar frames into "shots"
- face_df: contains information related to faces found, including their vectorized encodings, and clusters of these encodings

In [2]:
film = 'plus_one_2019'
srt_df, subtitle_df, sentence_df, vision_df, face_df = read_pickle(film)

## Identifying Scenes
In modern film, two-character dialogue scenes follow a very distinct pattern. Character A speaks, then Character B, then back to A, then to B, etc. We cut back and forth between the two characters.


### Anchor Shots: The A/B/A/B Pattern
We look for these two Anchor shots, which are the shots of the two characters and form the A/B/A/B pattern. We'll be looking through every frame in the film, and trying to find these ABAB patterns.

The key to this lies in two columns in vision_df:
- shot_cluster:  represents clusters of similar frames, or shots. Think of a four-second shot of a character speaking. This would be represented as four rows with a common shot_cluster
- shot_id: sequential numbering of each shot (regardless of uniqueness). Every time a shot changes (and even if we've seen this shot before), the shot_id is incremented by 1

In [3]:
vision_df[202:213]

Unnamed: 0_level_0,blank,aspect_ratio,brightness,contrast,blue,green,red,shot_cluster,shot_id
frame,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
203,,2.39,39,39,36,36,45,2,52
204,,2.39,31,28,25,27,42,2,52
205,,2.39,35,36,33,32,44,2,52
206,,2.39,29,30,25,25,39,2,52
207,,2.39,65,51,57,63,73,2,52
208,,2.39,100,61,89,95,114,71,53
209,,2.39,92,68,82,88,102,186,54
210,,2.39,92,69,81,88,103,186,54
211,,2.39,91,67,82,88,102,186,54
212,,2.39,96,68,86,92,107,186,54


The below code will generate two lists each time an ABAB pattern is found:
- alternating_pairs: the two shot_clusters
- pair_shot_ids: the beginning and ending shot_id

In [4]:
shot_id_list = vision_df.shot_id.tolist()
shot_clusters = vision_df.shot_cluster.tolist()
frame_choice = range(1, (len(vision_df) + 1))

# to check for an A/B/A/B pattern, we must store the previous three clusters in memory
prev_clust_1 = 1001
prev_clust_2 = 1002
prev_clust_3 = 1003
prev_shot_id = -1
alternate_a_list = []
alternate_b_list = []
pair_shot_ids = []
pair_found = 0

# zip our various lists into a usable data structure
for frame_file, cluster, shot_id in zip(frame_choice, shot_clusters, shot_id_list):


    # we use prev_shot_id to identify when there's a new shot (when the cluster value changes)
    # when iterating through each frame, look for an A/B/A/B pattern, and save the clusters of any patterns
    if shot_id != prev_shot_id:
        if cluster == prev_clust_2 and prev_clust_1 == prev_clust_3:
            if pair_found == 0:
                alternate_a_list.append(min(cluster, prev_clust_1)) # min and max are used to avoid duplicates of (1, 2), (2, 1)
                alternate_b_list.append(max(cluster, prev_clust_1))
                beginning_shot = shot_id - 3
            pair_found = 1
        else:
            if pair_found == 1:
                ending_shot = shot_id - 1
                pair_shot_ids.append([beginning_shot, ending_shot])
            pair_found = 0
        
        # every time there's a new shot, we update the cluster memory
        prev_shot_id = shot_id
        prev_clust_3 = prev_clust_2
        prev_clust_2 = prev_clust_1
        prev_clust_1 = cluster
        
    # the below print can be used for troubleshooting and visualizing the memory state at each frame
    # print(frame_file, '\t', mcu_flag, '\t', cluster,'\t', shot_id, '\t', prev_shot_id, '\t', prev_clust_1, '\t', prev_clust_2, '\t', prev_clust_3, '\tend')

# save non-unique alternating pairs, because these must line up with pair_shot_ids
alternating_pairs = []

for a, b, in zip(alternate_a_list, alternate_b_list):
    alternating_pairs.append([int(a), int(b)])

print(len(alternating_pairs))
print(len(pair_shot_ids))

108
108


In [5]:
alternating_pairs[0]

[65, 334]

In [6]:
pair_shot_ids[0]

[33, 36]

### Filtering for Substantial Pairs
We've found 108 instances of scenes where there's an ABAB pattern, but we may want to filter this down. Below, we define a threshold of 6 alternating shots. We check if the shot_id's differ by six or more shots. In other words, we're looking for a series of shots which form a ABABAB pattern.

In [7]:
substantial_pair_shot_ids = []
substantial_anchor_shot_clusters = []
threshold = 6

for anchor_pair, shot_id_pair in zip(alternating_pairs, pair_shot_ids):
    if shot_id_pair[1] - shot_id_pair[0] > threshold:
        substantial_pair_shot_ids.append(shot_id_pair)
        substantial_anchor_shot_clusters.append(anchor_pair)
print(len(substantial_pair_shot_ids))
print(len(substantial_anchor_shot_clusters))

22
22


### Checking for Faces
Next, we check each of the 22 possible dialogue scenes to make sure they actuallly contain characters speaking. We do this by checking both anchor shots, and making sure there's actually faces in them. First, we'll need to define functions which return the majority face clusters (and other face clusters) in each anchor. Remember, for each pair of anchor shots, one of them has a character on the left side of the screen, and the other has a different character on the right side of the screen.

We only want to keep anchor shot pairs that find faces on the left, and other faces on the right. We only keep anchor shot pairs where one anchor has faces on the left in more than 50% of its frames, and the other anchor has faces on the right in more than 50% of its frames.

In [8]:
def left_face_clusters(alternation_face_df):
    """
    returns the primary face cluster for the left character in a scene, and a list of additional matching face clusters
    primary face cluster is the most prevalent, and matching clusters are the rest
    """
    matching_left_clusters = []

    left_value_counts = alternation_face_df[(alternation_face_df['p_center_alignment'] == 'left') & (alternation_face_df['faces_found'] == 1)].p_face_cluster.value_counts(normalize=True)

    if len(alternation_face_df[(alternation_face_df['p_center_alignment'] == 'left') & (alternation_face_df['faces_found'] == 1)]) > 2:
        if left_value_counts.values[0] >= .5:
            left_anchor_face_cluster = left_value_counts.index.values[0]
            left_anchor_face_encoding = np.average(alternation_face_df.loc[(alternation_face_df['p_center_alignment'] == 'left') & (alternation_face_df['p_face_cluster'] == left_anchor_face_cluster) & (alternation_face_df['faces_found'] == 1)].face_encodings.tolist(), axis=0)[0]
            for candidate in left_value_counts.index.values[1:]:
                left_cluster_candidate = np.average(alternation_face_df.loc[(alternation_face_df['p_center_alignment'] == 'left') & (alternation_face_df['p_face_cluster'] == candidate) & (alternation_face_df['faces_found'] == 1)].face_encodings.tolist(), axis=0)[0]
                if face_recognition.compare_faces([left_anchor_face_encoding], left_cluster_candidate)[0] == True:
                    matching_left_clusters.append(candidate)
            return left_anchor_face_cluster, matching_left_clusters
        else:
            return None, None
    else:
        return None, None

In [9]:
def right_face_clusters(alternation_face_df):
    """
    returns the primary face cluster for the right character in a scene, and a list of additional matching face clusters
    primary face cluster is the most prevalent, and matching clusters are the rest
    """
    matching_right_clusters = []

    right_value_counts = alternation_face_df[(alternation_face_df['p_center_alignment'] == 'right') & (
                alternation_face_df['faces_found'] == 1)].p_face_cluster.value_counts(normalize=True)

    if len(alternation_face_df[
               (alternation_face_df['p_center_alignment'] == 'right') & (alternation_face_df['faces_found'] == 1)]) > 2:
        if right_value_counts.values[0] >= .5:
            right_anchor_face_cluster = right_value_counts.index.values[0]
            right_anchor_face_encoding = np.average(alternation_face_df.loc[
                                                        (alternation_face_df['p_center_alignment'] == 'right') & (
                                                                    alternation_face_df[
                                                                        'p_face_cluster'] == right_anchor_face_cluster) & (
                                                                    alternation_face_df[
                                                                        'faces_found'] == 1)].face_encodings.tolist(),
                                                    axis=0)[0]
            for candidate in right_value_counts.index.values[1:]:
                right_cluster_candidate = np.average(alternation_face_df.loc[
                                                         (alternation_face_df['p_center_alignment'] == 'right') & (
                                                                     alternation_face_df[
                                                                         'p_face_cluster'] == candidate) & (
                                                                     alternation_face_df[
                                                                         'faces_found'] == 1)].face_encodings.tolist(),
                                                     axis=0)[0]
                if face_recognition.compare_faces([right_anchor_face_encoding], right_cluster_candidate)[0] == True:
                    matching_right_clusters.append(candidate)
            return right_anchor_face_cluster, matching_right_clusters
        else:
            return None, None
    else:
        return None, None

In [10]:
alternating_scene_frame_pairs = []
alternating_scene_anchor_pairs = []

for pair, anchors in zip(substantial_pair_shot_ids, substantial_anchor_shot_clusters):
    first_frame = vision_df[vision_df['shot_id'].isin([pair[0], pair[1]])][:1].index[0]
    last_frame = vision_df[vision_df['shot_id'].isin([pair[0], pair[1]])][-1:].index[0]
    alternation_face_df = face_df.copy()[first_frame - 1:last_frame]
    left_right_percentage = len(
        alternation_face_df[alternation_face_df['p_center_alignment'].isin(['left', 'right'])]) / len(
        alternation_face_df) * 100
    prim_face_percentage = len(alternation_face_df[alternation_face_df['prim_char_flag'] == 1]) / len(
        alternation_face_df) * 100
    left_anchor_face_cluster, matching_left_clusters = left_face_clusters(alternation_face_df)
    right_anchor_face_cluster, matching_right_clusters = right_face_clusters(alternation_face_df)
    if left_anchor_face_cluster and right_anchor_face_cluster:
        if prim_face_percentage >= .8:
            alternating_scene_frame_pairs.append([first_frame, last_frame])
            alternating_scene_anchor_pairs.append(anchors)
    else:
        pass
    
print(len(alternating_scene_frame_pairs))
print(len(alternating_scene_anchor_pairs))

11
11


### Expanding the Scene
Next, we'll take a single scene and expand it, beyond just the anchors (the two individual shots of Characters A and B). This will help find cutaways, shots that are still part of the scene, but aren't the anchor shots. Cutaways may include shots like a closeup of an object or a POV shot of where a character is looking.

We approach this by looking for the anchor shots, which fall outside the scope of the ABAB pattern, but nearby. For example, we might have a scene with ACBABAB, where C is a cutaway. We look before the first, and after the last shot in the ABAB pattern for the anchors.

In [11]:
alternating_scene_frame_pair = alternating_scene_frame_pairs[0]
alternating_scene_frame_pair

[492, 527]

In [12]:
anchor_search_threshold = 6

anchor_shot_cluster_pair = list(vision_df[alternating_scene_frame_pair[0] - 1:alternating_scene_frame_pair[1]].shot_cluster.unique())
anchor_shot_id_pair = [vision_df[alternating_scene_frame_pair[0] - 1:alternating_scene_frame_pair[1]].shot_id.min(),
                       vision_df[alternating_scene_frame_pair[0] - 1:alternating_scene_frame_pair[1]].shot_id.max()]
first_anchor_frame = vision_df[(vision_df['shot_id'] > anchor_shot_id_pair[0] - anchor_search_threshold) & (vision_df['shot_id'] < anchor_shot_id_pair[1] + anchor_search_threshold) & (vision_df['shot_cluster'].isin(anchor_shot_cluster_pair))].index.min()
last_anchor_frame = vision_df[
    (vision_df['shot_id'] > anchor_shot_id_pair[0] - anchor_search_threshold) & (vision_df['shot_id'] < anchor_shot_id_pair[1] + anchor_search_threshold) & (
        vision_df['shot_cluster'].isin(anchor_shot_cluster_pair))].index.max()
cutaways = vision_df[first_anchor_frame - 1:last_anchor_frame].shot_cluster.unique()
cutaways = cutaways[cutaways != anchor_shot_cluster_pair[0]] # remove the Speaker A and Speaker B clusters from this list
cutaways = cutaways[cutaways != anchor_shot_cluster_pair[1]]

scene_start_frame = first_anchor_frame
min_flag = 0

while min_flag == 0:
    try:
        if vision_df.loc[scene_start_frame - 1].shot_cluster in cutaways:
            scene_start_frame -= 1
        else:
            min_flag = 1
    except TypeError:  # error if hitting the beginning of the frame list
        min_flag = 1

scene_end_frame = last_anchor_frame
max_flag = 0
while max_flag == 0:
    try:
        if vision_df.loc[scene_start_frame - 1].shot_cluster in cutaways:
            scene_end_frame += 1
        else:
            max_flag = 1
    except TypeError:  # error if hitting the end of the frame list
        max_flag = 1

expanded_scene_frame_pair = [scene_start_frame, scene_end_frame]

expanded_scene_frame_pair


[483, 527]

Below, we perform the above expansion for all the potential scenes we found.

In [13]:
anchor_search = 6

expanded_scene_frame_pairs = []
for alternating_frame_pair in alternating_scene_frame_pairs:
    expanded_scene_frame_pairs.append(expand_scene(alternating_frame_pair, vision_df, anchor_search_threshold=anchor_search))

In [16]:
expanded_scene_frame_pairs

[[483, 527],
 [1786, 1806],
 [1822, 1842],
 [2909, 2950],
 [3187, 3227],
 [3293, 3339],
 [4369, 4458],
 [4415, 4525],
 [4973, 5050],
 [5064, 5124],
 [5095, 5198]]

## Face Check and Generating Scene Dictionaries

Now that we have a list of *potential* scenes' beginning- and end-frames, there's one final check to make sure it's a two-character dialogue scene. We check to make sure there are indeed faces in the left-anchor and right-anchor shots.

Although we assume the same character is in all the left-anchor shots, the face clustering sometimes may assign different face cluster values for different frames in a left-anchor shot (or right-anchor shot). We take the most prevalent face cluster value, and designate it as the `left_anchor_face_cluster` (or `right_anchor_face_cluster`). We also store the other face cluster values as `matching_left_face_clusters` and `matching_right_face_clusters`. We also store information on the cutaway shots.

We can create a dictionary holding information for each scene.

In [14]:
x = 1
scene_dictionary_list = []
for expanded_frame_pair, scene_anchor_pair in zip(expanded_scene_frame_pairs, alternating_scene_anchor_pairs):
    first_frame = expanded_frame_pair[0]
    last_frame = expanded_frame_pair[1]
    scene_duration = last_frame + 1 - first_frame
    expanded_face_df = face_df.copy()[first_frame - 1:last_frame]
    expanded_vision_df = vision_df.copy()[first_frame - 1:last_frame]
    left_anchor_shot_cluster = expanded_vision_df[(expanded_face_df['p_center_alignment'] == 'left') & (expanded_vision_df.shot_cluster.isin(scene_anchor_pair))].shot_cluster.value_counts().index[0]
    left_anchor_face_cluster, matching_left_face_clusters = left_face_clusters(expanded_face_df)
    right_anchor_face_cluster, matching_right_face_clusters = right_face_clusters(expanded_face_df)
    right_anchor_shot_cluster = expanded_vision_df[(expanded_face_df['p_center_alignment'] == 'right') & (expanded_vision_df.shot_cluster.isin(scene_anchor_pair))].shot_cluster.value_counts().index[0]
    cutaway_shot_clusters = vision_df[first_frame - 1:last_frame].shot_cluster.unique()
    cutaway_shot_clusters = cutaway_shot_clusters[cutaway_shot_clusters != left_anchor_shot_cluster]
    cutaway_shot_clusters = list(cutaway_shot_clusters[cutaway_shot_clusters != right_anchor_shot_cluster])
    if left_anchor_face_cluster and right_anchor_face_cluster:
        scene_dict = {'scene_id': x,
                      'first_frame': first_frame,
                      'last_frame': last_frame,
                      'scene_duration': scene_duration,
                      'left_anchor_shot_cluster': left_anchor_shot_cluster,
                      'left_anchor_face_cluster': left_anchor_face_cluster,
                      'matching_left_face_clusters': matching_left_face_clusters,
                      'right_anchor_shot_cluster': right_anchor_shot_cluster,
                      'right_anchor_face_cluster': right_anchor_face_cluster,
                      'matching_right_face_clusters': matching_right_face_clusters,
                      'cutaway_shot_clusters': cutaway_shot_clusters}
        scene_dictionary_list.append(scene_dict)
        x += 1

    scene_dictionaries = {}
    x = 1
    for scene_dict in scene_dictionary_list:
        scene_dictionaries[x] = scene_dict
        x += 1
        
len(scene_dictionaries)

11

Most are self-explanatory, but here are some clarifications:
- left_anchor_shot_cluster: the shot cluster value of the left anchor
- left_anchor_face_cluster: the most prevalent face cluster value of the left anchor shot
- matching_left_face_clusters: all other face cluster values of the left anchor shot
- cutaway_shot_clusters: shot cluster values of all shots that aren't anchor shots

In [15]:
scene_dictionaries[1]

{'scene_id': 1,
 'first_frame': 483,
 'last_frame': 527,
 'scene_duration': 45,
 'left_anchor_shot_cluster': 179,
 'left_anchor_face_cluster': 3.0,
 'matching_left_face_clusters': [11.0],
 'right_anchor_shot_cluster': 86,
 'right_anchor_face_cluster': 15.0,
 'matching_right_face_clusters': [5.0],
 'cutaway_shot_clusters': [39]}