# Story Beats

## Overview 
The goal of the competition is to predict the story beat of a scene in a given movie. According to Blake Snyder's Save the Cat! categorization, there are 15 possible story beats:

    1. Opening Image
    2. Theme Stated
    3. Setup
    4. Catalyst
    5. Debate
    6. Break Into Two
    7. B Story
    8. Fun and Games
    9. Midpoint
    10. Bad Guys Close In
    11. All is Lost
    12. Dark Night of the Soul
    13. Break Into Three
    14. Finale
    15. Final Image

The student's task is to create an algorithm that predicts one of these story beats based on scene stats and a subtitle file.

## Description 
The submissions will be evaluated using classification accuracy. To create a predictive model, students should use Python and scikit-learn (or any other applicable machine learning toolkit).

In addition to submitting their solution on this site, students are required to provide a link to reproducible code in the form of a Jupyter Notebook on the Course website. This project can be done in pairs.

## Authors
The notebook was prepared as part of the subject Advanced Data Mining, project number 2, project for a group of two people.

## Import

In [26]:
import pandas as pd
import os
import glob
import re
import nltk
import contractions
import re
import chardet
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Load test files - paths & variables

In [8]:
test_folder = './data/test'
test_features_path = os.path.join(test_folder, "features")

test_features_files = glob.glob(os.path.join(test_features_path, '*.csv'))
test_timestamps_files = glob.glob(os.path.join(test_folder, 'scene_timestamps', '*.csv'))
test_subtitles_files = glob.glob(os.path.join(test_folder, 'subtitles', '*.srt'))

Load training files - paths & variables

In [15]:
randomstate = 40 #aby zapewnić powtarzalność
train_folder = './data/train'
train_features_path =  os.path.join(train_folder, "features")

train_features_files = glob.glob(os.path.join(train_features_path, '*.csv'))
train_labels_files = glob.glob(os.path.join(train_folder, 'labels', '*.csv'))
train_timestamps_files = glob.glob(os.path.join(train_folder, 'scene_timestamps', '*.csv'))
train_subtitles_files = glob.glob(os.path.join(train_folder, 'subtitles', '*.srt'))

### Utility Functions 

#### 1. Read subtitles

In [16]:
#na problemy z wczytywaniem pliku i złymi znakami
def read_srt_file(file_path):
    """
    Reads an SRT file, detecting its encoding and handling any issues with invalid characters.

    This function first detects the file's encoding using the chardet library, and then reads the file using the detected encoding.

    Parameters:
    file_path (str): The path to the SRT file to be read.

    Returns:
    str: The content of the SRT file as a string.
    """
    with open(file_path, 'rb') as file:
        detector = chardet.UniversalDetector()
        for line in file:
            detector.feed(line)
            if detector.done:
                break
        detector.close()

    encoding = detector.result['encoding']
    
    with open(file_path, 'r', encoding=encoding) as file:
        content = file.read()

    return content

#### 2. Removing HTML tags

In [17]:
#na problemy z źle sformatownaym tekstem (nie powinien mieć tagów)
def remove_html_tags(text):
    """
    Removes HTML tags from the given text.

    This function uses BeautifulSoup to parse the input text and extract plain text, removing any HTML tags.

    Parameters:
    text (str): The input text containing HTML tags.

    Returns:
    str: The plain text with HTML tags removed.
    """
    soup = BeautifulSoup(text, 'html.parser')
    plain_text = soup.get_text(separator=' ', strip=True)
    return plain_text

#### 3. Parsing subtitles

In this function, we perform the following tasks:
- Normalize sentences
- Decipher abbreviations
- Expand abbreviations
- Remove the corpus
- Lemmatize words

In [18]:
def load_and_parse_subtitles(subtitles_files_list):
    subtitles_df = pd.DataFrame()
    for file_path in subtitles_files_list:

        name = file_path.split('\\')[-1].split('_')[0] # movieId
        srt_content = read_srt_file(file_path)

        subtitles = re.split(r'\n\n', srt_content.strip())[:]
        data = {'id': [], 'start': [], 'end': [], 'text': []}

        for subtitle in subtitles:
            start = ""
            end = ""
            lines = subtitle.strip().split('\n')
            
            try:
                data['id'].append(int(lines[0]))
            except ValueError as e:
                print(f"Error parsing int: {file_path}. subtitle: {subtitle}, int: {lines}")
                continue
    
            try:
                time_pattern = r'(\d+:\d+:\d+[,.]\d+) --> (\d+:\d+:\d+[,.]\d+)'
                start, end = re.findall(time_pattern, lines[1])[0]

                start_times = start.split(":")
                start_seconds = str(start_times[-1]).replace(",", ".").split(".")
                start = int(start_times[0])*60*60+int(start_times[1])*60+int(start_seconds[0]) + int(start_seconds[1])/1000

                end_times = end.split(":")
                end_seconds = str(end_times[-1]).replace(",", ".").split(".")
                end = int(end_times[0])*60*60 + int(end_times[1]) * 60 + int(end_seconds[0]) + int(end_seconds[1])/1000
            except ValueError as e:
                print(f"file_name: {file_path}. start: {start}, end: {end}")
                continue
    
            data['start'].append(start)
            data['end'].append(end)
            
            text = ' '.join(lines[2:])
            
            #usuń niepotrzebne znaki
            characters_to_remove = ':.,<>[];:\'"{}-i/()?!@#&*'
            sno = SnowballStemmer('english')
            lemmatizer = WordNetLemmatizer()

            text = remove_html_tags(text)
            text = contractions.fix(text) #napraw słowa, i'll -> i will
            text = text.lower()

            for char in characters_to_remove:
                text = text.replace(char, '')
            
            stop_words = set(stopwords.words('english'))
            stop_words.add("\an8")

            words = text.split()
            filtered_words = [word for word in words if word not in stop_words] #usuwam slowa z listy stopwords
            lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
            stemmed_words = [sno.stem(word) for word in lemmatized_words]
            text = ' '.join(stemmed_words)
            
            data['text'].append(text)
            data['movie_id'] = name

        df_text = pd.DataFrame(data)
        subtitles_df = pd.concat([subtitles_df, df_text], ignore_index=True)
        
    return subtitles_df

#### 4. Combine dataframes

In [19]:
def filter_join_text_df(combined_df, subtitles_df, check):
    result_df_func = pd.merge(combined_df, subtitles_df, left_on=['id'], right_on=['movie_id'], how='left')
    condition_func = (
        (result_df_func['start_x'] <= result_df_func['start_y']) & 
        (result_df_func['start_y'] <= result_df_func['end_x'])
    )
    combined_df = result_df_func.copy()
    combined_df.loc[~condition_func, ['text', 'start_y', 'end_y', 'id_y']] = '', '', '',''
    combined_df = combined_df.drop_duplicates()

    combined_df = combined_df.drop(["id_y"], axis=1)
    combined_df = combined_df.rename(columns={'id_x': 'id', 'start_x': 'start', 'end_x': 'end', 'start_y': 'start_sub', 'end_y': 'end_sub'})


    if check == 0:
        combined_df = combined_df.groupby(['index', 'movie_id']).agg({
        's_dur': 'first',
        'n_shots': 'first',
        'ava_shot_dur': 'first',
        'rel_id_loc': 'first',
        'rel_t_loc': 'first',
        'ava_char_score': 'first',
        'is_prot_appear': 'first',
        'id': 'first',
        # 'film_name': 'first',
        'label_movie': 'first', # Label
        'start': 'first',
        'end': 'first',
        'scene_order': 'first',
        'start_sub': 'first',
        'end_sub': 'first',
        'text': ' '.join
        }).reset_index()
    else:
        combined_df = combined_df.groupby(['index', 'movie_id']).agg({
        's_dur': 'first',
        'n_shots': 'first',
        'ava_shot_dur': 'first',
        'rel_id_loc': 'first',
        'rel_t_loc': 'first',
        'ava_char_score': 'first',
        'is_prot_appear': 'first',
        'id': 'first',
        # 'film_name': 'first',
        'start': 'first',
        'end': 'first',
        'scene_order': 'first',
        'start_sub': 'first',
        'end_sub': 'first',
        'text': ' '.join
        }).reset_index()

    #posortowane jak w sample submission
    #combined_df_func = combined_df.sort_values(by=['movie_id', 'index'])
    #to sortowanie jest lepsze do porównywania labelek
    combined_df_func = combined_df.sort_values(by=['index', 'movie_id'])
    combined_df_func.loc[combined_df_func['text'] == '', 'text'] = 'None'
    combined_df = combined_df_func.dropna()
    return combined_df

#### Load main data 
Load CSV files & normalize dataframes

In [20]:
def load_and_combine(files_list):
    combined_df = pd.DataFrame()
    for file in files_list:
        id = file.split('\\')[-1].split('_')[0]
        df = pd.read_csv(file)

        df = df.rename(columns={'Unnamed: 0': 'index'})
        if 'labels' in file:
            df = df.rename(columns={'0': 'label_movie'})

        if 'timestamps' in file:
            df['scene_order'] = range(1, len(df) + 1)

        df['id'] = id
        combined_df = pd.concat([combined_df, df], ignore_index=True)
    return combined_df

def normalize_df(X_combined_df):
    column_names = X_combined_df.columns
    scaler = MinMaxScaler()
    X_combined_df = scaler.fit_transform(X_combined_df)
    X_combined_df = pd.DataFrame(data=X_combined_df, columns=column_names)
    return X_combined_df

## Core Workflow

In [29]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### 1. Load trainging & test data

In [35]:
###
# First training data are loaded and combined into one dataframe 
###

train_features_df = load_and_combine(train_features_files)
train_labels_df = load_and_combine(train_labels_files)
train_timestamps_df = load_and_combine(train_timestamps_files)
train_subtitles_df = load_and_parse_subtitles(train_subtitles_files)

temp_df = pd.merge(train_features_df, train_timestamps_df, on=['index', 'id'])
train_movie_scene_df = pd.merge(temp_df, train_labels_df, on=['index', 'id'])
train_movie_scene_df = filter_join_text_df(train_movie_scene_df, train_subtitles_df, 0)
train_movie_scene_df = train_movie_scene_df.reset_index(drop=True)

Error parsing int: ./data/train\subtitles\tt0097576_indiana jones and the last crusade.srt. subtitle: Those people are
trying to kill us!, int: ['Those people are', 'trying to kill us!']


  soup = BeautifulSoup(text, 'html.parser')
  combined_df.loc[~condition_func, ['text', 'start_y', 'end_y', 'id_y']] = '', '', '',''


In [33]:
display(train_movie_scene_df.head())

Unnamed: 0,index,movie_id,s_dur,n_shots,ava_shot_dur,rel_id_loc,rel_t_loc,ava_char_score,is_prot_appear,id,label_movie,start,end,scene_order,start_sub,end_sub,text
0,0,tt0037884,218.718333,6,36.453056,0.0,0.0,1295.579078,1,tt0037884,Opening Image,0.0,213.171,1,150.32,152.926,would better take th along gong cold farm ok m...
1,0,tt0108160,75.992333,1,75.992333,0.0,0.0,1334.892862,1,tt0108160,Opening Image,0.0,0.0,1,,,
2,0,tt0109830,108.525333,1,108.525333,0.0,0.020719,1622.83302,1,tt0109830,Opening Image,176.718,176.718,1,,,
3,0,tt0119822,18.893333,2,9.446667,0.0,0.003049,895.32838,1,tt0119822,Opening Image,25.359,36.245,1,28.779,32.617,gong get flower dear wll back n 20 mnute tulp ...
4,1,tt0037884,21.980333,3,7.326778,0.008403,0.036189,1295.579078,1,tt0037884,Opening Image,218.76,228.854,2,,,sure n closet cannot fnd well look desk


In [39]:
###
# Here tests data are loaded and combined into one dataframe 
###

test_features_df = load_and_combine(test_features_files)
test_timestamps_df = load_and_combine(test_timestamps_files)
test_subtitles_df = load_and_parse_subtitles(test_subtitles_files)

test_movie_scene_df = pd.merge(test_features_df, test_timestamps_df, on=['index', 'id'])
test_movie_scene_df = filter_join_text_df(test_movie_scene_df, test_subtitles_df, 1)
test_movie_scene_df = test_movie_scene_df.reset_index(drop=True)

# Save to CSV
test_movie_scene_df.to_csv('test_movie_scene_df.csv', index=False)

  soup = BeautifulSoup(text, 'html.parser')


Error parsing int: ./data/test\subtitles\tt1285016_the social network.srt. subtitle: , int: ['']


  combined_df.loc[~condition_func, ['text', 'start_y', 'end_y', 'id_y']] = '', '', '',''
  combined_df.loc[~condition_func, ['text', 'start_y', 'end_y', 'id_y']] = '', '', '',''


In [37]:
display(test_movie_scene_df)

Unnamed: 0,index,movie_id,s_dur,n_shots,ava_shot_dur,rel_id_loc,rel_t_loc,ava_char_score,is_prot_appear,id,start,end,scene_order,start_sub,end_sub,text
0,0,tt1142988,28.153333,3,9.384444,0.000000,0.011692,2727.048221,1,tt1142988,67.317,90.507,1,,,alert okay well tell wll n 15 mnute stop arg...
1,0,tt1285016,225.934333,45,5.020763,0.000000,0.000000,867.546640,1,tt1285016,0.000,216.633,1,2.829,8.829,subttl ax sames cat dd know peopl wth genus q ...
2,0,tt1568346,60.143333,6,10.023889,0.000000,0.000000,175.028161,0,tt1568346,0.000,45.879,1,30.68,33.16,knd know whte frame dark postmark last tme note
3,0,tt1632708,71.988333,29,2.482356,0.000000,0.000000,1440.813107,1,tt1632708,0.000,70.070,1,15.181,17.65,okay let u see could move th get rd kll knd fr...
4,1,tt0822832,32.240333,11,2.930939,0.008333,0.007405,2569.410356,1,tt0822832,51.343,77.035,1,,,\an8 cours exper kds even crazi hound chasng...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2465,309,tt0988595,42.375333,9,4.708370,0.980952,0.906111,993.294659,0,tt0988595,6033.945,6064.767,275,,,um mad honor um lve n wllamsburg wth roommat ...
2466,310,tt0988595,19.227333,1,19.227333,0.984127,0.912490,1045.822997,1,tt0988595,6066.936,6066.936,276,,,
2467,311,tt0988595,13.180333,4,3.295083,0.987302,0.915389,649.656064,0,tt0988595,6086.205,6095.965,277,,,
2468,312,tt0988595,7.173333,3,2.391111,0.990476,0.917377,993.294659,0,tt0988595,6099.427,6103.806,278,,,


#### 2. Text processing
Processing text into a version compatible with Python libraries

In [40]:
texts = train_movie_scene_df['text']

# Model TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=100) #ograniczenie
tfidf_tokens = tfidf_vectorizer.fit_transform(texts) #tylko kolumna z tekstem


tfidf_texts = pd.DataFrame(tfidf_tokens.toarray(), columns= tfidf_vectorizer.get_feature_names_out().tolist()) #dataframe format
tfidf_texts = tfidf_texts.reset_index()
train_movie_scene_subtitles_df = pd.concat([train_movie_scene_df, tfidf_texts], axis=1) #polaczenie text+pozostale features

display(train_movie_scene_subtitles_df)

Unnamed: 0,index,movie_id,s_dur,n_shots,ava_shot_dur,rel_id_loc,rel_t_loc,ava_char_score,is_prot_appear,id,...,wat,way,well,wll,work,would,wth,yeah,year,yes
0,0,tt0037884,218.718333,6,36.453056,0.000000,0.000000,1295.579078,1,tt0037884,...,0.0,0.0,0.140783,0.368224,0.0,0.294782,0.0,0.144425,0.0,0.0
1,0,tt0108160,75.992333,1,75.992333,0.000000,0.000000,1334.892862,1,tt0108160,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
2,0,tt0109830,108.525333,1,108.525333,0.000000,0.020719,1622.833020,1,tt0109830,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
3,0,tt0119822,18.893333,2,9.446667,0.000000,0.003049,895.328380,1,tt0119822,...,0.0,0.0,0.000000,0.434463,0.0,0.000000,0.0,0.000000,0.0,0.0
4,1,tt0037884,21.980333,3,7.326778,0.008403,0.036189,1295.579078,1,tt0037884,...,0.0,0.0,0.420542,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3203,305,tt0106918,57.390333,13,4.414641,0.980707,0.952509,1154.750154,1,tt0106918,...,0.0,0.0,0.276736,0.000000,0.0,0.289726,0.0,0.000000,0.0,0.0
3204,306,tt0106918,46.922333,10,4.692233,0.983923,0.958702,1154.750154,1,tt0106918,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
3205,307,tt0106918,19.895333,6,3.315889,0.987138,0.963766,1154.750154,1,tt0106918,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
3206,308,tt0106918,47.672333,15,3.178156,0.990354,0.965916,1154.750154,1,tt0106918,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0


#### 3. Learning with stratification
Learn model on **train** dataset

In [42]:
columns_to_drop = ["index", "movie_id", "id", "label_movie", "start_sub", "end_sub", "text", "label_encoded"]
label_encoder = LabelEncoder()
train_movie_scene_subtitles_df['label_encoded'] = label_encoder.fit_transform(train_movie_scene_subtitles_df['label_movie'])
#train_movie_scene_subtitles_df.head()

X = train_movie_scene_subtitles_df.drop(columns_to_drop, axis=1)
X_norm = normalize_df(X)
y = train_movie_scene_subtitles_df['label_encoded']

X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3, random_state=23)

criterion_options = ['entropy', 'log_loss']
param_grid = {
    'n_estimators': [150, 200, 50],
    'max_depth': [None, 50, 80],
    'min_samples_split': [10, 5, 15, 2],
    'min_samples_leaf': [1, 3, 5],
    'criterion': criterion_options,
    'max_features': ['sqrt', 'log2', None],
}

# grid_search = GridSearchCV(RandomForestClassifier(random_state=randomstate),
#                             param_grid, cv=5, scoring='accuracy', n_jobs=-1)
# grid_search.fit(X_train, y_train)
# best_params = grid_search.best_params_

best_params={'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
model = RandomForestClassifier(random_state=randomstate, **best_params)
model.fit(X_train, y_train)
# log_loss: 90,1,11,300
# entropy: 90, 1, 10, 250

y_true, y_pred = y_test, model.predict(X_test)
print(classification_report(y_true, y_pred, zero_division=0))

accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy}")

              precision    recall  f1-score   support

           0       0.75      0.29      0.41        21
           1       0.54      0.73      0.62       131
           2       0.61      0.76      0.68       103
           3       0.55      0.34      0.42        35
           4       0.43      0.37      0.40        27
           5       0.00      0.00      0.00        22
           6       0.47      0.72      0.57        32
           7       0.52      0.62      0.57        61
           8       0.79      0.52      0.63        21
           9       0.84      0.95      0.89       190
          10       0.47      0.24      0.32       118
          11       0.62      0.34      0.44        29
          12       0.62      0.81      0.70        16
          13       0.77      0.82      0.79       145
          14       1.00      0.08      0.15        12

    accuracy                           0.65       963
   macro avg       0.60      0.51      0.51       963
weighted avg       0.63   

In [43]:
print(f"best params={best_params}")

best params={'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


#### 4. Learn on full train dataset

In [44]:
X_train = train_movie_scene_subtitles_df.drop(columns_to_drop, axis=1)
X_train_norm = normalize_df(X_train)
y_train = train_movie_scene_subtitles_df['label_encoded']

model = RandomForestClassifier(random_state=randomstate, **best_params)
model.fit(X_train_norm, y_train)

#### 5. Score prediction
Scene prediction for **test** dataset

In [45]:
test_texts = test_movie_scene_df['text']
test_tfidf_tokens = tfidf_vectorizer.transform(test_texts) #tylko kolumna z tekstem

test_tfidf_texts = pd.DataFrame(test_tfidf_tokens.toarray(), columns= tfidf_vectorizer.get_feature_names_out().tolist()) #dataframe format
test_tfidf_texts = test_tfidf_texts.reset_index(drop=True)

test_movie_scene_subtitles_df = pd.concat([test_movie_scene_df, test_tfidf_texts], axis=1) #polaczenie text+pozostale features

X_test = test_movie_scene_subtitles_df.drop(["index", "movie_id", "id", "start_sub", "end_sub", "text"], axis=1)
X_test_norm = normalize_df(X_test)
#display(X_test_norm)
y_pred = model.predict(X_test_norm)

predicted_labels = label_encoder.inverse_transform(y_pred)
test_movie_scene_subtitles_df['Id'] = test_movie_scene_subtitles_df['movie_id'].astype(str) + '_' + test_movie_scene_subtitles_df['index'].astype(str)

result_df = pd.DataFrame({
    'Id': test_movie_scene_subtitles_df['Id'],
    'Label': predicted_labels
})

In [46]:
display(result_df.head())

Unnamed: 0,Id,Label
0,tt1142988_0,Opening Image
1,tt1285016_0,Opening Image
2,tt1568346_0,Opening Image
3,tt1632708_0,Opening Image
4,tt0822832_1,Opening Image


### Save score

Saving the final dataframe as an output file with a structure suitable for a Kaggle competition

In [47]:
result_df.to_csv('submission.csv', index=False)

In [48]:
###
# Display useful score files statistics
###

label_counts = result_df['Label'].value_counts().sort_index()
print(f"predicted: {label_counts}")
print(f"Labels used: {len(label_counts)} from 15")

predicted: Label
All Is Lost                24
B Story                   388
Bad Guys Close In         360
Break into Three           86
Break into Two             58
Catalyst                    6
Dark Night of the Soul    191
Debate                    228
Final Image                67
Finale                    448
Fun and Games             168
Midpoint                   36
Opening Image              64
Set-Up                    342
Theme Stated                4
Name: count, dtype: int64
Labels used: 15 from 15
