# Script pipeline

## Before starting

Collect your videos in MP4 format, give them a unique commercial ID (make sure their names are the same of the corresponding `commerical_id` values) and put them in the `videos` folder.

Then fill in the CSV file `initial_data/commercials_initial_metadata.csv` with the metadata of each video:

- `commercial_id`
- `title`
- `brand`
- `nice_class`
- `product_type_key`
- `year`
- `lustrum`
- `source`

## 1. Color and Thumb Extraction

- Analyze each video and collects new data (`avg_frame_rate`, `aspect_ratio`), add them to `commercials_initial_metadata.csv` and save it as `general/commercials.csv`.
- Split each video in “scenes”.
- For each scene, extract the representative median frame and save it in small size (with height of 180 px), in WEBP format, in a folder named after the `commercial_id` in the `thumbnails` folder.
- From each median frame, extract a color palette of maximum 32 colours and save them in a CSV file named
  `general/commercial_palettes.csv` with these data:
    - `commercial_id`
    - `scene`: the progressive number ID of the scene.
    - `scene_size`: the scene duration measured in frames.
    - `start_frame`: the initial frame number of the scene.
    - `end_frame`: the final frame number of the scene.
    - `hex_code`: the hexadecimal representation of the original colour extracted.
    - `frequency_within_the_scene`: the frequency of the original colour in the scene.
    - `closest_color_ext_pal`: the closet colour from the extended palette.
    - `closest_color_ess_pal`: the closet colour from the essential palette.
    - `closest_color_bas_pal`: the closet colour from the basic palette.
    - `scene_size_norm`: the normalized scene size.
    - `frequency_within_the_commercial`: the frequency of the original colour within the video (`frequency_within_the_scene` × `scene_size_norm`)
    - `tf`: the term frequency of the `closest_color_ext_pal` value in the video.
- save the info about each scene detected in each video into `general/scenes.csv`.

In [None]:
%run '1_color_and_thumb_extraction.py'

## 2. Reference Palette Idf Calculation

Calculate the idfs (Inverse Document Frequencies) of each color for each reference palette and save them as:
- `colors/basic_palette_idfs.csv`
- `colors/essential_palette_idfs.csv`
- `colors/extended_palette_idfs.csv`

In [None]:
%run '2_ref_palette_idf_calculation.py'

## 3. Audio Feature Extraction

Export the 19 audio features of each video in the folder `audio/features`.

In [None]:
%run '3_audio_feature_extraction.py'

## 4. Audio Transcription and Lemmatization

- Find the “Speech” presence in each video, save it as `audio/speech_class_confidence_score.csv` and transcribe the found speech. All transcriptions are saved in `text/transcriptions.csv`
- Lemmatize each transcription and save lemmas (alphabetically ordered by video) as `text/lemmas.csv`.
- Calculate the tf-idf values for each lemma and update `text/lemmas.csv`.

In [None]:
%run '4_audio_transcription_and_lemmatization.py'

## Finally

You can use `text/transcriptions.csv` as input for further text analysis (e.g. LIWC analysis).