<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_11_24_Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper: Automated Audio Transcription [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **Whisper: Automated Audio Transcription** Notebook uses OpenAI's Whisper model in a version fine tuned for German speaking audio to automatically transcribe social media videos.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: 2023-11-27 (v0.0.5). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

## 1. Import Data

### Import 4CAT

In [None]:
#@markdown Read the exported `csv` file from 4CAT for metadata.

import pandas as pd

four_cat_file_path = "/content/drive/MyDrive/2023-11-24-4CAT-Metadata.csv" #@param {type:"string"}

df = pd.read_csv(four_cat_file_path)

In [None]:
df.head()

Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,url,image_url,media_url,hashtags,num_likes,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp
0,CzLE8FCoO-2,CzLE8FCoO-2,CzLE8FCoO-2,Wir haben eine klare Haltung: Wir stehen zu Is...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-03 06:01:22,photo,https://www.instagram.com/p/CzLE8FCoO-2,https://scontent-fra3-1.cdninstagram.com/v/t51...,https://scontent-fra3-1.cdninstagram.com/v/t51...,,1538,167,1,,,,1698991282
1,CzGGK2PIpou,CzGGK2PIpou,CzGGK2PIpou,An Allerseelen und Allerheiligen denke ich bes...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 07:35:55,photo,https://www.instagram.com/p/CzGGK2PIpou,https://scontent-fra3-1.cdninstagram.com/v/t51...,https://scontent-fra3-1.cdninstagram.com/v/t51...,"allerheiligen,allerseelen,familie,erinnerung",14364,289,1,,,,1698824155
2,CzF7RDmpDXl,CzF7RDmpDXl,CzF7RDmpDXl,#Allerheiligen und #Allerseelen: Wir halten in...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 06:00:39,photo,https://www.instagram.com/p/CzF7RDmpDXl,https://scontent-fra5-1.cdninstagram.com/v/t39...,https://scontent-fra5-1.cdninstagram.com/v/t39...,"Allerheiligen,Allerseelen",1732,30,1,,,,1698818439
3,CzEB00zu65J,CzEB00zu65J,CzEB00zu65J,Wir wollen Bayern in eine gute Zukunft führen....,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:19:29,photo,https://www.instagram.com/p/CzEB00zu65J,https://scontent-fra3-2.cdninstagram.com/v/t51...,https://scontent-fra3-2.cdninstagram.com/v/t51...,"demokratie,landtag,zusammenhalt,modernität,sta...",1415,30,1,,,,1698754769
4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,https://www.instagram.com/p/CzD93SEIi-E,https://scontent-fra3-1.cdninstagram.com/v/t51...,https://scontent-fra3-2.cdninstagram.com/o1/v/...,"bayern,landtag",7081,227,1,,,,1698753983


In [None]:
#@title Unzip and Process Videos from 4CAT Export

#@markdown This script will unzip a specified ZIP file, read a metadata JSON file, and then process and relocate video files according to the metadata.

import zipfile
import json
import os

#@markdown Enter the Path to the ZIP File
zip_file_path = '/content/drive/MyDrive/2023-11-24-4CAT-Videos.zip' #@param {type:"string"}

#@markdown Enter the Extraction Folder Path
four_cat_folder = "4cat-export/" #@param {type:"string"}

#@markdown Enter the Destination Folder Path for Videos
video_path = "media/videos" #@param {type:"string"}

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified folder
    zip_ref.extractall(four_cat_folder)

print(f"Files extracted to {four_cat_folder}")

# Specify the path to the metadata JSON file
metadata_file_path = f'{four_cat_folder}/.metadata.json'

# Open the metadata file and load its content
with open(metadata_file_path, 'r') as file:
    data = json.load(file)

# Check if the destination directory for videos exists
if not os.path.exists(video_path):
    # Create the directory if it does not exist
    os.makedirs(video_path)

# Process each item in the metadata
for item in data.values():
    if item.get('success', False):
        post_id = item['post_ids'][0]
        if len(item['files']) == 1:
            filename = item['files'][0]['filename']
            print(f"Processing Post ID: {post_id}, Filename: {filename}")

            # Full path to the source file
            source_path = os.path.join(four_cat_folder, filename)

            # Full path to the destination file
            destination_path = os.path.join(video_path, f"{post_id}.mp4")

            # Move and rename the file
            os.rename(source_path, destination_path)

Files extracted to 4cat-export/
Processing Post ID: CzD93SEIi-E, Filename: https_scontent_fra3_2_cdninstagram_com_o1_v_t16_f1_m69_gicwmaar_njb7jcyajv76ikk_lqsbpr1aaaf_mp4_efg_.mp4


Using the next line we save the extracted image files to a new `ZIP` file following our `media/images/` convention. This will be useful for future tasks / notebooks. Rename the file according to your needs.

In [None]:
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media

updating: media/ (stored 0%)
  adding: media/videos/ (stored 0%)
  adding: media/videos/CzD93SEIi-E.mp4 (deflated 0%)


Here we add a new column to the metadata table, referencing the image file.

In [None]:
df['video_file'] = df.apply(lambda row: f"media/videos/{row['id']}.mp4" if row['type'] == "video" else "", axis=1)

In [None]:
df.head()

Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,url,...,media_url,hashtags,num_likes,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp,video_file
0,CzLE8FCoO-2,CzLE8FCoO-2,CzLE8FCoO-2,Wir haben eine klare Haltung: Wir stehen zu Is...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-03 06:01:22,photo,https://www.instagram.com/p/CzLE8FCoO-2,...,https://scontent-fra3-1.cdninstagram.com/v/t51...,,1538,167,1,,,,1698991282,
1,CzGGK2PIpou,CzGGK2PIpou,CzGGK2PIpou,An Allerseelen und Allerheiligen denke ich bes...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 07:35:55,photo,https://www.instagram.com/p/CzGGK2PIpou,...,https://scontent-fra3-1.cdninstagram.com/v/t51...,"allerheiligen,allerseelen,familie,erinnerung",14364,289,1,,,,1698824155,
2,CzF7RDmpDXl,CzF7RDmpDXl,CzF7RDmpDXl,#Allerheiligen und #Allerseelen: Wir halten in...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 06:00:39,photo,https://www.instagram.com/p/CzF7RDmpDXl,...,https://scontent-fra5-1.cdninstagram.com/v/t39...,"Allerheiligen,Allerseelen",1732,30,1,,,,1698818439,
3,CzEB00zu65J,CzEB00zu65J,CzEB00zu65J,Wir wollen Bayern in eine gute Zukunft führen....,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:19:29,photo,https://www.instagram.com/p/CzEB00zu65J,...,https://scontent-fra3-2.cdninstagram.com/v/t51...,"demokratie,landtag,zusammenhalt,modernität,sta...",1415,30,1,,,,1698754769,
4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,https://www.instagram.com/p/CzD93SEIi-E,...,https://scontent-fra3-2.cdninstagram.com/o1/v/...,"bayern,landtag",7081,227,1,,,,1698753983,media/videos/CzD93SEIi-E.mp4


### Import Stories (Zeeschuimer-F)

In [None]:
import pandas as pd

df_filepath = '/content/drive/MyDrive/2022-11-09-Stories-Exported.csv'
df = pd.read_csv(df_filepath)

In [None]:
!unzip /content/drive/MyDrive/2023-11-09-Story-Media-Export.zip

In [None]:
df['video_file'] = df.apply(lambda row: f"media/videos/{row['Username']}/{row['ID']}.mp4" if row['Type of Content'] == "Video" else "", axis=1)

In [None]:
df[df['video_file'] != ""].head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ID,Time of Posting,Type of Content,video_url,image_url,Username,Video Length (s),Expiration,Caption,Is Verified,Stickers,Accessibility Caption,Attribution URL,video_file,audio_file,duration,sampling_rate
6,6,6,3234541302917417720_327693598,2023-11-12 16:43:05,Video,,,abcnews,15.0,2023-11-13 16:43:05,,True,[],,,media/videos/abcnews/3234541302917417720_32769...,No Audio,15.0,-1.0
10,10,10,3234680898213756200_327693598,2023-11-12 21:20:29,Video,,,abcnews,15.0,2023-11-13 21:20:29,,True,[],,,media/videos/abcnews/3234680898213756200_32769...,No Audio,15.0,-1.0
36,36,36,3235088803252443144_1483455177,2023-11-13 10:51:05,Video,,,rmf24.pl,4.182,2023-11-14 10:51:05,,False,[],,,media/videos/rmf24.pl/3235088803252443144_1483...,3235088803252443144_1483455177.mp3,4.29,44100.0
37,37,37,3235088958903068251_1483455177,2023-11-13 10:51:24,Video,,,rmf24.pl,7.104,2023-11-14 10:51:24,,False,[],,,media/videos/rmf24.pl/3235088958903068251_1483...,3235088958903068251_1483455177.mp3,7.21,44100.0
38,38,38,3235089133277048351_1483455177,2023-11-13 10:51:45,Video,,,rmf24.pl,5.142,2023-11-14 10:51:45,,False,[],,,media/videos/rmf24.pl/3235089133277048351_1483...,3235089133277048351_1483455177.mp3,5.25,44100.0


### Other formats
I can provide more examples for reading metadata and media files collected using `instaloader` and CrowdTangle as needed.

## 2. Extract Audio from Video File

After loading the metadta and media files from the Google Drive, we extract the audio from each video file to prepare the automated transcription.

In [None]:
!pip install -q moviepy

In [None]:
import os

# Set audio directory path
audio_path = "media/audio/"

# Check if the directory exists
if not os.path.exists(audio_path):
    # Create the directory if it does not exist
    os.makedirs(audio_path)

In [None]:
from moviepy.editor import *

for index, row in df.iterrows():
    if row['video_file'] != "":
        # Load the video file
        video = VideoFileClip(row['video_file'])
        filename = row['video_file'].split('/')[-1]

        # Extract the audio from the video file
        audio = video.audio

        if audio is not None:
            sampling_rate = audio.fps
            current_suffix = filename.split(".")[-1]
            new_filename = filename.replace(current_suffix, "mp3")

            # Save the audio to a file
            audio.write_audiofile("{}{}".format(audio_path, new_filename))
        else:
            new_filename = "No Audio"
            sampling_rate = -1

        # Update DataFrame inplace
        df.at[index, 'audio_file'] = new_filename
        df.at[index, 'duration'] = video.duration
        df.at[index, 'sampling_rate'] = sampling_rate

        df.at[index, 'video_file'] = row['video_file'].split('/')[-1]

        # Close the video file
        video.close()


MoviePy - Writing audio in media/audio/CzD93SEIi-E.mp3


                                                                      

MoviePy - Done.




We've extracted the audio content of each video file to a `mp3` file in the `media/audio` folder. The files keep the name of the video file. We added new columns to the metadata for audio duration and sampling_rate. In case the video did not include an audio file, `smapling_rate`is set to `-1`, which we use to filter the `df` when transcribing the files.

In [None]:
df[df['video_file'] != ""].head()

Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,url,...,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp,video_file,audio_file,duration,sampling_rate
4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,https://www.instagram.com/p/CzD93SEIi-E,...,227,1,,,,1698753983,CzD93SEIi-E.mp4,CzD93SEIi-E.mp3,67.89,44100.0


Let's update the `ZIP`ed folder to include the audio files.

In [None]:
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media

updating: media/ (stored 0%)
updating: media/videos/ (stored 0%)
updating: media/videos/CzD93SEIi-E.mp4 (deflated 0%)
  adding: media/audio/ (stored 0%)
  adding: media/audio/CzD93SEIi-E.mp3 (deflated 1%)


And save the updated metadata file. **Change filename when importing stories here!**

In [None]:
df.to_csv(four_cat_file_path)

## 3. Transcriptions using Whisper

> The Whisper model was proposed in Robust Speech Recognition via Large-Scale  Weak Supervision by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

> The abstract from the paper is the following:

>>  We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

-- https://huggingface.co/docs/transformers/model_doc/whisper

In [None]:
!pip install -q transformers

The next code snippet initializes the Whisper model. The `transcribe_audio` method is applied to each row of the dataframe where `sampling_rate` > `0`, thus only to those lines with referencees to audio files. Each audio file is transcribed using Whisper, the result, one text string, is saved to the `transcript` column.

**Adjust the language variable according to your needs!** The model is also capable of automated translation, e.g. setting `language` to english when processing German content results in an English translation of the speech. (Additionally, the `task` variable accepts `translate`).

In [None]:
import torch
from transformers import pipeline, WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Set device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Initialize the Whisper model pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large",
    chunk_length_s=30,
    device=device,
)

# Load model and processor for multilingual support
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")

# Function to read, transcribe, and handle longer audio files in different languages
def transcribe_audio(filename, language='german'):
    try:
        # Load and resample audio file
        audio_path = f"{audio_folder}/{filename}"
        waveform, original_sample_rate = librosa.load(audio_path, sr=None, mono=True)
        waveform_resampled = librosa.resample(waveform, orig_sr=original_sample_rate, target_sr=16000)

        # Get forced decoder IDs for the specified language
        forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")

        # Process the audio file in chunks and transcribe
        transcription = ""
        for i in range(0, len(waveform_resampled), 16000 * 30):  # 30 seconds chunks
            chunk = waveform_resampled[i:i + 16000 * 30]
            input_features = processor(chunk, sampling_rate=16000, return_tensors="pt").input_features
            predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
            chunk_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
            transcription += " " + chunk_transcription

        return transcription.strip()
    except Exception as e:
        print(f"Error processing file {filename}: {e}")
        return ""


# Filter the DataFrame (sampling_rates < 0 identify items without audio)
filtered_index = df['sampling_rate'] > 0

# Apply the transcription function to each row in the filtered DataFrame
df.loc[filtered_index, 'transcript'] = df.loc[filtered_index, 'audio_file'].apply(transcribe_audio)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
df[df['video_file'] != ""].head()

Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,url,...,num_media,location_name,location_latlong,location_city,unix_timestamp,video_file,audio_file,duration,sampling_rate,transcript
4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,https://www.instagram.com/p/CzD93SEIi-E,...,1,,,,1698753983,CzD93SEIi-E.mp4,CzD93SEIi-E.mp3,67.89,44100.0,Ich bitte auf den abgelagerten Vortrag der Maa...


In [None]:
df.loc[4, 'transcript']

'Ich bitte auf den abgelagerten Vortrag der Maaßen-Söder-Entfühlen ein.  Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Ich schwöre Treue der Verfassung des Freistaates Bayern, Gehorsam den Gesetzen und gewissenhafte Erfüllung meiner Amtspflichten, so wahr mir Gott helfe. Herr Ministerpräsident, ich darf Ihnen im Namen des ganzen Hauses ganz persönlich die herzlichsten Glückwünsche aussprechen und wünsche Ihnen viel Erfolg und gute Nerven auch bei Ihrer Aufgabe. Herzlichen Dank.  Applaus'