<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2024_01_12_Visual_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visual BERTopic [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **Visual BERTopic** Notebook uses BERTopic, the CLIP, and the `vit-gpt2-image-captioning` model to arranges images into topics based on their content. The image captioning model generates a textual description of the content, that is used for the topic modeling.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2024). michaelachmann/social-media-lab: 2024-1-15 (v0.0.9). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

## 1. Data Import

### From 4CAT

In [None]:
#@markdown Read the exported `csv` file from 4CAT for metadata.

import pandas as pd

four_cat_file_path = "/content/drive/MyDrive/2024-01-09-Bauernproteste/2024-01-09-Combined.csv" #@param {type:"string"}

df = pd.read_csv(four_cat_file_path)

In [None]:
df.head()

In [None]:
#@title Unzip and Process Videos from 4CAT Export

#@markdown This script will unzip a specified ZIP file, read a metadata JSON file, and then process and relocate video files according to the metadata.

import zipfile
import json
import os

#@markdown Enter the Path to the ZIP File
zip_file_path = '/content/drive/MyDrive/2024-01-09-Bauernproteste/2024-01-09-#Bauern-Bilder.zip' #@param {type:"string"}
output_zip_file_path = '/content/drive/MyDrive/2024-01-09-Bauernproteste/2024-01-09-Images-Clean.zip' #@param {type:"string"}


#@markdown Enter the Extraction Folder Path
four_cat_folder = "4cat-export/" #@param {type:"string"}

#@markdown Enter the Destination Folder Path for Videos
video_path = "media/images" #@param {type:"string"}

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified folder
    zip_ref.extractall(four_cat_folder)

print(f"Files extracted to {four_cat_folder}")

# Specify the path to the metadata JSON file
metadata_file_path = f'{four_cat_folder}/.metadata.json'

# Open the metadata file and load its content
with open(metadata_file_path, 'r') as file:
    data = json.load(file)

# Check if the destination directory for videos exists
if not os.path.exists(video_path):
    # Create the directory if it does not exist
    os.makedirs(video_path)

# Process each item in the metadata
for item in data.values():
    if item.get('success', False):
        post_id = item['post_ids'][0]
        filename = item['filename']
        print(f"Processing Post ID: {post_id}, Filename: {filename}")

        # Full path to the source file
        source_path = os.path.join(four_cat_folder, filename)

        # Full path to the destination file
        destination_path = os.path.join(video_path, f"{post_id}.jpg")

        # Move and rename the file
        os.rename(source_path, destination_path)

Using the next line we save the extracted image files to a new `ZIP` file following our `media/images/` convention. This will be useful for future tasks / notebooks. Rename the file according to your needs.

In [None]:
!zip -r /content/drive/MyDrive/2024-01-09-Bauernproteste/2024-01-09-Images-Clean.zip media

Here we add a new column to the metadata table, referencing the image file.

In [None]:
df['image_file'] = df.apply(lambda row: f"media/images/{row['id']}.jpg", axis=1)

In [None]:
df[['id', 'body', 'Transcript', 'image_file']].head()

Unnamed: 0,id,body,Transcript,image_file
0,7321692663852404001,#Fakten #mutzurwahrheit #ulrichsiegmund #AfD #...,"Liebe Freunde, schaut euch das an, das ist der...",/content/media/images/7321692663852404001.jpg
1,7320593840212151584,Unstoppable 🇩🇪 #deutschland #8januar2024 #baue...,"the next, video!!",/content/media/images/7320593840212151584.jpg
2,7321341957333060896,08.01.2024 Streik - Hoss & Hopf #hossundhopf #...,"scheiß Bauern, die, was weiß ich, ich habe auc...",/content/media/images/7321341957333060896.jpg
3,7321355364950117665,#streik #2024 #bauernstreik2024 #deutschland #...,😎😎😎😎😎😎😎😎😎,/content/media/images/7321355364950117665.jpg
4,7321656341590789409,#🌞❤️ #sunshineheart #sunshineheartforever #🇩🇪 ...,,/content/media/images/7321656341590789409.jpg


### From Zeeschuimer-F

In [None]:
import pandas as pd

df_filepath = '/content/drive/MyDrive/2022-11-09-Stories-Exported.csv'
df = pd.read_csv(df_filepath)

In [None]:
!unzip /content/drive/MyDrive/2023-11-09-Story-Media-Export.zip

In [None]:
df['image_file'] = df.apply(lambda row: f"media/images/{row['Username']}/{row['ID']}.mp4, axis=1)

In [None]:
df.head()

### Previously Cleaned Files

In [None]:
!unzip /content/drive/MyDrive/2024-01-09-Bauernproteste/2024-01-09-Images-Clean.zip

In [None]:
#@markdown Read the cleaned `csv` file.

import pandas as pd

clean_file_path = "/content/drive/MyDrive/2024-01-09-Bauernproteste/2024-01-09-Combined.csv" #@param {type:"string"}

df = pd.read_csv(clean_file_path)

In [None]:
df['image_file'] = df.apply(lambda row: f"/content/media/images/{row['id']}.jpg", axis=1)

In [None]:
df.head()

## BERTopic

In [None]:
!pip install bertopic[vision]

### Images Only

In [None]:
from bertopic.representation import KeyBERTInspired, VisualRepresentation
from bertopic.backend import MultiModalBackend

# Image embedding model
embedding_model = MultiModalBackend('clip-ViT-B-32', batch_size=32)

# Image to text representation model
representation_model = {
    "Visual_Aspect": VisualRepresentation(image_to_text_model="nlpconnect/vit-gpt2-image-captioning")
}


Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


In [None]:
# We create a copy of our dataframe and remove all lines without captions.
# Replace the `body` column with the text column of your interest.
image_only_df = df.copy()
images = image_only_df['image_file'].to_list()

In [None]:
from bertopic import BERTopic

# Train our model with images only
topic_model = BERTopic(embedding_model=embedding_model, representation_model=representation_model, min_topic_size=5)
topics, probs = topic_model.fit_transform(documents=None, images=images)

100%|██████████| 7/7 [02:33<00:00, 21.88s/it]
100%|██████████| 7/7 [00:02<00:00,  2.99it/s]


In [None]:
import base64
from io import BytesIO
from IPython.display import HTML

def image_base64(im):
    if isinstance(im, str):
        im = get_thumbnail(im)
    with BytesIO() as buffer:
        im.save(buffer, 'jpeg')
        return base64.b64encode(buffer.getvalue()).decode()


def image_formatter(im):
    return f'<img src="data:image/jpeg;base64,{image_base64(im)}">'

# Extract dataframe
topic_df = topic_model.get_topic_info().drop("Representative_Docs", 1).drop("Name", 1)

# Visualize the images
HTML(topic_df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))
