<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2024_11_11_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing for Posts and Stories [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **Preprocessing** Notebook uses the `easyocr` package to recognize and transcribe text embedded in images and stories and OpenAI's Whisper API to transcribe videos.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2024). michaelachmann/social-media-lab: 2023-11-11 (v0.0.14). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

## Setup

In [1]:
!pip install -q easyocr

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.9/2.9 MB[0m [31m142.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m79.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/307.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/912.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m912.2/912.2 kB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/286.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install -q openai backoff

## Posts

We assume the import of Zeeschuimer collected posts [based on Notebook](https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_11_03_Zeeschuimer_Import.ipynb).

**Warning** The current version only incorporates the *first* image of a post. The notebook above needs an update to deal with multiple images.

In [4]:
!unzip -q 2024-11-11-posts.zip

In [5]:
import pandas as pd

df_posts = pd.read_csv('/content/2024-11-11-ig-zeeschuimer-export.csv')

In [6]:
print(df_posts.iloc[0].to_markdown())

|                   | 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|:------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Extract First Frame

In [7]:
import os
import cv2
from tqdm.notebook import tqdm

# Define the paths
videos_root_path = 'posts/videos'
images_root_path = 'posts/images'

# Collect all video files in the directory and subdirectories
video_files = []
for root, dirs, files in os.walk(videos_root_path):
    for file in files:
        if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):  # Add more video file extensions if needed
            video_files.append(os.path.join(root, file))

# Loop through each video file and extract the first frame
for video_path in tqdm(video_files, desc="Extracting frames"):
    # Open the video file
    cap = cv2.VideoCapture(video_path)

    # Check if the video opened successfully
    if not cap.isOpened():
        print(f"Error opening video file: {video_path}")
        continue

    # Read the first frame of the video
    ret, frame = cap.read()

    # Check if the frame was read successfully
    if not ret:
        print(f"Error reading first frame of video: {video_path}")
        continue

    # Create the output image path
    relative_path = os.path.relpath(video_path, videos_root_path)
    image_path = os.path.join(images_root_path, os.path.splitext(relative_path)[0] + '.jpg')

    # Ensure the output directory exists
    os.makedirs(os.path.dirname(image_path), exist_ok=True)

    # Save the first frame as a JPEG image
    cv2.imwrite(image_path, frame)

    # Release the video capture object
    cap.release()

print("First frame extraction completed.")


Extracting frames:   0%|          | 0/1 [00:00<?, ?it/s]

First frame extraction completed.


In [8]:
!zip -r --update posts.zip posts/

  adding: posts/ (stored 0%)
  adding: posts/images/ (stored 0%)
  adding: posts/images/fw_bayern/ (stored 0%)
  adding: posts/images/fw_bayern/DCBmLuTv_7C.jpg (deflated 12%)
  adding: posts/images/dh.news.catcher/ (stored 0%)
  adding: posts/images/dh.news.catcher/Cl8QQffoAv1.jpg (deflated 0%)
  adding: posts/images/dh.news.catcher/Cl06_FgImCM.jpg (deflated 0%)
  adding: posts/images/dh.news.catcher/CmG5tP3ohLS.jpg (deflated 0%)
  adding: posts/images/news24/ (stored 0%)
  adding: posts/images/news24/DCOcihAOOfr.jpg (deflated 0%)
  adding: posts/images/news24/DBwG6eEuPIg.jpg (deflated 0%)
  adding: posts/images/ludwighartmann/ (stored 0%)
  adding: posts/images/ludwighartmann/DB8hUbWNcXS.jpg (deflated 3%)
  adding: posts/images/katrin.ebnersteiner/ (stored 0%)
  adding: posts/images/katrin.ebnersteiner/DCEHp67sb1U.jpg (deflated 2%)
  adding: posts/images/dielinke.bayern/ (stored 0%)
  adding: posts/images/dielinke.bayern/DCMbVnSNKgL.jpg (deflated 0%)
  adding: posts/images/kathaschulz

### OCR

In [15]:
import pandas as pd
import easyocr
import os
from tqdm.notebook import tqdm

# Define the path to the images folder
images_root_path = 'posts/images'

# Initialize the EasyOCR reader
reader = easyocr.Reader(['de'])

# Initialize a dictionary to store OCR results
ocr_results = {}

# Loop through each subfolder in the images folder
for root, dirs, files in os.walk(images_root_path):
    for file in tqdm(files, desc=f"Processing images in {root}"):
        if file.endswith(('.jpg', '.jpeg', '.png')):  # Add more image file extensions if needed
            image_path = os.path.join(root, file)
            author = os.path.basename(root)
            image_id, _ = os.path.splitext(file)

            # Read the image using EasyOCR
            text = reader.readtext(image_path)

            # Extracted text as a single string
            extracted_text = ' '.join([line[1] for line in text])

            # Store the result in the dictionary
            ocr_results[(author, image_id)] = extracted_text

Processing images in posts/images: 0it [00:00, ?it/s]

Processing images in posts/images/fw_bayern:   0%|          | 0/1 [00:00<?, ?it/s]

Processing images in posts/images/dh.news.catcher:   0%|          | 0/3 [00:00<?, ?it/s]

Processing images in posts/images/news24:   0%|          | 0/2 [00:00<?, ?it/s]

Processing images in posts/images/ludwighartmann:   0%|          | 0/1 [00:00<?, ?it/s]

Processing images in posts/images/katrin.ebnersteiner:   0%|          | 0/1 [00:00<?, ?it/s]

Processing images in posts/images/dielinke.bayern:   0%|          | 0/1 [00:00<?, ?it/s]

Processing images in posts/images/kathaschulze:   0%|          | 0/3 [00:00<?, ?it/s]

Processing images in posts/images/gruenebayern:   0%|          | 0/1 [00:00<?, ?it/s]

Processing images in posts/images/bayernspd:   0%|          | 0/1 [00:00<?, ?it/s]

In [18]:
# Add a new column for OCR text in the dataframe
df_posts['ocr_text'] = df_posts.apply(lambda row: ocr_results.get((row['author'], row['id']), ''), axis=1)

In [19]:
df_posts.head()

Unnamed: 0.1,Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,...,media_url,hashtags,num_likes,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp,ocr_text
0,0,DBwPNDuNdAg,DBwPNDuNdAg,DBwPNDuNdAg,Hallo Heidelberg! Zum ersten Mal zu viert hier...,kathaschulze,Katharina Schulze,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-10-30 15:42:29,photo,...,https://scontent.cdninstagram.com/v/t51.2885-1...,"heidelberg,schlossheidelberg,badenwürttemberg,...",3816,51,1,Heidelberg,"49.4122,8.71",,1730302949,
1,1,DCOcihAOOfr,DCOcihAOOfr,DCOcihAOOfr,When the police at the Palm Ridge Magistrate's...,news24,News24,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-11-11 09:16:17,photo,...,https://scontent.cdninstagram.com/v/t51.2885-1...,,358,13,1,,,,1731316577,news24 Unlucky escape: Alleged serial rapist's...
2,2,DCHWFYTta-b,DCHWFYTta-b,DCHWFYTta-b,Gemeinsam kämpfen wir für soziale Gerechtigkei...,bayernspd,BayernSPD,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-11-08 15:05:08,photo,...,https://scontent.cdninstagram.com/v/t51.29350-...,,3,3,1,,,,1731078308,"DER BESTE MOMENT, MITGLIED ZU WERDEN, WAR GEST..."
3,3,DCEHp67sb1U,DCEHp67sb1U,DCEHp67sb1U,Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb...,katrin.ebnersteiner,"Katrin Ebner-Steiner, MdL",https://scontent.cdninstagram.com/v/t51.2885-1...,2024-11-07 09:01:20,photo,...,https://scontent.cdninstagram.com/v/t39.30808-...,,108,7,1,,,,1730970080,DIE CHAOS-AMPEL IST ZERBROCHENI DEUTSCHLAND BR...
4,4,DCBmO3HOcxk,DCBmO3HOcxk,DCBmO3HOcxk,Die USA hat gewählt und sich für nationalistis...,gruenebayern,GRÜNE Bayern,https://scontent-fra3-1.cdninstagram.com/v/t51...,2024-11-06 09:30:48,photo,...,https://scontent-fra5-2.cdninstagram.com/v/t39...,"USWahl,Trump,Feminismus,Frauen,Politik,Grüne",1774,71,1,,,,1730885448,"Wenn die Welt verrückt spielt, braucht es eine..."


In [20]:
df_posts.to_csv('2024-11-11-Posts.csv')

### Whisper

In [23]:
import openai
from openai import OpenAI
from google.colab import userdata
import backoff

api_key = userdata.get('openai-forschung-mad')

client = OpenAI(api_key=api_key)


@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(audio_file):
    return client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

In [24]:
import os
from tqdm.notebook import tqdm
import moviepy.editor as mp
import pandas as pd


# Define the paths
videos_root_path = 'posts/videos'
images_root_path = 'posts/images'
audio_save_path = 'posts/audio'

# Ensure the audio directory exists
os.makedirs(audio_save_path, exist_ok=True)

# Initialize a dictionary to store transcription results
transcription_results = {}

# Loop through each subfolder and video file in the videos folder
for root, dirs, files in os.walk(videos_root_path):
    for file in tqdm(files, desc=f"Processing videos in {root}"):
        if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):  # Add more video file extensions if needed
            video_path = os.path.join(root, file)
            author = os.path.basename(root)
            video_id, _ = os.path.splitext(file)

            # Extract audio from the video and save as MP3
            try:
                video_clip = mp.VideoFileClip(video_path)
                audio_path = os.path.join(audio_save_path, f"{video_id}.mp3")
                video_clip.audio.write_audiofile(audio_path, codec='libmp3lame')

                # Transcribe the audio using OpenAI Whisper
                audio_file = open(audio_path, "rb")
                response = run_request(audio_file)
                transcription_text = response.text

                # Store the result in the dictionary
                transcription_results[(author, video_id)] = transcription_text
            except Exception as e:
                print(f"Error processing video {video_path}: {e}")

Processing videos in posts/videos: 0it [00:00, ?it/s]

Processing videos in posts/videos/kathaschulze:   0%|          | 0/1 [00:00<?, ?it/s]

MoviePy - Writing audio in posts/audio/DCHYJgitebc.mp3



chunk:   0%|          | 0/143 [00:00<?, ?it/s, now=None][A
chunk:  71%|███████▏  | 102/143 [00:00<00:00, 1016.47it/s, now=None][A
                                                                    [A

MoviePy - Done.


In [26]:
# Add a new column for transcription text in the dataframe
df_posts['transcription_text'] = df_posts.apply(lambda row: transcription_results.get((row['author'], row['id']), ''), axis=1)

In [27]:
df_posts.sample(10)

Unnamed: 0.1,Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,...,hashtags,num_likes,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp,ocr_text,transcription_text
5,5,DB8hUbWNcXS,DB8hUbWNcXS,DB8hUbWNcXS,Humusverlust auf Bayerns Feldern: Eine Gefahr ...,ludwighartmann,Ludwig Hartmann,https://scontent-fra3-2.cdninstagram.com/v/t51...,2024-11-04 10:11:40,photo,...,"Landwirtschaft,Humus,Bodenschutz,Klimaschutz,H...",805,50,5,,,,1730715100,ludwighartmannde CSU Lmndwirten wichtige Förde...,
10,10,Cl06_FgImCM,Cl06_FgImCM,Cl06_FgImCM,"""Keepin' up with news from around the world! T...",dh.news.catcher,DH News Collector,https://scontent-fra3-2.cdninstagram.com/v/t51...,2022-12-06 12:42:59,photo,...,"RobotReading,CoffeeAndNewspaper,LearningMoreEv...",1,0,1,,,,1670330579,,
4,4,DCBmO3HOcxk,DCBmO3HOcxk,DCBmO3HOcxk,Die USA hat gewählt und sich für nationalistis...,gruenebayern,GRÜNE Bayern,https://scontent-fra3-1.cdninstagram.com/v/t51...,2024-11-06 09:30:48,photo,...,"USWahl,Trump,Feminismus,Frauen,Politik,Grüne",1774,71,1,,,,1730885448,"Wenn die Welt verrückt spielt, braucht es eine...",
0,0,DBwPNDuNdAg,DBwPNDuNdAg,DBwPNDuNdAg,Hallo Heidelberg! Zum ersten Mal zu viert hier...,kathaschulze,Katharina Schulze,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-10-30 15:42:29,photo,...,"heidelberg,schlossheidelberg,badenwürttemberg,...",3816,51,1,Heidelberg,"49.4122,8.71",,1730302949,,
6,6,DCB1aieNF-o,DCB1aieNF-o,DCB1aieNF-o,"Was für ein Horror. \n \nFühlt ihr euch auch, ...",kathaschulze,Katharina Schulze,https://scontent-fra5-1.cdninstagram.com/v/t51...,2024-11-06 11:43:28,photo,...,,2622,140,1,,,,1730893408,,
11,11,DCHYJgitebc,DCHYJgitebc,DCHYJgitebc,#kanzlerera \nIch freu mich auf den Bundestags...,kathaschulze,Katharina Schulze,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-11-08 15:26:11,video,...,kanzlerera,3140,163,1,"Bayern, Germany","48.894107570617,11.583000803261",,1731079571,,Are you ready for it?
1,1,DCOcihAOOfr,DCOcihAOOfr,DCOcihAOOfr,When the police at the Palm Ridge Magistrate's...,news24,News24,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-11-11 09:16:17,photo,...,,358,13,1,,,,1731316577,news24 Unlucky escape: Alleged serial rapist's...,
8,8,CmG5tP3ohLS,CmG5tP3ohLS,CmG5tP3ohLS,"Taking the time to appreciate the morning, one...",dh.news.catcher,DH News Collector,https://scontent-fra3-2.cdninstagram.com/v/t51...,2022-12-13 12:18:08,photo,...,"RobotLife,UpliftingNews,aiart,stablediffusion",4,0,1,,,,1670933888,36 ELNE AK8 HCSTFOIO A 1a6 KFoB. HEA An; EPST ...,
13,13,DCBmLuTv_7C,DCBmLuTv_7C,DCBmLuTv_7C,#Klartext von @hubertaiwanger\n\n#FREIEWÄHLER ...,fw_bayern,FREIE WÄHLER Bayern,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-11-06 09:30:26,photo,...,"Klartext,FREIEWÄHLER,Aiwanger,Trump,USAElectio...",599,15,1,,,,1730885426,Hubert Aiwanger @HubertAiwanger #Trump #USWahl...,
12,12,DBwG6eEuPIg,DBwG6eEuPIg,DBwG6eEuPIg,"Carel Benjamin Schoeman, the attorney accused ...",news24,News24,https://scontent.cdninstagram.com/v/t51.2885-1...,2024-10-30 14:30:06,photo,...,,4439,569,1,,,,1730298606,news24 Meet Carel Schoeman; the attorney accus...,


In [28]:
df_posts.to_csv('2024-11-11-Posts.csv')

In [29]:
!zip -r --update posts.zip posts/

updating: posts/ (stored 0%)
  adding: posts/audio/ (stored 0%)
  adding: posts/audio/DCHYJgitebc.mp3 (deflated 2%)


### Create the Text Master

In [30]:
import pandas as pd

# Melt the dataframe
df_long = pd.melt(df_posts, id_vars=['id'],
                  value_vars=['body', 'ocr_text', 'transcription_text'],
                  var_name='Text Type',
                  value_name='Text')

# Map the Text Type to more descriptive names
df_long['Text Type'] = df_long['Text Type'].map({
    'body': 'Caption',
    'ocr_text': 'OCR',
    'transcription_text': 'Transcription'
})

df_long['Image'] = df_long['id'].apply(lambda x: f'{x}.jpg')

df_long.rename(columns={'id': 'Identifier'}, inplace=True)

df_long['Post Type'] = 'Post'

In [31]:
df_long = df_long[df_long['Text'].apply(lambda x: isinstance(x, str) and len(x) > 0)]

In [32]:
df_long.sample(10)

Unnamed: 0,Identifier,Text Type,Text,Image,Post Type
15,DCOcihAOOfr,OCR,news24 Unlucky escape: Alleged serial rapist's...,DCOcihAOOfr.jpg,Post
1,DCOcihAOOfr,Caption,When the police at the Palm Ridge Magistrate's...,DCOcihAOOfr.jpg,Post
7,DCMbVnSNKgL,Caption,"Unser Spitzenduo für die Bundestagswahl, @heid...",DCMbVnSNKgL.jpg,Post
0,DBwPNDuNdAg,Caption,Hallo Heidelberg! Zum ersten Mal zu viert hier...,DBwPNDuNdAg.jpg,Post
39,DCHYJgitebc,Transcription,Are you ready for it?,DCHYJgitebc.jpg,Post
27,DCBmLuTv_7C,OCR,Hubert Aiwanger @HubertAiwanger #Trump #USWahl...,DCBmLuTv_7C.jpg,Post
9,Cl8QQffoAv1,Caption,"Sometimes the world can be a dark place, but I...",Cl8QQffoAv1.jpg,Post
17,DCEHp67sb1U,OCR,DIE CHAOS-AMPEL IST ZERBROCHENI DEUTSCHLAND BR...,DCEHp67sb1U.jpg,Post
5,DB8hUbWNcXS,Caption,Humusverlust auf Bayerns Feldern: Eine Gefahr ...,DB8hUbWNcXS.jpg,Post
8,CmG5tP3ohLS,Caption,"Taking the time to appreciate the morning, one...",CmG5tP3ohLS.jpg,Post


In [33]:
df_long.to_csv('2024-11-11-Text-Master.csv')

## Stories

In [34]:
!unzip /content/tidaltales.zip

Archive:  /content/tidaltales.zip
   creating: tidaltales/
  inflating: __MACOSX/._tidaltales   
   creating: tidaltales/bayernspd/
  inflating: __MACOSX/tidaltales/._bayernspd  
   creating: tidaltales/ludwighartmann/
  inflating: __MACOSX/tidaltales/._ludwighartmann  
   creating: tidaltales/wsj/
  inflating: __MACOSX/tidaltales/._wsj  
   creating: tidaltales/bild/
  inflating: __MACOSX/tidaltales/._bild  
   creating: tidaltales/sz/
  inflating: __MACOSX/tidaltales/._sz  
   creating: tidaltales/spiegelmagazin/
  inflating: __MACOSX/tidaltales/._spiegelmagazin  
   creating: tidaltales/timesofindia/
  inflating: __MACOSX/tidaltales/._timesofindia  
   creating: tidaltales/fw_bayern/
  inflating: __MACOSX/tidaltales/._fw_bayern  
   creating: tidaltales/markus.soeder/
  inflating: __MACOSX/tidaltales/._markus.soeder  
   creating: tidaltales/bbcnews/
  inflating: __MACOSX/tidaltales/._bbcnews  
  inflating: tidaltales/bayernspd/3498340580813225047.jpg  
  inflating: __MACOSX/tidalta

In [36]:
import pandas as pd

df_stories = pd.read_csv('/content/tidaltales_export_20241111T104741.csv')

In [37]:
df_stories.head()

Unnamed: 0,ID,Time of Posting,Type of Content,video_path,image_path,Username,Video Length (s),Expiration,Caption,Is Verified,Stickers,Accessibility Caption,Attribution URL,Story Media,Story Hashtags,Story Questions,Story Sliders,Story CTA,Story Countdowns,Story Locations
0,3498187238895924847,2024-11-10T11:00:34.000Z,Image,,bild/3498187238895924847.jpg,bild,,2024-11-11T11:00:34.000Z,,True,[],"Photo by BILD on November 10, 2024. Ist möglic...",,[],[],[],[],[],[],[]
1,3498230535983173346,2024-11-10T12:26:34.000Z,Image,,bild/3498230535983173346.jpg,bild,,2024-11-11T12:26:34.000Z,,True,"[{""x"":0,""y"":0,""width"":0,""height"":0,""rotation"":...","Photo by BILD on November 10, 2024. Ist möglic...",,[],[],[],[],[],[],[]
2,3498235320048180286,2024-11-10T12:36:02.000Z,Image,,bild/3498235320048180286.jpg,bild,,2024-11-11T12:36:02.000Z,,True,"[{""x"":-0.377450262761927,""y"":0.503325569767475...","Photo by BILD on November 10, 2024. Ist möglic...",,[],[],[],[],[],[],[]
3,3498249690941594229,2024-11-10T13:03:55.000Z,Video,bild/3498249690941594229.mp4,bild/3498249690941594229.jpg,bild,41.2,2024-11-11T13:03:55.000Z,,True,[],,,"[{""x"":0.44480300381452303,""y"":0.45220631792168...",[],[],[],[],[],[]
4,3498316699821691823,2024-11-10T15:17:46.000Z,Image,,bild/3498316699821691823.jpg,bild,,2024-11-11T15:17:46.000Z,,True,[],"Photo by BILD on November 10, 2024. Ist möglic...",,"[{""x"":0.5,""y"":0.5,""width"":0.5,""height"":0.5,""ro...",[],[],[],[],[],[]


### Extract First Frame

**Not Necessary -- We have an image for each Story from Instagram!**

### OCR

In [35]:
import pandas as pd
import easyocr
import os
from tqdm.notebook import tqdm

# Define the path to the images folder
images_root_path = '/content/tidaltales'

# Initialize the EasyOCR reader
reader = easyocr.Reader(['de'])

# Initialize a dictionary to store OCR results
ocr_results = {}

# Loop through each subfolder in the images folder
for root, dirs, files in os.walk(images_root_path):
    for file in tqdm(files, desc=f"Processing images in {root}"):
        if file.endswith(('.jpg', '.jpeg', '.png')):  # Add more image file extensions if needed
            image_path = os.path.join(root, file)
            author = os.path.basename(root)
            image_id, _ = os.path.splitext(file)

            # Read the image using EasyOCR
            text = reader.readtext(image_path)

            # Extracted text as a single string
            extracted_text = ' '.join([line[1] for line in text])

            # Store the result in the dictionary
            ocr_results[(author, image_id)] = extracted_text

Processing images in /content/tidaltales: 0it [00:00, ?it/s]

Processing images in /content/tidaltales/fw_bayern:   0%|          | 0/8 [00:00<?, ?it/s]

Processing images in /content/tidaltales/spiegelmagazin:   0%|          | 0/33 [00:00<?, ?it/s]

Processing images in /content/tidaltales/bild:   0%|          | 0/23 [00:00<?, ?it/s]

Processing images in /content/tidaltales/timesofindia:   0%|          | 0/14 [00:00<?, ?it/s]

Processing images in /content/tidaltales/ludwighartmann:   0%|          | 0/6 [00:00<?, ?it/s]

Processing images in /content/tidaltales/wsj:   0%|          | 0/14 [00:00<?, ?it/s]

Processing images in /content/tidaltales/bbcnews:   0%|          | 0/34 [00:00<?, ?it/s]

Processing images in /content/tidaltales/sz:   0%|          | 0/8 [00:00<?, ?it/s]

Processing images in /content/tidaltales/markus.soeder:   0%|          | 0/24 [00:00<?, ?it/s]

Processing images in /content/tidaltales/bayernspd:   0%|          | 0/10 [00:00<?, ?it/s]

In [40]:
ocr_results

{('fw_bayern',
  '3498390774081971063'): 'fw_bayern FREIE Weil ihr in den WÄHLER Kommunen recht habt! FREIE WÄHLER in den Bundestag: fw_bayern #FREIEWÄHLER sind die Stimme der #Kommunen! FREIE WÄHLER in den Bundestag.',
 ('fw_bayern',
  '3498292921506668375'): 'marina jakob mdl fwaugsburgland tagesschau ; ; Während Elon Musk seine Raketen rück- wärts einparken lässt; versinkt Deutschland im Chaos, wir warten müssen, bis die Bundesdruckerei Wahlscheine ausgedruckt hat das ist absurd! Fabian Mehring, Freie Wähler Digital-Staatsminister tagesschau Bayern Quelle: Augsburger Allgemeine tagesschau Wahlbenachrichtigung, Briefwahlunterlagen, Stimmzettel: Für eine Bundestagswahl werden viele Tonne. fw_bayern fwlandtag weil',
 ('fw_bayern',
  '3498293642876497971'): 'bayern Ohne unsere Bauern bleibt der Teller leer! FREIE WÄHLER Bayerns starke Mitte. fw_bayern #FREIEWÄHLER Wir stehen hinter unseren #Bauern! fw_',
 ('fw_bayern',
  '3498293154785290767'): 'Deutschland braucht, wie Bayern, eine bür

In [48]:
# Add a new column for OCR text in the dataframe
df_stories['ocr_text'] = df_stories.apply(lambda row: ocr_results.get((row['Username'], str(row['ID'])), ''), axis=1)

In [49]:
df_stories.head()

Unnamed: 0,ID,Time of Posting,Type of Content,video_path,image_path,Username,Video Length (s),Expiration,Caption,Is Verified,...,Accessibility Caption,Attribution URL,Story Media,Story Hashtags,Story Questions,Story Sliders,Story CTA,Story Countdowns,Story Locations,ocr_text
0,3498187238895924847,2024-11-10T11:00:34.000Z,Image,,bild/3498187238895924847.jpg,bild,,2024-11-11T11:00:34.000Z,,True,...,"Photo by BILD on November 10, 2024. Ist möglic...",,[],[],[],[],[],[],[],Bild Bachelor-Babe will Walentina verkloppen 2...
1,3498230535983173346,2024-11-10T12:26:34.000Z,Image,,bild/3498230535983173346.jpg,bild,,2024-11-11T12:26:34.000Z,,True,...,"Photo by BILD on November 10, 2024. Ist möglic...",,[],[],[],[],[],[],[],FiGhTING Can Der Kampf in voller Länge Blut-Sc...
2,3498235320048180286,2024-11-10T12:36:02.000Z,Image,,bild/3498235320048180286.jpg,bild,,2024-11-11T12:36:02.000Z,,True,...,"Photo by BILD on November 10, 2024. Ist möglic...",,[],[],[],[],[],[],[],Bild CmA Y 1 1 Fame Fighting RMIETUNG ALS WNer...
3,3498249690941594229,2024-11-10T13:03:55.000Z,Video,bild/3498249690941594229.mp4,bild/3498249690941594229.jpg,bild,41.2,2024-11-11T13:03:55.000Z,,True,...,,,"[{""x"":0.44480300381452303,""y"":0.45220631792168...",[],[],[],[],[],[],"@'WJI Y bild , walentinadoroninaofficial und b..."
4,3498316699821691823,2024-11-10T15:17:46.000Z,Image,,bild/3498316699821691823.jpg,bild,,2024-11-11T15:17:46.000Z,,True,...,"Photo by BILD on November 10, 2024. Ist möglic...",,"[{""x"":0.5,""y"":0.5,""width"":0.5,""height"":0.5,""ro...",[],[],[],[],[],[],Fotor Bild Hype immer irrer Dubai-Schokolade a...


In [50]:
df_stories.to_csv('2024-11-11-Stories-Export.csv')

### Transcription

In [51]:
import openai
from openai import OpenAI
from google.colab import userdata
import backoff

api_key = userdata.get('openai-forschung-mad')

client = OpenAI(api_key=api_key)


@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APIError))
def run_request(audio_file):
    return client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )

In [52]:
import os
from tqdm.notebook import tqdm
import moviepy.editor as mp
import pandas as pd


# Define the paths
videos_root_path = '/content/tidaltales'
audio_save_path = '/content/tidaltales/audio'

# Ensure the audio directory exists
os.makedirs(audio_save_path, exist_ok=True)

# Initialize a dictionary to store transcription results
transcription_results = {}

# Loop through each subfolder and video file in the videos folder
for root, dirs, files in os.walk(videos_root_path):
    for file in tqdm(files, desc=f"Processing videos in {root}"):
        if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):  # Add more video file extensions if needed
            video_path = os.path.join(root, file)
            author = os.path.basename(root)
            video_id, _ = os.path.splitext(file)

            # Extract audio from the video and save as MP3
            try:
                video_clip = mp.VideoFileClip(video_path)
                audio_path = os.path.join(audio_save_path, f"{video_id}.mp3")
                video_clip.audio.write_audiofile(audio_path, codec='libmp3lame')

                # Transcribe the audio using OpenAI Whisper
                audio_file = open(audio_path, "rb")
                response = run_request(audio_file)
                transcription_text = response.text

                # Store the result in the dictionary
                transcription_results[(author, video_id)] = transcription_text
            except Exception as e:
                print(f"Error processing video {video_path}: {e}")

Processing videos in /content/tidaltales: 0it [00:00, ?it/s]

Processing videos in /content/tidaltales/fw_bayern:   0%|          | 0/8 [00:00<?, ?it/s]

Processing videos in /content/tidaltales/audio: 0it [00:00, ?it/s]

Processing videos in /content/tidaltales/spiegelmagazin:   0%|          | 0/33 [00:00<?, ?it/s]

MoviePy - Writing audio in /content/tidaltales/audio/3498292772298853867.mp3



chunk:   0%|          | 0/223 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292815122779364.mp3



chunk:   0%|          | 0/293 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498293566012283142.mp3



chunk:   0%|          | 0/226 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292828485804820.mp3



chunk:   0%|          | 0/226 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292761704068720.mp3



chunk:   0%|          | 0/226 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292726035846459.mp3



chunk:   0%|          | 0/226 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292749389736873.mp3



chunk:   0%|          | 0/230 [00:00<?, ?it/s, now=None][A
chunk:  68%|██████▊   | 156/230 [00:00<00:00, 1559.33it/s, now=None][A
                                                                    [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292784403609711.mp3



chunk:   0%|          | 0/225 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498292798169486078.mp3



chunk:   0%|          | 0/226 [00:00<?, ?it/s, now=None][A
                                                        [A

MoviePy - Done.


Processing videos in /content/tidaltales/bild:   0%|          | 0/23 [00:00<?, ?it/s]

MoviePy - Writing audio in /content/tidaltales/audio/3498371279830118768.mp3



chunk:   0%|          | 0/1326 [00:00<?, ?it/s, now=None][A
chunk:  12%|█▏        | 155/1326 [00:00<00:00, 1547.43it/s, now=None][A
chunk:  23%|██▎       | 310/1326 [00:00<00:00, 1430.06it/s, now=None][A
chunk:  36%|███▌      | 478/1326 [00:00<00:00, 1537.93it/s, now=None][A
chunk:  49%|████▉     | 648/1326 [00:00<00:00, 1599.92it/s, now=None][A
chunk:  61%|██████    | 809/1326 [00:00<00:00, 1426.47it/s, now=None][A
chunk:  72%|███████▏  | 955/1326 [00:00<00:00, 1357.90it/s, now=None][A
chunk:  85%|████████▍ | 1122/1326 [00:00<00:00, 1434.26it/s, now=None][A
chunk:  97%|█████████▋| 1291/1326 [00:00<00:00, 1508.26it/s, now=None][A
                                                                      [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498249690941594229.mp3



chunk:   0%|          | 0/909 [00:00<?, ?it/s, now=None][A
chunk:  20%|█▉        | 180/909 [00:00<00:00, 1798.61it/s, now=None][A
chunk:  40%|███▉      | 360/909 [00:00<00:00, 1764.89it/s, now=None][A
chunk:  60%|█████▉    | 542/909 [00:00<00:00, 1780.90it/s, now=None][A
chunk:  80%|███████▉  | 725/909 [00:00<00:00, 1800.09it/s, now=None][A
                                                                    [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498904357774220621.mp3



chunk:   0%|          | 0/334 [00:00<?, ?it/s, now=None][A
chunk:  28%|██▊       | 93/334 [00:00<00:00, 918.62it/s, now=None][A
chunk:  57%|█████▋    | 191/334 [00:00<00:00, 934.75it/s, now=None][A
chunk:  86%|████████▌ | 287/334 [00:00<00:00, 945.64it/s, now=None][A
                                                                   [A

MoviePy - Done.


Processing videos in /content/tidaltales/timesofindia:   0%|          | 0/14 [00:00<?, ?it/s]

Processing videos in /content/tidaltales/ludwighartmann:   0%|          | 0/6 [00:00<?, ?it/s]

Processing videos in /content/tidaltales/wsj:   0%|          | 0/14 [00:00<?, ?it/s]

Processing videos in /content/tidaltales/bbcnews:   0%|          | 0/34 [00:00<?, ?it/s]

Processing videos in /content/tidaltales/sz:   0%|          | 0/8 [00:00<?, ?it/s]

Processing videos in /content/tidaltales/markus.soeder:   0%|          | 0/24 [00:00<?, ?it/s]

MoviePy - Writing audio in /content/tidaltales/audio/3498339312236287047.mp3



chunk:   0%|          | 0/369 [00:00<?, ?it/s, now=None][A
chunk:  27%|██▋       | 101/369 [00:00<00:00, 996.00it/s, now=None][A
chunk:  54%|█████▍    | 201/369 [00:00<00:00, 981.01it/s, now=None][A
chunk:  83%|████████▎ | 306/369 [00:00<00:00, 992.93it/s, now=None][A
                                                                   [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498303470340137697.mp3



chunk:   0%|          | 0/334 [00:00<?, ?it/s, now=None][A
chunk:  28%|██▊       | 93/334 [00:00<00:00, 919.81it/s, now=None][A
chunk:  55%|█████▌    | 185/334 [00:00<00:00, 908.43it/s, now=None][A
chunk:  84%|████████▍ | 280/334 [00:00<00:00, 926.47it/s, now=None][A
                                                                   [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498513851939297300.mp3



chunk:   0%|          | 0/334 [00:00<?, ?it/s, now=None][A
chunk:  49%|████▉     | 163/334 [00:00<00:00, 1627.02it/s, now=None][A
                                                                    [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498302272253733623.mp3



chunk:   0%|          | 0/397 [00:00<?, ?it/s, now=None][A
chunk:  21%|██▏       | 85/397 [00:00<00:00, 827.65it/s, now=None][A
chunk:  45%|████▍     | 177/397 [00:00<00:00, 880.45it/s, now=None][A
chunk:  67%|██████▋   | 266/397 [00:00<00:00, 880.28it/s, now=None][A
chunk:  91%|█████████ | 362/397 [00:00<00:00, 911.18it/s, now=None][A
                                                                   [A

MoviePy - Done.


Processing videos in /content/tidaltales/bayernspd:   0%|          | 0/10 [00:00<?, ?it/s]

MoviePy - Writing audio in /content/tidaltales/audio/3498551344396277774.mp3



chunk:   0%|          | 0/334 [00:00<?, ?it/s, now=None][A
chunk:  39%|███▉      | 130/334 [00:00<00:00, 1272.40it/s, now=None][A
chunk:  77%|███████▋  | 258/334 [00:00<00:00, 1272.84it/s, now=None][A
                                                                    [A

MoviePy - Done.
MoviePy - Writing audio in /content/tidaltales/audio/3498550915973289629.mp3



chunk:   0%|          | 0/334 [00:00<?, ?it/s, now=None][A
chunk:  33%|███▎      | 109/334 [00:00<00:00, 1086.05it/s, now=None][A
chunk:  75%|███████▌  | 252/334 [00:00<00:00, 1287.83it/s, now=None][A
                                                                    [A

MoviePy - Done.


In [53]:
# Add a new column for transcription text in the dataframe
df_stories['transcription_text'] = df_stories.apply(lambda row: transcription_results.get((row['Username'], str(row['ID'])), ''), axis=1)

In [54]:
df_stories.head()

Unnamed: 0,ID,Time of Posting,Type of Content,video_path,image_path,Username,Video Length (s),Expiration,Caption,Is Verified,...,Attribution URL,Story Media,Story Hashtags,Story Questions,Story Sliders,Story CTA,Story Countdowns,Story Locations,ocr_text,transcription_text
0,3498187238895924847,2024-11-10T11:00:34.000Z,Image,,bild/3498187238895924847.jpg,bild,,2024-11-11T11:00:34.000Z,,True,...,,[],[],[],[],[],[],[],Bild Bachelor-Babe will Walentina verkloppen 2...,
1,3498230535983173346,2024-11-10T12:26:34.000Z,Image,,bild/3498230535983173346.jpg,bild,,2024-11-11T12:26:34.000Z,,True,...,,[],[],[],[],[],[],[],FiGhTING Can Der Kampf in voller Länge Blut-Sc...,
2,3498235320048180286,2024-11-10T12:36:02.000Z,Image,,bild/3498235320048180286.jpg,bild,,2024-11-11T12:36:02.000Z,,True,...,,[],[],[],[],[],[],[],Bild CmA Y 1 1 Fame Fighting RMIETUNG ALS WNer...,
3,3498249690941594229,2024-11-10T13:03:55.000Z,Video,bild/3498249690941594229.mp4,bild/3498249690941594229.jpg,bild,41.2,2024-11-11T13:03:55.000Z,,True,...,,"[{""x"":0.44480300381452303,""y"":0.45220631792168...",[],[],[],[],[],[],"@'WJI Y bild , walentinadoroninaofficial und b...","Ich glaube, dass der Gegner ihn immer noch drü..."
4,3498316699821691823,2024-11-10T15:17:46.000Z,Image,,bild/3498316699821691823.jpg,bild,,2024-11-11T15:17:46.000Z,,True,...,,"[{""x"":0.5,""y"":0.5,""width"":0.5,""height"":0.5,""ro...",[],[],[],[],[],[],Fotor Bild Hype immer irrer Dubai-Schokolade a...,


In [55]:
df_stories.to_csv('2024-11-11-Stories-Export.csv')

In [56]:
!zip -r tidaltales.zip tidaltales

updating: tidaltales/ (stored 0%)
updating: tidaltales/bayernspd/ (stored 0%)
updating: tidaltales/ludwighartmann/ (stored 0%)
updating: tidaltales/wsj/ (stored 0%)
updating: tidaltales/bild/ (stored 0%)
updating: tidaltales/sz/ (stored 0%)
updating: tidaltales/spiegelmagazin/ (stored 0%)
updating: tidaltales/timesofindia/ (stored 0%)
updating: tidaltales/fw_bayern/ (stored 0%)
updating: tidaltales/markus.soeder/ (stored 0%)
updating: tidaltales/bbcnews/ (stored 0%)
updating: tidaltales/bayernspd/3498340580813225047.jpg
 (deflated 2%)
updating: tidaltales/bayernspd/3498550915973289629.jpg
 (deflated 1%)
updating: tidaltales/bayernspd/3498551344396277774.jpg
 (deflated 7%)
updating: tidaltales/bayernspd/3498550915973289629.mp4
 (deflated 0%)
updating: tidaltales/bayernspd/3498898649922355318.json
 (deflated 76%)
updating: tidaltales/bayernspd/3498340580813225047.json
 (deflated 77%)
updating: tidaltales/bayernspd/3498550915973289629.json
 (deflated 73%)
updating: tidaltales/bayernspd/34

### Create the Text Master

In [57]:
import pandas as pd

master_df = pd.read_csv('2024-11-11-Text-Master.csv', index_col=0)

In [58]:
master_df.head()

Unnamed: 0,Identifier,Text Type,Text,Image,Post Type
0,DBwPNDuNdAg,Caption,Hallo Heidelberg! Zum ersten Mal zu viert hier...,DBwPNDuNdAg.jpg,Post
1,DCOcihAOOfr,Caption,When the police at the Palm Ridge Magistrate's...,DCOcihAOOfr.jpg,Post
2,DCHWFYTta-b,Caption,Gemeinsam kämpfen wir für soziale Gerechtigkei...,DCHWFYTta-b.jpg,Post
3,DCEHp67sb1U,Caption,Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb...,DCEHp67sb1U.jpg,Post
4,DCBmO3HOcxk,Caption,Die USA hat gewählt und sich für nationalistis...,DCBmO3HOcxk.jpg,Post


In [59]:
len(master_df)

24

Let's remove duplicate Captions

In [60]:
import pandas as pd

# Step 1: Filter the DataFrame to include only rows where 'Text Type' is 'Caption'
caption_df = master_df[master_df['Text Type'] == 'Caption']

# Step 2: Remove duplicates based on the 'Identifier' column within this subset
caption_df_dedup = caption_df.drop_duplicates(subset='Identifier')

# Step 3: Combine this deduplicated subset back with the rest of the original DataFrame
non_caption_df = master_df[master_df['Text Type'] != 'Caption']
master_df = pd.concat([caption_df_dedup, non_caption_df])

# Optionally, sort the resulting DataFrame to maintain original order
master_df = master_df.sort_index()

In [61]:
len(master_df)

24

In [62]:
master_df.head()

Unnamed: 0,Identifier,Text Type,Text,Image,Post Type
0,DBwPNDuNdAg,Caption,Hallo Heidelberg! Zum ersten Mal zu viert hier...,DBwPNDuNdAg.jpg,Post
1,DCOcihAOOfr,Caption,When the police at the Palm Ridge Magistrate's...,DCOcihAOOfr.jpg,Post
2,DCHWFYTta-b,Caption,Gemeinsam kämpfen wir für soziale Gerechtigkei...,DCHWFYTta-b.jpg,Post
3,DCEHp67sb1U,Caption,Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb...,DCEHp67sb1U.jpg,Post
4,DCBmO3HOcxk,Caption,Die USA hat gewählt und sich für nationalistis...,DCBmO3HOcxk.jpg,Post


And now for the Stories

In [63]:
df_stories = pd.read_csv('2024-11-11-Stories-Export.csv')

In [64]:
df_stories.head()

Unnamed: 0.1,Unnamed: 0,ID,Time of Posting,Type of Content,video_path,image_path,Username,Video Length (s),Expiration,Caption,...,Attribution URL,Story Media,Story Hashtags,Story Questions,Story Sliders,Story CTA,Story Countdowns,Story Locations,ocr_text,transcription_text
0,0,3498187238895924847,2024-11-10T11:00:34.000Z,Image,,bild/3498187238895924847.jpg,bild,,2024-11-11T11:00:34.000Z,,...,,[],[],[],[],[],[],[],Bild Bachelor-Babe will Walentina verkloppen 2...,
1,1,3498230535983173346,2024-11-10T12:26:34.000Z,Image,,bild/3498230535983173346.jpg,bild,,2024-11-11T12:26:34.000Z,,...,,[],[],[],[],[],[],[],FiGhTING Can Der Kampf in voller Länge Blut-Sc...,
2,2,3498235320048180286,2024-11-10T12:36:02.000Z,Image,,bild/3498235320048180286.jpg,bild,,2024-11-11T12:36:02.000Z,,...,,[],[],[],[],[],[],[],Bild CmA Y 1 1 Fame Fighting RMIETUNG ALS WNer...,
3,3,3498249690941594229,2024-11-10T13:03:55.000Z,Video,bild/3498249690941594229.mp4,bild/3498249690941594229.jpg,bild,41.2,2024-11-11T13:03:55.000Z,,...,,"[{""x"":0.44480300381452303,""y"":0.45220631792168...",[],[],[],[],[],[],"@'WJI Y bild , walentinadoroninaofficial und b...","Ich glaube, dass der Gegner ihn immer noch drü..."
4,4,3498316699821691823,2024-11-10T15:17:46.000Z,Image,,bild/3498316699821691823.jpg,bild,,2024-11-11T15:17:46.000Z,,...,,"[{""x"":0.5,""y"":0.5,""width"":0.5,""height"":0.5,""ro...",[],[],[],[],[],[],Fotor Bild Hype immer irrer Dubai-Schokolade a...,


In [67]:
import pandas as pd

# Melt the dataframe
df_long = pd.melt(df_stories, id_vars=['ID', 'image_path'],
                  value_vars=['ocr_text', 'transcription_text'],
                  var_name='Text Type',
                  value_name='Text')

# Map the Text Type to more descriptive names
df_long['Text Type'] = df_long['Text Type'].map({
    'ocr_text': 'OCR',
    'transcription_text': 'Transcription'
})

df_long['Image'] = df_long['image_path'].apply(lambda url: url.split('/')[-1])

df_long.rename(columns={'ID': 'Identifier'}, inplace=True)

df_long['Post Type'] = 'Story'

In [68]:
df_long.head()

Unnamed: 0,Identifier,image_path,Text Type,Text,Image,Post Type
0,3498187238895924847,bild/3498187238895924847.jpg,OCR,Bild Bachelor-Babe will Walentina verkloppen 2...,3498187238895924847.jpg,Story
1,3498230535983173346,bild/3498230535983173346.jpg,OCR,FiGhTING Can Der Kampf in voller Länge Blut-Sc...,3498230535983173346.jpg,Story
2,3498235320048180286,bild/3498235320048180286.jpg,OCR,Bild CmA Y 1 1 Fame Fighting RMIETUNG ALS WNer...,3498235320048180286.jpg,Story
3,3498249690941594229,bild/3498249690941594229.jpg,OCR,"@'WJI Y bild , walentinadoroninaofficial und b...",3498249690941594229.jpg,Story
4,3498316699821691823,bild/3498316699821691823.jpg,OCR,Fotor Bild Hype immer irrer Dubai-Schokolade a...,3498316699821691823.jpg,Story


In [69]:
df_long = df_long[df_long['Text'].apply(lambda x: isinstance(x, str) and len(x) > 0)]

In [70]:
df_long.head()

Unnamed: 0,Identifier,image_path,Text Type,Text,Image,Post Type
0,3498187238895924847,bild/3498187238895924847.jpg,OCR,Bild Bachelor-Babe will Walentina verkloppen 2...,3498187238895924847.jpg,Story
1,3498230535983173346,bild/3498230535983173346.jpg,OCR,FiGhTING Can Der Kampf in voller Länge Blut-Sc...,3498230535983173346.jpg,Story
2,3498235320048180286,bild/3498235320048180286.jpg,OCR,Bild CmA Y 1 1 Fame Fighting RMIETUNG ALS WNer...,3498235320048180286.jpg,Story
3,3498249690941594229,bild/3498249690941594229.jpg,OCR,"@'WJI Y bild , walentinadoroninaofficial und b...",3498249690941594229.jpg,Story
4,3498316699821691823,bild/3498316699821691823.jpg,OCR,Fotor Bild Hype immer irrer Dubai-Schokolade a...,3498316699821691823.jpg,Story


In [71]:
master_df = pd.concat([master_df, df_long], ignore_index=True)

In [72]:
len(master_df)

120

In [73]:
master_df['uuid'] = master_df.index

In [74]:
master_df.head()

Unnamed: 0,Identifier,Text Type,Text,Image,Post Type,image_path,uuid
0,DBwPNDuNdAg,Caption,Hallo Heidelberg! Zum ersten Mal zu viert hier...,DBwPNDuNdAg.jpg,Post,,0
1,DCOcihAOOfr,Caption,When the police at the Palm Ridge Magistrate's...,DCOcihAOOfr.jpg,Post,,1
2,DCHWFYTta-b,Caption,Gemeinsam kämpfen wir für soziale Gerechtigkei...,DCHWFYTta-b.jpg,Post,,2
3,DCEHp67sb1U,Caption,Katrin Ebner-Steiner: Die Chaos-Ampel ist zerb...,DCEHp67sb1U.jpg,Post,,3
4,DCBmO3HOcxk,Caption,Die USA hat gewählt und sich für nationalistis...,DCBmO3HOcxk.jpg,Post,,4


In [75]:
master_df.to_csv('2024-11-11-Text-Master.csv', index=False)