<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_11_24_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OCR for Posts and Stories [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **OCR for Posts and Stories** Notebook uses the `easyocr` package to recognize and transcribe text embedded in images and stories.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: 2023-11-27 (v0.0.5). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

## 1. Data Import

### From 4CAT

In [None]:
#@markdown Read the exported `csv` file from 4CAT for metadata.

import pandas as pd

four_cat_file_path = "/content/drive/MyDrive/2023-11-24-4CAT-Metadata.csv" #@param {type:"string"}

df = pd.read_csv(four_cat_file_path)

In [None]:
df.head()

Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,url,image_url,media_url,hashtags,num_likes,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp
0,CzLE8FCoO-2,CzLE8FCoO-2,CzLE8FCoO-2,Wir haben eine klare Haltung: Wir stehen zu Is...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-03 06:01:22,photo,https://www.instagram.com/p/CzLE8FCoO-2,https://scontent-fra3-1.cdninstagram.com/v/t51...,https://scontent-fra3-1.cdninstagram.com/v/t51...,,1538,167,1,,,,1698991282
1,CzGGK2PIpou,CzGGK2PIpou,CzGGK2PIpou,An Allerseelen und Allerheiligen denke ich bes...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 07:35:55,photo,https://www.instagram.com/p/CzGGK2PIpou,https://scontent-fra3-1.cdninstagram.com/v/t51...,https://scontent-fra3-1.cdninstagram.com/v/t51...,"allerheiligen,allerseelen,familie,erinnerung",14364,289,1,,,,1698824155
2,CzF7RDmpDXl,CzF7RDmpDXl,CzF7RDmpDXl,#Allerheiligen und #Allerseelen: Wir halten in...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 06:00:39,photo,https://www.instagram.com/p/CzF7RDmpDXl,https://scontent-fra5-1.cdninstagram.com/v/t39...,https://scontent-fra5-1.cdninstagram.com/v/t39...,"Allerheiligen,Allerseelen",1732,30,1,,,,1698818439
3,CzEB00zu65J,CzEB00zu65J,CzEB00zu65J,Wir wollen Bayern in eine gute Zukunft führen....,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:19:29,photo,https://www.instagram.com/p/CzEB00zu65J,https://scontent-fra3-2.cdninstagram.com/v/t51...,https://scontent-fra3-2.cdninstagram.com/v/t51...,"demokratie,landtag,zusammenhalt,modernität,sta...",1415,30,1,,,,1698754769
4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,https://www.instagram.com/p/CzD93SEIi-E,https://scontent-fra3-1.cdninstagram.com/v/t51...,https://scontent-fra3-2.cdninstagram.com/o1/v/...,"bayern,landtag",7081,227,1,,,,1698753983


In [19]:
#@title Unzip and Process Videos from 4CAT Export

#@markdown This script will unzip a specified ZIP file, read a metadata JSON file, and then process and relocate video files according to the metadata.

import zipfile
import json
import os

#@markdown Enter the Path to the ZIP File
zip_file_path = '/content/drive/MyDrive/2023-11-24-4CAT-Images.zip' #@param {type:"string"}
output_zip_file_path = '/content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip' #@param {type:"string"}


#@markdown Enter the Extraction Folder Path
four_cat_folder = "4cat-export/" #@param {type:"string"}

#@markdown Enter the Destination Folder Path for Videos
video_path = "media/images" #@param {type:"string"}

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified folder
    zip_ref.extractall(four_cat_folder)

print(f"Files extracted to {four_cat_folder}")

# Specify the path to the metadata JSON file
metadata_file_path = f'{four_cat_folder}/.metadata.json'

# Open the metadata file and load its content
with open(metadata_file_path, 'r') as file:
    data = json.load(file)

# Check if the destination directory for videos exists
if not os.path.exists(video_path):
    # Create the directory if it does not exist
    os.makedirs(video_path)

# Process each item in the metadata
for item in data.values():
    if item.get('success', False):
        post_id = item['post_ids'][0]
        filename = item['filename']
        print(f"Processing Post ID: {post_id}, Filename: {filename}")

        # Full path to the source file
        source_path = os.path.join(four_cat_folder, filename)

        # Full path to the destination file
        destination_path = os.path.join(video_path, f"{post_id}.jpg")

        # Move and rename the file
        os.rename(source_path, destination_path)

Files extracted to 4cat-export/
Processing Post ID: CzLE8FCoO-2, Filename: 399099252_3485122821751295_8263370465096499787_n.jpg
Processing Post ID: CzGGK2PIpou, Filename: 397070079_1007045093913241_8656729052771444150_n.jpg
Processing Post ID: CzF7RDmpDXl, Filename: 397558733_891808195641036_5096515791342710211_n.jpg
Processing Post ID: CzEB00zu65J, Filename: 397941460_1287150215325443_5323885461182628527_n.jpg
Processing Post ID: CzD93SEIi-E, Filename: 398280880_670591648497089_6560234879451222622_n.jpg
Processing Post ID: CzD8s01ot-7, Filename: 396755440_2427417250772617_1757731681175310218_n.jpg
Processing Post ID: CzDWdN-hU7Y, Filename: 396991696_891203805701475_6074511894218926822_n.jpg
Processing Post ID: CzB-cm4orLY, Filename: 397325327_711690393845940_4774729516376905427_n.jpg
Processing Post ID: CzB7LhPofE8, Filename: 396960658_651191003728578_6818450362956077157_n.jpg
Processing Post ID: CzB3Xqeobx9, Filename: 397963405_1773828079748506_4462604717543654421_n.jpg
Processing Po

Using the next line we save the extracted image files to a new `ZIP` file following our `media/images/` convention. This will be useful for future tasks / notebooks. Rename the file according to your needs.

In [22]:
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media

  adding: media/ (stored 0%)
  adding: media/images/ (stored 0%)
  adding: media/images/CzB-cm4orLY.jpg (deflated 0%)
  adding: media/images/CzD93SEIi-E.jpg (deflated 1%)
  adding: media/images/CzF7RDmpDXl.jpg (deflated 0%)
  adding: media/images/CzB7LhPofE8.jpg (deflated 0%)
  adding: media/images/CzDWdN-hU7Y.jpg (deflated 2%)
  adding: media/images/CzEB00zu65J.jpg (deflated 1%)
  adding: media/images/Cy--YrdIp_7.jpg (deflated 0%)
  adding: media/images/CzGGK2PIpou.jpg (deflated 1%)
  adding: media/images/CzB3Xqeobx9.jpg (deflated 1%)
  adding: media/images/CzLE8FCoO-2.jpg (deflated 3%)
  adding: media/images/CzBUAchuUG9.jpg (deflated 1%)
  adding: media/images/CzD8s01ot-7.jpg (deflated 2%)


Here we add a new column to the metadata table, referencing the image file.

In [25]:
df['image_file'] = df.apply(lambda row: f"media/images/{row['ID']}.jpg", axis=1)

In [26]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ID,Time of Posting,Type of Content,video_url,image_url,Username,Video Length (s),Expiration,...,Is Verified,Stickers,Accessibility Caption,Attribution URL,video_file,audio_file,duration,sampling_rate,image_file,ocr_text
0,0,0,3234500408402516260_1383567706,2023-11-12 15:21:53,Image,,,news24,,2023-11-13 15:21:53,...,True,[],"Photo by News24 on November 12, 2023. May be a...",https://www.threads.net/t/CzjB80Zqme0,,,,,media/images/3234500408402516260_1383567706.jpg,Keee WEEKEND NEWS24 PLUS: TESTING FORDS RANGER...
1,1,1,3234502795095897337_8537434,2023-11-12 15:26:39,Image,,,bild,,2023-11-13 15:26:39,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234502795095897337_8537434.jpg,Dieses Auto ist einfach der Horror Du glaubst ...
2,2,2,3234503046678453705_8537434,2023-11-12 15:27:10,Image,,,bild,,2023-11-13 15:27:10,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234503046678453705_8537434.jpg,Touchdown bei Taylor Swift und Travis Kelce De...
3,3,3,3234503930728728807_8537434,2023-11-12 15:28:55,Image,,,bild,,2023-11-13 15:28:55,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234503930728728807_8537434.jpg,Horror-Diagnose für Barton Cowperthwaite Netfl...
4,4,4,3234504185910204562_8537434,2023-11-12 15:29:25,Image,,,bild,,2023-11-13 15:29:25,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234504185910204562_8537434.jpg,3v Bilde GG JJ Besorgniserregende Ufo-Aktivitä...


### Import Stories (Zeeschuimer-F)

The following cells load the metadata and media files from Google Drive. Replace the file names to match yours.

In [1]:
import pandas as pd

df_filepath = '/content/drive/MyDrive/2022-11-09-Stories-Exported.csv'
df = pd.read_csv(df_filepath)

In [None]:
!unzip /content/drive/MyDrive/2023-11-09-Story-Media-Export.zip

Here we add a new column to the metadata table, referencing the image file.

In [4]:
df['image_file'] = df.apply(lambda row: f"media/images/{row['Username']}/{row['ID']}.jpg", axis=1)

In [5]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ID,Time of Posting,Type of Content,video_url,image_url,Username,Video Length (s),Expiration,Caption,Is Verified,Stickers,Accessibility Caption,Attribution URL,video_file,audio_file,duration,sampling_rate,image_file
0,0,0,3234500408402516260_1383567706,2023-11-12 15:21:53,Image,,,news24,,2023-11-13 15:21:53,,True,[],"Photo by News24 on November 12, 2023. May be a...",https://www.threads.net/t/CzjB80Zqme0,,,,,media/images/news24/3234500408402516260_138356...
1,1,1,3234502795095897337_8537434,2023-11-12 15:26:39,Image,,,bild,,2023-11-13 15:26:39,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234502795095897337_8537434.jpg
2,2,2,3234503046678453705_8537434,2023-11-12 15:27:10,Image,,,bild,,2023-11-13 15:27:10,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234503046678453705_8537434.jpg
3,3,3,3234503930728728807_8537434,2023-11-12 15:28:55,Image,,,bild,,2023-11-13 15:28:55,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234503930728728807_8537434.jpg
4,4,4,3234504185910204562_8537434,2023-11-12 15:29:25,Image,,,bild,,2023-11-13 15:29:25,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234504185910204562_8537434.jpg


## 2. OCR

We're using [easyocr](https://github.com/JaidedAI/EasyOCR). See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

In [6]:
!pip -q install easyocr

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 kB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
# Imports for OCR
import easyocr
reader = easyocr.Reader(['de','en'])



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.0% Complete

We define a very simple method to receive one string for all text recognized: The `readtext`method returns a list of text areas, in this example we concatenate the string, therefore the order of words is sometimes not correct.

Also, we save the file to Google Drive to save our results.

In [8]:
def run_ocr(image_path):
    ocr_result = reader.readtext(image_path, detail = 0)
    ocr_text = " ".join(ocr_result)
    return ocr_text

df['ocr_text'] = df['image_file'].apply(run_ocr)

# Saving Results to Drive
df.to_csv('/content/drive/MyDrive/2022-11-09-Stories-Exported.csv')

In [27]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ID,Time of Posting,Type of Content,video_url,image_url,Username,Video Length (s),Expiration,...,Is Verified,Stickers,Accessibility Caption,Attribution URL,video_file,audio_file,duration,sampling_rate,image_file,ocr_text
0,0,0,3234500408402516260_1383567706,2023-11-12 15:21:53,Image,,,news24,,2023-11-13 15:21:53,...,True,[],"Photo by News24 on November 12, 2023. May be a...",https://www.threads.net/t/CzjB80Zqme0,,,,,media/images/3234500408402516260_1383567706.jpg,Keee WEEKEND NEWS24 PLUS: TESTING FORDS RANGER...
1,1,1,3234502795095897337_8537434,2023-11-12 15:26:39,Image,,,bild,,2023-11-13 15:26:39,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234502795095897337_8537434.jpg,Dieses Auto ist einfach der Horror Du glaubst ...
2,2,2,3234503046678453705_8537434,2023-11-12 15:27:10,Image,,,bild,,2023-11-13 15:27:10,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234503046678453705_8537434.jpg,Touchdown bei Taylor Swift und Travis Kelce De...
3,3,3,3234503930728728807_8537434,2023-11-12 15:28:55,Image,,,bild,,2023-11-13 15:28:55,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234503930728728807_8537434.jpg,Horror-Diagnose für Barton Cowperthwaite Netfl...
4,4,4,3234504185910204562_8537434,2023-11-12 15:29:25,Image,,,bild,,2023-11-13 15:29:25,...,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/3234504185910204562_8537434.jpg,3v Bilde GG JJ Besorgniserregende Ufo-Aktivitä...
