<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_11_24_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OCR for Posts and Stories [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/badge/latestdoi/660157642)
![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **OCR for Posts and Stories** Notebook uses the `easyocr` package to recognize and transcribe text embedded in images and stories.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: 2023-11-27 (v0.0.5). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

## 1. Data Import

### From 4CAT

In [None]:
#@markdown Read the exported `csv` file from 4CAT for metadata.

import pandas as pd

four_cat_file_path = "/content/drive/MyDrive/2023-11-24-4CAT-Metadata.csv" #@param {type:"string"}

df = pd.read_csv(four_cat_file_path)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,...,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp,video_file,audio_file,duration,sampling_rate
0,0,CzLE8FCoO-2,CzLE8FCoO-2,CzLE8FCoO-2,Wir haben eine klare Haltung: Wir stehen zu Is...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-03 06:01:22,photo,...,167,1,,,,1698991282,,,,
1,1,CzGGK2PIpou,CzGGK2PIpou,CzGGK2PIpou,An Allerseelen und Allerheiligen denke ich bes...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 07:35:55,photo,...,289,1,,,,1698824155,,,,
2,2,CzF7RDmpDXl,CzF7RDmpDXl,CzF7RDmpDXl,#Allerheiligen und #Allerseelen: Wir halten in...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 06:00:39,photo,...,30,1,,,,1698818439,,,,
3,3,CzEB00zu65J,CzEB00zu65J,CzEB00zu65J,Wir wollen Bayern in eine gute Zukunft führen....,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:19:29,photo,...,30,1,,,,1698754769,,,,
4,4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,...,227,1,,,,1698753983,CzD93SEIi-E.mp4,CzD93SEIi-E.mp3,67.89,44100.0


In [None]:
#@title Unzip and Process Videos from 4CAT Export

#@markdown This script will unzip a specified ZIP file, read a metadata JSON file, and then process and relocate video files according to the metadata.

import zipfile
import json
import os

#@markdown Enter the Path to the ZIP File
zip_file_path = '/content/drive/MyDrive/2023-11-24-4CAT-Images.zip' #@param {type:"string"}
output_zip_file_path = '/content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip' #@param {type:"string"}


#@markdown Enter the Extraction Folder Path
four_cat_folder = "4cat-export/" #@param {type:"string"}

#@markdown Enter the Destination Folder Path for Videos
video_path = "media/images" #@param {type:"string"}

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified folder
    zip_ref.extractall(four_cat_folder)

print(f"Files extracted to {four_cat_folder}")

# Specify the path to the metadata JSON file
metadata_file_path = f'{four_cat_folder}/.metadata.json'

# Open the metadata file and load its content
with open(metadata_file_path, 'r') as file:
    data = json.load(file)

# Check if the destination directory for videos exists
if not os.path.exists(video_path):
    # Create the directory if it does not exist
    os.makedirs(video_path)

# Process each item in the metadata
for item in data.values():
    if item.get('success', False):
        post_id = item['post_ids'][0]
        filename = item['filename']
        print(f"Processing Post ID: {post_id}, Filename: {filename}")

        # Full path to the source file
        source_path = os.path.join(four_cat_folder, filename)

        # Full path to the destination file
        destination_path = os.path.join(video_path, f"{post_id}.jpg")

        # Move and rename the file
        os.rename(source_path, destination_path)

Files extracted to 4cat-export/
Processing Post ID: CzLE8FCoO-2, Filename: 399099252_3485122821751295_8263370465096499787_n.jpg
Processing Post ID: CzGGK2PIpou, Filename: 397070079_1007045093913241_8656729052771444150_n.jpg
Processing Post ID: CzF7RDmpDXl, Filename: 397558733_891808195641036_5096515791342710211_n.jpg
Processing Post ID: CzEB00zu65J, Filename: 397941460_1287150215325443_5323885461182628527_n.jpg
Processing Post ID: CzD93SEIi-E, Filename: 398280880_670591648497089_6560234879451222622_n.jpg
Processing Post ID: CzD8s01ot-7, Filename: 396755440_2427417250772617_1757731681175310218_n.jpg
Processing Post ID: CzDWdN-hU7Y, Filename: 396991696_891203805701475_6074511894218926822_n.jpg
Processing Post ID: CzB-cm4orLY, Filename: 397325327_711690393845940_4774729516376905427_n.jpg
Processing Post ID: CzB7LhPofE8, Filename: 396960658_651191003728578_6818450362956077157_n.jpg
Processing Post ID: CzB3Xqeobx9, Filename: 397963405_1773828079748506_4462604717543654421_n.jpg
Processing Po

Using the next line we save the extracted image files to a new `ZIP` file following our `media/images/` convention. This will be useful for future tasks / notebooks. Rename the file according to your needs.

In [None]:
!zip -r /content/drive/MyDrive/2023-11-24-4CAT-Images-Clean.zip media

updating: media/ (stored 0%)
updating: media/images/ (stored 0%)
updating: media/images/CzB-cm4orLY.jpg (deflated 0%)
updating: media/images/CzD93SEIi-E.jpg (deflated 1%)
updating: media/images/CzF7RDmpDXl.jpg (deflated 0%)
updating: media/images/CzB7LhPofE8.jpg (deflated 0%)
updating: media/images/CzDWdN-hU7Y.jpg (deflated 2%)
updating: media/images/CzEB00zu65J.jpg (deflated 1%)
updating: media/images/Cy--YrdIp_7.jpg (deflated 0%)
updating: media/images/CzGGK2PIpou.jpg (deflated 1%)
updating: media/images/CzB3Xqeobx9.jpg (deflated 1%)
updating: media/images/CzLE8FCoO-2.jpg (deflated 3%)
updating: media/images/CzBUAchuUG9.jpg (deflated 1%)
updating: media/images/CzD8s01ot-7.jpg (deflated 2%)


Here we add a new column to the metadata table, referencing the image file.

In [None]:
df['image_file'] = df.apply(lambda row: f"media/images/{row['id']}.jpg", axis=1)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,...,num_media,location_name,location_latlong,location_city,unix_timestamp,video_file,audio_file,duration,sampling_rate,image_file
0,0,CzLE8FCoO-2,CzLE8FCoO-2,CzLE8FCoO-2,Wir haben eine klare Haltung: Wir stehen zu Is...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-03 06:01:22,photo,...,1,,,,1698991282,,,,,media/images/CzLE8FCoO-2.jpg
1,1,CzGGK2PIpou,CzGGK2PIpou,CzGGK2PIpou,An Allerseelen und Allerheiligen denke ich bes...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 07:35:55,photo,...,1,,,,1698824155,,,,,media/images/CzGGK2PIpou.jpg
2,2,CzF7RDmpDXl,CzF7RDmpDXl,CzF7RDmpDXl,#Allerheiligen und #Allerseelen: Wir halten in...,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-11-01 06:00:39,photo,...,1,,,,1698818439,,,,,media/images/CzF7RDmpDXl.jpg
3,3,CzEB00zu65J,CzEB00zu65J,CzEB00zu65J,Wir wollen Bayern in eine gute Zukunft führen....,markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:19:29,photo,...,1,,,,1698754769,,,,,media/images/CzEB00zu65J.jpg
4,4,CzD93SEIi-E,CzD93SEIi-E,CzD93SEIi-E,"Mitzuarbeiten für unser Land, Bayern zu entwic...",markus.soeder,Markus Söder,https://scontent-fra3-1.cdninstagram.com/v/t51...,2023-10-31 12:06:23,video,...,1,,,,1698753983,CzD93SEIi-E.mp4,CzD93SEIi-E.mp3,67.89,44100.0,media/images/CzD93SEIi-E.jpg


### Import Stories (Zeeschuimer-F)

The following cells load the metadata and media files from Google Drive. Replace the file names to match yours.

In [None]:
import pandas as pd

df_filepath = '/content/drive/MyDrive/2022-11-09-Stories-Exported.csv'
df = pd.read_csv(df_filepath)

In [None]:
!unzip /content/drive/MyDrive/2023-11-09-Story-Media-Export.zip

Here we add a new column to the metadata table, referencing the image file.

In [None]:
df['image_file'] = df.apply(lambda row: f"media/images/{row['Username']}/{row['ID']}.jpg", axis=1)

In [None]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,ID,Time of Posting,Type of Content,video_url,image_url,Username,Video Length (s),Expiration,Caption,Is Verified,Stickers,Accessibility Caption,Attribution URL,video_file,audio_file,duration,sampling_rate,image_file
0,0,0,3234500408402516260_1383567706,2023-11-12 15:21:53,Image,,,news24,,2023-11-13 15:21:53,,True,[],"Photo by News24 on November 12, 2023. May be a...",https://www.threads.net/t/CzjB80Zqme0,,,,,media/images/news24/3234500408402516260_138356...
1,1,1,3234502795095897337_8537434,2023-11-12 15:26:39,Image,,,bild,,2023-11-13 15:26:39,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234502795095897337_8537434.jpg
2,2,2,3234503046678453705_8537434,2023-11-12 15:27:10,Image,,,bild,,2023-11-13 15:27:10,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234503046678453705_8537434.jpg
3,3,3,3234503930728728807_8537434,2023-11-12 15:28:55,Image,,,bild,,2023-11-13 15:28:55,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234503930728728807_8537434.jpg
4,4,4,3234504185910204562_8537434,2023-11-12 15:29:25,Image,,,bild,,2023-11-13 15:29:25,,True,[],"Photo by BILD on November 12, 2023. May be an ...",,,,,,media/images/bild/3234504185910204562_8537434.jpg


### Import CrowdTangle Data & Images

In [1]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/2023-11-30-Export-Posts-Crowd-Tangle.csv')

In [2]:
!unzip /content/drive/MyDrive/2023-11-30-LTW23-CrowdTangle-Post-Images.zip

Archive:  /content/drive/MyDrive/2023-11-30-LTW23-CrowdTangle-Post-Images.zip
   creating: media/
   creating: media/images/
   creating: media/images/fw_bayern/
  inflating: media/images/fw_bayern/Cw6wUkzMt8I.jpg  
  inflating: media/images/fw_bayern/Cx7JT0sNaUP.jpg  
  inflating: media/images/fw_bayern/CxvefBFObQk.jpg  
  inflating: media/images/fw_bayern/CxkCuTRsfm6.jpg  
  inflating: media/images/fw_bayern/CyBDII3sCyu.jpg  
  inflating: media/images/fw_bayern/CxGMt9wNzX-.jpg  
  inflating: media/images/fw_bayern/Cw2tpI8Miim.jpg  
  inflating: media/images/fw_bayern/CyFfVqktLxN.jpg  
  inflating: media/images/fw_bayern/CxUhY_6MwkB.jpg  
  inflating: media/images/fw_bayern/Cw_8KBnMs4X.jpg  
  inflating: media/images/fw_bayern/Cx9sYF4OHw3.jpg  
  inflating: media/images/fw_bayern/CxruPuqMI5q.jpg  
  inflating: media/images/fw_bayern/Cxe1tQSM7rl.jpg  
  inflating: media/images/fw_bayern/CxQrVAZNeKj.jpg  
  inflating: media/images/fw_bayern/CyMAe_tufcR.jpg  
  inflating: media/images/fw

In [3]:
df['shortcode'] = df['URL'].apply(lambda x: x.split("/")[4])
df['image_file'] = df.apply(lambda row: f"media/images/{row['User Name']}/{row['shortcode']}.jpg", axis=1)

In [4]:
df.head()

Unnamed: 0,Account,User Name,Followers at Posting,Post Created,Post Created Date,Post Created Time,Type,Total Interactions,Likes,Comments,...,Link,Photo,Title,Description,Image Text,Sponsor Id,Sponsor Name,Overperforming Score (weighted — Likes 1x Comments 1x ),shortcode,image_file
0,FREIE WÄHLER Bayern,fw_bayern,9138,2023-10-09 20:10:19 CEST,2023-10-09,20:10:19,Photo,566,561,5,...,https://www.instagram.com/p/CyMAe_tufcR/,https://scontent-sea1-1.cdninstagram.com/v/t51...,,#Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da...,"FREIE WAHLER 15,8 %",,,2.95,CyMAe_tufcR,media/images/fw_bayern/CyMAe_tufcR.jpg
1,Junge Liberale JuLis Bayern,julisbayern,4902,2023-10-09 19:48:02 CEST,2023-10-09,19:48:02,Album,320,310,10,...,https://www.instagram.com/p/CyL975vouHU/,https://scontent-sea1-1.cdninstagram.com/v/t51...,,Die Landtagswahl war für uns als Liberale hart...,,,,1.41,CyL975vouHU,media/images/julisbayern/CyL975vouHU.jpg
2,Junge Union Deutschlands,junge_union,44414,2023-10-09 19:31:59 CEST,2023-10-09,19:31:59,Photo,929,925,4,...,https://www.instagram.com/p/CyL8GWWJmci/,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Nach einem starken Wahlkampf ein verdientes Er...,HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris...,,,1.17,CyL8GWWJmci,media/images/junge_union/CyL8GWWJmci.jpg
3,Katharina Schulze,kathaschulze,37161,2023-10-09 19:29:02 CEST,2023-10-09,19:29:02,Photo,1074,1009,65,...,https://www.instagram.com/p/CyL7wyJtTV5/,https://scontent-sea1-1.cdninstagram.com/v/t51...,,So viele Menschen am Odeonsplatz heute mit ein...,,,,1.61,CyL7wyJtTV5,media/images/kathaschulze/CyL7wyJtTV5.jpg
4,Junge Union Deutschlands,junge_union,44414,2023-10-09 18:01:34 CEST,2023-10-09,18:01:34,Album,1655,1644,11,...,https://www.instagram.com/p/CyLxwHuvR4Y/,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Herzlichen Glückwunsch zu diesem grandiosen Wa...,,,,2.34,CyLxwHuvR4Y,media/images/junge_union/CyLxwHuvR4Y.jpg


## 2. OCR

We're using [easyocr](https://github.com/JaidedAI/EasyOCR). See the documentation for more complex configurations. Using CPU only this process takes from minutes to hours (depends on the amount of images). OCR may also be outsourced (e.g. using Google Vision API), see future sessions (and Memespector) for this.

In [5]:
!pip -q install easyocr

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
# Imports for OCR
import easyocr
reader = easyocr.Reader(['de','en'])



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |██████████████████████████████████████████████████| 100.0% Complete

We define a very simple method to receive one string for all text recognized: The `readtext`method returns a list of text areas, in this example we concatenate the string, therefore the order of words is sometimes not correct.

Also, we save the file to Google Drive to save our results.

In [10]:
import os
from tqdm import tqdm
from tqdm.notebook import tqdm_notebook

# Initialize tqdm for pandas
tqdm.pandas()

def run_ocr(image_path):
    ocr_text = ""
    # Check if the image file exists
    if not os.path.exists(image_path):
        print(f"File does not exist: {image_path}")
        return ""

    try:
        ocr_result = reader.readtext(image_path, detail = 0)
        ocr_text = " ".join(ocr_result)
    except Exception as e:
        print(f"OCR failed: {e}")

    return ocr_text

# Applying the OCR function to each image file in the dataframe
df['ocr_text'] = df['image_file'].progress_apply(run_ocr)

output_file = '/content/drive/MyDrive/2023-11-30-Export-Posts-Crowd-Tangle.csv'

# Saving the results
df.to_csv(output_file)

 35%|███▍      | 486/1408 [04:11<04:41,  3.27it/s]

File does not exist: media/images/markus.soeder/CxupI6boQfV.jpg


 42%|████▏     | 595/1408 [05:16<09:06,  1.49it/s]

File does not exist: media/images/jubayern/Cxnf4w-oxuJ.jpg
File does not exist: media/images/jubayern/Cxnfy4UoYnK.jpg
File does not exist: media/images/jubayern/CxnfqEuoOLk.jpg


 75%|███████▌  | 1057/1408 [09:07<03:41,  1.59it/s]

File does not exist: media/images/markus.soeder/CxH9_qTIWsn.jpg


 76%|███████▌  | 1072/1408 [09:13<02:14,  2.50it/s]

File does not exist: media/images/spdde/CxGS8WjNhXv.jpg


 77%|███████▋  | 1081/1408 [09:19<03:09,  1.72it/s]

File does not exist: media/images/jubayern/CxFqhfvo0xr.jpg
File does not exist: media/images/jubayern/CxFqgBYo3s9.jpg


100%|██████████| 1408/1408 [12:16<00:00,  1.91it/s]


In [11]:
df.head()

Unnamed: 0,Account,User Name,Followers at Posting,Post Created,Post Created Date,Post Created Time,Type,Total Interactions,Likes,Comments,...,Photo,Title,Description,Image Text,Sponsor Id,Sponsor Name,Overperforming Score (weighted — Likes 1x Comments 1x ),shortcode,image_file,ocr_text
0,FREIE WÄHLER Bayern,fw_bayern,9138,2023-10-09 20:10:19 CEST,2023-10-09,20:10:19,Photo,566,561,5,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,#Landtagswahl23 🤩🧡🙏 #FREIEWÄHLER #Aiwanger #Da...,"FREIE WAHLER 15,8 %",,,2.95,CyMAe_tufcR,media/images/fw_bayern/CyMAe_tufcR.jpg,"FREIE WAHLER 15,8 %"
1,Junge Liberale JuLis Bayern,julisbayern,4902,2023-10-09 19:48:02 CEST,2023-10-09,19:48:02,Album,320,310,10,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,Die Landtagswahl war für uns als Liberale hart...,,,,1.41,CyL975vouHU,media/images/julisbayern/CyL975vouHU.jpg,Freie EDP Demokraten BDB FDP FB FDP DANKE FÜR ...
2,Junge Union Deutschlands,junge_union,44414,2023-10-09 19:31:59 CEST,2023-10-09,19:31:59,Photo,929,925,4,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Nach einem starken Wahlkampf ein verdientes Er...,HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris...,,,1.17,CyL8GWWJmci,media/images/junge_union/CyL8GWWJmci.jpg,HERZLICHEN GLÜCKWUNSCH! Unsere JUler im bayris...
3,Katharina Schulze,kathaschulze,37161,2023-10-09 19:29:02 CEST,2023-10-09,19:29:02,Photo,1074,1009,65,...,https://scontent-sea1-1.cdninstagram.com/v/t51...,,So viele Menschen am Odeonsplatz heute mit ein...,,,,1.61,CyL7wyJtTV5,media/images/kathaschulze/CyL7wyJtTV5.jpg,Juo I W
4,Junge Union Deutschlands,junge_union,44414,2023-10-09 18:01:34 CEST,2023-10-09,18:01:34,Album,1655,1644,11,...,https://scontent-sea1-1.cdninstagram.com/v/t39...,,Herzlichen Glückwunsch zu diesem grandiosen Wa...,,,,2.34,CyLxwHuvR4Y,media/images/junge_union/CyLxwHuvR4Y.jpg,12/12 der hessischen JU-Kandidaten ziehen in d...
