<a href="https://colab.research.google.com/github/michaelachmann/ig-tutorial/blob/main/03_AutomatedExportLabelStudio" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated Export to LabelStudio [![DOI](https://zenodo.org/badge/573008138.svg)](https://zenodo.org/badge/latestdoi/573008138)

![A robot conducting visual social media analysis](https://user-images.githubusercontent.com/8556092/229534426-846f4b4d-61b6-499b-8465-65e93b278c35.png)

## Overview

This Jupyter notebook is a part of a tutorial series on computational social media analysis. The notebook is intended for use with my [Medium articles](https://achmann.dev/).

The **Automated Export to LabelStudio** Notebook helps to programmatically create a LabelStudio Project, and to add annotation tasks and a simple interface.

### Project Information

- Project Website: [achmann.dev](https://achmann.dev/)
- GitHub Repository: [https://github.com/michaelachmann/ig-tutorial](https://github.com/michaelachmann/ig-tutorial)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/ig-tutorial/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/ig-tutorial: First Release (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.8199595

```

In [3]:
!pip -q install gcloud label-studio-sdk

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for gcloud (setup.py) ... [?25l[?25hdone


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [4]:
#@title ## Gcloud Setup
#@markdown Please specify the file path for the credentials file in order to upload images to google cloud bucket.

import json
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials

gcloud_credentials_path = 'public-ws2223.json' #@param {type: "string"}
gcloud_bucket = 'ws2223-labelstudio' #@param {type: "string"}

with open(gcloud_credentials_path, 'rb') as f:
  credentials_dict = json.loads(f.read())

credentials = ServiceAccountCredentials.from_json_keyfile_dict(
    credentials_dict
)
client = storage.Client(credentials=credentials, project='local-grove-153811')
bucket = client.get_bucket(gcloud_bucket)

In [16]:
#@title ## LabelStudio Setup
#@markdown Please specify the the URL and API-Key for you LabelStudio Instance.

labelstudio_key = "YOUR-API-KEY" #@param {type: "string"}
labelstudio_url = "https://app.heartex.com/" #@param {type: "string"}


# Image Upload to Google Cloud Bucket

Please choose the right cells based on whether you have collected your data using `instaloader` according to my Tutorial or using CrowdTangle.

## Instaloader
We're assuming that you have transformed the Instaloader files using the [DataCollection Notebook](https://github.com/michaelachmann/ig-tutorial/blob/main/01_DataCollection.ipynb). Image files are located at `working_dir/username/`

In [8]:
#@title ## Optional: Unzip from GDrive
#@markdown Upload your data as a `zip` file to Google Drive and provide the path. Run the cell to unzip it in this VM.

import shutil

instaloader_zip = '/content/drive/MyDrive/ESNCOM-Gruppe.zip' #@param {type:"string"}
working_dir = '/content/working_dir/' #@param {type: "string"}

shutil.unpack_archive(instaloader_zip, working_dir)

In [9]:
#@title ## Data Setup
#@markdown Please upload the ZIP file with all media / image files and provide the path to your CSV-File created with the DataCollection notebook.

from google.colab import files
import shutil
import os
import pandas as pd

csv_file = '/content/drive/MyDrive/ESNCOM-Gruppeinstaloader_output.csv' #@param {type: "string"}
df = pd.read_csv(csv_file)

In [None]:
#@title ## Image Upload
#@markdown Next we will upload all images to the cloud storage. Please provide the path for your image files (check on the left hand-side). This takes some time.

image_dir = "/content/working_dir/ESNCOM-Gruppe"  #@param {type: "string"}
# Upload images to Bucket
df["image"] = df.apply(lambda x: "gs://{}/{}-{}.jpg".format(gcloud_bucket, x['username'], x['id']), axis=1)

i = 0
j = 0
for index, row in df.iterrows():
  filename = "{}-{}.jpg".format(row['username'], row['id'])
  source_filename = "{}/{}/{}.jpg".format(image_dir,row['username'], row['filename'])
  blob = bucket.blob(filename)

  if not blob.exists(client):
    try:
      blob.upload_from_filename(source_filename)
    except:
      print("Uploading {} failed: Missing File".format(source_filename))
    j += 0
  i += 1

print("Uploaded {} images successfully, skipped {} existing files.".format(i, j))

## CrowdTangle
The next cells assumes that you exported a DataFrame from CrowdTangle and used my [CrowdTangleDownload Notebook](https://github.com/michaelachmann/ig-tutorial/blob/main/02_CrowdTangleDownload.ipynb) to download the images.

In [None]:
#@title ## Optional: Unzip from GDrive
#@markdown Upload your data as a `zip` file to Google Drive and provide the path. Run the cell to unzip it in this VM.

import shutil

crowdtangle_zip = '/content/drive/MyDrive/bodypositivity.zip' #@param {type:"string"}
working_dir = '/content/working_dir/' #@param {type: "string"}

shutil.unpack_archive(crowdtangle_zip, working_dir)

In [None]:
#@title ## Upload images to Cloud

import pandas as pd

crowdtangle_csv_file = '/content/working_dir/bodypositivity/bodypositivity.csv' #@param {type:"string"}
image_dir = '/content/working_dir/bodypositivity/images' #@param {type: "string"}

df = pd.read_csv(crowdtangle_csv_file)

df['ocr_text'] = df['Image Text']
df['username'] = df['User Name']
df['id'] = df['URL'].apply(lambda x: x.split("/")[-2])
df["image"] = df.apply(lambda x: "gs://{}/{}-{}.jpg".format(gcloud_bucket, x['username'], x['id']), axis=1)

i = 0
j = 0
for index, row in df.iterrows():
  filename = "{}-{}.jpg".format(row['username'], row['id'])
  source_filename = "{}/{}/{}.jpg".format(image_dir,row['username'], row['id'])
  blob = bucket.blob(filename)
  if not blob.exists(client):
    try:
      blob.upload_from_filename(source_filename)
    except:
      print("Uploading {} failed: Missing File".format(source_filename))
    j += 0
  i += 1

print("Uploaded {} images successfully, skipped {} existing files.".format(i, j))

## Create LabelStudio Interface
Before creating the LabelStudio project you will need to define your labelling interface. Once the project is set up you will only be able to edit the interface in LabelStudio.

In [11]:
interface = """
<View style="display:flex;">
  <View style="flex:33%">
    <Image name="image" value="$image"/>
  </View>
  <View style="flex:66%">
"""

In [12]:
#@title ### Bounding Boxes
#@markdown Do you want to tag persons / objects in the image using bounding boxes? Please provide a name for the bounding boxes. Add **all** possible values in a **comma-seperated** list! </br> **By running this cell multiple times you're able to add multiple bounding box types (not recommended)**

bb_name = "Objects" #@param {type:"string"}
bb_values = "Food, Beverages, Humans, Cats, Dogs" #@param {type:"string"}


bb_interface = ' <Header value="{}" /><Rectangle name="{}_bbox" toName="image" strokeWidth="3"/><RectangleLabels name="{}" toName="image">'.format(bb_name,bb_name,bb_name)

for value in bb_values.split(","):
  value = value.strip()
  bb_interface += '<Label value="{}" />'.format(value)

bb_interface += "</RectangleLabels>"

interface += bb_interface

print("Added {}".format(bb_name))

Added Objects


In [13]:
#@title ### Codes
#@markdown Do you want add codes (Classification) to the images? Please name your coding instance and add options. <br/> **By running this cell multiple times you're able to add multiple bounding box types (not recommended)**

coding_name = "Scene" #@param {type:"string"}
coding_values = "Indoor, Outdoor, Unsure" #@param {type:"string"}
coding_choice = "multiple" #@param ["single", "multiple"]

coding_interface = '<Header value="{}" /><Choices name="{}" choice="{}" toName="image">'.format(coding_name, coding_name,coding_choice)

for value in coding_values.split(","):
  value = value.strip()
  coding_interface += '<Choice value="{}" />'.format(value)

coding_interface += "</Choices>"

interface += coding_interface

print("Added {}".format(coding_name))

Added Scene


In [None]:
#@title ### OCR
#@markdown Do you want to use (and correct) the OCR results?
use_ocr = True #@param {type:"boolean"}

if use_ocr:
  interface += """
    <Header value="Correct the OCR of the Image"/>
    <TextArea name="ocr" value="$ocr_text" toName="image" placeholder="Enter any text here" rows="5"/>
  """


In [14]:
interface += """
        </View>
    </View>
    """

In [None]:
#@title ## Create LabelStudio Project
#@markdown In this step we will create a LabelStudio project and configure cloud storage and the interface.
from label_studio_sdk import Client
import contextlib
import io

project_name = "Medium Test"  #@param {type: "string"}

ls = Client(url=labelstudio_url, api_key=labelstudio_key)

# Create the project
project = ls.start_project(
    title=project_name,
    label_config=interface,
    sampling="Uniform sampling"
)

# Configure Cloud Storage (in order to be able to view the images)
project.connect_google_import_storage(bucket=gcloud_bucket, google_application_credentials=json.dumps(credentials_dict))


# Import all tasks
df_tasks = df[['image', 'ocr_text', 'username']]
df_tasks = df_tasks.fillna("")


with contextlib.redirect_stdout(io.StringIO()):
  project.import_tasks(
        df_tasks.to_dict('records')
      )

print(f"All done, you're set up! Visit {labelstudio_url}/projects/{project.id}/ and get started labelling!")
