# Zeeschuimer Data Import [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8199901.svg)](https://doi.org/10.5281/zenodo.8199901)

![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **Zeeschuimer Data Import** notebook handles *ndjson* files provided by the [Zeeschuimer](https://github.com/digitalmethodsinitiative/zeeschuimer) plugin for collecting Instagram posts. After importing the files, we can download images (same as 4CAT). Additionally we can convert the data to be compatible with other notebooks in the coures.

**TODO:** At the moment we can only download one image per posts. Future versions of this notebook should be capable of downloading all images / media from albums.

See [social-media-lab.net](https://social-media-lab.net/data-collection/ig-posts.html#zeeschuimer-4cat) for more information.

### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- This Notebook incorporates code taken from the [4CAT repository](https://github.com/digitalmethodsinitiative/4cat/), licenced under Mozilla Public License, 2.0.  
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: 06.11.2023 (v0.0.3). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

In [1]:
import pandas as pd
import json

In [61]:
#@title Import Zeeschuimer Data
#@markdown **Keep 4CAT as format to download images within this notebook.** <br> **Formats:** *MA* corresponds to CrowdTangle and instaloader Notebooks. Compatible with future notebooks for data annotation. *4CAT* corresponds to 4CAT CSV format. <br> *TODO: Import multiple images.* <br> This importer is based on 4CAT code, published under Mozilla Pulice License 2.0 at https://github.com/digitalmethodsinitiative/4cat/tree/master/datasources/instagram

from tqdm.auto import tqdm

import_format = "4CAT" # @param ["MA", "4CAT"]
import_filename = '/content/zeeschuimer-export-instagram.com-2023-11-03T131020.ndjson'  # @param {type: "string"}
export_filename = '/content/2023-11-03-ig-zeeschuimer-export.csv'  # @param {type: "string"}


"""
Import scraped Instagram data

This code has been taken from 4CAT, available under Mozilla Public License 2.0 at https://github.com/digitalmethodsinitiative/4cat/tree/master/datasources/instagram
"""
import datetime
import re

# some magic numbers instagram uses
MEDIA_TYPE_PHOTO = 1
MEDIA_TYPE_VIDEO = 2
MEDIA_TYPE_CAROUSEL = 8

def parse_graph_item(node):
    """
    Parse Instagram post in Graph format

    :param node:  Data as received from Instagram
    :return dict:  Mapped item
    """
    try:
        caption = node["edge_media_to_caption"]["edges"][0]["node"]["text"]
    except IndexError:
        caption = ""

    num_media = 1 if node["__typename"] != "GraphSidecar" else len(node["edge_sidecar_to_children"]["edges"])

    # get media url
    # for carousels, get the first media item, for videos, get the video
    # url, for photos, get the highest resolution
    if node["__typename"] == "GraphSidecar":
        media_node = node["edge_sidecar_to_children"]["edges"][0]["node"]
    else:
        media_node = node

    if media_node["__typename"] == "GraphVideo":
        media_url = media_node["video_url"]
    elif media_node["__typename"] == "GraphImage":
        resources = media_node.get("display_resources", media_node.get("thumbnail_resources"))
        try:
            media_url = resources.pop()["src"]
        except AttributeError:
            media_url = media_node.get("display_url", "")
    else:
        media_url = media_node["display_url"]

    # type, 'mixed' means carousel with video and photo
    type_map = {"GraphSidecar": "photo", "GraphVideo": "video"}
    if node["__typename"] != "GraphSidecar":
        media_type = type_map.get(node["__typename"], "unknown")
    else:
        media_types = set([s["node"]["__typename"] for s in node["edge_sidecar_to_children"]["edges"]])
        media_type = "mixed" if len(media_types) > 1 else type_map.get(media_types.pop(), "unknown")

    location = {"name": "", "latlong": "", "city": ""}
    # location has 'id', 'has_public_page', 'name', and 'slug' keys in tested examples; no lat long or "city" though name seems
    if node.get("location"):
        location["name"] = node["location"].get("name")
        # Leaving this though it does not appear to be used in this type; maybe we'll be surprised in the future...
        location["latlong"] = str(node["location"]["lat"]) + "," + str(node["location"]["lng"]) if node[
            "location"].get("lat") else ""
        location["city"] = node["location"].get("city")

    mapped_item = {
        "id": node["shortcode"],
        "thread_id": node["shortcode"],
        "parent_id": node["shortcode"],
        "body": caption,
        "author": node["owner"]["username"],
        "timestamp": datetime.datetime.fromtimestamp(node["taken_at_timestamp"]).strftime("%Y-%m-%d %H:%M:%S"),
        "author_fullname": node["owner"].get("full_name", ""),
        "author_avatar_url": node["owner"].get("profile_pic_url", ""),
        "type": media_type,
        "url": "https://www.instagram.com/p/" + node["shortcode"],
        "image_url": node["display_url"],
        "media_url": media_url,
        "hashtags": ",".join(re.findall(r"#([^\s!@#$%ˆ&*()_+{}:\"|<>?\[\];'\,./`~']+)", caption)),
        # "usertags": ",".join(
        #     [u["node"]["user"]["username"] for u in node["edge_media_to_tagged_user"]["edges"]]),
        "num_likes": node["edge_media_preview_like"]["count"],
        "num_comments": node.get("edge_media_preview_comment", {}).get("count", 0),
        "num_media": num_media,
        "location_name": location["name"],
        "location_latlong": location["latlong"],
        "location_city": location["city"],
        "unix_timestamp": node["taken_at_timestamp"]
    }

    return mapped_item

@staticmethod
def parse_itemlist_item(node):
    """
    Parse Instagram post in 'item list' format

    :param node:  Data as received from Instagram
    :return dict:  Mapped item
    """
    num_media = 1 if node["media_type"] != MEDIA_TYPE_CAROUSEL else len(node["carousel_media"])
    caption = "" if not node.get("caption") else node["caption"]["text"]

    # get media url
    # for carousels, get the first media item, for videos, get the video
    # url, for photos, get the highest resolution
    if node["media_type"] == MEDIA_TYPE_CAROUSEL:
        media_node = node["carousel_media"][0]
    else:
        media_node = node

    if media_node["media_type"] == MEDIA_TYPE_VIDEO:
        media_url = media_node["video_versions"][0]["url"]
        if "image_versions2" in media_node:
            display_url = media_node["image_versions2"]["candidates"][0]["url"]
        else:
            # no image links at all :-/
            # video is all we have
            display_url = media_node["video_versions"][0]["url"]
    elif media_node["media_type"] == MEDIA_TYPE_PHOTO:
        media_url = media_node["image_versions2"]["candidates"][0]["url"]
        display_url = media_url
    else:
        media_url = ""
        display_url = ""

    # type, 'mixed' means carousel with video and photo
    type_map = {MEDIA_TYPE_PHOTO: "photo", MEDIA_TYPE_VIDEO: "video"}
    if node["media_type"] != MEDIA_TYPE_CAROUSEL:
        media_type = type_map.get(node["media_type"], "unknown")
    else:
        media_types = set([s["media_type"] for s in node["carousel_media"]])
        media_type = "mixed" if len(media_types) > 1 else type_map.get(media_types.pop(), "unknown")

    if "comment_count" in node:
        num_comments = node["comment_count"]
    elif "comments" in node and type(node["comments"]) is list:
        num_comments = len(node["comments"])
    else:
        num_comments = -1

    location = {"name": "", "latlong": "", "city": ""}
    if node.get("location"):
        location["name"] = node["location"].get("name")
        location["latlong"] = str(node["location"]["lat"]) + "," + str(node["location"]["lng"]) if node[
            "location"].get("lat") else ""
        location["city"] = node["location"].get("city")

    mapped_item = {
        "id": node["code"],
        "thread_id": node["code"],
        "parent_id": node["code"],
        "body": caption,
        "author": node["user"]["username"],
        "author_fullname": node["user"]["full_name"],
        "author_avatar_url": node["user"]["profile_pic_url"],
        "timestamp": datetime.datetime.fromtimestamp(node["taken_at"]).strftime("%Y-%m-%d %H:%M:%S"),
        "type": media_type,
        "url": "https://www.instagram.com/p/" + node["code"],
        "image_url": display_url,
        "media_url": media_url,
        "hashtags": ",".join(re.findall(r"#([^\s!@#$%ˆ&*()_+{}:\"|<>?\[\];'\,./`~']+)", caption)),
        # "usertags": ",".join(
        #     [u["node"]["user"]["username"] for u in node["edge_media_to_tagged_user"]["edges"]]),
        "num_likes": node["like_count"],
        "num_comments": num_comments,
        "num_media": num_media,
        "location_name": location["name"],
        "location_latlong": location["latlong"],
        "location_city": location["city"],
        "unix_timestamp": node["taken_at"]
    }

    return mapped_item

def map_item(item):
    """
    Map Instagram item

    Instagram importing is a little bit roundabout since we can expect
    input in two separate and not completely overlapping formats - an "edge
    list" or an "item list", and posts are structured differently between
    those, and do not contain the same data. So we find a middle ground
    here... each format has its own handler function

    :param dict item:  Item to map
    :return:  Mapped item
    """
    link = item.get("link", "")
    if (item.get("product_type", "") == "ad") or \
            (link and link.startswith("https://www.facebook.com/ads/ig_redirect")):

        return None

    is_graph_response = "__typename" in item and item["__typename"] not in ("XDTMediaDict",)

    if is_graph_response:
        return parse_graph_item(item)
    else:
        return parse_itemlist_item(item)



##########
# CODE MA
##########

data = []

# List to store all the dictionaries
data_list = []

# Read the ndjson file line by line
with open(import_filename, 'r') as file:
    for line in file:
        # Parse the JSON line and extract the 'data' field
        json_line = json.loads(line)
        data_field = json_line['data']

        # Add the 'data' field to your list
        data_list.append(data_field)


# Next, we convert the data into a table
for element in tqdm(data_list):
  item = map_item(element)

  if item:
    if import_format == "MA":
      data.append({
          'shortcode': item.get("id", ""),
          'username': item.get("author", ""),
          'timestamp': item.get("unix_timestamp", None),
          'caption': item.get("body", ""),
          'location': item.get("location_name", None),
      })

    else:
      data.append(item)

posts_df = pd.DataFrame(data)
posts_df.to_csv(export_filename)

print(f"Imported Zeeschuimer Data. Saved export to {export_filename}")

  0%|          | 0/10 [00:00<?, ?it/s]

Imported Zeeschuimer Data. Saved export to /content/2023-11-03-ig-zeeschuimer-export.csv


In [64]:
#| label: fig-bill-marginal
posts_df.head()

Unnamed: 0,id,thread_id,parent_id,body,author,author_fullname,author_avatar_url,timestamp,type,url,image_url,media_url,hashtags,num_likes,num_comments,num_media,location_name,location_latlong,location_city,unix_timestamp
0,CzLuRzfrukl,CzLuRzfrukl,CzLuRzfrukl,"Dear citizens of Israel, \n\nYou have been our...",stateofisrael,Israel,https://scontent.cdninstagram.com/v/t51.2885-1...,2023-11-03 12:04:11,video,https://www.instagram.com/p/CzLuRzfrukl,https://scontent.cdninstagram.com/v/t51.2885-1...,https://scontent.cdninstagram.com/o1/v/t16/f1/...,,2775,807,1,,,,1699013051
1,CzGoQoFrGQR,CzGoQoFrGQR,CzGoQoFrGQR,"אני עוד חי, חי, חי, \nעם ישראל חי.\n\n📸 @arnon...",stateofisrael,Israel,https://scontent.cdninstagram.com/v/t51.2885-1...,2023-11-01 12:33:49,photo,https://www.instagram.com/p/CzGoQoFrGQR,https://scontent.cdninstagram.com/v/t51.2885-1...,https://scontent.cdninstagram.com/v/t51.2885-1...,,20749,6436,1,,,,1698842029
2,CzEIzgitG5w,CzEIzgitG5w,CzEIzgitG5w,Unsere Demokratie steht im Feuer. Wir als poli...,kathaschulze,Katharina Schulze,https://scontent.cdninstagram.com/v/t51.2885-1...,2023-10-31 13:20:28,photo,https://www.instagram.com/p/CzEIzgitG5w,https://scontent.cdninstagram.com/v/t39.30808-...,https://scontent.cdninstagram.com/v/t39.30808-...,"parlament,landtag,bayern,demokratie,rede",570,27,1,Bayerischer Landtag,"48.136341,11.594312",,1698758428
3,CzLGA9jtJuj,CzLGA9jtJuj,CzLGA9jtJuj,Hass und Desinformation dürfen im Netz nicht g...,spdde,SPD,https://scontent.cdninstagram.com/v/t51.2885-1...,2023-11-03 06:10:47,photo,https://www.instagram.com/p/CzLGA9jtJuj,https://scontent.cdninstagram.com/v/t51.2885-1...,https://scontent.cdninstagram.com/v/t51.2885-1...,,747,46,1,,,,1698991847
4,CzGC5dxth82,CzGC5dxth82,CzGC5dxth82,,jungealternativebayern,Junge Alternative Bayern ❌,https://scontent.cdninstagram.com/v/t51.2885-1...,2023-11-01 07:07:20,photo,https://www.instagram.com/p/CzGC5dxth82,https://scontent.cdninstagram.com/v/t39.30808-...,https://scontent.cdninstagram.com/v/t39.30808-...,,164,4,1,,,,1698822440


In [58]:
#@title Download Videos and Images
#@markdown This cell downloads all images and, if available, videos to the folders below. <br> **TODO: Download multiple images per Post**

import requests
from pathlib import Path
import pandas as pd

video_folder = "/content/posts/videos"  # @param {type: "string"}
image_folder = "/content/posts/images"  # @param {type: "string"}

# Set up the base path for media
base_video_path = Path("/content/posts/videos")
base_image_path = Path("/content/posts/images")

# Ensure the base directories exist before entering the loop
base_video_path.mkdir(parents=True, exist_ok=True)
base_image_path.mkdir(parents=True, exist_ok=True)

with requests.Session() as session:
    for index, row in posts_df.iterrows():
        # Determine the media type and set up path and url accordingly
        if row['type'] == "video":
            path = base_video_path / row['author']
            media_url = row['media_url']
            file_extension = ".mp4"
        else:  # assuming all non-video types are images
            path = base_image_path / row['author']
            media_url = row['image_url']
            file_extension = ".jpg"

        # Ensure the author-specific directory exists
        path.mkdir(parents=True, exist_ok=True)

        # Construct the file path
        filename = f"{row['id']}{file_extension}"
        file_path = path / filename

        # Make the request and save the file
        try:
            response = session.get(media_url, allow_redirects=True)
            response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
            with open(file_path, 'wb') as f:
                f.write(response.content)
        except requests.HTTPError as e:
            print(f"HTTP Error for {media_url}: {e}")
        except requests.RequestException as e:
            print(f"Request Exception for {media_url}: {e}")


## ZIP files for download

In [63]:
!zip -r posts.zip /content/posts/

  adding: content/posts/ (stored 0%)
  adding: content/posts/images/ (stored 0%)
  adding: content/posts/images/jungealternativebayern/ (stored 0%)
  adding: content/posts/images/jungealternativebayern/CzGC5dxth82.jpg (deflated 1%)
  adding: content/posts/images/kathaschulze/ (stored 0%)
  adding: content/posts/images/kathaschulze/CzEIzgitG5w.jpg (deflated 1%)
  adding: content/posts/images/spdde/ (stored 0%)
  adding: content/posts/images/spdde/CzLGA9jtJuj.jpg (deflated 2%)
  adding: content/posts/images/stateofisrael/ (stored 0%)
  adding: content/posts/images/stateofisrael/CzGoQoFrGQR.jpg (deflated 1%)
  adding: content/posts/videos/ (stored 0%)
  adding: content/posts/videos/stateofisrael/ (stored 0%)
  adding: content/posts/videos/stateofisrael/CzLuRzfrukl.mp4 (deflated 0%)


The file `posts.zip` will appear in the files pane on the left. Right click the file to download it. Files can also be moved to Google Drive (faster!): `!cp posts.zip /content/drive/MyDrive/`.