<a href="https://colab.research.google.com/github/michaelachmann/social-media-lab/blob/main/notebooks/2023_11_07_Firebase_Interface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Firebase Interace Notebook [![DOI](https://zenodo.org/badge/660157642.svg)](https://zenodo.org/doi/10.5281/zenodo.8199901)

![Notes on (Computational) Social Media Research Banner](https://raw.githubusercontent.com/michaelachmann/social-media-lab/main/images/banner.png)

## Overview

This Jupyter notebook is a part of the social-media-lab.net project, which is a work-in-progress textbook on computational social media analysis. The notebook is intended for use in my classes.

The **Firebase Interace Notebook** Notebook provides an interface for the backend, to create projects and export data from firebase.


### Project Information

- Project Website: [social-media-lab.net](https://social-media-lab.net/)
- GitHub Repository: [https://github.com/michaelachmann/social-media-lab](https://github.com/michaelachmann/social-media-lab)

## License Information

This notebook, along with all other notebooks in the project, is licensed under the following terms:

- License: [GNU General Public License version 3.0 (GPL-3.0)](https://www.gnu.org/licenses/gpl-3.0.de.html)
- License File: [LICENSE.md](https://github.com/michaelachmann/social-media-lab/blob/main/LICENSE.md)


## Citation

If you use or reference this notebook in your work, please cite it appropriately. Here is an example of the citation:

```
Michael Achmann. (2023). michaelachmann/social-media-lab: 2023-11-09 (v0.0.4). Zenodo. https://doi.org/10.5281/zenodo.8199901
```

In [1]:
!pip -q install firebase-admin

In [1]:
# @title Connecting to Firebase
# @markdown Please provide the path to your credentials file

import firebase_admin
from firebase_admin import credentials, firestore

credentials_path = '/content/XXXX-firebase-adminsdk-ZZZ-YYYYY.json'  # @param {type: "string"}

cred = credentials.Certificate(credentials_path)
firebase_admin.initialize_app(cred)
db = firestore.client()

In [10]:
# @title Project Creation
# @markdown Please fill in the following form in order to create a new project on the backend.

from IPython.display import display, Markdown
import pandas as pd

alert_email = 'michael@achmann.me'  # @param {type: "string"}
project_name = 'Forschungsseminar23 Test'  # @param {type: "string"}

# Create Project
import uuid

# Generate a UUID for the document
project_id = str(uuid.uuid4())
api_key = str(uuid.uuid4())

# Your data
data = {
    "api_key": api_key,
    "email": alert_email,
    "name": project_name
}

# Add a new document with a UUID as the document name (ID)
doc_ref = db.collection('projects').document(project_id)
doc_ref.set(data)

display(Markdown("### Project Created:"))
display(Markdown(f"**Project Name:** {project_name}"))
display(Markdown(f"**Alert Email:** {alert_email}"))
display(Markdown(f"**Project ID:** {project_id}"))
display(Markdown(f"**API-Key:** {api_key}"))

### Project Created:

**Project Name:** Forschungsseminar23 Test

**Alert Email:** michael@achmann.me

**Project ID:** 959466fe-4088-4099-a6b2-3cbe058889d3

**API-Key:** 554fbce8-fb15-44f1-bb4d-54cdc57554f2

In [2]:
# @title Project Export
# @markdown Please fill in project ID and export path to download all JSON files and store them locally.

from tqdm.auto import tqdm
import os
import json

PROJECT_ID = '959466fe-4088-4099-a6b2-3cbe058889d3'  # @param {type: "string"}
export_path = '/content/export'  # @param {type: "string"}


def fetch_stories(project_id):
    stories_ref = db.collection('projects').document(project_id).collection('stories')
    docs = stories_ref.stream()

    stories = []
    for doc in docs:
        stories.append(doc.to_dict())

    return stories

db = fetch_stories(PROJECT_ID)

if not os.path.exists('export'):
    os.makedirs('export')

# Iterate over each element in the database
for element in tqdm(db, desc='Exporting elements'):
    # Serialize the element to JSON
    element_json = json.dumps(element, indent=4)

    # Write to a file named {id}.json
    with open(os.path.join('export', f"{element['id']}.json"), 'w') as f:
        f.write(element_json)

Exporting elements:   0%|          | 0/22 [00:00<?, ?it/s]

In [3]:
# @title Convert to DataFrame
# @markdown After running the above cell, you may conver the data to a pandas DataFrame.

import pandas as pd
from datetime import datetime, timedelta


df_export_path = '/content/2022-11-09-Stories-Exported.csv'  # @param {type: "string"}

def process_instagram_story(data):

    # Extract relevant information
    story_info = {
        'ID': data.get("id"),
        'Time of Posting': datetime.utcfromtimestamp(data['taken_at']).strftime('%Y-%m-%d %H:%M:%S'),
        'Type of Content': 'Video' if 'video_duration' in data else 'Image',
        'video_url': None,
        'image_url': None,
        'Username': data['user']['username'],
        'Video Length (s)': data.get('video_duration', None),
        'Expiration': (datetime.utcfromtimestamp(data['taken_at']) + timedelta(hours=24)).strftime('%Y-%m-%d %H:%M:%S'),
        'Caption': data.get('caption', None),
        'Is Verified': data['user']['is_verified'],
        'Stickers': data.get('story_bloks_stickers', []),
        'Accessibility Caption': data.get('accessibility_caption', ''),
        'Attribution URL': data.get('attribution_content_url', '')
    }

    return story_info

rows = []
for element in db:
  rows.append(process_instagram_story(element))

df = pd.DataFrame(rows)
df.to_csv(df_export_path)
print(f"Successfully exported {len(df)} rows as CSV.")

Successfully exported 22 rows as CSV.


In [6]:
df.head()

Unnamed: 0,ID,Time of Posting,Type of Content,video_url,image_url,Username,Video Length (s),Expiration,Caption,Is Verified,Stickers,Accessibility Caption,Attribution URL
0,3231585718932790545_1483455177,2023-11-08 14:50:59,Image,,https://storage.googleapis.com/zeeschuimer-fb-...,rmf24.pl,,2023-11-09 14:50:59,,False,[],Photo by Fakty RMF FM | Rozmowy | Podcasty on ...,
1,3231585778860997221_1483455177,2023-11-08 14:51:06,Image,,https://storage.googleapis.com/zeeschuimer-fb-...,rmf24.pl,,2023-11-09 14:51:06,,False,[],Photo by Fakty RMF FM | Rozmowy | Podcasty on ...,
2,3231750838597692854_1349651722,2023-11-08 20:19:00,Video,https://storage.googleapis.com/zeeschuimer-fb-...,,tagesschau,13.3,2023-11-09 20:19:00,,True,[],,
3,3231750989408058657_1349651722,2023-11-08 20:19:18,Video,https://storage.googleapis.com/zeeschuimer-fb-...,,tagesschau,15.267,2023-11-09 20:19:18,,True,[],,
4,3231751135118088390_1349651722,2023-11-08 20:19:35,Video,https://storage.googleapis.com/zeeschuimer-fb-...,,tagesschau,17.0,2023-11-09 20:19:35,,True,[],,


In [10]:
# @title Download Images and Videos
# @markdown Once the data has been converted to a DataFrame, you may download the images and videos.

storage_bucket = "XXX.appspot.com"  # @param {type: "string"}
media_export_path =  '/content/media/'  # @param {type: "string"}

from firebase_admin import storage
import os
import requests

bucket = storage.bucket(storage_bucket)

def generate_signed_url(username, content_id, file_type):
    if file_type not in ['images', 'videos']:
        raise ValueError("Invalid file type specified")

    ext = 'jpeg' if file_type == 'images' else 'mp4'
    blob_path = f"projects/{PROJECT_ID}/stories/{file_type}/{username}/{content_id}.{ext}"
    blob = bucket.blob(blob_path)
    # Set the expiration of the link. Here, it's set to 24 hours.
    return blob.generate_signed_url(expiration=timedelta(hours=24), method='GET')

# Create a function to be applied across DataFrame rows
def apply_generate_signed_url(row):
    image_url = generate_signed_url(row['Username'], row['ID'], 'images')
    video_url = generate_signed_url(row['Username'], row['ID'], 'videos') if row['Type of Content'] == 'Video' else pd.NA
    return pd.Series({'image_url': image_url, 'video_url': video_url})

# Apply the function along the axis=1 (row-wise)
df[['image_url', 'video_url']] = df.apply(apply_generate_signed_url, axis=1)

# Now, creating the lists for images and videos can be done more efficiently
data_images = df.loc[df['image_url'].notna(), ['ID', 'image_url', 'Username', 'Time of Posting']] \
               .rename(columns={'image_url': 'url', 'Time of Posting': 'datetime'}) \
               .to_dict('records')

data_videos = df.loc[df['video_url'].notna(), ['ID', 'video_url', 'Username', 'Time of Posting']] \
               .rename(columns={'video_url': 'url', 'Time of Posting': 'datetime'}) \
               .to_dict('records')


def create_directories(base_path, entries, subdir):
    usernames = set(entry['Username'] for entry in entries)
    for username in usernames:
        os.makedirs(os.path.join(base_path, subdir, username), exist_ok=True)

def download_file(entry, media_type, media_export_path, session):
    directory = os.path.join(media_export_path, media_type, entry['Username'])
    ext = 'jpg' if media_type == 'images' else 'mp4'
    filename = os.path.join(directory, f"{entry['ID']}.{ext}")

    with session.get(entry['url'], stream=True) as response:
        if response.status_code == 200:
            with open(filename, 'wb') as file:
                for chunk in response.iter_content(8192):
                    file.write(chunk)
        else:
            print(f"Failed to download {entry['url']}. Status code: {response.status_code}")

session = requests.Session()
# Pre-create directories
create_directories(media_export_path, data_images, 'images')
create_directories(media_export_path, data_videos, 'videos')

# Download images
for entry in tqdm(data_images, desc="Downloading Images", unit="file"):
    download_file(entry, 'images', media_export_path, session)

# Download videos
for entry in tqdm(data_videos, desc="Downloading Videos", unit="file"):
    download_file(entry, 'videos', media_export_path, session)

print("Download complete!")


Downloading Images:   0%|          | 0/22 [00:00<?, ?file/s]

Downloading Videos:   0%|          | 0/11 [00:00<?, ?file/s]

Download complete!


## Prepare Downloadable ZIP
Run the following to ZIP all files. *Optionally* copy them to Google Drive.

In [None]:
!zip -r 2023-11-09-Story-Media-Export.zip media/*

In [None]:
!cp 2023-11-09-Story-Media-Export.zip /content/drive/MyDrive/