# Section 1 — Report Header & Hypothesis

**Report Title:** _Replace with your title_  
**Your Name:** _Replace with your name_  
**Date:** _2025-10-07_

### Hypothesis
Write one testable hypothesis that can be evaluated using data available via the Bluesky API.  
_Example:_ “Accounts that post more frequently receive a higher average number of likes per post.”

### Theoretical Rationale
Explain the theory or reasoning behind your hypothesis. Cite any relevant concepts or readings.

### Statistical Application
Explain how your hypothesis could be tested statistically (e.g., group comparison, correlation). <br>
What variables (columns) will you be using.


> Tip: You do not need to fully execute the analysis now, but you should articulate how you would test it.


# Section 2 — Endpoint Plan (Design Your Data Collection)

Identify the **Bluesky API endpoints** you will use and why they are suitable for testing your hypothesis.  
Link: https://docs.bsky.app/docs/category/http-reference

**Planned endpoints (examples; replace with your own):**
- `app.bsky.feed.searchPosts` — to collect posts matching a topic, hashtag, or keyword set.
- `app.bsky.actor.getProfiles` — to enrich authors with profile metadata (e.g., displayName, followersCount).
- `app.bsky.feed.getAuthorFeed` — to get posts authored by a specific actor (for longitudinal behavior).

For **each endpoint**, specify:
1. The key **request parameters** you will use. e.g. search query `q` for `app.bsky.feed.searchPosts`. User profile `did` for `app.bsky.actor.getProfiles`
2. The **response objects/fields** you will extract. e.g. `posts` response in case of `app.bsky.feed.searchPosts`
3. Why these fields map to the variables in your hypothesis.

## Reliability and Bias 
Discuss how the data might be **reliable** and **unreliable**. Consider:
- Missingness or unavailable fields; rate limits; unauthenticated vs authenticated access.
- Bot/spam accounts, deleted posts, or moderation effects.
- Ethical considerations and terms of service (collect only what you need; avoid sensitive data).

## Limitations
List any **caveats** in the response objects (e.g., fields not guaranteed, delayed counts, missing information) that could affect your analysis.

# Section 3 Data Collection
Collect posts that match a query. Adjust `QUERY`, `MAX_POSTS`, and any filters your hypothesis requires.


In [1]:
# imports
import requests 
import time
import json as js
import pandas as pd

BASE_URL = "https://api.bsky.app/xrpc"

## Data Collection (Endpoint 1): 
e.g. `app.bsky.feed.searchPosts`
Flatten key fields from Bluesky PostView objects.

In [12]:
#endpoint = f"{BASE_URL}/app.bsky.feed.searchPosts"
endpoint = f"{BASE_URL}/app.bsky.feed.getSuggestedFeeds"
headers = {"User-Agent": "EMAT-Teaching/1.0 (+contact@example.com)"}
params = {

}

resp = requests.get(endpoint, params=params, headers=headers, timeout=30)

print("Status:", resp.status_code)

data = resp.json()

print("Top-level keys:", list(data.keys()))

Status: 200
Top-level keys: ['feeds', 'cursor']


In [17]:
posts = data.get("posts", [])
print(posts)

feeds = data.get("feeds", [])
#print(feeds)


[]


In [18]:
## Flatten the posts
#print(posts)
rows = []
for p in feeds:
    #print(js.dumps(p, indent=2))
    stats = {
        "post_uri": p.get("uri"),
        "post_cid": p.get("cid"),
        #"text": p.get("record", {}).get("text"),
        #"likeCount": p.get("likeCount"),
        #"repostCount": p.get("repostCount"),
        "creator_did": p.get("creator").get("did"),
        "creator_handle": p.get("creator").get("handle"),
        "creator_displayName": p.get("creator").get("displayName"),
    }
    rows.append(stats)
posts_df = pd.DataFrame(rows)
posts_df.head(5)
## my endpoint 1 dataframe is posts_df

Unnamed: 0,post_uri,post_cid,creator_did,creator_handle,creator_displayName
0,at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky...,bafyreievgu2ty7qbiaaom5zhmkznsnajuzideek3lo7e6...,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky
1,at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky...,bafyreigreonzn577vy6i4qh2so7aqfjztqrrj4jpssg2j...,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky
2,at://did:plc:tenurhgjptubkk5zf5qhi3og/app.bsky...,bafyreifegrnk7edkfbomkhp3q7prqovpmn66sku63owr3...,did:plc:tenurhgjptubkk5zf5qhi3og,skyfeed.xyz,Sky Feeds
3,at://did:plc:jfhpnnst6flqway4eaeqzj2a/app.bsky...,bafyreihkf7336jzjp6o3qqfmah34jltrcytonakhnq6gi...,did:plc:jfhpnnst6flqway4eaeqzj2a,bossett.social,Bossett
4,at://did:plc:y7crv2yh74s7qhmtx3mvbgv5/app.bsky...,bafyreif6iro4tgb2wtdjbyi6yawssengptawfixxcwx32...,did:plc:y7crv2yh74s7qhmtx3mvbgv5,bsky.art,@bsky.art


## Data Collection (Endpoint 2): 

e.g. `app.bsky.actor.getProfiles`
- Enrich the post data with profile attributes (followers count, display name, etc.).  
- We gather unique author identifiers (`did`) from the posts and request them in batches.
- NOTE: Will this be a for loop?


In [21]:
## Let us get profile data for all the authors from the previous feed
# get unique author ids which is dids
unique_dids = posts_df["creator_did"].dropna().unique().tolist()
#print(unique_dids)

# Get author profiles for these dids
all_profiles = []
for d in unique_dids:
    #print(js.dumps(d, indent=2))
    params = []
    params.append(("actor", d))
    #print(d)
    r = requests.get(f"{BASE_URL}/app.bsky.actor.getProfile", params=params, timeout=30)
    data = r.json()
    #print(js.dumps(data, indent=2))

    # Append this profile in our list 
    # flatten tha data for profile
    all_profiles.append({
        "did": data.get("did"),
        "handle": data.get("handle"),
        "displayName": data.get("displayName"),
        "followersCount": data.get("followersCount"),
        "followsCount": data.get("followsCount"),
        "postsCount": data.get("postsCount"),
        "createdAt": data.get("createdAt"),
        "description": data.get("description"),
    })

all_profiles_df = pd.DataFrame(all_profiles)
# This will take a while to load !
all_profiles_df.head(5)

Unnamed: 0,did,handle,displayName,followersCount,followsCount,postsCount,createdAt,description
0,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky,29606723,3,681,2023-04-12T04:53:57.057Z,official Bluesky account (check username👆)\n\n...
1,did:plc:tenurhgjptubkk5zf5qhi3og,skyfeed.xyz,Sky Feeds,8199,2,50,2023-05-20T12:29:20.940Z,A collection of custom feeds to enhance your B...
2,did:plc:jfhpnnst6flqway4eaeqzj2a,bossett.social,Bossett,10224,931,19693,2023-05-27T07:05:12.214Z,Profile labeller: @profile-labels.bossett.soci...
3,did:plc:y7crv2yh74s7qhmtx3mvbgv5,bsky.art,@bsky.art,32794,1480,415,2023-05-21T14:29:53.828Z,"Artists for your Skyline!\nFor more info, clic..."
4,did:plc:kkf4naxqmweop7dv4l2iqqf5,aendra.com,ændra.,39653,945,8440,2023-05-04T16:59:41.121Z,"Creator of 📰 News feeds, @xblock.aendra.dev, @..."


# Section 4 — Build DataFrames

Use a pandas method to combine your DataFrames. Use your own endpoints and dataframes. Adjust based on your plan:
- **merge** on a key (`author_did`), or
- **concat** to stack rows from multiple endpoints, or
- **join** to add columns using an index.
- **Wrangling** (select, clean, sort)

  


In [22]:
# Classic pandas stitch:
# merge joins rows from the two dataframes based on matching key values.
posts_enriched = posts_df.merge(
    # Adds "author_" to every column name in all_profiles_df
    # Why? To avoid name collisions (e.g., both dataframes could have handle, displayName) 
    # and to make the origin obvious: anything about the author now clearly starts with author_.
    all_profiles_df.add_prefix("author_"),
    # left_on="author_did": use posts_df["author_did"] as the join key on the left.
    left_on="creator_did",
    # right_on="author_did": use the prefixed key from the right dataframe (formerly did).
    right_on="author_did",
    # how="left": a left join. Keep every row from posts_df (every post), 
    # even if there is no matching profile. If a profile is missing, 
    # the author columns become NaN. 
    # This is what you want for enrichment—don’t drop posts just because the profile lookup failed.
    how="left"
)

posts_enriched.head(5)

Unnamed: 0,post_uri,post_cid,creator_did,creator_handle,creator_displayName,author_did,author_handle,author_displayName,author_followersCount,author_followsCount,author_postsCount,author_createdAt,author_description
0,at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky...,bafyreievgu2ty7qbiaaom5zhmkznsnajuzideek3lo7e6...,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky,29606723,3,681,2023-04-12T04:53:57.057Z,official Bluesky account (check username👆)\n\n...
1,at://did:plc:z72i7hdynmk6r22z27h6tvur/app.bsky...,bafyreigreonzn577vy6i4qh2so7aqfjztqrrj4jpssg2j...,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky,did:plc:z72i7hdynmk6r22z27h6tvur,bsky.app,Bluesky,29606723,3,681,2023-04-12T04:53:57.057Z,official Bluesky account (check username👆)\n\n...
2,at://did:plc:tenurhgjptubkk5zf5qhi3og/app.bsky...,bafyreifegrnk7edkfbomkhp3q7prqovpmn66sku63owr3...,did:plc:tenurhgjptubkk5zf5qhi3og,skyfeed.xyz,Sky Feeds,did:plc:tenurhgjptubkk5zf5qhi3og,skyfeed.xyz,Sky Feeds,8199,2,50,2023-05-20T12:29:20.940Z,A collection of custom feeds to enhance your B...
3,at://did:plc:jfhpnnst6flqway4eaeqzj2a/app.bsky...,bafyreihkf7336jzjp6o3qqfmah34jltrcytonakhnq6gi...,did:plc:jfhpnnst6flqway4eaeqzj2a,bossett.social,Bossett,did:plc:jfhpnnst6flqway4eaeqzj2a,bossett.social,Bossett,10224,931,19693,2023-05-27T07:05:12.214Z,Profile labeller: @profile-labels.bossett.soci...
4,at://did:plc:y7crv2yh74s7qhmtx3mvbgv5/app.bsky...,bafyreif6iro4tgb2wtdjbyi6yawssengptawfixxcwx32...,did:plc:y7crv2yh74s7qhmtx3mvbgv5,bsky.art,@bsky.art,did:plc:y7crv2yh74s7qhmtx3mvbgv5,bsky.art,@bsky.art,32794,1480,415,2023-05-21T14:29:53.828Z,"Artists for your Skyline!\nFor more info, clic..."


# Section 5 — Conclusion

Describe any patterns you observe in the collected data and how they relate to your hypothesis. <br>
Describe challenges you faced.
