# Section 1 - API Data Report & Hypothesis

**Report Title:** API Data Report <br>
**Name:** Morgan Rawski <br>
**Date:** 10/8/2025

### Hypothesis
I believe that posts with relation to dogs receive more likes than posts created with relation to cats because there are more dog owners than cat owners in the United States.

### Theoretical Rationale
I wanted to test this hypothesis because personally, I have a dog at home and I tend to like dogs more than cats. I wanted to see if the same thing was common with other online users, as this topic is a common debate amongst people in our society whether people like cats or dogs more. According to the American Veterinary Medical Association, there are more dog owners than cat owners in the United States, so this difference in ownership may influence online engagement with posts about each animal. (AVMA, 2024, https://www.avma.org/resources-tools/reports-statistics/us-pet-ownership-statistics)

### Statistical Application
This hypothesis can be tested statistically through correlation, seeing if there is a connection between the likes count for dog posts versus the likes count for cat posts. The variables or columns that will be used are author handle, author displayName, text of the post, and like count of the post.

# Section 2 - Enpoint Plan (Design Your Data Collection)
Identify the **Bluesky API endpoints** you will use and why they are suitable for testing your hypothesis.
Link: https://docs.bsky.app/docs/category/http-reference

**Planned endpoints:** 
- To test this, I would use the `app.bsky.feed.searchPosts` endpoint twice to collect posts matching a topic, hashtag, or keyword set. One would be with the query for dog posts pulling a limit of 100 posts. The other with the query for cat posts pulling a limit of 100 posts. I will extract the `posts` for each of these keywords and determine the mean likeCount for each post. This is the key variable for my hypothesis because it gives me all of the material to compare the two categories of posts.
- `app.bsky.feed.getLikes` — to get a likes count for who liked a post or the number of likes on a post. If I need to go into further detail for the likesCount, I would look into the uri parameter to look further into the posts and see what the likes look like on that end. For that I could extract the likes object and look into the actor specification for each post topic. This could be beneficial to further my hypothesis because it could give me more insight as to how the likes and reposts worked for each post and category.

## Reliability and Bias
Discuss how the data might be **reliable** and **unreliable**. Consider:

- Missingness or unavailable fields; rate limits; unauthenticated vs authenticated access.
- Bot/spam accounts, deleted posts, or moderation effects.
- Ethical considerations and terms of service (collect only what you need; avoid sensitive data).

I think that this data could be reliable because Bluesky is live on tracking trends and posts, so it could give me live updates on the likesCount and average likes per kind of posts. The searchPosts enpoint also allows the search of data through keywords, so it allows for an easy way to find and organize your data. A downside to this though is that I also don't know how it is selecting the type of posts it pulls, so it is possible that someone could have linked a post to cats or dogs, but the post or video involved could potentially not be about either. There is also the question of how smart this data pull is because if users use keywords like puppy or kitty in their post instead of dog or cat, I don't know if the endpoint is taking these posts with slightly different keywords into account. This idea can also relate to the fake accounts or private access because if I am trying to get a true representation in all of the posts related to cats vs dogs, these fake or private accounts could block the endpoint from getting the most accurate data.

## Limitations
List any **caveats** in the response objects (e.g., fields not guaranteed, delayed counts, missing information) that could affect your analysis.
I think that one of the biggest caveats for these response objects is that due to the consistent live update of the likesCount, my data could be outdated or inaccurate for the level of engagement currently happening at that moment unless the code is refreshed. Other caveats could include sampling bias, incapabilities for location control, or issues with keyword filtering. Sampling bias could be a caveat because the searchPosts enpoint does not check every post on the network, so there is no guarentee that it is finding all of the posts related to cats or dogs. This also leads into the issue of keyword filtering or missing information because there is no guarentee that this endpoint is searching the posts that have keywords similar to dog or cat, like puppy or kitty, which means I could be missing key parts of the post population. Lastly, my hypothesis reasoning involved the fact that there are more dog owners than cat owners in the U.S., but the searchPosts endpoint does not allow for location filtering through response objects, so I don't know where the users are from that the endpoint is pulling these posts. 

# Section 3 Data Collection

Collect posts that match a query. Adjust `QUERY`, `MAX_POSTS`, and any filters your hypothesis requires.

In [17]:
# imports
import requests
import time
import json as js
import pandas as pd

BASE_URL = "https://api.bsky.app/xrpc"

## Data Collection (Endpoint 1):

`app.bsky.feed.searchPosts` 

- Using the searchPosts endpoint  to gather the data for the posts that include the keyword of dogs
- Looking to get the author, text, and likeCount for each post to later compare with the cat posts

In [18]:
endpoint = f"{BASE_URL}/app.bsky.feed.searchPosts"
headers = {"User-Agent": "EMAT-Teaching/1.0 (+contact@example.com)"}
params = {
    "q": "dogs",
    "limit": 100, 
}

resp = requests.get(endpoint, params=params, headers=headers, timeout=30)

print("Status:", resp.status_code)

data = resp.json()

print("Top-level keys:", list(data.keys()))

Status: 200
Top-level keys: ['posts', 'cursor']


In [19]:
posts_dogs = data.get("posts", [])
print(posts_dogs)

feeds = data.get("feeds", [])
# print(feeds)

[{'uri': 'at://did:plc:355ajandtdzkot5vboahjdp4/app.bsky.feed.post/3m4745fwzj22v', 'cid': 'bafyreielradxkxaufoeev4nf5wtzkki4qv4o6cs3ogjdkriohwg4grimcm', 'author': {'did': 'did:plc:355ajandtdzkot5vboahjdp4', 'handle': 'mims-news.bsky.social', 'displayName': '', 'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:355ajandtdzkot5vboahjdp4/bafkreieht4fwknkv3tuoiysj7tqmrupiakymub7serkdztha5ippbaj56e@jpeg', 'associated': {'activitySubscription': {'allowSubscriptions': 'followers'}}, 'labels': [], 'createdAt': '2025-06-27T18:22:35.133Z'}, 'record': {'$type': 'app.bsky.feed.post', 'createdAt': '2025-10-27T19:01:27.021Z', 'langs': ['en'], 'reply': {'parent': {'cid': 'bafyreibsxfjsfmvdm3cwu55hrcuhhg5gu5n4jlvsfbxlrsfpmakp2gbj7a', 'uri': 'at://did:plc:udnac33pmf2iwcblpeai5a5p/app.bsky.feed.post/3m46nlh4kyc2e'}, 'root': {'cid': 'bafyreibsxfjsfmvdm3cwu55hrcuhhg5gu5n4jlvsfbxlrsfpmakp2gbj7a', 'uri': 'at://did:plc:udnac33pmf2iwcblpeai5a5p/app.bsky.feed.post/3m46nlh4kyc2e'}}, 'text': 'This is about

In [20]:
## Flatten the posts
#print(posts)
rows = []
for p in posts_dogs:
    # print(js.dumps(p, indent=2))
    stats = {
        #"post_uri": p.get("uri"),
        "author_handle": p.get("author").get("handle"),
        "author_displayName": p.get("author").get("displayName"),
        #"post_cid": p.get("cid"),
        "text": p.get("record", {}).get("text"),
        "likeCount": p.get("likeCount"),
        #"repostCount": p.get("repostcount"),
        #"author_did": p.get("author").get("did"),
    }
    rows.append(stats)
posts_dogs_df = pd.DataFrame(rows)
posts_dogs_df.head(20)
## my endpoint 1 dataframe is posts_df

Unnamed: 0,author_handle,author_displayName,text,likeCount
0,mims-news.bsky.social,,"This is about FERREL DOGS,\n\nBut when ferrel ...",0
1,karent.bsky.social,,"Dogs, not just terriers as I remember 🤪😂! 🩵",1
2,dogeveryhour.bsky.social,Dog every hour,🐶 Hourly Dog Picture! 🐶 #dogs #dogsky #dogsofb...,1
3,yulehorn.bsky.social,Yule Horn,"I have to know, do the Jonas Brothers put cats...",0
4,joygregory.bsky.social,,Experience tells me that Danielle Smith also d...,1
5,peterashman.bsky.social,Peter Ashman,Barney’s getting ready for Halloween 🎃\n #dogs...,2
6,mpulskamp.bsky.social,,About an hour’s drive S/E of Sacramento. \nJus...,0
7,diagonist.bsky.social,The Linker,"Condolences. We've lost many a cat over time,...",0
8,miigumi.bsky.social,michelle ☁️,Hi I’m an asian artist and ux/ui designer with...,2
9,jpcentresouth.bsky.social,JP Centre/South Main Streets,Thanks to all the dogs and dog owners who part...,0


## Data Collection (Endpoint 2):

 `app.bsky.actor.searchPosts`

- Using the searchPosts endpoint again to gather the data for the posts that include the keyword of cats
- Looking to get the author, text, and likeCount for each post to later compare with the dog posts

In [21]:
endpoint = f"{BASE_URL}/app.bsky.feed.searchPosts"
headers = {"User-Agent": "EMAT-Teaching/1.0 (+contact@example.com)"}
params = {
    "q": "cats",
    "limit": 100, 
}

resp = requests.get(endpoint, params=params, headers=headers, timeout=30)

print("Status:", resp.status_code)

data = resp.json()

print("Top-level keys:", list(data.keys()))

posts_cats = data.get("posts", [])
print(posts_cats)

feeds = data.get("feeds", [])
# print(feeds)

Status: 200
Top-level keys: ['posts', 'cursor']
[{'uri': 'at://did:plc:34dyig3f2puz4vm5krkedocw/app.bsky.feed.post/3m4746qovxs2s', 'cid': 'bafyreia3hshqumtupllusezqclc3437v4nqkyxiy4gwpl3v2bialir6nti', 'author': {'did': 'did:plc:34dyig3f2puz4vm5krkedocw', 'handle': 'tolstoy-fangirl.bsky.social', 'displayName': 'tolstoy-fangirl', 'avatar': 'https://cdn.bsky.app/img/avatar/plain/did:plc:34dyig3f2puz4vm5krkedocw/bafkreie33rta4cdjdqmj6vtc7huauvo73uihzhjoakwmumxc3a6dklubza@jpeg', 'associated': {'activitySubscription': {'allowSubscriptions': 'followers'}}, 'labels': [], 'createdAt': '2024-11-29T22:06:53.146Z'}, 'record': {'$type': 'app.bsky.feed.post', 'createdAt': '2025-10-27T19:02:11.843Z', 'embed': {'$type': 'app.bsky.embed.external', 'external': {'description': 'ALT: a woman with gray hair says useful idiot in front of her face', 'thumb': {'$type': 'blob', 'ref': {'$link': 'bafkreifoznhpcaicmdebjvev6bownazmh5pnt5oxh7fhuqkltovinapy5a'}, 'mimeType': 'image/jpeg', 'size': 756781}, 'title': '

In [22]:
## Flatten the posts
#print(posts)
rows = []
for p in posts_cats:
    # print(js.dumps(p, indent=2))
    stats = {
        #"post_uri": p.get("uri"),
        "author_handle": p.get("author").get("handle"),
        "author_displayName": p.get("author").get("displayName"),
        #"post_cid": p.get("cid"),
        "text": p.get("record", {}).get("text"),
        "likeCount": p.get("likeCount"),
        #"repostCount": p.get("repostcount"),
        #"author_did": p.get("author").get("did"),
    }
    rows.append(stats)
posts_cats_df = pd.DataFrame(rows)
posts_cats_df.head(20)
## my endpoint 1 dataframe is posts_df

Unnamed: 0,author_handle,author_displayName,text,likeCount
0,tolstoy-fangirl.bsky.social,tolstoy-fangirl,"How about ""not waving your finger in someone's...",0
1,grahamfluster.bsky.social,Graham Fluster Writer,"Puffin says ""Sorry, the stairs are out of orde...",0
2,linty.bsky.social,linty 💨🔞 (comms open),little animals are quick to learn the easiest ...,1
3,kronn.bsky.social,Kurt Ronn,Mazie cannot believe anyone still supports Tru...,0
4,justcallmemike178.bsky.social,JustCallMeMike,"Sorry to hear that, takes time but both my cat...",0
5,markhathaway1.bsky.social,Mark Hathaway,The cats are very smart. They choose where to ...,0
6,ireneadler45.bsky.social,,Best cats 🧹,0
7,mrpussy.xyz,mr pussy,ugh see this possibility sucks because I have ...,0
8,abc10news.bsky.social,ABC 10News,Forty cats were rescued from a San Diego apart...,0
9,zadescrivner.bsky.social,Zade Scrivner,#NationalBlackCatDay #BlackCatPhotography #Cat...,0


# Section 4 — Build DataFrames

Use a pandas method to combine your DataFrames. Use your own endpoints and dataframes. Adjust based on your plan:

- **merge** on a key (`author_did`), or
- **concat** to stack rows from multiple endpoints, or
- **join** to add columns using an index.
- **Wrangling** (select, clean, sort)

In [23]:
# Classic pandas stitch:
# merge joins rows from the two dataframes based on matching key values.
posts_combined = pd.concat([posts_dogs_df, posts_cats_df], axis=1)

posts_combined.head(20)

Unnamed: 0,author_handle,author_displayName,text,likeCount,author_handle.1,author_displayName.1,text.1,likeCount.1
0,mims-news.bsky.social,,"This is about FERREL DOGS,\n\nBut when ferrel ...",0,tolstoy-fangirl.bsky.social,tolstoy-fangirl,"How about ""not waving your finger in someone's...",0
1,karent.bsky.social,,"Dogs, not just terriers as I remember 🤪😂! 🩵",1,grahamfluster.bsky.social,Graham Fluster Writer,"Puffin says ""Sorry, the stairs are out of orde...",0
2,dogeveryhour.bsky.social,Dog every hour,🐶 Hourly Dog Picture! 🐶 #dogs #dogsky #dogsofb...,1,linty.bsky.social,linty 💨🔞 (comms open),little animals are quick to learn the easiest ...,1
3,yulehorn.bsky.social,Yule Horn,"I have to know, do the Jonas Brothers put cats...",0,kronn.bsky.social,Kurt Ronn,Mazie cannot believe anyone still supports Tru...,0
4,joygregory.bsky.social,,Experience tells me that Danielle Smith also d...,1,justcallmemike178.bsky.social,JustCallMeMike,"Sorry to hear that, takes time but both my cat...",0
5,peterashman.bsky.social,Peter Ashman,Barney’s getting ready for Halloween 🎃\n #dogs...,2,markhathaway1.bsky.social,Mark Hathaway,The cats are very smart. They choose where to ...,0
6,mpulskamp.bsky.social,,About an hour’s drive S/E of Sacramento. \nJus...,0,ireneadler45.bsky.social,,Best cats 🧹,0
7,diagonist.bsky.social,The Linker,"Condolences. We've lost many a cat over time,...",0,mrpussy.xyz,mr pussy,ugh see this possibility sucks because I have ...,0
8,miigumi.bsky.social,michelle ☁️,Hi I’m an asian artist and ux/ui designer with...,2,abc10news.bsky.social,ABC 10News,Forty cats were rescued from a San Diego apart...,0
9,jpcentresouth.bsky.social,JP Centre/South Main Streets,Thanks to all the dogs and dog owners who part...,0,zadescrivner.bsky.social,Zade Scrivner,#NationalBlackCatDay #BlackCatPhotography #Cat...,0


In [24]:
mean_likes = posts_combined["likeCount"].mean()
mean_likes.head()

likeCount    2.51
likeCount    1.25
dtype: float64

# Section 5 — Conclusion

Describe any patterns you observe in the collected data and how they relate to your hypothesis. <br>
Describe challenges you faced.

After putting together these dataframes, I am not really sure if my hypothesis was correct or not. Due to the live nature of the Bluesky API database, everytime that I refresh my dataframes it is pulling from new posts and calculating new averages from those likeCounts. I think that if I were to further this project, I would need to find a way to make sure that the enpoinds were considering all variations of the cat and dog keywords that I was giving it to pull from, as I was worried that some posts were being missed due to them not having the right keywords attached to them. I also learned through the course of this project that the searchPosts enpoints do not filter location into their parameters, so I believe that I would have to integrate some other enpoint that would give me the ability to keep refining my data to ensure that the posts it was pulling were those actually in the United States in order to support the reasoning that I gave in my hypothesis. With the current data that has been pulled into my dataframes, the mean_likes tells me that there are actually more average likes related to dog posts than there are cat posts by looking at the two likes counters shown above. This also shows in the dataframe above it where the left side is the representation of dog posts and the right is the representation of cat posts, where there is a definite greater number of posts for the dogs in the likeCount column. The first likeCount for the mean_likes is the representation for the dog posts dataframe and the second is the cat posts dataframe. The mean likes counter is another thing that I struggled with when putting this analysis together because I didn't know how to represent this comparison in the same dataframe without merging their data together and then trying to add another row or column that would display the averages for the specific likeCount column. Without knowing if these posts originated from users in the United States though, I cannot say for certain if my hypothesis was correct because I was testing the hypothesis that the likes would be greater for dog posts than cat posts based on the factor of there being more dog owners in the US, which this doesn't exclusively express U.S. data. 