### Drawing Snowball Samples From Facebook 'Seed' Page(s) 

If you are interested in analyzing social media, one of the main challenges is figuring out where to start!

Perhaps you are interested in identifying which sources of information are influential within partisan Facebook pages or groups? Uncovering behavioral patterns that may indicate coordination by Facebook pages to advance partisan narratives? Identifying information echo-chambers that are susceptible to partisan information (and disinformation)? or extracting topics that are being pushed -- or are gaining traction -- within partisan ecosystems?

These are questions that require some manner of identifying partisan ecosystems - ideologically like-minded Facebook pages and groups - for research and analysis purposes. Unfortunately, there is no census of Facebook pages and groups from which you could draw a random sample and draw generalizable conclusions. In the absence of comprehensive sampling frame, you could use a form of 'snowball sampling' -- a non-random approach that relies on 'referrals' from identified sources -- to find like-minded pages and groups.

To draw a snowball sample of Facebook pages and groups that may form a partisan ecosystem, we need to first identify 'seed' pages that have large numbers of followers and are understood to be representative of a partisan agenda, e.g. if we wanted to study the 'Pro-Trump' ecosystem, a 'seed' would be Donald Trump's Facebook page. Once we have identifed a seed page(s), we then want to find pages and groups that have recently shared content from that page and, for the most part, are like-minded supporters and / or share similar political views.*

**This is not universally true -- sometimes non-supporters share content from seed pages for the purposes of condemning it, or media outlets share content because it is newsworthy. These non-supporters pages and groups need to be weeded out through Social Network Analysis and Community Detection techniques which will be demonstrated in the next post.*

In this example, we are interested in analyzing social media behavior among pro-Georgia Republican groups engaged in political discourse ahead of the January 5, 2021 runoff election. We will draw a snowball sample to identify Pro-Republican Facebook pages and groups, with a specific goal of identifying pro-Georgia Republican Facebook groups for deeper analysis. We will start with three 'seed' pages: The Georgia Republican Party's Facebook page (130K+ followers), David Perdue's Facebook page (80K+) and Kelly Loeffler's Facebook page (35K+).  

To draw a snowball sample of Facebook pages and groups, we will rely on the CrowdTangle API* to find the last 1,000 shares of post content from these seed pages. Once we have found pages and groups that have shared content from these seed pages, we will upload the snowball pages and groups to lists in CrowdTangle ("Republican Pages" and "Republican Groups").

**CrowdTangle is an intermediary for accessing data from Facebook. You will need to create a CrowdTangle dashboard for this analysis and create lists to which to upload batches of pages and groups that you will analyze. To use the CrowdTangle API, you can obtain the API token for the dashboard via "API Access" in the settings menu. If you don't have API access, ask the CrowdTangle adminstrator in your organization.*

Once these snowball pages and groups are in CrowdTangle, we grab posts from sample groups to analyze behavior. Specifically, we'll grab the last 10,000 posts from a snowball sample of groups (given API rate limits imposed by CrowdTangle, we'll create a database in which store posts data obtained from sample groups for efficiency and accessibility in future posts).

First, we'll import packages needed for script, add API token from the CrowdTangle dashboard you created, and add a database connection to store posts data we collect from sample groups

In [10]:
import requests
import json
import pandas as pd
import collections
from datetime import datetime, timedelta 
import time
import pyodbc 
from sqlalchemy import create_engine

#crowdtangle token:
token = 'your_dashboard_token'

#database connection:
connection_string = 'your_database_connection_string'
engine = create_engine(connection_string, encoding='utf8')
db = engine.connect()

Then we will create a 'get_links' function (CrowdTangle Links API wrapper) to use to get last 1,000 post shares from a 'seed' page

In [2]:
def get_links(token, link, platforms='facebook', count=1000):
    api_url_base = "https://api.crowdtangle.com/links?token="
    link_pre = '&link='
    count_pre = '&count='
    plat_pre = '&platforms='
    api_url = format(f'{api_url_base}{token}{link_pre}{link}{plat_pre}{platforms}{count_pre}{count}')
    response = requests.get(api_url)   
    if response.status_code == 200:
        return json.loads(response.content.decode('utf-8'))
    else:
        return None

Then we pass the links to the 'seed' pages from which we want to grab last 1,000 post shares to the 'get_links' function 

In [3]:
gagop_snowball = get_links(token, link='facebook.com/GAGOP') #Georgia Republican Party 
loeffler_snowball = get_links(token, link='facebook.com/KellyLoefflerGA') #Kelly Loeffler 
perdue_snowball = get_links(token, link='facebook.com/perduesenate') #David Perdue 

We can take a quick peek at the names of some of the pages and groups that were drawn in our snowball sample because they have shared content from the Georgia Republican Facebook page. 

In [30]:
gagop_snowball_sample = pd.DataFrame.from_dict(gagop_snowball['result']['posts'])
gagop_snowball_sample = pd.concat([gagop_snowball_sample.drop(['account'], axis=1), gagop_snowball_sample['account'].apply(pd.Series)], axis=1)
gagop_snowball_sample.groupby(['name', 'accountType']).size().to_frame().reset_index().rename(columns={0: 'shares'}).sort_values(by='shares', ascending=False).head(10)

Unnamed: 0,name,accountType,shares
51,David Perdue,facebook_page,424
109,Law Enforcement Officers and patriots for Pres...,facebook_page,41
89,HOLD THE LINE January 5th,facebook_group,37
87,Gwinnett County Republican Party,facebook_page,25
93,"Harris County, Georgia, Republican Party",facebook_page,18
78,Georgia Republican Party,facebook_page,17
64,Floyd County Republican Party,facebook_page,14
139,Pickens County Georgia Republican Party,facebook_page,13
108,Latinos For Trump -Georgia,facebook_page,11
52,Dear President Trump,facebook_group,10


Looks like a pretty good snowball sample for studying Georgia Republican pages and groups!

Next we create a 'prep_batch' function to prepare the pages and groups that have shared posts from these pages for batch upload to the CrowdTangle dashboard we created (be sure that you have created empty lists to store these pages and groups in your CrowdTangle dashboard!)

In [21]:
def prep_batch(data, atype='pages', minsize=0, listname='null'):
    df = pd.DataFrame.from_dict(data['result']['posts'])
    df = pd.concat([df.drop(['account'], axis=1), df['account'].apply(pd.Series)], axis=1)
    df = df.groupby(['name', 'url', 'accountType']).size().to_frame().reset_index().sort_values(by=0, ascending=False)
    if atype == 'pages':
        df1 = df.loc[((df['accountType'] == 'facebook_page') & (df[0] > minsize))]
    else: #need to fix the else to set to 'groups' as an option -- not a big deal right now
        df1 = df.loc[((df['accountType'] == 'facebook_group') & (df[0] > minsize))]
    df1['List'] = listname
    df1 = df1.rename(columns={"url": "Page or Account URL"}).reset_index(drop=True)
    return df1[['Page or Account URL', 'List']] 

Run stored results of post shares through batch upload prep function

In [95]:
gagop_snowball_pg_batch = prep_batch(gagop_snowball, atype='pages', minsize=1, listname='Republican Snowball Pages')
gagop_snowball_gp_batch = prep_batch(gagop_snowball, atype='groups', minsize=1, listname='Republican Snowball Pages')
loeffler_snowball_pg_batch = prep_batch(loeffler_snowball, atype='pages', minsize=1, listname='Republican Snowball Pages')
loeffler_snowball_gp_batch = prep_batch(loeffler_snowball, atype='groups', minsize=1, listname='Republican Snowball Groups')
perdue_snowball_pg_batch = prep_batch(perdue_snowball, atype='pages', minsize=1, listname='Republican Snowball Pages')
perdue_snowball_gp_batch = prep_batch(perdue_snowball, atype='groups', minsize=1, listname='Republican Snowball Groups')

Export CSV files for batch upload to CrowdTangle -- this will export 6 CSV files for batch upload. Don't worry about duplication - CrowdTangle will handle duplicates and will not include the same page or group twice in a list!

In [28]:
gagop_snowball_pg_batch.to_csv("gagop_snowball_pages.csv", index=False)
gagop_snowball_gp_batch.to_csv("gagop_snowball_groups.csv", index=False)
loeffler_snowball_pg_batch.to_csv("loeffler_snowball_pages.csv", index=False)
loeffler_snowball_gp_batch.to_csv("loeffler_snowball_groups.csv", index=False)
perdue_snowball_pg_batch.to_csv("perdue_snowball_pages.csv", index=False)
perdue_snowball_gp_batch.to_csv("perdue_snowball_groups.csv", index=False)

Once you have uploaded these pages and groups via batch upload to CrowdTangle, we can grab posts from the snowball sample. For this example, we'll grab the last 10,000 posts from the Oromo Groups sample.

We create a 'get_lists' function (CrowdTangle Lists API wrapper) to use to access all lists in the dashboard that have been created. Then we will grab the id of the list of sample groups from which you want to collect up to the last 10,000 posts.

In [31]:
def get_list_info():
    ctapi_list = 'https://api.crowdtangle.com/lists?token='
    api_url = format(f'{ctapi_list}{token}')
    response = requests.get(api_url)
    if response.status_code == 200:
        return json.loads(response.content.decode('utf-8'))
    else:
        return None
    
list_info = get_list_info()
list_info['result']['lists']

[{'id': 1490375, 'title': 'Republican Snowball Pages', 'type': 'LIST'},
 {'id': 1490376, 'title': 'Republican Snowball Groups', 'type': 'LIST'}]

The list id for the Republican Snowball Groups sample we just created is '1490376'.

We create a 'get_posts' function (CrowdTangle Posts API wrapper) to get up to last 10,000 posts from sample during date range provided.

We'll set a custom date range even though we will only get the last 10,000 posts available (NB: it's unnecessary for this example, but the custom date range is included here for illustrative purposes to highlight how you can adjust the date range to a specific window of time to retrieve posts).

In [82]:
idx = '1490376' #Republican Snowball Groups
start = '2020-11-29' #Nov 29 2020
end = '2020-12-28' #Dec 28 2020
allposts = []
def get_posts():
    ctapi_posts = 'https://api.crowdtangle.com/posts?token='
    start_date = '&startDate='
    end_date = '&endDate=' 
    listids = '&listIds='
    count = '&count='
    n = '100'
    offset = '&offset='
    sortBy = '&sortBy='
    sort = 'total_interactions'
    api_url = format(f'{ctapi_posts}{token}{listids}{idx}{start_date}{start}{end_date}{end}{count}{n}{sortBy}{sort}{offset}')
    for o in range(0,10000,100):
        api_call = api_url + str(o)
        response = requests.get(api_call).json()
        time.sleep(10)
        allposts.append(response)
        print(api_call)
        
def posts_toframe(allposts):
    temp = pd.DataFrame(allposts)
    temp = pd.concat([temp.drop(['result'], axis=1), temp['result'].apply(pd.Series)], axis=1)
    temp = temp.explode('posts')
    temp = pd.concat([temp.drop(['posts'], axis=1), temp['posts'].apply(pd.Series)], axis=1)
    temp = temp.rename(columns={"subscriberCount": "initialSubscriberCount", "id": "initialId", "platformId": "initialPlatformId", "platform": "initialPlatform"})
#expand account data into individual columns
    temp = pd.concat([temp.drop(['account'], axis=1), temp['account'].apply(pd.Series)], axis=1)
#expand statistics data into invidivual columns
    temp = pd.concat([temp.drop(['statistics'], axis=1), temp['statistics'].apply(pd.Series)], axis=1)
    temp = pd.concat([temp.drop(['actual'], axis=1), temp['actual'].apply(pd.Series)], axis=1)
    temp['date'] = pd.to_datetime(temp.date)
    temp['updated'] = pd.to_datetime(temp.updated)
    temp['id'] = temp['id'].astype(object)
    temp = temp.drop(['status', 'pagination'], axis=1)
    return temp

Now we'll run the get_posts function to retrieve the last 10,000 posts. 
Note: to address API rate limits, the get_posts function takes 15+ minutes to run (now's a good time to brew another pot of coffee).  

In [96]:
get_posts() 

We'll convert the data we just obtained to a dataframe and have a quick peek at the number of groups for which we have post data. 

In [85]:
republican_snowball_group_posts = posts_toframe(allposts)
republican_snowball_group_posts['name'].nunique() #how many unique groups
republican_snowball_group_posts['name'].unique() #unique group names

array(['Sarah Sanders Fox News Fans', 'Flip it Red California',
       'HOLD THE LINE January 5th', 'We the People of Georgia Group',
       'Keep Georgia Red', 'TRUMP ~ The Next Four Years',
       'Nikki Haley for POTUS, 2024', 'Grassroots For Doug Collins',
       'Trump Women Landslide 2020', 'THE TRUMPERS!!!',
       'Georgia Democrats', 'The Silent Majority Group',
       'WAYCROSS & BLACKSHEAR GA NEWS', 'OfficialLatinosForTrump',
       'Friends Who Like Sean Hannity', 'Georgians for Kelly Loeffler',
       'Hart County Republicans', 'Patriots For Trump',
       'Citizens of Berrien County, Georgia',
       'Georgia Republicans United',
       'Republican National Hispanic Assembly - Official Group',
       'Southest GA Conservatives',
       'Nationwide Support for Donald J. Trump',
       'Australians for Donald Trump', 'Keep Cobb Conservative',
       'Concerned Citizens of Whitfield and Murray County',
       'Georgia 14th District Conservative Patriots', 'TRUMP VICTORY USA'

We'll also take a peek at a sample of the messages in some of our sample groups.

In [94]:
republican_snowball_group_posts[['message']][republican_snowball_group_posts['message'].notnull()].sample(n=20, random_state=1)

Unnamed: 0,message
15,This is fantastic news about the election. I a...
86,https://www.thedailybeast.com/value-of-sen-kel...
17,A takeaway from a local meeting tonight: URGEN...
59,HAPPENING NOW!! Fulton County voting machines ...
69,"If you are going to watch ANYTHING today, WATC..."
61,"“Target Date: January 6, 2021” by Joe Esposito..."
72,"If I were Brian Kemp, and thank God I ain't, I..."
35,"""Incredible as it may seem, the future of our ..."
41,Does anyone know which legislators are backing...
83,I emailed all the Republican Senators around t...


Looks like we've pulled a sample with a decent amount of post content related to the Georgia runoffs as well as a mixture of other political content.

Finally, we'll process the data we just collected and prepare it for storage in our database for further analysis.

In [65]:
#republican_snowball_group_posts = republican_snowball_group_posts.drop([0], axis=1) #removes all columns with name '0'
republican_snowball_group_posts = republican_snowball_group_posts.drop(['expandedLinks', 'media', 'expected'], axis=1) #removes columns which have list objects as values

In [67]:
republican_snowball_group_posts.to_sql("republican_groups_snowball_posts", db, if_exists='replace', schema=None, index=False, chunksize=500)

We've now got posts from our snowball sample of Republican groups stored in our database for easy access and analysis. 

In the next notebook, we'll conduct a Social Network Analysis and use a Community Detection algorithm to identify sub-clusters of interest for even deeper analysis. 