This data is of flagged and unflagged tweets from the social media monitoring technology Social Sentinel. 

Social Sentinel is an service sold to schools which scans social media messages which tries to detect threats of violence or self-harm. The company has said in the past that it scans more than a billion posts on social media every day against more than 450,000 words and phrases in its “Language of Harm.” 

So far as I can tell from publicly-available documents and news articles, Social Sentinel’s system consists of two pieces:<ol>  
<li>An nlp classifier system that detects potentially threatening messages posted to social media</li>
<li>Keywords inputted by the school which connect a given threatening post to a client college.</li></ol>

According to other data I obtained, Social Sentinel has been used by at least 37 colleges in the past six years, 32 of which are public. (UCLA may or may not have purchased the service in late 2016). This number is quite small when compared to the hundreds of K12 school districts which have used the service over the same time period, but not insignificant. Some of the country’s largest and most well-known colleges have used Social Sentinel, including UNC-Chapel Hill, the University of Virginia, Michigan State University, MIT and Arizona State University.
    
There is mounting evdience that Social Sentinel is not only being used to stop suicides and shootings, but for suppressing activism and protests as well. For more on this read the following stories, which I authored:<ol>
    <li><a href="https://www.dallasnews.com/news/investigations/2021/09/02/texas-schools-are-watching-millions-of-students-online-often-without-their-knowledge-or-consent/">Texas schools are surveilling students online, often without their knowledge or consent</a></li>
    <li><a href="https://www.nbcnews.com/news/education/unc-campus-police-used-geofencing-tech-monitor-antiracism-protestors-n1105746">UNC campus police used geofencing tech to monitor antiracism protestors</a></li></ol>       
        
The data used here comes from two sources:<ol>
    <li>Public records obtained by Buzzfeed News reporter's Peter Aldhous and Lam Vo, which were parsed from PDFs, MSGs and EMLs into RData, then exported as CSVs and given to me.</li>
    <li>Tweets from the same time period from the same users who have flagged tweets from the Buzzfeed dataset, which were scraped using the Twint package.</li></ol>

The data are in two columns "content", which is the actual tweet itself and "flagged", which is encoded as a binary 0 (for no ie. the tweet wasn't  classified as a potential threat by Social Sentinel) and 1 (for yes, ie. the tweet was classified as a threat by Social Sentinel). 

In [1]:
import pandas as pd
import re
from datetime import datetime
import Levenshtein as levenshtein

In [2]:
#bring in data
df = pd.read_csv('~/Desktop/Berkeley/year_2/fall/journ 221/GitHub/ari-sen/data/hobbs_flagler_unique.csv')
df_2 = pd.read_csv('~/Desktop/Berkeley/year_2/fall/journ 221/GitHub/ari-sen/data/flagged.csv')

In [3]:
#convert strings to datetimes
df['time_posted'] = pd.to_datetime(df['time_posted'])
df.dtypes

district               object
fmt                    object
alert_num               int64
source                 object
time_posted    datetime64[ns]
author_text            object
site                   object
screen_name            object
profile_url            object
locations              object
geodata                object
rmap_match             object
content                object
link                   object
dtype: object

In [4]:
#add a column for flagged
df['flagged'] = 1

In [5]:
#convert string to datetime
df_2['created_at '] = pd.to_datetime(df_2['created_at '])
df_2.dtypes

district               object
screen_name            object
text                   object
created_at     datetime64[ns]
status_id             float64
user_id               float64
dtype: object

In [6]:
#add a flagged column
df_2['flagged'] = 1

In [7]:
#get a subset of the data that is present in both sheets
df = df[['district', 'screen_name', 'time_posted', 'content', 'flagged']]
df_2 = df_2[['district', 'screen_name','created_at ', 'text', 'flagged']]
df.head()

Unnamed: 0,district,screen_name,time_posted,content,flagged
0,hobbs_nm,Cpt_Martinez,2017-12-08 10:27:00,You have gotta be the dumbest Bitch I know and...,1
1,hobbs_nm,TheNewErikT93,2017-12-10 11:09:00,Dez can kiss my ass im done defending him,1
2,hobbs_nm,Cpt_Martinez,2017-12-11 17:55:00,I just wanna take you out an show you off,1
3,hobbs_nm,TheNewErikT93,2017-12-11 21:02:00,A womans idea of shooting her shot is liking a...,1
4,hobbs_nm,martiinez_25,2017-12-11 23:53:00,I get mad when marco carries his truck in the ...,1


In [8]:
#rename so the columns is both sheets are called the same thing
df_2 = df_2.rename(columns = {'created_at ':'time_posted', 'text':'content'})
df_2.head()

Unnamed: 0,district,screen_name,time_posted,content,flagged
0,hobbs_nm,Cpt_Martinez,2017-12-08 17:27:00,You have gotta be the dumbest Bitch I know🤦🏻‍♂...,1
1,hobbs_nm,SchnaubertM,2017-12-12 18:31:00,@petehudson101 Man he needs his ass kicked! No...,1
2,hobbs_nm,MayraWestAf,2017-12-13 01:00:00,Nolen is about to give me a heart attack man h...,1
3,hobbs_nm,Ramy529,2017-12-14 19:28:00,Catch me dropping out and selling drugs just t...,1
4,hobbs_nm,RicardoNino2,2017-12-15 07:06:00,G Eazy fucking murdered his album. Don’t @ me....,1


In [9]:
#combine the flagged tweet CSVs
df = pd.concat([df, df_2])

In [10]:
#convert string to datetime
df['time_posted'] = pd.to_datetime(df['time_posted'])
df.dtypes

district               object
screen_name            object
time_posted    datetime64[ns]
content                object
flagged                 int64
dtype: object

In [11]:
#sort by date
df = df.sort_values(by='time_posted')

In [12]:
#first row date is where we start scraping, last row date is where we stop scraping
df.head(887)

Unnamed: 0,district,screen_name,time_posted,content,flagged
0,hobbs_nm,Cpt_Martinez,2017-12-08 10:27:00,You have gotta be the dumbest Bitch I know and...,1
0,hobbs_nm,Cpt_Martinez,2017-12-08 17:27:00,You have gotta be the dumbest Bitch I know🤦🏻‍♂...,1
1,hobbs_nm,TheNewErikT93,2017-12-10 11:09:00,Dez can kiss my ass im done defending him,1
2,hobbs_nm,Cpt_Martinez,2017-12-11 17:55:00,I just wanna take you out an show you off,1
3,hobbs_nm,TheNewErikT93,2017-12-11 21:02:00,A womans idea of shooting her shot is liking a...,1
...,...,...,...,...,...
180,danville_va,nayaalece,2019-09-25 00:34:00,So this man followed me home today &amp; calle...,1
158,danville_va,HonxhoD,2019-09-25 17:22:00,@tequila_titi Lmfaoo ion get cheated on cause ...,1
173,fairfield_oh,briannnanicole4,2019-09-26 12:56:00,Havent even been up for an hour &amp; im alrea...,1
156,fairfield_oh,EdCarroll51,2019-09-26 15:18:00,Willing to pay upwards to several chickens (so...,1


In [13]:
#get all the unique usernames from our dataset
usernames = df.groupby('screen_name').count().reset_index()
usernames.head(500)

Unnamed: 0,screen_name,district,time_posted,content,flagged
0,1.18E+20,1,1,1,1
1,ABQJournal,6,6,6,6
2,AlanLunin,1,1,1,1
3,AlbuquerqueNMRR,1,1,1,1
4,AlexisBradyknit,1,1,1,1
...,...,...,...,...,...
300,yalss71,1,1,1,1
301,yipi_csgo,1,1,1,1
302,yungmasedgod,3,3,3,3
303,zairaa_m,1,1,1,1


In [77]:
import twint
import nest_asyncio

In [18]:
nest_asyncio.apply()

In [19]:
#scrape tweets with Twint
c  = twint.Config()

for i in usernames['screen_name']:
    try:
        c.Username = i

        c.Since = "2017-12-08"

        c.Until = "2019-09-27"

        c.Limit = 1000

        c.Lang = "en"

        c.Output = f'{i}.csv'
        c.Store_csv = True

        twint.run.Search(c)
    except ValueError:
        pass

AttributeError: module 'twint' has no attribute 'Config'

In [14]:
#concat scraped tweet CSVs to a single unflagged df
unflagged = pd.concat(map(pd.read_csv, ['zmelot18.csv', 'yalss71.csv', 'ttime72.csv', 'teehattt.csv', 'tabatha_sams.csv', 'slaguerre3.csv', 'shreyagupta_.csv', 'ryan_kyler_.csv', 'poison_ivey1998.csv', 'pebdog5.csv', 'owillis.csv', 'msksq.csv', 'mmbrandenburg.csv', 'mjbaxter_.csv', 'martiinez_25.csv', 'leah_beah3.csv', 'leacountydems1.csv', 'lainemobleyy.csv', 'kyotoamatsukami.csv', 'kennedysara__.csv', 'kaytlovee1.csv', 'kaylapbaker.csv', 'katy77539.csv', 'jeffrey_poe.csv', 'ihoop__24.csv', 'iam_GRIZZLY.csv', 'iahn_walkerB.csv', 'estefaniaday.csv', 'ehoffman81.csv', 'donato_pacifico.csv', 'chris11fisher.csv', 'chimerakim.csv', 'chelsey_maria02.csv', 'briannnanicole4.csv', 'avtistrv.csv', '_jennifuurrr.csv', '_gracielee_.csv', '_ericuhh_.csv', '_Kyle7.csv', 'WonkaWinsTwitch.csv', 'WGXAnews.csv', 'WBAY.csv', 'TyChxbbs.csv', 'Timexdistance2.csv', 'Taylorl1219.csv', 'T_Mulherin.csv', 'SheriffRicStaly.csv', 'SheeshCP.csv', 'ScottLeeCupp.csv', 'SchnaubertM.csv', 'SOSA_CB_.csv', 'Ranzaaa_.csv', 'RafaelV90.csv', 'PatrickHenryCC.csv', 'Past2Present.csv', 'Mxkkie.csv', 'MarkVSlaughter.csv', 'LeoTheLion_11.csv', 'KimVarghese.csv', 'JoplinNewsFirst.csv', 'JessicaLeigh001.csv', 'JellyJam_24.csv', 'JayceBrandt.csv', 'JasonDiG_.csv', 'JPrincePhysed.csv', 'IvetteRios2_26.csv', 'Hellscr333m.csv', 'HawkeyeBrooke.csv', 'Franceborn.csv', 'EricHeggie.csv', 'Dansoffthewall.csv', 'Dannyjimmerado.csv', 'ChaseBurnett8.csv', 'CashWRLD.csv', 'BroadmoorSchool.csv', 'BIGDADDY197569.csv', 'AunistyN.csv', 'ABQJournal.csv']), ignore_index=True)
#take only the subset we need
unflagged =  unflagged[['username','created_at','tweet']]
#add an unflagged column
unflagged['flagged'] = 0
#add a district column
unflagged['district'] = 'n/a'
#reorder columns so the match the flagged df
unflagged = unflagged[['username', 'district', 'created_at', 'tweet', 'flagged']]
#rename so it matches the flagged df
unflagged = unflagged.rename(columns = {'username':'screen_name', 'created_at':'time_posted', 'tweet':'content'})

unflagged.head()

Unnamed: 0,screen_name,district,time_posted,content,flagged
0,zmelot18,,2019-09-25 19:22:32 EDT,@WiscoCooper @J_Sammour @AdamSchefter Philly r...,0
1,zmelot18,,2019-09-25 04:15:52 EDT,@jfons1567 https://t.co/Roz1oD9kfC,0
2,zmelot18,,2019-09-25 04:15:32 EDT,"@Pxrola @mariokarttourEN Ik, they ducked up wi...",0
3,zmelot18,,2019-09-25 04:15:19 EDT,@ShaaarellSMB https://t.co/cINet1BqbM,0
4,zmelot18,,2019-09-25 04:15:08 EDT,@ShaaarellSMB Shnopeeee,0


In [15]:
#concat flagged and unflagged
tweets = pd.concat([df, unflagged])
len(tweets)

3547

In [16]:
def dedupe_and_id(tweets):
    #where tweets is a df of tweets
    drop_idx = []

    for i in range(0, len(tweets['content'])):
        tweet_id = 0
        for j in range(0, len(tweets['content'])):
            try:
                similarity_ratio = levenshtein.ratio(tweets['content'][i], tweets['content'][j])
                if similarity_ratio >  0.9 and i != j:
                    drop_idx.append([i,j])
                else: 
                    pass
            except:
                similarity_ratio = 0

    drop_idx = sorted(drop_idx)

    for i in range(0, len(drop_idx)):
        try:
            x = drop_idx[i][0]
            y = drop_idx[i][1]
            drop_idx[:] = (value for value in drop_idx if value != [y,x])
        except:
            pass

    dont_drop = []
    for i in drop_idx:
        j = i[0]
        k = i[1]
        try:
            if tweets['screen_name'][j].lower() != tweets['screen_name'][k].lower():
                dont_drop.append([j,k])
            else: 
                pass
        except:
            pass

    for i in dont_drop:
        drop_idx[:] = (value for value in drop_idx if value != i)

    drop = []
    for l in range(0, len(drop_idx)-1):
        drop.append(drop_idx[l][1])

    drop_list = list(dict.fromkeys(drop))
    drop_list = sorted(drop_list)
    tweets = tweets.drop(index = drop_list)
    tweets.to_csv('tweet_data.csv')
    tweets = pd.read_csv('tweet_data.csv')
    tweets.rename(columns={'Unnamed: 0':'tweet_id'}, inplace=True)
    return tweets

In [17]:
dedupe_and_id(tweets)

Unnamed: 0,tweet_id,district,screen_name,time_posted,content,flagged
0,0,hobbs_nm,Cpt_Martinez,2017-12-08 10:27:00,You have gotta be the dumbest Bitch I know and...,1
1,0,hobbs_nm,Cpt_Martinez,2017-12-08 17:27:00,You have gotta be the dumbest Bitch I know🤦🏻‍♂...,1
2,1,hobbs_nm,TheNewErikT93,2017-12-10 11:09:00,Dez can kiss my ass im done defending him,1
3,2,hobbs_nm,Cpt_Martinez,2017-12-11 17:55:00,I just wanna take you out an show you off,1
4,3,hobbs_nm,TheNewErikT93,2017-12-11 21:02:00,A womans idea of shooting her shot is liking a...,1
...,...,...,...,...,...,...
3533,2654,,abqjournal,2019-09-24 15:55:53 EDT,AG: Parents allowed boy to be sexually abused ...,0
3534,2655,,abqjournal,2019-09-24 15:10:29 EDT,UNM enrollment declines again https://t.co/Vv...,0
3535,2656,,abqjournal,2019-09-24 14:25:23 EDT,NM agency to appeal after judge reaffirms medi...,0
3536,2657,,abqjournal,2019-09-24 12:35:07 EDT,3 APD officers honored for going 'above and be...,0


In [None]:
tweets.to_csv('tweet_data.csv', index=False)