This data is of flagged and unflagged tweets from the social media monitoring technology Social Sentinel. 

Social Sentinel is an service sold to schools which scans social media messages which tries to detect threats of violence or self-harm. The company has said in the past that it scans more than a billion posts on social media every day against more than 450,000 words and phrases in its “Language of Harm.” 

So far as I can tell from publicly-available documents and news articles, Social Sentinel’s system consists of two pieces:<ol>  
<li>An nlp classifier system that detects potentially threatening messages posted to social media</li>
<li>Keywords inputted by the school which connect a given threatening post to a client college.</li></ol>

According to other data I obtained, Social Sentinel has been used by at least 37 colleges in the past six years, 32 of which are public. (UCLA may or may not have purchased the service in late 2016). This number is quite small when compared to the hundreds of K12 school districts which have used the service over the same time period, but not insignificant. Some of the country’s largest and most well-known colleges have used Social Sentinel, including UNC-Chapel Hill, the University of Virginia, Michigan State University, MIT and Arizona State University.
    
There is mounting evdience that Social Sentinel is not only being used to stop suicides and shootings, but for suppressing activism and protests as well. For more on this read the following stories, which I authored:<ol>
    <li><a href="https://www.dallasnews.com/news/investigations/2021/09/02/texas-schools-are-watching-millions-of-students-online-often-without-their-knowledge-or-consent/">Texas schools are surveilling students online, often without their knowledge or consent</a></li>
    <li><a href="https://www.nbcnews.com/news/education/unc-campus-police-used-geofencing-tech-monitor-antiracism-protestors-n1105746">UNC campus police used geofencing tech to monitor antiracism protestors</a></li></ol>       
        
The data used here comes from two sources:<ol>
    <li>Public records obtained by Buzzfeed News reporter's Peter Aldhous and Lam Vo, which were parsed from PDFs, MSGs and EMLs into RData, then exported as CSVs and given to me.</li>
    <li>Tweets from the same time period from the same users who have flagged tweets from the Buzzfeed dataset, which were scraped using the Twint package.</li></ol>

The data are in two columns "content", which is the actual tweet itself and "flagged", which is encoded as a binary 0 (for no ie. the tweet wasn't  classified as a potential threat by Social Sentinel) and 1 (for yes, ie. the tweet was classified as a threat by Social Sentinel). 

In [258]:
import pandas as pd
import re
from datetime import datetime
import Levenshtein as levenshtein
from datetime import timedelta

In [259]:
#bring in data
df = pd.read_csv('data/CSVs/Raw/hobbs_flagler_unique.csv')
df_2 = pd.read_csv('data/CSVs/Raw/flagged.csv')

In [260]:
#convert strings to datetimes
df['time_posted'] = pd.to_datetime(df['time_posted'])
df.dtypes

district               object
fmt                    object
alert_num               int64
source                 object
time_posted    datetime64[ns]
author_text            object
site                   object
screen_name            object
profile_url            object
locations              object
geodata                object
rmap_match             object
content                object
link                   object
dtype: object

In [261]:
#add a column for flagged
df['flagged'] = 1

In [262]:
#convert string to datetime
df_2['created_at '] = pd.to_datetime(df_2['created_at '])
df_2.dtypes

district               object
screen_name            object
text                   object
created_at     datetime64[ns]
status_id             float64
user_id               float64
dtype: object

In [263]:
#add a flagged column
df_2['flagged'] = 1

In [264]:
#get a subset of the data that is present in both sheets
df = df[['district', 'screen_name', 'time_posted', 'content', 'flagged']]
df_2 = df_2[['district', 'screen_name','created_at ', 'text', 'flagged']]
df.head()

Unnamed: 0,district,screen_name,time_posted,content,flagged
0,hobbs_nm,Cpt_Martinez,2017-12-08 10:27:00,You have gotta be the dumbest Bitch I know and...,1
1,hobbs_nm,TheNewErikT93,2017-12-10 11:09:00,Dez can kiss my ass im done defending him,1
2,hobbs_nm,Cpt_Martinez,2017-12-11 17:55:00,I just wanna take you out an show you off,1
3,hobbs_nm,TheNewErikT93,2017-12-11 21:02:00,A womans idea of shooting her shot is liking a...,1
4,hobbs_nm,martiinez_25,2017-12-11 23:53:00,I get mad when marco carries his truck in the ...,1


In [265]:
#rename so the columns is both sheets are called the same thing
df_2 = df_2.rename(columns = {'created_at ':'time_posted', 'text':'content'})
df_2.head()

Unnamed: 0,district,screen_name,time_posted,content,flagged
0,hobbs_nm,Cpt_Martinez,2017-12-08 17:27:00,You have gotta be the dumbest Bitch I know🤦🏻‍♂...,1
1,hobbs_nm,SchnaubertM,2017-12-12 18:31:00,@petehudson101 Man he needs his ass kicked! No...,1
2,hobbs_nm,MayraWestAf,2017-12-13 01:00:00,Nolen is about to give me a heart attack man h...,1
3,hobbs_nm,Ramy529,2017-12-14 19:28:00,Catch me dropping out and selling drugs just t...,1
4,hobbs_nm,RicardoNino2,2017-12-15 07:06:00,G Eazy fucking murdered his album. Don’t @ me....,1


In [267]:
#combine the flagged tweet CSVs
df = pd.concat([df, df_2])

In [268]:
#convert strings to datetime
df['time_posted'] = pd.to_datetime(df['time_posted'])
df.dtypes

district               object
screen_name            object
time_posted    datetime64[ns]
content                object
flagged                 int64
dtype: object

In [269]:
#sort by date
df = df.sort_values(by='time_posted')

In [270]:
#first row date is where we start scraping, last row date is where we stop scraping
df.head(887)

Unnamed: 0,district,screen_name,time_posted,content,flagged
0,hobbs_nm,Cpt_Martinez,2017-12-08 10:27:00,You have gotta be the dumbest Bitch I know and...,1
0,hobbs_nm,Cpt_Martinez,2017-12-08 17:27:00,You have gotta be the dumbest Bitch I know🤦🏻‍♂...,1
1,hobbs_nm,TheNewErikT93,2017-12-10 11:09:00,Dez can kiss my ass im done defending him,1
2,hobbs_nm,Cpt_Martinez,2017-12-11 17:55:00,I just wanna take you out an show you off,1
3,hobbs_nm,TheNewErikT93,2017-12-11 21:02:00,A womans idea of shooting her shot is liking a...,1
...,...,...,...,...,...
180,danville_va,nayaalece,2019-09-25 00:34:00,So this man followed me home today &amp; calle...,1
158,danville_va,HonxhoD,2019-09-25 17:22:00,@tequila_titi Lmfaoo ion get cheated on cause ...,1
173,fairfield_oh,briannnanicole4,2019-09-26 12:56:00,Havent even been up for an hour &amp; im alrea...,1
156,fairfield_oh,EdCarroll51,2019-09-26 15:18:00,Willing to pay upwards to several chickens (so...,1


In [271]:
result = df.groupby('screen_name').agg({'time_posted':['min', 'max']})

result.columns = list(map('_'.join, result.columns.values))


df = pd.merge(df, result, on="screen_name", how="left")
len(df)

888

In [272]:
df.dropna(subset=['screen_name'], inplace=True)
len(df)

482

In [273]:
import twint
import nest_asyncio

In [274]:
nest_asyncio.apply()

In [275]:
uf = pd.DataFrame()
#scrape tweets with Twint
c  = twint.Config()

screen_names = []
for i in range(0, len(df)):
    try:
        if df['screen_name'][i] not in screen_names:
            
            min_time = df['time_posted_min'][i]
            min_time = min_time.date()
            
            max_time = df['time_posted_max'][i]
            max_time = max_time.date()
            
            time_diff = max_time - min_time
            
            if time_diff.days < 7:
                max_time = min_time + timedelta(7)
            else:
                pass
            
            
            c.Username = df['screen_name'][i]

            c.Since = str(min_time)

            c.Until = str(max_time)

            c.Lang = "en"

            c.Pandas = True

            twint.run.Search(c)
            
            Tweets_df = twint.storage.panda.Tweets_df

            uf = pd.concat([uf, Tweets_df])

            screen_names.append(df['screen_name'][i])
        else:
            pass
    except:
        pass

    
    
uf.head()

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
941516350572367872 2017-12-14 19:52:45 -0800 <_gracielee_> @t_reeev999  why’s this me 100%.?  https://t.co/9nm1kxtTKa
941515048639516673 2017-12-14 19:47:34 -0800 <_gracielee_> @t_reeev999 why is this me 100%  https://t.co/9nm1kxtTKa
941510068469723136 2017-12-14 19:27:47 -0800 <_gracielee_> Anyone wanna buy me a new truck or car.????
940785138539089921 2017-12-12 19:27:10 -0800 <_gracielee_> Me and @jerakah_madron  are having Jeff Dunham wars ❤️
940677214852456448 2017-12-12 12:18:19 -0800 <_gracielee_> Why does Hobbs gotta act and say they are gonna shoot up the school.? I’m not in class now getting the good grades I️ need to graduate, cause y’all dumbass’s like to 

959090737706160128 2018-02-01 07:47:05 -0800 <PennjrJr> Happy Heavenly Birthday Lil Bro/BruckShot👊🏽💙  keep watching over yo big bro🙏🏽💯
958919850780430336 2018-01-31 20:28:02 -0800 <PennjrJr> Just want something real with you shawdy 😏
958914775706144768 2018-01-31 20:07:52 -0800 <PennjrJr> Shawdy would you ride for a 🙎🏾‍♂️?
958898868934053889 2018-01-31 19:04:40 -0800 <PennjrJr> Need me a queen who was moving with Latifah 👑
958772990312960000 2018-01-31 10:44:28 -0800 <PennjrJr> Say less do more
958749654736424966 2018-01-31 09:11:45 -0800 <PennjrJr> The world is a stage and I just playing my part ‼️
958748827657351174 2018-01-31 09:08:27 -0800 <PennjrJr> Is like the more I sleep the more I get tired 🤦🏽‍♂️
958547637833224193 2018-01-30 19:49:00 -0800 <PennjrJr> Keep da fake love away from me ‼️
958539557670178816 2018-01-30 19:16:54 -0800 <PennjrJr> 60 pts and a triple double
958539486958358529 2018-01-30 19:16:37 -0800 <PennjrJr> Harden is not a joke😨
958534884196323328 2018-01-30 18:5

949720774016946177 2018-01-06 11:14:12 -0800 <kbreed89> Amazing artwork! (Spiegelman 35)  https://t.co/ZFVd9Qmhed
949711295326179330 2018-01-06 10:36:32 -0800 <kbreed89> Starting my fourth book of the year! Maus I is a graphic novel about the Holocaust. Here we go!  https://t.co/g66YEaLH3W
949526197410680832 2018-01-05 22:21:01 -0800 <kbreed89> “The Cask of Amontillado” by Edgar Allan Poe “The thousand injuries of Fortunato I had borne as I best could; but when he ventured upon insult, I vowed revenge.” (14)
949517827312152576 2018-01-05 21:47:45 -0800 <kbreed89> “Young Goodman Brown” by Nathaniel Hawthorne “Depending upon one another’s hearts, ye had still hoped that virtue were not all a dream.  Now are ye undeceived. Evil is the nature of mankind. Evil must be your only happiness. Welcome again...to the communion of your race.” (11)
949505909180502016 2018-01-05 21:00:24 -0800 <kbreed89> Next on the list, 40 Short Stories: A Portable Anthology! I’ll take my time with this one, maybe

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
956051581987971072 2018-01-23 22:30:34 -0800 <JoplinNewsFirst> Al Zar produced/promoted shows in Joplin where he raised his kids. If you went to a concert in the 80s or 90s at Memorial Hall that was big? It was Al! R.I.P.  https://t.co/iy9rPw8TQG
956049191280750592 2018-01-23 22:21:04 -0800 <JoplinNewsFirst> (64804) — “It literally happened right in front of my car. The guy had to be going every bit of 50-60 mph through the red light. I was scared to death. JPD and everyone did great!” She told us full of emotion. “I was driving and we are expecting a baby in June.”  https://t.co/iPTB3RTuXB
955995197493559296 2018-01-23 18:46:31 -0800 <JoplinNewsFirst> HOUSE OF WORSHIP: Security Team Options -Protecting Your Congregation from an Active Shooter -Held by the Jasper...  https://t.co/XXAB4WpRlK
955946609191333888 2018-01-23 15:33:26 -0800 <JoplinNewsFirst> (64804) — Possible hit and run injury crash near McCle

990247379197575169 2018-04-28 08:12:27 -0800 <Destiiny03> @destiny4399 I love youuuuu more!!!💓
990236386774126593 2018-04-28 07:28:46 -0800 <Destiiny03> Happy birthday BFF✨ hanks for doing life with me. May god bless you with so many more yrs so that I may be blessed with you by my side for so many more yrs! Love youuu!!!  https://t.co/K1HIVBkRTD
990008623303553024 2018-04-27 16:23:43 -0800 <Destiiny03> I just want a puppy so it can be my bestfriend!😭
990001914619224064 2018-04-27 15:57:04 -0800 <Destiiny03> @destiny4399 Shut up, you’re perfect🙄
989965706962780160 2018-04-27 13:33:11 -0800 <Destiiny03> My mom is going to see @maluma tonight without me 😭
989928866515832832 2018-04-27 11:06:48 -0800 <Destiiny03> I find it so funny that people think they know me 😂  Lemme tell you something hun,  YOU DONT GOT IT LIKE THAT:))
989922270427557890 2018-04-27 10:40:35 -0800 <Destiiny03> @suunnieraai @Stayclasssyyy @alleahbaca107 They usually change when it’s too late....
989863211464777730 2018

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
971531422124134405 2018-03-07 15:41:55 -0800 <KSBY> Drop box for unused medication installed at #Lompoc hospital  https://t.co/u6YZdgJIcY
971511462035324928 2018-03-07 14:22:37 -0800 <KSBY> #MostWantedWednesday: Kevin Bruce Duckworth; wanted by @SLOSheriff   https://t.co/Ea0cbYgPr6  https://t.co/0LBukrhzQJ
971494803178508291 2018-03-07 13:16:25 -0800 <KSBY> The fire broke out on Santa Rita Road at about 12:30 p.m.  https://t.co/cJvG6HQBv2
971486900954517504 2018-03-07 12:45:01 -0800 <KSBY> San Luis Obispo County voters to decide on #cannabis business tax  https://t.co/lCTnDrLCyV
971472187294208000 2018-03-07 11:46:33 -0800 <KSBY> New center opening in #Montecito to as

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
977002821974482944 2018-03-22 19:03:19 -0800 <jesmtlsd> JES Art Night! #leboproud of these creative kiddos!  https://t.co/RO5WM2Tt2j
976583583149289472 2018-03-21 15:17:24 -0800 <jesmtlsd> We’re getting some plants ready for the garden! (Winter, please go away). Thank you to the JES PTA for helping us plant this week! #leboproud of these gardeners @LetsMovePGH  https://t.co/vR01GgKaUq
976447570594729984 2018-03-21 06:16:57 -0800 <jesmtlsd> MTLSD Schools now CLOSED Wednesday, March 21, 2018 due to weather conditions.  https://t.co/igDAmTvN3b
976400867934031872 2018-03-21 03:11:22 -0800 <jesmtlsd> The PTA meeting that was scheduled for this morning has been cancelled. It has been rescheduled for Tuesday, March 27th.
976397230960758784 2018-03-21 02:56:55 -0800 <jesmtlsd> The Mt. Lebanon School District will have a 10AM start wit

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
981280636269940737 2018-04-03 14:21:49 -0800 <wobm> DEVELOPING STORY:  https://t.co/jgzq7U0XJp
981270359210016768 2018-04-03 13:40:59 -0800 <wobm> NJ weather this week: Rain, thunderstorms, warm, cold, wind, snow: A warm front Tuesday night will be followed by a cold front Wednesday afternoon. Continue reading…  https://t.co/ohxhBOBt3d  https://t.co/1MLeyWvs4r
981242123843489793 2018-04-03 11:48:47 -0800 <wobm> Don't resign yourself to losing the money that you have on unused Toys "R" Us gift cards, but you'll have to act fast to take advantage of this exchange program that is good right here in Ocean County:  https://t.co/Px6izheI33
981229861971939330 2018-04-03 11:00:04 -0800 <wobm> Stephen Paul will spend 30-years in a state prison following the plea to first degree aggravated manslaughter.  https://t.co/yo2A9o0VOg
981214763291152384 2018-04-03 10:00:04 -0800 <wobm> The self-employed painter was found w

980750291690774528 2018-04-02 03:14:25 -0800 <wobm> Here's what you need to know about today's weather.  https://t.co/CnCXdfqLWP
980746419450474496 2018-04-02 02:59:02 -0800 <wobm> NJ school closings and delays for Monday, April 2: A quick snowfall on Monday morning could mean delayed openings and closures for New Jersey schools on Monday. Continue reading…  https://t.co/ieLObJf6lN  https://t.co/N8AH4Dnghi
980746414522167296 2018-04-02 02:59:01 -0800 <wobm> Congratulations To Michael Gavini Jr., Our Warrior of the Week: Congratulations to Michael Gavini Jr., Our Warrior of the Week for the week of April 2, 2018. Continue reading…  https://t.co/C7ImBGXI8q  https://t.co/glVI8xVFpP
980746408587227136 2018-04-02 02:58:59 -0800 <wobm> Student of the Week: 92.7 WOBM and Gateway Toyota of Toms River honor ANGEL SANCHES of Central Regional High School as the Student of the Week. Continue reading…  https://t.co/taDW4655LE  https://t.co/hcmb13TCZa
980746404325818368 2018-04-02 02:58:58 -0800 <wo

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
981428180673290240 2018-04-04 00:08:06 -0800 <JRuiz2444> @jcb1_ @jeffrey_poe I’m gonna have nightmares #24
981405355086438401 2018-04-03 22:37:24 -0800 <JRuiz2444> @FabeBaezaS12 But those “bums” beat y’all twice this year👀
981404719448035329 2018-04-03 22:34:53 -0800 <JRuiz2444> @FabeBaezaS12 The spurs fucking suck
981357760569888768 2018-04-03 19:28:17 -0800 <JRuiz2444> @jordan_ruiz_  https://t.co/oV0G5trcbP
981275470862962688 2018-04-03 14:01:18 -0800 <JRuiz2444> @jeffrey_poe I’ll give you $21 since your turning 21. It’s basically the same thing
980362164463693824 2018-04-01 01:32:08 -0800 <JRuiz2444> @jeffrey_poe @jahernandez0218  https://t.co/BnTA5VawjO
979856513832452097 2018-03-30 16:02:52 -0800 <JRuiz2444> @jahernandez0218 @jeffrey_poe We just got a sniper dub without you. You’re the weakest link
979829589701734400 2018-03-30 14:15:53 -0800 <JRuiz2444> @jeffrey_poe I hate myself too.
979792403388358

998759437685977089 2018-05-21 19:56:20 -0800 <T_Mulherin> Can't win games if you don't rebound.
998730582136971266 2018-05-21 18:01:40 -0800 <T_Mulherin> Marcus Smart's immense impact isn't always positive. Yeah, good box out on Thompson and some good overall D. But those two forced 3s he missed broke up a string of Boston getting consistent buckets to start heating up.
998728338020732934 2018-05-21 17:52:45 -0800 <T_Mulherin> I don't even know why J.R. Smith takes regular open shots anymore because he only ever makes the low percentage ones
998031034498387968 2018-05-19 19:41:55 -0800 <T_Mulherin> @BrookQuinones40 I respect that, but talking about the royal wedding during a professional basketball game is a good way to get muted
998029579372650496 2018-05-19 19:36:08 -0800 <T_Mulherin> @MattVautour424 Only thing that would make that better would be if it projected an image of your face. Or my face, for that matter.
998028400882257920 2018-05-19 19:31:27 -0800 <T_Mulherin> @MattVautour

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
988949738610479110 2018-04-24 18:16:06 -0800 <kaytlovee1> Some people just suck the freaking nice out of me
988610792047116289 2018-04-23 19:49:14 -0800 <kaytlovee1> It’s like I’m in a bad dream and can’t seem to wake up
988607467180445701 2018-04-23 19:36:02 -0800 <kaytlovee1> Guys are so cruel
988607214985375745 2018-04-23 19:35:02 -0800 <kaytlovee1> #NewProfilePic  https://t.co/ORsq01dZaU
988570907944812550 2018-04-23 17:10:45 -0800 <kaytlovee1> @lexy_18 @L_Eazy5
988570818006454272 2018-04-23 17:10:24 -0800 <kaytlovee1> Guys must be on their period today 😂
[!] No more data! Scraping wi

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
997620242288467969 2018-05-18 16:29:35 -0800 <fwt1776> I’m literally so tired I have lost any resemblance of caring.
997245886685102080 2018-05-17 15:42:01 -0800 <fwt1776> @EmilyErinDill I’ll be a witness 😳
997099945353269248 2018-05-17 06:02:06 -0800 <fwt1776> @JohnCooper0610 👏👏👏👏
997090521335156736 2018-05-17 05:24:39 -0800 <fwt1776> @GovMikeHuckabee @SchultheisKathy It’s a good day when you get the lube. 😂😂
997090251851132928 2018-05-17 05:23:35 -0800 <fwt1776> @AMike4761 Just hook it up to some electricity. That’ll do it. Or a fire hose. That’d be fun. Like that game at the fairs where the more you shoot off the bigger the balloon gets....good times
997089649008013312 2018-05-17 05:21:12 -0800 <fwt1776> @gr8tjude @ArizonaKayte 😂😂😂 I dare the New York Yankees to play me. Lol. See how stupid that sounds @JimAcosta - Yankess 

[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
1001794226315780096 2018-05-30 04:55:30 -0800 <ShaunECarter> Skeptical people stay broker while wealthy people find ways to invest.  Dr Jamal Harrison Bryant saw the vision..local pastors, what about you?  https://t.co/yaV8DSUy7x
1001598076036026368 2018-05-29 15:56:04 -0800 <ShaunECarter> I posted a new video to Facebook  https://t.co/uxr8lhkRfW
1001229710624460805 2018-05-28 15:32:19 -0800 <ShaunECarter> Another business partne

1004086740578131968 2018-06-05 12:45:08 -0800 <SOSA_CB_> @nathan031599 @aboogie728 @TBEScuffable Get lebron on the block, stop making him bring the ball up court when you have POINT GUARDS to do that... Driving &amp; Kicking is the main we people try to get assist that’s why they shoot as soon as he passes the can make EXTRA PASSES ONE pass on MOST POSSESSIONS isn’t good BBALL
1004077563952672768 2018-06-05 12:08:40 -0800 <SOSA_CB_> @TGodfrey93 @Tsmackthebarber @MyOpini97849657 @ItsMeCathi @JanaBlade1 Right, you basically just blamed the woman for being raped and can the females in your family shoot a gun while being unconscious? If so they’re TREMENDOUSLY TALENTED
1004077126163881986 2018-06-05 12:06:56 -0800 <SOSA_CB_> @TGodfrey93 @Tsmackthebarber @MyOpini97849657 @ItsMeCathi @JanaBlade1 That was not the MAIN SUBJECT of this quoted tweet that you responded on....did you read what the young lady wrote? and Why can’t I call him African American if tests and job applications in the US d

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,941516350572367872,941516350572367872,1513310000000.0,2017-12-14 19:52:45,-800,,@t_reeev999 why’s this me 100%.? https://t.c...,en,[],[],...,,,,,,[],,,,
1,941515048639516673,941515048639516673,1513310000000.0,2017-12-14 19:47:34,-800,,@t_reeev999 why is this me 100% https://t.co/...,en,[],[],...,,,,,,[],,,,
2,941510068469723136,941510068469723136,1513308000000.0,2017-12-14 19:27:47,-800,,Anyone wanna buy me a new truck or car.????,en,[],[],...,,,,,,[],,,,
3,940785138539089921,940785138539089921,1513136000000.0,2017-12-12 19:27:10,-800,,Me and @jerakah_madron are having Jeff Dunham...,en,[],[],...,,,,,,[],,,,
4,940677214852456448,940677214852456448,1513110000000.0,2017-12-12 12:18:19,-800,,Why does Hobbs gotta act and say they are gonn...,en,[],[],...,,,,,,[],,,,


In [276]:
len(uf)

780

In [277]:
uf['unflagged'] = 0
uf.head()

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,unflagged
0,941516350572367872,941516350572367872,1513310000000.0,2017-12-14 19:52:45,-800,,@t_reeev999 why’s this me 100%.? https://t.c...,en,[],[],...,,,,,[],,,,,0
1,941515048639516673,941515048639516673,1513310000000.0,2017-12-14 19:47:34,-800,,@t_reeev999 why is this me 100% https://t.co/...,en,[],[],...,,,,,[],,,,,0
2,941510068469723136,941510068469723136,1513308000000.0,2017-12-14 19:27:47,-800,,Anyone wanna buy me a new truck or car.????,en,[],[],...,,,,,[],,,,,0
3,940785138539089921,940785138539089921,1513136000000.0,2017-12-12 19:27:10,-800,,Me and @jerakah_madron are having Jeff Dunham...,en,[],[],...,,,,,[],,,,,0
4,940677214852456448,940677214852456448,1513110000000.0,2017-12-12 12:18:19,-800,,Why does Hobbs gotta act and say they are gonn...,en,[],[],...,,,,,[],,,,,0


In [278]:
uf.columns

Index(['id', 'conversation_id', 'created_at', 'date', 'timezone', 'place',
       'tweet', 'language', 'hashtags', 'cashtags', 'user_id', 'user_id_str',
       'username', 'name', 'day', 'hour', 'link', 'urls', 'photos', 'video',
       'thumbnail', 'retweet', 'nlikes', 'nreplies', 'nretweets', 'quote_url',
       'search', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest', 'unflagged'],
      dtype='object')

In [279]:
uf = uf[['username', 'date', 'tweet', 'unflagged']]

In [280]:
uf = uf.rename(columns = {'username':'screen_name', 'date':'time_posted','tweet':'content', 'unflagged':'flagged'})
uf.head()

Unnamed: 0,screen_name,time_posted,content,flagged
0,_gracielee_,2017-12-14 19:52:45,@t_reeev999 why’s this me 100%.? https://t.c...,0
1,_gracielee_,2017-12-14 19:47:34,@t_reeev999 why is this me 100% https://t.co/...,0
2,_gracielee_,2017-12-14 19:27:47,Anyone wanna buy me a new truck or car.????,0
3,_gracielee_,2017-12-12 19:27:10,Me and @jerakah_madron are having Jeff Dunham...,0
4,_gracielee_,2017-12-12 12:18:19,Why does Hobbs gotta act and say they are gonn...,0


In [281]:
df = df[['screen_name', 'time_posted', 'content', 'flagged']]

In [282]:
df.head()

Unnamed: 0,screen_name,time_posted,content,flagged
0,Cpt_Martinez,2017-12-08 10:27:00,You have gotta be the dumbest Bitch I know and...,1
1,Cpt_Martinez,2017-12-08 17:27:00,You have gotta be the dumbest Bitch I know🤦🏻‍♂...,1
2,TheNewErikT93,2017-12-10 11:09:00,Dez can kiss my ass im done defending him,1
3,Cpt_Martinez,2017-12-11 17:55:00,I just wanna take you out an show you off,1
4,TheNewErikT93,2017-12-11 21:02:00,A womans idea of shooting her shot is liking a...,1


In [283]:
df = pd.concat([df, uf])
print(len(df))

1262


In [None]:
tweets = df 

In [284]:
def dedupe_and_id(tweets):
    #where tweets is a df of tweets
    drop_idx = []

    for i in range(0, len(tweets['content'])):
        tweet_id = 0
        for j in range(0, len(tweets['content'])):
            try:
                similarity_ratio = levenshtein.ratio(tweets['content'][i], tweets['content'][j])
                if similarity_ratio >  0.9 and i != j:
                    drop_idx.append([i,j])
                else: 
                    pass
            except:
                similarity_ratio = 0

    drop_idx = sorted(drop_idx)

    for i in range(0, len(drop_idx)):
        try:
            x = drop_idx[i][0]
            y = drop_idx[i][1]
            drop_idx[:] = (value for value in drop_idx if value != [y,x])
        except:
            pass

    dont_drop = []
    for i in drop_idx:
        j = i[0]
        k = i[1]
        try:
            if tweets['screen_name'][j].lower() != tweets['screen_name'][k].lower():
                dont_drop.append([j,k])
            else: 
                pass
        except:
            pass

    for i in dont_drop:
        drop_idx[:] = (value for value in drop_idx if value != i)

    drop = []
    for l in range(0, len(drop_idx)-1):
        drop.append(drop_idx[l][1])

    drop_list = list(dict.fromkeys(drop))
    drop_list = sorted(drop_list)
    tweets = tweets.drop(index = drop_list)
    tweets.to_csv('data/CSVs/Created/tweet_data.csv')
    tweets = pd.read_csv('data/CSVs/Created/tweet_data.csv')
    tweets.rename(columns={'Unnamed: 0':'tweet_id'}, inplace=True)
    return tweets

In [285]:
tweets = dedupe_and_id(tweets)

In [287]:
len(tweets)

1236

In [291]:
tweets.to_csv('data/CSVs/Created/tweet_data.csv', index=False)