X, formerly known as Twitter, has a set of rules they expect the users of the platform to abide by. Last year, I scraped over 64M tweets and wrote a set of programs to identify which of those tweets were deleted by (then) Twitter. I explored this project to verify that the platform was following their rules when determining what tweets to delete and not being biased in their approach. I documented my findings in a blog that can be found here. In summary, a majority, if not all, of the tweets I found to be deleted could be labelled as some sort of hate speech, and my concerns about bias in Twitter‚Äôs approach dissipated. 

When reading through some of the ‚Äúcensored‚Äù tweets, I wiped the blood from my eyes that came from all the hate speech, and I found some tweets that the author may have benefited from some sort of warning to tell them that their tweet may be deleted. For example, an author may be trying to get a point across to a congress person about something he or she felt strongly about; unfortunately, they probably felt too strongly, and their rhetoric ‚Äúcrossed the line.‚Äù I thought, ‚Äúwouldn‚Äôt it be cool if a tweet author was prompted to consider re-writing their tweet so that it does not get taken down from the platform team?‚Äù This is the crux of my project proposal: train a model to predict a tweet will be deleted from the platform.

In this project, I will use two different techniques to vectorize the tweet data: 
 - spaCy
 - Scikit Learn's TFIDF vectorizer

From there, we will deploy 4 techniques to each of these vectorizers:
 - Custom built KNN Classifier using cosine similarity
 - Scikit KNN Classifier using PCA
 - Cross-Validation using RandomForestClassifier
 - Kmeans Clustering
 
For more information about how I gathered this data, check out my blog and github:
 - Blog: https://inthegraey.com/
 - Github: https://github.com/madecero/thegraey
 
NOTE: This notebook is only used to take data from my local database and produce a csv that will be used for the model development. Refer to predictions.ipynb for the analysis using the csv produced via the code below.

### Query local database to obtain all tweets along with their delete reason code (if applicable)

In [52]:
import sqlite3
import pandas as pd

In [53]:
# Establish a connection to the SQLite database
conn = sqlite3.connect('de0project.db')

# Define your SQL query
query = 'SELECT ID, Text, CreatedAt, deleteReason FROM deleteView'

# Execute the query and store the results in a Pandas DataFrame
df = pd.read_sql_query(query, conn)

# Close the database connection
conn.close()

In [54]:
#What does this dataframe look like?
df.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
0,1474541784210362368,RT @Fukkard: Top Two Belongs to Pra‚ÄôBOSS‚Äô‚úä\n\n...,Sat Dec 25 00:45:31 2021,
1,1474541784181182468,RT @texan40: I swear... only in south Texas üòÇ ...,Sat Dec 25 00:45:31 2021,
2,1474541784176861185,RT @kdramadaisy: choi woong is the standard.\n...,Sat Dec 25 00:45:31 2021,
3,1474541784156020741,RT @nft_ray: ANY #MAYC OWNERS INTERESTED IN TR...,Sat Dec 25 00:45:31 2021,
4,1474541784155963396,RT @methnpizza: Spare 11? 1k would be a wonder...,Sat Dec 25 00:45:31 2021,


In [55]:
#What is the shape?
df.shape

(64163912, 4)

In [56]:
#Let's make sure the delete reasons we care about came through
deletedf = df[df['deleteReason'] == 'Twitter API returned a 404 (Not Found), This Tweet is no longer available because it violated the Twitter Rules.']

In [57]:
deletedf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
2468695,1485894286365368328,That will really help stop the huge surge of m...,Tue Jan 25 08:36:19 2022,"Twitter API returned a 404 (Not Found), This T..."
2524858,1485996263380238341,"@ebonykayxxxx Scotland: fried mars bar, Gordon...",Tue Jan 25 15:21:32 2022,"Twitter API returned a 404 (Not Found), This T..."
2536038,1486011588780134400,Hey Sunghoon! Don't you dare to be closer with...,Tue Jan 25 16:22:26 2022,"Twitter API returned a 404 (Not Found), This T..."
2632183,1486939783540588544,@BevSutphin78 @MethyNurse @catherinenunya @Can...,Fri Jan 28 05:50:45 2022,"Twitter API returned a 404 (Not Found), This T..."
2710292,1487083580220186625,@Kasoulis1 @pskrill @Gala_heart @tariqnasheed ...,Fri Jan 28 15:22:08 2022,"Twitter API returned a 404 (Not Found), This T..."


In [58]:
deletedf.shape

(1153, 4)

### Transform our target variable to binary

In [59]:
# Convert the Target column based on substring presence
df['deleteReason'] = df['deleteReason'].apply(
    lambda x: 1 if x is not None and "Twitter API returned a 404 (Not Found), This Tweet is no longer available because it violated the Twitter Rules." in x else 0)

In [60]:
df.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
0,1474541784210362368,RT @Fukkard: Top Two Belongs to Pra‚ÄôBOSS‚Äô‚úä\n\n...,Sat Dec 25 00:45:31 2021,0
1,1474541784181182468,RT @texan40: I swear... only in south Texas üòÇ ...,Sat Dec 25 00:45:31 2021,0
2,1474541784176861185,RT @kdramadaisy: choi woong is the standard.\n...,Sat Dec 25 00:45:31 2021,0
3,1474541784156020741,RT @nft_ray: ANY #MAYC OWNERS INTERESTED IN TR...,Sat Dec 25 00:45:31 2021,0
4,1474541784155963396,RT @methnpizza: Spare 11? 1k would be a wonder...,Sat Dec 25 00:45:31 2021,0


In [61]:
deletedf = df[df['deleteReason'] == 1]

In [62]:
deletedf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
2468695,1485894286365368328,That will really help stop the huge surge of m...,Tue Jan 25 08:36:19 2022,1
2524858,1485996263380238341,"@ebonykayxxxx Scotland: fried mars bar, Gordon...",Tue Jan 25 15:21:32 2022,1
2536038,1486011588780134400,Hey Sunghoon! Don't you dare to be closer with...,Tue Jan 25 16:22:26 2022,1
2632183,1486939783540588544,@BevSutphin78 @MethyNurse @catherinenunya @Can...,Fri Jan 28 05:50:45 2022,1
2710292,1487083580220186625,@Kasoulis1 @pskrill @Gala_heart @tariqnasheed ...,Fri Jan 28 15:22:08 2022,1


In [63]:
df.shape

(64163912, 4)

In [64]:
deletedf.shape

(1153, 4)

### Let's create a df that is only records that have a target variable of 0 (not deleted by X)

In [65]:
sampledf = df[df['deleteReason'] == 0]

In [66]:
sampledf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
0,1474541784210362368,RT @Fukkard: Top Two Belongs to Pra‚ÄôBOSS‚Äô‚úä\n\n...,Sat Dec 25 00:45:31 2021,0
1,1474541784181182468,RT @texan40: I swear... only in south Texas üòÇ ...,Sat Dec 25 00:45:31 2021,0
2,1474541784176861185,RT @kdramadaisy: choi woong is the standard.\n...,Sat Dec 25 00:45:31 2021,0
3,1474541784156020741,RT @nft_ray: ANY #MAYC OWNERS INTERESTED IN TR...,Sat Dec 25 00:45:31 2021,0
4,1474541784155963396,RT @methnpizza: Spare 11? 1k would be a wonder...,Sat Dec 25 00:45:31 2021,0


In [67]:
sampledf.shape

(64162759, 4)

### Let's pull a sample of 10k rows of the sampledf so that our algorithms can handle the smaller load. 65M rows takes too long to run, and we are not distributing this load for this project because we want to keep costs at $0.

In [68]:
sampledf_10k = sampledf.sample(n=10000)

In [69]:
sampledf_10k.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
12040219,1521947259306491904,RT @ogm4xb_: women with good pussy can‚Äôt drive,Wed May 04 20:17:57 2022,0
30815750,1575088295302012929,RT @LisaPeruBP: Typa girl that came straight o...,Wed Sep 28 11:41:27 2022,0
20688716,1539416811099328513,@StrictLiable @ClaireFosterPHD @Timcast she's ...,Wed Jun 22 01:15:43 2022,0
6171573,1498186913702084608,RT @Pradeep_Dmk: Women priceless Bus üî•\n#Stal...,Mon Feb 28 06:42:49 2022,0
19794827,1532365059241832448,RT @ladyincrypto: $50 GIVEAWAY ~ 4 HOURS ‚è≥Ô∏è \n...,Thu Jun 02 14:14:34 2022,0


In [70]:
#let's sort it by index

sampledf_10k = sampledf_10k.sort_index()

In [71]:
sampledf_10k.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
1315,1474541778120450049,RT @McQueenRH: new active list now! come get a...,Sat Dec 25 00:45:30 2021,0
4328,1474549365578645506,RT @thatboicandy: @mooncat2878 @tezos @TempleW...,Sat Dec 25 01:15:39 2021,0
6789,1474553150841376770,RT @btsvotingorg: Collect more hearts and regi...,Sat Dec 25 01:30:41 2021,0
21157,1474579650475134983,Instead of saying English is not a desi langua...,Sat Dec 25 03:15:59 2021,0
40552,1474621084993343494,üé∂ It came upon a midnight cloudy...\n\n...the ...,Sat Dec 25 06:00:38 2021,0


### We now have a dataframe of 10k tweets that were not deleted. Let's concatenate it with the 1153 tweets that were deleted to make our dataframe we will use to run our models

In [72]:
projectdf = pd.concat([sampledf_10k, deletedf])

In [73]:
projectdf = projectdf.sort_index()

In [74]:
projectdf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
1315,1474541778120450049,RT @McQueenRH: new active list now! come get a...,Sat Dec 25 00:45:30 2021,0
4328,1474549365578645506,RT @thatboicandy: @mooncat2878 @tezos @TempleW...,Sat Dec 25 01:15:39 2021,0
6789,1474553150841376770,RT @btsvotingorg: Collect more hearts and regi...,Sat Dec 25 01:30:41 2021,0
21157,1474579650475134983,Instead of saying English is not a desi langua...,Sat Dec 25 03:15:59 2021,0
40552,1474621084993343494,üé∂ It came upon a midnight cloudy...\n\n...the ...,Sat Dec 25 06:00:38 2021,0


In [75]:
projectdf.shape

(11153, 4)

### Print to a csv to be used for our final project analysis

In [76]:
projectdf.to_csv('projectdf.csv', index = False)

# Please now refer to DSC478_FinalProject_mdecero.ipynb for the rest of the project. This notebook was simply to create a csv for us to do the analysis on. This way, the grader can replicate the steps using the produced CSV as opposed to the source being a local database that he or she will not have access to.