# Table of Contents
- [Project Definition](#Project-Definition)
- [Analysis](#Analysis)
- [Conclusion](#Conclusion)
- [Future Work](#Future-Work)
- [Guides](#Guides)

# Project Definition

Utilizing and analyzing a dataset of classified Tweets from [Dataturks through Kaggle](https://www.kaggle.com/dataturks/dataset-for-detection-of-cybertrolls) to build a model that classifies Tweets between good or bad and display results in a Flask web application.

# Analysis

Some additional analysis has been done on Kaggle:
<br>https://www.kaggle.com/kevinlwebb/cybertrolls-exploration-and-ml

In [1]:
import pandas as pd
import numpy as np

from sqlalchemy import create_engine
from joblib import dump, load

import re
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
  from collections import Sequence, defaultdict
  from collections import Counter, Iterable
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kevinwebb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kevinwebb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [3]:
# load data
engine = create_engine('sqlite:///data/TweetSentiment.db')
df = pd.read_sql_table('tweets', engine)

# load model
model = load("models/classifier.pkl")

In [4]:
df.head()

Unnamed: 0,cleaned_tweet,label
0,Get fucking real dude.,1
1,She is as dirty as they come and that crook ...,1
2,why did you fuck it up. I could do it all day...,1
3,Dude they dont finish enclosing the fucking s...,1
4,WTF are you talking about Men? No men thats n...,1


In [5]:
df.groupby("label").count()

Unnamed: 0_level_0,cleaned_tweet
label,Unnamed: 1_level_1
0,12179
1,7822


In [6]:
bad_per = (len(df[df.label == '1']) / len(df)) * 100
print("Percent of bad tweets: {}".format(bad_per))

good_per = (len(df[df.label == '0']) / len(df)) * 100
print("Percent of good tweets: {}".format(good_per))

Percent of bad tweets: 39.10804459777012
Percent of good tweets: 60.89195540222989


In [7]:
sw = stopwords.words("english")
text = df.cleaned_tweet.str.cat(sep=' ')
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
str_list = text.split(" ")
s = pd.Series(str_list)
s = s[s != ""]
s = s[~s.isin(sw)]

top_10 = s.value_counts()[:10]
word_counts = s.value_counts()[:10].tolist()
word_names = s.value_counts()[:10].index.tolist()

In [8]:
print("Top 10 General Words")
print(top_10)

Top 10 General Words
hate       2833
damn       2485
ass        1820
sucks      1537
fuck       1494
lol        1440
like       1440
get        1046
fucking     997
u           984
dtype: int64


In [9]:
text = df[df.label == '1'].cleaned_tweet.str.cat(sep=' ')
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
str_list = text.split(" ")
s = pd.Series(str_list)
s = s[s != ""]
s = s[~s.isin(sw)]

bad_top_10 = s.value_counts()[:10]
bad_word_counts = s.value_counts()[:10].tolist()
bad_word_names = s.value_counts()[:10].index.tolist()

In [10]:
print("Top 10 Bad Words")
print(bad_top_10)

Top 10 Bad Words
hate       1326
damn       1110
fuck       1070
ass        1070
sucks       724
fucking     634
lol         596
u           512
bitch       502
like        500
dtype: int64


In [11]:
text = df[df.label == '0'].cleaned_tweet.str.cat(sep=' ')
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
str_list = text.split(" ")
s = pd.Series(str_list)
s = s[s != ""]
s = s[~s.isin(sw)]

good_top_10 = s.value_counts()[:10]
good_word_counts = s.value_counts()[:10].tolist()
good_word_names = s.value_counts()[:10].index.tolist()

In [12]:
print("Top 10 Good Words")
print(good_top_10)

Top 10 Good Words
hate     1507
damn     1375
like      940
lol       844
sucks     813
ass       750
would     694
get       599
one       528
know      515
dtype: int64


In [25]:
query = "I hate you"

if model.predict([query])[0] == '0':
    print("The Tweet '{}' is a good tweet".format(query))
else:
    print("The Tweet '{}' is a bad tweet".format(query))

The Tweet 'I hate you' is a good tweet


# Conclusion

The divide of data between good and bad is great, but after seeing that 'hate' and other bad words are shared between classified bad and good, the data is still skewed. The model was tested above with a sentence with clear bad sentiment, but because of the skewed data (and other factors), the model predicted that the sentence has 'good' sentiment.

Below, you'll find different methods and solutions that may help in correctly classifying Tweets.

# Future Work

- Implement analysis on live Tweets
- Switch from classification to regression
- Switch and / or add new dataset
  - [sentiment140 dataset](https://www.kaggle.com/kazanova/sentiment140)
- Use more libraries and tools
  - textblob
- Utilize Hashtags, Emoticons, and Emojis

In [13]:
import os
from tweepy import Stream, OAuthHandler, API
from tweepy.streaming import StreamListener
import json
import pandas as pd
import csv
import re

In [14]:
#Twitter credentials for the app
consumer_key = os.environ['twitter_consumer_key']
consumer_secret = os.environ['twitter_consumer_secret']
access_key= os.environ['twitter_access_key']
access_secret = os.environ['twitter_access_secret']

In [15]:
#pass twitter credentials to tweepy
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = API(auth)

In [16]:
#HappyEmoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])

In [17]:
# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])

In [18]:
timeline = api.user_timeline(user_id="realDonaldTrump", count=200)

In [19]:
for what in timeline:
    print(what.text, "\n")

https://t.co/61KbAnpncb
Hey Google! Where's my shirt? 😁
#GCP #GoogleCloud @GCPcloud https://t.co/y8s4UhkT5m 

#ATTLIVE17 amazing things happen here https://t.co/21TovQK4Ub 

#ATTLIVE17 caught in the act https://t.co/VEcGwbbOoC 

#ATTLIVE17 morning views https://t.co/dcf6Q5N1wX 

@DTgolfstar @CauseWereGuys better goalie than I'll ever be. It should be Central's goalkeeper coach 👀 

RT @bayer04fussball: #Chicharito kommt von @ManUtd und unterschreibt bei der #Werkself bis 2018 || @CH14_ joins #Bayer04 until 2018! http:/… 

@destiny_belle getting there at 11 vs getting a decent parking spot. Your choice 😆 

@destiny_belle get there earlier 

RT @Lambdas1975: Congratulations to Delta Zeta Chapter (University of West Georgia) for earning the highest fraternity chapter GPA... http:… 

@Andy_Hartsfield you forgot #kevinislame gosh! 

@VivaDat_Stud how you should feel http://t.co/G5Z5GMpq4z 

@destiny_belle so... is this confirmation to do it? Or nah? 

@destiny_belle demand a year's worth of 

In [20]:
search = api.search(q="Trump", tweet_mode = 'extended', lang="en", rpp=5)

In [21]:
for tweet in search:
    print(tweet.full_text, "\n---------------------------\n")

RT @NewsCorpse: Trump says he was "distracted" by his impeachment from addressing the coronavirus. If that's true, it means he cared more a… 
---------------------------

RT @JenAshleyWright: There’s always a Trump tweet that aged poorly. https://t.co/RwkCkYQNTO 
---------------------------

@AdyBarkan @JoeBiden Nothing, I’m only voting for him because he’s not Trump. He’s not progressive so whatever. 
---------------------------

RT @AngelaBelcamino: @realDonaldTrump @nytimes To clarify to you all why DJT is throwing a hissing fit baby tantrum... It's because the Wal… 
---------------------------

RT @Barnes_Law: How Clinton Allies Hijacked Policy Response to Pandemic to Try and Sink Trump's Re-election https://t.co/b5zTgKTZVn 
---------------------------

RT @HowardA_Esq: Today, acting president Cuomo, his voice breaking, spoke about the horrifying number of New Yorkers, 799, that died in the… 
---------------------------

RT @mitchellvii: President Trump needs smart people like Dr. 

In [22]:
search = api.search(q="Alexandria Ocasio-Cortez", lang="en", tweet_mode = 'extended', include_rts=False, rpp=5)

In [23]:
for tweet in search:
    print(tweet.truncated)
    print(tweet.full_text)
    print("\n---------\n")

False
RT @DianeLong22: '🦠🦠🦠🦠AOC in the Middle of a Pandemic?? 🦠🦠🦠🦠🦠
I Pledge Allegiance To The Drag': Alexandria Ocasio-Cortez to Be Guest Judge…

---------

False
RT @TribulationThe: AOC is definitely ONE EVIL and CRAZY WOMAN!!!😡😡😡

'I Pledge Allegiance To The Drag': Alexandria Ocasio-Cortez to Be Gue…

---------

False
RT @DianeLong22: '🦠🦠🦠🦠AOC in the Middle of a Pandemic?? 🦠🦠🦠🦠🦠
I Pledge Allegiance To The Drag': Alexandria Ocasio-Cortez to Be Guest Judge…

---------

False
RT @DianeLong22: '🦠🦠🦠🦠AOC in the Middle of a Pandemic?? 🦠🦠🦠🦠🦠
I Pledge Allegiance To The Drag': Alexandria Ocasio-Cortez to Be Guest Judge…

---------

False
Tonight at 11PM EST / 8PM PST @Diddy will host a virtual town hall joined by Angela Rye, Congressperson Alexandria Ocasio-Cortez, Killer Mike, Charles Blow, Van Jones and more https://t.co/KnU9qKDpI4

---------

False
RT @DianeLong22: '🦠🦠🦠🦠AOC in the Middle of a Pandemic?? 🦠🦠🦠🦠🦠
I Pledge Allegiance To The Drag': Alexandria Ocasio-Cortez to Be Guest Judge…

--

# Guides

https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf

https://towardsdatascience.com/tweepy-for-beginners-24baf21f2c25