<a href="https://colab.research.google.com/github/joshuacalloway/dsc540groupproject/blob/main/StartingTrumpTweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using NLP on Trump's Tweets
- Joshua Calloway
- DSC 540, Fall Quarter - DePaul


# Motivation
What problem are you tackling, and what's the setting you're considering? What data are you working on? Did anything change from the proposal regarding data, objectives, and methods that you will apply?


We are looking Trump's tweets and applying NLP to see if we can determine the following
- Sentiment Analysis
- Subjectivity Analysis ( How objective or not are the tweets )
- @readDonaldTrump contains tweets by Trump and also his publicity staff.  See if we can determine which are by Trump

We are using tweets from thetrumparchive, which has about 50,0000 tweets and also another set with 1000 of Trump's latest tweets.  This is in alignment with the project proposal to use NLP on Trump

# We use Azure Predictive Analysis to create the GroundTruth sentiment labels
- We get either positive, negative, or neutral

# We then apply the following methods to build various sentiment classifiers and measure the accuracy against the GroundTruth test data

## 1. Tweepy and TextBlob for analyzing sentiment
## 2. LogisticsRegression using the Tweepy/TextBlob to fit the model
## 3. NLTK Toolkit and NaiveBayesClassifier using the Tweepy/TextBlob labels to fit the model
## 4. FastText with single label sentiment using the TweepyTextBlob labels to fit the model

# The Data, Trump Tweets ( either 1000 or 55,0000 tweets )
Here we use thetrumparchive to either fetch 1000 or larger set of 55,000 tweets.  The tweets come back as JSON in format of
<code>
{
  id: 1
  text: 'Lets win Michigan'
  isRetweet: True
  isDeleted: False
  device: iPhone
  favorites: 323,
  retweets: 2
  date: 2020-11-02
}
</code>

In [141]:
import urllib.request, json
from sklearn.model_selection import train_test_split
from pandas import DataFrame

In [142]:
# If LargeData is True, then we fetch 55,0000 tweets
def fetch_data(largeData=False):
    if largeData:
        with open('tweets_11-06-2020.json') as f:
            data = json.load(f)  # fetch 50,0000 tweets
    else:
        with urllib.request.urlopen("https://www.thetrumparchive.com/latest-tweets") as url:
            data = json.loads(url.read().decode())
    return DataFrame(data)

# get_tweet_text = lambda tweet : tweet['text']


### We can either fetch 1000 or 55,000 tweets by switching the flag largeData

In [143]:
# we r interested in the text for NLP
data = fetch_data(largeData=False)
data.head()

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date
0,1329963571250335744,https://t.co/YHscjY6G8t,False,False,Twitter for iPhone,76500,22413,2020-11-21T01:43:19.000Z
1,1329963296854847492,https://t.co/OLZnCJq93Y,False,False,Twitter for iPhone,55759,15486,2020-11-21T01:42:14.000Z
2,1329963239170564098,https://t.co/cwOQLhQNFq,False,False,Twitter for iPhone,89705,24554,2020-11-21T01:42:00.000Z
3,1329871920607744001,RT @WhiteHouse: LIVE: President @realDonaldTru...,True,False,Twitter for iPhone,0,16786,2020-11-20T19:39:08.000Z
4,1329871776889925636,"...Why won’t they do it, and why are they so f...",False,False,Twitter for iPhone,140901,23832,2020-11-20T19:38:34.000Z


# We split the data into training, validation and test


In [144]:
# we create a Y with unknown value
data['sentiment'] = 'unknown'

In [145]:
from sklearn.model_selection import train_test_split

X = data['text']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=555)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.30, random_state=555)

# We use Azure Predictive Analysis to create the ground truth sentiment

In [146]:
import numpy as np

from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient

credential = AzureKeyCredential("cb61d607e5c8402b9742b8aa40207593")
text_analytics_client = TextAnalyticsClient(endpoint="https://trumptweetanalysissentiment.cognitiveservices.azure.com/", credential=credential)
    
def call_azure(list_text_only_ten_items):
    response = text_analytics_client.analyze_sentiment(list_text_only_ten_items)
    successful_responses = [doc for doc in response if not doc.is_error]
    return list_text_only_ten_items, list(map(lambda x: x['sentiment'], successful_responses))
    
# treat mixed and neutral sentiment as neutral
def combine_mixed_neutral(sentiments):
    converted = []
    for item in sentiments:
        newitem = 'neutral' if item == 'mixed' else item
        converted.append(newitem)
    return converted
           
# this is ground truth.  Using Azure sentiment
def calculate_groundtruth_sentiment(list_of_texts):
    sublists = np.split(np.array(list_of_texts.tolist()), list_of_texts.size / 10)
    retvalues = list(map(lambda ls: call_azure(list(ls)), sublists))
    sentiments = []
    for item in retvalues:
        sentiments.append(item[1])    
    sentiments = [item for items in sentiments for item in items]
    return list_of_texts, combine_mixed_neutral(sentiments)

In [147]:
X_test, Y_test_groundtruth = calculate_groundtruth_sentiment(X_test)


# Now that we have groundtruth sentiment labels on test tweets, we will then try various methods and train on the test data.

# 1. Let's try tweepy and TextBlob to add Sentiment to each Tweet


## B. Let's use a tweepy and TextBlob to add Sentiment to each Tweet

Two blogs that use tweepy and TextBlob can be found at 
- https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/analyze-tweet-sentiment-in-python/
- https://medium.com/better-programming/twitter-sentiment-analysis-15d8892c0082



# C. We compute Sentiment of Tweet using TextBlob
- polarity is whether or not the tweet is positive, negative, or neutral ( scaled from 1 to -1 )

In [148]:
# We r going to use tweepy and TextBlob for tweets
import tweepy as tw
from textblob import TextBlob

# Create a function to get the subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
    return  TextBlob(text).sentiment.polarity

# We eliminate words less then 3 characters long and standardize all words to lowercase
def filter_words(words):
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    return words_filtered


# return neutral if small
def calculateSentiment(text, neutralCutoff = 0.05):
    polarity = getPolarity(text)
    if abs(polarity) < neutralCutoff:
        return 'neutral'
    if polarity > 0:
        return "positive"
    else:
        return "negative"
    

In [149]:
y_test = X_test.apply(calculateSentiment)

In [150]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(f'accuracy_score is {accuracy_score(Y_test_groundtruth, y_test)}')


accuracy_score is 0.6


# Here we get 0.6 accuracy by guessing the neutralCutoff = 0.05.  Let's try to see if we can do better by trying different neutralCutoffs

In [151]:
from numpy import arange

bestCutoff = 0.05
bestAccuracy = 0.6

for i in arange(0.0, 0.5, 0.02):
    y_test_i = X_test.apply(calculateSentiment, neutralCutoff=i)
    accuracy_i = accuracy_score(Y_test_groundtruth, y_test_i)
    if accuracy_i > bestAccuracy:
        bestAccuracy = accuracy_i
        bestCutoff = i

print(f'bestAccuracy is at {bestAccuracy} with neutralCutoff at {bestCutoff}')


bestAccuracy is at 0.67 with neutralCutoff at 0.22


In [152]:
# let's recalculate the y_test with best neutralCutoff at 0.22
y_test = X_test.apply(calculateSentiment, neutralCutoff=0.22)

In [153]:
print(f'confusion matrix is \n{confusion_matrix(Y_test_groundtruth, y_test)}')

confusion matrix is 
[[ 4 15  2]
 [ 2 48  4]
 [ 0 10 15]]


In [154]:
print(f'classification report is \n{classification_report(Y_test_groundtruth, y_test)}')

classification report is 
              precision    recall  f1-score   support

    negative       0.67      0.19      0.30        21
     neutral       0.66      0.89      0.76        54
    positive       0.71      0.60      0.65        25

    accuracy                           0.67       100
   macro avg       0.68      0.56      0.57       100
weighted avg       0.67      0.67      0.63       100



# 2. Let's try LogisticRegression

In [155]:
import pandas as pd

y = np.array([calculateSentiment(xi, neutralCutoff=0.22) for xi in X])

In [156]:
y

array(['neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'negative',
       'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'positive', 'neutral', 'positive', 'positive', 'neutral',
       'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'negative',
       'positive', 'neutral', 'neutral', 'positive', 'neutral', 'neutral',
       'neutral', 'neutral', 'positive', 'neutral', 'positive', 'neutral',
       'neutral', 'positive', 'positive', 'positive', 'positive',
       'positive', 'positive', 'neutral', 'neutral', 'positive',
       'positive', 'neutral', 'positive', 'negative', 'neutral',
       'neutral', 'negative', 'neutral', 'neutral', 'positive',
       'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'positive', 'positive', 'positive', 'neutral', 'neutral',
       'negative', 'positive', 'neutral', 'neutral', 'neutral', 'neutral',
       'positive', 'positive', 'neutral', 'neutral', 'neutral',
       'positive', 'neutr

In [157]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = False,
)
features = vectorizer.fit_transform(
    data['text']
)
features_nd = features.toarray() # for easy usage

In [158]:
# https://www.twilio.com/blog/2017/12/sentiment-analysis-scikit-learn.html
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In [159]:
from sklearn.model_selection import train_test_split

X_train_logistic, X_test_logistic, y_train_logistic, y_test_logistic  = train_test_split(
        features_nd, 
        y,
        train_size=0.90, 
        random_state=1234)

In [160]:
log_model = log_model.fit(X=X_train_logistic, y=y_train_logistic)

In [161]:
y_pred = log_model.predict(X_test_logistic)

In [162]:
print(f'accuracy_score is {accuracy_score(Y_test_groundtruth, y_pred)}')


accuracy_score is 0.46


In [163]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         6
     neutral       0.74      0.86      0.80        73
    positive       0.18      0.10      0.12        21

    accuracy                           0.65       100
   macro avg       0.31      0.32      0.31       100
weighted avg       0.58      0.65      0.61       100



# 3. We can try the NLTK toolkit

In [164]:
import nltk

def format_sentence(sent):
    return({word: True for word in nltk.word_tokenize(sent)})

print(nltk.word_tokenize("The cat is very cute"))

def format_sentence_with_sentement(sent):
    formatted = format_sentence(sent)
    sentement = calculateSentiment(sent, neutralCutoff = 0.22)
    return [formatted, sentement]

['The', 'cat', 'is', 'very', 'cute']


In [165]:
format_sentence_with_sentement("Stars are great!")

[{'Stars': True, 'are': True, 'great': True, '!': True}, 'positive']

In [166]:
format_sentence_with_sentement("Rats smell very bad")

[{'Rats': True, 'smell': True, 'very': True, 'bad': True}, 'negative']

In [167]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jc487/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [168]:
X

0                                https://t.co/YHscjY6G8t
1                                https://t.co/OLZnCJq93Y
2                                https://t.co/cwOQLhQNFq
3      RT @WhiteHouse: LIVE: President @realDonaldTru...
4      ...Why won’t they do it, and why are they so f...
                             ...                        
995    RT @chefjclark: Amen ⁦@bkirkland7⁩! https://t....
996      https://t.co/gsFSghkmdM https://t.co/zNoPFsTnn3
997    Big problems and discrepancies with Mail In Ba...
998    RT @realDonaldTrump: The Fake News Media is ri...
999    RT @realDonaldTrump: Joe Biden called me Georg...
Name: text, Length: 1000, dtype: object

In [169]:
X_nltk = X.apply(format_sentence_with_sentement)

In [170]:
X_nltk.head()

0    [{'https': True, ':': True, '//t.co/YHscjY6G8t...
1    [{'https': True, ':': True, '//t.co/OLZnCJq93Y...
2    [{'https': True, ':': True, '//t.co/cwOQLhQNFq...
3    [{'RT': True, '@': True, 'WhiteHouse': True, '...
4    [{'...': True, 'Why': True, 'won': True, '’': ...
Name: text, dtype: object

In [171]:
training = X_nltk[:int((.9)*len(X_nltk))]
test =  X_nltk[int((.1)*len( X_nltk)):] 


In [172]:
from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(training)
classifier.show_most_informative_features()

Most Informative Features
                     WIN = True           positi : neutra =     28.6 : 1.0
                    Fake = True           negati : positi =     26.1 : 1.0
                 Corrupt = True           negati : neutra =     25.0 : 1.0
                decision = True           negati : neutra =     25.0 : 1.0
                  things = True           negati : neutra =     25.0 : 1.0
                   Great = True           positi : neutra =     24.4 : 1.0
                   place = True           negati : neutra =     19.3 : 1.0
                       * = True           negati : neutra =     17.9 : 1.0
                  Andrew = True           negati : neutra =     17.9 : 1.0
                     FOR = True           negati : neutra =     17.9 : 1.0


In [173]:
example1 = "America is great"

print(classifier.classify(format_sentence(example1)))

positive


In [174]:
example1 = "Mail in ballots are fraudelent"

print(classifier.classify(format_sentence(example1)))

negative


In [175]:
from nltk.classify.util import accuracy
print(accuracy(classifier, test))

0.45555555555555555


## Let's see the accuracy vs Ground Truth

In [203]:

def nltk_predict(sent):
    return classifier.classify(format_sentence(sent))

y_pred_nltk = X_test.apply(nltk_predict)

In [205]:
print(f'accuracy_score is {accuracy_score(Y_test_groundtruth, y_pred_nltk)}')


accuracy_score is 0.63


In [207]:
print(classification_report(Y_test_groundtruth, y_pred_nltk))


              precision    recall  f1-score   support

    negative       0.48      0.94      0.64        31
     neutral       0.96      0.57      0.72        42
    positive       0.67      0.37      0.48        27

    accuracy                           0.63       100
   macro avg       0.70      0.63      0.61       100
weighted avg       0.73      0.63      0.63       100



# 4. We train a FaceBook FastText model to build a classifier

# Let try cleaning the Tweets to see if we can get a better ground truth

In [177]:
# Let's try to get better accuracy by cleaning the tweets text
# Define some cleaning methods for the Tweet Text
# Create a function to clean the tweets
import re
import string

def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

def remove_doublespace(text):
    text = re.sub('  +', ' ', text)
    return text

In [178]:
remove_doublespace('    ') == ' '

True

In [179]:
remove_punct("Hello this is josh!!!") == "Hello this is josh"

True

In [180]:
data['text'] = data['text'].apply(remove_punct)
data['text'] = data['text'].apply(remove_doublespace)

data['sentiment'] = data['text'].apply(calculateSentiment)
X = data['text']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=555)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.30, random_state=555)


In [181]:
X_test, Y_test_groundtruth = calculate_groundtruth_sentiment(X_test)


In [182]:
len(X_test)

100

In [183]:
len(y_test)

100

In [184]:
len(Y_test_groundtruth)

100

In [185]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.50      0.12      0.20        16
     neutral       0.47      0.85      0.61        47
    positive       0.27      0.08      0.12        37

    accuracy                           0.45       100
   macro avg       0.41      0.35      0.31       100
weighted avg       0.40      0.45      0.36       100



In [186]:
X_test[357]

'Pennsylvania Party Leadership votes are this week I hope they pick very tough and smart fighters We will WIN'

In [187]:
mystr = ''
mystr

''

In [188]:
X_test[357] == mystr

False

In [189]:
not ''

True

In [190]:
import pandas as pd

d = {'text': ['hello', '', ' ']}
df = pd.DataFrame(data=d)

In [191]:
df

Unnamed: 0,text
0,hello
1,
2,


In [192]:
blank = df[(df['text'] == '') | (df['text'].str.isspace())]


In [193]:
blank

Unnamed: 0,text
1,
2,


In [194]:
data['text']

0                                       httpstcoYHscjYGt
1                                       httpstcoOLZnCJqY
2                                     httpstcocwOQLhQNFq
3      RT WhiteHouse LIVE President realDonaldTrump g...
4      Why won’t they do it and why are they so fast ...
                             ...                        
995     RT chefjclark Amen ⁦bkirkland⁩ httpstcohWKODkDmT
996                 httpstcogsFSghkmdM httpstcozNoPFsTnn
997    Big problems and discrepancies with Mail In Ba...
998    RT realDonaldTrump The Fake News Media is ridi...
999    RT realDonaldTrump Joe Biden called me George ...
Name: text, Length: 1000, dtype: object

In [195]:
y

0       neutral
1       neutral
2       neutral
3      positive
4      negative
         ...   
995     neutral
996     neutral
997     neutral
998    negative
999     neutral
Name: sentiment, Length: 1000, dtype: object

In [196]:
print(accuracy_score(y_test, y_pred))

0.45


In [197]:
print(accuracy_score(Y_test_groundtruth, y_pred))

0.39


In [198]:
len(Y_test_groundtruth)

100

In [199]:
len(y_pred)

100

In [200]:
len(X_test)

100

In [201]:
y_pred

array(['neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'positive', 'neutral', 'neutral', 'neutral', 'positive', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 'negative', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'negative', 'positive', 'neutral', 'neutral',
       'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'positive', 'positive', 'neutral', 'positive', 'neutral',
       'neutral', 'neutral', 'neutral', 'neutral', 

In [202]:
Y_test_groundtruth

['positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'neutral',
 'positive',
 'neutral',
 'negative',
 'neutral',
 'negative',
 'neutral',
 'neutral',
 'positive',
 'neutral',
 'positive',
 'positive',
 'neutral',
 'neutral',
 'positive',
 'neutral',
 'positive',
 'positive',
 'positive',
 'neutral',
 'positive',
 'neutral',
 'negative',
 'negative',
 'negative',
 'neutral',
 'neutral',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'neutral',
 'positive',
 'positive',
 'negative',
 'neutral',
 'positive',
 'negative',
 'positive',
 'neutral',
 'neutral',
 'positive',
 'neutral',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'positive',
 'negative',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'positive',
 'neutral',
 'neutral',
 'negative',
 'positive',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'negative',
 'neutral',
 'neutral',
 'negative',
 'neutral',
 'positive',
 'negative',
 'neg