<a href="https://colab.research.google.com/github/joshuacalloway/dsc540groupproject/blob/main/StartingTrumpTweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using NLP on Trump's Tweets
- Joshua Calloway
- DSC 540, Fall Quarter - DePaul


# Motivation
What problem are you tackling, and what's the setting you're considering? What data are you working on? Did anything change from the proposal regarding data, objectives, and methods that you will apply?


We are looking Trump's tweets and applying NLP to see if we can determine the following
- Sentiment Analysis
- Subjectivity Analysis ( How objective or not are the tweets )
- @readDonaldTrump contains tweets by Trump and also his publicity staff.  See if we can determine which are by Trump

We are using tweets from thetrumparchive, which has about 50,0000 tweets and also another set with 1000 of Trump's latest tweets.  This is in alignment with the project proposal to use NLP on Trump

# Next steps [10]: Given your preliminary results, what are the next steps that you're considering?
- Improve accuracy by better tweet cleaning and or adding tweet specific language like emojis or shorthand
- Use GPT-2 for tweet generation
- Figure a way to measure accuracy and do cross validation to try to come up with better models
- Hyper-parameter tuning of the models
- Use models, methods on larger Trump tweets data of 50,0000 tweets

## A. Fetch Trumps Trumps Tweets
Here we use thetrumparchive to either fetch 1000 or larger set of 55,000 tweets.  The tweets come back as JSON in format of
<code>
{
  id: 1
  text: 'Lets win Michigan'
  isRetweet: True
  isDeleted: False
  device: iPhone
  favorites: 323,
  retweets: 2
  date: 2020-11-02
}
</code>

In [117]:
import urllib.request, json
from sklearn.model_selection import train_test_split
from pandas import DataFrame

In [118]:
# If LargeData is True, then we fetch 55,0000 tweets
def fetch_data(largeData=False):
    if largeData:
        with open('tweets_11-06-2020.json') as f:
            data = json.load(f)  # fetch 50,0000 tweets
    else:
        with urllib.request.urlopen("https://www.thetrumparchive.com/latest-tweets") as url:
            data = json.loads(url.read().decode())
    return DataFrame(data)

# get_tweet_text = lambda tweet : tweet['text']


In [119]:
# we r interested in the text for NLP
data = fetch_data(largeData=False)
data.head()

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date
0,1329233502139715586,Look at this in Wisconsin! A day AFTER the ele...,False,False,Twitter for iPhone,199249,67689,2020-11-19T01:22:17.000Z
1,1329200307453239297,RT @AndrewHittGOP: BREAKING: WI Elections Comm...,True,False,Twitter for iPhone,0,13679,2020-11-18T23:10:23.000Z
2,1329158254660898816,"RT @aricnesbitt: This afternoon, I joined my c...",True,False,Twitter for iPhone,0,17227,2020-11-18T20:23:17.000Z
3,1329150897641951233,"Thank you @DarrellIssa, so nice! \nhttps://t.c...",False,False,Twitter for iPhone,103783,20416,2020-11-18T19:54:03.000Z
4,1329145487325339649,RT @realDonaldTrump: More reports of voting ma...,True,False,Twitter for iPhone,0,29076,2020-11-18T19:32:33.000Z


In [120]:
data['text']



0      Look at this in Wisconsin! A day AFTER the ele...
1      RT @AndrewHittGOP: BREAKING: WI Elections Comm...
2      RT @aricnesbitt: This afternoon, I joined my c...
3      Thank you @DarrellIssa, so nice! \nhttps://t.c...
4      RT @realDonaldTrump: More reports of voting ma...
                             ...                        
995    RT @TeamTrump: President @realDonaldTrump: How...
996    RT @TeamTrump: President @realDonaldTrump: We ...
997    RT @WhiteHouse: In Maine, Medicare health plan...
998      https://t.co/gsFSgh2KPc https://t.co/KTe3etEKhs
999    RT @TeamTrump: President @realDonaldTrump: In ...
Name: text, Length: 1000, dtype: object

In [121]:
blank = data[(data['text'] == '') | (data['text'].str.isspace())]



In [122]:
(data['text'] == '').sum()


0

In [123]:
len(blank)

0

In [124]:
len(data['text'])

1000

## B. Let's use a tweepy and TextBlob to add Sentiment to each Tweet

Two blogs that use tweepy and TextBlob can be found at 
- https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/analyze-tweet-sentiment-in-python/
- https://medium.com/better-programming/twitter-sentiment-analysis-15d8892c0082



# C. We compute Sentiment of Tweet using TextBlob
- polarity is whether or not the tweet is positive, negative, or neutral ( scaled from 1 to -1 )

In [125]:
# We r going to use tweepy and TextBlob for tweets
import tweepy as tw
from textblob import TextBlob

# Create a function to get the subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
    return  TextBlob(text).sentiment.polarity

# We eliminate words less then 3 characters long and standardize all words to lowercase
def filter_words(words):
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    return words_filtered


# return neutral if small
def calculateSentiment(text):
    polarity = getPolarity(text)
    if abs(polarity) < 0.05:
        return 'neutral'
    if polarity > 0:
        return "positive"
    else:
        return "negative"
    
    
# Create two new columns 'Subjectivity' & 'Polarity'
data['sentiment'] = data['text'].apply(calculateSentiment)


# Show the new dataframe with columns 'Subjectivity' & 'Polarity'
data

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,sentiment
0,1329233502139715586,Look at this in Wisconsin! A day AFTER the ele...,False,False,Twitter for iPhone,199249,67689,2020-11-19T01:22:17.000Z,negative
1,1329200307453239297,RT @AndrewHittGOP: BREAKING: WI Elections Comm...,True,False,Twitter for iPhone,0,13679,2020-11-18T23:10:23.000Z,neutral
2,1329158254660898816,"RT @aricnesbitt: This afternoon, I joined my c...",True,False,Twitter for iPhone,0,17227,2020-11-18T20:23:17.000Z,positive
3,1329150897641951233,"Thank you @DarrellIssa, so nice! \nhttps://t.c...",False,False,Twitter for iPhone,103783,20416,2020-11-18T19:54:03.000Z,positive
4,1329145487325339649,RT @realDonaldTrump: More reports of voting ma...,True,False,Twitter for iPhone,0,29076,2020-11-18T19:32:33.000Z,positive
...,...,...,...,...,...,...,...,...,...
995,1320444583638163457,RT @TeamTrump: President @realDonaldTrump: How...,True,False,Twitter for iPhone,0,6504,2020-10-25T19:18:16.000Z,neutral
996,1320444583717777410,RT @TeamTrump: President @realDonaldTrump: We ...,True,False,Twitter for iPhone,0,6298,2020-10-25T19:18:16.000Z,negative
997,1320444583814156288,"RT @WhiteHouse: In Maine, Medicare health plan...",True,False,Twitter for iPhone,0,7706,2020-10-25T19:18:16.000Z,negative
998,1320436393814806529,https://t.co/gsFSgh2KPc https://t.co/KTe3etEKhs,False,False,Twitter for iPhone,64516,11361,2020-10-25T18:45:43.000Z,neutral


# We now split data into training, test, and validation

In [126]:
from sklearn.model_selection import train_test_split

X = data['text']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=555)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.30, random_state=555)


In [127]:
len(X_train)

630

In [128]:
len(X_val)

270

In [129]:
import numpy as np

from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient

credential = AzureKeyCredential("cb61d607e5c8402b9742b8aa40207593")
text_analytics_client = TextAnalyticsClient(endpoint="https://trumptweetanalysissentiment.cognitiveservices.azure.com/", credential=credential)
    
def call_azure(list_text_only_ten_items):
    response = text_analytics_client.analyze_sentiment(list_text_only_ten_items)
    successful_responses = [doc for doc in response if not doc.is_error]
    return list_text_only_ten_items, list(map(lambda x: x['sentiment'], successful_responses))
    
# treat mixed and neutral sentiment as neutral
def combine_mixed_neutral(sentiments):
    converted = []
    for item in sentiments:
        newitem = 'neutral' if item == 'mixed' else item
        converted.append(newitem)
    return converted
           
# this is ground truth.  Using Azure sentiment
def calculate_groundtruth_sentiment(list_of_texts):
    sublists = np.split(np.array(list_of_texts.tolist()), list_of_texts.size / 10)
    retvalues = list(map(lambda ls: call_azure(list(ls)), sublists))
    sentiments = []
    print(f'retvalues is {len(retvalues)}')
    #print(f'retvalues = {retvalues}')
    for item in retvalues:
        print(f'item is {item}')
        sentiments.append(item[1])    
    sentiments = [item for items in sentiments for item in items]
    print(f'len of sentiments is {len(sentiments)}')
    return list_of_texts, combine_mixed_neutral(sentiments)


In [134]:
X_test, Y_test_groundtruth = calculate_groundtruth_sentiment(X_test)


retvalues is 10
item is (['https://t.co/EfYcyJtoJz', 'No way! https://t.co/Dwvb57mgMz', 'WHAT IS THIS ALL ABOUT? https://t.co/6487pYLZnL', '.@JakeLaTurner will be a phenomenal Congressman for the people of Kansas! He will help us Lower your Taxes, Support our Brave Law Enforcement, Build the Wall, and Protect and Defend your Second Amendment. Jake has my Complete and Total Endorsement! #KS02 https://t.co/Nzpbh7DqUN', 'THANK YOU! #MAGA https://t.co/EEmSU6uPAy', '“A Biden win would mean the end of Fracking in Pennsylvania, Texas, and everywhere else. Millions of jobs would be lost, and Energy prices would soar.” @OANN  And we would no longer be Energy Independent!!!', 'Flip Michigan back to TRUMP. Detroit, not surprisingly, has tremendous problems! https://t.co/RHhuoSMICg', 'Nate Simington, a very smart and qualified individual, is having his Senate hearing today. Republicans will hopefully confirm him to the FCC ASAP! We need action NOW on this very important nomination!!\xa0@SenatorWic

In [135]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(f'accuracy_score is {accuracy_score(Y_test_groundtruth, y_test)}')


accuracy_score is 0.65


In [136]:
print(f'confusion matrix is \n{confusion_matrix(Y_test_groundtruth, y_test)}')

confusion matrix is 
[[12 13  0]
 [ 1 40 13]
 [ 1  7 13]]


In [137]:
print(f'classification report is \n{classification_report(Y_test_groundtruth, y_test)}')

classification report is 
              precision    recall  f1-score   support

    negative       0.86      0.48      0.62        25
     neutral       0.67      0.74      0.70        54
    positive       0.50      0.62      0.55        21

    accuracy                           0.65       100
   macro avg       0.67      0.61      0.62       100
weighted avg       0.68      0.65      0.65       100



# Let try cleaning the Tweets to see if we can get a better ground truth

In [181]:
# Let's try to get better accuracy by cleaning the tweets text
# Define some cleaning methods for the Tweet Text
# Create a function to clean the tweets
import re
import string

def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

def remove_doublespace(text):
    text = re.sub('  +', ' ', text)
    return text

In [182]:
remove_doublespace('    ') == ' '

True

In [183]:
remove_punct("Hello this is josh!!!") == "Hello this is josh"

True

In [184]:
data['text'] = data['text'].apply(remove_punct)
data['text'] = data['text'].apply(remove_doublespace)

data['sentiment'] = data['text'].apply(calculateSentiment)
X = data['text']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=555)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.30, random_state=555)


In [185]:
X_test, Y_test_groundtruth = calculate_groundtruth_sentiment(X_test)


retvalues is 10
item is (['httpstcoEfYcyJtoJz', 'No way httpstcoDwvbmgMz', 'WHAT IS THIS ALL ABOUT httpstcopYLZnL', 'JakeLaTurner will be a phenomenal Congressman for the people of Kansas He will help us Lower your Taxes Support our Brave Law Enforcement Build the Wall and Protect and Defend your Second Amendment Jake has my Complete and Total Endorsement KS httpstcoNzpbhDqUN', 'THANK YOU MAGA httpstcoEEmSUuPAy', '“A Biden win would mean the end of Fracking in Pennsylvania Texas and everywhere else Millions of jobs would be lost and Energy prices would soar” OANN And we would no longer be Energy Independent', 'Flip Michigan back to TRUMP Detroit not surprisingly has tremendous problems httpstcoRHhuoSMICg', 'Nate Simington a very smart and qualified individual is having his Senate hearing today Republicans will hopefully confirm him to the FCC ASAP We need action NOW on this very important nomination\xa0SenatorWicker MarshaBlackburn senatemajldr', '¡Mi AmericanDreamPlan es una promesa p

In [186]:
len(X_test)

100

In [187]:
len(y_test)

100

In [188]:
len(Y_test_groundtruth)

100

In [189]:
print(f'accuracy_score is {accuracy_score(Y_test_groundtruth, y_test)}')


accuracy_score is 0.64


In [190]:
X_test

357                                   httpstcoEfYcyJtoJz
81                               No way httpstcoDwvbmgMz
436                WHAT IS THIS ALL ABOUT httpstcopYLZnL
657    JakeLaTurner will be a phenomenal Congressman ...
654                     THANK YOU MAGA httpstcoEEmSUuPAy
                             ...                        
742                              Rigged httpstcoKXSIrouV
34     The only thing secure about our Election was t...
132    The Fake News Media hardly even discussed the ...
530               Landing in North Carolina See you soon
886    Thank you to the most incredible people on ear...
Name: text, Length: 100, dtype: object

In [178]:
X_test[357]

'httpstcoEfYcyJtoJz'

In [179]:
mystr = ''
mystr

''

In [164]:
X_test[357] == mystr

False

In [149]:
not ''

True

In [150]:
import pandas as pd

d = {'text': ['hello', '', ' ']}
df = pd.DataFrame(data=d)

In [151]:
df

Unnamed: 0,text
0,hello
1,
2,


In [152]:
blank = df[(df['text'] == '') | (df['text'].str.isspace())]


In [153]:
blank

Unnamed: 0,text
1,
2,


In [197]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = False,
)
features = vectorizer.fit_transform(
    data
)
features_nd = features.toarray() # for easy usage

In [204]:
features_nd

array([[0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0]])

In [198]:
# https://www.twilio.com/blog/2017/12/sentiment-analysis-scikit-learn.html
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()

In [206]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(
        features_nd, 
        ['positive', 'neutral', 'negative'],
        train_size=0.80, 
        random_state=1234)

ValueError: Found input variables with inconsistent numbers of samples: [9, 3]

In [193]:
log_model = log_model.fit(X=X_train, y=y_train)

ValueError: could not convert string to float: 'WATCH DC Cops Direct TrumpSupporters into Gauntlet of Protesters Do Nothing When They Are Assaulted httpstcoyDCbZou via BreitbartNews These thugs and lowlifes only stalked and attacked when most of the tens of thousands of people had left town Ran away earlier'