<a href="https://colab.research.google.com/github/joshuacalloway/dsc540groupproject/blob/main/StartingTrumpTweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using NLP on Trump's Tweets
- Joshua Calloway
- DSC 540, Fall Quarter - DePaul


# Motivation
What problem are you tackling, and what's the setting you're considering? What data are you working on? Did anything change from the proposal regarding data, objectives, and methods that you will apply?


We are looking Trump's tweets and applying NLP to see if we can determine the following
- Sentiment Analysis
- Subjectivity Analysis ( How objective or not are the tweets )
- @readDonaldTrump contains tweets by Trump and also his publicity staff.  See if we can determine which are by Trump

We are using tweets from thetrumparchive, which has about 50,0000 tweets and also another set with 1000 of Trump's latest tweets.  This is in alignment with the project proposal to use NLP on Trump

# Next steps [10]: Given your preliminary results, what are the next steps that you're considering?
- Improve accuracy by better tweet cleaning and or adding tweet specific language like emojis or shorthand
- Use GPT-2 for tweet generation
- Figure a way to measure accuracy and do cross validation to try to come up with better models
- Hyper-parameter tuning of the models
- Use models, methods on larger Trump tweets data of 50,0000 tweets

## A. Fetch Trumps Trumps Tweets
Here we use thetrumparchive to either fetch 1000 or larger set of 55,000 tweets.  The tweets come back as JSON in format of
<code>
{
  id: 1
  text: 'Lets win Michigan'
  isRetweet: True
  isDeleted: False
  device: iPhone
  favorites: 323,
  retweets: 2
  date: 2020-11-02
}
</code>

In [98]:
import urllib.request, json
from sklearn.model_selection import train_test_split
from pandas import DataFrame

In [99]:
# If LargeData is True, then we fetch 55,0000 tweets
def fetch_data(largeData=False):
    if largeData:
        with open('tweets_11-06-2020.json') as f:
            data = json.load(f)  # fetch 50,0000 tweets
    else:
        with urllib.request.urlopen("https://www.thetrumparchive.com/latest-tweets") as url:
            data = json.loads(url.read().decode())
    return DataFrame(data)

# get_tweet_text = lambda tweet : tweet['text']


In [100]:
# we r interested in the text for NLP
data = fetch_data(largeData=False)
data.head()

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date
0,1328545700012027904,RT @txelectionlaw: Having practiced law for al...,True,False,Twitter for iPhone,0,24869,2020-11-17T03:49:12.000Z
1,1328483862490574849,Big victory moments ago in the State of Nevada...,False,False,Twitter for iPhone,178766,49758,2020-11-16T23:43:29.000Z
2,1328448530504163329,The rate of rejected Mail-In Ballots is 30 X’s...,False,False,Twitter for iPhone,145809,37941,2020-11-16T21:23:05.000Z
3,1328382287331856384,Georgia won’t let us look at the all important...,False,False,Twitter for iPhone,253094,57272,2020-11-16T16:59:52.000Z
4,1328370030678044675,European Countries are sadly getting clobbered...,False,False,Twitter for iPhone,237437,39993,2020-11-16T16:11:09.000Z


## B. Let's use a tweepy and TextBlob to add Sentiment to each Tweet

Two blogs that use tweepy and TextBlob can be found at 
- https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/analyze-tweet-sentiment-in-python/
- https://medium.com/better-programming/twitter-sentiment-analysis-15d8892c0082



# C. We compute Sentiment of Tweet using TextBlob
- subjectivity is how opinionated a tweet is ( scaled from 1 to -1 )
- polarity is whether or not the tweet is positive or negative ( scaled from 1 to -1 )

In [102]:
# We r going to use tweepy and TextBlob for tweets
import tweepy as tw
from textblob import TextBlob

# Create a function to get the subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
    return  TextBlob(text).sentiment.polarity

# We eliminate words less then 3 characters long and standardize all words to lowercase
def filter_words(words):
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    return words_filtered


# return neutral if small
def calculateSentiment(text):
    polarity = getPolarity(text)
    if abs(polarity) < 0.05:
        return 'neutral'
    if polarity > 0:
        return "positive"
    else:
        return "negative"
    
    
# Create two new columns 'Subjectivity' & 'Polarity'
data['sentiment'] = data['text'].apply(calculateSentiment)


# Show the new dataframe with columns 'Subjectivity' & 'Polarity'
data

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,sentiment
0,1328545700012027904,RT @txelectionlaw: Having practiced law for al...,True,False,Twitter for iPhone,0,24869,2020-11-17T03:49:12.000Z,neutral
1,1328483862490574849,Big victory moments ago in the State of Nevada...,False,False,Twitter for iPhone,178766,49758,2020-11-16T23:43:29.000Z,positive
2,1328448530504163329,The rate of rejected Mail-In Ballots is 30 X’s...,False,False,Twitter for iPhone,145809,37941,2020-11-16T21:23:05.000Z,neutral
3,1328382287331856384,Georgia won’t let us look at the all important...,False,False,Twitter for iPhone,253094,57272,2020-11-16T16:59:52.000Z,neutral
4,1328370030678044675,European Countries are sadly getting clobbered...,False,False,Twitter for iPhone,237437,39993,2020-11-16T16:11:09.000Z,negative
...,...,...,...,...,...,...,...,...,...
995,1320162439200149504,Thank you OHIO! #VOTE\nhttps://t.co/gsFSghkmdM...,False,False,Twitter for iPhone,74088,12731,2020-10-25T00:37:07.000Z,neutral
996,1320162252868255745,What a terrible thing for Biden to say! Rigged...,False,False,Twitter for iPhone,59144,14564,2020-10-25T00:36:23.000Z,negative
997,1320146431584301062,RT @TeamTrump: TONIGHT: Watch President @realD...,True,False,Twitter for iPhone,0,4558,2020-10-24T23:33:31.000Z,neutral
998,1320143847356182528,RT @chefjclark: People ask why the President f...,True,False,Twitter for iPhone,0,12139,2020-10-24T23:23:14.000Z,negative


# We now split data into training, test, and validation

In [103]:
from sklearn.model_selection import train_test_split

X = df['text']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=555)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.30, random_state=555)


In [104]:
len(X_train)

630

In [105]:
len(X_val)

270

In [106]:
import numpy as np

from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient

credential = AzureKeyCredential("cb61d607e5c8402b9742b8aa40207593")
text_analytics_client = TextAnalyticsClient(endpoint="https://trumptweetanalysissentiment.cognitiveservices.azure.com/", credential=credential)
    
def call_azure(list_text_only_ten_items):
    response = text_analytics_client.analyze_sentiment(list_text_only_ten_items)
    successful_responses = [doc for doc in response if not doc.is_error]
    return list_text_only_ten_items, list(map(lambda x: x['sentiment'], successful_responses))
    
# treat mixed and neutral sentiment as neutral
def combine_mixed_neutral(sentiments):
    converted = []
    for item in sentiments:
        newitem = 'neutral' if item == 'mixed' else item
        converted.append(newitem)
    return converted
           
# this is ground truth.  Using Azure sentiment
def calculate_groundtruth_sentiment(list_of_texts):
    sublists = np.split(np.array(list_of_texts.tolist()), list_of_texts.size / 10)
    retvalues = list(map(lambda ls: call_azure(list(ls)), sublists))
    sentiments = []
    for item in retvalues:
        sentiments.append(item[1])    
    sentiments = [item for items in sentiments for item in items]
    return list_of_texts, combine_mixed_neutral(sentiments)


In [107]:
X_test2, sentiments = calculate_groundtruth_sentiment(X_test)


In [108]:
Y_test_groundtruth = sentiments
Y_test_groundtruth[:5]

['negative', 'neutral', 'positive', 'neutral', 'neutral']

In [109]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(f'accuracy_score is {accuracy_score(Y_test_groundtruth, y_test)}')


accuracy_score is 0.61


In [110]:
print(f'confusion matrix is \n{confusion_matrix(Y_test_groundtruth, y_test)}')

confusion matrix is 
[[14  6  8]
 [ 4 26 14]
 [ 1  6 21]]


In [111]:
print(f'classification report is \n{classification_report(Y_test_groundtruth, y_test)}')

classification report is 
              precision    recall  f1-score   support

    negative       0.74      0.50      0.60        28
     neutral       0.68      0.59      0.63        44
    positive       0.49      0.75      0.59        28

    accuracy                           0.61       100
   macro avg       0.64      0.61      0.61       100
weighted avg       0.64      0.61      0.61       100

