# Analyzing Social Influence Tactics in Twitter Using Artificial Intelligence

**Adapted from: Emmanuel Dufourq** (edufourq@gmail.com - [www.emmanueldufourq.com](http://www.emmanueldufourq.com) )

July 2018

*Made for the Theoretical Foundations of Data Science 2018 (African Institute for Mathematical Sciences)*

Adapted from https://cloud.google.com/blog/big-data/2017/10/intro-to-text-classification-with-keras-automatically-tagging-stack-overflow-posts

### Objectives:
* This notebook will examine a set of tweets related to the Masters golf tournament and analyze influence tactics using text classification models

* Classify tweet data related to an event based on 4 different influence tactics. 

**Clarification of Data sources**

This example is the Twitter API implemented as a google sheets add-on to generate the csv. The addon and tutorial is found here: [How to Save Tweets for any Twitter Hashtag in a Google Sheet](https://www.labnol.org/internet/save-twitter-hashtag-tweets/6505/). 

The gathered data was then categorized by hand, which is why the data set contains only 300 entries.

All the collected data sets are available for download on [this](https://github.com/jaortiz117/Twitter-Influence-tactic-classification) github repository

##How does it work?
The general concept of this proyect is to have a computer determine influence tactics given a specific context (in this case The Masters Golf Tournament). We are using supervised learning for this example. This restricts us to only be able to identify from contexts we have trained with. However, later in the future, unsupervised versions could be implemented. This would allow for dynamic context recognition and training.

## Imports

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
import keras
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from keras.preprocessing import text, sequence
from keras import utils
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
import numpy as np
import string
import re

## Download the data

In [0]:
df = pd.read_csv("https://raw.githubusercontent.com/jaortiz117/Twitter-Influence-tactic-classification/master/classified_tweets.csv")

## Look the some of the data

In this data set, for ease of use, the influence tactics have been abbreviated into M, D, V, N. Minimizing, Denial, Diversion and Neither respectively

In [0]:
df.head()

Unnamed: 0,tweet,tactic
0,Listen for the sound! Insane speed! Love worki...,N
1,Too cool! Check this out from @TheFlippist #Th...,V
2,What it was like to be at Augusta for Tiger's ...,N
3,Love this! @TheMasters @TigerWoods #tiger #The...,N
4,Tiger Woods Masters win with ALL THE FEELS ❤️ ...,N


## Clean the data (remove symbols, numbers, emojis)

In [0]:
def remove_punct_emoji(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    text = "".join([char for char in text if char in string.printable])
    return text

df['tweet_clean'] = df['tweet'].apply(lambda x: remove_punct_emoji(x))
df.head(10)

Unnamed: 0,tweet,tactic,tweet_clean
0,Listen for the sound! Insane speed! Love worki...,N,Listen for the sound Insane speed Love working...
1,Too cool! Check this out from @TheFlippist #Th...,V,Too cool Check this out from TheFlippist TheMa...
2,What it was like to be at Augusta for Tiger's ...,N,What it was like to be at Augusta for Tigers e...
3,Love this! @TheMasters @TigerWoods #tiger #The...,N,Love this TheMasters TigerWoods tiger TheMaste...
4,Tiger Woods Masters win with ALL THE FEELS ❤️ ...,N,Tiger Woods Masters win with ALL THE FEELS vi...
5,"After injuries, an affair, a DUI, addiction, a...",M,After injuries an affair a DUI addiction and s...
6,Simply perfect. ⛳️ No one didn't think Tigerrr...,D,Simply perfect No one didnt think Tigerrrrrrr...
7,"After a second-place finish at #TheMasters, be...",N,After a secondplace finish at TheMasters betti...
8,Oprah's Flipbook Club award winner. #Tiger #Th...,V,Oprahs Flipbook Club award winner Tiger TheMas...
9,Reminiscing about this shot last Saturday at A...,M,Reminiscing about this shot last Saturday at A...


## Print out the unique tactics

In [0]:
df['tactic'].unique()

array(['N', 'V', 'M', 'D'], dtype=object)

## Determine the number of classes

In [0]:
num_classes = len(df['tactic'].unique())

In [0]:
num_classes

4

## Check how many instances for each class

In [0]:
df['tactic'].value_counts()

V    132
N    114
M     44
D     11
Name: tactic, dtype: int64

## Determine the number of words in each instance

In [0]:
df['Word Count'] =  df['tweet_clean'].apply(lambda x: len(x.split (' ')))

In [0]:
 df.sort_values(by=['Word Count'], ascending=False)

Unnamed: 0,tweet,tactic,tweet_clean,Word Count
227,I took in every moment of #TheMasters . It was...,N,I took in every moment of TheMasters It was a...,79
274,My boss is the 🐐 want to know what Jordan thou...,N,My boss is the want to know what Jordan thoug...,64
285,The GOLFBUDDY aim L10 was used during #TheMast...,V,The GOLFBUDDY aim L was used during TheMasters...,59
138,Great to talk with the guys about my journey i...,V,Great to talk with the guys about my journey i...,58
186,Wrote this piece on @TigerWoods for @theslantm...,V,Wrote this piece on TigerWoods for theslantmed...,58
264,Great look at #TexasTech vs BU with @MattRober...,V,Great look at TexasTech vs BU with MattRoberts...,58
189,So proud to see @TigerWoods using @tagboard to...,V,So proud to see TigerWoods using tagboard to d...,55
167,Thank you @RobHodgetts and @cnnsport @CNN for ...,V,Thank you RobHodgetts and cnnsport CNN for tak...,54
178,#ListenIn to @brandonsgorall and I yak about t...,V,ListenIn to brandonsgorall and I yak about thi...,52
93,Between today's spot-on storm forecast and hel...,N,Between todays spoton storm forecast and helpi...,50


## Convert the data into X and Y

In [0]:
X = df['tweet_clean'].values

In [0]:
Y = df['tactic']

## Split the data into training and testing

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [0]:
X_train[0]

' Passionate Historic Legendary Simply Fantastic And thats not just the AWESOME weekend offers we have on instore right now TigerWoods themasters majorseason majoroffers '

## Tokenize

Tokenizer has the ability to count the number of unique words and to allocate a unique number to each of the the words. We can specify the number of words that we want, this is typically the most frequent words. So in our case, we can to allocate an index number of 1000 words. The documentation is here: https://keras.io/preprocessing/text/#tokenizer

In [0]:
max_words = 1000
tokenize = text.Tokenizer(num_words=max_words, char_level=False)

Now, we can convert each post in our dataset into a vector. The size of the vector *max_words*. The vector is made up of 0's and 1's. There is a value of 1 at the index location of the tokenized words. In other words, if the tokenized words are [what, I, you, where, cat] then the sentence "where is the cat" is converted into [0, 0,0,1,1] which indicates that words where and cat are present. In other words, the tokenizer creates a vocabulary and then we can assign a 1 if a word in the text is found in the vocabulary, and the index location is based on the vocabulary. We need to fit this to some data, so we use the training data:

In [0]:
tokenize.fit_on_texts(X_train) 

We can take a look at the words and the indices in the vocabulary here:

In [0]:
tokenize.word_index

{'themasters': 1,
 'the': 2,
 'to': 3,
 'and': 4,
 'tigerwoods': 5,
 'of': 6,
 'a': 7,
 'tiger': 8,
 'in': 9,
 'golf': 10,
 'on': 11,
 'at': 12,
 'for': 13,
 'is': 14,
 'this': 15,
 'with': 16,
 'woods': 17,
 'it': 18,
 'you': 19,
 'was': 20,
 'i': 21,
 'masters': 22,
 'from': 23,
 'his': 24,
 'out': 25,
 'sports': 26,
 'that': 27,
 'my': 28,
 'week': 29,
 'one': 30,
 'we': 31,
 'be': 32,
 'win': 33,
 'now': 34,
 'more': 35,
 'podcast': 36,
 'won': 37,
 'our': 38,
 'not': 39,
 'all': 40,
 'new': 41,
 'rt': 42,
 'last': 43,
 'what': 44,
 'winning': 45,
 'like': 46,
 'comeback': 47,
 'about': 48,
 'time': 49,
 'us': 50,
 'just': 51,
 'after': 52,
 'augusta': 53,
 'back': 54,
 'nbaplayoffs': 55,
 'tigers': 56,
 'episode': 57,
 'but': 58,
 'victory': 59,
 'he': 60,
 'when': 61,
 'do': 62,
 'him': 63,
 'nba': 64,
 'by': 65,
 'your': 66,
 'if': 67,
 'were': 68,
 'pgatour': 69,
 'are': 70,
 'weekend': 71,
 'great': 72,
 'listen': 73,
 'talk': 74,
 'so': 75,
 'dont': 76,
 'have': 77,
 'how': 7

Then, we go ahead and convert the training and testing features into their corresponding vectors. The size of these vectors is based on the size of the vocabulary, in our case 1000.

In [0]:
X_train_token = tokenize.texts_to_matrix(X_train)
X_test_token = tokenize.texts_to_matrix(X_test)

In [0]:
X_train_token[0]

array([0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Check size here

In [0]:
len(X_train_token[0])

1000

Now we need to convert the labels (targets) into their corresponding one-hot encoded values. One way to do this is to convert each label into a number, and then convert the number into a one-hot encoded vector.

## Encode the targets

In [0]:
# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(Y_train)
Y_train_encoded = encoder.transform(Y_train)
Y_test_encoded = encoder.transform(Y_test)

In [0]:
Y_train_encoded[0]

3

Now convert into one-hot encoded vectors

In [0]:
Y_train_hot = utils.to_categorical(Y_train_encoded, num_classes)
Y_test_hot = utils.to_categorical(Y_test_encoded, num_classes)

In [0]:
Y_train_hot[0]

array([0., 0., 0., 1.], dtype=float32)

Check the shapes.

Here are 240 training samples and 61 testing samples.

Each feature sample is a vector of length 1000 and each target is of length 20 (since there are 20 unique classes and the values have been one-hot encoded).

In [0]:
print('x_train shape:', X_train_token.shape)
print('x_test shape:', X_test_token.shape)
print('y_train shape:', Y_train_hot.shape)
print('y_test shape:', Y_test_hot.shape)

x_train shape: (240, 1000)
x_test shape: (61, 1000)
y_train shape: (240, 4)
y_test shape: (61, 4)


## Hyper-parameters

In [0]:
batch_size = 16
epochs = 10

## Build the model

In [0]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
history = model.fit(X_train_token, Y_train_hot,batch_size=batch_size,
                    epochs=epochs,verbose=1,
                    validation_split=0.1)

Instructions for updating:
Use tf.cast instead.
Train on 216 samples, validate on 24 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Check accuracy

In [0]:
# Evaluate the accuracy of our trained model
score = model.evaluate(X_test_token, Y_test_hot,
                       batch_size=batch_size, verbose=1)
print('Test accuracy:', score[1])

Test accuracy: 0.5901639349147921


In [0]:
Y_test.values

array(['N', 'N', 'V', 'V', 'N', 'M', 'V', 'V', 'V', 'V', 'M', 'V', 'V',
       'N', 'N', 'V', 'N', 'D', 'M', 'N', 'N', 'M', 'V', 'N', 'N', 'V',
       'N', 'V', 'N', 'D', 'D', 'V', 'N', 'M', 'N', 'V', 'V', 'N', 'N',
       'V', 'N', 'M', 'N', 'V', 'N', 'V', 'V', 'V', 'V', 'V', 'M', 'M',
       'N', 'V', 'M', 'V', 'V', 'M', 'V', 'V', 'M'], dtype=object)

## Predict

In [0]:
text_labels = encoder.classes_ 
for i in range(10):
    prediction = model.predict(np.array([X_test_token[i]]))
    predicted_label = text_labels[np.argmax(prediction)]
    print('Text: ',X_test[i])
    print('Actual Rating: ' + str(Y_test.values[i]))
    print('Predicted Rating: ' + str(predicted_label) + '\n')

Text:  Still watching The Masters  Rd  Highlights Just cant imagine all of TigerWoods golf colleague waiting for him at the end of the road to congratulate him TheMasters
Actual Rating: N
Predicted Rating: N

Actual Rating: N
Predicted Rating: V

Text:  LIVE Chasing Builds Run It  TheDivision   RyoTFAM NerddomUnited newbstreamteam Xbox ninja streaming Blackout SupportSmallStreamers Themasters ApexLegends Apex gameofthrones forthethrone
Actual Rating: V
Predicted Rating: V

Text:  VIRA Cambridge Golf press  Cambridge Golf espn sports golfcentral ad wsj nytimes reuters bloomberg thestreet forbes nasdaq IHubStockPosts pgatour business cnn bet foxnews bitcoin blockchain themasters cannabis marijuana CBD
Actual Rating: V
Predicted Rating: V

Text:  Only  days ago history was made at TheMasters Check out our Masters Recap Podcast special w a great lineup of guests from Augusta and beyond  Please SHARE the link with others you know were watching history unfold
Actual Rating: N
Predicted Ratin

## Prediciting user input:

Now you get to add your own "tweet" and let the program predict its influence tactic!

Keep in mind that for this to work your "tweet" has to be within the context of the Masters Golf Tournament

In [0]:
new_input = input("write tweet: ")
new_rating = input("Write tactic used \n(M, D, V, N) Minimizing, Denial, Diversion or Neither respectively: ")
new_input = remove_punct_emoji(new_input)

write tweet: Cant believe he won!! amazing job by tiger woods
Write tactic used 
(M, D, V, N) Minimizing, Denial, Diversion or Neither respectively: N


In [0]:
new_input_token = tokenize.texts_to_matrix([new_input])
prediction = model.predict(np.array([new_input_token[0]]))
predicted_label = text_labels[np.argmax(prediction)]

In [0]:
print("Text: " + new_input)
print("Actual Rating: " + new_rating)
print("Predicted Rating: " + str(predicted_label) + "\n")

Text: Cant believe he won amazing job by tiger woods
Actual Rating: N
Predicted Rating: N

