## Twitter data exploration part 2

### Working with Text Data and Naive Bayes in scikit-learn
#### idea: can we pull in other tweets and classify them as pro-ISIS or not?

find a dataset of tweets: http://help.sentiment140.com/for-students/

or get your own using the twitter api

read the docs for info on text feature extraction: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [6]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [7]:
random_tweets = pd.read_csv('new_tweets.csv', encoding='iso-8859-1')

In [8]:
random_tweets.shape

(1048575, 3)

In [9]:
#lets take a smaller subset of this dataframe
new_tweets = random_tweets[:17410]
#add label to know which dataset the tweet came from when we combine and predict later
new_tweets['label']=0

In [10]:
pro_isis=pd.read_csv('tweets.csv')

In [11]:
pro_isis['label']=1

In [12]:
combined = pd.concat([new_tweets, pro_isis[['username', 'time', 'tweets','label']]], axis=0)

In [13]:
combined.head()

Unnamed: 0,label,time,tweets,username
0,0,Mon Apr 06 22:19:49 PDT 2009,is upset that he can't update his Facebook by ...,scotthamilton
1,0,Mon Apr 06 22:19:53 PDT 2009,@Kenichan I dived many times for the ball. Man...,mattycus
2,0,Mon Apr 06 22:19:57 PDT 2009,my whole body feels itchy and like its on fire,ElleCTF
3,0,Mon Apr 06 22:19:57 PDT 2009,"@nationwideclass no, it's not behaving at all....",Karoli
4,0,Mon Apr 06 22:20:00 PDT 2009,@Kwesidei not the whole crew,joy_wolf


In [14]:
combined.shape

(34820, 4)

In [15]:
#Below code adapted from an inclass exercise from my data science class at General Assembly - DC - March 2016

In [16]:
# define X and y
#X is the column containing the text of the tweet
X = combined.tweets
#y is the label we added - 0 if from non-ISIS twitter dataset, 1 if from ISIS fan twitter dataset
y = combined.label

In [17]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(26115,) (26115,)
(8705,) (8705,)


In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
# instantiate the vectorizer
vect = CountVectorizer()

In [20]:
# learn training data vocabulary, then create document-term matrix
# combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<26115x44513 sparse matrix of type '<class 'numpy.int64'>'
	with 359288 stored elements in Compressed Sparse Row format>

In [21]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<8705x44513 sparse matrix of type '<class 'numpy.int64'>'
	with 110240 stored elements in Compressed Sparse Row format>

In [22]:
# store token names
X_train_tokens = vect.get_feature_names()

In [23]:
# first 50 tokens
print(X_train_tokens[:50])

['00', '000', '0000', '000s', '002', '00am', '00hhqvgwng', '00klaxxhb8', '00pm', '00rebel_umm', '00wytaqd0b', '01', '01sak_', '01yyag5osu', '02', '02am', '02hijbrguk', '03', '0332zzzopm', '04', '04_8_1437', '05', '055brigade', '05am', '05om1fdrsf', '05pm', '06afxzoivs', '06fnh', '06kill56', '06zufezmpq', '07', '07899000930', '07am', '07dxtibpab', '07fzl09eny', '07ld37pwca', '08', '08am', '08ecl', '08gmtkagnu', '08swnllkjy', '09', '096', '0_0', '0a7knioyll', '0ahoovyt5b', '0ajaf2ys6i', '0aqck2x09r', '0b2', '0b6bnzn1xs']


In [24]:
# last 50 tokens
print(X_train_tokens[-50:])

['ھلاک', 'ھے', 'ہاتھوں', 'ہزیان', 'ہمدردی', 'ہو', 'ہوا', 'ہوگئی', 'ہوگئے', 'ہوگیا', 'ہوے', 'ہیڈ', 'ہیں', 'ہے', 'ہےجسکاتعلق', 'یا', 'یلدا', 'یہ', 'یہی', 'আখ', 'আল', 'ইক', 'ওয', 'শসহ', 'ḥalab', 'ḥamzah', 'ṣadiq', 'ṭāġūt', '新宿高島屋11階の綺麗なムッサラーで礼拝', '日本男児たるもの', 'ﺃﻟﻔﺎ', 'ﺍﻋﺘﺒﺮ', 'ﺍﻟﻔﺎ', 'ﺍﻧﺘﻢ', 'ﺛﻢ', 'ﺣﺬﻭﻛﻢ', 'ﻋﺒﺮﺓ', 'ﻓﻮﺍﻟﻠﻪ', 'ﻟﻤﻦ', 'ﻟﻦ', 'ﻟﻨﺴﺤﺒﻨﻜﻢ', 'ﻣﻨﻜﻢ', 'ﻧﺒﻖ', 'ﻧﺬﺭ', 'ﻭﺍﻟﻠﻪ', 'ﻭﻟﻦ', 'ﻭﻟﻨﺠﻌﻠﻨﻜﻢ', 'ﻭﻣﻦ', 'ﻮﻥ', 'ﻳﺤﺬﻭ']


In [25]:
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [26]:
# count how many times EACH token appears across ALL messages in X_train_dtm
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

array([17, 74,  1, ...,  1,  1,  1], dtype=int64)

In [27]:
# optional: uncomment to create a DataFrame of tokens with their counts
#token_counts = pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort_values(by='count', ascending=True)

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [28]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB, GaussianNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [29]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [30]:
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.972085008616


In [31]:
print(metrics.classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.95      0.99      0.97      4352
          1       0.99      0.95      0.97      4353

avg / total       0.97      0.97      0.97      8705



In [32]:
# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[4322   30]
 [ 213 4140]]


In [33]:
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  1.00000000e+00,   1.00000000e+00,   1.00000000e+00, ...,
         8.28461746e-08,   7.35653454e-01,   1.35103253e-03])

In [34]:
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.993953787364


In [35]:
# print message text for the false positives
X_test[y_test < y_pred_class]

10005    Hrm, scanner traffic diminishes greatly when y...
10233            @Ishme3t these are the mysteries of life 
13396              The Indians scored 13 runs in 1 inning 
12955                                     @ShAiNaBeLu Sep 
3557     @itspink WHAT?BOYZONE ARE REFORMING???I'm neve...
16304    University of Texas at Austin or South Western...
4723     CBI is Congress Bureau of Investigation. ...It...
3719          March sales reports done... hardly worth it 
5020     Why can't Excel 2003 handle more than 7 nested...
3781     NBCNews reporting Pres Chief Econ guy #Summers...
3319                         soluna is slower than accord 
9192                                    is @ the AA Hotel 
17168    @AgentMan1 You're really into the US Armed For...
14339    @IvyEnvy the cubs are rated 8th in MLB, 4th in...
12566         Wants to be followed by the King of Twitter 
1667     @grahamcracker  If only you were working in th...
17309        @zjelektra the news...  the pronounciation.

In [36]:
# print message text for the false negatives
X_test[y_test > y_pred_class]

5773     @mawilner do you have any link of video about ...
16346    @SimNasr I got it; I already deleted my tweet....
6104     @WarReporter1 I dont think that Arab tribal le...
13153                       @SaxHorse666 and my life too..
10167                                   forgot my email :(
3820     @DawlateMohamedi videos r frm fresh past and r...
9773                @PalmyraRev1 thanks for your great job
3769     Pakistan always come up with this : they drop ...
15117    It wasn't totally ununderstandable that some w...
12469    @WarReporter1 This is so dramatic. Drama, intr...
8677     @syrmukhabarat @leithfadel very Dumb fanboy yo...
4932     @7layers_ this is fantastic how they can move ...
5222     Sweet is Revenge,                             ...
14262    @support__7220 interview? what sort of interview?
3314     You want to insult yourself as a cricketer? Ap...
9732                  @Remy8289 lol. It's a guys only. Lol
7807     If you are having a bad night tonight,guess wh.

In [37]:
#need to look into these false positives and false negatives
X_test[10005]

'Hrm, scanner traffic diminishes greatly when your public services are on strike. '

In [39]:
#the classifier has a very high accuracy score - almost too good to be true - so who knows if it would work in real life
#please email me with ideas and findings: dills_julia@bah.com