## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
dataset = pd.read_csv('stocks.csv')
dataset.head()

Unnamed: 0,Text,Sentiment
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,1
2,user I'd be afraid to short AMZN - they are lo...,1
3,MNTA Over 12.00,1
4,OI Over 21.37,1


Note: You can use more data depending on the compatiblity of your device to handle large amount of dataset.

In [3]:
dataset = dataset.iloc[:2000,:]
dataset.shape

(2000, 2)

Checking for any null values in the dataset. If any null vaue present in dataset just remove it from your dataset and reset the index.

In [4]:
dataset.isnull().sum()

Text         0
Sentiment    0
dtype: int64

## Adding libraires for NLP

These libraries are used to clean the dataset. cleaning process inclues:
- Removing words that are not relevant (like punctuations, capital letters..)
- Stemming ( used for getting the root words of some different versions of a same word, loved becomes love after stemming)

Here Stopwords are the list of words that consist of all non-significant words for creating a NLP model.

In [5]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /data/user/0/ru.iiec.pydroid3/app_HOME/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


For converting dataset(Text column) into string only format.

In [6]:
dataset.loc[:,'Text'] = dataset.loc[:,'Text'].apply(str)

In [7]:
ps = PorterStemmer()
corpus = [] # creating an empty list that will contain all the clean words after the preprocessing step

for i in range(0,len(dataset)):
    
    review = re.sub('[^a-zA-Z]',' ',dataset['Text'][i]) #this ensures only strings based data to enter the corpus and all non string components like words will be removed
    review = review.lower() # for converting string into lowercase format
    review = review.split() # slitiing the whole string of sentence into individual string of words
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    # this ensures appending only those words which are not part of stopwords list
    review = ' '.join(review)
    corpus.append(review)
    

In [8]:
corpus

['kicker watchlist xide tit soq pnk cpw bpz aj trade method method see prev post',
 'user aap movi return fea geed indic trade year awesom',
 'user afraid short amzn look like near monopoli ebook infrastructur servic',
 'mnta',
 'oi',
 'pgnx',
 'aap user current downtrend break otherwis short term correct med term downtrend',
 'monday rel weak nyx win tie tap ice int bmc aon c chk biib',
 'goog ower trend line channel test volum support',
 'aap watch tomorrow ong entri',
 'assum fcx open tomorrow trigger buy still much like setup',
 'realli worri everyon expect market ralli usual exact opposit happen everi time shall see soon bac spx jpm',
 'aap gamco arri haverti appl extrem cheap great video',
 'user maykiljil post agre msft go higher possibl north',
 'momentum come back etfc broke resist solid volum friday ong set',
 'ha hit mean resum target level',
 'user gameplan shot today like trend break may c h break oc weekli trend break back juli',
 'fcx gap well ideal entri look pull least

In [9]:
len(corpus)

2000

## Creating bag of words model

Bag of words is basically simplify and clean all the reviews and try to minimize the number of words.

Here we take all unique words from each review then we create a column for each word then put them all in a table where rows corresponds to reviews and columns corresponds to each of the different words we can find in corpus.

This result in formation of sparse matrix since most of the elements of corpus are unique as each row of corpus have different size of string and have different elements most of the time because of that there will be lots of zeroes in the column of matrix after 'Tokenization'.  

Tokenization - It's a process of taking different words of the review and creating one column for each of the words.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()
y = dataset.loc[:,'Sentiment'].values

In [11]:
x.shape, y.shape

((2000, 3292), (2000,))

In [12]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 42)

In [13]:
# Naive bayes model
from sklearn.naive_bayes import GaussianNB
c1 = GaussianNB()
c1.fit(x_train,y_train)
y1_pred = c1.predict(x_test)
y1_pred

array([1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0,

In [14]:
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(y_test,y1_pred)
cm1

array([[104,  32],
       [114, 150]])

In [15]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y1_pred)

0.635

63.5% accuracy by using Naive-Bayes classifer

In [16]:
# SVM with linear kernel

from sklearn.svm import SVC
c2 = SVC(kernel = 'linear')
c2.fit(x_train,y_train)
y2_pred = c2.predict(x_test)
y2_pred

array([1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,

In [17]:
cm2 = confusion_matrix(y_test,y2_pred)
cm2

array([[ 93,  43],
       [ 31, 233]])

In [18]:
accuracy_score(y_test,y2_pred)

0.815

81.5% accuracy by SVM classifier

In [19]:
# Decision tree model

from sklearn.tree import DecisionTreeClassifier
c3 = DecisionTreeClassifier()
c3.fit(x_train,y_train)
y3_pred = c3.predict(x_test)
y3_pred

array([1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1,

In [20]:
cm3 = confusion_matrix(y_test,y3_pred)
cm3

array([[ 87,  49],
       [ 33, 231]])

In [21]:
accuracy_score(y_test,y3_pred)

0.795

79.5% accuracy by Decision tree classifier

In [22]:
# Random forest 

from sklearn.ensemble import RandomForestClassifier
c4 = RandomForestClassifier()
c4.fit(x_train,y_train)
y4_pred = c4.predict(x_test)
y4_pred

array([1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,

In [23]:
cm4 = confusion_matrix(y_test,y4_pred)
cm4

array([[ 81,  55],
       [ 15, 249]])

In [24]:
accuracy_score(y_test,y4_pred)

0.825

82.5% accuray using Random forest classifier

## Using LSTM

In [25]:
import keras

Using TensorFlow backend.


In [26]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Bidirectional

Voc-size is taken as 5000. It is important to take voc-size as we are implementing one hot representation at that time we require some kind of dictionary size and withing that all words will be given some kind of sequences(indices).

In [27]:
voc_size = 5000

One hot representation will provide indices to each of word in corpus that will be less than or equal to the size of the dictionary (i.e. voc_size).

In [28]:
onehot_repr = [one_hot(words,voc_size) for words in corpus]
onehot_repr

[[1406,
  374,
  2318,
  2528,
  3819,
  706,
  4936,
  3629,
  1739,
  3033,
  1082,
  1082,
  2250,
  2797,
  3960],
 [1564, 3282, 865, 1263, 3741, 3391, 1715, 3033, 1752, 4153],
 [1564, 307, 104, 1799, 2642, 1552, 135, 4601, 997, 414, 232],
 [4253],
 [4442],
 [3026],
 [3282, 1564, 1391, 2423, 4793, 1647, 104, 644, 553, 802, 644, 2423],
 [3294,
  2742,
  1416,
  1259,
  3569,
  2513,
  3077,
  3229,
  2411,
  3911,
  4450,
  3156,
  16,
  788],
 [3023, 4315, 4227, 4508, 3154, 1211, 4999, 306],
 [3282, 3829, 4634, 2832, 389],
 [1641, 2540, 3662, 4634, 4900, 3277, 2329, 35, 1552, 3585],
 [4201,
  1331,
  3859,
  3307,
  4475,
  2274,
  471,
  1390,
  3278,
  2409,
  770,
  84,
  2010,
  2250,
  433,
  2789,
  2017,
  1756],
 [3282, 1683, 2654, 417, 1733, 3812, 2447, 1289, 888],
 [1564, 2204, 3960, 3816, 1419, 2726, 4156, 4253, 3427],
 [4463, 4463, 985, 4699, 2954, 3649, 501, 4999, 2441, 2832, 2267],
 [2704, 1053, 4589, 2859, 4733, 3972],
 [1564,
  2399,
  3039,
  1346,
  1552,
  4227,


## Embeeding Representation

If any sentence in corpus has lenght less than 20 (i.e. sen_len) then to make it's lenght equal to the sen_len we add zeroes to it (either at beginning of the sentence or at the end of the sentence).

This process of adding zeroes is called as padding, and it is of two types:
- Prepadding - wen zeroes are added at the beginning of the sentence.
- Postpadding - when zeroes are added at the end of the sentences.

In [29]:
sen_len = 20
embedded_docs = pad_sequences(onehot_repr,padding = 'pre', maxlen = sen_len)

In [30]:
x_final = np.array(embedded_docs)
x_final.shape

(2000, 20)

In [31]:
x_final.shape

(2000, 20)

Creating LSTM model

In [32]:
embedding_vector_feature = 40
c5 = Sequential()
c5.add(Embedding(voc_size,embedding_vector_feature,input_length = sen_len))
c5.add(LSTM(128))
c5.add(Dense(1,activation = 'sigmoid'))
c5.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
print(c5.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 40)            200000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               86528     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 286,657
Trainable params: 286,657
Non-trainable params: 0
_________________________________________________________________
None


In [37]:
X_train,X_test,Y_train,Y_test = train_test_split(x_final,y,test_size = 0.3,random_state = 42)

In [38]:
c5.fit(X_train,Y_train,validation_data = (X_test,Y_test),batch_size = 64, nb_epoch = 10)

  c5.fit(X_train,Y_train,validation_data = (X_test,Y_test),batch_size = 64, nb_epoch = 10)


Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x750e290d30>

Here we are getting an accuracy of 82.6% on our predictions of test dataset. 