
# Project 6: Analyzing Stock Sentiment from Twits

## Instructions

Each problem consists of a function to implement and instructions on how to implement the function. The parts of the function that need to be implemented are marked with a # TODO comment.Your code will be checked for the correct solution when you submit it to Udacity.


## Packages
When you implement the functions, you'll only need to you use the packages you've used in the classroom, like Torch, NLTK. These packages will be imported for you. We recommend you don't add any import statements, otherwise the grader might not be able to run your code.


In [1]:
import json
import re
import nltk
import torch
import random

## Introduction


You've been considering using sentiment around specific stocks in your models. You've been subscribed to StockTwits for a while to stay up to date on trading news. You can collect these twits (similar to tweets) and now you want to build a model that can predict the sentiment of the text in the twits. Many of the existing models perform feature extraction manually, basically assigning sentiment scores to individual words by hand. Instead you'd like to train a neural network to learn the features itself then have it predict sentiment. This means you'll need labeled data.

You collected a bunch of twits, then hand labeled the sentiment of each with the help of some interns. You wanted to capture the degree of sentiment so you decided to use a five-point scale: very negative, negative, neutral, positive, very positive. Each tweet is labeled -2 to 2 in steps of 1, from very negative to very positive. 

Here then, you'll build a sentiment analysis model that will learn to assign sentiment to tweets on its own, using this labeled data.



Load in the twits. This is a JSON object with structure like so:

```
{'data':
  {'message_body': 'Tweet body text here',
   'sentiment': 0},
  {'message_body': 'Happy tweet body text here',
   'sentiment': 1},
   ...
}
```

## Import Twits 

In [2]:
with open('twits.json', 'r') as f:
    twits = json.load(f)

Fields in our individual tweets:

* `'message_body'`: The actual text in the tweet
* `'sentiment'`: Score on the sentiment of the tweet, ranges from -2 to 2 in steps of 1, with 0 being neutral

Remember that we want our network to look at some text and predict the sentiment. Our training input will be the message bodies, and we can use the sentiment score as training label for our data.

### View Data 

To see what tweets look like, let's load 10 tweets from the list. 


In [3]:
data = twits['data']

In [4]:
"""loading 10 twits from the list"""
# TODO Implement

for i in range(len(data)):
    print (data[i])
    if ( i >= 9):
        break;

{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG', 'sentiment': 0, 'timestamp': '2012-01-01T00:06:01Z'}
{'message_body': '$GOOG http://stks.co/1jQs Many market leaders appear extended. Are these moves sustainable? I think not.', 'sentiment': -1, 'timestamp': '2012-01-01T00:18:17Z'}
{'message_body': '"Deconstructing A Trade: $AAPL 12/29/2011"-New Blog Post.  Yeah it\'s New Year\'s Eve but I\'m married with kids http://t.co/6VV31tBY $STUDY', 'sentiment': 0, 'timestamp': '2012-01-01T00:26:18Z'}
{'message_body': 'My prediction for 2012 is that the $spx and $djia ($spy and $dia) will make all time highs. And $aapl right along with them.', 'sentiment': 2, 'timestamp': '2012-01-01T00:30:36Z'}
{'message_body': 'RT @bclund &quot;Deconstructing A Trade: $AAPL 12/29/2011&quot;New Blog Post. Yeah it&#39;s NY&#39;s Eve but I&#39;m married with kids http://stks.co/1jQy $STUDY', 'sentiment': 0, 'timestamp': '2012-01-01T00:34:52Z'}
{'m

### Length of Data 
Now let's look at the number of twits in dataset. 

In [5]:
"""print out the number of twits"""
# TODO Implement 
len(data)

3000000

### Load data

If you get it right, we have 3 million tweets over all. For development purposes, let's only use the first 2 million tweets.

In [6]:
"""print out the first 2 million tweets"""
# TODO Implement 
tweets2million = data[0:2000000] 
len(tweets2million)

2000000

In [7]:
# Just run this cell and do not change anything

# This is the data we'll train & test on
messages = [twit['message_body'] for twit in data]
# Adding 2 here to scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in data]


## Data Pre-Processing


With our data in hand we need to preprocess our text. These tweets are collected by filtering on ticker symbols where these are denoted with a leader $ symbol in the tweet itself. For example,

`{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG',
 'sentiment': 0}`


The ticker symbols don't provide information on the sentiment, and they are in every tweet, so we should remove them. This tweet also has the `@google` username, again not providing sentiment information, so we should also remove it. And, we see a URL `http://t.co/sptHOAh8`, let's remove these too.



The easiest way to remove specific words or phrases is with regex, the `re` module. You can sub out specific patterns with a space:

```python
re.sub(pattern, ' ', text)
```
This will substitute a space with anywhere the pattern matches in the text. Later when we tokenize the text, we'll split appropriately on those spaces.


### Load Data 

In [8]:
# Just run this cell and do not change anything

nltk.download('wordnet')
wnl = nltk.stem.WordNetLemmatizer()
train_on_gpu = False

[nltk_data] Downloading package wordnet to /Users/tkmal0o/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Pre-Processing

In [9]:
""" *This function takes a string as input, then performs these operations: 
    * lowercase
    * remove URLs
    * remove ticker symbols 
    * removes punctuation
    * tokenize by spliting the string on whitespace 
    * removes any single character tokens
"""

def preprocess(message):
    #TODO: Implement 
    
    # Lowercase 
    
    text = message.lower()
    
    # Match and remove URLs
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE)
    # Match and remove ticker symbols that start with $
    text = re.sub(r'\$\w+', '', text, flags=re.MULTILINE) 
    
    # Match and remove twitter usernames that start with @
    text = re.sub(r'@[a-z]*', '', text) 

    # Replace punctuation and numbers (anything not a letter) with spaces
    text = re.sub(r'[^\w\s]','',text)

    # Tokenize by splitting the string on whitespace
    tokens = text.split()

    # Lemmatize and remove any tokens with only one character
    tokens = [wnl.lemmatize(token) for token in tokens ] 
    return tokens

### Preprocess All the Twits 
Now we can preprocess each of the twits in our dataset. This will take a while since we have millions of twits.

In [27]:
# TODO Implement 
i = 0
tokenized = []
for message in messages:
    tokenized.append(preprocess(message))
    if i > 1000000:
        break
    i += 1
#print(tokenized)    

### Collect the Vocabulary

Now with all of our messages tokenized, we want to create a vocabulary and count up how often each word appears in our entire corpus.


In [28]:
from collections import Counter

### Bag of Word

In [29]:
# TODO: Implement 

"""
Create a vocabulary by using Bag of words
Use for loop to update your tokens

"""
## Build a dictionary that maps words to integers

bow = Counter()
for words in tokenized:
    for word in words:
        bow[word] += 1
vocab = sorted(bow, key=bow.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
#print (vocab_to_int)

### Remove Common and Rare Words

With our vocabulary, now we'll remove some of the most common words such as 'the', 'and', 'it', etc. These words don't contribute to identfying sentiment and are really common, resulting in a lot of noise in our input. If we can filter these out, then our network should have an easier time learning. It's up to you to decide how many to remove.

We also want to remove really rare words that show up in a only a few tweets. Here you'll want to divide the count of each word by the number of messages. Then remove words that only appear in some small fraction of the messages. Again, it's up to you how much you want to keep.

 ### Frequency of Words Appearing in Message

In [30]:
# TODO Implement 

"""

Parameters
----------
    freqs: frequency of words appearing in messages
    low_cutoff: here we assign frequency cutoff to 0.00005, this is probably better as a percenet of the low frequency words. 
    high_cutoff: set a number for high frequency words instead of a frequency threshold. 
    The distribution of word counts is peaked at the most frequent words with a really long tail, so it's usually better just cut off the first K words.
    K_most_common 

Return
------
print the K most common words in the vocab

THIS IS NOT very clear, what needs to be done????

"""
K_most_common = 10 #arbitrary number choosen??

# Frequency of words appearing in messages

freqs = bow
        
# Frequency cutoff 
low_cutoff = 0.00005
        
# K high frequency words 
high_cutoff = 10 # Since there is only 10-12 stopwords in English language

K_most_common_words = Counter(freqs).most_common(K_most_common) 

print(K_most_common_words)


[('the', 370317), ('to', 316779), ('a', 275359), ('is', 201971), ('in', 167707), ('on', 162790), ('it', 161728), ('and', 161387), ('for', 160739), ('of', 156851)]


### Filtering High and Low Frequency Words

In [31]:
### TODO Implement 

"""
Filter high and low frequency words.

Parameters
----------
Filtered words 

Return
------
Length of filtered words 

"""
filtered_words = freqs 
rare_word = len(filtered_words) * low_cutoff
filter_high = [key for key,cnt in K_most_common_words] 
for key, cnts in list(filtered_words.items()):
    if ( key in filter_high or cnts < rare_word):
        del filtered_words[key]
print(len(filtered_words))

19192


### Updaing Vocabulary by Removing Filtered Words

In [32]:
 #TODO Implement
"""Go through all the data and remove words that aren't in our vocab"""
#  Not clear what is filtered?
vocab = filtered_words

id2vocab = {word: ii for ii, word in enumerate(vocab, 1)}

filtered = list(filtered_words.elements())
#print(list(filtered.items()))

### Balancing the classes

Let's do a few last pre-processing steps. If we look at how our tweets are labeled, we'll find that 50% of them are neutral. This means that our network will be 50% accurate just by guessing 0 every single time. To help our network learn appropriately, we'll want to balance our classes.
That is, make sure each of our different sentiment scores show up roughly as frequently in the data.

What we can do here is go through each of our examples and randomly drop tweets with neutral sentiment. What should be the probability we drop these tweets if we want to get around 20% neutral tweets starting at 50% neutral? We should also take this opportunity to remove messages with length 0.

In [33]:
# Just run this cell and do not change anything 
n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral

In [34]:
balanced = {'messages': [], 'sentiments':[]}

for idx, sentiment in enumerate(sentiments):
        
    message = filtered[idx]
    if len(message) == 0:
        # skip this message because it has length zero
        continue
    elif sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 

If you did it correctly, you should see the following result 

In [35]:
# Just run this cell and do not change anything 
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples

0.20021373590881839

Finally let's convert our tokens into integer ids which we can pass to the network.

In [36]:
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

## Neural Network

Now we have our vocabulary which means we can transform our tokens into ids, which are then passed to our network. So, let's define the network now!

Here is a nice diagram showing the network we'd like to build: 
#### Embed -> RNN -> Dense -> Softmax


In [37]:
import torch
from torch import nn, optim
import torch.nn.functional as F

### Implement the text classifier 

Before we build text classifier, if you remember from the other network that you built in  "Sentiment Analysis with an RNN"  exercise  - which there, the network called " SentimentRNN", here we named it "TextClassifer" - consists of three main parts: 1) init function `__init__` 2) forward pass `forward`  3) hidden state `init_hidden`. 

This network is pretty similar to the network you built expect in the  `forward` pass, we use softmax instead of sigmoid. The reason we are not using sigmoid is that the output of NN is not a binary. In our network, sentiment scores have 5 possible outcomes. We are looking for an outcome with the highest probability thus softmax is a better choice.

In [38]:
# TODO Implement 
"""
 1. Define __init__  
 2. Define forward  
 3. Define init_hidden

"""
"""
Initialize the model by setting up the layers.

    Parameters
    ---------- 
    Use the following parameters as the arguments for __init__ function: 
    (self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1)
    - You can add more layers or change the dropout rate. 

"""

class TextClassifier(nn.Module):
     

#  __init__ function    

    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
#    def __init__(self, vocab_size, output_size, embed_size, lstm_size, lstm_layers=1, dropout=0.1):
        """
        Initialize the model by setting up the layers.
        """
        super(TextClassifier, self).__init__()

        self.output_size = output_size
        self.lstm_layers = lstm_layers
        self.lstm_size = lstm_size
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers, 
                            dropout=dropout, batch_first=True)
        
        
        # dropout layer
        self.dropout = nn.Dropout(dropout)
        
        # linear and softmax layers
        self.fc = nn.Linear(lstm_size, output_size)
        self.softmax = nn.Softmax()
        
        

#Perform a forward pass of our model on some input and hidden state.

#Parameters
#----------
#      Use softmax instead of sigmoid

#Returns
#-------
#    return last softmax 'logps' output and hidden state 

    
 # forward pass 

    def forward(self, x, hidden):

        batch_size = x.size(0)
        # embeddings and lstm_out
        print('batch_size: ', batch_size)
        print(x.size())
        embeds = self.embedding(x)
        print('embed', embeds.size())
        
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.lstm_size)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # softmax function
        softmax_out = self.softmax(out)
        
        # reshape to be batch_size first
        softmax_out = softmax_out.view(batch_size, -1)
        softmax_out = softmax_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return softmax_out, hidden
  
 
# hidden state 

 
#Initializes hidden state 

#    Parameters
#    ----------
#            - Create two new tensors with sizes n_layers x batch_size x hidden_dim,
#            - Initialized to zero, for hidden state and cell state of LSTM
#    Returns
#    -------
#        hidden 
    def init_hidden(self, batch_size):

        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_().cuda(),
                  weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_().cuda())
        else:
            hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),
                      weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        
        return hidden




### View Model


In [39]:
# Just run this cell and do not change anything

model = TextClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
print ( 'model: ', model)
model.embedding.weight.data.uniform_(-1, 1)
input_ = torch.randint(0, 1000, (5, 5), dtype=torch.int64)
hidden = model.init_hidden(5)

logps, _ = model.forward(input_, hidden)
print(logps)

model:  TextClassifier(
  (embedding): Embedding(19192, 10)
  (lstm): LSTM(10, 6, num_layers=2, batch_first=True, dropout=0.1)
  (dropout): Dropout(p=0.1)
  (fc): Linear(in_features=6, out_features=5, bias=True)
  (softmax): Softmax()
)
batch_size:  5
torch.Size([5, 5])
embed torch.Size([5, 5, 10])
tensor([ 0.1914,  0.1861,  0.1884,  0.1826,  0.1888])




### DataLoaders and Batching - Optional 


Now we should build a generator that we can use to loop through our data. It'll be more efficient if we can pass our sequences in as a batch, that is some number of sequences all at the same time. Our input tensors should look like `(sequence_length, batch_size)`. So if our sequences are 40 tokens long and we pass in 25 sequences, then we'd have an input size of `(40, 25)`.

If we set our sequence length to 40, what do we do with messages that are more or less than 40 tokens? For messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to **left** pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we'll just keep the first 40 tokens.

In [40]:
import numpy as np
def pad_features(msg_ints, seq_length=30):
    ''' Return features of review_ints, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    # getting the correct rows x cols shape
    features = np.zeros((len(msg_ints), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(msg_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features   


In [128]:
# Optional: TODO implement 

""" 
Build a dataloader

    Parameters
    ----------
    define dataloader
    Use the following parameters as the arguments for dataloader function:
    (messages, labels, sequence_length=30, batch_size=32, shuffle=False)

    Returns
    -------
    batch, label_tensor       

"""

def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
    
    #TODO Implement 
   `yield batch, label_tensor

SyntaxError: invalid syntax (<ipython-input-128-c1cb35ef3a89>, line 21)

### Training, and  Validation
With our data in nice shape, we'll split it into training and validation sets.

In [41]:
# TODO Implement 

"""
split data into training and validation sets 

 Parameters
 ----------
 token_ids for valid_split

"""   
## split data into training, validation, and test data (features and labels, x and y)

split_frac = 0.8
sequence_length = 20
features = pad_features(token_ids,sequence_length)  
split_idx = int(len(features)*split_frac)

train_features, remaining_features = features[:split_idx], features[split_idx:]
train_labels, remaining_labels = sentiments[:split_idx], sentiments[split_idx:]

test_idx = int(len(remaining_features)*0.5)

valid_features, test_x = remaining_features[:test_idx], remaining_features[test_idx:]
valid_labels, test_y = remaining_labels[:test_idx], remaining_labels[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_features.shape), 
      "\nValidation set: \t{}".format(valid_features.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(1551821, 20) 
Validation set: 	(193978, 20) 
Test set: 		(193978, 20)


In [42]:
# Just run this cell and do not change anything
from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(torch.from_numpy(train_features), torch.from_numpy(np.asarray(train_labels)))
valid_data = TensorDataset(torch.from_numpy(valid_features), torch.from_numpy(np.asarray(valid_labels)))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(np.asarray(test_y)))




#text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
train_loader = DataLoader(train_data, shuffle=True, batch_size=64)
dataiter = iter(train_loader)
text_batch, labels = dataiter.next()

print('Sample input size: ', text_batch.size()) # batch_size, seq_length
print('Sample input: \n', text_batch)
print()
print('Sample label size: ', labels.size()) # batch_size
print('Sample label: \n', labels)



Sample input size:  torch.Size([64, 20])
Sample input: 
 tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          1.1080e+03,  1.1220e+03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  6.2780e+03,
          2.6542e+04,  7.2900e+02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.4930e+03,
          2.0560e+03,  1.0280e+03],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  1.4560e+05],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.2661e+04,
          2.7520e+03,  6.6490e+03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.0560e+03,
          4.1270e+03,  2.6700e+02]])

Sample label size:  torch.Size([64])
Sample label: 
 tensor([ 3,  3,  3,  4,  1,  3,  1,  3,  0,  1,  0,  3,  1,  4,
         3,  3,  4,  0,  0,  2,  0,  2,  0,  1,  2,  4,  0,  1,
         0,  2,  2,  4,  3,  3,  3,  4,  2,  0,  1,  1,  2,  4,
         1,  3,  4,  1,  0,  4,  0,  1,  2,  3,  4,  0,  3,  2,
 

In [43]:
model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
print(model)
print ( text_batch.size(), ' : ', (type(hidden)))
hidden = model.init_hidden(64)
logps, hidden = model.forward(text_batch, hidden)

TextClassifier(
  (embedding): Embedding(19193, 200)
  (lstm): LSTM(200, 128, batch_first=True)
  (dropout): Dropout(p=0.0)
  (fc): Linear(in_features=128, out_features=5, bias=True)
  (softmax): Softmax()
)
torch.Size([64, 20])  :  <class 'tuple'>
batch_size:  64
torch.Size([64, 20])


RuntimeError: index out of range at /Users/soumith/minicondabuild3/conda-bld/pytorch_1524590658547/work/aten/src/TH/generic/THTensorMath.c:343

## Training


In [44]:
# Just run this cell and do not change anything

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
epochs = 5
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)
print_every = 100
batch_size = 256

In [45]:
# TODO Implement 
"""Train your model with dropout, and monitor the training progress with the validation loss and accuracy"""
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)


In [46]:
epochs = 4 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    model.cuda()

model.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = model.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        model.zero_grad()

        # get the output from the model
        output, h = model(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = model(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            model.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

batch_size:  64
torch.Size([64, 20])


RuntimeError: index out of range at /Users/soumith/minicondabuild3/conda-bld/pytorch_1524590658547/work/aten/src/TH/generic/THTensorMath.c:343

## Making Predictions

Okay, now that you have a trained model, try it on some new tweets and see if it works appropriately. Remember that for any new text, you'll need to preprocess it first before passing it to the network. You should also think about how to handle input words that aren't in your vocabulary.

We also want to use these sentiment scores in a larger ensemble model which you'll be learning about next. For this, we'll need the output to be some continuous value, typically on a scale of -3 to 3. Our model is predicting the probability on this discrete 5 value scale. Since we have a probability distribution, a good way to convert this to a continuous value is using the expectated value of the score. The expected value $\bar{s}$ is the sum of each score $s_i$ multiplied by the probability $p_i$ of getting that score.

$$
\large \bar{s} = \sum_i p_i s_i
$$

This is nice because it captures uncertainty in our model's predictions. For example, if it predicts 50% in positive ($s = 1$) and 50% in strongly positive ($s=2$), the expected value will be in between at 1.5.

### Prediction 

In [None]:
# TODO Implement 

""" 
    Prints out whether a twit is predicted to be 
    positive or negative in sentiment, using a trained model.

    Parameters
    ----------
    - define predict function 
    text 
    model 
    vocab 

    Returns
    -------
    expectation.item() 
"""
    
def predict(text, model, vocab): 
        
    # TODO Implement   
     tokens = preprocess(text)

    
    # Filter non-vocab words and convert to ids
    tokens = pass 
    if len(tokens) == 0:
        return None, None
        
    # Adding a batch dimension
    text_input = pass 
    hidden = pass 
    logps, = pass 
    ps = pass 

    # Sentiment expectation
    expectation = pass 
    
    return expectation.item()

In [None]:
#Just run this cell and do not change anything 
text = "Good earnings this year, I'm bullish on $goog"
model.to("cpu")
predict(text, model, vocab)

Now we have a trained model and we can make predictions. We can use this model to track the sentiments of various stocks by predicting the sentiments of twits as they are coming in. Now we have a stream of twits. For each of those twits, pull out the stocks mentioned in them and keep track of the sentiments. Remember that in the twits, ticker symbols are encoded with a dollar sign as the first character, all caps, and 2-4 letters, like $AAPL. Ideally, you'd want to track the sentiments of the stocks in your universe and use this as a signal in your larger model(s).


## Testing 

### Load the Data 

In [None]:
with open('test_twits.json', 'r') as f:
    test_data = json.load(f)

### Twit Stream

In [None]:
def twit_stream():
    for twit in test_data['data'][:1000]:
        yield twit

In [None]:
next(twit_stream())

You have voc, and you need create the steem of signals for the tweets 

In [None]:
# TODO Implement 

""" 
Given a stream of twits and a universe of tickers, return sentiment scores for tickers in the universe.

Parameters
----------
define score_twits 
Use the following arguments for your function:
(stream, model, vocab, universe):

Returns
-------
score_twits 
"""

def score_twits(stream, model, vocab, universe):
   
            for twit in stream:

        # Get the message_body of twits 
        text = pass
        # use re.findall method in re 
        symbols = pass 
        score = pass 

        for symbol in symbols:
            if symbol in universe:
            yield symbols, score, twit['timestamp']


In [None]:
# Just run this cell and do not change anything
universe = {'$BBRY', '$AAPL', '$AMZN', '$BABA', '$YHOO', '$LQMT', '$FB', '$GOOG', '$BBBY', '$JNUG', '$SBUX', '$MU'}
score_stream = score_twits(twit_stream(), model, vocab, universe)

In [None]:
# Just run this cell and do not change anything
next(score_streem)