# <span style="color:red">Twitter Sentiment Extractor</span>

> This is an **extractive question-answering problem** where tweet_text is a context, sentiment is a question and selected_text is an answer.
>
> In this notebook, I will **create different feature sets** with using different types of representations of word such as  bag-of-words, tf-idf, n-gram, Glove embedding. Then I will try **using LSTM cell with Recurrent neural network** structure on these feature sets.

In [17]:
# basic imports
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.layers import LSTM, GRU, Dense, Input, Embedding
from tensorflow.keras.models import Model

In [2]:
df = pd.read_csv("data/train.csv")
df.head(2)

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative


In [3]:
df.shape

(27481, 4)

In [14]:
df.fillna("",inplace=True)

# 1. Preparing the data / Preprocessing.

> The **text(context) field** and **sentiment(question) field** are very straightforward to use as input for the model. The **seleced_text( answer) field** is a bit trickier to use as the output.</span>
>
> We need to generate labels for the question's answer. And the **labels will be start and end position of the token** corresponding to the token inside the context. So, **labels will be index of token where the answer starts and index of the token where the answer ends.**
>
> And the model will be tasked to **predict one start and end logit per token** in the input context.

## 1.1 Train-Validation Split.

In [121]:
from sklearn.model_selection import train_test_split



X_train, X_val = train_test_split(df,test_size=0.1,random_state=42)

print("Shape of the train data: ", X_train.shape)
print("Shape of the validation data: ", X_val.shape)

Shape of the train data:  (24732, 4)
Shape of the validation data:  (2749, 4)


In [122]:
X_train.head(2)

Unnamed: 0,textID,text,selected_text,sentiment
14619,ddc3017ca5,WTF facebook just cleared out my whole survey ...,WTF facebook just cleared out my whole survey ...,positive
25779,3e6cc1a2d8,Back from LAAANDAN. Miss it already check o...,Miss it already,negative


In [123]:
X_val.head(2)

Unnamed: 0,textID,text,selected_text,sentiment
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral


In [124]:
## saving data.
X_train.to_csv("data/X_train.csv",index=False)
X_val.to_csv("data/X_val.csv", index = False)


## 1.2 Cleaning Data.

In [207]:
import re
import string

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation,
    remove words containing numbers and remove extra spaces.'''
    text = (text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub(' +', ' ', text)
    
    return text.strip()

In [208]:
X_train["cleaned_text"] = X_train.apply(lambda x: clean_text(x["text"]),axis=1)
X_val["cleaned_text"] = X_val.apply(lambda x: clean_text(x["text"]),axis=1)

X_train["cleaned_selected_text"] = X_train.apply(lambda x: clean_text(x["selected_text"]),axis=1)
X_val["cleaned_selected_text"] = X_val.apply(lambda x: clean_text(x["selected_text"]),axis=1)


In [209]:
X_train.head(2)

Unnamed: 0,textID,text,selected_text,sentiment,cleaned_text,cleaned_selected_text
14619,ddc3017ca5,WTF facebook just cleared out my whole survey ...,WTF facebook just cleared out my whole survey ...,positive,wtf facebook just cleared out my whole survey ...,wtf facebook just cleared out my whole survey ...
25779,3e6cc1a2d8,Back from LAAANDAN. Miss it already check o...,Miss it already,negative,back from laaandan miss it already check out m...,miss it already


In [210]:
X_val.head(2)

Unnamed: 0,textID,text,selected_text,sentiment,cleaned_text,cleaned_selected_text
1588,a7f72a928a,WOOOOOOOOOO are you coming to Nottingham at...,t? lovelovelove,positive,woooooooooo are you coming to nottingham at an...,t lovelovelove
23879,ef42dee96c,resting had a whole day of walking,resting had a whole day of walking,neutral,resting had a whole day of walking,resting had a whole day of walking


## 1.3 Generating Labels: start and end position

In [211]:
"a" == "a"

True

In [212]:
print(X_train["cleaned_text"].iloc[1][0])

b


In [213]:
len(X_train["cleaned_selected_text"].iloc[0])

121

In [202]:
X_train["cleaned_text"].iloc[0]

'wtf facebook just cleared out my whole survey and i was on the last q this night gets better and better  what else is next'

In [203]:
X_train["cleaned_selected_text"].iloc[0]

'wtf facebook just cleared out my whole survey and i was on the last q this night gets better and better  what else is next'

In [204]:
start_positions = []
end_positions = []

offset_map = []
for idx in range(X_train.shape[0]):
    text = (X_train["cleaned_text"].iloc[0])
    selected_text = (X_train["cleaned_selected_text"].iloc[0])
    
    
    ## creating offset map: that contains a tuple of start and end char position of a word.
    words = text.split()
    offset = []
    for word in words:
        count = 0
        start_char = 0
        end
            
        
        
        
                
                
            if text[i] == 
    for i in range(len(text)):
        count = i
        for ch in selected_text:
            flag = True
            if text[count]== ch:
                count +=1
            else:
                flag = False
                break
        print(flag)
        print(i)   
        if flag:
            start_pos = i
            break
        
    break
        
        
                
              
    
    

True
0


## 1.4 Tokenization

# 2. Creating the Recurrent Neural Network using LSTM cell. 

# 3. Training the RNN.

# 4. Evaluation using Jaccard Similarity metric.

# 5. Result(s)

# 6. Choosing the best model to predict on test set. 