### SST5 Dataset - Preprocessing

The SST5 dataset has ~12k sentences. Each sentence is further sub-divided into phrases and SST5 consists of  215,154 phrases each given a value from 0 to 1. And using these values we can map the probabilities as [0,0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] for very negative, negative, neutral, positive, very positive, respectively.

In [9]:
!curl  -O -J -L http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip  
!mv stanfordSentimentTreebank.zip ./dataset/

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   329    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 6223k  100 6223k    0     0  1217k      0  0:00:05  0:00:05 --:--:-- 1510k


In [16]:
!unzip ./dataset/stanfordSentimentTreebank.zip -d ./dataset/

Archive:  ./dataset/stanfordSentimentTreebank.zip
   creating: ./dataset/stanfordSentimentTreebank/
  inflating: ./dataset/stanfordSentimentTreebank/datasetSentences.txt  
   creating: ./dataset/__MACOSX/
   creating: ./dataset/__MACOSX/stanfordSentimentTreebank/
  inflating: ./dataset/__MACOSX/stanfordSentimentTreebank/._datasetSentences.txt  
  inflating: ./dataset/stanfordSentimentTreebank/datasetSplit.txt  
  inflating: ./dataset/__MACOSX/stanfordSentimentTreebank/._datasetSplit.txt  
  inflating: ./dataset/stanfordSentimentTreebank/dictionary.txt  
  inflating: ./dataset/__MACOSX/stanfordSentimentTreebank/._dictionary.txt  
  inflating: ./dataset/stanfordSentimentTreebank/original_rt_snippets.txt  
  inflating: ./dataset/__MACOSX/stanfordSentimentTreebank/._original_rt_snippets.txt  
  inflating: ./dataset/stanfordSentimentTreebank/README.txt  
  inflating: ./dataset/__MACOSX/stanfordSentimentTreebank/._README.txt  
  inflating: ./dataset/stanfordSentimentTreebank/sentiment_labels

#### Understanding the SST5 Dataset

The [readme.txt](dataset/stanfordSentimentTreebank/README.txt) talks about the different files in the dataset. The ones of note for us, are
- [datasetSentences.txt](dataset/stanfordSentimentTreebank/datasetSentences.txt)
- [sentiment_labels.txt](dataset/stanfordSentimentTreebank/sentiment_labels.txt)
- [dictionary.txt](dataset/stanfordSentimentTreebank/dictionary.txt)
- [datasetSplit.txt](dataset/stanfordSentimentTreebank/datasetSplit.txt)

Dataset Sentences contains 11,855 sentences. Dictionary maps each phrase to an ID. And Sentiment_Labels gives the label for each phrase. The Dictionary contains all the phrases of a given sentence and as seen below - The whole sentence is also a phrase. So we need to look up the sentence as a phrase in Dictionary, obtain the phrase ID associated with it and then look it up in Sentiment_Labels for its sentiment value which is given between 0 to 1

DatasetSplit, is later used to map Test and Train dataset after mapping sentiment values to the corresponding sentence 

**Example of sentence and its constituent phrases as given in the Dictionary.txt:**

A completely spooky piece of business|62875 \
A completely spooky piece of business that gets under your skin and , some plot blips aside|62876 \
A completely spooky piece of business that gets under your skin and , some plot blips aside ,|62877 \
A completely spooky piece of business that gets under your skin and , some plot blips aside , stays there for the duration .|62878

**Corresponding sentiment values for the phrases above, from sentiment_labels.txt**

62875|0.48611\
62876|0.68056\
62877|0.70833\
62878|0.65278

**Given a sentiment value, determining the sentiment class** <br>
[0,0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] : very negative, negative, neutral, positive, very positive

62878|0.65278 <br>
Sentiment Class for the sentence: <br>
*'A completely spooky piece of business that gets under your skin and , some plot blips aside , stays there for the duration .'* <br>
**POSITIVE**


#### Loading the dataset for Preprocessing

In [7]:

import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', None)

df_sentences = pd.read_csv("dataset/stanfordSentimentTreebank/datasetsentences.txt", sep='\t', engine="python",error_bad_lines=False)
display(df_sentences.head())
df_sentences['sentence'] = df_sentences['sentence'].astype(str)

df_sentiments = pd.read_csv("dataset/stanfordSentimentTreebank/sentiment_labels.txt", sep='|', engine="python",error_bad_lines=False, header=None,skiprows=1, names = ["phrase_id","sentiment_value"])
display(df_sentiments.head())
df_sentiments['sentiment_value'] = df_sentiments['sentiment_value'].astype(float)
df_sentiments['phrase_id'] = df_sentiments['phrase_id'].astype(int)

df_phrases = pd.read_csv("dataset/stanfordSentimentTreebank/dictionary.txt", sep='|', engine="python",error_bad_lines=False, header=None, names=["phrase","phrase_id"])
display(df_phrases[12119:12123])
df_phrases['phrase'] = df_phrases['phrase'].astype(str)
df_phrases['phrase_id'] = df_phrases['phrase_id'].astype(int)


Unnamed: 0,sentence_index,sentence
0,1,"The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal ."
1,2,The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .
2,3,Effective but too-tepid biopic
3,4,"If you sometimes like to go to the movies to have fun , Wasabi is a good place to start ."
4,5,"Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one ."


Unnamed: 0,phrase_id,sentiment_value
0,0,0.5
1,1,0.5
2,2,0.44444
3,3,0.5
4,4,0.42708


Unnamed: 0,phrase,phrase_id
12119,... a rich and intelligent film that uses its pulpy core conceit to probe questions,62555
12120,... a rich and intelligent film that uses its pulpy core conceit to probe questions of attraction and interdependence,62556
12121,... a rich and intelligent film that uses its pulpy core conceit to probe questions of attraction and interdependence and,62557
12122,... a rich and intelligent film that uses its pulpy core conceit to probe questions of attraction and interdependence and how the heart accomodates practical needs .,221782


#### Merging the DatasetSentences, Dictionary and Sentiment Labels

In [8]:
#Merging the 3 Dataframes with inner joins
df_merged = pd.merge(df_sentences, df_phrases, left_on='sentence', right_on='phrase')  
df = pd.merge(df_merged, df_sentiments, on='phrase_id')
display(df.head())


Unnamed: 0,sentence_index,sentence,phrase,phrase_id,sentiment_value
0,1,"The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .","The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",226166,0.69444
1,2,The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .,The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .,226300,0.83333
2,3,Effective but too-tepid biopic,Effective but too-tepid biopic,13995,0.51389
3,4,"If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .","If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .",14123,0.73611
4,5,"Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .","Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .",13999,0.86111


In [9]:
#Mapping probabilities to classes
#[0,0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
#very negative, negative, neutral, positive, very positive

df["sentiment_class"] = ""
df["sentiment_cls_value"] = -1
df["sentiment_class"] = df["sentiment_class"].astype(str)
df["sentiment_cls_value"] = df["sentiment_cls_value"].astype(int)

for i, row in df.iterrows():
    senti_value = row["sentiment_value"]
    if(0<=senti_value<=0.2):
        s_cls = "Very_Negative"
        s_val = 0
    if(0.2<senti_value<=0.4):
        s_cls = "Negative"
        s_val = 1
    if(0.4<senti_value<=0.6):
        s_cls = "Neutral"
        s_val = 2
    if(0.6<senti_value<=0.8):
        s_cls = "Positive"
        s_val = 3
    if(0.8<senti_value<=1.0):
        s_cls = "Very_Positive"
        s_val = 4
        
    df.at[i,"sentiment_class"] = s_cls
    df.at[i,"sentiment_cls_value"] = s_val
    

In [10]:
### Final Merged Dataset
df.sample(10)

Unnamed: 0,sentence_index,sentence,phrase,phrase_id,sentiment_value,sentiment_class,sentiment_cls_value
2620,2736,"See Scratch for the history , see Scratch for the music , see Scratch for a lesson in scratching , but , most of all , see it for the passion .","See Scratch for the history , see Scratch for the music , see Scratch for a lesson in scratching , but , most of all , see it for the passion .",68678,0.86111,Very_Positive,4
4614,4826,"You can watch , giggle and get an adrenaline boost without feeling like you 've completely lowered your entertainment standards .","You can watch , giggle and get an adrenaline boost without feeling like you 've completely lowered your entertainment standards .",111345,0.66667,Positive,3
1688,1761,Moonlight Mile gives itself the freedom to feel contradictory things .,Moonlight Mile gives itself the freedom to feel contradictory things .,46204,0.79167,Positive,3
4559,4768,"Upsetting and thought-provoking , the film has an odd purity that does n't bring you into the characters so much as it has you study them .","Upsetting and thought-provoking , the film has an odd purity that does n't bring you into the characters so much as it has you study them .",110843,0.61111,Positive,3
8977,9401,The movie itself appears to be running on hypertime in reverse as the truly funny bits get further and further apart .,The movie itself appears to be running on hypertime in reverse as the truly funny bits get further and further apart .,188451,0.375,Negative,1
9162,9598,This film was made by and for those folks who collect the serial killer cards and are fascinated by the mere suggestion of serial killers .,This film was made by and for those folks who collect the serial killer cards and are fascinated by the mere suggestion of serial killers .,226640,0.54167,Neutral,2
1935,2017,As simple and innocent a movie as you can imagine .,As simple and innocent a movie as you can imagine .,44593,0.66667,Positive,3
9857,10337,The film desperately sinks further and further into comedy futility .,The film desperately sinks further and further into comedy futility .,188168,0.18056,Very_Negative,0
9611,10073,It looks much more like a cartoon in the end than The Simpsons ever has .,It looks much more like a cartoon in the end than The Simpsons ever has .,185311,0.43056,Neutral,2
7034,7363,Too much of the humor falls flat .,Too much of the humor falls flat .,150459,0.26389,Negative,1


In [11]:
df.to_csv("dataset/sst5.csv", index=False)

#### Split Dataset into Train and Test 
As given by DatasetSplits.txt and write it into two separate files

In [17]:
import pandas as pd
df = pd.read_csv("dataset/sst5.csv")
df_split = pd.read_csv("dataset/stanfordSentimentTreebank/datasetSplit.txt")
df = pd.merge(df,df_split, on='sentence_index')
df.head()


Unnamed: 0,sentence_index,sentence,phrase,phrase_id,sentiment_value,sentiment_class,sentiment_cls_value,splitset_label
0,1,"The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .","The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",226166,0.69444,Positive,3,1
1,2,The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .,The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .,226300,0.83333,Very_Positive,4,1
2,3,Effective but too-tepid biopic,Effective but too-tepid biopic,13995,0.51389,Neutral,2,2
3,4,"If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .","If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .",14123,0.73611,Positive,3,2
4,5,"Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .","Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .",13999,0.86111,Very_Positive,4,2


In [21]:
df_train = df[df['splitset_label']==1]
print("Length of training set: %d -- %0.2f of the dataset" % (len(df_train), len(df_train)/len(df)*100))
df_train.to_csv("dataset/train.csv", index=False)

df_test = df[df['splitset_label']!=1]
print("Length of test set: %d -- %0.2f of the dataset" % (len(df_test), len(df_test)/len(df)*100))
df_test.to_csv("dataset/test.csv", index=False)

Length of training set: 8117 -- 71.92 of the dataset
Length of test set: 3169 -- 28.08 of the dataset


In [23]:
df_test.head()

Unnamed: 0,sentence_index,sentence,phrase,phrase_id,sentiment_value,sentiment_class,sentiment_cls_value,splitset_label
2,3,Effective but too-tepid biopic,Effective but too-tepid biopic,13995,0.51389,Neutral,2,2
3,4,"If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .","If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .",14123,0.73611,Positive,3,2
4,5,"Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .","Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .",13999,0.86111,Very_Positive,4,2
5,6,The film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .,The film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .,14498,0.59722,Neutral,2,2
6,7,Offers that rare combination of entertainment and education .,Offers that rare combination of entertainment and education .,14351,0.83333,Very_Positive,4,2
