# ECE 570: Assignment 1

## **Instructions**
1. Please follow the thread in Piazza for detailed usage of Google Colab.
2. All submissions should be uploaded to Gradescope as a PDF version of your current Jupyter notebook. In this assignment you only need to submit section 3, 4, 5, and 6. 
3. Have fun!


## 1. Background
In this assignment, we are trying to do simple sentiment analysis. Sentiment analysis is the process of detecting positive or negative sentiment in text. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.

The dataset we will be using is called the [***Stanford Sentiment Treebank***](https://nlp.stanford.edu/sentiment/code.html). This dataset is collected from movie reviews on *Rotten Tomatoes* and includes over 20k sentences. All reviews were later re-organized as distinct phrases with a label as a number between 0.0 and 1.0. Labels can later be divided in to five intervals [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] which means very negative, negative, neutral, positive, and very positive, respectively.

The dataset we are using in this assignment is a subset of Stanford Sentiment Treebank and consists of only **400 phrases**. For simplicity, the dataset only contains extreme sentiments; half of them are extremely positive reviews (have corresponding range (0.9, 1.0]), and the other half are extremely negative reviews (have corresponding range [0.0, 0.1]).

In machine learning (ML), the process of optimizing the model given some dataset is known as *training* or *learning*---thus the data used during training is merely called the *training data* (e.g., historicial reviews). This is analagous to a student seeing worked-out problems in class to help understand a concept.  However, after the model has been trained, it will be used to predict on new data (e.g., new reviews)---this is what is often called *test data*. This process is analagous to a student taking a test and requires them to understand the concepts rather than just memorizing the problems worked-out in class. We will cover these concepts in more detail later on in the lectures.

To *simulate* this process and evaluate your model *as if it was deployed*, we will split the dataset into two parts: a training dataset (50%) and test dataset (50%), where both have an equal number of positive and negative reviews.

Your job is to manually construct a simple function that determines whether a single phrase (input) has a positive or negative sentiment (output/return value). You are only allowed to see and optimize using the **training dataset** (simulating model training in ML), but we provide a function that performs a final evaluation of your model on the **testing dataset** (simulating the model being deployed in the real world). 

*Note: We obfuscate (i.e., make secret) the testing function because in the real world, you won't know what the new data will look like.*

## 2. Mounting your google drive on Colab
Since colab is running on a remote server on Google, you need to mount your google drive on Colab to serve as a 'local directory' to your coding environment. Luckily, it is as simple as two steps! Try to run this block and follow the instructions that got popped out.

Note: This part is not necessary if you are using your own Python environment or other remote python environment.

In [None]:
from google.colab import drive 
drive.mount('/content/drive')

## 3. Load data (10/100 points)
Now, we need to load the data from the "train.txt" and "test.txt" file. Please change the location for **dir_root** in the following code block to where you saved all your files.

The train dataset is stored in the "train.txt" file which stores 100 positive phrases and 100 negative phrases. Each line in the file is consist of a phrase and the corresponding sentiment positive(1) or negative(-1) followed by a separation mark '|'. 

Tip: It is helpful and sometimes necessary to have a separate folder for each assignment!

In [1]:
import os                                                                       # for better path controls
#########################        YOUR CODE        ######################### 
import pandas as pd
dir_root = 'D:\Jupyter\ECE570\Assignment-1'    # change this root directory
#########################      END YOUR CODE      #########################
train_dir = os.path.join(dir_root, 'train.txt')                                 # locate the train.txt file

In [2]:
# use built-in function "open" to read files 
#  and use the "with" syntax to automatically close the file after the block
with open(train_dir, 'r') as f:
    train_lines = f.readlines()

# construct two lists to store phrases and labels seperately
train_data, train_label = [], []
for line in train_lines:
    line_sec = line.split("|", -1)
    train_data.append(line_sec[0])
    train_label.append(int(line_sec[1]))

# preview some data here
preview = 10                 # feel free to toggle this number to see more/less data
for i, (phrase, label) in enumerate(zip(train_data[:preview], train_label[:preview])):
    print(f'Phrase {i:03} \"{phrase}\" has the sentiment {label}')

Phrase 000 "Astonishingly skillful and moving" has the sentiment 1
Phrase 001 "are incredibly beautiful to look at" has the sentiment 1
Phrase 002 "as the most magical and most fun family fare of this or any recent holiday season" has the sentiment 1
Phrase 003 "It shows that some studios firmly believe that people have lost the ability to think and will forgive any shoddy product as long as there 's a little girl-on-girl action ." has the sentiment -1
Phrase 004 "Will assuredly rank as one of the cleverest , most deceptively amusing comedies of the year ." has the sentiment 1
Phrase 005 "disintegrates into a dreary , humorless soap opera" has the sentiment -1
Phrase 006 "The editing is chaotic , the photography grainy and badly focused , the writing unintentionally hilarious , the direction unfocused ," has the sentiment -1
Phrase 007 "The film is often filled with a sense of pure wonderment and excitement not often seen in today 's cinema du sarcasm" has the sentiment 1
Phrase 008 "i

## 4. Classifier (80/100 points)
Please fill in code in the provided skeleton for the function `sentiment_analysis` which has the following structure:
* Input: a single string `phrase`
* Output/Return value: an integer `-1` or `1`. `-1` stands for negative sentiment and `1` stands for positive sentiment

This task is similar to the hand-crafted knowledge phase of AI in which you are manually creating rules or performing manual computation by using your intuition and looking at some examples. You should not use any ML packages for this assignment (simple Python code should be enough).

Notes:
1. To receive full credit, your training accuracy must be greater than 60%. (After some effort, Prof. Inouye was able to get 66.5%. Can you do better?)
2. Your code should be less than **50 lines without importing any additional packages** (i.e, this assignment does not require you to perform any complicated model analysis).  Only 10-20 lines is likely required if written concisely.
3. You can view all the training phrases by opening file *'train.txt'* in the provided zip file. This may help understand what could be used in your function.
4. Throughout the design of your algorithm, **you should only have access to the train dataset** stored in "train.txt". The test dataset stored in 'test.npy' is used in the next evaluation section but you should not look at this (see discussion above). Again, you can think that train dataset is what we would actually have to learn from (like course materials and lectures) while test is new data that simulates real-world posts (where we wouldn’t usually know the true labels). 

You might find the following hints helpful (not required to use them):
1. The Python keyword `in` can be used to determine if a string is within another string. e.g., `'ece' in 'hello ece 570'` would evaluate to `True`.
2. The `lower()` or `upper()` methods of string can be helpful, e.g., if `a = 'HeLlo'` then `print(a.lower())` would print `hello`.
3. A partial frequency table for all words in the training dataset is given as the follow:

Word | # positive | # negative | # total
--- | --- | --- | ---
best|12|0|12
i|0|11|11
are|9|1|10
most|9|1|10
bad|0|10|10
at|2|6|8
his|7|1|8
has|5|3|8
about|2|6|8
have|1|6|7
from|2|4|6
worst|0|6|6
does|2|4|6
brilliant|6|0|6
films|6|0|6
any|1|4|5
enough|1|4|5
what|4|1|5
work|5|0|5
great|4|1|5
time|1|4|5
or|1|3|4
some|1|3|4
will|3|1|4
sense|3|1|4
cinema|3|1|4
comedy|1|3|4
just|1|3|4
first|4|0|4
masterpiece|3|1|4
my|0|4|4
want|1|3|4
if|0|4|4
something|3|1|4
story|3|1|4
love|4|0|4
filmmaking|2|2|4
their|4|0|4
when|0|4|4
than|1|3|4
look|1|2|3
recent|3|0|3
product|0|3|3
into|0|3|3
hilarious|2|1|3
often|3|0|3
easily|3|0|3
performances|3|0|3
deserves|3|0|3



In [45]:
def train_analysis():
    result_word = []
    result_score = []
    
    for i in range(len(data.Word)):
        if data.positive[i] > data.negative[i]:
            result_word.append(data.Word[i])
            result_score.append(1)

        elif data.positive[i] < data.negative[i]:
            result_word.append(data.Word[i])
            result_score.append(-1)
        else:
            result_word.append(data.Word[i])
            result_score.append(0)
    
    return result_word,result_score

In [77]:
import pandas as pd

data = pd.read_csv('freq_table.csv')

In [78]:
def sentiment_analysis(phrase):
    """
    sentiment_analysis function determines whether a phrase is positive (1) or negative (-1).

    :param1(string) phrase: a single phrase in the format of string
    :return(int)          : 1 if the phrase is postive or -1 if the phrase is negative
    """ 
    #########################        YOUR CODE        ######################### 
    low_phrase = phrase.lower()
    score = 0

    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    
    words = ['the','he','be','to','of','and','a','in','that','have','I','it','for','not','on','with','he','as','you','do','at','this','but','his','by','from','they','we','say','her','she','or','an','will','my','one','all','would','there','their','what','so','up','out','if','about','who','get','which','go','me','when','make','can','like','time','no','just','him']
    
    for x in low_phrase:
        if x in punctuations:
            low_phrase = low_phrase.replace(x, "")
    
    
    #result_word,result_score = train_analysis()
    res = low_phrase.split()
    for x in res:
        print("words = ",x)
        if x in words:
            res.remove(x)
    print("\n",res,"\n")
    for i in res:
        #for l in range(len(z.Word)):
        for l in range(len(data.Word)):
            
            if i == data.Word[l]:
                
                if data.positive[l] > data.negative[l]:
                    #if (len(i) > 3):
                    print("word: ",data.Word[l],"positive: ",data.positive[l])
                        #data.positive[l] = data.positive[l] + 1
                        #score = score + 1
                        #if data.positive[l] > 5:
                         #   score = score + 5
                        #else:
                    score = score + data.positive[l]
                    break
                elif data.negative[l] > data.positive[l]:
                    #if (len(i) > 3):
                    print("word: ",data.Word[l],"negative: ",data.negative[l])
                        #data.negative[l] = data.negative[l] + 1
                        #score = score - 1
                        #if data.negative[l] > 5:
                         #   score = score - 5
                        #else:
                    score = score - data.negative[l]
                        
                    break
        #if i in result_word:
            #score = score + result_score[result_word.index(i)]
            
    if score < 0:
        
        for i in res:
            for k in range(len(data.Word)):
                if i == data.Word[k]:
            #print(data.Word[s],"\n")
                    data.negative[k] = data.negative[k] + 1
                    data.total[k] = data.total[k] + 1
                    break
        
            if i != data.Word[k]:
                print(i)
                data.loc[len(data)] = {'Word': i,'positive': 0,'negative': 1,'total': 1}
                #data = data.append({'Word': i,'positive': 0,'negative': 1,'total': 1}, ignore_index = True)
                #l = l + 1
                #data.Word[l] = i
                #data.negative[l] = 1
                #data.positive[l] = 0
                #data.total[l] = 1
            
        
        return -1
            
    elif score > 0:
        for i in res:
            for k in range(len(data.Word)):
                if i == data.Word[k]:
            #print(data.Word[s],"\n")
                    data.positive[k] = data.positive[k] + 1
                    data.total[k] = data.total[k] + 1
                    break
                    
            if i != data.Word[k]:
                print(i)
                data.loc[len(data)] = {'Word': i,'positive': 1,'negative': 0,'total': 1}
               
            #data = data.append({'Word': i,'positive': 1,'negative': 0,'total': 1})
                
                #data.Word[l] = i
                #data.positive[l] = 1
                #data.negative[l] = 0
                #data.total[l] = 1
            
        return 1     # change this line and add lines above
    
    
    
    else:
        print("score = ",score)
        return 0
    #########################      END YOUR CODE      ######################### 


def evaluate(func, data, label):
    score = 0
    
    for x, y in zip(data, label):
        score += (func(x) == y)
    return score/len(data)

train_acc = evaluate(sentiment_analysis, train_data, train_label)
print(f"Your method has a training accuracy of {train_acc*100}%")

words =  astonishingly
words =  skillful
words =  and

 ['astonishingly', 'skillful', 'moving'] 

score =  0
words =  are
words =  incredibly
words =  beautiful
words =  to
words =  at

 ['are', 'incredibly', 'beautiful', 'look'] 

word:  are positive:  9
word:  look negative:  2
incredibly
beautiful
words =  as
words =  most
words =  magical
words =  and
words =  fun
words =  family
words =  fare
words =  of
words =  or
words =  recent
words =  holiday
words =  season

 ['the', 'most', 'magical', 'most', 'fun', 'family', 'fare', 'this', 'any', 'recent', 'holiday', 'season'] 

word:  most positive:  9
word:  most positive:  9
word:  any negative:  4
word:  recent positive:  3
the
magical
fun
family
fare
this
holiday
season
words =  it
words =  that
words =  studios
words =  firmly
words =  believe
words =  that
words =  have
words =  the
words =  to
words =  and
words =  forgive
words =  any
words =  shoddy
words =  product
words =  as
words =  as
words =  s
words =  a
words =  girlong

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


long
there
s
little
girlongirl
action
words =  will
words =  rank
words =  as
words =  of
words =  cleverest
words =  most
words =  deceptively
words =  amusing
words =  comedies
words =  of
words =  year

 ['assuredly', 'rank', 'one', 'the', 'cleverest', 'most', 'deceptively', 'amusing', 'comedies', 'the', 'year'] 

word:  the positive:  1
word:  most positive:  11
word:  the positive:  1
assuredly
rank
one
cleverest
deceptively
amusing
comedies
year
words =  disintegrates
words =  into
words =  a
words =  humorless
words =  soap
words =  opera

 ['disintegrates', 'into', 'dreary', 'humorless', 'soap', 'opera'] 

word:  into negative:  3
disintegrates
dreary
humorless
soap
opera
words =  the
words =  is
words =  chaotic
words =  the
words =  grainy
words =  and
words =  focused
words =  the
words =  unintentionally
words =  hilarious
words =  the
words =  unfocused

 ['editing', 'is', 'chaotic', 'photography', 'grainy', 'badly', 'focused', 'writing', 'unintentionally', 'hilarious', 'd

frame
words =  this
words =  is
words =  hopeless

 ['movie', 'is', 'hopeless'] 

word:  movie positive:  3
word:  is positive:  4
hopeless
words =  we
words =  a
words =  bad
words =  bad
words =  movie

 ['have', 'bad', 'bad', 'bad', 'movie'] 

word:  have negative:  6
word:  bad negative:  12
word:  bad negative:  12
word:  bad negative:  12
word:  movie positive:  4
words =  has
words =  all
words =  scenic
words =  appeal
words =  of
words =  cesspool

 ['has', 'the', 'scenic', 'appeal', 'a', 'cesspool'] 

word:  has positive:  5
word:  the positive:  4
word:  a negative:  2
scenic
appeal
cesspool
words =  a
words =  and
words =  movie

 ['cumbersome', 'clicheridden', 'movie'] 

word:  movie positive:  4
cumbersome
clicheridden
words =  starts
words =  out
words =  spirals
words =  downward
words =  and
words =  to
words =  bottom
words =  of
words =  pool
words =  with
words =  utterly
words =  incompetent
words =  conclusion

 ['starts', 'mediocre', 'spirals', 'downward', 'thuds

words =  worst
words =  cinematic
words =  tragedies

 ['year', 's', 'worst', 'cinematic', 'tragedies'] 

word:  s negative:  9
word:  worst negative:  7
cinematic
tragedies
words =  ladles
words =  on
words =  local
words =  flavour
words =  with
words =  hugely
words =  enjoyable
words =  film
words =  about
words =  times
words =  clashing
words =  cultures
words =  and
words =  pleasures
words =  of
words =  wellmade
words =  pizza

 ['ladles', 'the', 'local', 'flavour', 'a', 'hugely', 'enjoyable', 'film', 'changing', 'times', 'clashing', 'cultures', 'the', 'pleasures', 'a', 'wellmade', 'pizza'] 

word:  the positive:  10
word:  a positive:  4
word:  film positive:  5
word:  the positive:  10
word:  a positive:  4
ladles
local
flavour
hugely
enjoyable
changing
times
clashing
cultures
pleasures
wellmade
pizza
words =  a
words =  of
words =  films
words =  like
words =  should
words =  be

 ['model', 'what', 'films', 'this', 'should', 'like'] 

word:  what positive:  4
word:  films p

words =  fantastic

 ['work', 'is', 'fantastic'] 

word:  work positive:  6
word:  is positive:  16
fantastic
words =  for
words =  its
words =  technical
words =  virtuosity
words =  the
words =  is
words =  so
words =  in
words =  and
words =  pedagogy
words =  that
words =  s
words =  enough
words =  to
words =  one
words =  for
words =  day
words =  when
words =  can
words =  longer
words =  handle
words =  the
words =  of

 ['all', 'its', 'technical', 'virtuosity', 'film', 'is', 'mired', 'juvenile', 'nearxenophobic', 'pedagogy', 'it', 's', 'enough', 'make', 'pine', 'day', 'godard', 'no', 'longer', 'handle', 'the', 'rigors', 'filmmaking'] 

word:  its negative:  3
word:  film positive:  8
word:  is positive:  17
word:  it positive:  1
word:  s negative:  17
word:  enough negative:  6
word:  make negative:  1
word:  no positive:  1
word:  the positive:  19
word:  filmmaking positive:  3
technical
virtuosity
mired
juvenile
nearxenophobic
pedagogy
pine
day
godard
longer
handle
rigors


word:  is positive:  21
word:  nt negative:  2
repeated
least
four
every
joke
annoying
words =  ranks
words =  with
words =  worst

 ['ranks', 'the', 'worst'] 

word:  the positive:  28
word:  worst negative:  8
ranks
words =  the
words =  wondrous
words =  of
words =  hollywood
words =  fantasies

 ['most', 'wondrous', 'all', 'hollywood', 'fantasies'] 

word:  most positive:  15
word:  all positive:  2
wondrous
hollywood
fantasies
words =  one
words =  the
words =  romantic
words =  comedies
words =  of
words =  past
words =  decade

 ['greatest', 'romantic', 'comedies', 'of', 'the', 'past', 'decade'] 

word:  the positive:  29
greatest
romantic
past
decade
words =  have
words =  admit
words =  i
words =  walked
words =  out
words =  runteldat

 ['to', 'admit', 'i', 'walked', 'of', 'runteldat'] 

word:  to positive:  2
word:  i negative:  12
word:  of positive:  2
admit
walked
runteldat
words =  the
words =  master
words =  of
words =  chase
words =  sequence
words =  returns
words = 

stinks
start
finish
wet
burlap
sack
gloom
words =  send
words =  it
words =  cranky
words =  we
words =  nt
words =  get
words =  enough
words =  to
words =  through
words =  crap
words =  like

 ['send', 'cranky', 'do', 'nt', 'paid', 'enough', 'to', 'sit', 'through', 'crap', 'this'] 

word:  nt positive:  4
word:  enough negative:  6
word:  to positive:  2
word:  through positive:  2
word:  this positive:  3
send
cranky
do
paid
sit
crap
words =  delightfully
words =  rendered

 ['delightfully', 'rendered'] 

word:  delightfully negative:  1
rendered
words =  a
words =  manipulative
words =  stinker
words =  the
words =  passes
words =  time
words =  it
words =  time
words =  an
words =  finale
words =  of
words =  metal

 ['predictable', 'manipulative', 'stinker', 'story', 'passes', 'until', 's', 'for', 'absurd', 'finale', 'twisted', 'metal'] 

word:  story positive:  6
word:  s negative:  17
predictable
manipulative
stinker
passes
until
for
absurd
finale
twisted
metal
words =  wonder

word:  large positive:  1
word:  filmmaking positive:  5
word:  are positive:  17
word:  likely negative:  1
word:  see positive:  2
examples
artful
format
anytime
soon
words =  with
words =  candylike
words =  taste
words =  of
words =  fading
words =  faster
words =  than
words =  25cent
words =  bubble
words =  gum
words =  i
words =  realized
words =  this
words =  a
words =  movie
words =  that
words =  nt
words =  stand
words =  the
words =  of
words =  it
words =  a

 ['candylike', 'taste', 'fading', 'faster', 'than', '25cent', 'bubble', 'gum', 'i', 'realized', 'is', 'throwaway', 'movie', 'wo', 'nt', 'stand', 'the', 'test', 'time', 'it', 's', 'trifle'] 

word:  than negative:  4
word:  i negative:  17
word:  realized negative:  1
word:  is positive:  33
word:  movie positive:  17
word:  nt positive:  6
word:  the positive:  42
word:  time negative:  4
word:  it positive:  4
word:  s negative:  20
candylike
taste
fading
faster
25cent
bubble
gum
throwaway
wo
stand
test
trifle
word

word:  the positive:  50
word:  own positive:  1
passing
interest
events
shaping
world
beyond
horizons
words =  the
words =  of
words =  during
words =  which
words =  want
words =  to
words =  your
words =  head
words =  on
words =  seat
words =  in
words =  of
words =  at
words =  cluelessness
words =  at
words =  idiocy
words =  at
words =  utterly
words =  misplaced
words =  earnestness

 ['kind', 'movie', 'during', 'you', 'want', 'bang', 'your', 'head', 'the', 'seat', 'front', 'you', 'its', 'cluelessness', 'its', 'idiocy', 'its', 'utterly', 'misplaced', 'earnestness'] 

word:  kind positive:  1
word:  movie positive:  20
word:  want negative:  5
word:  your positive:  1
word:  the positive:  51
word:  its positive:  4
word:  its positive:  4
word:  its positive:  4
word:  utterly positive:  1
during
bang
head
seat
front
cluelessness
idiocy
misplaced
earnestness
words =  is
words =  great
words =  in
words =  role

 ['is', 'great', 'his', 'role'] 

word:  is positive:  38
word:  gr

In [69]:
words = ['the','be','to','of','and','a','in','that','have','I','it','for','not','on','with','he','as','you','do','at','this','but','his','by','from','they','we','say','her','she','or','an','will','my','one','all','would','there','their','what','so','up','out','if','about','who','get','which','go','me','when','make','can','like','time','no','just','him']
stat = 'to he be continue'


res = stat.split()
print(res)
for x in res:
    if x in words:
        res.remove(x)

print(res)

['to', 'he', 'be', 'continue']
['he', 'continue']


## 5. Evaluate (10/100 points)
You may already notice that there is an extra evaluation function in the above coding block which helps calculate the accuracy for your algorithm on the training dataset. The metric that we used to evaluate is straightforward:    
$$\text{accuracy} = \frac{\text{# of correct prediction}}{\text{# of total cases}}$$
Now, let's test the performance of your algorithm on the test dataset (as if deployed on new reviews)! 

Try to get the **test accuracy** to be higher than 55% to receive **full credit**!

Note: You should not have the accuracy to be lower than 50%! (Why?)

In [79]:
import sys
sys.path.append(dir_root)
from top_classified_file import super_secret_function

test_dir = os.path.join(dir_root, 'test.npy')
test_acc = super_secret_function(test_dir, sentiment_analysis)

print(f"Your method has a test accuracy of {test_acc*100}%")

words =  the
words =  films
words =  of

 ['worst', 'films', '2002'] 

word:  worst negative:  9
word:  films positive:  11
word:  2002 positive:  1
words =  have
words =  frida
words =  with
words =  visual
words =  style
words =  unique
words =  and
words =  to
words =  titular
words =  character
words =  s
words =  paintings
words =  and
words =  the
words =  created
words =  a
words =  work
words =  of
words =  of
words =  own

 ['infused', 'frida', 'visual', 'style', 'unique', 'inherent', 'titular', 'character', 's', 'paintings', 'in', 'the', 'process', 'created', 'a', 'masterful', 'work', 'art', 'their', 'own'] 

word:  visual positive:  2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


word:  character positive:  2
word:  s negative:  26
word:  in positive:  1
word:  the positive:  53
word:  process positive:  1
word:  a positive:  25
word:  work positive:  10
word:  their positive:  5
word:  own positive:  2
infused
frida
style
unique
inherent
titular
paintings
created
masterful
art
words =  what
words =  idea
words =  what
words =  thrill
words =  ride
words =  this
words =  a
words =  fascinating
words =  look
words =  at
words =  future
words =  than
words =  bladerunner
words =  and
words =  of
words =  most
words =  highconcept
words =  sci
words =  fi
words =  adventures
words =  attempted
words =  for
words =  screen

 ['an', 'idea', 'thrill', 'ride', 'is', 'a', 'more', 'fascinating', 'look', 'the', 'future', 'than', 'bladerunner', 'one', 'the', 'most', 'highconcept', 'sci', 'fi', 'adventures', 'attempted', 'the', 'screen'] 

word:  an positive:  8
word:  idea negative:  1
word:  is positive:  42
word:  a positive:  26
word:  more positive:  2
word:  fascinat

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


matter
fans
comic
find
uninteresting
words =  s
words =  almost
words =  impossible
words =  not
words =  be
words =  away
words =  by
words =  sheer
words =  beauty
words =  of
words =  images

 ['s', 'almost', 'impossible', 'to', 'swept', 'away', 'the', 'sheer', 'beauty', 'his', 'images'] 

word:  s negative:  28
word:  almost positive:  2
word:  to positive:  3
word:  the positive:  57
word:  his positive:  10
impossible
swept
away
sheer
beauty
images
words =  putrid
words =  it
words =  not
words =  the
words =  of
words =  match
words =  that
words =  be
words =  to
words =  every
words =  print
words =  of
words =  film

 ['putrid', 'is', 'worth', 'price', 'the', 'match', 'should', 'used', 'burn', 'every', 'print', 'the', 'film'] 

word:  is positive:  46
word:  worth positive:  1
word:  the positive:  58
word:  should positive:  1
word:  every positive:  3
word:  the positive:  58
word:  film positive:  20
putrid
price
match
used
burn
print
words =  and
words =  delightful

 ['t

heck
samuel
l
jackson
words =  inconsistent
words =  meandering
words =  and
words =  dry
words =  plot

 ['inconsistent', 'meandering', 'sometimes', 'dry', 'plot'] 

word:  sometimes positive:  1
inconsistent
meandering
dry
plot
words =  calling
words =  this
words =  brainless
words =  would
words =  paying
words =  it
words =  compliment

 ['calling', 'movie', 'brainless', 'be', 'paying', 'a', 'compliment'] 

word:  movie positive:  25
word:  be positive:  3
word:  a positive:  29
calling
brainless
paying
compliment
words =  adultery

 ['adultery'] 

score =  0
words =  appalling

 ['appalling'] 

word:  appalling negative:  2
words =  s
words =  a
words =  overused
words =  cocktail
words =  using
words =  the
words =  olives
words =  since
words =  1962
words =  as
words =  not
words =  is
words =  entry
words =  number
words =  twenty
words =  the
words =  of
words =  brosnan
words =  bunch

 ['s', 'stale', 'overused', 'cocktail', 'using', 'same', 'olives', 'since', '1962', 'garn

word:  ll positive:  4
word:  refreshing positive:  1
word:  a positive:  36
yourself
remembering
visit
sunshine
state
words =  there
words =  no
words =  the

 ['s', 'saving', 'movie'] 

word:  s negative:  30
word:  movie positive:  28
saving
words =  idiots
words =  who
words =  nt
words =  care
words =  what
words =  of
words =  they
words =  into
words =  their
words =  gullets
words =  to
words =  sustenance

 ['idiots', 'do', 'nt', 'care', 'kind', 'sewage', 'shovel', 'into', 'mental', 'gullets', 'simulate', 'sustenance'] 

word:  do positive:  2
word:  nt positive:  9
word:  kind positive:  3
word:  into negative:  5
idiots
care
sewage
shovel
mental
gullets
simulate
sustenance
words =  plays
words =  like
words =  extended
words =  dialogue
words =  exercise
words =  in
words =  101

 ['plays', 'an', 'extended', 'dialogue', 'exercise', 'retard', '101'] 

word:  plays positive:  1
word:  an positive:  13
word:  dialogue positive:  1
word:  exercise positive:  1
extended
retard
10

word:  s negative:  34
word:  reason positive:  1
word:  why positive:  1
word:  anyone positive:  4
word:  should positive:  2
word:  remembering positive:  1
after
seeing
tell
bother
words =  one
words =  the
words =  exciting
words =  action
words =  films
words =  to
words =  out
words =  china
words =  in
words =  years

 ['of', 'most', 'exciting', 'action', 'films', 'come', 'of', 'china', 'recent', 'years'] 

word:  of positive:  8
word:  most positive:  22
word:  exciting positive:  1
word:  action positive:  3
word:  films positive:  14
word:  come negative:  1
word:  of positive:  8
word:  recent positive:  6
word:  years positive:  1
china
words =  whether
words =  it
words =  the
words =  movie
words =  of
words =  i
words =  ca
words =  nt
words =  say
words =  sure
words =  memories
words =  of
words =  have
words =  and
words =  skipped
words =  country
words =  bears
words =  but
words =  new
words =  jangle
words =  of
words =  mayhem
words =  and
words =  must
words = 

word:  movie positive:  35
word:  is positive:  62
word:  a positive:  42
word:  disaster negative:  2
onesided
offers
simplistic
explanations
stylistically
words =  me
words =  to
words =  the
words =  in
words =  first
words =  10
words =  minutes

 ['want', 'bolt', 'theater', 'the', 'first', '10', 'minutes'] 

word:  want negative:  5
word:  theater negative:  2
word:  the positive:  76
word:  first positive:  8
word:  minutes negative:  1
bolt
10
words =  good
words =  car
words =  chases
words =  great
words =  fight
words =  scenes

 ['good', 'car', 'chases', 'great', 'fight', 'scenes'] 

word:  good positive:  2
word:  car positive:  1
word:  chases positive:  1
word:  great positive:  11
word:  scenes positive:  1
fight
words =  you
words =  have
words =  wonder
words =  how
words =  on
words =  anyone
words =  anywhere
words =  could
words =  have
words =  they
words =  make
words =  guffaw
words =  with
words =  script
words =  as
words =  diabolical
words =  as

 ['really', 

word:  infantile negative:  1
word:  makes positive:  3
word:  director positive:  3
word:  writer negative:  1
word:  s negative:  38
word:  diapers negative:  1
sort
words =  cinema
words =  s
words =  finest
words =  this

 ['cinema', 's', 'finest', 'year'] 

word:  cinema positive:  8
word:  s negative:  38
word:  year positive:  8
finest
words =  nonsensical
words =  dull
words =  ``
words =  cyberhorror
words =  flick
words =  is
words =  a
words =  hollow
words =  exercise
words =  in
words =  scares
words =  and
words =  acting

 ['nonsensical', 'dull', '``', 'cyberhorror', 'flick', 'is', 'grim', 'hollow', 'exercise', 'flat', 'scares', 'bad', 'acting'] 

word:  nonsensical positive:  1
word:  dull positive:  3
word:  `` positive:  3
word:  is positive:  67
word:  exercise positive:  3
word:  flat positive:  1
word:  bad negative:  25
word:  acting positive:  2
cyberhorror
flick
grim
hollow
scares
words =  the
words =  equivalent
words =  of
words =  to
words =  dinner
words =  

word:  is positive:  74
word:  pretty negative:  1
word:  its positive:  12
word:  acting positive:  4
word:  dull positive:  5
set
sea
ghost
ship
landbound
leaden
exposition
telegraphed
surprises
words =  how
words =  did
words =  it
words =  get

 ['how', 'did', 'ever', 'made'] 

word:  how positive:  3
word:  did negative:  2
word:  ever positive:  4
word:  made positive:  3
words =  a
words =  incoherent
words =  selfindulgent
words =  mess
words =  of
words =  movie
words =  in
words =  a
words =  of
words =  windbags
words =  drone
words =  on
words =  for
words =  hours
words =  a
words =  of
words =  meaningless
words =  prattle

 ['dreary', 'incoherent', 'selfindulgent', 'mess', 'movie', 'which', 'bunch', 'pompous', 'windbags', 'drone', 'inanely', 'two', 'hours', 'a', 'cacophony', 'pretentious', 'meaningless', 'prattle'] 

word:  movie positive:  40
word:  bunch positive:  2
word:  two positive:  2
word:  hours positive:  2
word:  a positive:  53
word:  pretentious positive:  

word:  holes positive:  1
word:  completely positive:  1
word:  lacking negative:  1
chills
words =  directed
words =  action
words =  sequences
words =  and
words =  of
words =  worst
words =  dialogue
words =  in
words =  memory

 ['directed', 'action', 'sequences', 'some', 'the', 'worst', 'dialogue', 'recent', 'memory'] 

word:  action positive:  6
word:  some positive:  5
word:  the positive:  91
word:  worst negative:  9
word:  dialogue positive:  4
word:  recent positive:  7
directed
sequences
memory
words =  the
words =  undercover
words =  brother
words =  a
words =  mix
words =  of
words =  night
words =  livestyle
words =  parody
words =  70s
words =  blaxploitation
words =  films
words =  and
words =  action
words =  comedy
words =  gone
words =  wild
words =  dishes
words =  out
words =  ton
words =  of
words =  that
words =  can

 ['enjoyable', 'undercover', 'brother', 'zany', 'mix', 'saturday', 'night', 'livestyle', 'parody', '70s', 'blaxploitation', 'films', 'goofball', 

word:  s negative:  46
word:  brilliant positive:  12
word:  has positive:  21
word:  it positive:  10
aside
simone
niccol
antihollywood
satire
wickedly
eccentric
enchantment
words =  priceless
words =  entertainment

 ['priceless', 'entertainment'] 

word:  entertainment positive:  1
priceless
words =  fascinating
words =  and

 ['fascinating'] 

word:  fascinating positive:  5
words =  one
words =  the
words =  ingenious
words =  and
words =  thrillers
words =  i
words =  ve
words =  seen
words =  in
words =  a
words =  time

 ['of', 'most', 'ingenious', 'entertaining', 'thrillers', 'i', 've', 'seen', 'quite', 'long'] 

word:  of positive:  19
word:  most positive:  28
word:  i negative:  25
word:  ve positive:  4
word:  seen positive:  4
word:  quite positive:  3
word:  long positive:  3
ingenious
thrillers
words =  for
words =  for
words =  the
words =  woody
words =  allen
words =  was
words =  once
words =  a
words =  of
words =  fresh
words =  sometimes
words =  funny
words =  a

## 6. (Optional) Did you notice something interesting?
1. During your design, was the training accuracy always higher than test accuracy? Can you explain why that might be true?
2. Was the sentiment analysis task harder than you expected? Why or why not?
3. Anything else you learned or feedback you may have?


## 7. (Optional) Further reading and exploration
If you are interested in text data, you could look into the scikit-learn tutorial on analyzing text: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html  We will discuss scikit-learn a little later in the semester.