# ECE 473 Assignment 1

## **Instructions**
1. Please follow the thread in Piazza for detailed usage of Google Colab.
2. All submissions should be uploaded to Gradescope as a PDF version of your current jupyter notebook (see `uploader.ipynb`). In this assignment you only need to submit sections 3, 4 and 5. **Make sure to select the correct corresponding pages for each question on Gradescope and make sure your code and output are visible on the PDF.**
3. Have fun!


## 1. Background
In this assignment, we are trying to do simple sentiment analysis. Sentiment analysis is the process of detecting positive or negative sentiment in text. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.

The dataset we will be using is called [***Stanford Sentiment Treebank***](https://nlp.stanford.edu/sentiment/code.html). This dataset is collected from movie reviews on *Rotten Tomatoes* for over 20k sentences. All reviews later got re-organized as distinct phrases with label as number 0.0 to 1.0. Labels can later be divided in to five intervals [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] which means very negative, negative, neutral, positive, very positive, respectively.

The dataset we are using in this assignment is a subset of Stanford Sentiment Treebank and consists of **400 phrases**. Train-test dataset split ratio is 50/50 and for either train or test dataset, half of them are extremely positive reviews (have corresponding range (0.9, 1.0]), and the other half are extremely negative reviews (have corresponding range [0.0, 0.1]). Your job is to construct a simple function **train/reference from train dataset only** that takes a single phrase in and outputs whether this phrase has positive or negative sentiment.

**The goal of this assignment is to attempt to implement a method from the 1st wave of AI, namely handcrafted knowledge systems. Thus, you will be trying to create a rule-based function for this task based on your prior knowledge and some examples.**

## 2. Mounting your google drive on Colab
Since colab is running on a remote server on Google, you need to mount your google drive on Colab to serve as a 'local directory' to your coding environment. Luckily, it is as simple as two steps! Try to run this block and follow the instructions that pop out.

Note: This part is not necessary if you are using your own Python environment or other remote python environment.

In [None]:
from google.colab import drive 
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 3. Load data (20/100 points)
Now, we need to load the data from the "train.txt" and "test.txt" file. Please change the location for **dir_root** in the following code block to where you saved all your files.

Train dataset is stored in the "train.txt" file which stores 100 positive phrases and 100 negative phrases. Each line in the file is consist of a phrase and the corresponding sentiment positive(1) or negative(-1) followed by a separation mark '|'. 

Tips: It is helpful and sometimes necessary to have a separate folder for each assignment!

In [None]:
import os  
#dir_root = ''    
#########################        YOUR CODE        ######################### 
dir_root = '/content/drive/MyDrive/Colab Notebooks/ECE473/Assignment-1'        # change this root directory
#########################      END YOUR CODE      #########################                                                                           # for better path controls
train_dir = os.path.join(dir_root, 'train.txt')                                      # locate the train.txt file

In [None]:
# use built-in function "open" to read files
f = open(train_dir, 'r')
train_lines = f.readlines()
f.close()

# construct two lists to store phrases and labels seperately
train_data, train_label = [], []
line_val = []
#### YOUR CODE HERE ####
# Populate the train_data and train_label lists by splitting
# each line of train.txt by the "|" character and adding
# the phrase and label to the lists
for line in train_lines:
  line_val = line.split('|')
  train_data.append(line_val[0])
  train_label.append(line_val[1])

#### END YOUR CODE ####

# preview some data here
preview = 10                 # feel free to toggle this number to see more/less data
for i in range(preview):
    print(f'Phrase \"{train_data[i]}\" has the sentiment {train_label[i]}')

Phrase "Astonishingly skillful and moving" has the sentiment 1

Phrase "are incredibly beautiful to look at" has the sentiment 1

Phrase "as the most magical and most fun family fare of this or any recent holiday season" has the sentiment 1

Phrase "It shows that some studios firmly believe that people have lost the ability to think and will forgive any shoddy product as long as there 's a little girl-on-girl action ." has the sentiment -1

Phrase "Will assuredly rank as one of the cleverest , most deceptively amusing comedies of the year ." has the sentiment 1

Phrase "disintegrates into a dreary , humorless soap opera" has the sentiment -1

Phrase "The editing is chaotic , the photography grainy and badly focused , the writing unintentionally hilarious , the direction unfocused ," has the sentiment -1

Phrase "The film is often filled with a sense of pure wonderment and excitement not often seen in today 's cinema du sarcasm" has the sentiment 1

Phrase "is as appalling as any ` come

## 4. Handcrafted / Hardcoded Classifier (40/100 points)
Please fill in code in the provided skeleton for the function `sentiment_analysis_model` which has the following structure:
* Input: a single string `phrase`
* output: an integer `-1` or `1`. `-1` stands for negative sentiment and `1` stands for positive sentiment

Importantly, this is meant to be like the *first wave of AI* with **hardcoded / handcrafted rules**. You should not use any ML or AI package for this assignment. You can manually look at the train dataset to understand words or phrases that might be positive or negative and can then hardcode these words and possibly weights into your classifier.

Second, fill in the function `evaluate` to evaluate the accuracy of your proposed model using the comments in the function.

Notes:
1. Try to constrain your code for `sentiment_analysis_model` to within **50 lines without importing any additional packages** (i.e, this assignment does not require you to perform any complicated model analysis)
2. You can view all the training phrases by opening file *'train.txt'* in the provided zip file.
3. Throughout the design of your algorithm, **you should only have access to the train dataset** stored in "train.txt". The test dataset stored in 'test.npy' should only be used in the next evaluation section. You can think that train dataset is what we would actually have to learn from (like course materials and lectures) while test is new data that simulates real-world posts (where we wouldn’t usually know the true labels). 

You might find the following hints helpful (not required to use them):
1. Part of frequency table for all words in the training dataset is given as the follow:

Word | # of times in positive | # of times in negative | total #
--- | --- | --- | ---
best|12|0|12
i|0|11|11
are|9|1|10
most|9|1|10
bad|0|10|10
at|2|6|8
his|7|1|8
has|5|3|8
about|2|6|8
have|1|6|7
from|2|4|6
worst|0|6|6
does|2|4|6
brilliant|6|0|6
films|6|0|6
any|1|4|5
enough|1|4|5
what|4|1|5
work|5|0|5
great|4|1|5
time|1|4|5
or|1|3|4
some|1|3|4
will|3|1|4
sense|3|1|4
cinema|3|1|4
comedy|1|3|4
just|1|3|4
first|4|0|4
masterpiece|3|1|4
my|0|4|4
want|1|3|4
if|0|4|4
something|3|1|4
story|3|1|4
love|4|0|4
filmmaking|2|2|4
their|4|0|4
when|0|4|4
than|1|3|4
look|1|2|3
recent|3|0|3
product|0|3|3
into|0|3|3
hilarious|2|1|3
often|3|0|3
easily|3|0|3
performances|3|0|3
deserves|3|0|3

2. You might want to use the Python keyword `in` for seeing if one string is in another.
3. You might want to use the `lower()` or `upper()` string functions.
4. Manually define (i.e., hand-craft) your own rules/criteria for good vs. bad review (e.g. you may want to consider words that are usually good or bad)

In [None]:
def sentiment_analysis_model(phrase):
    """
    sentiment_analysis function determines whether a phrase is positive (1) or negative (-1).

    :param1(string) phrase: a single phrase in the format of string
    :return(int)          : 1 if the phrase is postive or -1 if the phrase is negative
    """ 

    #########################        YOUR CODE        ######################### 
    #more_pos = ['best', 'are', 'most', 'his', 'has', 'brilliant', 'films', 'what', 'work', 'great', 'will', 'sense', 'cinema', 'first', 'masterpiece', 'something', 'story', 'love', 'their', 'recent', 'hilarious', 'often', 'easily', 'performances', 'deserves']
    #more_neg = ['i', 'bad', 'about', 'have', 'from', 'worst', 'does', 'any', 'enough', 'time', 'or', 'some', 'comedy', 'just', 'my', 'want', 'if', 'when', 'than', 'look', 'product', 'into']
    
    more_pos = ['best', 'are', 'most', 'his', 'has', 'brilliant', 'films', 'what', 'work', 'great', 'will', 'sense', 'cinema', 'first', 'masterpiece', 'story', 'love', 'their', 'recent', 'hilarious', 'often', 'easily', 'performances', 'deserves']
    more_neg = ['i', 'bad', 'have', 'worst', 'any', 'enough', 'time', 'some', 'just', 'my', 'if', 'when', 'than', 'product']

    num_pos = 0
    num_neg = 0

    for pos in more_pos:
      if pos in phrase.lower():
        num_pos += 1

    for neg in more_neg:
      if neg in phrase.lower():
        num_neg += 1

    #print(num_pos)
    #print(num_neg)

    if num_neg < num_pos:
      #print('pos')
      return 1
    else:
      #print('neg')
      return -1

    #########################      END YOUR CODE      ######################### 


def evaluate(func, data, label):
    #########################       YOUR CODE      ######################### 
    # Evaluate the accuracy of the model (passed as the function `func`) 
    #   on the given phrases (`data`) and corresponding labels (`label`)
    # For each phrase in `data`, compute the model's prediction for the phrase
    #   and then determine if the prediction is equal to the true corresponding 
    #   label from `label`.
    # Count the number of correct predictions and divide by the total number
    #   of phrases to get the accuracy.
    tot = 0
    for corr_data, corr_label in zip(data, label):
      #print("func: " + str(func(corr_data)))
      #print("label: " + str(corr_label))
      if int(func(corr_data)) == int(corr_label):
        tot += 1
        #print('true')
      #else:
        #print('false')
    accuracy = tot / len(data)

    #########################   END YOUR CODE      ######################### 
    return accuracy

train_acc = evaluate(sentiment_analysis_model, train_data, train_label)
print(f"Your method has the training accuracy of {train_acc*100}%")

Your method has the training accuracy of 65.5%


## 5. Evaluate (40/100 points)
You may already notice that there is an extra evaluation function in the above coding block which helps calculate the accuracy for your algorithm in the training dataset. The metric that we used to evaluate is straightforward:    
$$Accuracy = # of correct prediction / # of total cases$$
Now, let's test the performances of your algorithm in test dataset! 
Try to get the **test accuracy** to be higher than 55% to receive **full credit**!

Note: You should not have the accuracy to be lower than 50%!

In [None]:
import sys
sys.path.append(dir_root)
from top_classified_file import super_secret_function

test_dir = os.path.join(dir_root, 'test.npy')
test_acc = super_secret_function(test_dir, sentiment_analysis_model)

print(f"Your method has the test accuracy of {test_acc*100}%")

Your method has the test accuracy of 56.49999999999999%


## 6. Did you notice something interesting? (Optional)
1. During your design, does training accuracy always a little bit higher than test accuracy? Why?
2. Does the sentiment analysis task a little bit harder than you expected?
3. ... something else you would like to talk about