# Instructions for Project 1 - Sentiment Analysis

Hello everyone, this is Zhaowei Wang. I am glad to host the first project. My email is *zwanggy@connect.ust.hk*. Feel free to send me an email if you have any problem regarding this project.

In this project, you will try to work on a sentiment analysis task.
You will build a model to predict the scores (a.k.a. the "label" column in datasets, from 1 to 5) of each review.
For each review, you are given a piece of text. You can consider the predicted variables as categorical, ordinal or numerical.

Just a kind note: The codes and techniques introduced in the previous tutorials may come in handy. You can refer to the .ipynb notebooks for details.

## Important dates, submission requirements and grading policy 
**Important dates:**
- *March 16, 2024 (Saturday)*: Project starts
- *March 23, 2024 (Saturday)* Release the validation score of baselines
- *April 6, 2024, 23:59 (Saturday)*: `Submission Deadline`

**Submission requirements:**  
Each team leader is required to submit the groupNo.zip file on the Canvas. It shoud contain 
- `pred.csv`: Predictions on test data (please make sure you can successfully evaluate your validation predictions on the validation data with the help of evaluate.py). The file should contain two so-called columns, which are `id`
and `label`.
- report (1-2 pages of pdf)
- code (Frameworks and programming languages are not restricted)

**Grading policy:**  
We will check your report with your code and your model performance (in terms of Accuracy) on the test set.

| Grade | Classifier (80%)                                                   | Report (20%)                      |
|-------|--------------------------------------------------------------------|-----------------------------------|
| 50%   | Example code in tutorials or in Project 1 without any modification | submission                        |
| 75%   | A method that can outperform the easy baseline  | algorithm you used                |
| 95%   | A method that can outperform the hard baseline                     | detailed explanation and analysis, such as explorative data analysis, hyperparameters and ablation studies  |
| 100%  | A method that can outperform the hard baseline with at least one excellent idea  | excellent ideas, detailed explanation and solid analysis |

## Instruction Content
In this notebook, you are provided with the code snippets to start with.

The content follows previous lectures and tutorials. But some potentially useful python packages are also mentioned.

1. Loading data and saving predictions
    1. Loading data
    1. Saving predictions to file
1. Preprocessing
    1. Text data processing recap
    1. Explorative data analysis
1. Learning Baselines

## 1. Loading data and saving predictions

The same as previous tutorials, we use `pandas` as the basic tool to load & dump the data.
The key ingredient of our operation is the `DataFrame` in pandas.

In [1]:
import pandas as pd

In [3]:
# if you use Google Colab, un-comment this cell, modify `path_to_data` if needed, and run to mount data to `data`
# from google.colab import drive
# drive.mount('/content/drive')

# path_to_data = '/content/drive/MyDrive/HKUST stuff/COMP4332_Project1/data'
# !rm -f data
# !ln -s '/content/drive/MyDrive/HKUST stuff/COMP4332_Project1/data' data

### A. Loading data

The following code shows how to load the datasets for this project.  
Among which, we do not release the labels (the "label" column) for the test set. 
You may evaluate your trained model on the validation set instead.
However, your submitted predictions (``pred.csv``) should be generated on the test set.

Each year we release different data, so old models are not guaranteed to solve the new data.

In [2]:
def load_data(split_name='train', columns=['text', 'label'], folder='data'):
    '''
        "split_name" may be set as 'train', 'valid' or 'test' to load the corresponding dataset.
        
        You may also specify the column names to load any columns in the .csv data file.
        Among many, "text" can be used as model input, and "label" column is the labels (sentiment). 
    '''
    try:
        print(f"select [{', '.join(columns)}] columns from the {split_name} split")
        df = pd.read_csv(f'{folder}/{split_name}.csv')
        df = df.loc[:,columns]
        print("Success")
        return df
    except:
        print(f"Failed loading specified columns... Returning all columns from the {split_name} split")
        df = pd.read_csv(f'{folder}/{split_name}.csv')
        return df

Then you can extract the data by specifying the desired split and columns

In [23]:
train_df = load_data('train', columns=['text', 'label'], folder='data')
valid_df = load_data('valid', columns=['id','text', 'label'], folder='data')
# the test set labels (the 'label' column) are unavailable! So the following code will instead return all columns
test_df = load_data('test_no_label', columns=['id', 'text'], folder='data')

select [text, label] columns from the train split
Success
select [id, text, label] columns from the valid split
Success
select [id, text] columns from the test_no_label split
Success


In [6]:
train_df.head()

Unnamed: 0,text,label
0,Two Wolfgang Petersen directed films together ...,5
1,For fans of the series and the movies\nthis fi...,4
2,"I love the movie. The Blu-ray was fine, but it...",3
3,You don't know what is going on until the end ...,3
4,"We only watched a few minutes of the movie, du...",1


In [7]:
test_df.head()

Unnamed: 0,id,text
0,A3EMGD8RAEOK64_2907,"On our trip this past summer to Lunenberg, Nov..."
1,A2BOWU2PX28BET_5501,Excellent!! Most remakes fall short of the ori...
2,A100WO06OQR8BQ_10469,I started to watch this movie but it is such a...
3,A2H4LKU7CPIUU9_11364,Well! I must be terribly jaded. Or I am comple...
4,A14RF11JYGDKI8_23751,Dark and grim -- not a fun movie. Watch it fo...


In [8]:
print(len(train_df), len(valid_df), len(test_df))

18000 2000 4000


### B. Saving predictions to file

Your submitted predictions are supposed to be a .csv file containing two columns, i.e. (``id`` and ``label``). 

Here, as an example, we generate some random predictions as our answer, which are put in a DataFrame and output to a .csv file

After getting your model predictions on the test set, you may follow these steps to generate your ``pred.csv`` file. (By replacing the random predictions with your model predictions)

In [4]:
import numpy as np

In [10]:
random_pred = pd.DataFrame(data={
    'id': test_df['id'],
    'label': np.random.randint(0, 6, size=len(test_df))
})

In [11]:
random_pred.head()

Unnamed: 0,id,label
0,A3EMGD8RAEOK64_2907,5
1,A2BOWU2PX28BET_5501,5
2,A100WO06OQR8BQ_10469,0
3,A2H4LKU7CPIUU9_11364,1
4,A14RF11JYGDKI8_23751,2


In [12]:
random_pred.to_csv(f'random_pred.csv', index=False)

Then, you will get a ``random_pred.csv`` in your folder.

## 2. Preprocessing

Here are some preprocessing examples for your reference. For more details you may refer to the previous tutorials.

### A. Text data processing recap
In the tutorials, we have shown how to extract textual features using the `nltk` package

Remember to use the NLTK Downloader to obtain the resource first:
```
  >>> import nltk
  >>> nltk.download('stopwords')
  >>> nltk.download('punkt')
```

In [5]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
ps = PorterStemmer()

def lower(s):
    """
    :param s: a string.
    return a string with lower characters
    Note that we allow the input to be nested string of a list.
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: 'text mining is to identify useful information.'
    """
    if isinstance(s, list):
        return [lower(t) for t in s]
    if isinstance(s, str):
        return s.lower()
    else:
        raise NotImplementedError("unknown datatype")


def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)


def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]

def n_gram(tokens, n=1):
    """
    :param tokens: a list of tokens, type: list
    :param n: the corresponding n-gram, type: int
    return a list of n-gram tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.'], 2
    Output: ['text mine', 'mine is', 'is to', 'to identifi', 'identifi use', 'use inform', 'inform .']
    """
    if n == 1:
        return tokens
    else:
        results = list()
        for i in range(len(tokens)-n+1):
            # tokens[i:i+n] will return a sublist from i th to i+n th (i+n th is not included)
            results.append(" ".join(tokens[i:i+n]))
        return results

def filter_stopwords(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of filtered tokens, type: list
    e.g.
    Input: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    Output: ['text', 'mine', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     if token not in stopwords and not token.isnumeric():
    #         results.append(token)
    # return results

    return [token for token in tokens if token not in stopwords and not token.isnumeric()]

import numpy as np

def get_onehot_vector(feats, feats_dict):
    """
    :param data: a list of features, type: list
    :param feats_dict: a dict from features to indices, type: dict
    return a feature vector,
    """
    # initialize the vector as all zeros
    vector = np.zeros(len(feats_dict), dtype=np.float)
    for f in feats:
        # get the feature index, return -1 if the feature is not existed
        f_idx = feats_dict.get(f, -1)
        if f_idx != -1:
            # set the corresponding element as 1
            vector[f_idx] = 1
    return vector

Note that you can use the `map` function to apply your preprocessing functions into the dataframe.

In [14]:
for i in range(len(test_df)):
    try:
        tokenize(test_df.loc[i, 'text'])
    except: 
        print(i)

In [15]:
print(test_df.loc[1155])

id                                     A8NQVLIE0QVT4_7949
text    Great movie, even better dubb. Blu ray is the ...
Name: 1155, dtype: object


In [16]:
test_df['tokens'] = test_df['text'].map(tokenize).map(filter_stopwords).map(lower)
print(test_df['tokens'].head().to_string())

0    [on, trip, past, summer, lunenberg, ,, nova, s...
1    [excellent, !, !, most, remakes, fall, short, ...
2    [i, started, watch, movie, lousy, movie, i, st...
3    [well, !, i, must, terribly, jaded, ., or, i, ...
4    [dark, grim, --, fun, movie, ., watch, perform...


Besides `nltk`, `SpaCy` may also be useful.

You can explore it at https://spacy.io/

Let's install it with the following command (in terminal)

```bash
python -m pip install spacy
python -m spacy download en_core_web_sm
```

You may use spacy to extract linguistic features from texts

Example:

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

fmt = "{:10s},\t " * 8
print(fmt.format('raw', 'stem', 'PartOfSpeech', 'dependency', 'shape', 'is alpha', 'is stop', 'its childrens in the parsing tree'))
print('-'*140)
for token in doc:
    print(fmt.format(token.text, token.lemma_, token.pos_, token.dep_,
            token.shape_, str(token.is_alpha), str(token.is_stop), str(list(token.children))))

raw       ,	 stem      ,	 PartOfSpeech,	 dependency,	 shape     ,	 is alpha  ,	 is stop   ,	 its childrens in the parsing tree,	 
--------------------------------------------------------------------------------------------------------------------------------------------
Apple     ,	 Apple     ,	 PROPN     ,	 nsubj     ,	 Xxxxx     ,	 True      ,	 False     ,	 []        ,	 
is        ,	 be        ,	 AUX       ,	 aux       ,	 xx        ,	 True      ,	 True      ,	 []        ,	 
looking   ,	 look      ,	 VERB      ,	 ROOT      ,	 xxxx      ,	 True      ,	 False     ,	 [Apple, is, at, startup],	 
at        ,	 at        ,	 ADP       ,	 prep      ,	 xx        ,	 True      ,	 True      ,	 [buying]  ,	 
buying    ,	 buy       ,	 VERB      ,	 pcomp     ,	 xxxx      ,	 True      ,	 False     ,	 [U.K.]    ,	 
U.K.      ,	 U.K.      ,	 PROPN     ,	 dobj      ,	 X.X.      ,	 False     ,	 False     ,	 []        ,	 
startup   ,	 startup   ,	 NOUN      ,	 dep       ,	 xxxx      ,	 True      ,	 False  

SpaCy also allows you to use the embeddings for both sentence and words

Example:

In [None]:
print(doc, doc.vector[:5], '...')
for t in doc:
    print(t, t.vector[:5], '...')

Apple is looking at buying U.K. startup for $1 billion [-0.49226812  0.40478638  0.5446301   0.2650897   0.5588461 ] ...
Apple [-1.231103   -1.1917272   0.15840513  0.3598817   0.680532  ] ...
is [-1.0020912  -0.24935524  0.2847814   0.7584369  -0.5807612 ] ...
looking [-0.3423702  1.0666494  0.7334783  0.0921919 -1.0159137] ...
at [ 1.4327683   1.9650179   0.528621   -1.103754   -0.29277676] ...
buying [ 0.21374054  0.97006285 -0.37104425  0.25935912 -0.5281231 ] ...
U.K. [-0.4769047 -0.6881261 -1.0802059  0.9870316  1.3138596] ...
startup [-0.7687981   0.16186625  0.20556203 -0.70367986 -0.56370217] ...
for [-0.12731966  0.20830332  1.3329215  -0.43356493 -0.7824546 ] ...
$ [-0.5493651   1.2996746   0.19532208  0.46639207  1.8706077 ] ...
1 [-1.1942618   0.7871655   5.5716734   0.19170347  3.1280797 ] ...
billion [-1.3692445   0.12311826 -1.568583    2.0419886   2.91796   ] ...


For more usage of SpaCy, you can refer to its documentation at this link: https://spacy.io/usage

## 2. Baselines

Finally, we provide two example baselines for your reference. The first baseline extracts TF-iDF features from texts and use logistic regression to generate prediction. The second baseline uses Convolutional Neural Networks (CNNs) to generate prediction from texts.


We only consider its first 3k training samples. It is just an example, you can use the data as you like.

### TF-IDF + LR

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression

In [24]:
x_train = train_df['text']
y_train = train_df['label']
x_valid = valid_df['text']
y_valid = valid_df['label']

In [25]:
from sklearn.decomposition import TruncatedSVD
tfidf = TfidfVectorizer(tokenizer=tokenize)
lr = LogisticRegression(tol=5e-3,max_iter=1000)
svd = TruncatedSVD(n_components=500)
steps = [('tfidf', tfidf),('Truncated SVD',svd),('lr', lr)]
pipe = Pipeline(steps)
print(pipe)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(tokenizer=<function tokenize at 0x000001DEF69F4220>)),
                ('Truncated SVD', TruncatedSVD(n_components=500)),
                ('lr', LogisticRegression(max_iter=1000, tol=0.005))])


In [26]:
pipe.fit(x_train, y_train)



In [33]:
y_pred = pipe.predict(x_valid)

print(classification_report(y_valid, y_pred))
print("\n\n")
print(confusion_matrix(y_valid, y_pred))
print('accuracy', np.mean(y_valid == y_pred))
results = pd.DataFrame({'id': valid_df['id'], 'text': valid_df['text'], 'label': y_pred})
results.to_csv('data/valid_pred.csv', index=False)
test_pred = pipe.predict(test_df['text'])
test_results = pd.DataFrame({'id': test_df['id'], 'text': test_df['text'], 'label': test_pred})
test_results.to_csv('pred.csv', index=False)
print(test_df.head())

              precision    recall  f1-score   support

           1       0.58      0.53      0.55       295
           2       0.40      0.13      0.19       198
           3       0.46      0.58      0.51       508
           4       0.46      0.42      0.44       523
           5       0.58      0.68      0.63       476

    accuracy                           0.51      2000
   macro avg       0.50      0.47      0.47      2000
weighted avg       0.50      0.51      0.50      2000




[[156  17  70  23  29]
 [ 57  25  90  16  10]
 [ 26  19 294 120  49]
 [ 14   0 145 219 145]
 [ 15   1  38  97 325]]
accuracy 0.5095
                     id                                               text
0   A3EMGD8RAEOK64_2907  On our trip this past summer to Lunenberg, Nov...
1   A2BOWU2PX28BET_5501  Excellent!! Most remakes fall short of the ori...
2  A100WO06OQR8BQ_10469  I started to watch this movie but it is such a...
3  A2H4LKU7CPIUU9_11364  Well! I must be terribly jaded. Or I am comple...
4