# HW3: Preprocessing

 <div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

## Problem Description

Generative models such as GPT may be used to fabricate indistinguishable fake customer reviews at a much lower cost. In this assignment, we'll use what we learned in Preprocessing module to compare ChatGPT-generated reviews with human-generated reviews. 
- A dataset with 300 reviews has been provided for you to use.
- Label: Binary label indicating the class (0=authentic, 1=machine-generated).
- The dataset can be found in this paper: https://arxiv.org/abs/2401.08825.

Hint: you may find it is convenient to use `Spacy` package for this assignment. Outputs displayed here are ONLY for your refereces! You may get slightly different outputs if you use different packages. 

In [1]:
import pandas as pd
import spacy
import nltk
import numpy as np
import string
from sklearn.preprocessing import normalize
from nltk.stem import WordNetLemmatizer 
nlp = spacy.load("en_core_web_sm")
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
data = pd.read_csv("review.csv")
data.head()

Unnamed: 0,text,label
0,The food is delicious and extremely flavorful....,0
1,"Okay, so I didn't expect much in terms of Kore...",0
2,I recently visited the new Veggie Grill in NYC...,1
3,I'm a fan of 5Napkin Burger for 2 distinct tim...,0
4,Joe's Pizza offers the finest slice in Times S...,1


## Q1. Tokenize function `(2 points)`

Define a function `tokenize(doc, lemmatized = False, remove_stopword = False, remove_punct = True, pos_tag = False)`  as follows:
   - Take five parameters: 
       - `doc`: a document (e.g., a review)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is False (i.e., tokens are not lemmatized).
       - `remove_stopword`: an optional boolean parameter to remove stop words. The default value is False (i.e., do not remove stop words). 
       - `remove_punct`: an optional boolean parameter to remove punctuations. The default value is True (i.e., remove punctuations).
      - `pos_tag`: whether the POS tag of each token is returned. Please use the universal POS tag. The default is False (i.e., do not return POS tag).
   - Split the input document into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams. 
       - if `remove_stopword` is set to True, remove all stop words. 
       - if `remove_punct` is set to True, remove all punctuation tokens. 
       - remove all empty tokens and lowercase all the tokens.
       - if `pos_tag = True`, retrieve the POS tag for each of the resulting tokens and make a tuple (token, pos_tag) for the token.
   - Return the list of tokens (including POS tag if the option is on) obtained for the document after all the processing. 
   
(Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

In [3]:
def tokenize(doc, lemmatized=False, remove_stopword=False, remove_punct = True, pos_tag = False):
    doc = nlp(doc.lower()) 
    tokens = [token for token in doc if not token.is_space]
    
    if remove_punct:
        tokens = [token.text for token in tokens if not token.is_punct]
        
    if remove_stopword:
        spacy_stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)
        if type(tokens[0]) == str:
            tokens = [token for token in tokens if token not in spacy_stopwords]
        else:
            tokens = [token.text for token in tokens if token.text not in spacy_stopwords]

    if lemmatized:
        if type(tokens[0]) != spacy.tokens.token.Token:
            tokens = [nlp(token)[0] for token in tokens]
        tokens = [token.lemma_ for token in tokens]
        if remove_stopword:
            tokens = [token for token in tokens if token not in spacy_stopwords]
    
    if pos_tag:
        if type(tokens[0]) != spacy.tokens.token.Token:
            tokens = [nlp(token)[0] for token in tokens]
        tokens = [(token.text, token.pos_) for token in tokens]

    return tokens

Test your function with different parameter configuration and observe the differences in the resulting tokens.

In [4]:
# For simplicity, We will test one document generated by GPT

print(data["text"].iloc[2] + "\n")

print(f"1.lemmatized=False, remove_stopword=False, remove_punct = True,  pos_tag = False:\n \
{tokenize(data['text'].iloc[0], lemmatized=False, remove_stopword=False, remove_punct = True, pos_tag = False)}\n")

print(f"2.lemmatized=False, remove_stopword=False, remove_punct = True,  pos_tag = True:\n \
{tokenize(data['text'].iloc[0], lemmatized=False, remove_stopword=False, remove_punct = True, pos_tag = True)}\n")

print(f"3.lemmatized=True, remove_stopword=True, remove_punct = True, pos_tag = False:\n \
{tokenize(data['text'].iloc[0], lemmatized=True, remove_stopword=True, remove_punct = True, pos_tag = False)}\n")

print(f"4.lemmatized=True, remove_stopword=True, remove_punct = True, pos_tag = True:\n \
{tokenize(data['text'].iloc[0], lemmatized=True, remove_stopword=True, remove_punct = True, pos_tag = True)}\n")


I recently visited the new Veggie Grill in NYC and it was an absolute delight! The Nashville hot 'chicken' blew me away with its spot-on flavors and textures – it's a must-try. The vegan buffalo 'wings' were the highlight for me; truly the best I've ever had, hands down. The Mac and cheese also did not disappoint, rounding out a meal that has me eager to return. Anyone on the fence about Veggie Grill needs to take the plunge. Get ready to be wowed by a vegan menu that hits all the right notes. I'm already planning my next visit to conquer the rest of the menu. This place is a solid 10/10.

1.lemmatized=False, remove_stopword=False, remove_punct = True,  pos_tag = False:
 ['the', 'food', 'is', 'delicious', 'and', 'extremely', 'flavorful', 'the', 'portion', 'sizes', 'are', 'huge', 'you', 'wo', "n't", 'leave', 'hungry', 'i', 'ordered', 'chicken', 'and', 'waffles', 'cornbread', 'is', 'served', 'here', 'for', 'free', 'i', 'went', 'with', 'my', 'sister', 'and', 'she', 'got', 'meatloaf', 'and

## Q2. Quantify concreteness (3 points)


`Concreteness` can increase a message's persuasion. The concreteness can be measured by the use of :
- `article (DET)` (e.g., a, an, and the), 
- `adpositions (ADP)` (e.g., in, at, of, on, etc), and
- `adjectives(ADJ)` before `nouns(NN)`, i.e., a bigram where the first word is adjective and the second one is noun.

Note: corresponding POS tags have been provided in the parentheses above.

Define a function `compute_concreteness(doc)` as follows:
- Input argument is a document, i.e., `doc`
- Call your function defined in Q1 with `lemmatized=False, remove_stopword=False, remove_punct = False, pos_tag = True` to generate a list of tokens with POS tags for the input document.
- Generate bigrams out of the token list
- Find unigrams with tags `article` or `adposition` 
- Find bigrams where the first word is `adjective` and the second is `noun`.
- Compute `concereness` score as:  `((the counts of unigrams found) +  2 * (the number of bigrams found))/(total non-punctuation tokens)`.
- return the concreteness score, the list of article tokens, the list of adposition tokens, and the list of bigrams.
- Do you think, overall, ChatGPT-generated answers are more concrete than human answers? Test your hypothesis statistically (e.g., t-test) and explain your finding in text. 


In [5]:
def compute_concreteness(doc):
    tokens = tokenize(doc, lemmatized=False, remove_stopword=False, remove_punct=False, pos_tag=True)
    #find unigrams with tags article or adposition
    articles = [(token, tag) for (token, tag) in tokens if tag=='DET']
    adpositions = [(token, tag) for (token, tag) in tokens if tag=='ADP']
    #find bigrams with first word adjective and second word noun
    bigrams=list(nltk.bigrams(tokens))
    phrases=[(x[0],y[0]) for (x,y) in bigrams if x[1]==('ADJ') and y[1]==('NOUN')]
    #compute concreteness score ((counts of unigrams) + 2 * (number of bigrams)) / (total non-punct tokens)
    nonPunct = [token for (token, tag) in tokens if tag != 'PUNCT']
    concreteness = ((len(articles)+len(adpositions))+2*(len(phrases)))/ (len(nonPunct))
    # return concreteness score, list of article tokens, list of adposition tokens, list of bigrams
    return (concreteness, articles, adpositions, phrases)
    

In [6]:
concreteness, articles, adpositions, quantifier = compute_concreteness(data["text"].iloc[1])
print(f"Question: {data['text'].iloc[2]} \n\nConcreteness: {concreteness :.4f} \n\nArticles:  {articles} \n\nAdpositions: {adpositions} \n\n(ADJ, NOUNS): {quantifier}")


Question: I recently visited the new Veggie Grill in NYC and it was an absolute delight! The Nashville hot 'chicken' blew me away with its spot-on flavors and textures – it's a must-try. The vegan buffalo 'wings' were the highlight for me; truly the best I've ever had, hands down. The Mac and cheese also did not disappoint, rounding out a meal that has me eager to return. Anyone on the fence about Veggie Grill needs to take the plunge. Get ready to be wowed by a vegan menu that hits all the right notes. I'm already planning my next visit to conquer the rest of the menu. This place is a solid 10/10. 

Concreteness: 0.3030 

Articles:  [('this', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('all', 'DET'), ('the', 'DET'), ('a', 'DET'), ('a', 'DET'), ('any', 'DET'), ('the', 'DET'), ('a', 'DET'), ('the', 'DET'), ('a', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('the', 'DET'), ('a', 'DET'), ('no', 'DET'),

In [7]:
data['concrete'] = data['text'].apply(lambda x: compute_concreteness(x)[0])

In [8]:
data.head()

Unnamed: 0,text,label,concrete
0,The food is delicious and extremely flavorful....,0,0.227941
1,"Okay, so I didn't expect much in terms of Kore...",0,0.30303
2,I recently visited the new Veggie Grill in NYC...,1,0.307692
3,I'm a fan of 5Napkin Burger for 2 distinct tim...,0,0.284091
4,Joe's Pizza offers the finest slice in Times S...,1,0.326087


In [9]:
# Is LLM-generated review more concrete than human review? What is your finding

from scipy import stats
human = data[data['label']==0]['concrete']
gpt = data[data['label']==1]['concrete']

(statistic, pValue) = stats.ttest_ind(gpt, human)
print ("t-statistic: ", statistic, "  p-value: ", pValue)
if (pValue<0.01):
    print("The concreteness score of ChatGPT generated text is on average greater than the concreteness score of human generated text, to a statistically significant degree.\nWe can say that LLM-generated text is more concrete than human reviews.")

t-statistic:  14.446783496031298   p-value:  3.192650740626651e-36
The concreteness score of ChatGPT generated text is on average greater than the concreteness score of human generated text, to a statistically significant degree.
We can say that LLM-generated text is more concrete than human reviews.


## Q3. Sentiment Analysis (3 points)


Let's check if there is any difference in sentiment between ChatGPT-generated and human-generated text.


Define a function `compute_sentiment(tokenized_docs, pos, neg )` as follows:
- take three parameters:
    - `tokenized_docs` is the tokenized reviews by the `tokenize` function in Q1.
    - `pos` (`neg`) is the lists of positive (negative) words, which can be find in Canvas Preprocessing module.
- for each doc, compute the sentiment as `(#pos - #neg )/(#pos + #neg)`, where `#pos`(`#neg`) is the number of positive (negative) words. If a doc contains none of the positive or negative words, set the sentiment to 0.
- return the sentiment column "sentiment" of DataFrame.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how sentiment results change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT-generated reviews are more posive or negative than human-generated ones? Use data to support your conclusion.

In [10]:
def compute_sentiment(tokenized_docs, pos, neg ):
  sentimentCol=[]
  pos=list(pos[0])
  neg=list(neg[0])
  for doc in tokenized_docs:
    positives = [token for token in doc if token in pos]
    countPos=len(positives)
    negatives = [token for token in doc if token in neg]
    countNeg = len(negatives)
    if countPos==0 and countNeg==0:
      sentiment=0
    else:
      sentiment = (countPos-countNeg)/(countPos+countNeg)
    sentimentCol.append(sentiment)
  return sentimentCol

In [11]:
pos = pd.read_csv("positive-words-1.txt", header = None)
pos.head()

neg = pd.read_csv("negative-words-1.txt", header = None)
neg.head()

Unnamed: 0,0
0,a+
1,abound
2,abounds
3,abundance
4,abundant


Unnamed: 0,0
0,2-faced
1,2-faces
2,abnormal
3,abolish
4,abominable


In [12]:
# get tokens, try different parameters, for example:
docs=data["text"]
# use your function
tokenized_docs = docs.apply(tokenize)
sentimentsList=compute_sentiment(tokenized_docs, pos, neg)
print(sentimentsList)

[0.25, 0.7272727272727273, 0.7777777777777778, 0.6666666666666666, 1.0, 0.6666666666666666, 1.0, 0.5, 0.5, 1.0, 0.8666666666666667, 0.15789473684210525, 0.2, 1.0, 0.5789473684210527, 1.0, 1.0, 0.7142857142857143, 0.5454545454545454, 0.8666666666666667, 0.45454545454545453, 0.5555555555555556, 0.8461538461538461, 0.8461538461538461, 0.7142857142857143, 0.45454545454545453, 0.5, 0.3333333333333333, 0.8181818181818182, 0.8, 0.6842105263157895, 1.0, 0.0, 1.0, 0.75, 0.7333333333333333, 0.8571428571428571, 0.6666666666666666, 0.8461538461538461, 0.7142857142857143, 0.6666666666666666, 1.0, 0.6666666666666666, 0.55, 0.6, 0.5384615384615384, 0.7142857142857143, 1.0, 0.5384615384615384, 0.75, 0.18181818181818182, 0.6, 0.7142857142857143, 0.8181818181818182, 0.8461538461538461, 0.25, 1.0, -0.1111111111111111, 0.5, 0.25, 0.5555555555555556, 0.75, 1.0, 0.6923076923076923, 0.5714285714285714, 1.0, 1.0, 0.6, 0.7142857142857143, 1.0, 0.5, 0.3684210526315789, 1.0, 1.0, 1.0, 0.7142857142857143, 0.6, 0.

In [13]:
tokensNoStop = docs.apply(lambda x: tokenize(x, lemmatized=False, remove_stopword=True, remove_punct = False))
sentimentNoStop = compute_sentiment(tokensNoStop, pos, neg)
print(sentimentNoStop)

[0.25, 0.6842105263157895, 0.7777777777777778, 0.6666666666666666, 1.0, 0.6666666666666666, 1.0, 0.5, 0.5, 1.0, 0.8571428571428571, 0.15789473684210525, 0.2, 1.0, 0.5789473684210527, 1.0, 1.0, 0.6666666666666666, 0.5454545454545454, 0.8666666666666667, 0.45454545454545453, 0.5555555555555556, 0.8461538461538461, 0.8333333333333334, 0.6666666666666666, 0.3333333333333333, 0.5, 0.3333333333333333, 0.8181818181818182, 0.8, 0.6842105263157895, 1.0, 0.0, 1.0, 0.75, 0.6923076923076923, 0.8571428571428571, 0.6521739130434783, 0.8333333333333334, 0.7142857142857143, 0.6666666666666666, 1.0, 0.6666666666666666, 0.5263157894736842, 0.6, 0.5384615384615384, 0.7142857142857143, 1.0, 0.5384615384615384, 0.75, 0.18181818181818182, 0.6, 0.6470588235294118, 0.8181818181818182, 0.8461538461538461, 0.25, 1.0, -0.1111111111111111, 0.5, 0.25, 0.5555555555555556, 0.75, 1.0, 0.6923076923076923, 0.5714285714285714, 1.0, 1.0, 0.6, 0.7142857142857143, 1.0, 0.5, 0.3333333333333333, 1.0, 1.0, 1.0, 0.692307692307

In [14]:
tokensNoPunct =  docs.apply(lambda x: tokenize(x, lemmatized=False, remove_stopword=False, remove_punct=True))
sentimentNoPunct = compute_sentiment(tokensNoPunct, pos, neg)
print(sentimentNoPunct)

[0.25, 0.7272727272727273, 0.7777777777777778, 0.6666666666666666, 1.0, 0.6666666666666666, 1.0, 0.5, 0.5, 1.0, 0.8666666666666667, 0.15789473684210525, 0.2, 1.0, 0.5789473684210527, 1.0, 1.0, 0.7142857142857143, 0.5454545454545454, 0.8666666666666667, 0.45454545454545453, 0.5555555555555556, 0.8461538461538461, 0.8461538461538461, 0.7142857142857143, 0.45454545454545453, 0.5, 0.3333333333333333, 0.8181818181818182, 0.8, 0.6842105263157895, 1.0, 0.0, 1.0, 0.75, 0.7333333333333333, 0.8571428571428571, 0.6666666666666666, 0.8461538461538461, 0.7142857142857143, 0.6666666666666666, 1.0, 0.6666666666666666, 0.55, 0.6, 0.5384615384615384, 0.7142857142857143, 1.0, 0.5384615384615384, 0.75, 0.18181818181818182, 0.6, 0.7142857142857143, 0.8181818181818182, 0.8461538461538461, 0.25, 1.0, -0.1111111111111111, 0.5, 0.25, 0.5555555555555556, 0.75, 1.0, 0.6923076923076923, 0.5714285714285714, 1.0, 1.0, 0.6, 0.7142857142857143, 1.0, 0.5, 0.3684210526315789, 1.0, 1.0, 1.0, 0.7142857142857143, 0.6, 0.

In [15]:
tokensNoPnoS=  docs.apply(lambda x: tokenize(x, lemmatized=False, remove_stopword=True, remove_punct=True))
sentimentNoPnoS = compute_sentiment(tokensNoPnoS, pos, neg)
print(sentimentNoPnoS)

[0.25, 0.6842105263157895, 0.7777777777777778, 0.6666666666666666, 1.0, 0.6666666666666666, 1.0, 0.5, 0.5, 1.0, 0.8571428571428571, 0.15789473684210525, 0.2, 1.0, 0.5789473684210527, 1.0, 1.0, 0.6666666666666666, 0.5454545454545454, 0.8666666666666667, 0.45454545454545453, 0.5555555555555556, 0.8461538461538461, 0.8333333333333334, 0.6666666666666666, 0.3333333333333333, 0.5, 0.3333333333333333, 0.8181818181818182, 0.8, 0.6842105263157895, 1.0, 0.0, 1.0, 0.75, 0.6923076923076923, 0.8571428571428571, 0.6521739130434783, 0.8333333333333334, 0.7142857142857143, 0.6666666666666666, 1.0, 0.6666666666666666, 0.5263157894736842, 0.6, 0.5384615384615384, 0.7142857142857143, 1.0, 0.5384615384615384, 0.75, 0.18181818181818182, 0.6, 0.6470588235294118, 0.8181818181818182, 0.8461538461538461, 0.25, 1.0, -0.1111111111111111, 0.5, 0.25, 0.5555555555555556, 0.75, 1.0, 0.6923076923076923, 0.5714285714285714, 1.0, 1.0, 0.6, 0.7142857142857143, 1.0, 0.5, 0.3333333333333333, 1.0, 1.0, 1.0, 0.692307692307

In [16]:
tokensLemmatized =  docs.apply(lambda x: tokenize(x, lemmatized=True, remove_stopword=False, remove_punct=False))
sentimentLemmatized = compute_sentiment(tokensLemmatized, pos, neg)
print(sentimentLemmatized)


[0.4, 0.7142857142857143, 0.6, 0.6666666666666666, 0.8, 0.6666666666666666, 1.0, 0.5714285714285714, 0.7142857142857143, 1.0, 0.8666666666666667, 0.2222222222222222, 0.5, 1.0, 0.5, 1.0, 1.0, 0.7142857142857143, 0.5454545454545454, 0.75, 0.38461538461538464, 0.6, 0.8666666666666667, 0.8461538461538461, 0.5, 0.45454545454545453, 0.5, 0.3333333333333333, 0.8181818181818182, 0.6363636363636364, 0.6666666666666666, 1.0, 0.0, 1.0, 0.5555555555555556, 0.7333333333333333, 0.7333333333333333, 0.68, 0.8461538461538461, 0.6, 0.6, 1.0, 0.6666666666666666, 0.5238095238095238, 0.6, 0.5714285714285714, 0.8461538461538461, 1.0, 0.4666666666666667, 1.0, 0.18181818181818182, 0.6, 0.7272727272727273, 0.8181818181818182, 0.8571428571428571, 0.25, 1.0, -0.05263157894736842, 0.5454545454545454, 0.42857142857142855, 0.5555555555555556, 0.75, 0.8, 0.6923076923076923, 0.4117647058823529, 1.0, 1.0, 0.6, 0.7142857142857143, 0.7777777777777778, 0.6363636363636364, 0.4, 1.0, 0.5, 1.0, 0.7142857142857143, 0.6, 0.33

In [17]:
tokensAllTrue =  docs.apply(lambda x: tokenize(x, lemmatized=True, remove_stopword=True, remove_punct=True))
sentimentAllTrue = compute_sentiment(tokensAllTrue, pos, neg)
print(sentimentAllTrue)

[0.4, 0.875, 0.5555555555555556, 0.6, 0.8, 0.6666666666666666, 1.0, 0.5714285714285714, 0.5, 1.0, 0.8571428571428571, 0.2, 0.5, 1.0, 0.5, 1.0, 1.0, 0.6666666666666666, 0.6190476190476191, 0.75, 0.5454545454545454, 0.6, 0.8666666666666667, 1.0, 0.42857142857142855, 0.3333333333333333, 0.3333333333333333, 0.3333333333333333, 0.8181818181818182, 0.6363636363636364, 0.7777777777777778, 1.0, 0.0, 1.0, 0.75, 0.6923076923076923, 0.7142857142857143, 0.6521739130434783, 0.8181818181818182, 0.6, 0.6, 1.0, 0.6666666666666666, 0.5384615384615384, 0.6, 0.5714285714285714, 0.8461538461538461, 1.0, 0.4666666666666667, 1.0, 0.14285714285714285, 0.3333333333333333, 0.6470588235294118, 0.8181818181818182, 0.8666666666666667, 0.42857142857142855, 1.0, -0.1111111111111111, 0.5238095238095238, 0.25, 0.4, 0.75, 0.7777777777777778, 0.6923076923076923, 0.375, 1.0, 1.0, 0.6, 0.7142857142857143, 0.75, 0.6363636363636364, 0.3333333333333333, 1.0, 0.3333333333333333, 0.7142857142857143, 0.5384615384615384, 0.4666

In [28]:
spacy_stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)
common = [x for x in spacy_stopwords if x in list(pos[0]) or x in list(neg[0])]
print(common)

['well', 'enough', 'top']


In [19]:
# compare between ChatGPT-generated reviews and human reviews
data['sentiment'] = sentimentsList
data['sentimentNoPunct'] = sentimentNoPunct
data['sentimentNoStop'] = sentimentNoStop
data['sentimentNoPnoS'] = sentimentNoPnoS
data['sentimentLemmatized'] = sentimentLemmatized
data['sentimentAllTrue'] = sentimentAllTrue
data.head(10)


Unnamed: 0,text,label,concrete,sentiment,sentimentNoPunct,sentimentNoStop,sentimentNoPnoS,sentimentLemmatized,sentimentAllTrue
0,The food is delicious and extremely flavorful....,0,0.227941,0.25,0.25,0.25,0.25,0.4,0.4
1,"Okay, so I didn't expect much in terms of Kore...",0,0.30303,0.727273,0.727273,0.684211,0.684211,0.714286,0.875
2,I recently visited the new Veggie Grill in NYC...,1,0.307692,0.777778,0.777778,0.777778,0.777778,0.6,0.555556
3,I'm a fan of 5Napkin Burger for 2 distinct tim...,0,0.284091,0.666667,0.666667,0.666667,0.666667,0.666667,0.6
4,Joe's Pizza offers the finest slice in Times S...,1,0.326087,1.0,1.0,1.0,1.0,0.8,0.8
5,I recently dined at this gem of a restaurant a...,1,0.423077,0.666667,0.666667,0.666667,0.666667,0.666667,0.666667
6,"If you like pad Thai, I highly recommend this ...",0,0.230769,1.0,1.0,1.0,1.0,1.0,1.0
7,"Tucked away off the beaten path, this eatery o...",1,0.451327,0.5,0.5,0.5,0.5,0.571429,0.571429
8,The hostess was gracious and the place was emp...,0,0.219512,0.5,0.5,0.5,0.5,0.714286,0.5
9,Square pepperoni pizza?!? Sayyyy what?!? Stayi...,0,0.290598,1.0,1.0,1.0,1.0,1.0,1.0


In [25]:
# based on comparing the columns in the table above, we can see that removing punctuation has no bearning on sentiment score (as expected)  so we should keep columns that 
# do remove punctuation for computational purposes and focus mostly on the difference between lemmatization, removing stop words, or doing both

data.drop(columns=["sentimentNoPunct", "sentimentNoPnoS"])

Unnamed: 0,text,label,concrete,sentiment,sentimentNoStop,sentimentLemmatized,sentimentAllTrue,mean_sentiment
0,The food is delicious and extremely flavorful....,0,0.227941,0.250000,0.250000,0.400000,0.400000,0.325000
1,"Okay, so I didn't expect much in terms of Kore...",0,0.303030,0.727273,0.684211,0.714286,0.875000,0.750192
2,I recently visited the new Veggie Grill in NYC...,1,0.307692,0.777778,0.777778,0.600000,0.555556,0.677778
3,I'm a fan of 5Napkin Burger for 2 distinct tim...,0,0.284091,0.666667,0.666667,0.666667,0.600000,0.650000
4,Joe's Pizza offers the finest slice in Times S...,1,0.326087,1.000000,1.000000,0.800000,0.800000,0.900000
...,...,...,...,...,...,...,...,...
295,I recently dined at this charming spot for a w...,1,0.323529,0.875000,0.875000,0.684211,0.666667,0.775219
296,IHop has better food the further you go down s...,0,0.194444,0.333333,0.333333,0.333333,0.000000,0.250000
297,Favorite new spot. Everything was really good ...,0,0.262295,1.000000,1.000000,0.800000,0.636364,0.859091
298,"Tavolino, the sibling establishment to the bel...",1,0.401639,0.857143,0.846154,0.866667,0.846154,0.854029


In [24]:
data['mean_sentiment'] = data[['sentiment', 'sentimentNoStop','sentimentLemmatized', 'sentimentAllTrue']].mean(axis=1)
data.head(10)

Unnamed: 0,text,label,concrete,sentiment,sentimentNoPunct,sentimentNoStop,sentimentNoPnoS,sentimentLemmatized,sentimentAllTrue,mean_sentiment
0,The food is delicious and extremely flavorful....,0,0.227941,0.25,0.25,0.25,0.25,0.4,0.4,0.325
1,"Okay, so I didn't expect much in terms of Kore...",0,0.30303,0.727273,0.727273,0.684211,0.684211,0.714286,0.875,0.750192
2,I recently visited the new Veggie Grill in NYC...,1,0.307692,0.777778,0.777778,0.777778,0.777778,0.6,0.555556,0.677778
3,I'm a fan of 5Napkin Burger for 2 distinct tim...,0,0.284091,0.666667,0.666667,0.666667,0.666667,0.666667,0.6,0.65
4,Joe's Pizza offers the finest slice in Times S...,1,0.326087,1.0,1.0,1.0,1.0,0.8,0.8,0.9
5,I recently dined at this gem of a restaurant a...,1,0.423077,0.666667,0.666667,0.666667,0.666667,0.666667,0.666667,0.666667
6,"If you like pad Thai, I highly recommend this ...",0,0.230769,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,"Tucked away off the beaten path, this eatery o...",1,0.451327,0.5,0.5,0.5,0.5,0.571429,0.571429,0.535714
8,The hostess was gracious and the place was emp...,0,0.219512,0.5,0.5,0.5,0.5,0.714286,0.5,0.553571
9,Square pepperoni pizza?!? Sayyyy what?!? Stayi...,0,0.290598,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [26]:
mean_human = data[data['label'] == 0]['mean_sentiment'].mean()
mean_gpt = data[data['label'] == 1]['mean_sentiment'].mean()
print("Mean sentiment score for human reviews:", mean_human)
print("Mean sentiment score for GPT text:", mean_gpt)



mean_human_lemmatized = data[data['label'] == 0]['sentimentLemmatized'].mean()
mean_gpt_lemmatized = data[data['label'] == 1]['sentimentLemmatized'].mean()
print("mean human lemmatized: ", mean_human_lemmatized)
print("mean gpt lemmatized: ", mean_gpt_lemmatized)


mean_human_allTrue = data[data['label'] == 0]['sentimentAllTrue'].mean()
mean_gpt_allTrue = data[data['label'] == 1]['sentimentAllTrue'].mean()
print("mean human allTrue: ", mean_human_allTrue)
print("mean gpt allTrue: ", mean_gpt_allTrue)

Mean sentiment score for human reviews: 0.6445651959414616
Mean sentiment score for GPT text: 0.6618849946640805
mean human lemmatized:  0.6569835523952601
mean gpt lemmatized:  0.6542593729552788
mean human allTrue:  0.640534850156153
mean gpt allTrue:  0.6430618691186898


## Q4 Compute TF-IDF and Compare Top Adjectives `(2 point)`

Define a function `compute_tf_idf(tokenized_docs)` as follows: 
- Take paramter `tokenized_docs`, i.e., a list of tokenized documents by `tokenize` function in Q1
- Calculate tf_idf weights as shown in lecture notes 
- Return the smoothed normalized `tf_idf` array, where each row stands for a document and each column denotes a word.
- Use the `tokenize` function in Q1 with pos tags, extract the top adjectives by tfidf weights for GPT reviews, and do the same for human reviews. Can you see any differences? What types of adjectives does ChatGPT prefer? Explain in text. 

In [36]:
def compute_tfidf(tokenized_docs):
    #calculate tf_idf
# Step 1. get tokens of each document as list   
# create token count dictionary
# step 2. process all documents to get a dictionary of dictionaries
    docs_tokens={idx: nltk.FreqDist(doc) for idx,doc in enumerate(tokenized_docs)}
    # print(docs_tokens)
# step 3. get document-term matrix, contruct a document-term matrix where each row is a doc each column is a token and the value is the frequency of the token
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index")
    dtm=dtm.fillna(0)
    dtm = dtm.sort_index(axis = 0) # sort by index (i.e. doc id)
# step 4. get normalized term frequency (tf) matrix, convert dtm to numpy arrays
    dtm2=dtm.values
    doc_len=dtm2.sum(axis=1) # sum the value of each row
# divide dtm matrix by the doc length matrix
    tf=np.divide(dtm2, doc_len[:,None])  # set float precision to print nicely
    np.set_printoptions(precision=2)
# step 5. get idf get document frequency
    df=np.where(dtm2>0,1,0)
    smoothed_idf=np.log(np.divide(len(docs)+1, np.sum(df, axis=0)+1))+1
 # step 6. get tf-idf, by default normalize by row
    smoothed_tf_idf=normalize(tf*smoothed_idf)
    return smoothed_tf_idf    # return smoothed normalized tf_idf array where each row stands for a document and each column denotes a word    

In [37]:
# Try different tokenization options to see how these options affect TFIDF matrix:
tokens1 = data["text"].apply(lambda x: tokenize(x, lemmatized=False, remove_stopword=False, remove_punct = False, pos_tag = False))
dtm1 = compute_tfidf(tokens1)
print(f"1.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
Shape: {dtm1.shape}\n")


tokens2 = data["text"].apply(lambda x: tokenize(x, lemmatized=True, remove_stopword=True, remove_punct = True, pos_tag = False))
dtm2 = compute_tfidf(tokens2)
print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
Shape: {dtm2.shape}\n")

1.lemmatized=False, remove_stopword=False, remove_punct = True:
 Shape: (300, 48831)

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 Shape: (300, 4020)



In [38]:
# Use the tokenize function in Q1 with pos tags
tokens3 = data["text"].apply(lambda x: tokenize(x, lemmatized=True, remove_stopword=True, remove_punct = True, pos_tag = True))
dtm3 = compute_tfidf(tokens3)

In [253]:
# Extract the top adjectives by tfidf weights for GPT reviews, and do the same for human reviews. 
# Can you see any differences? Explain in text.









top 20 words for ChatGPP reviews: [('delightful', 0.036282600079039454), ('perfect', 0.020108343933400736), ('culinary', 0.019677386314729), ('exceptional', 0.01847081801336628), ('flavorful', 0.017229612475560308), ('short', 0.01580233980043591), ('satisfying', 0.015228795036514574), ('charming', 0.013435044398147465), ('pleasant', 0.013299705117961734), ('cozy', 0.01285143061425164), ('worth', 0.012482041226243723), ('enjoyable', 0.012410384009811218), ('classic', 0.011914560557052473), ('attentive', 0.010968466561321439), ('local', 0.010921851806039296), ('recent', 0.010803043588672578), ('quick', 0.010671676294506128), ('impressive', 0.010629782650871002), ('decent', 0.010627216418500953), ('japanese', 0.010227080091596617)]

top 20 words for Human reviews: [('good', 0.04456433434603564), ('great', 0.029143511024871446), ('delicious', 0.02279444210006634), ('nice', 0.02145287434723795), ('amazing', 0.01749131399980889), ('small', 0.015366299020570089), ('little', 0.0149177042824373

## Q5 (Bonus): Further Analysis (Open question, 2 points)


Can you investigate the following linguistic differences between human and chatgpt-generated answers:
- Readability
- Coherence

You need to implement your ideas.