Positive/Negative Liste bereitgestellt von:

>   Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
>       Proceedings of the ACM SIGKDD International Conference on Knowledge 
>       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
>       Washington, USA, 
>   Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing 
>       and Comparing Opinions on the Web." Proceedings of the 14th 
>       International World Wide Web conference (WWW-2005), May 10-14, 
>       2005, Chiba, Japan.

***Basline Model***

This is our baseline. Basically, it just counts the positive and negative words in a review and decides, based on that, if the review is positive or negative.

In [1]:
#Load the libraries
import numpy as np
import pandas as pd
# https://online.stat.psu.edu/stat504/lesson/1/1.7
from utils import preprocesser_text, train_test_split, evaluate

In [2]:
positive_words = pd.read_csv('data/positive-words.txt', skiprows=29, header=None, names=['words'])
positive_words

Unnamed: 0,words
0,a+
1,abound
2,abounds
3,abundance
4,abundant
...,...
2001,youthful
2002,zeal
2003,zenith
2004,zest


In [3]:
positive_words = preprocesser_text(positive_words, to_prepro='words')
positive_words.head(5)

Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 9034.06it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 333192.37it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 287483.30it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 32893.67it/s]
Pandas Apply: 100%|██████████| 2006/2006 [00:00<00:00, 3996.20it/s]


Unnamed: 0,words
0,
1,abound
2,abound
3,abund
4,abund


In [4]:
negative_words = pd.read_csv('data/negative-words.txt', skiprows=29, header=None, names=['words'])
negative_words.head(5)

Unnamed: 0,words
0,2-faced
1,2-faces
2,abnormal
3,abolish
4,abominable


In [5]:
negative_words = preprocesser_text(negative_words, to_prepro='words')
negative_words.head(5)

Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 11752.08it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 478311.86it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 366960.36it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:00<00:00, 30464.86it/s]
Pandas Apply: 100%|██████████| 4783/4783 [00:01<00:00, 4049.99it/s]


Unnamed: 0,words
0,2face
1,2face
2,abnorm
3,abolish
4,abomin


In [6]:
negative_words.drop_duplicates(inplace=True)
positive_words.drop_duplicates(inplace=True)

In [7]:
#importing the training data
imdb_data=pd.read_csv('data/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [8]:
imdb_data = preprocesser_text(imdb_data)

Pandas Apply: 100%|██████████| 50000/50000 [00:07<00:00, 6268.77it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:00<00:00, 423722.71it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:01<00:00, 46423.79it/s]
Pandas Apply: 100%|██████████| 50000/50000 [03:06<00:00, 268.14it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:48<00:00, 1028.00it/s]


In [9]:
norm_train_reviews, norm_test_reviews = train_test_split(imdb_data)

In [10]:
def check_sentiment_by_counting(tokens, positive=True, negative=True, return_as_str=False, threshold=0):
    if positive:
        positive_n = len(np.intersect1d(tokens.split(), positive_words.values))
    if negative: 
        negative_n = len(np.intersect1d(tokens.split(), negative_words.values))
    if return_as_str:
        return 'positive' if positive_n - negative_n > threshold else 'negative'
    if positive:
        return positive_n
    if negative:
        return negative_n

def count_positive_negative_words(df):
    positive = df['review'].swifter.apply(check_sentiment_by_counting, positive=True, negative=False)
    negative = df['review'].swifter.apply(check_sentiment_by_counting, positive=False, negative=True)
    print("Positive and Negative Words: ", positive.sum(), negative.sum())
    return positive, negative


In [11]:
positive, negative = count_positive_negative_words(norm_test_reviews)

Pandas Apply: 100%|██████████| 10000/10000 [00:10<00:00, 984.15it/s]
Pandas Apply: 100%|██████████| 10000/10000 [00:24<00:00, 407.01it/s]

Positive and Negative Words:  126354 103148





In [23]:
norm_test_reviews['sentiment_pred'] = norm_test_reviews['review'].swifter.apply(check_sentiment_by_counting, return_as_str=True, threshold=126354/103148)

Pandas Apply: 100%|██████████| 10000/10000 [00:39<00:00, 254.28it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_test_reviews['sentiment_pred'] = norm_test_reviews['review'].swifter.apply(check_sentiment_by_counting, return_as_str=True, threshold=126354/103148)


In [24]:
evaluate(norm_test_reviews['sentiment_pred'], norm_test_reviews['sentiment'])[0]

0.7131

In [25]:
norm_test_reviews['sentiment_pred'].value_counts()

positive    5474
negative    4526
Name: sentiment_pred, dtype: int64

In [26]:
#Classification report for tfidf features
print(evaluate(norm_test_reviews['sentiment_pred'], norm_test_reviews['sentiment'])[1])

              precision    recall  f1-score   support

    Negative       0.67      0.73      0.70      4526
    Positive       0.76      0.70      0.73      5474

    accuracy                           0.71     10000
   macro avg       0.71      0.71      0.71     10000
weighted avg       0.72      0.71      0.71     10000

