# NLP: Sentiment analysis (v1)

ML Sample of Natural Language Processing.

- For environment test and confirmation.

## Dataset

Bag of Words Meets Bags of Popcorn
> Use Google's Word2Vec for movie reviews

https://www.kaggle.com/c/word2vec-nlp-tutorial

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB


In [2]:
pd.set_option("display.max_colwidth", 200)

nltk.download('stopwords')

STOPWORDS = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Load Train Dataset
df_train = pd.read_csv(
    "./raw_data/labeledTrainData.tsv",
    delimiter="\t",
    na_filter=False
)

display(df_train.head(10))

Unnamed: 0,id,sentiment,review
0,5814_8,1,"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just wa..."
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds i..."
2,7759_3,0,"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, an..."
3,3630_4,0,"It must be assumed that those who praised this film (\the greatest filmed opera ever,\"" didn't I read somewhere?) either don't care for opera, don't care for Wagner, or don't care about anything e..."
4,9495_8,1,"Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that we're dealing with a serious and harrowing drama, ..."
5,8196_8,1,"I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not hurt either. Sure some of its offensive and gratu..."
6,7166_2,0,"This movie could have been very good, but comes up way short. Cheesy special effects and so-so acting. I could have looked past that if the story wasn't so lousy. If there was more of a background..."
7,10633_1,0,I watched this video at a friend's house. I'm glad I did not waste money buying this one. The video cover has a scene from the 1975 movie Capricorn One. The movie starts out with several clips of ...
8,319_1,0,"A friend of mine bought this film for £1, and even then it was grossly overpriced. Despite featuring big names such as Adam Sandler, Billy Bob Thornton and the incredibly talented Burt Young, this..."
9,8713_10,1,"<br /><br />This movie is full of references. Like \Mad Max II\"", \""The wild one\"" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpie..."


In [4]:
# Methods preparation
def clean_text_html_tag_removing(text: str) -> str:
    """Clean text with removing HTML tags."""
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()


def clean_text_lowercase_conversion(text: str) -> str:
    """Clean text with lower case conversion."""
    return text.lower()


def clean_text_escape_double_quote_removing(text: str) -> str:
    """Clean text with converttion escaped double quotes to normal double quotes."""
    text = text.replace('\\\"', '\"')
    # Recover a double quart that has disappeared, leaving only one side
    text = text.replace('\\', '\"')
    return text


def clean_text_stopwords_removing(text: str) -> str:
    """Clean text with removing stopwords."""
    words = text.split()
    words = [
        word for word in words if word not in STOPWORDS
    ]
    return ' '.join(words)


def evaluate_trained_model(
    model: BaseEstimator,
    X_val_data: list,
    y_val_data: list
) -> None:
    """Evaluate a trained Machine Learning model using various metrics

    This function provides:
    - Accuracy Score: Measures how accurately the class labels are predicted.
    - Precision Score: Evaluates how many of the items predicted as positive are actually positive.
    - Confusion Matrix: Provides a matrix representing TP, FP, FN, TN for each class.
    - Classification Report: Generates a detailed report including Precision, Recall, F1-score, and Support for each class.

    Args:
        model: Trained machine learning model.
        X_test_data, y_test_data: Test data and labels.
    """
    y_pred = model.predict(X_val_data)

    print(f"Evaluation: {model.__class__.__name__}\n")  
    print("Accuracy:", accuracy_score(y_val_data, y_pred))
    print("Precision:", precision_score(y_val_data, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_val_data, y_pred))
    print("Classification Report:\n", classification_report(y_val_data, y_pred))


In [5]:
# Data Preprocessing
df_train['review'] = df_train['review'].apply(clean_text_html_tag_removing)

display(df_train.head(10))

  soup = BeautifulSoup(text, "html.parser")


Unnamed: 0,id,sentiment,review
0,5814_8,1,"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just wa..."
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds i..."
2,7759_3,0,"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, an..."
3,3630_4,0,"It must be assumed that those who praised this film (\the greatest filmed opera ever,\"" didn't I read somewhere?) either don't care for opera, don't care for Wagner, or don't care about anything e..."
4,9495_8,1,"Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that we're dealing with a serious and harrowing drama, ..."
5,8196_8,1,"I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not hurt either. Sure some of its offensive and gratu..."
6,7166_2,0,"This movie could have been very good, but comes up way short. Cheesy special effects and so-so acting. I could have looked past that if the story wasn't so lousy. If there was more of a background..."
7,10633_1,0,I watched this video at a friend's house. I'm glad I did not waste money buying this one. The video cover has a scene from the 1975 movie Capricorn One. The movie starts out with several clips of ...
8,319_1,0,"A friend of mine bought this film for £1, and even then it was grossly overpriced. Despite featuring big names such as Adam Sandler, Billy Bob Thornton and the incredibly talented Burt Young, this..."
9,8713_10,1,"This movie is full of references. Like \Mad Max II\"", \""The wild one\"" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll ta..."


In [6]:
# Data Preprocessing
df_train['review'] = df_train['review'].apply(clean_text_lowercase_conversion)

display(df_train.head(2))

Unnamed: 0,id,sentiment,review
0,5814_8,1,"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just wa..."
1,2381_9,1,"\the classic war of the worlds\"" by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h. g. wells' classic book. mr. hines succeeds i..."


In [7]:
# Data Preprocessing
df_train['review'] = df_train['review'].apply(
    clean_text_escape_double_quote_removing
)

display(df_train.head(2))

Unnamed: 0,id,sentiment,review
0,5814_8,1,"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just wa..."
1,2381_9,1,"""the classic war of the worlds"" by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h. g. wells' classic book. mr. hines succeeds in..."


In [8]:
# Data Preprocessing
df_train['review'] = df_train['review'].apply(clean_text_stopwords_removing)

display(df_train.head(2))

Unnamed: 0,id,sentiment,review
0,5814_8,1,"stuff going moment mj i've started listening music, watching odd documentary there, watched wiz watched moonwalker again. maybe want get certain insight guy thought really cool eighties maybe make..."
1,2381_9,1,"""the classic war worlds"" timothy hines entertaining film obviously goes great effort lengths faithfully recreate h. g. wells' classic book. mr. hines succeeds so. i, watched film me, appreciated f..."


In [9]:
# Feature Engineering: TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)

X = vectorizer.fit_transform(df_train['review'])
y = df_train['sentiment']

In [10]:
# Model Building: split data
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

In [11]:
# Model Building: Logistic Regression model and train model
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

# Evaluation
evaluate_trained_model(
    model_lr,
    X_val,
    y_val
)

Evaluation: LogisticRegression

Accuracy: 0.8836
Precision: 0.8784681516217272
Confusion Matrix:
 [[2170  311]
 [ 271 2248]]
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.87      0.88      2481
           1       0.88      0.89      0.89      2519

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000



In [12]:
# Model Building: Support Vector Machine
model_svm = SVC()
model_svm.fit(X_train, y_train)

# Evaluation
evaluate_trained_model(
    model_svm,
    X_val,
    y_val
)

Evaluation: SVC

Accuracy: 0.8886
Precision: 0.882307092751364
Confusion Matrix:
 [[2179  302]
 [ 255 2264]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.88      0.89      2481
           1       0.88      0.90      0.89      2519

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000



In [13]:
# Model Building: Support Vector Machine
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)

# Evaluation
evaluate_trained_model(
    model_rf,
    X_val,
    y_val
)

Evaluation: RandomForestClassifier

Accuracy: 0.843
Precision: 0.8493150684931506
Confusion Matrix:
 [[2107  374]
 [ 411 2108]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.85      0.84      2481
           1       0.85      0.84      0.84      2519

    accuracy                           0.84      5000
   macro avg       0.84      0.84      0.84      5000
weighted avg       0.84      0.84      0.84      5000



In [14]:
# Model Building: Multinomial Naive Bayes
model_nb = MultinomialNB()
model_nb.fit(X_train, y_train)

# Evaluation
evaluate_trained_model(
    model_nb,
    X_val,
    y_val
)

Evaluation: MultinomialNB

Accuracy: 0.85
Precision: 0.8440295604823026
Confusion Matrix:
 [[2080  401]
 [ 349 2170]]
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.84      0.85      2481
           1       0.84      0.86      0.85      2519

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000



## Consideration of Model evaluation results

From these results, __Support Vector Machine (SVC)__ seem to predict with high Accuracy.
This model also have a good precision, recall, and F1 scores are also well-balanced and high.

For this __v1__, we shall select __Support Vector Machine (SVC)__ model `model_svm`.

In [15]:
# Load Test Dataset
df_test = pd.read_csv(
    "./raw_data/testData.tsv",
    delimiter="\t",
    na_filter=False
)

display(df_test.head())

Unnamed: 0,id,review
0,12311_10,"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there..."
1,8348_2,"This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw away all sense of reality. Let's see, word to the wise, lava burns you; s..."
2,5828_4,"All in all, this is a movie for kids. We saw it tonight and my child loved it. At one point my kid's excitement was so great that sitting was impossible. However, I am a great fan of A.A. Milne's ..."
3,7186_2,"Afraid of the Dark left me with the impression that several different screenplays were written, all too short for a feature length film, then spliced together clumsily into this Frankenstein's mon..."
4,12128_7,"A very accurate depiction of small time mob life filmed in New Jersey. The story, characters and script are believable but the acting drops the ball. Still, it's worth watching, especially for the..."


In [16]:
# Data Preprocessing
df_test['review'] = df_test['review'].apply(
    clean_text_html_tag_removing
).apply(
    clean_text_lowercase_conversion
).apply(
    clean_text_escape_double_quote_removing
).apply(
    clean_text_stopwords_removing
)

display(df_test.head())

  soup = BeautifulSoup(text, "html.parser")


Unnamed: 0,id,review
0,12311_10,"naturally film who's main themes mortality, nostalgia, loss innocence perhaps surprising rated highly older viewers younger ones. however craftsmanship completeness film anyone enjoy. pace steady ..."
1,8348_2,"movie disaster within disaster film. full great action scenes, meaningful throw away sense reality. let's see, word wise, lava burns you; steam burns you. can't stand next lava. diverting minor la..."
2,5828_4,"all, movie kids. saw tonight child loved it. one point kid's excitement great sitting impossible. however, great fan a.a. milne's books subtle hide wry intelligence behind childlike quality leadin..."
3,7186_2,"afraid dark left impression several different screenplays written, short feature length film, spliced together clumsily frankenstein's monster.at best, protagonist, lucas, creepy. hard draw bead s..."
4,12128_7,"accurate depiction small time mob life filmed new jersey. story, characters script believable acting drops ball. still, worth watching, especially strong images, still even though first viewed 25 ..."


In [17]:
# Feature Engineering: TF-IDF Vectorization
X_test = vectorizer.transform(df_test['review'])

In [18]:
# Prediction: using SVC trained model
y_test_predicted = model_svm.predict(X_test)

In [19]:
df_final = pd.DataFrame({
    'id': df_test['id'],
    'sentiment': y_test_predicted
})

display(df_final)

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,1
4,12128_7,1
...,...,...
24995,2155_10,1
24996,59_10,1
24997,2531_1,0
24998,7772_8,1


In [20]:
print(
    df_final[df_final['sentiment'] == 1].shape[0]
)

12531


In [21]:
df_final.to_csv(
    'v1_trial_submit.csv',
    index=False
)

- Result

> Score: 0.8800