In [2]:
import pandas as pd 

In [3]:
df = pd.read_csv("IMDB Dataset.csv")

In [4]:
df = df.replace({"positive": 1, "negative": 0})

  df = df.replace({"positive": 1, "negative": 0})


In [5]:
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


## Lets get started with some natural langauge processing!

In [6]:
def metrics(actual, predicted):
    # lengths must be the same
    if len(actual) != len(predicted):
        print("error lengths of actual and predicted are not the same.")
        return
    
    true_positive = 0
    true_negative = 0
    false_positive = 0
    false_negative = 0

    for a, p in zip(actual, predicted):
        if a == 1 and p == 1:
            true_positive += 1
        elif a == 0 and p == 0:
            true_negative += 1
        elif a == 0 and p == 1:
            false_positive += 1
        elif a == 1 and p == 0:
            false_negative += 1

    try:
        accuracy = (true_positive+true_negative)/len(predicted)
        precision = true_positive/(true_positive+false_positive)
        recall = true_positive/(true_positive+false_negative)
        f1 = 2*((precision*recall)/(precision+recall))
    except:
        raise ZeroDivisionError("Division by zero")

    print(f"accuracy:{accuracy:.4f}, Precision:{precision:.4f}, Recall:{recall:.4f}, F1:{f1:.4f}")




## Step one Linear Regression and Bag of Words!


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load data
X_train, X_test, y_train, y_test = train_test_split(df["review"], df["sentiment"], test_size=0.2)

# Convert text to BoW features
vectorizer = CountVectorizer(max_features=5000)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Train classifier
model = LogisticRegression()
model.fit(X_train_bow, y_train)

y_pred = model.predict(X_test_bow)

metrics(y_test.tolist(), y_pred.tolist())

accuracy:0.8850, Precision:0.8803, Recall:0.8919, F1:0.8860


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Linear Regression might be the simplest algorithm out there, but isn't it fantastic?

These values will be our baseline model! 

accuracy:0.8848, Precision:0.8804, Recall:0.8892, F1:0.8848

In [8]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

# Bayes expects "dense array"
X_train_dense = X_train_bow.toarray()
X_test_dense = X_test_bow.toarray()

model.fit(X_train_dense, y_train)

y_pred = model.predict(X_test_dense)

metrics(y_test.tolist(), y_pred.tolist())

accuracy:0.7540, Precision:0.8351, Recall:0.6345, F1:0.7211


Maybe not super surprising Bayes might not be kitted for NLP!

One thing we have forgotten about is the data is binary! of course bayes doesn't work!

We could use an SVM, however because a SVM scales with $n_{features}*n^2_{samples}$ the computation is way to slow for my machine :(


## Next step Tfidvectorizer

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_vec, y_train)

y_pred = model.predict(X_test_vec)

metrics(y_test.tolist(), y_pred.tolist())

accuracy:0.8955, Precision:0.8885, Recall:0.9050, F1:0.8967


A new best model! 

How exiating our new best model is:

accuracy:0.8966, Precision:0.8827, Recall:0.9129, F1:0.8975

Let's try something intresting a PCA?

Problem the pca takes forever! lets try a Truncated SVD much like a PCA!

In [16]:
from sklearn.decomposition import TruncatedSVD


svd = TruncatedSVD(n_components=5, random_state=42)
X_train_reduced = svd.fit_transform(X_train_vec)
X_test_reduced = svd.fit_transform(X_test_vec)

model.fit(X_train_reduced, y_train)

y_pred = model.predict(X_test_reduced)


metrics(y_test.tolist(), y_pred.tolist())

accuracy:0.6012, Precision:0.5973, Recall:0.6271, F1:0.6118


Our worst model yet! 

accuracy:0.6012, Precision:0.5973, Recall:0.6271, F1:0.6118

Our data is highly dimentional removing the dimentions makes it worse. Who could have gussed.

These values are slightly better than a random guess which would be 50/50. 

Letter on in this project I will create my very own neural network and see if it outperforms a TruncatedSVD

## SentimentIntensityAnalyzer AKA pretrained model from NLTK

In [30]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

nltk.download('vader_lexicon')
Semtiment = SentimentIntensityAnalyzer()


# Testing some obvious once before
print(f"first: {Semtiment.polarity_scores("If you like original gut wrenching laughter you will like this movie. If you are young or old then you will love this movie, hell even my mom liked it.<br /><br />Great Camp!!!")}")
print(f"second: {Semtiment.polarity_scores(("Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."))}")
print(f"Third: {Semtiment.polarity_scores(("Besides being boring, the scenes were oppressive and dark. The movie tried to portray some kind of moral, but fell flat with its message. What were the redeeming qualities?? On top of that, I don't think it could make librarians look any more unglamorous than it did."))}")


first: {'neg': 0.094, 'neu': 0.531, 'pos': 0.375, 'compound': 0.9149}
second: {'neg': 0.166, 'neu': 0.662, 'pos': 0.172, 'compound': 0.2362}
Third: {'neg': 0.079, 'neu': 0.876, 'pos': 0.045, 'compound': -0.168}


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\felik\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [31]:
def vader_sentiment(text):
    score = Semtiment.polarity_scores(text)
    return 1 if score['neg'] < score["pos"] else 0

y_pred = X_test.apply(vader_sentiment)


metrics(y_test.tolist(), y_pred.tolist())

accuracy:0.6982, Precision:0.6509, Recall:0.8581, F1:0.7403


In [33]:
def vader_sentiment2(text):
    score = Semtiment.polarity_scores(text)
    return 1 if score['compound'] > 0 else 0

y_pred = X_test.apply(vader_sentiment2)


metrics(y_test.tolist(), y_pred.tolist())


accuracy:0.6996, Precision:0.6539, Recall:0.8510, F1:0.7396
