This file will combine **Sentiment Analysis** with **Aspect Extraction** and **Emotional Anlysis**. It will also use **Topic Modeling** to attempt to categorize each review. 

**Importing** Neccessary **Libraries**

In [117]:
import nltk 
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from collections import Counter
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF


**Downloading** VADER Lexicon

In [38]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\MasonLonoff\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

**Initializing** Sentiment Intensity Analyzer

In [39]:
sid = SentimentIntensityAnalyzer()

**Reading** In the Dataset

In [40]:
df = pd.read_csv('IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


Getting **value counts** of the **actual** sentiments 

In [41]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

Checking for **Nulls**

In [42]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

Checking for **Blanks**

In [43]:
blanks = []
for i, rv, ln in df.itertuples():
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)

**Droppping Blanks** if Needed

In [44]:
df.drop(blanks, inplace=True)

Creating the **Emotional Lexicon** to be used for the **Emotional Analysis** aspect of the project. For each **Emotion**, there are **Sub-Emotions** that will be counted up by review. These emotion scores will give insights on the **Emotional Aspect** of each review. 

In [45]:
emotion_lexicon = {
    'joy': ['happiness', 'happy', 'ecstasy', 'euphoria', 'glee', 'delight', 'bliss', 'cheerfulness', 'exhilaration', 'elation', 'joy', 'smile', 'laugh', 'upbeat', 'jubilation', 'mirth'],
    'sadness': ['grief', 'sorrow', 'heartache', 'desolation', 'melancholy', 'tears', 'anguish', 'regret', 'despair', 'cry', 'depressed', 'mourning', 'bereavement', 'desolation', 'sorrow', 'heartache', 'loss', 'agony', 'grieving', 'tragedy', 'wretched', 'downcast', 'despondent'],
    'anger': ['rage', 'outrage', 'fury', 'indignation', 'irritation', 'resentment', 'hostility', 'wrath', 'exasperation', 'angry', 'frustration', 'outraged', 'infuriated', 'irate', 'furious'],
    'fear': ['anxiety', 'terror', 'dread', 'apprehension', 'panic', 'unease', 'phobia', 'trepidation', 'horror', 'afraid', 'scared', 'frightened', 'terrified', 'petrified', 'paranoia'],
    'surprise': ['astonishment', 'amazement', 'shock', 'awe', 'stunning', 'unexpected', 'startling', 'bewilderment', 'surprised', 'wow', 'astounded', 'stunned', 'flabbergasted', 'jaw-dropping'],
    'disgust': ['revulsion', 'abhorrence', 'repulsion', 'contempt', 'dislike', 'displeasure', 'loathing', 'aversion', 'disgusted', 'nausea', 'repelled', 'repugnant', 'horrified', 'appalled'],
    'love': ['affection', 'passion', 'devotion', 'admiration', 'intimacy', 'tenderness', 'romance', 'infatuation', 'love', 'caring', 'adoration', 'warmth', 'desire', 'compassion'],
    'excitement': ['enthusiasm', 'thrill', 'eagerness', 'anticipation', 'vibrancy', 'exhilaration', 'stimulation', 'excited', 'enthralling', 'passionate', 'electrifying', 'energized', 'enthusiastic'],
    'hope': ['optimism', 'aspiration', 'expectation', 'confidence', 'yearning', 'faith', 'positivity', 'hopeful', 'anticipate', 'optimistic', 'dream', 'desire', 'aspiring', 'uplifting'],
    'disappointment': ['letdown', 'frustration', 'regret', 'dissatisfaction', 'despondency', 'displeasure', 'disillusionment', 'disappointed', 'let down', 'unfulfilled', 'discouraged', 'heartbroken', 'defeated'],
    'nostalgia': ['sentimentality', 'longing', 'remembrance', 'reminiscence', 'wistfulness', 'yearning', 'nostalgic', 'memories'],
    'pride': ['satisfaction', 'accomplishment', 'achievement', 'triumph', 'self-esteem', 'confidence', 'proud', 'success'],
    'admiration': ['respect', 'appreciation', 'esteem', 'approval', 'awe', 'veneration', 'applause', 'admire', 'respectful'],
    'confusion': ['bewilderment', 'puzzlement', 'perplexity', 'uncertainty', 'disorientation', 'mystification', 'confused', 'puzzled'],
    'curiosity': ['inquisitiveness', 'intrigue', 'interest', 'wonder', 'fascination', 'desire to explore', 'curious', 'explore']
}


**Extracting Aspects** from the Reviews. This part of the code specifies certain aspects. The code then searches through each review and counts up how many times each aspect is in each review. It returns the top aspect of each review. Aspect extraction was done as an attempt to get a deeper insight on which part of the review was either viewed positively or negatively. 

In [46]:
def extract_aspects(review):
    review_lower = review.lower() 
    
    # Define the aspects and their corresponding scores
    aspect_scores = {
        'acting': 0,
        'performance': 0,
        'plot': 0,
        'storyline': 0,
        'dialogue': 0,
        'director': 0,
        'direction': 0,
        'cinematography': 0,
        'writing': 0,
        'visual effects': 0,
        'special effects': 0,
        'soundtrack': 0,
        'editing': 0,
        'pacing': 0,
        'transitions': 0,
        'sets': 0,
        'costumes': 0,
        'props': 0,
        'music': 0,
        'sound effects': 0,
        'originality': 0,
        'social issues': 0,
        'genre': 0,
        'character development': 0,
        'production design': 0,
        'humor': 0,
        'emotional impact': 0,
        'ending': 0,
        'chemistry': 0,
        'performances': 0,
        'screenplay': 0,
        'action': 0,
        'makeup': 0,
        'character arc': 0,
        'world building': 0,
        'twists': 0,
        'villain': 0,
        'comedy': 0,
        'suspense': 0,
        'romance': 0,
        'set design': 0,
        'cinematic experience': 0,
        'narrative': 0,
        'sequences': 0,
        'climax': 0,
        'finale': 0,
        'theme': 0,
        'setting': 0,
        'point of view': 0,
        'mood': 0,
        'symbolism': 0,
        'conflict': 0,
        'lighting': 0,
        'adventure': 0,
        'animated': 0,
        'drama': 0,
        'fantasy': 0,
        'historical': 0,
        'horror': 0,
        'musical': 0,
        'science fiction': 0,
        'thriller': 0,
        'western': 0,
        'violence': 0
    }


 # Update the scores based on keyword matching
    for aspect in aspect_scores:
        if aspect in review_lower:
            aspect_scores[aspect] += review_lower.count(aspect)
    
    # Select the aspect with the highest score
    most_common_aspect = max(aspect_scores, key=aspect_scores.get)
    
    if aspect_scores[most_common_aspect] == 0:
        return 'general'

    return most_common_aspect

Adding the **Aspect Column** that shows the top aspect per review

In [47]:
df['aspect'] = df['review'].apply(lambda review: extract_aspects(review))

Getting the **Sentiment Scores** 

In [48]:
# Perform sentiment analysis
df['score'] = df['review'].apply(lambda review: sid.polarity_scores(review))

Let's check out the new df

In [49]:
df.head()

Unnamed: 0,review,sentiment,aspect,score
0,One of the other reviewers has mentioned that ...,positive,violence,"{'neg': 0.203, 'neu': 0.748, 'pos': 0.048, 'co..."
1,A wonderful little production. <br /><br />The...,positive,editing,"{'neg': 0.053, 'neu': 0.776, 'pos': 0.172, 'co..."
2,I thought this was a wonderful way to spend ti...,positive,comedy,"{'neg': 0.094, 'neu': 0.714, 'pos': 0.192, 'co..."
3,Basically there's a family where a little boy ...,negative,drama,"{'neg': 0.138, 'neu': 0.797, 'pos': 0.065, 'co..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,acting,"{'neg': 0.052, 'neu': 0.801, 'pos': 0.147, 'co..."


Pulling out the **Compound** score

In [50]:
df['compound'] = df['score'].apply(lambda d:d['compound'])

**Dropping** the other scores from the df

In [51]:
df.drop(columns='score', inplace=True)

Let's check out the df again

In [52]:
df.head()

Unnamed: 0,review,sentiment,aspect,compound
0,One of the other reviewers has mentioned that ...,positive,violence,-0.9951
1,A wonderful little production. <br /><br />The...,positive,editing,0.9641
2,I thought this was a wonderful way to spend ti...,positive,comedy,0.9605
3,Basically there's a family where a little boy ...,negative,drama,-0.9213
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,acting,0.9744


Assigning **Positive** and **Negative** labels to each review based on the **Comp_Score**

In [53]:
df['comp_score'] = df['compound'].apply(lambda score: 'positive' if score >= 0 else 'negative')

Let's examine the df again

In [54]:
df.head()

Unnamed: 0,review,sentiment,aspect,compound,comp_score
0,One of the other reviewers has mentioned that ...,positive,violence,-0.9951,negative
1,A wonderful little production. <br /><br />The...,positive,editing,0.9641,positive
2,I thought this was a wonderful way to spend ti...,positive,comedy,0.9605,positive
3,Basically there's a family where a little boy ...,negative,drama,-0.9213,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,acting,0.9744,positive


Let's check out how **Accurate** our predictions in *comp_score* was

In [55]:
accuracy_score(df['sentiment'], df['comp_score'])

0.69626

In [56]:
print(classification_report(df['sentiment'], df['comp_score']))

              precision    recall  f1-score   support

    negative       0.79      0.54      0.64     25000
    positive       0.65      0.86      0.74     25000

    accuracy                           0.70     50000
   macro avg       0.72      0.70      0.69     50000
weighted avg       0.72      0.70      0.69     50000



In [57]:
confusion_matrix(df['sentiment'], df['comp_score'])

array([[13410, 11590],
       [ 3597, 21403]], dtype=int64)

Now, we will apply the **Emotional Analysis**. The code below uses the **emotion lexicon** which counts up the number of times that each **emotion** is used in a review. Specifically, it is counting each instance of a **sub-emotion** being used. For example, if **joy=2**, then that means that **2 instances** of its **sub-emotions** was found in that review. 

In [58]:
# Perform emotion analysis using the custom emotion lexicon
for emotion, keywords in emotion_lexicon.items():
    df[emotion] = df['review'].apply(lambda review: sum(review.count(keyword) for keyword in keywords))


Let's check out the df again

In [59]:
df.head()

Unnamed: 0,review,sentiment,aspect,compound,comp_score,joy,sadness,anger,fear,surprise,disgust,love,excitement,hope,disappointment,nostalgia,pride,admiration,confusion,curiosity
0,One of the other reviewers has mentioned that ...,positive,violence,-0.9951,negative,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,A wonderful little production. <br /><br />The...,positive,editing,0.9641,positive,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
2,I thought this was a wonderful way to spend ti...,positive,comedy,0.9605,positive,1,0,1,0,0,0,1,0,0,1,0,0,0,0,2
3,Basically there's a family where a little boy ...,negative,drama,-0.9213,negative,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,acting,0.9744,positive,2,0,0,0,1,0,0,0,0,0,0,1,0,0,0


Let's examine how the different aspects were assigned

In [60]:
df.aspect.value_counts()

general              8608
acting               7804
plot                 5539
performance          3996
director             3221
                     ... 
social issues           9
set design              8
production design       7
transitions             5
character arc           2
Name: aspect, Length: 61, dtype: int64

## Topic Modeling

To begin **Topic Modeling**, we need to create a list of **stopwords**. I utilized the existing NLTK stopword library and my own custom **stopword** list.

In [126]:
nltk.download('wordnet')
nltk.download('stopwords')


# Get the default English stopwords from nltk
default_stopwords = set(nltk.corpus.stopwords.words('english'))

# Define your custom stopwords
custom_stopwords = ['10', 've', 'll', 'br', 'wa', 'ha']

# Combine custom stopwords with default English stopwords
stopwords = default_stopwords.union(custom_stopwords)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\MasonLonoff\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\MasonLonoff\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The custom stop word list was created through trial and error. Upon seeing the topics that were outputted below, some of the words were introducing noise and hurting its interpretability. I removed them through different iterations of running the model.

**Initializes** the TF-IDF vectorizer with certain parameters:
- max_df = 0.95: Words occurring in more than 95% of the documents should be ignored
- min_df = 2: Minimum amount of documents a word needs to appear in to be included in the analysis
- stop_words = stopwords: The list of stopwords that was calculated above

In [98]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words=stopwords)

 **Downloading** the neccessary resources for **lemmatization** from NLTK. **Initializing** and applying the **lemmatizer**. 

In [123]:
nltk.download('wordnet')

# Lemmatize the text data
lemmatizer = WordNetLemmatizer()
df['lemmatized_review'] = df['review'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\MasonLonoff\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Applying TF-IDF Vectorization on the reviews

In [90]:
dtm = tfidf.fit_transform(df['lemmatized_review'])

**Instantiating** the **Non-Negative Matrix**

In [91]:
nmf = NMF(n_components=6, random_state=42)
# 6 different topics

**Fitting** the NMF

In [92]:
nmf.fit(dtm)



NMF(n_components=6, random_state=42)

Printing out the **Top 15** words for each topic

In [80]:
for index, topic in enumerate(nmf.components_):
    print(f'The top 15 words for index # {index}:')
    print([tfidf.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print('\n')



The top 15 words for index # 0:
['make', 'girl', 'take', 'way', 'scene', 'time', 'go', 'people', 'two', 'woman', 'like', 'man', 'life', 'get', 'one']


The top 15 words for index # 1:
['movies', 'one', 'made', 'watching', 'people', 'saw', 'ever', 'time', 'think', 'would', 'seen', 'like', 'see', 'watch', 'movie']


The top 15 words for index # 2:
['see', 'people', 'first', 'television', 'would', 'watching', 'think', 'like', 'watch', 'funny', 'tv', 'season', 'series', 'episode', 'show']


The top 15 words for index # 3:
['waste', 'script', 'one', 'guy', 'awful', 'like', 'terrible', 'horror', 'plot', 'really', 'even', 'worst', 'good', 'acting', 'bad']


The top 15 words for index # 4:
['many', 'one', 'make', 'ever', 'director', 'like', 'saw', 'time', 'watch', 'would', 'films', 'made', 'seen', 'see', 'film']


The top 15 words for index # 5:
['excellent', 'book', 'also', 'role', 'cast', 'love', 'performance', 'best', 'really', 'actor', 'well', 'character', 'good', 'story', 'great']




Giving each topic a name and then applying it to each review

In [93]:
# Get the topic distributions for each review
topic_distributions = nmf.transform(dtm)

# Define the topic names
topic_names = [
    'Relationships and Life',
    'General Watching Experience',
    'TV Shows',
    'Criticism',
    'Filmmaking and Film Appreciation',
    'Performances and Storytelling'
]

# Assign each review to its related topic
assigned_topics = topic_distributions.argmax(axis=1)

# Map topic indices to topic names
topic_mapping = {idx: topic_names[idx] for idx in range(len(topic_names))}

# Replace numeric topic values with topic names
df['topic'] = [topic_mapping[idx] for idx in assigned_topics]

df.head()

Unnamed: 0,review,sentiment,aspect,compound,comp_score,joy,sadness,anger,fear,surprise,...,excitement,hope,disappointment,nostalgia,pride,admiration,confusion,curiosity,topic,lemmatized_review
0,One of the other reviewer ha mentioned that af...,positive,violence,-0.9951,negative,0,0,0,0,0,...,0,0,0,0,0,0,0,0,TV Shows,One of the other reviewer ha mentioned that af...
1,A wonderful little production. <br /><br />The...,positive,editing,0.9641,positive,0,0,0,0,0,...,0,1,0,0,0,0,0,1,Performances and Storytelling,A wonderful little production. <br /><br />The...
2,I thought this wa a wonderful way to spend tim...,positive,comedy,0.9605,positive,1,0,1,0,0,...,0,0,1,0,0,0,0,2,Performances and Storytelling,I thought this wa a wonderful way to spend tim...
3,Basically there's a family where a little boy ...,negative,drama,-0.9213,negative,0,0,0,0,0,...,2,0,0,0,0,0,0,0,General Watching Experience,Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,acting,0.9744,positive,2,0,0,0,1,...,0,0,0,0,1,0,0,0,Relationships and Life,"Petter Mattei's ""Love in the Time of Money"" is..."


### Next Steps:

Our accuracy scores leaves a lot of room for improvement. I would need to try out different models to test how accurate each model is. I would want to experiment with Logistic Regression models, Support Vector Machines, Naive Bayes models and more. 