#Packages Install
Certainly, the provided text is a set of commands often used in Python notebooks or scripts to install two Python packages, **"nltk"** and **"sklearn,"** using the Python package manager, **"pip."**

In [1]:
#  Packages installation
!pip install nltk
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.post9.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post9-py3-none-any.whl size=2952 sha256=23accfc0d1f2ac3734329b587e6b5733408aa30196ec35033ded7d335a81e645
  Stored in directory: /root/.cache/pip/wheels/33/a3/d2/092b519e9522b4c91608b7dcec0dd9051fa1bff4c45f4502d1
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post9


# Treebank Corpus
This code snippet utilizes the Natural Language Toolkit (NLTK) library to download the **"treebank"** corpus, a dataset commonly used in linguistic research and natural language processing (NLP). The code then employs NLTK's **"FreqDist"** and **"ConditionalFreqDist"** classes to perform frequency distribution analysis. **"FreqDist"** helps count the occurrences of unique words in the corpus, revealing common words. **"ConditionalFreqDist"** extends this analysis to explore how the presence or absence of one word relates to others, offering insights into word associations and syntactic structures. Overall, this code prepares textual data for linguistic exploration and NLP tasks by providing tools to analyze word frequencies and relationships within the corpus.

In [2]:
# Downloading Corpus from NLTK and different FreqDist
import nltk
from nltk.corpus import treebank
from nltk.probability import FreqDist, ConditionalFreqDist

# Download the required dataset
nltk.download('treebank')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [3]:
# Load the Treebank corpus for training, loading the Treebank tagged sentences
treebank_tagged_sentences = treebank.tagged_sents()
# print(treebank_tagged_sentences[0:1])
# print(type(treebank_tagged_sentences))
# for sen in treebank_tagged_sentences:
#   print(sen)
#   print("\n")
# FreqDist() is class in NLTK used to count and represent freq of elements in a list form
# f = FreqDist({'word1': 2, 'word2': 1, 'word3': 7})
# print(f['word1']) --> O/P= 2 , we could also use f.keys() to get 'word1' ,'word2'...

freq_of_tags = FreqDist() # store frequesncies of diff pos tags in corpus

# ConditionalFreqDist is another class of nltk.to count freq of things conditioned on anothr set of items.
# c = ConditionalFreqDist()
# c[condition][event] +=1 // event is the thing whose freq we want to count upon the condition
# In POS tagging we would want c[word][pos-tag]--> count freq of POS tags condition on words
# c['word1'] would give list like {'VB' : 3, 'NN': 5}

freq_of_word_tag_pairs = ConditionalFreqDist() # store frequesncies of diff word-tag pairs in corpus
prob_transition = ConditionalFreqDist() # store the prob of transitioning from one tag to another tag: P(tag2|tag1)
prob_emission = ConditionalFreqDist() # store the prob of a particular word given a specific tag: P(word|tag)

all_tags = set(tag for sentence in treebank_tagged_sentences for _, tag in sentence)

for tag in all_tags:
  prob_transition[tag][tag] = 1

for each_sentence in treebank_tagged_sentences:
  for word_part, tag_part in each_sentence:
      freq_of_tags[tag_part] = freq_of_tags[tag_part] + 1
      freq_of_word_tag_pairs[word_part][tag_part] = freq_of_word_tag_pairs[word_part][tag_part]+1


# for tag, count in freq_of_tags.items():
#     print(f"{tag}: {count}")   # POS_tags: freq_count


# for condition in freq_of_word_tag_pairs.conditions():
#     print(condition, end=': ')
#     for event in freq_of_word_tag_pairs[condition]:
#         count = freq_of_word_tag_pairs[condition][event]
#         print(f'{event}({count})', end=' ')
#     print()  # Move to the next line for the next condition   word: POS_tag(freq_count  `)



# print("T")
# print(freq_of_tags)
# print("HH")
# print(freq_of_word_tag_pairs)

# Probabilistic Modeling of POS Tags

####Emission Probabilities:
The first part of the code iterates through tagged sentences in the "`treebank"` corpus, examining each word-tag pair within a sentence. For each word, the code calculates the emission probability, denoted as `prob_emission[word][tag]`. Emission probability represents the likelihood of observing a specific word given a particular POS tag. The code counts the occurrences of word-tag pairs, accumulating them in the `prob_emission` dictionary.

####Transition Probabilities:
The second part of the code focuses on transition probabilities between POS tags. It calculates the probability of transitioning from one POS tag to another within a sentence. The `prob_transition` dictionary stores these probabilities. To compute transition probabilities, the code examines consecutive word-tag pairs within a sentence. If an earlier tag is present (not the first pair in a sentence), it updates the transition probability from the earlier tag to the current tag, denoted as `prob_transition [earlier_tag][current_tag]`.

####Normalization:
After counting occurrences, both emission and transition probabilities are normalized. This ensures that the probabilities sum up to 1 for each word and tag combination or for transitions between tags. The normalization process divides the count of each occurrence by the total count, ensuring that probabilities represent relative frequencies.

In [4]:
for each_sentence in treebank_tagged_sentences:
  # print("Sentence ",each_sentence)
  earlier_tag_present = None  # at the beginning of each sentnece their is no initial tag present
  for word_part, tag_part in each_sentence:
    # print(f"{word_part} : {tag_part}")
    prob_emission[word_part][tag_part] = prob_emission[word_part][tag_part] + 1
    if earlier_tag_present is not None: # means earlier tag is present that is it is not first (word,tag) pair
      prob_transition[earlier_tag_present][tag_part] = prob_transition[earlier_tag_present][tag_part] + 1
    earlier_tag_present = tag_part



# for condition in prob_emission.conditions():
#     print(condition, end=': ') # word:
#     for event in prob_emission[condition]:
#         count = prob_emission[condition][event]
#         print(f'{event}({count})', end=' ') #tag(count)
#     print()

# this prints word1: tag11(count_tag11) tag21(count_tag21)
#             word2: tag12(count_tag12) tag22(count_tag22)

# for condition in prob_transition.conditions():
#     print(condition, end=': ') # current_tag
#     for event in prob_transition[condition]:  # event are next POS tags that can follow current tag
#         count = prob_transition[condition][event]
#         print(f'{event}({count})', end=' ') #how many times transition from current tag to next tag has occured
#     print("\n")


# Normalising the probabilites so that as in usual senario the sum of prrobabilities should add upto 1,
# it would now present relative probabilities, it is important to normalise to adhere to fundamental rules of probability theory

# for tag in prob_emission:
#   print(tag, end =": ")
#   totalCount = sum(prob_emission[tag].values())
#   print(totalCount, end = ": ")
#   if totalCount > 0:
#     for word in prob_emission[tag]:
#       print(word)
#       prob_transition[tag][word] /= totalCount
# print("\n\n\n\n\n\n")
for word in prob_emission:
    total_count = sum(prob_emission[word].values())
    if total_count > 0:
        for tag in prob_emission[word]:
            prob_emission[word][tag] /= total_count



for current_tag in prob_transition:
    total_count = sum(prob_transition[current_tag].values())
    if total_count > 0:
        for next_tag in prob_transition[current_tag]:
            prob_transition[current_tag][next_tag] /= total_count

# for tag in prob_transition:
#   totalCount = sum(prob_transition[tag].values())
#   if totalCount > 0:
#     for tag1 in prob_transition[tag]:
#       prob_transition[tag][tag1] /= totalCount

# for tag in prob_transition:
#     for tag1 in prob_transition[tag]:
#         total_count = prob_transition[tag].freq(tag1)
#         if total_count > 0:
#             prob_transition[tag][tag1] /= total_count



# for condition in prob_transition.conditions():
#     print(condition, end=': ') # word:
#     for event in prob_transition[condition]:
#         count = prob_transition[condition][event]
#         print(f'{event}({count})', end=' ') #tag(count)
#     print()


In [None]:


# import numpy as np
# def algorithm_viterb_non_optimised(line, prob_trans, prob_emis):
#     words_of_line = line.split() # number of words separated by space
#     words_number = len(words_of_line)
#     print("Words number: ",words_number)
#     #  Getting list of all POS tags we would be using
#     empty_list = []
#     if words_number == 0:
#       return empty_list # no words present to be tagged
#     tags_all_possible = prob_transition.conditions()
#     tags_number = len(tags_all_possible)


#     # 2 tables used in Dynamic programming viterbi algorithm
#     # Initialising these 2 matrices/tables
#     matrix_viterbi = np.zeros((tags_number, words_number)) # storing highest probability of a specific tag at a given word position
#     backpointer = np.zeros((tags_number, words_number), dtype=int)  # stores the previous/earlier tag leading upto highest probability

#     # in dynamic programming we are intitialise the first column/row depending on the problem

#     # initialise first column of Viterbi matrix
#     for index, tag in enumerate(tags_all_possible):
#         word_at_index = words_of_line[0]
#         p_e = prob_emission[word_at_index].freq(tag)

#         if p_e == 0:
#             p_e = 1e-10  #Not having zero probability but very close to zero
#         matrix_viterbi[index][0] = np.log(p_e)  # for 1st word entry in table use log probabilities
#         print(f"martrix_viterbi[{index}][0] = ",matrix_viterbi[index][0])

#     # Fill in the rest of the viterbi and backpointer matrices
#     for word_index in range(1, words_number):
#         for i, this_tag in enumerate(tags_all_possible):
#             maximum_prob = float('-inf')
#             optimum_state_earlier = 0
#             for j, earlier_tag in enumerate(tags_all_possible):
#                 p_t = prob_trans[earlier_tag].freq(this_tag)
#                 p_em = prob_emis[words_of_line[word_index]].freq(this_tag)
#                 if p_t == 0:
#                     p_t = 1e-10  # Avoid zero probability
#                 if p_em == 0:
#                     p_em = 1e-10  # Avoid zero probability
#                 logarithm_probability = matrix_viterbi[j][word_index - 1] + np.log(p_t) + np.log(p_em)  # Use log probabilities
#                 if logarithm_probability > maximum_prob:
#                     print("Logarithm probability ",logarithm_probability)
#                     maximum_prob = logarithm_probability
#                     optimum_state_earlier = j
#                     print("optimum earlier state ",optimum_state_earlier)
#             matrix_viterbi[i][word_index] = maximum_prob
#             backpointer[i][word_index] = optimum_state_earlier

#     # Find the best path by backtracking
#     possible_path_optimal = []
#     maximum_log_final_prob = float('-inf')
#     optimal_final_state = 0
#     for i, tag in enumerate(tags_all_possible):
#       print(f"{i}, {tag} ::::")
#       if matrix_viterbi[i][words_number - 1] > maximum_log_final_prob:
#           maximum_log_final_prob = matrix_viterbi[i][words_number - 1]
#           print("max_log")
#           optimal_final_state = i
#     possible_path_optimal.append(tags_all_possible[optimal_final_state])
#     for t in range(words_number - 1, 0, -1):
#         print(f"t is {t}")
#         optimum_state_earlier = backpointer[optimal_final_state][t]
#         possible_path_optimal.insert(0, tags_all_possible[optimum_state_earlier])
#         optimal_final_state = optimum_state_earlier

#     return possible_path_optimal


#Viterbi Algorithm for POS Tagging

The provided code snippet implements the Viterbi algorithm for Part-of-Speech (POS) tagging, a fundamental task in natural language processing. Here's an explanation of the key components:

####Initialization:
The code begins by splitting the input sentence into individual words. It then initializes matrices, `matrix_viterbi` and `backpointer`, which will be used to store probabilities and track the best path of POS tags.

####Probability Computation:
 Before proceeding, the code computes logarithmic probabilities for both emissions (`log_prob_emis`) and transitions (`log_prob_trans)`. These probabilities are precomputed for efficiency and handle cases where probabilities are very close to zero by setting a small floor value (1e-10).

####Viterbi Algorithm Main Loop:
The core of the Viterbi algorithm is implemented in a loop. It iterates through each word in the input sentence, considering all possible POS tags for each word. For each word and tag combination, it calculates the probability of reaching that state based on the maximum probability from the previous word and the transition and emission probabilities.

####Backtracking:
While filling the matrices, the code also maintains a backpointer to record the optimal previous state (POS tag) for each state at the current word. This information is crucial for backtracking to find the best path once the entire sentence has been processed.

####Optimal Path Reconstruction:
After processing all words in the sentence, the code identifies the final state with the highest probability. This state corresponds to the last POS tag in the optimal path. The code then backtracks through the backpointer matrix to reconstruct the complete optimal path of POS tags.

In [5]:
import numpy as np
# The Viterbi Algorithm is a dynamic programming approach that finds the most likely sequence of POS tags for a given sentence.

# Here's a step-by-step breakdown of the Viterbi Algorithm:

# Initialize Tables: Initialize the Viterbi table and backpointer table.

# Initialize the First Column: For the first word in the sentence, calculate the Viterbi scores for all possible tags based on transition and emission probabilities.

# Fill in the Rest of the Table: Iterate through the remaining words in the sentence, updating the Viterbi scores for each tag at each step.

# Backtracking: Once you have filled the table, backtrack through it to find the best sequence of POS tags.

def algorithm_viterbi(line, prob_trans, prob_emis):
    words_of_line = line.split()
    words_number = len(words_of_line)

    if words_number == 0:
        return []  # Return an empty list if there are no words.

    # Getting list of all POS tags we would be using
    tags_all_possible = list(prob_trans.conditions())
    tags_number = len(tags_all_possible)

    # Initialize matrices using Numpy arrays
    matrix_viterbi = np.zeros((tags_number, words_number))
    backpointer = np.zeros((tags_number, words_number), dtype=int)

    # Precompute log probabilities for emissions and transitions
    log_prob_emis = {word: {tag: np.log(prob) if prob > 0 else np.log(1e-10) for tag, prob in prob_emis[word].items()} for word in words_of_line}
    log_prob_trans = {tag1: {tag2: np.log(prob) if prob > 0 else np.log(1e-10) for tag2, prob in prob_trans[tag1].items()} for tag1 in tags_all_possible}

    # Initialize the first column of Viterbi matrix
    for index, tag in enumerate(tags_all_possible):
        word_at_index = words_of_line[0]
        p_e = log_prob_emis[word_at_index].get(tag, np.log(1e-10))
        matrix_viterbi[index][0] = p_e

    # Fill in the rest of the Viterbi and backpointer matrices
    for word_index in range(1, words_number):
        for i, this_tag in enumerate(tags_all_possible):
            maximum_prob = float('-inf')
            optimum_state_earlier = 0
            for j, earlier_tag in enumerate(tags_all_possible):
                p_t = log_prob_trans[earlier_tag].get(this_tag, np.log(1e-10))
                p_em = log_prob_emis[words_of_line[word_index]].get(this_tag, np.log(1e-10))
                logarithm_probability = matrix_viterbi[j][word_index - 1] + p_t + p_em
                if logarithm_probability > maximum_prob:
                    maximum_prob = logarithm_probability
                    optimum_state_earlier = j
            matrix_viterbi[i][word_index] = maximum_prob
            backpointer[i][word_index] = optimum_state_earlier

    # Find the best path by backtracking
    possible_path_optimal = []
    maximum_log_final_prob = float('-inf')
    optimal_final_state = 0
    for i, tag in enumerate(tags_all_possible):
        if matrix_viterbi[i][words_number - 1] > maximum_log_final_prob:
            maximum_log_final_prob = matrix_viterbi[i][words_number - 1]
            optimal_final_state = i
    possible_path_optimal.append(tags_all_possible[optimal_final_state])
    for t in range(words_number - 1, 0, -1):
        optimum_state_earlier = backpointer[optimal_final_state][t]
        possible_path_optimal.insert(0, tags_all_possible[optimum_state_earlier])
        optimal_final_state = optimum_state_earlier

    return possible_path_optimal


#Test Example Sentence

Given a `example` sentence we are just calling our `algorithm_viterbi`
for getting a list of POS_tags for each word in `tagged`.
This is just testing if our method is executing properly or not



In [10]:
example = "In October , South Korea's economy reflected sluggishness . Trade deficits cast doubt on export-oriented growth . Newsweek announced new ad rates ."
# print(example)
# Print the ConditionalFreqDist in a custom format
# for condition in prob_emission.conditions():
#     print(condition, end=': ')
#     for event in prob_emission[condition]:
#         print(f'{event}({prob_emission[condition][event]})', end=' ')
#     print()  # Move to the next line for the next condition
# # print(prob_transition)
# # count=0
# # for i in prob_transition:
#     print(i)
#     count+=1
#     if count == 40:
#       break
tagged = algorithm_viterbi(example,prob_transition,prob_emission)
print(example)
print(tagged)
# example=example.split()
# for i in range(len(example)):
#   word = example[i]
#   tag = tagged[i]
#   print(f"{word}/{tag}", end=" ")

In October , South Korea's economy reflected sluggishness . Trade deficits cast doubt on export-oriented growth . Newsweek announced new ad rates .
['IN', 'NNP', ',', 'NNP', 'NNP', 'NN', 'VBD', 'NN', '.', 'NNP', 'VBD', 'VBN', 'NN', 'IN', 'JJ', 'NN', '.', 'NNP', 'VBD', 'JJ', 'NN', 'NNS', '.']


#Setting up the Movie Reviews Dataset and Classification Models

In [15]:
!pip install scikit-learn
import nltk
import numpy as np
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score , classification_report
# Step 0: Load the movie_reviews corpus
nltk.download('movie_reviews')



[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

#Preparing the Movie Reviews Dataset

In this code section, we perform the initial steps to prepare the movie reviews dataset for sentiment analysis. Here's a concise explanation:

We start with a list called `documents`, where each element is a tuple containing a list of words from a movie review and its associated category (positive or negative).

Next, we split the dataset into training, validation, and test sets using the `train_test_split` function from scikit-learn. The training set is further divided into validation and training subsets.

We extract text data and labels from each of these subsets, joining the words in each review into a single string and storing them in` X_train, X_val`, and `X_test `for text data, and `y_train, y_val, and y_test` for labels.

To convert the text data into numerical features, we use TF-IDF vectorization with a maximum of 10,000 features. This process results in `X_train_tfidf, X_val_tfidf, and X_test_tfidf,` which are ready for use in machine learning models.

In summary, this code prepares the movie reviews dataset by splitting it into subsets and converting the text data into TF-IDF features for sentiment analysis.

In [16]:

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
# print(type(documents))

# for i in range(1):
#   print(documents[i])

# if i=1: documents[i][0] is the words of review for item number i and documents[i][1] is the category pos/neg of that item i


# Step 1: Split the dataset into train, validation, and test sets
# You can adjust the train-validation-test split ratios as needed.



train_documents, test_documents = train_test_split(documents, test_size=0.2, random_state=42)
train_documents, val_documents = train_test_split(train_documents, test_size=0.2, random_state=42)


# random, train_documents = train_test_split(documents,test_size=0.2,random_state=42)
# train_documents, test_documents = train_test_split(train_documents, test_size=0.2, random_state=42)
# train_documents, val_documents = train_test_split(train_documents, test_size=0.2, random_state=42)


# Extracting text data and labels from train_documents
X_train = []  # To store text data for training
y_train = []  # To store labels for training

for document, category in train_documents:
    # Join the list of words in the document into a single string
    text_data = ' '.join(document)
    X_train.append(text_data)
    y_train.append(category)

# Extracting text data and labels from val_documents
X_val = []  # To store text data for validation
y_val = []  # To store labels for validation

for document, category in val_documents:
    # Join the list of words in the document into a single string
    text_data = ' '.join(document)
    X_val.append(text_data)
    y_val.append(category)

# Extracting text data and labels from test_documents
X_test = []  # To store text data for testing
y_test = []  # To store labels for testing

for document, category in test_documents:
    # Join the list of words in the document into a single string
    text_data = ' '.join(document)
    X_test.append(text_data)
    y_test.append(category)


tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf = tfidf_vectorizer.transform(X_val)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

#Part-of-Speech Tagging for Training Data

In this section, we perform Part-of-Speech (POS) tagging for the training data using the Viterbi Algorithm. The code snippet provided iterates through each sentence in the training data and applies the Viterbi Algorithm to assign POS tags to each word in the sentence.

The result is stored in the` pos_tags_train `list, which contains the POS tags for each training sentence

In [17]:
print("Training: " ,len(X_train))
print("\n\n\n")
# t = """synopsis : al simmons , top - notch assasin with a guilty conscience , dies in a fiery explosion and goes to hell . making a pact with malebolgia , a chief demon there , simmons returns to earth 5 years later reborn as spawn , a general in hell ' s army donning a necroplasmic costume replete with knives , chains , and a morphing cape . sullen , wise cogliostro and flatulating , wisecracking violator vy for spawn ' s attention . comments : when todd mcfarlane left marvel comics ( where he had made a name for himself as a first - rate comic book penciller on the " spider - man " titles ) to join the newly - formed , creator - owned image comics , a new comic book legend was born : spawn . mcfarlane ' s " spawn " immediately became a commercial and critical success and a defining comic book series of the 1990s . mcfarlane created a hero who was not only original but visually intricate , allowing mcfarlane to utilize his knack for artistic detail to the max . the early " spawn " issues brilliantly capture mcfarlane ' s genius at illustration and show his early attempts at writing . with the popularity of " spawn " and the success of the current warner bros . ' s batman film franchise , a movie version of some sort seemed inevitable for spawn . in the summer of 1997 , hence , new line cinema released spawn , a live - action film based on the groundbreaking series . this topheavy exercise in violence and special effects unfortunately topples quickly and leaves fans of the comic book , like me , numbed by how much spawn misses the mark . what happened ? why is spawn so bad ? todd mcfarlane himself executive produced this disappointing misfire and even appears in a cameo . i don ' t think , however , that his presence necessarily hurt ( or helped ) the film . i place the blame , in part , on the recent hollywood trend , fueled by public demand apparently , for special effects blow - out movies utilizing the latest computer technology . these films focus upon the effects at the expense of everything else : character , plot , dialogue , etc . spawn , reflecting this trend , shows the audience one gratuitous scene after another populated with morphing characters and filled with unnecessary pyrotechnics . hardly a minute goes by in this film without fires , explosions , knives and chains appearing out of nowhere , glowing eyes , or constantly transforming demons . a lot of it is visually interesting and technically solid , don ' t get me wrong , but , because the script and cast aren ' t engaging , spawn ultimately comes across like overwrought wallpaper ( the surface may capture the eye , but nothing exists underneath ) . spawn ' s translation of the comic book suffers the most at the storyline level . mcfarlane ' s spawn was a tortured hero . a mercenary by trade , al simmons was nonetheless a warm man in love with the beautiful wanda . having died and journeyed to hell , he made a pact to return to earth to be with wanda . simmons , however , discovers that his memories are fragmented , his body a creepy mess , and his wife married . despite his sometimes violent nature , readers couldn ' t help but feel sympathetic toward his plight as the spawn of the underworld . spawn attempts to show all of this but does not spend nearly the time it should to do so . when the characters are developed , they seem absurd rather than touching . the cartoonish dialogue and implausible subplot ( a general possesses the antidote to a supervirus called heat - 16 which he wishes to unleash to enslave the world ) do not help matters . spawn , in an apparent attempt to duplicate the success of batman , also unwisely spends too much time on a villain , the violator ( batman favored the joker over batman ) . john leguizamo , like jack nicholson in batman , receives top billing in the cast as the violator ; michael jai white ( al simmons / spawn ) is second . i ordinarily find leguizamo an intensely annoying presence in films which seems to make him a perfect candidate for the violator . the film , however , spends so much time on the violator ' s offensive antics that they grate on the nerves . apparently meant to be the comic relief in the film ( as nicholson was in batman ) , especially when contrasted with the sullen spawn , the violator ' s lines are oftentimes grotesque and unfunny , leaving the audience wishing he would leave . leguizamo does a satisfactory job in the role , but he is seen far too often in the film . michael jai white , a relative newcomer to theatrical releases , seems to be an appealing actor , and he handles his role adequately , but we see little of him without various masks on . more time needed to be spent on white ' s character before he became spawn for the movie to pull at the heartstrings . a special note should be made about martin sheen as the over - the - top , obnoxious , evil general wynn . easily the hammiest performance in the movie , it ' s hard to imagine how sheen mucked up his role so much ; after all , he played a vietnam assasin brilliantly in the great apocalypse now . sheen ' s excessive demeanor do not help the audience accept him as a mastermind villain and comes as a surprise considering his extensive career in film . many other elements conspire with the disappointing script and abundant special effects to drag spawn down . mtv - style , jerky , in - your - face editing is one of them . flames , for example , roll across the screen sometimes to announce a shift in setting . cogliostro , unlikely wannabe guide for spawn , serves as a poor narrator for the film . he goofily tells the audience , at one point , that " how much of [ spawn ' s ] humanity is left remains to be seen , " as if the audience really cares as one violent sequence leads to another . the music , finally , assaults the audience as much as the manic violence and offensive dialogue . loud and obnoxious hard rock fused with drum loops dominate some scenes . to be fair , however , marilyn manson ' s " long hard road out of hell " effectively compliments spawn ' s return to earth , while filter and the crystal method ' s " ( can ' t you ) trip like i do " proves a surprisingly fitting theme song . for as good a comic book as it is , " spawn " did not spawn a good movie . spawn , instead , suffers from too much pomp and circumstance , and too little plot and character development . it receives two stars for its technically well - done special effects . many other films , though , have equal , if not superior , special effects and are much better . rated pg - 13 , spawn seems more violent than many r - rated movies and probably wouldn ' t be appropriate for the very young ."""
# tag = algorithm_viterbi(t, prob_transition, prob_emission)
# print(tag)
pos_tags_train = []
count=0
for sentence in X_train:
    print("Train ",count,end=": ")
    print(sentence)
    count+=1
    tags = algorithm_viterbi(sentence, prob_transition, prob_emission)
    pos_tags_train.append(tags)

# print("\n\n\nValidation\n\n\n")
# pos_tags_val = []
# for sentence in X_val:
#     tags = algorithm_viterbi(sentence, prob_transition, prob_emission)
#     pos_tags_val.append(tags)

# print("\n\n\nTesting\n\n\n")
# pos_tags_test = []
# for sentence in X_test:
#     tags = algorithm_viterbi(sentence, prob_transition, prob_emission)
#     pos_tags_test.append(tags)

# # Step 2: Feature Engineering
# # You should have your TF-IDF vectors (X_train_tfidf, X_val_tfidf, X_test_tfidf) from your previous code.
# # Now, convert POS tags to a suitable format (e.g., one-hot encoding or embeddings).

# # Step 3: Combine Features
# # Combine TF-IDF vectors with POS tag features for each dataset (train, validation, test).
# X_train_combined = combine_features(X_train_tfidf, pos_tags_train)
# X_val_combined = combine_features(X_val_tfidf, pos_tags_val)
# X_test_combined = combine_features(X_test_tfidf, pos_tags_test)

# # Step 4: Train the Classifier
# svm_classifier = SVC(kernel="linear")
# svm_classifier.fit(X_train_combined, y_train)
# svm_val_predictions = svm_classifier.predict(X_val_combined)
# accuracy_svm = accuracy_score(y_val, svm_val_predictions)
# print(f"SVM Validation Accuracy: {accuracy_svm:.2f}")

# # Make predictions on the test set
# svm_test_predictions = svm_classifier.predict(X_test_combined)

# # Calculate and print the test accuracy
# accuracy_svm_test = accuracy_score(y_test, svm_test_predictions)
# print(f"SVM Test Accuracy: {accuracy_svm_test:.2f}")

Training:  256




Train  0: kirk douglas is one of those rare american actors who can say more with a simple glance than most can say with pages of dialogue . all he has to do is look at someone with a raised eyebrow , and you instantly know what he ' s thinking . " detective story " features one of kirk douglas ' s finest performances . he stars as a new york detective that has his whole world fall apart in one night . the film is based on a play , and this is quite evident , as most of the movie takes place in the one - room flat that the detective ' s work in . the film opens with the douglas character getting ready to go home to his wife , but through a series of events , he never quite makes it there . the bulk of the movie follows the case of a man named schneider , a surgeon who routinely performs abortions with a high fatality rate . however , this schneider character has a connection in the past of douglas ' s wife ; a connection douglas himself is not aware of . many secrets

#POS-Tagging for Validation and Testing Data

In [18]:
print("\n\n\nValidation ", len(X_val))
print("\n\n\n")
pos_tags_val = []
count=0
for sentence in X_val:
    print("Validate ",count,end=": ")
    print(sentence)
    count=count+1
    tags = algorithm_viterbi(sentence, prob_transition, prob_emission)
    pos_tags_val.append(tags)

print("\n\n\nTesting: ",len(X_test))
print("\n\n\n")
pos_tags_test = []
count=0
for sentence in X_test:
    print("Test ",count,end=": ")
    print(sentence)
    count+=1
    tags = algorithm_viterbi(sentence, prob_transition, prob_emission)
    pos_tags_test.append(tags)

# Step 2: Feature Engineering
# You should have your TF-IDF vectors (X_train_tfidf, X_val_tfidf, X_test_tfidf) from your previous code.
# Now, convert POS tags to a suitable format (e.g., one-hot encoding or embeddings).

# Step 3: Combine Features
# Combine TF-IDF vectors with POS tag features for each dataset (train, validation, test).
# X_train_combined = combine_features(X_train_tfidf, pos_tags_train)
# X_val_combined = combine_features(X_val_tfidf, pos_tags_val)
# X_test_combined = combine_features(X_test_tfidf, pos_tags_test)

# # Step 4: Train the Classifier
# svm_classifier = SVC(kernel="linear")
# svm_classifier.fit(X_train_combined, y_train)
# svm_val_predictions = svm_classifier.predict(X_val_combined)
# accuracy_svm = accuracy_score(y_val, svm_val_predictions)
# print(f"SVM Validation Accuracy: {accuracy_svm:.2f}")

# # Make predictions on the test set
# svm_test_predictions = svm_classifier.predict(X_test_combined)

# # Calculate and print the test accuracy
# accuracy_svm_test = accuracy_score(y_test, svm_test_predictions)
# print(f"SVM Test Accuracy: {accuracy_svm_test:.2f}")




Validation  64




Validate  0: synopsis : a humorless police officer ' s life changes when he befriends a super - smart , super - adorable golden retriever named einstein and a cute , young blond scientist . unfortunately , einstein shares a psychic link with a bigfoot - sized ape - creature trained by the blond scientist to be an unstoppable killing machine , and this rogaine - nightmare is loose and after the dog and the girl . meanwhile , a group of white , chain - smoking , gun - toting nsa agents in sunglasses and business suits tries to kill all the other characters in the movie . comments : watchers reborn , a cheaply made direct - to - video turkey , is the fourth sequel to the first film version of dean koontz ' s bestselling novel watchers . technically , this should have been called watchers v , but it seems that this cycle of horror movies , much like many other sequel - crazy film series , has decided to drop the numbers from the titles . ( even the star trek movies dr

#Combining POS Tags with Text Data

In this section, we combine the Part-of-Speech (POS) tags that were generated for the training, validation, and test sets into a unified list called `all_pos_tags`. This step is crucial for ensuring that the same POS tag encoding scheme is applied consistently across all three datasets.
Detailed explanation of combining present in report

#SVM Classifier and Evaluation
##( POS-TAG ENHANCED MODEL )

In this section, we use an SVM classifier with a linear kernel to perform sentiment analysis. The classifier is trained on a combination of TF-IDF-based text embeddings and encoded POS tag features (`X_train_combined`). We evaluate the model on the validation and test datasets, reporting accuracy and providing classification reports (`classification_rep_val` and `classification_rep_test`) that contain comprehensive performance metrics for each sentiment category. This assessment helps gauge the effectiveness of the POS-tag-enhanced sentiment analysis model

In [19]:
# Combine POS tags for training, validation, and test sets into one list
all_pos_tags = pos_tags_train + pos_tags_val + pos_tags_test

# Create a MultiLabelBinarizer and fit it to all POS tags
mlb = MultiLabelBinarizer()
pos_tags_encoded = mlb.fit_transform(all_pos_tags)

# Split the encoded POS tags back into training, validation, and test sets
pos_tags_train_encoded = pos_tags_encoded[:len(pos_tags_train)]
pos_tags_val_encoded = pos_tags_encoded[len(pos_tags_train):len(pos_tags_train) + len(pos_tags_val)]
pos_tags_test_encoded = pos_tags_encoded[len(pos_tags_train) + len(pos_tags_val):]

# Continue with the rest of your code
# Combine TF-IDF-based embeddings and POS tag features
X_train_combined = np.hstack((X_train_tfidf.toarray(), pos_tags_train_encoded))
X_val_combined = np.hstack((X_val_tfidf.toarray(), pos_tags_val_encoded))
X_test_combined = np.hstack((X_test_tfidf.toarray(), pos_tags_test_encoded))

# SVM classifier
svm_classifier = SVC(kernel="linear")
svm_classifier.fit(X_train_combined, y_train)
print("\nPOS-TAG ENHANCED MODEL\n")
# Validation
svm_val_predictions = svm_classifier.predict(X_val_combined)
accuracy_svm = accuracy_score(y_val, svm_val_predictions)
print(f"\nSVM Validation Accuracy: {accuracy_svm:.4f}")

classification_rep_val = classification_report(y_val, svm_val_predictions)
print("Classification Report for Validation:\n", classification_rep_val)
# Testing svm_test_predictions = svm_classifier.predict(X_test_combined)
svm_test_predictions = svm_classifier.predict(X_test_combined)
accuracy_svm_test = accuracy_score(y_test, svm_test_predictions)
print(f"\nSVM Test Accuracy: {accuracy_svm_test:.4f}")

classification_rep_test = classification_report(y_test, svm_test_predictions)
print("Classification Report for Testing:\n", classification_rep_test)



POS-TAG ENHANCED MODEL


SVM Validation Accuracy: 0.7344
Classification Report for Validation:
               precision    recall  f1-score   support

         neg       0.69      0.76      0.72        29
         pos       0.78      0.71      0.75        35

    accuracy                           0.73        64
   macro avg       0.73      0.74      0.73        64
weighted avg       0.74      0.73      0.73        64


SVM Test Accuracy: 0.6000
Classification Report for Testing:
               precision    recall  f1-score   support

         neg       0.67      0.58      0.62        45
         pos       0.54      0.63      0.58        35

    accuracy                           0.60        80
   macro avg       0.60      0.60      0.60        80
weighted avg       0.61      0.60      0.60        80



#Evaluation
##( BASELINE MODEL WITH NO TAGS )

In [20]:
import nltk
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score , classification_report


tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_val_tf = tfidf_vectorizer.transform(X_val)
X_test_tf = tfidf_vectorizer.transform(X_test)
# SVM classifier with TF-IDF

svm_classifier = SVC(kernel="linear")
svm_classifier.fit(X_train_tf, y_train)

print("\nBASELINE MODEL WITHOUT TAGS\n")
svm_val_predictions = svm_classifier.predict(X_val_tf)
accuracy_svm = accuracy_score(y_val, svm_val_predictions)
print(f"\nSVM Validation Accuracy: {accuracy_svm:.4f}")
classification_rep_val = classification_report(y_val, svm_val_predictions)
print("Classification Report for Validation:\n", classification_rep_val)


# Make predictions on the test set
svm_test_predictions = svm_classifier.predict(X_test_tf)

# Calculate and print the test accuracy
accuracy_svm_test = accuracy_score(y_test,svm_test_predictions)

print(f"\nSVM Test Accuracy: {accuracy_svm_test:.4f}")

classification_rep_test = classification_report(y_test, svm_test_predictions)
print("Classification Report for Testing:\n", classification_rep_test)


BASELINE MODEL WITHOUT TAGS


SVM Validation Accuracy: 0.6719
Classification Report for Validation:
               precision    recall  f1-score   support

         neg       0.65      0.59      0.62        29
         pos       0.68      0.74      0.71        35

    accuracy                           0.67        64
   macro avg       0.67      0.66      0.67        64
weighted avg       0.67      0.67      0.67        64


SVM Test Accuracy: 0.6000
Classification Report for Testing:
               precision    recall  f1-score   support

         neg       0.71      0.49      0.58        45
         pos       0.53      0.74      0.62        35

    accuracy                           0.60        80
   macro avg       0.62      0.62      0.60        80
weighted avg       0.63      0.60      0.60        80

