<a href="https://colab.research.google.com/github/hazieon/CSS-Clock/blob/master/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hazel Andrew - Natural Language Processing Assignment 3

##  Identifying Subjectivity in Text

- First, let's import and load all the necessary modules and datasets

In [1]:
import nltk
from nltk.corpus import subjectivity
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# nltk.download('movie_reviews')  #Dataset of movie reviews
nltk.download('stopwords') # Common words that usually don't carry much meaning
nltk.download('punkt') # Data for tokenizing (splitting text into words)
nltk.download('subjectivity')
# need to download non-nltk packages, not import

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package subjectivity to /root/nltk_data...
[nltk_data]   Unzipping corpora/subjectivity.zip.


True

## Exploring the dataset [Task 1]

- Now let's explore the contents and structure of the data set. First I used the help() function.  Doing this, I found I can use other built in functions to to see what categories exist, and look at the first 10 sentences in the dataset.

In [3]:
# help(nltk.corpus.subjectivity)
# this shows we can use the methods categories() and sents() to explore the data

subjectivity.categories()
# ['obj', 'subj']

# Look at the first 10 sentences:
subjectivity.sents()[:10]

# Create a list of tuples for each category, subjective and objective
subjective_data = [(sentence, 'subj') for sentence in subjectivity.sents(categories='subj')]
objective_data = [(sentence, 'obj') for sentence in subjectivity.sents(categories='obj')]

# Total number of subjective sentences:
print(f"Number of subjective sentences: {len(subjective_data)}")
print(f"Number of objective sentences: {len(objective_data)}")

# Examples of subjective and objective sentences:
print(f"Example subjective sentence: {' '.join(subjective_data[0][0])}")
print(f"Example objective sentence: {' '.join(objective_data[0][0])}")


Number of subjective sentences: 5000
Number of objective sentences: 5000
Example subjective sentence: smart and alert , thirteen conversations about one thing is a small gem .
Example objective sentence: the movie begins in the past where a young boy named sam attempts to save celebi from a hunter .


## Feature extraction [Task 2]

- Next, let's create a set of stopwords and define a function to help us remove stopwords from the data set

In [4]:
stopwords_set = set(stopwords.words('english'))
# Exclude common stopwords to improve the classifier's focus
# print(stopwords_set)

# Implement a feature extraction function that converts sentences into a suitable format for training:
# Create a dictionary containing only non-stopword words
def extract_features(words):
  # Create a dictionary where the keys are words NOT in stopwords set
  return {word: True for word in words if word not in stopwords_set}


- Prepare to store subjective and objective sentences by initialising lists

In [10]:
# Initialise two lists (arrays) to hold the subjective and objective snippets
features_subjective = []
features_objective = []

- Now define a loop to categorise and organise subjective sentences

In [16]:
for f in subjectivity.fileids('subj'):
  #Extract words
  words = subjectivity.words(fileids=[f])
  #Add the feature to the subjective list with a label
  features_subjective.append((extract_features(words),'Subjective'))

  # look at the data to check it is as expected (my own habit):
  print(features_subjective[:10])



- Repeat this process to categorise and organise objective sentences

In [17]:
for f in subjectivity.fileids('obj'):
  #Extract words from the review
  words = subjectivity.words(fileids=[f])
  #Add the feature to the subjective list with a label
  features_objective.append((extract_features(words),'Objective'))

  # look at the data to check it is as expected (my own habit):
  print(features_objective[:10])



## Separate training & testing data [Task 3]


- First establish the split point to divide the data


In [18]:
#Split the data into training and testing set (80% training, 20% testing)
threshold_factor = 0.80
threshold_subjective = int(threshold_factor * len(features_subjective))
threshold_objective = int(threshold_factor * len(features_objective))

In [20]:
#Split the data
#The training set includes the first threshold_subjective features from subjective data and the first threshold_objective features from objective data.
#The testing set includes the remaining features from both subjective and objective data.
features_train = features_subjective[:threshold_subjective] + features_objective[:threshold_objective]
features_test = features_subjective[threshold_subjective:] + features_objective[threshold_objective:]

## Time to begin the training

In [28]:
# Train a Naive Bayes classifier using the training data
classifier = NaiveBayesClassifier.train(features_train)

# Evaluate and print the accuracy of the classifier on the test data
print(f'Accuracy: {nltk_accuracy(classifier, features_test):.2f}')
#gives an accuracy of 1.00, but must consider if there are any false positives if further using the data

Accuracy: 1.00


In [32]:
#Show which features (words) are most informative in determing the sentiment of a review
most_informative = classifier.most_informative_features()

for feature, weight in most_informative:
    print(f'{feature}: {weight}')

#: None
#1: None
#626: None
#9: True
$10: None
$100: None
$15m: None
$20: None
$200: None
$25: None
$30: None
$300: None
$40: None
$5: None
$50-million: None
$65: None
$7: True
$8: True
$9: True
%20john: None
%20laurie: None
&: True
&#171: None
&#173: None
&#193: None
&#227: None
&#237: None
&#38: None
'20: True
'50s: True
'90s: True
'[sex: True
'[the: True
'a': None
'accident': None
'alabama': True
'alaipayuthe': None
'all: True
'alternate: True
'amateur': True
'an: True
'analyze: True
'anime': None
'anthony: True
'arroz: None
'artistically': True
'artÃ­stico': True
'atlantis: True
'bad: None
'bartleby': True
'been: True
'belgium's: True
'best: True
'big: None
'blade: True
'blonde: None
'bloody: None
'blue: True
'bold': True
'boys': None
'brass: None
'brazil: True
'broadway': None
'businessman': None
'butterfingered': True
'buy': None
'casting: None
'catch: True
'challenging': True
'chan: True
'characters: True
'chick: True
'chris: True
'christian: True
'comedy': True
'cq: True
'cultu