I chose to increase the accuracy of the K-Nearest Neighbours (KNN) algorithm on the validation set.

I began by repeating the original first steps of loading the IMDB dataset and printing the last instance.


In [4]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")['train']
print(imdb_dataset[-1])

  from .autonotebook import tqdm as notebook_tqdm


{'text': 'The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.', 'label': 1}


Once again, I converted the data to a list of input vectors and a list of associated output labels.

In [5]:
train_data = []
train_data_labels = []
for item in imdb_dataset:
  train_data.append(item['text'])
  train_data_labels.append(item['label'])
print(train_data[-1])
print(train_data_labels[-1])

The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritance. Being about the grossest Aussie shearer ever to set foot outside this great Nation of ours there is something of a culture clash and much fun and games ensue. The songs of Barry McKenzie(Barry Crocker) are highlights.
1


The accuracy of the KNN model from part two was **0.6092**.

To improve this figure I experimented with a variety of additional parameters of the "CountVectorizer" function.

Unfortunately for this particular model, some of these additions and alterations did not result in the accuracy score improving. 

Here are the changes I tried to implement that did not improve the performance of the model:

**Adding n-grams:** `ngram_range=(1, 2)`

By using just individual words you can often miss important sentiment that is expressed in combinations of words (e.g. "good" vs "not good"). I attempted to capture this added context using two word expressions (bigrams) in my list of algorithm features. It is likely that for this particular data, adding bigrams introduced irrelevant word combinations to the feature list, perhaps like "the movie", that include little to no sentiment and would therefore not be useful for classification purposes.

**Increasing number of features:** `max_features=1000`

I tinkered with the amount of the most common words to include in the feature list. I tried increasing it to 1000 to include even more distinctive words but this did not improve the accuracy. I also tried decreasing the number to 250 in case the original figure of 500 was too large and overfitting the data. This also lowered the accuracy, indicating that 500 was the ideal size for data of this type and volume.

I found the adjustment to the parameters that had the most positive influence on the predictive accuracy was:

**Removing Stop Words:** `stop_words='english'`

Scikit-learn provides a stop list of common English words (e.g. "like" "this" "and") that are typically considered irrelevant for text classification due to their lack of sentiment. I used this argument to filter these words out of the data so that they weren't included in the feature list. This meant that the model focused on words that carried more sentiment information and could establish more meaningful patterns that ultimately increased the overall accuracy score.

After testing the model with various combinations and values of these CountVectorizer parameters, this was the version of the classifier that returned the highest accuracy:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

vectorizer = CountVectorizer(analyzer='word',max_features=500,lowercase=True,stop_words='english')
features = vectorizer.fit_transform(train_data).toarray()
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.9,random_state=123)
model = KNeighborsClassifier(n_neighbors=3)
model = model.fit(X=X_train,y=y_train)
y_pred = model.predict(X_val)

print(accuracy_score(y_val,y_pred))

0.622


As you can see, the improvements to the CountVectorizer parameters only marginally increased the accuracy from 0.6092 to 0.622 (+0.0128).

However, by using TfidfVectorizer as an alternative to CountVectorizer, I was able to significantly enhance the model. 

I chose TfidfVectorizer as it adjusts the weight of words based on how many reviews they appear in throughout the dataset. By reducing the weight of common words such as "the", distinctive words with more sentiment have more influence. 

Once again, I experimented with the stop word, n-gram and maximum feature parameters to maximise the algorithms efficiency. As expected, the stop word filtering was very beneficial but this time increasing the max features to 1000 also increased the accuracy score.




The final area I looked at to improve the model was altering the settings of the specific algorithm, in this case KNN.

I chose to change the value for K from 3 to a value that would increase the accuracy.

A value too small for K can capture too much noise in a dataset while a value too big can mean local patterns are overlooked.

In the end, I found 5 was the value for K that gave me the highest accuracy score.

The final version of the model that gave me the highest predictive accuracy over the validation set was:

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

vectorizer = TfidfVectorizer(analyzer='word',max_features=1000,lowercase=True,stop_words='english')
features = vectorizer.fit_transform(train_data).toarray()
X_train, X_val, y_train, y_val = train_test_split(features,train_data_labels,train_size=0.9,random_state=123)
model = KNeighborsClassifier(n_neighbors=5)
model = model.fit(X=X_train,y=y_train)
y_pred = model.predict(X_val)

print(accuracy_score(y_val,y_pred))

0.7056


The model has been substantially improved from an accuracy score of 0.6092 to 0.7056 (+0.964)

To apply the model to the test data, I loaded in the IMDB dataset again but this time selecting the "test" data instead of the "train" data.

In [21]:
imdb_test = load_dataset("imdb")['test']
print(imdb_test[-1])



{'text': 'I caught this movie on the Sci-Fi channel recently. It actually turned out to be pretty decent as far as B-list horror/suspense films go. Two guys (one naive and one loud mouthed a**) take a road trip to stop a wedding but have the worst possible luck when a maniac in a freaky, make-shift tank/truck hybrid decides to play cat-and-mouse with them. Things are further complicated when they pick up a ridiculously whorish hitchhiker. What makes this film unique is that the combination of comedy and terror actually work in this movie, unlike so many others. The two guys are likable enough and there are some good chase/suspense scenes. Nice pacing and comic timing make this movie more than passable for the horror/slasher buff. Definitely worth checking out.', 'label': 1}
25000


I then transformed each review in a feature vector with a corresponding label vector in the same way that the training set was.

In [17]:
test_data = []
test_data_labels = []
for item in imdb_test:
  test_data.append(item['text'])
  test_data_labels.append(item['label'])
print(test_data[-1])
print(test_data_labels[-1])

I caught this movie on the Sci-Fi channel recently. It actually turned out to be pretty decent as far as B-list horror/suspense films go. Two guys (one naive and one loud mouthed a**) take a road trip to stop a wedding but have the worst possible luck when a maniac in a freaky, make-shift tank/truck hybrid decides to play cat-and-mouse with them. Things are further complicated when they pick up a ridiculously whorish hitchhiker. What makes this film unique is that the combination of comedy and terror actually work in this movie, unlike so many others. The two guys are likable enough and there are some good chase/suspense scenes. Nice pacing and comic timing make this movie more than passable for the horror/slasher buff. Definitely worth checking out.
1


Finally, I applied the model to the test data and printed the final accuracy score.

In [23]:
test_pred=model.predict(vectorizer.transform(test_data).toarray())
print(accuracy_score(test_data_labels,test_pred))

0.68888


The final accuracy score was 0.68888. 

Although this score is far from perfect, it is a significantly more effective method than classifying the reviews at random which would have an accuracy score of ~0.5. This shows that the model does have some level of sophistication.

The model has also been considerably refined to increase its performance. The accuracy score has been improved from 0.6092 originally on the training set to 0.6888 on the final test data. This is an increase of 0.0796 that was achieved through sensibly selecting combinations of classes, parameters and algorithm settings within the model.