# CNN Implementation for Text Classification
Convolution Neural Network(CNN) is generally used for image classification which goes through every corner, vector and dimension of pixel matrix.

We were unable to find related image dataset that can correlate to the entertainment dataset that we have chose.

## Step 1 - Data Cleaning
We have chosen the gaming dataset to do text classification on the reviews provided into positive and negatives.

Understanding the data

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
import numpy as np

data = pd.read_csv('cleaned_game_dataset.csv')
reviews = data['0']
print(reviews.head())

0    The first playthrough of elden ring is one of ...
1    a replay solidified my love for elden ring so ...
2    The game is absolutely beautiful with so much ...
3    Took everything great about the Soulsborne gam...
4    I play with my overlevelled friend every time ...
Name: 0, dtype: object


## Step 2 - Creating Dataset
In order to train the model we need to have a dataset with both labels and sentences. We have got the sentences after cleaning the dataset obtained from Kaggle. Now to classify the sentences into positive and negative we are gonna use the words present in the sentences to compare with the list of positive words and negative words that are taken from the mentioned github repository to classify it.

We will be using this classified dataset to train Logistic Regression model and Convolution Neural Network Model. The results of these two models will be compared and analyzed. We will also be doing some hyperparameter tuning for the CNN model to yield better results than LR if required.


In [8]:
import csv

positive_list = []
with open('positive-words.txt', 'r') as file:
    positive_list = file.read().splitlines()
postive_keys = set(positive_list)

negative_list = []
with open('negative-words.txt', 'r') as file:
    negative_list = file.read().splitlines()
negative_keys = set(negative_list)

confused_dataset = []
sorted_dataset = []

review_flag = 0
positive_flag = 0
negative_flag = 0
confused_flag = 0
for review in reviews:
  pos = any(ele in review for ele in positive_list)
  neg = any(negele in review for negele in negative_list)
  if((pos == False and neg == False)):
    continue
  if(pos == True and neg == False):
    review_flag = 1
    positive_flag += 1
  if (pos == False and neg == True):
    review_flag = 0
    negative_flag += 1
  if (pos == True and neg == True):
    pos_values = [review.count(key) for key in postive_keys]
    neg_values = [review.count(key) for key in negative_keys]
    avg = sum(pos_values) - sum(neg_values)
    if(avg > 0):
        review_flag = 1
        positive_flag += 1
    if(avg < 0):
        review_flag = 0
        negative_flag += 1
    if(avg == 0):
        confused_dataset.append([review, -1])
        continue
  sorted_dataset.append([review, review_flag])

with open('sorted_games.csv', 'w', encoding='UTF-8', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(sorted_dataset)

with open('confused_data.csv', 'w', encoding='UTF-8', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(confused_dataset)

print(sorted_dataset[0])
print(confused_dataset[0])
print(f"positive reviews count: {positive_flag} negative reviews count: {negative_flag}")

['The first playthrough of elden ring is one of the best eperiences gaming can offer you but after youve explored everything in the open world and you ve experienced all of the surprises you lose motivation to go exploring on repeat playthroughs which takes lot away from the replayability which is very important thing for from games imo ', 0]
['People tell me this game gets really really good at some point but ve beaten entire games in the amount of time gave this game ', -1]
positive reviews count: 1278 negative reviews count: 1301


After sorting the dataset we can see that we have equal amount of positive and negative reviews. We will be using these to train and test the LR model against CNN model.

We also created a confused dataset which will be manually sorted and be used for testing against both the LR model and CNN Model.

## Step 3 - Training the Model
We will now be using the data created above and split it into test and train to create a Logistic Regression model for classification and Convolution Neural Network model and compare the results of both.

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
import numpy as np

data = pd.read_csv('sorted_games.csv', names=['sentence', 'label'])
test_data = pd.read_csv('confused_data.csv', names=['sentence', 'label'])
#print(data)
review = data['sentence'].values
label = data['label'].values
# split data into test and train
review_train, review_test, label_train, label_test = train_test_split(review, label, test_size=0.30, random_state=2000)

manual_test_sent = test_data['sentence'].values
manual_test_lab = test_data['label'].values

review_vectorizer = CountVectorizer()
review_vectorizer.fit(review_train)
Xlr_train = review_vectorizer.transform(review_train)
Xlr_test = review_vectorizer.transform(review_test)
Xlr_train
LRmodel = LogisticRegression()
LRmodel.fit(Xlr_train, label_train)
score = LRmodel.score(Xlr_test, label_test)
xlr_test_data = review_vectorizer.transform(manual_test_sent)
manual_score = LRmodel.score(xlr_test_data, manual_test_lab)
print("Accuracy:", score)
print("Manual Data Accuracy:", manual_score)

#CNN Implementation



tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(review_train)
Xcnn_train = tokenizer.texts_to_sequences(review_train)
Xcnn_test = tokenizer.texts_to_sequences(review_test)
Xcnn_test_data = tokenizer.texts_to_sequences(manual_test_sent)
vocab_size = len(tokenizer.word_index) + 1
print(review_train[1])
print(Xcnn_train[1])
maxlen = 100
Xcnn_train = pad_sequences(Xcnn_train, padding='post', maxlen=maxlen)
Xcnn_test = pad_sequences(Xcnn_test, padding='post', maxlen=maxlen)
Xcnn_test_data = pad_sequences(Xcnn_test_data, padding='post', maxlen=maxlen)
print(Xcnn_train[0, :])
embedding_dim = 200
textcnnmodel = Sequential()
textcnnmodel.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
textcnnmodel.add(layers.Conv1D(256, 5, activation='relu'))
textcnnmodel.add(layers.GlobalMaxPooling1D())
textcnnmodel.add(layers.Dense(15, activation='relu'))
textcnnmodel.add(layers.Dense(1, activation='sigmoid'))
textcnnmodel.compile(optimizer='adam',
               loss='binary_crossentropy',
               metrics=['accuracy'])
textcnnmodel.summary()

textcnnmodel.fit(Xcnn_train, label_train,
                     epochs=50,
                     verbose=False,
                     validation_data=(Xcnn_test, label_test),
                     batch_size=20)
loss, accuracy = textcnnmodel.evaluate(Xcnn_train, label_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = textcnnmodel.evaluate(Xcnn_test, label_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
loss, accuracy = textcnnmodel.evaluate(Xcnn_test_data, manual_test_lab, verbose=False)
print("Manual Testing Accuracy:  {:.4f}".format(accuracy))

Accuracy: 0.8888888888888888
Manual Data Accuracy: 0.5142857142857142
Unique gameplay encounters and cool setting make this one of the better mgs titles in the series What thrill 
[176, 42, 1078, 2, 107, 379, 115, 8, 30, 3, 1, 81, 2628, 475, 11, 1, 77, 65, 2629]
[ 38 322 211   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0]
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 200)          1520600   
                                                                 
 conv1d (Conv1D)             (None, 96, 256)  

From the above we can see that the test results of Logistic Regression is better than CNN model. However, running the test on the dataset created/classified manually the CNN performs better than LR. This might be due to various reasons but also mainly due to the hyperparameter tuning with the CNN. The results of different hyperparameters are discussed and shown in the readme of the project.