# CNN Implementation for Text Classification

Convolution Neural Network(CNN) is generally used for image classification which goes through every corner, vector and dimension of pixel matrix. 

We were unable to find related image dataset that can correlate to the entertainment dataset that we have chose. 

## 1. Step 1 - Data Cleaning

We have chosen the gaming dataset to do text classification on the reviews provided into positive and negatives.

1. Understanding the data

In [1]:
import pandas as pd
import requests

def read_games_data():
    games = pd.read_csv('https://raw.githubusercontent.com/dmml-heriot-watt/group-coursework-ha/main/data/games.csv?token=GHSAT0AAAAAACHSRUHFAUTZK4ITZJBLXAHUZI5INFQ')
    print(games)
    return games

In [2]:
games_data = read_games_data()

      Unnamed: 0                                    Title  Release Date  \
0              0                               Elden Ring  Feb 25, 2022   
1              1                                    Hades  Dec 10, 2019   
2              2  The Legend of Zelda: Breath of the Wild  Mar 03, 2017   
3              3                                Undertale  Sep 15, 2015   
4              4                            Hollow Knight  Feb 24, 2017   
...          ...                                      ...           ...   
1507        1507             Back to the Future: The Game  Dec 22, 2010   
1508        1508                        Team Sonic Racing  May 21, 2019   
1509        1509                           Dragon's Dogma  May 22, 2012   
1510        1510                          Baldur's Gate 3  Oct 06, 2020   
1511        1511                 The LEGO Movie Videogame  Feb 04, 2014   

                                                   Team  Rating Times Listed  \
0        ['Bandai N

2. Filter out only the required column which is the review and understanding the datastructure of the values

In [3]:
reviews = games_data['Reviews']
print(reviews)

0       ["The first playthrough of elden ring is one o...
1       ['convinced this is a roguelike for people who...
2       ['This game is the game (that is not CS:GO) th...
3       ['soundtrack is tied for #1 with nier automata...
4       ["this games worldbuilding is incredible, with...
                              ...                        
1507    ['Very enjoyable game. The story adds onto the...
1508    ['jogo morto mas bom', 'not my cup of tea', "C...
1509    ['Underrated.', 'A grandes rasgos, es como un ...
1510    ['Bu türe bu oyunla girmeye çalışmak hataydı s...
1511    ['Legal', 'Pretty Average Lego Game But It Was...
Name: Reviews, Length: 1512, dtype: object


3. The reviews column is an array of strings which contains multiple reviews for a single game. 
4. Cleaning the data includes the following
    - Separating into single sentence of review
    - Removing single characters which doesn't add proper meaning
    - Removing special characters
    - Removing new lines & multiple spaces
    - Removing single worded reviews for better dataset

In [4]:
import re

cleaned_reviews = []

for review in reviews:
    texts = review[1:]
    texts = texts[:-1]
    if(len(texts) == 0):
        break
    line_present = True
    while(line_present):
        try:
            current_review = (texts[1:texts.find(texts[0], texts.find(texts) + 1)]).strip() # removing the quotes
            original_texts = current_review
            if(len(current_review) > 0):
                current_review = re.sub(r'\W', ' ', str(current_review)) # removing single character
                current_review = re.sub(r'\s+[a-zA-Z]\s+', ' ', current_review) # removing single words
                current_review = re.sub(r'\^[a-zA-Z]\s+', ' ', current_review) # removing special characters
                if(len(current_review.strip()) == 0):
                    # if the current sentence length is zero skip and move to next sentence if present
                    remaining_texts = texts.split(original_texts, 1)
                    if (len(remaining_texts) > 1):
                        texts = remaining_texts[1][2:]
                        texts = texts.strip()
                        if (len(texts) == 0):
                            line_present = False
                    continue
                current_review = current_review.replace("\\n", ' ') # removing new lines
                current_review = re.sub(r"\s+", " ", current_review) # removing multiple spaces
                cleaned_reviews.append(current_review)
                remaining_texts = texts.split(original_texts, 1)
                if(len(remaining_texts) > 1):
                    texts = remaining_texts[1][2:]
                    texts = texts.strip()
                    if(len(texts) == 0):
                        line_present = False
                else:
                    line_present = False
            else:
                line_present = False
        except:
            print(texts)

print(len(cleaned_reviews))

4149


5. After cleaning the dataset we have 4149 rows of reviews that can be used for training the CNN. 

## Step 2 - Classifying the data into Positive and Negative

Inorder to classify the data into positive and negative we are matching the words present in the reviews against positive and negative words list.

The list is taken from an opensource repo of github as mentioned below

> Words Source - https://gist.github.com/mkulakowski2/4289441
> Positive Words Source - https://gist.github.com/mkulakowski2/4289437

We use the following logic to classify
1. If the sentence contains only positive words then we give the value as 1
2. If the sentence contains only negative words then we give the value as 0
3. If the sentence contains both positive and negative words then we check the number of positive words and negative words and assign the values based on which occurs the most.
4. If the sentence contains equal amount of positive and negative words then we skip the sentence.

In [11]:
import urllib.request

def get_negative_keys():
    negative_list = []
    negative_text = urllib.request.urlopen('https://gist.githubusercontent.com/mkulakowski2/4289441/raw/dad8b64b307cd6df8068a379079becbb3f91101a/negative-words.txt') 
    for line in negative_text:
        decoded_line = line.decode("utf-8")
        if(len(decoded_line.split()) > 1):
            continue
        if(len(decoded_line.strip()) == 0):
            continue
        if(';' in decoded_line):
            continue
        negative_list.append(decoded_line.strip())
    returns set(negative_list)



