# Introduction

Hello people, welcome to this kernel! In this kernel I am going to predict scores of reviews. As deep learning library I am going to use keras. Before starting, let's check our content

# Content
1. Importing Libraries and The Data
1. Data Overview
1. Label Processing
    * Splitting X and Y
    * Preparing Label Classes
1. Natural Language Processing
    * Tokenizing
    * Choosing The Size Of Tokens
    * Padding
    * Train Test Split
1. Training Deep Learning (RNN) Model
    * Building GRU Model
    * Fitting GRU Model
    * Predicting and Evaluating Results
1. Conclusion

So let's start.

# Importing Libraries and The Data

In this section I am going to import libraries and the data that I will use. Our file's extension is parquet. We can read .parquet files using pandas' read_parquet() function. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

"""
Natural Language Processing
"""
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords

"""
Deep Learning - Keras
"""

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Embedding,Dense,CuDNNGRU

"""
Other Tools
"""
from sklearn.model_selection import train_test_split


# Tokenizer pad sequence naive bayes svm lr rf deep learning

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/valorant-metacritic-reviews/user_reviews.parquet


In [2]:
# Reading data
data = pd.read_parquet('/kaggle/input/valorant-metacritic-reviews/user_reviews.parquet')
data.head()

Unnamed: 0,username,review_type,published_date,score,votes,review_text,profile_url
0,Xalencelph,user,"Jun 11, 2020",2,0,A good game to play if you like CS GO.\r<br/>E...,https://www.metacritic.com/user/Xalencelph
1,DrPiipocO,user,"Jun 11, 2020",9,0,Muito bom porém não gostei muito dos mapas poi...,https://www.metacritic.com/user/DrPiipocO
2,Mirzahan,user,"Jun 11, 2020",2,0,I was expecting much more at the begin because...,https://www.metacritic.com/user/Mirzahan
3,Uncleho,user,"Jun 11, 2020",8,1,Good FPS game and i really like the cartoon gr...,https://www.metacritic.com/user/Uncleho
4,rpzdylilim,user,"Jun 11, 2020",2,1,The game is a bastardized version of CSGO. Ins...,https://www.metacritic.com/user/rpzdylilim


# Data Overview

In this section I am going to examine the details of the dataset.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1009 entries, 0 to 1008
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   username        1009 non-null   object
 1   review_type     1009 non-null   object
 2   published_date  1009 non-null   object
 3   score           1009 non-null   object
 4   votes           1009 non-null   object
 5   review_text     1009 non-null   object
 6   profile_url     1009 non-null   object
dtypes: object(7)
memory usage: 55.3+ KB


* There are 7 features in the dataset. However I just use text and our label (score).
* There are 1009 rows in the dataset. We can easily handle this data using a simple GRU

In [4]:
data["score"].value_counts()

10    285
0     264
1      73
9      68
2      64
8      63
4      55
5      45
3      44
7      28
6      20
Name: score, dtype: int64

* From 0 to 10 there are 11 classes in the dataset. 
* I am going to convert them into 0 and 1
* If value is lower than 5 it will be 0 and if value is bigger than 5 it will be 1
* If value is 5 I am going to drop them.


# Label Processing
In this section I am going to prepare our label in order to use in deep learning and machine learning. I will follow these two steps:

* Splitting X and Y
* Converting Labels Into 1 And 0

In [5]:
# Splitting X and Y

data = data[data["score"] != "5"] 
# I've dropped five in here. Five is string becasue it is string too in the data.


x = data.review_text.values
y = data.score.values

print(x.shape)
print(y.shape)

(964,)
(964,)


In [6]:
# Converting labes into 0 and 1

new_y = []

for score in y:
    
    if int(score) < 5:
        
        new_y.append(0)
        
    elif int(score) > 5:
        new_y.append(1)
        

fin_y = np.array(new_y)

print(fin_y.shape)
print(x.shape)
print(y[:5])
print(fin_y[:5])

(964,)
(964,)
['2' '9' '2' '8' '2']
[0 1 0 1 0]


# Natural Language Processing

In this section I am going to prepare the texts. I will follow these three steps:

* Tokenization
* Choosing The Size Of Tokens
* Padding


# Tokenization

In this section I am going to tokenize the texts.

### Little Knowledge About Tokenization
You know, in machine learning we need numerical features for training any model. But our texts are string. So we must convert our strings into numerical but how?

At this point, we can use tokenization. In tokenization we label each words with a number. I want to give an instance.

* You will be great father and you will be a real hero

If we tokenize this sentence, output will be like this:
* [1 2 3 4 5 6 1 2 3 7 8 9]
* I know, it looks meaningless, but if you know that list, it would be significant

                 1 You
                 2 Will
                 3 Be
                 4 Great
                 5 Father
                 6 And
                 7 Real
                 8 Hero

I hope you understand this. Tokenization is one of the most important things in NLP.

Before tokenizing, I will clean texts and drop stopwords. As you remember I've imported stopwords from nltk.



In [7]:
import re
import nltk

stopwords = stopwords.words('english')
clean_text = []
pattern = "[^a-zA-Z0123456789]"

for text in x:
    
    text = re.sub(pattern," ",text)
    text= text.lower()
    text = nltk.word_tokenize(text)
    text = [word for word in text if word not in stopwords]
    text = " ".join(text)
    clean_text.append(text)
    

# showing some random samples

print(clean_text[32])
print("\n\n")
print(clean_text[66])
    

played csgo 5 years game feels refreshing compared game everyone still recommend anyone likes competitive fps games



great game alot potential reason give 10 even tho still balancing game perfecting something keep cheaters way mind evasive anti cheat running computer br second reason listen community work maintain unlike valve takes players br br big con prizes skins take consideration people economic problems fewer resources options people want skins spend 10 20 euros


In [8]:
clean_text = np.array(clean_text)
print(type(clean_text))

<class 'numpy.ndarray'>


* Now we can tokenize our dataset.

In [9]:
num_words = 5000 # Only consider most used 5000 words
tokenizer = Tokenizer(num_words = num_words)

tokenizer.fit_on_texts(clean_text)

tokenizer.word_index

{'game': 1,
 'br': 2,
 'like': 3,
 'cs': 4,
 'play': 5,
 'valorant': 6,
 'games': 7,
 'good': 8,
 'go': 9,
 'csgo': 10,
 'overwatch': 11,
 'riot': 12,
 'abilities': 13,
 'people': 14,
 'fun': 15,
 'really': 16,
 'even': 17,
 'feel': 18,
 'gameplay': 19,
 'maps': 20,
 'great': 21,
 'get': 22,
 'one': 23,
 'time': 24,
 '10': 25,
 'much': 26,
 'fps': 27,
 'graphics': 28,
 'also': 29,
 'bad': 30,
 'anti': 31,
 'cheat': 32,
 'feels': 33,
 'playing': 34,
 'new': 35,
 'better': 36,
 'beta': 37,
 'map': 38,
 'played': 39,
 'make': 40,
 'think': 41,
 'players': 42,
 'would': 43,
 'well': 44,
 'boring': 45,
 'characters': 46,
 'many': 47,
 'still': 48,
 'competitive': 49,
 'shooter': 50,
 'way': 51,
 'lot': 52,
 'want': 53,
 'slow': 54,
 'made': 55,
 'say': 56,
 'every': 57,
 '1': 58,
 '2': 59,
 'different': 60,
 'give': 61,
 'skins': 62,
 'design': 63,
 'first': 64,
 'see': 65,
 'something': 66,
 'gun': 67,
 'makes': 68,
 'know': 69,
 'issues': 70,
 'things': 71,
 'player': 72,
 'pretty': 73,
 

* Our tokenizer is ready but we did not converted our x yet

In [10]:
x_token = tokenizer.texts_to_sequences(clean_text)

print(x_token[321]) # Checking random sample
print(type(x_token))

[58, 960, 2, 59, 2]
<class 'list'>


* We converted our texts into tokens but we still have a problem. Let's discover the problem together.

In [11]:
print("Len of 321th entry is ",len(x_token[321]))
print("Len of 231th entry is ",len(x_token[231]))
print("Len of 450th entry is",len(x_token[450]))

Len of 321th entry is  5
Len of 231th entry is  12
Len of 450th entry is 16


* As we can see, length of each entry can be different, but in machine learning, data must have same size.
* In order to solve this problem, we will use padding.

### Little Knowlodge About Padding
I am going to tell padding using an example: 
* Let's assume that we have a tokenized sentences like these: 
    * [1,4,3,7],
    * [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
* We want to create 10D arrays.
* If we add six zeros to first list, it would be 10D
* And if we truncate less used six token from 2nd list, it would be 10D too.
* Padding is that.



## Choosing The Size Of Tokens

In previous section I've told what should we use padding, in this section I am going to determine the size of our tokens. I will check some values in order to do this.

In [12]:
len_of_tokens = [len(tokens) for tokens in x_token]
print("There are {} tokens in our dataset".format(len(len_of_tokens)))


There are 964 tokens in our dataset


In [13]:
len_tokens = np.array(len_of_tokens)

len_tokens.mean()

46.29356846473029

* Mean of lengths is 46. But we should choose a bigger value. Let's examine what would happen if we choose 60 as size.

In [14]:
def padding_determiner(word_len):
    count = 0
    for len_ in len_tokens:
    
        if len_<word_len:
        
            count+=1
        
    print("%",count*100 / 964, " of the texts containts words less than {} ".format(word_len),sep = "")

padding_determiner(60)

%77.80082987551867 of the texts containts words less than 60 


77 is a bit bad. Let's try 80

In [15]:
padding_determiner(200)

%96.57676348547717 of the texts containts words less than 200 


* We can use 200 as our size. Let's move on to padding!

# Padding

In [16]:
maxlen = 200
x_pad = pad_sequences(x_token,maxlen=maxlen)

print(x_pad[321])



[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  58 960   2
  59   2]


* Our data is ready, now let's split our data into train and test.

# Train Test Split

We have a prepared data and we can use it to train our model. But if we use it all for training, how do we test our model. In order to solve this problem, we will split our data randomly.

In [17]:
# Let's remind our y
y.shape

(964,)

In [18]:
x_train,x_test,y_train,y_test = train_test_split(np.array(x_pad),fin_y,test_size=0.2,random_state=1)


# Training Deep Learning Model

In this section we will build our model using our prepared data. We will use a simple GRU.

In [19]:
model = Sequential()
model.add(Embedding(input_dim=num_words
                   ,output_dim=50
                   ,input_length=maxlen))

model.add(CuDNNGRU(units=16,return_sequences = True))

model.add(CuDNNGRU(units=8))

model.add(Dense(1,activation="sigmoid"))

model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 50)           250000    
_________________________________________________________________
cu_dnngru (CuDNNGRU)         (None, 200, 16)           3264      
_________________________________________________________________
cu_dnngru_1 (CuDNNGRU)       (None, 8)                 624       
_________________________________________________________________
dense (Dense)                (None, 1)                 9         
Total params: 253,897
Trainable params: 253,897
Non-trainable params: 0
_________________________________________________________________


# Fitting GRU Model

In this section I am going to fit the model that I've created in previous section.

In [20]:
model.fit(x_train,y_train,epochs=20,batch_size=50)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fc9ec035d50>

# Predicting and Evaluating Results

Our model is ready, now we can do some predictions.

In [21]:
from sklearn.metrics import accuracy_score
y_pred = model.predict_classes(x_test)

accuracy_score(y_test,y_pred)

0.7668393782383419

* Our score is %76. It is not bad.

# Conclusion

Thanks for your attention, if you like this kernel, please upvote. 

And if you have questions in your mind, please ask I will definetely asnwer your questions as much as I can. 