<a href="https://colab.research.google.com/github/ithabibi/Persian-Opinion-Mining-and-Sentiment-Analysis/blob/main/Persian-Sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Persian Sentiment Analysis With Fasttext language Model and LSTM neural network
### Persian sentiment analysis step by step guide


---


so there are 5 steps we going through with each other 

## Step 1)Choose and Preparing word embedding model
in this step we gonna to prepare [word embedding](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) model.
there are too many ways to train a word embedding model for example :

1.   Fasttext
2.   ELMo (Embeddings from Language Models)
3.   Universal Sentence Encoder 
4.   Word2Vec
5.   GloVe (Global Vector)

if you Want to know more then read [this article from Thomas Wolf](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a) but now we gonna use Fasttext because it's Pretrained by Facebook and we can use it ( there is nothing to worry about this model it's pretty easy to train it by your self or your corpus facebook used Persian Wikipedia and some other staff as dataset for this model so it's just very simpler for us)



In [None]:
#@title Download, extract and load Fasttext word embedding model

!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fa.300.bin.gz
!gunzip /content/cc.fa.300.bin.gz

!pip install pybind11==2.11.1
!pip install fasttext==0.9.2 

import fasttext 

%time
model = fasttext.load_model("/content/cc.fa.300.bin")

### test tasttext model whit تتو word

In [None]:
print(model.get_dimension())
print(model.get_word_vector('تتو').shape)
print(model['تتو']) # get the vector of the word 'تتو'

model.get_nearest_neighbors('تتو')


## Step 2) Normalize and Prepare dataset
in this step we going to collect a dataset that crawled by [@minasmz](https://github.com/minasmz) it's not good and I only used 450 pos and 450 neg reviews from it.anyway here we will download the dataset and split it to train and test ( I created Train and Test then I filled it with data )

In [None]:
#@title Upload in google colab and prepare Dataset
!wget https://raw.githubusercontent.com/ithabibi/Persian-Opinion-Mining-and-Sentiment-Analysis/main/sentiment_tagged_dataset.csv

!pip install pandas==1.5.3
!pip install numpy==1.23
!pip install hazm==0.7.0

import pandas
import random
import numpy
import hazm

# load and read sentiment_tagged dataset.csv file in tne path ./content/ in google colab. 
# this dataset include three element: Text,Score,Suggestion
csv_dataset = pandas.read_csv("/content/sentiment_tagged_dataset.csv")

def CleanPersianText(text):
  _normalizer = hazm.Normalizer()
  text = _normalizer.normalize(text)
  return text

# Cleansing the dataset and creating a new list with two elements: "text" and "suggestion". (but without the third element: "score")
# The new list is created by the zip command --> x= zip(csv_dataset['Text'],csv_dataset['Suggestion'])
# valu of suggestion is 1,2,3 or pos,nat,neg
revlist = list(map(lambda x: [CleanPersianText(x[0]),x[1]],zip(csv_dataset['Text'],csv_dataset['Suggestion'])))

# Separation of positive and negative suggestions
positive=list(filter(lambda x: x[1] == 1,revlist))
neutral=list(filter(lambda x: x[1] == 2,revlist))
negative=list(filter(lambda x: x[1] == 3,revlist))

# print number of element exist in positive, neutral, negative, revlist list 
print("Posetive count {}".format(len(positive)))
print("Negetive count {}".format(len(negative)))
print("Natural  count {}".format(len(neutral)))
print("Total dataset count {}".format(len(revlist)))

# mix positive and negative suggestions for 450 element.
# We chose 450 because the most negative comments were 450
revlist_shuffle = positive[:450] + negative[:450]
random.shuffle(revlist_shuffle)
random.shuffle(revlist_shuffle)#double shuffle
print("Total shuffle count {}".format(len(revlist_shuffle)))

# print random element from positive, neutral, negative List
print("Posetive count : ","\n",positive[random.randrange(1,len(positive))])
print("Negetive count : ","\n",negative[random.randrange(1,len(negative))])
print("unknown  count : ","\n",neutral[random.randrange(1,len(neutral))])

In [None]:
#@title create and Prepare Train & Test data_structure with zero value
vector_size = 300 #@param {type:"integer"}
max_no_tokens = 20 #@param {type:"integer"}
import numpy as np
import keras.backend as K
train_size = int(0.95*(len(revlist_shuffle)))
test_size = int(0.05*(len(revlist_shuffle)))

x_train = np.zeros((train_size, max_no_tokens, vector_size), dtype=K.floatx())
y_train = np.zeros((train_size, 2), dtype=np.int32)

x_test = np.zeros((test_size, max_no_tokens, vector_size), dtype=K.floatx())
y_test = np.zeros((test_size, 2), dtype=np.int32)

In [None]:
#@title Fill X_Train, X_Test, Y_Train, Y_Test with digi-kala Dataset
indexes = set(np.random.choice(len(revlist_shuffle), train_size + test_size, replace=False))

for data_item, index in enumerate(indexes): # indexes include 500 items
  Sentence = hazm.word_tokenize(revlist_shuffle[index][0])
  for word in range(0,len(Sentence)):
    if word >= max_no_tokens:
      break
    if Sentence[word] not in model.words: # model is fast text
      continue
    if data_item < train_size:
      x_train[data_item, word, :] = model.get_word_vector(Sentence[word])
    else:
      x_test[data_item - train_size, word, :] = model.get_word_vector(Sentence[word])

  if data_item < train_size:
    y_train[data_item, :] = [1.0, 0.0] if revlist_shuffle[index][1] == 3 else [0.0, 1.0]
  else:
    y_test[data_item - train_size, :] = [1.0, 0.0] if revlist_shuffle[index][1] == 3 else [0.0, 1.0]
    
x_train.shape,x_test.shape,y_train.shape,y_test.shape

## Step 3) Config & Compile & Fir the LSTM model
Now we will create our LSTM model then feed it our Train data

In [None]:
#@title Set batchSize and epochs
batch_size = 500 #@param {type:"integer"}
no_epochs = 200 #@param {type:"integer"}
fasttext_model = model
del model

In [None]:
#@title Building Layer of LSTM Model
from keras.models import Sequential
from keras.layers import Conv1D, Dropout, Dense, Flatten, LSTM, MaxPooling1D, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, TensorBoard


model = Sequential()

model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same',
                 input_shape=(max_no_tokens, vector_size)))
model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same'))
model.add(Conv1D(32, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=3))

model.add(Bidirectional(LSTM(512, dropout=0.2, recurrent_dropout=0.3)))

model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))
model.add(Dense(512, activation='sigmoid'))
model.add(Dropout(0.25))

model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])

# tensorboard = TensorBoard(log_dir='logs/', histogram_freq=0, write_graph=True, write_images=True)

model.summary()

In [None]:
model.fit(x_train, y_train, batch_size=batch_size, shuffle=True, epochs=no_epochs,
         validation_data=(x_test, y_test))

# Step 4) Evaluate and Save our model
in this step we evaluate LSTM model loss and accuracy metric
loss: 0.5849 - accuracy: 0.8333

In [None]:
model.metrics_names

In [None]:
model.evaluate(x=x_test, y=y_test, batch_size=32, verbose=1)

In [None]:
model.save('learned-persian-sentiment-fasttext.model')#Save the model for future use

# Step 5) test our model
there is two form but it's just for showcase there is no diff between them.

In [None]:
user_text = "\u062E\u06CC\u0644\u06CC \u06AF\u0648\u0634\u06CC\u0647 \u062E\u0648\u0628\u06CC\u0647. \u062A\u0634\u062E\u06CC\u0635 \u0686\u0647\u0631\u0647 \u062F\u0627\u0631\u0647. \u062F\u0627\u062E\u0644 \u062C\u0639\u0628\u0647 \u06A9\u0627\u0648\u0631 \u06AF\u0648\u0634\u06CC \u0648 \u0645\u062D\u0627\u0641\u0638 \u0635\u0641\u062D\u0647 \u062F\u0627\u0631\u0647. \u0645\u0646 \u062F\u06CC\u0631\u0648\u0632 \u0628\u0647 \u062F\u0633\u062A\u0645 \u0631\u0633\u06CC\u062F\u0647 \u0639\u0627\u0644\u06CC\u0647 \u0645\u0631\u0633\u06CC \u0627\u0632 \u062F\u06CC\u062C\u06CC \u06A9\u0627\u0644\u0627" #@param {type:"string"}
from IPython.core.display import display, HTML
_normalizer = hazm.Normalizer()
if not user_text=="":
  text_for_test = _normalizer.normalize(user_text)
  text_for_test_words = hazm.word_tokenize(text_for_test)
  x_text_for_test_words = np.zeros((1,max_no_tokens,vector_size),dtype=K.floatx())
  for word in range(0,len(text_for_test_words)):
    if word >= max_no_tokens:
      break
    if text_for_test_words[word] not in fasttext_model.words:
      continue
    
    x_text_for_test_words[0, word, :] = fasttext_model.get_word_vector(text_for_test_words[word])
  # print(x_text_for_test_words.shape)
  # print(text_for_test_words)
  result = model.predict(x_text_for_test_words)
  pos_percent = str(int(result[0][1]*100))+" % "
  neg_percent = str(int(result[0][0]*100))+" % "
  display(HTML("<div style='text-align: center'><div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260205.svg'/><h4>{}</h4></div> | <div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260206.svg'/><h4>{}</h4></div></div>".format(pos_percent,neg_percent)))
else:
  print("Please enter your text")

In [None]:
user_text = "\u062E\u06CC\u0644\u06CC \u062C\u0627\u0644\u0628\u0647 \u0627\u06CC\u0646 \u0645\u0648\u0628\u0627\u06CC\u0644 \u0627\u0635\u0644\u0627 \u0647\u0645\u0647 \u0686\u06CC \u062A\u0645\u0627\u0645\u0647 \u0645\u0646 \u06A9\u0647 \u067E\u0633\u0646\u062F\u06CC\u062F\u0645 \u0627\u06CC\u0646 \u0645\u0648\u0628\u0627\u06CC\u0644 \u0632\u06CC\u0628\u0627 \u0631\u0648" #@param {type:"string"}
from IPython.core.display import display, HTML
_normalizer = hazm.Normalizer()
if not user_text=="":
  text_for_test = _normalizer.normalize(user_text)
  text_for_test_words = hazm.word_tokenize(text_for_test)
  x_text_for_test_words = np.zeros((1,max_no_tokens,vector_size),dtype=K.floatx())
  for word in range(0,len(text_for_test_words)):
    if word >= max_no_tokens:
      break
    if text_for_test_words[word] not in fasttext_model.words:
      continue
    
    x_text_for_test_words[0, word, :] = fasttext_model.get_word_vector(text_for_test_words[word])
  # print(x_text_for_test_words.shape)
  # print(text_for_test_words)
  result = model.predict(x_text_for_test_words)
  pos_percent = str(int(result[0][1]*100))+" % "
  neg_percent = str(int(result[0][0]*100))+" % "
  display(HTML("<div style='text-align: center'><div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260205.svg'/><h4>{}</h4></div> | <div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260206.svg'/><h4>{}</h4></div></div>".format(pos_percent,neg_percent)))
else:
  print("Please enter your text")