### Connect to Google colab 
<a href="https://colab.research.google.com/github/ithabibi/Persian-Opinion-Mining-and-Sentiment-Analysis/blob/main/Persian-Sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Persian Sentiment Analysis With Fasttext language Model and LSTM neural network
### Persian sentiment analysis step by step guide


---


#### so there are 5 steps we going through with each other 

## Step 1) Choose and Preparing Word Embedding Model

in this step we gonna to prepare word embedding model.(https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) 
there are too many ways to train a word embedding model for example :

1.   Fasttext
2.   ELMo (Embeddings from Language Models)
3.   Universal Sentence Encoder 
4.   Word2Vec
5.   GloVe (Global Vector)

if you Want to know more then read [this article from Thomas Wolf](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a) but now we gonna use Fasttext because it's Pretrained by Facebook and we can use it ( there is nothing to worry about this model it's pretty easy to train it by your self or your corpus facebook used Persian Wikipedia and some other staff as dataset for this model so it's just very simpler for us)

In [None]:
#@title Download, extract and load Fasttext 2016 word embedding model
# There are also newer models of fasttext in Persian language

!pip install pybind11==2.11.1
!pip install fasttext==0.9.2

!pip install keras==2.12.0 
!pip install pandas==1.5.3
!pip install numpy==1.23
!pip install hazm==0.7.0

!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fa.300.bin.gz
!gunzip /content/cc.fa.300.bin.gz

import fasttext 

%time
fasttext_model = fasttext.load_model("/content/cc.fa.300.bin")

In [None]:
#@title test Fasttext word embedding model by similar word

phrase = "\u062A\u062A\u0648" #@param {type:"string"}
print("dimension of " + phrase + " is " +str(fasttext_model.get_dimension()))
print(fasttext_model.get_word_vector(phrase).shape)
print(fasttext_model[phrase]) # get the vector of the word 

# get similar word
fasttext_model.get_nearest_neighbors(phrase)

## Step 2) Normalization and Preparation of Data Sets

in this step we going to collect a dataset that crawled by [@minasmz](https://github.com/minasmz) it's not good and I only used 450 pos and 450 neg reviews from it.anyway here we will download the dataset and split it to train and test ( I created Train and Test then I filled it with data )

In [1]:
#@title Upload in google colab and prepare Dataset
!wget https://raw.githubusercontent.com/ithabibi/Persian-Opinion-Mining-and-Sentiment-Analysis/main/sentiment_tagged_dataset.csv

import pandas
import random
import numpy
import hazm

# load and read sentiment_tagged dataset.csv file in tne path ./content/ in google colab. 
# this dataset include three element: Comment,Score,Suggestion. Comment is feature and Suggestion is label.
csv_dataset = pandas.read_csv("/content/sentiment_tagged_dataset.csv")

def CleanPersianText(text):
  _normalizer = hazm.Normalizer()
  text = _normalizer.normalize(text)
  return text

# Cleansing the dataset and creating a new list with two elements: "Comment" and "suggestion"filde. (but without the third element: "score")
# The new list is created by the zip command --> x= zip(csv_dataset['Comment'],csv_dataset['Suggestion'])
# valu of suggestion is 1,2,3 or positive,negative,neutral
revlist = list(map(lambda x: [CleanPersianText(x[0]),x[1]],zip(csv_dataset['Comment'],csv_dataset['Suggestion'])))

# Separation of positive and negative suggestions
positive=list(filter(lambda x: x[1] == 1,revlist))
neutral=list(filter(lambda x: x[1] == 2,revlist))
negative=list(filter(lambda x: x[1] == 3,revlist))

# print number of element exist in positive, neutral, negative, revlist list 
print("*" * 88)
print("Posetive count {}".format(len(positive)))
print("*Negetive count {}".format(len(negative)))
print("Natural  count {}".format(len(neutral)))
print("Total dataset count {}".format(len(revlist)))

# mix positive and negative suggestions for 460 element.
# We chose 460 because the most negative comments were 460
revlist_shuffle = positive[:460] + negative[:460]
random.shuffle(revlist_shuffle)
random.shuffle(revlist_shuffle)#double shuffle
print("Total shuffle count {}".format(len(revlist_shuffle)),"\n")

# print random element from positive, neutral, negative List
print("Random Posetive Comment: ","\n",positive[random.randrange(1,len(positive))])
print("Random Negetive Comment: ","\n",negative[random.randrange(1,len(negative))])
print("Random unknown  Comment: ","\n",neutral[random.randrange(1,len(neutral))])

'wget' is not recognized as an internal or external command,
operable program or batch file.





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting numpy==1.23

ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/03/c6/14a17e10813b8db20d1e800ff9a3a898e65d25f2b0e9d6a94616f1e3362c/numpy-1.23.0.tar.gz (Caused by ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)"))


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip






[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip




AttributeError: module 'inspect' has no attribute 'formatargspec'

In [None]:
#@title create and Prepare Train & Test data_structure with zero value
embedding_dim = 300 #@param {type:"integer"}
max_vocab_token = 20 #@param {type:"integer"}
import numpy as np
import keras.backend as K

train_size = int(0.95*(len(revlist_shuffle)))
test_size = int(0.05*(len(revlist_shuffle)))

# x_train same as features and y_train same as the label. x_train same as input and y_train same as output.
# The x_train data have 3 Dimention (874,20,300): (number_of_comment,number_of_words, dimension_of_fasttext)
# The y_train data has 2 dimensions (874,2): (number of comments, suggestions)
# The suggestions are 1 or 3. 1's are positive and 3's are negative suggestions.
x_train = np.zeros((train_size, max_vocab_token, embedding_dim), dtype=K.floatx())
y_train = np.zeros((train_size, 2), dtype=np.int32)

x_test = np.zeros((test_size, max_vocab_token, embedding_dim), dtype=K.floatx())
y_test = np.zeros((test_size, 2), dtype=np.int32)

In [None]:
#@title Fill X_Train, X_Test, Y_Train, Y_Test with digi-kala Dataset
indexes = set(np.random.choice(len(revlist_shuffle), train_size + test_size, replace=False)) # for random selection
print("data_item is: " + str(len(indexes)),"\n")

for data_item, index in enumerate(indexes): # indexes include 920 items of comments
  comment = hazm.word_tokenize(revlist_shuffle[index][0]) #[0] means the "comment" field in the .csv file
  for vocabs in range(0,len(comment)):
    if vocabs >= max_vocab_token: 
      break # If the comment is more than twenty words, only the first twenty words will be considered
    if comment[vocabs] not in fasttext_model.words:
      continue # If vocab does not exist in fasttext, every 300 elements of that word's vector in x_train is zero
    if data_item < train_size:
      x_train[data_item, vocabs, :] = fasttext_model.get_word_vector(comment[vocabs])
    else:
      x_test[data_item - train_size, vocabs, :] = fasttext_model.get_word_vector(comment[vocabs])

  if data_item < train_size:
    y_train[data_item, :] = [1.0, 0.0] if revlist_shuffle[index][1] == 3 else [0.0, 1.0]
  else:
    y_test[data_item - train_size, :] = [1.0, 0.0] if revlist_shuffle[index][1] == 3 else [0.0, 1.0]
    
print (x_train.shape,x_test.shape,y_train.shape,y_test.shape)

## Step 3) Config & Compile & Fit the LSTM Model

Now we will create our LSTM model then feed it our Train data

This code will help you build a neural network model with LSTM, which is capable of predicting the level of delusion, i.e. the dangerousness of an opinion.
First, we create the LSTM_model and add layers to it sequentially.
First, a Conv1D layer is added to the model, which is used to convert each word into a suitable vector.
In this model, two more Conv1D layers have been added to the model, which use 3x3 size filters.
A MaxPooling1D layer with a window size of 3 is also added to the model because it helps reduce dimensionality (i.e. ease of processing).
Then an LSTM layer with 512 neurons is added to the model, which uses long sentences for prediction.
Then three perceptron layers with sigmoid activations are added to the model. The dimensions of these layers are 512, 512 and 512 respectively.
To prevent overfitting, three Dropout layers with coefficients of 0.2 and 0.25 are used.
Finally, a Dense layer is added to the model which is the number of desired decision output (in this case 2) and finally softmax is used as activation which returns the probabilities of the classes.
The compile function is used to set the parameters of the model, where categorical_crossentropy is used as a loss function and is used for Adam optimization.
At the end, by using model print, we get a summary of the model structure.

In [None]:
#@title Set batchSize and epochs
# batch_size: is the number of data to be selected in each step
batch_size = 500 #@param {type:"integer"}
no_epochs = 200 #@param {type:"integer"}

In [None]:
#@title Building Layers of LSTM Model
from keras.models import Sequential
from keras.layers import Conv1D, Dropout, Dense, Flatten, LSTM, MaxPooling1D, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, TensorBoard

LSTM_model = Sequential() 

# Firstly, we will add an embedding layer which will convert each word into vector & set the hyperparameters of the layer
# We use Conv1D because sentences have one dimension: Convolutional layer is 20x300 and filter(kernel_size)=32 3x3
LSTM_model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same', input_shape=(max_vocab_token, embedding_dim)))

LSTM_model.add(Conv1D(32, kernel_size=3, activation='elu', padding='same'))
LSTM_model.add(Conv1D(32, kernel_size=3, activation='relu', padding='same'))
LSTM_model.add(MaxPooling1D(pool_size=3)) # Down sampling

# Add LSTM layer whit 512 neron & Dropout--> use for prevent of overfitting
LSTM_model.add(Bidirectional(LSTM(512, dropout=0.2, recurrent_dropout=0.3)))

# "Dense" refers to a fully connected layer
LSTM_model.add(Dense(512, activation='sigmoid')) # sigmoid --> use for binary classification
LSTM_model.add(Dropout(0.2)) # Dropout--> use for prevent of overfitting
LSTM_model.add(Dense(512, activation='sigmoid'))
LSTM_model.add(Dropout(0.25))
LSTM_model.add(Dense(512, activation='sigmoid'))
LSTM_model.add(Dropout(0.25))

# Dense 2 --> this layer is used to Decision between two classes.
LSTM_model.add(Dense(2, activation='softmax')) # softmax --> Returns the probability of a comment for each class.

# categorical_crossentropy cost function is used for multi-category classification problems.
# Adam's optimization algorithm is used and lr=0.0001 determine the learning rate and decay=1e-6 determine step size reduction rate
LSTM_model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])

# Show Dashboard
#tensorboard = TensorBoard(log_dir='logs/', histogram_freq=0, write_graph=True, write_images=True)

print(LSTM_model.summary())

In [None]:
#@title start learning

LSTM_model.fit(x_train, y_train, batch_size=batch_size, shuffle=True, epochs=no_epochs,validation_data=(x_test, y_test))

## Step 4) Evaluate and Save our Model

in this step we evaluate LSTM model loss and accuracy metric
loss: 0.5849 - accuracy: 0.8333

In [None]:
LSTM_model.metrics_names

In [None]:
# model evaluate
LSTM_model.evaluate(x=x_test, y=y_test, batch_size=32, verbose=1)

In [None]:
# Save the model for future use
LSTM_model.save('learned-persian-sentiment-fasttext.model') 

## Step 5) Test our Model

###there is three form but it's just for showcase there is no diff between them.

In [4]:
#@title using model

user_text = "\u062E\u06CC\u0644\u06CC \u06AF\u0648\u0634\u06CC\u0647 \u062E\u0648\u0628\u06CC\u0647. \u062A\u0634\u062E\u06CC\u0635 \u0686\u0647\u0631\u0647 \u062F\u0627\u0631\u0647. \u062F\u0627\u062E\u0644 \u062C\u0639\u0628\u0647 \u06A9\u0627\u0648\u0631 \u06AF\u0648\u0634\u06CC \u0648 \u0645\u062D\u0627\u0641\u0638 \u0635\u0641\u062D\u0647 \u062F\u0627\u0631\u0647. \u0645\u0646 \u062F\u06CC\u0631\u0648\u0632 \u0628\u0647 \u062F\u0633\u062A\u0645 \u0631\u0633\u06CC\u062F\u0647 \u0639\u0627\u0644\u06CC\u0647 \u0645\u0631\u0633\u06CC \u0627\u0632 \u062F\u06CC\u062C\u06CC \u06A9\u0627\u0644\u0627" #@param {type:"string"}
from IPython.core.display import display, HTML
_normalizer = hazm.Normalizer()
if not user_text=="":
  normal_text = _normalizer.normalize(user_text)
  tokenized_text = hazm.word_tokenize(normal_text)
  
  # create and Prepare three dimension tensor (1,20,300) with zero value : (1,number_of_words, dimension_of_fasttext)
  vector_text = np.zeros((1,max_vocab_token,embedding_dim),dtype=K.floatx())


  for vocabs in range(0,len(tokenized_text)):
    if vocabs >= max_vocab_token:
      break # If the comment is more than twenty words, only the first twenty words will be considered
    if tokenized_text[vocabs] not in fasttext_model.words:
      continue # If vocab does not exist in fasttext, every 300 elements of that word's vector remain zero
    
    vector_text[0, vocabs, :] = fasttext_model.get_word_vector(tokenized_text[vocabs])

  # print(vector_text.shape)
  # print(vector_text)
  result = LSTM_model.predict(vector_text) # the result has two element: [0][1] and [0][0]
  pos_percent = str(int(result[0][1]*100))+" % 😍"
  neg_percent = str(int(result[0][0]*100))+" % 🤕"
  display(HTML("<div style='text-align: center'><div style='display:inline-block'><img height='64px' width='64px' src='https://images.rawpixel.com/image_png_1000/cHJpdmF0ZS9sci9pbWFnZXMvd2Vic2l0ZS8yMDIyLTEwL3JtNTg2LWlubG92ZWZhY2UtMDFfMS1sOWQzYzlxMC5wbmc.png'/><h4>{}</h4></div> | <div style='display:inline-block'><img height='64px' width='64px' src='https://images.rawpixel.com/image_png_1000/cHJpdmF0ZS9sci9pbWFnZXMvd2Vic2l0ZS8yMDIyLTEwL3JtNTg2LWNyeWluZ2ZhY2UtMDFfMi1sOWQzYnh0MC5wbmc.png'/><h4>{}</h4></div></div>".format(pos_percent,neg_percent)))
else:
  print("Please enter your text")

  from IPython.core.display import display, HTML


NameError: name 'hazm' is not defined

In [None]:
#@title using model

user_text = "\u062E\u06CC\u0644\u06CC \u06AF\u0648\u0634\u06CC\u0647 \u062E\u0648\u0628\u06CC\u0647. \u062A\u0634\u062E\u06CC\u0635 \u0686\u0647\u0631\u0647 \u062F\u0627\u0631\u0647. \u062F\u0627\u062E\u0644 \u062C\u0639\u0628\u0647 \u06A9\u0627\u0648\u0631 \u06AF\u0648\u0634\u06CC \u0648 \u0645\u062D\u0627\u0641\u0638 \u0635\u0641\u062D\u0647 \u062F\u0627\u0631\u0647. \u0645\u0646 \u062F\u06CC\u0631\u0648\u0632 \u0628\u0647 \u062F\u0633\u062A\u0645 \u0631\u0633\u06CC\u062F\u0647 \u0639\u0627\u0644\u06CC\u0647 \u0645\u0631\u0633\u06CC \u0627\u0632 \u062F\u06CC\u062C\u06CC \u06A9\u0627\u0644\u0627" #@param {type:"string"}
from IPython.core.display import display, HTML
_normalizer = hazm.Normalizer()
if not user_text=="":
  normal_text = _normalizer.normalize(user_text)
  tokenized_text = hazm.word_tokenize(normal_text)
  
  # create and Prepare three dimension tensor (1,20,300) with zero value : (1,number_of_words, dimension_of_fasttext)
  vector_text = np.zeros((1,max_vocab_token,embedding_dim),dtype=K.floatx())


  for vocabs in range(0,len(tokenized_text)):
    if vocabs >= max_vocab_token:
      break # If the comment is more than twenty words, only the first twenty words will be considered
    if tokenized_text[vocabs] not in fasttext_model.words:
      continue # If vocab does not exist in fasttext, every 300 elements of that word's vector remain zero
    
    vector_text[0, vocabs, :] = fasttext_model.get_word_vector(tokenized_text[vocabs])

  # print(vector_text.shape)
  # print(vector_text)
  result = LSTM_model.predict(vector_text) # the result has two element: [0][1] and [0][0]
  pos_percent = str(int(result[0][1]*100))+" % 😍"
  neg_percent = str(int(result[0][0]*100))+" % 🤕"
  display(HTML("<div style='text-align: center'><div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260205.svg'/><h4>{}</h4></div> | <div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260206.svg'/><h4>{}</h4></div></div>".format(pos_percent,neg_percent)))
else:
  print("Please enter your text")

In [None]:
user_text = "\u062E\u06CC\u0644\u06CC \u062C\u0627\u0644\u0628\u0647 \u0627\u06CC\u0646 \u0645\u0648\u0628\u0627\u06CC\u0644 \u0627\u0635\u0644\u0627 \u0647\u0645\u0647 \u0686\u06CC \u062A\u0645\u0627\u0645\u0647 \u0645\u0646 \u06A9\u0647 \u067E\u0633\u0646\u062F\u06CC\u062F\u0645 \u0627\u06CC\u0646 \u0645\u0648\u0628\u0627\u06CC\u0644 \u0632\u06CC\u0628\u0627 \u0631\u0648" #@param {type:"string"}
from IPython.core.display import display, HTML
_normalizer = hazm.Normalizer()
if not user_text=="":
  normal_text = _normalizer.normalize(user_text)
  tokenized_text = hazm.word_tokenize(normal_text)
  vector_text = np.zeros((1,max_vocab_token,embedding_dim),dtype=K.floatx())
  for vocabs in range(0,len(tokenized_text)):
    if vocabs >= max_vocab_token:
      break
    if tokenized_text[vocabs] not in fasttext_model.words:
      continue
    
    vector_text[0, vocabs, :] = fasttext_model.get_word_vector(tokenized_text[vocabs])
  # print(x_text_for_test_words.shape)
  # print(text_for_test_words)
  result = LSTM_model.predict(vector_text)
  pos_percent = str(int(result[0][1]*100))+" % "
  neg_percent = str(int(result[0][0]*100))+" % "
  display(HTML("<div style='text-align: center'><div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260205.svg'/><h4>{}</h4></div> | <div style='display:inline-block'><img height='64px' width='64px' src='https://image.flaticon.com/icons/svg/260/260206.svg'/><h4>{}</h4></div></div>".format(pos_percent,neg_percent)))
else:
  print("Please enter your text")