## [phase 2] Employe the proposed method on tests
According to phase1, **TF-IDF** with **SVM(rbf)** scored the best result among all 18 models. Here we just use this model on test data.

*Note* that you may need some variables or functions from phase 1 in phase 2.

In [1]:
import os
import pandas as pd
import numpy as np
import re

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

### 2-1 Read & Prepare train and test data

In [2]:
train_data = pd.read_csv("sentiment_data/train.csv")
test_data = pd.read_csv("sentiment_data/test.csv")

X_train = np.array(train_data.iloc[:,0])
y_train = np.array(train_data.iloc[:,1])
X_test = np.array(test_data.iloc[:,0])

print(y_train)
LE2 = LabelEncoder()
y_train = LE2.fit_transform(y_train)
print(y_train)

print(f'X_train : {X_train.shape}')
print(f'y_train : {y_train.shape}')
print(f'X_test  : {X_test.shape}')

['Negative' 'Negative' 'Positive' ... 'Positive' 'Negative' 'Positive']
[0 0 2 ... 2 0 2]
X_train : (2543,)
y_train : (2543,)
X_test  : (449,)


### 2-2 Clean data

In [3]:
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
        "]+", flags=re.UNICODE)

vfunc = np.vectorize(lambda s: emoji_pattern.sub(r'', s))
X_test = vfunc(X_test)
X_train = vfunc(X_train)

In [4]:
with open('nlp_files/stop_words-fa.txt', mode='r', encoding='utf-8') as stop_words_file:
    stop_words = stop_words_file.read().split('\n')
    
with open('nlp_files/stop_puncs-fa.txt', mode='r', encoding='utf-8') as stop_puncs_file:
    stop_puncs = stop_puncs_file.read().split('\n') 

with open('nlp_files/stop_chars-fa.txt', mode='r', encoding='utf-8') as stop_chars_file:
    stop_chars = stop_chars_file.read().split('\n') 

def remove_stops(string):
    for char in stop_chars:
        string = string.replace(char, '') # remove stop-chars
        
    for punc in stop_puncs:
        string = string.replace(punc, ' ') # replace stop-punctuations with space
        
    words = [word.strip() for word in string.split(' ')] # split string to trimed words 
    words = list(filter(lambda word: len(word) > 0, words)) # remove empty words
    words = list(filter(lambda word: word not in stop_words, words)) # remove stop-words
    
    string = ' '.join(words)
    return string

vfunc = np.vectorize(lambda s: remove_stops(s))
X_test = vfunc(X_test)
X_train = vfunc(X_train)

### 2-3 Vectorization with TF-IDF

In [5]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

X_train_vec = vectorizer.transform(X_train).toarray()
X_test_vec = vectorizer.transform(X_test).toarray()

print(f'X_train_vec : {X_train_vec.shape}')
print(f'X_test_vec  : {X_test_vec.shape}')

X_train_vec : (2543, 7319)
X_test_vec  : (449, 7319)


### 2-4 Classification with SVM(rbf)

In [6]:
clf_svm = SVC(kernel='rbf')
clf_svm.fit(X_train_vec, y_train)
y_pred = np.array(clf_svm.predict(X_test_vec))

y_pred_categorized = np.array(LE2.inverse_transform(y_pred))

### 2-5 Store predictions

In [7]:
file  = 'sentiment_data/test-filled.csv'
if os.path.isfile(file):
    os.remove(file)
    
test_data['sentiment'] = y_pred_categorized

test_data.to_csv(file, index=False, mode='x')
test_data

Unnamed: 0,comment,sentiment
0,مکانی زیبا و دیدنی با چشم اندازی زیبا برای علا...,Positive
1,روز جمعه داخل نشان زده بود باز است ولی بسته بود,Negative
2,سلام\nمتاسفانه با اپراتور متقلبی در این پمپ بن...,Negative
3,محوطه بزرگ و فضای سبز خوبی دارد,Positive
4,مزه خوبی نداره شیرینی وکیک هاش,Negative
...,...,...
444,مکانی بسیار زیبا و دلنشین که علاوه بر کاخ موزه...,Positive
445,حیف غذا یک خاطر ساز است واشپز کسی که باعشق خا...,Negative
446,افتضاح و گرون تر از بقیه,Negative
447,مسجد کرمانی یکی از مسجدهای قدیمی شهر تربت جام ...,Neutral


### 2-6 individual prediction function

In [8]:
def predict(comment: str):
    comment = emoji_pattern.sub(r'', comment)
    comment = remove_stops(comment)
    
    vec = vectorizer.transform([comment]).toarray()
    
    res = clf_svm.predict(vec)
    category = LE2.inverse_transform(res)[0]
    return category

print(predict('بسیار جای خوبی بود من راضی بودم همه چیز عالی'))
print(predict('واقعا بد بود خیلی کثیف و نا مرتب'))

Positive
Negative
