# Demo SSentiA_Sp model for CorpusCine dataset

This problem is to design an automated model for sentiment analysis in Spanish with no labels. Specifically we test our apporach in CorpusCine dataset, which is a dataset formed by 3878 Spanish-written movie reviews captured from the MuchoCine website (\url{https://muchocine.net/}). Each document is rated using an integer tag ranging from 1 (unpleasant movie) to 5 (excellent movie). We use a methodology to treat PaperReviews as a binary classification problem. Samples with rating one or two are considered as negatives reviews; similarly, documents with rating four or five are categorized as positive reviews. This dataset is publicy avilable in http://www.lsi.us.es/~fermin/corpusCine.zip.

# Part 1. Load Python Packages

## 1.1 Install the required packages

In [None]:
!pip install pip setuptools wheel
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install deep-translator
!python -m pip install urllib3[secure]
!pip install spacytextblob
!pip install vaderSentiment
!python -m spacy download es_core_news_sm

## 1.2 Import packages

In [1]:
import pandas as pd
import numpy as np
import sys
import csv

sys.path.append('../Model')
from SSentiA_Sp import sSentiA_Sp
from urllib.request import urlopen
import json

# Part 2. Loading the data

## 2.1. load data

In [2]:
DataName = 'MoviesReview'

path = '../Data/' + DataName + '.xlsx'

Data = pd.read_excel(path)
numpy_array = Data.values
X = numpy_array[:,0]
Y = np.asarray(numpy_array[:,1])

## 2.2 Process the labels

Originally, such dataset configures a 5-class classification problem, we convert it into a binary problem as follows: First, we discard samples with label equal to three. Moreover, samples with labels one and two are assigned to negative class (0); on the other hand, samples with labels four and fiv are categorized a positives (1)

In [3]:
X = X[Y != 3]
Y = Y[Y != 3]
Y[Y < 3] = 0
Y[Y > 3] = 1

# Part 3. Applying lexicon-based approaches to CorpusCine dataset

## 3.1 Textblob lexicon

In [None]:
from LRsentiA_TextBlob import LexicalAnalyzer
r = LexicalAnalyzer('PaperReview')
predictions, pred_confidence_scores = r.classify_binary_dataset(X,Y)

## 3.2 VADER lexicon

In [None]:
from LRsentiA_VADER import LexicalAnalyzer
r = LexicalAnalyzer('PaperReview')
predictions, pred_confidence_scores = r.classify_binary_dataset(X,Y)

## 3.3 Spanish lexicon

In [None]:
from LRSentiA_Spanish import LexicalAnalyzer
r = LexicalAnalyzer('PaperReview')
predictions, pred_confidence_scores = r.classify_binary_dataset(X,Y)

# Part 3. Our hybrid approach

Accordingly, in this work, we employ a self-supervised approach based on the Self-supervised Sentiment Analyzer for classification from unlabeled data--(SSentiA) . Such an approach generates pseudo-labels using a lexicon-based method; then, these labels are enhanced using a supervised classification scheme.

## 3.1 No labels

We first test our approach under the scenario of having no labels.

In [None]:
from LRsentiA_Sp import LexicalAnalyzer
r = LexicalAnalyzer('PaperReview')
predictions, pred_confidence_scores = r.classify_binary_dataset(X,Y)
df1, df2, df3, df4, df5 = r.distribute_predictions_into_bins(X,Y,predictions, pred_confidence_scores)

In [None]:
s = sSentiA_Sp()
s.apply_SSSentiA(df1, df2, df3, df4, df5)

## 3.2 Few labels

Finally, aiming to evaluate the behavior of our hybrid proposal in scenarios with limited labeled data, we carry out an additional experiment, where we vary the number of labels.

In [None]:
from supervisedalgorithm import Logistic_Regression_Classifier, SVM_Classifier

from supervisedalgorithm import  Performance
from supervisedalgorithm  import TF_IDF

from sklearn.model_selection import train_test_split
from random import randint

In [None]:
df = [df1, df2, df3, df4, df5]

X  = np.array([1])
Sc = np.array([1])
Y  = np.array([1]) #true labels
Z  = np.array([1])

for i in range(5):
    data = df[i]
    content = data.values
    X = np.concatenate((X, content[:,0]))
    Sc = np.concatenate((Sc, content[:,3]))
    Y = np.concatenate((Y, content[:,1]))
    Z = np.concatenate((Z, content[:,2]))

X, Sc, Y, Z = X[1:], Sc[1:], Y[1:], Z[1:]
Y = Y.astype('int')
Z = Z.astype('int')

P = np.arange(0.05,0.9,0.1)
N = len(P)
Cla = ['LR', 'SVM']
Data_1 = []
for cl in Cla:
    if cl == 'LR':
        ml_classifier = Logistic_Regression_Classifier() 
    else:
        ml_classifier = SVM_Classifier()
    mean_F1 = np.zeros(N)
    min_F1 = np.zeros(N)
    max_F1 = np.zeros(N)
    std_F1 = np.zeros(N)
    for j, p in enumerate(P):
        aux_F1 = np.zeros(5)
        for i in range(5):
            X_true, X_, _, Sc_, y_true, y_, _, Z_ = train_test_split(X, Sc, Y, Z, test_size=1-p, random_state=randint(100, 1000))
            X1, Y1, Z1, X2, Y2, Z2, X3, Y3, Z3, X4, Y4, Z4, X5, Y5, Z5 = LexicalAnalyzer('PaperReview').distribute_predictions_into_bins_1(X_, y_, Z_, Sc_)
            
            X1 = np.concatenate((X_true, X1))
            Y1 = np.concatenate((y_true, Y1))
            Z1 = np.concatenate((y_true, Z1))
            
            bin_size_1_2 = len(X1) + len(X2) # + len(X_3) #+  len(X_01) + len(X_02) + len( X_11) + len(X_12)
            print("---",bin_size_1_2)
            
            
            data = np.concatenate((X1,X2,X3), axis=None)
            label = np.concatenate((Z1,Z2,Y3), axis=None)
            
            tf_idf = TF_IDF()
            data = tf_idf.get_tf_idf(data)
            
            X_train = data[:bin_size_1_2]
            Y_train = label[:bin_size_1_2]
            
            X_test = data[bin_size_1_2:]
            Y_test = label[bin_size_1_2:]
            
            prediction_bin_3 = ml_classifier.predict(X_train, Y_train, X_test)
    
            print("Bin-3 Results")
            performance = Performance()
            _,precision,  recall, f1_score, acc = performance.get_results(Y_test, prediction_bin_3)
            print("Total: ", round(precision,4),  round(recall,4), round(f1_score,4),round(acc,4) )

            data = np.concatenate((X1,X2,X3,X4,X5), axis=None)
            label = np.concatenate((Z1,Z2,prediction_bin_3,Y4,Y5), axis=None)
            
            
            tf_idf = TF_IDF()
            data = tf_idf.get_tf_idf(data)
        
            bin_1_2_3_training_data = len(X1) + len(X2) + len(X3)  
            
            X_train = data[:bin_1_2_3_training_data]
            Y_train = label[:bin_1_2_3_training_data]
            
            X_test = data[bin_1_2_3_training_data:]
            Y_test = label[bin_1_2_3_training_data:]
    
 
            print("Bin-4results")
            prediction_bin_4_5 = ml_classifier.predict(X_train, Y_train, X_test)
            _,precision,  recall, f1_score, acc = performance.get_results(Y_test[:len(X4)], prediction_bin_4_5[:len(X4)])
            print("F1: ", round(precision,4),  round(recall,4), round(f1_score,4),round(acc,4) )
            aux_F1[i] = acc
            
        mean_F1[j] = np.mean(aux_F1)
        std_F1[j] = np.std(aux_F1)
        min_F1[j] = mean_F1[j] - 2*np.std(aux_F1)
        max_F1[j] = mean_F1[j] + 2*np.std(aux_F1)
        
    Data_ = np.concatenate((P.reshape(N,1), mean_F1.reshape(N,1), max_F1.reshape(N,1), min_F1.reshape(N,1)), axis=1)
    Dat = pd.DataFrame(Data_,columns =None, index=None)
    Data_1.append([P.reshape(N,1), mean_F1.reshape(N,1), std_F1.reshape(N,1)])
#     Name_ = cl + '_Paper.dat'
#     Dat.to_csv(Name_,index=False, header=False,sep = " ")

In [None]:
import matplotlib.pyplot as plt
for i in range(2):
    plt.errorbar(Data_1[i][0].flatten(), Data_1[i][1].flatten(), yerr=Data_1[i][2].flatten(),capsize=4)
plt.legend(['LR', 'SVM'])
plt.title('PaperReview')
plt.ylabel('Accuracy')
plt.xlabel('Ratio of labeled data')
plt.show()