# Sentiment and product analysis in instagram comments

Now that we have get all the information from OFFCORSS and its competitors, and we have ordered that information in CSV files, it's time to do the analysis sentiment (using the model that we have chosen that performs better) and a products identificator (related to the `OFFCORSS_products.csv`).

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
from sentiment_analysis import *
from data_exploring_functions import *

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Reading the files
We are going to use the `instagram_response.csv` file that we have generated. Of it, we are going to create a `processed_comment` column with the cleaned commentary that the customer have left.

In [2]:
comments = pd.read_csv(os.path.join('data', 'instagram_responses.csv')).drop(columns="Unnamed: 0")
comments.head()

Unnamed: 0,rootPost_id,parentPost_id,response_id,time,text,username,likes,brand_username
0,1947952244502703605,1.801764e+16,1.795374e+16,2019-01-03 04:59:56,Hola @aleja.vasquez911 la camiseta de laya la ...,offcorss,3,offcorss
1,1948284380053077836,1.790035e+16,1.802043e+16,2019-01-03 05:01:31,Hola @delgadoa953 lo encuentras por $69.900. D...,offcorss,1,offcorss
2,1948805488864773596,1.796902e+16,1.806395e+16,2019-05-05 17:34:08,@offcorss no es extraña para mí su respuesta.....,mapu0831,0,offcorss
3,1948805488864773596,1.799455e+16,1.784674e+16,2019-01-20 01:44:42,Hola @erika_vanessa_rios te enviaremos un enla...,offcorss,1,offcorss
4,1948805488864773596,1.78886e+16,1.799713e+16,2019-01-20 01:45:05,Hola @jenny_paez te enviaremos un enlace para ...,offcorss,1,offcorss


In [3]:
comments['processed_comment'] = comments.text.str.lower()
comments = stopwords_correction(comments, 'processed_comment')

working on it!


In [4]:
comments.head()

Unnamed: 0,rootPost_id,parentPost_id,response_id,time,text,username,likes,brand_username,processed_comment
0,1947952244502703605,1.801764e+16,1.795374e+16,2019-01-03 04:59:56,Hola @aleja.vasquez911 la camiseta de laya la ...,offcorss,3,offcorss,hola @ aleja.vasquez911 camiseta laya encuentr...
1,1948284380053077836,1.790035e+16,1.802043e+16,2019-01-03 05:01:31,Hola @delgadoa953 lo encuentras por $69.900. D...,offcorss,1,offcorss,hola @ delgadoa953 encuentras $ 69.900 . dispo...
2,1948805488864773596,1.796902e+16,1.806395e+16,2019-05-05 17:34:08,@offcorss no es extraña para mí su respuesta.....,mapu0831,0,offcorss,@ offcorss extraña respuesta ... deberían sabe...
3,1948805488864773596,1.799455e+16,1.784674e+16,2019-01-20 01:44:42,Hola @erika_vanessa_rios te enviaremos un enla...,offcorss,1,offcorss,hola @ erika_vanessa_rios enviaremos enlace pu...
4,1948805488864773596,1.78886e+16,1.799713e+16,2019-01-20 01:45:05,Hola @jenny_paez te enviaremos un enlace para ...,offcorss,1,offcorss,hola @ jenny_paez enviaremos enlace puedas ver...


## Importing the model and needed CSV
We are going to import the trained model (`model.pickle`) and some related CSV that are needed  to do the lexicon analysis and the product recognition.

In [7]:
products_file  = os.path.join('model', 'offcorss_products.csv')
products_words = pd.read_csv(products_file).drop(columns = 'Unnamed: 0')

known_words = pd.read_csv(so.path.join("model", "OFFCORSS_lexicon.csv"))

f = open(os.path.join("model", "classifier.pickle"), 'rb')
logit_fit = pickle.load(f)
f.close()

## Analyzing the sentiment and recognizing the products
Firt, from the comments we are going to extract the needed features to get them into the trained model. Also, in this part of the code we extract a product characterization and a tokenized comment (it would be used later in the dashboard to get the word clouds).

In [8]:
data_responses = comments.reset_index(drop = True)

vt_comment = []
vt_tokens  = []
vt_products= np.empty((0,len(products_words)), int)

vt_mean    = []
vt_sum     = []
vt_size    = []
vt_min     = []
vt_max     = []
vt_sentim  = []


products_stm = stem_tokens(products_words.products)
nm_lenData   = len(data_responses)

for idx in range(0,nm_lenData):    
    
    # Get comment
    cur_comm = data_responses.loc[idx,['text']].values[0]
    
    # Get tokenized comment
    tokens, words = get_comment_tokens(cur_comm)
    
    # get products array
    product_inComments = np.in1d(products_stm, stem_tokens(words), assume_unique=True)
    product_inComments = product_inComments.astype(int) 
    product_inComments = product_inComments.reshape(1, len(products_words))

      
    # get classifier features    
    features_comment = get_comment_features(cur_comm,known_words)
    
    nm_mean = features_comment[0]
    nm_sum  = features_comment[1]
    nm_size = features_comment[2]
    nm_min  = features_comment[3]
    nm_max  = features_comment[4]
    
    # append classifier features
    vt_comment.append(cur_comm)
    vt_mean.append(nm_mean)
    vt_sum.append(nm_sum)
    vt_size.append(nm_size)
    vt_min.append(nm_min)
    vt_max.append(nm_max)
    
    # append comments tokens and products
    vt_tokens.append(tokens)
    vt_products = np.append(vt_products, product_inComments, axis=0)
    
    
    
features_df = pd.DataFrame({'comment':vt_comment,'mean':vt_mean,'sum':vt_sum,'size':vt_size,
                            'min':vt_min,'max':vt_max})

comments_df = pd.DataFrame({'comment':vt_comment,'tokens':vt_tokens})
products_df = pd.DataFrame(vt_products, columns = products_words.products)

Now that we have the features, we are going to get a score for each comment. From that score we are going to classify it as `good` (if it is greater than 2/3), `bad` (if it is lower than 1/3) and `neutral` (in other case).

In [9]:
vt_scoreLim = [1/3,2/3]

comment_score = features_df.copy()
comment_score['Intercept'] = 1.0

comment_score["score"] = logit_fit.predict(comment_score[['Intercept',"mean", "sum", "max"]])

vt_good     = (comment_score["score"] >= vt_scoreLim[1])
vt_bad      = (comment_score["score"] <= vt_scoreLim[0])
vt_neutral  = (comment_score["score"] > vt_scoreLim[0]) & (comment_score["score"] < vt_scoreLim[1])

comment_score["class"] = 'na'
comment_score["class"][vt_good]    = 'good'
comment_score["class"][vt_bad]     = 'bad'
comment_score["class"][vt_neutral] = 'neutral'

comment_score

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


Unnamed: 0,comment,mean,sum,size,min,max,Intercept,score,class
0,Hola @aleja.vasquez911 la camiseta de laya la ...,0.025794,1.625,63,0.000,0.750,1.0,0.434061,neutral
1,Hola @delgadoa953 lo encuentras por $69.900. D...,0.028846,1.125,39,0.000,0.750,1.0,0.411969,neutral
2,@offcorss no es extraña para mí su respuesta.....,-0.019841,-2.500,126,-1.000,0.500,1.0,0.122020,bad
3,Hola @erika_vanessa_rios te enviaremos un enla...,0.066667,3.000,45,0.000,0.750,1.0,0.707983,good
4,Hola @jenny_paez te enviaremos un enlace para ...,0.064394,2.125,33,0.000,0.750,1.0,0.642856,neutral
...,...,...,...,...,...,...,...,...,...
35262,Hola,0.000000,0.000,1,0.000,0.000,1.0,0.448265,neutral
35263,Hasta que talla viene ese camibuso?,0.031250,0.375,12,0.000,0.250,1.0,0.534695,neutral
35264,Precio,-0.125000,-0.125,1,-0.125,-0.125,1.0,0.085874,bad
35265,Monito precioso!!!!!! 😍,0.083333,2.000,24,0.000,1.000,1.0,0.637073,neutral


## Generating the CSV file

Now that we have a score for the sentiment, a classification from that score, the tokenized text and the products we are going to merge all the information into one dataframe and import it as CSV.

In [10]:
ig_responses = pd.merge(comments, comment_score[['score','class']], left_index = True, right_index=True)
ig_responses = pd.merge(ig_responses, comments_df[['tokens']], left_index = True, right_index=True)
ig_responses = pd.merge(ig_responses, products_df, left_index = True, right_index=True)

ig_responses

Unnamed: 0,rootPost_id,parentPost_id,response_id,time,text,username,likes,brand_username,processed_comment,score,...,tendido,tenis,termo,toalla,tobillera,top,tutu,vestido,visera,zapato
0,1947952244502703605,1.801764e+16,1.795374e+16,2019-01-03 04:59:56,Hola @aleja.vasquez911 la camiseta de laya la ...,offcorss,3,offcorss,hola @ aleja.vasquez911 camiseta laya encuentr...,0.434061,...,0,0,0,0,0,0,0,0,0,0
1,1948284380053077836,1.790035e+16,1.802043e+16,2019-01-03 05:01:31,Hola @delgadoa953 lo encuentras por $69.900. D...,offcorss,1,offcorss,hola @ delgadoa953 encuentras $ 69.900 . dispo...,0.411969,...,0,0,0,0,0,0,0,0,0,0
2,1948805488864773596,1.796902e+16,1.806395e+16,2019-05-05 17:34:08,@offcorss no es extraña para mí su respuesta.....,mapu0831,0,offcorss,@ offcorss extraña respuesta ... deberían sabe...,0.122020,...,0,0,0,0,0,0,0,1,0,0
3,1948805488864773596,1.799455e+16,1.784674e+16,2019-01-20 01:44:42,Hola @erika_vanessa_rios te enviaremos un enla...,offcorss,1,offcorss,hola @ erika_vanessa_rios enviaremos enlace pu...,0.707983,...,0,0,0,0,0,0,0,0,0,0
4,1948805488864773596,1.788860e+16,1.799713e+16,2019-01-20 01:45:05,Hola @jenny_paez te enviaremos un enlace para ...,offcorss,1,offcorss,hola @ jenny_paez enviaremos enlace puedas ver...,0.642856,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35262,2434697119937799889,1.785538e+16,1.785538e+16,2020-11-05 12:38:02,Hola,patricia.garzon.520,0,politokids,hola,0.448265,...,0,0,0,0,0,0,0,0,0,0
35263,2434697119937799889,1.784572e+16,1.784572e+16,2020-11-04 10:43:20,Hasta que talla viene ese camibuso?,andrea.0884,1,politokids,talla viene camibuso ?,0.534695,...,0,0,0,0,0,0,0,0,0,0
35264,2435118154851812205,1.790226e+16,1.790226e+16,2020-11-05 01:54:37,Precio,6859.andrea,0,politokids,precio,0.085874,...,0,0,0,0,0,0,0,0,0,0
35265,2435118154851812205,1.817026e+16,1.817026e+16,2020-11-04 18:30:23,Monito precioso!!!!!! 😍,lilimesab,1,politokids,monito precioso ! ! ! ! ! ! 😍,0.637073,...,0,0,0,0,0,0,0,0,0,0


In [11]:
ig_responses.to_csv(os.path.join("data", "instagram_responses_analized.csv"), encoding="utf-8-sig")