Initiate a SparkSession. A SparkSession initializes both a SparkContext and a SQLContext to use RDD-based and DataFrame-based functionalities of Spark.

In [1]:
import pyspark as ps
spark = (ps.sql.SparkSession.builder 
        .master("local[4]") 
        .appName("sparkSQL exercise") 
        .getOrCreate()
        )
sc = spark.sparkContext

# Sentiment Analysis using Naive Bayes
Here we are going to create a text indexing pipeline for user reviews based on Spark ML library. We check the structure of that dataframe, and the column detected in the json content, by using .printSchema()

In [2]:
# read json
reviews_df = spark.read.json('data/data/reviews_Musical_Instruments_5.json.gz')

# prints the schema
reviews_df.printSchema()

# some functions are still valid
print("line count: {}".format(reviews_df.count()))

# show the table in a oh-so-nice format
#yelp_business_df.show()

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)

line count: 10261


We will keep only the columns reviewText and overall. We can check our transformation to verify we are getting what we need

In [3]:
new_df = reviews_df.select(['reviewText', 'overall'])

In [4]:
new_df.printSchema()

root
 |-- reviewText: string (nullable = true)
 |-- overall: double (nullable = true)



In [5]:
new_df.show()

+--------------------+-------+
|          reviewText|overall|
+--------------------+-------+
|Not much to write...|    5.0|
|The product does ...|    5.0|
|The primary job o...|    5.0|
|Nice windscreen p...|    5.0|
|This pop filter i...|    5.0|
|So good that I bo...|    5.0|
|I have used monst...|    5.0|
|I now use this ca...|    3.0|
|Perfect for my Ep...|    5.0|
|Monster makes the...|    5.0|
|Monster makes a w...|    5.0|
|I got it to have ...|    4.0|
|If you are not us...|    3.0|
|I love it, I used...|    5.0|
|I bought this to ...|    5.0|
|I bought this to ...|    2.0|
|This Fender cable...|    4.0|
|wanted it just on...|    5.0|
|I've been using t...|    5.0|
|Fender cords look...|    5.0|
+--------------------+-------+
only showing top 20 rows



### Create a label for classification and a balanced dataset
- This dataset is made of user reviews and ratings:
    - The reviewText column (string) contains the raw text of the review.
    - The overall column (double) contains the rating given by the user, in {1.0, 2.0, 3.0, 4.0, 5.0}
    - We are going to focus on extreme ratings {1.0, 5.0} which will be refer to as the negative and positive class respectively

In [6]:
from pyspark.sql import functions as F
df_out = (new_df.groupBy("overall")
                  .agg(F.count("reviewText"))
                  .orderBy("overall", ascending=False)
         )

df_out.show()

+-------+-----------------+
|overall|count(reviewText)|
+-------+-----------------+
|    5.0|             6938|
|    4.0|             2084|
|    3.0|              772|
|    2.0|              250|
|    1.0|              217|
+-------+-----------------+



A rating of 5 has a count of 6938 which is 30 times more than the count for a rating of 1. Hence to have a balanced set, we need each rating to have 217 samples

### Process to balance data
- We will create two dataframes:
    - One for the reviews having an overall of 1.0 (we will call them the neg class),
    - Another for the reviews having an overall of 5.0 (we will call them the pos class).
    - We will limit the number of reviews in each dataframe by the number we identified previously of 217 and shufle before applying the limit


In [9]:
df_pos = new_df.filter(new_df.overall == 5)
df_pos.show()

+--------------------+-------+
|          reviewText|overall|
+--------------------+-------+
|Not much to write...|    5.0|
|The product does ...|    5.0|
|The primary job o...|    5.0|
|Nice windscreen p...|    5.0|
|This pop filter i...|    5.0|
|So good that I bo...|    5.0|
|I have used monst...|    5.0|
|Perfect for my Ep...|    5.0|
|Monster makes the...|    5.0|
|Monster makes a w...|    5.0|
|I love it, I used...|    5.0|
|I bought this to ...|    5.0|
|wanted it just on...|    5.0|
|I've been using t...|    5.0|
|Fender cords look...|    5.0|
|The Fender 18 Fee...|    5.0|
|Got this cable to...|    5.0|
|When I was search...|    5.0|
|The ends of the m...|    5.0|
|Just trying to fi...|    5.0|
+--------------------+-------+
only showing top 20 rows



In [10]:
from pyspark.sql.functions import randn
df_temp = df_pos.withColumn('randn', randn(seed=42))
df_temp.show()

+--------------------+-------+--------------------+
|          reviewText|overall|               randn|
+--------------------+-------+--------------------+
|Not much to write...|    5.0|  0.4085363219031828|
|The product does ...|    5.0|  0.8811793095417685|
|The primary job o...|    5.0|  -2.013921870967947|
|Nice windscreen p...|    5.0|  1.6641751435679302|
|This pop filter i...|    5.0| -1.0878600404148453|
|So good that I bo...|    5.0|  1.1432831717404852|
|I have used monst...|    5.0| -0.1668332100007998|
|Perfect for my Ep...|    5.0|  0.9728134830024971|
|Monster makes the...|    5.0| -1.8922625416463206|
|Monster makes a w...|    5.0|  -1.406958171706003|
|I love it, I used...|    5.0|  0.5598396336190025|
|I bought this to ...|    5.0| 0.25154049516128324|
|wanted it just on...|    5.0| -1.0231123693572317|
|I've been using t...|    5.0| -0.5507468559455683|
|Fender cords look...|    5.0|    2.80044811525585|
|The Fender 18 Fee...|    5.0| -0.4441804987714544|
|Got this ca

In [11]:
df_217_pos = df_temp.orderBy(df_temp.randn.desc()).limit(217)
df_217_pos.show()

+--------------------+-------+------------------+
|          reviewText|overall|             randn|
+--------------------+-------+------------------+
|I bought one for ...|    5.0| 3.605892795878719|
|Exactly what I ne...|    5.0|3.5847812457770263|
|I don't have a lo...|    5.0| 3.485952639155781|
|In my house alone...|    5.0|3.4816327449385627|
|Other than that, ...|    5.0|3.2466669637174808|
|After reading num...|    5.0| 3.234441808486619|
|I bought this to ...|    5.0| 3.228942703491076|
|it is lined and h...|    5.0| 3.211540930369533|
|I am self taught,...|    5.0|3.1805432487134673|
|These are guitar ...|    5.0| 3.160484834634083|
|I bought a tusq n...|    5.0|3.1086052096584713|
|Very nice product...|    5.0|3.0697456406156816|
|I have used Fende...|    5.0| 3.067345236139749|
|I've been playing...|    5.0| 3.062071141776958|
|Super sturdy, wel...|    5.0| 3.058307421766882|
|JF-15:again, OMFG...|    5.0| 3.057082631122537|
|It definitely cha...|    5.0|2.9939447504471213|


In [12]:
df_217_pos = df_217_pos.select(['reviewText', 'overall'])

In [13]:
pos_neg_df = df_217_pos.union(df_neg)
pos_neg_df.show()

+--------------------+-------+
|          reviewText|overall|
+--------------------+-------+
|I bought one for ...|    5.0|
|Exactly what I ne...|    5.0|
|I don't have a lo...|    5.0|
|In my house alone...|    5.0|
|Other than that, ...|    5.0|
|After reading num...|    5.0|
|I bought this to ...|    5.0|
|it is lined and h...|    5.0|
|I am self taught,...|    5.0|
|These are guitar ...|    5.0|
|I bought a tusq n...|    5.0|
|Very nice product...|    5.0|
|I have used Fende...|    5.0|
|I've been playing...|    5.0|
|Super sturdy, wel...|    5.0|
|JF-15:again, OMFG...|    5.0|
|It definitely cha...|    5.0|
|This is a very us...|    5.0|
|I like mic stands...|    5.0|
|I've bought quite...|    5.0|
+--------------------+-------+
only showing top 20 rows



In [14]:
df_out = (pos_neg_df.groupBy("overall")
                  .agg(F.count("reviewText"))
                  .orderBy("overall", ascending=False)
         )

df_out.show()

+-------+-----------------+
|overall|count(reviewText)|
+-------+-----------------+
|    5.0|              217|
|    1.0|              217|
+-------+-----------------+



Schema shows we have balanced data set correctly for the positive and negative class

In [15]:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType, FloatType

def classifier(x):
    if x == 1.0:
        return 0.0
    else:
        return 1.0
    
my_class_udf = udf(classifier, DoubleType())
merge_df = pos_neg_df.withColumn('label', my_class_udf(pos_neg_df['overall']))
merge_df.show()

+--------------------+-------+-----+
|          reviewText|overall|label|
+--------------------+-------+-----+
|I bought one for ...|    5.0|  1.0|
|Exactly what I ne...|    5.0|  1.0|
|I don't have a lo...|    5.0|  1.0|
|In my house alone...|    5.0|  1.0|
|Other than that, ...|    5.0|  1.0|
|After reading num...|    5.0|  1.0|
|I bought this to ...|    5.0|  1.0|
|it is lined and h...|    5.0|  1.0|
|I am self taught,...|    5.0|  1.0|
|These are guitar ...|    5.0|  1.0|
|I bought a tusq n...|    5.0|  1.0|
|Very nice product...|    5.0|  1.0|
|I have used Fende...|    5.0|  1.0|
|I've been playing...|    5.0|  1.0|
|Super sturdy, wel...|    5.0|  1.0|
|JF-15:again, OMFG...|    5.0|  1.0|
|It definitely cha...|    5.0|  1.0|
|This is a very us...|    5.0|  1.0|
|I like mic stands...|    5.0|  1.0|
|I've bought quite...|    5.0|  1.0|
+--------------------+-------+-----+
only showing top 20 rows



In [16]:
merge_df.printSchema()

root
 |-- reviewText: string (nullable = true)
 |-- overall: double (nullable = true)
 |-- label: double (nullable = true)



### Define fucntions for data pipeline

In [19]:
import pyspark as ps    # for the pyspark suite
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType
import string
import unicodedata

import nltk

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.util import ngrams
from nltk import pos_tag
from nltk import RegexpParser

from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import IDF

import sys

def extract_bow_from_raw_text(text_as_string):
    """Extracts bag-of-words from a raw text string.
    Parameters
    ----------
    text (str): a text document given as a string
    Returns
    -------
    list : the list of the tokens extracted and filtered from the text
    """
    if (text_as_string == None):
        return []

    if (len(text_as_string) < 1):
        return []

    if sys.version_info[0] < 3:
        nfkd_form = unicodedata.normalize('NFKD', unicode(text_as_string))
    else:
        nfkd_form = unicodedata.normalize('NFKD', str(text_as_string))

    text_input = str(nfkd_form.encode('ASCII', 'ignore'))

    sent_tokens = sent_tokenize(text_input)

    tokens = list(map(word_tokenize, sent_tokens))

    sent_tags = list(map(pos_tag, tokens))

    grammar = r"""
        SENT: {<(J|N).*>}                # chunk sequences of proper nouns
    """

    cp = RegexpParser(grammar)
    ret_tokens = list()
    stemmer_snowball = SnowballStemmer('english')

    for sent in sent_tags:
        tree = cp.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'SENT':
                t_tokenlist = [tpos[0].lower() for tpos in subtree.leaves()]
                t_tokens_stemsnowball = list(map(stemmer_snowball.stem, t_tokenlist))
                #t_token = "-".join(t_tokens_stemsnowball)
                #ret_tokens.append(t_token)
                ret_tokens.extend(t_tokens_stemsnowball)
            #if subtree.label() == 'V2V': print(subtree)
    #tokens_lower = [map(string.lower, sent) for sent in tokens]

    return(ret_tokens)


def indexing_pipeline(input_df, **kwargs):
    """Runs a full text indexing pipeline on a collection of texts contained in a DataFrame.
    Parameters
    ----------
    input_df (DataFrame): a DataFrame that contains a field called 'text'
    Returns
    -------
    df : the same DataFrames with a column called 'features' for each document
    wordlist : the list of words in the vocabulary with their corresponding IDF
    """
    inputCol_ = kwargs.get("inputCol", "text")
    vocabSize_ = kwargs.get("vocabSize", 5000)
    minDF_ = kwargs.get("minDF", 2.0)

    # ugly: to add that to our slave nodes so that it finds the bootstrapped nltk_data
    nltk.data.path.append('/home/hadoop/nltk_data')

    extract_bow_from_raw_text("")  # ugly: for instanciating all dependencies of this function
    tokenizer_udf = udf(extract_bow_from_raw_text, ArrayType(StringType()))
    df_tokens = input_df.withColumn("bow", tokenizer_udf(col(inputCol_)))

    cv = CountVectorizer(inputCol="bow", outputCol="vector_tf", vocabSize=vocabSize_, minDF=minDF_)
    cv_model = cv.fit(df_tokens)
    df_features_tf = cv_model.transform(df_tokens)

    idf = IDF(inputCol="vector_tf", outputCol="features")
    idfModel = idf.fit(df_features_tf)
    df_features = idfModel.transform(df_features_tf)

    return(df_features, cv_model.vocabulary)


if (__name__ == "__main__"):
    spark = ps.sql.SparkSession.builder \
                .master("local[4]") \
                .appName("df lecture") \
                .getOrCreate()

    dfMusical = spark.read.json('data/data/reviews_Musical_Instruments_5.json.gz')
    df = dfMusical.select('reviewText', 'overall').withColumnRenamed("reviewText", "text").limit(100)

    df, wordlist = indexing_pipeline(df)

    print("wordlist={}".format(wordlist[0:10]))


wordlist=['cabl', 'b', 'good', 'qualiti', 'guitar', 'price', 'cord', 'hosa', 'great', 'end']


Now we apply function to our previous DataFrame to index every review as well as obtain the vocabulary

In [21]:
(index_df, vocab) = indexing_pipeline(merge_df,inputCol='reviewText')
index_df.show()

+--------------------+-------+-----+--------------------+--------------------+--------------------+
|          reviewText|overall|label|                 bow|           vector_tf|            features|
+--------------------+-------+-----+--------------------+--------------------+--------------------+
|I bought one for ...|    5.0|  1.0|[b, tenor, differ...|(1281,[1,7,11,19,...|(1281,[1,7,11,19,...|
|Exactly what I ne...|    5.0|  1.0|[guitar, home, sa...|(1281,[0,21,32,19...|(1281,[0,21,32,19...|
|I don't have a lo...|    5.0|  1.0|[b, lot, experi, ...|(1281,[1,2,4,6,7,...|(1281,[1,2,4,6,7,...|
|In my house alone...|    5.0|  1.0|[hous, bass, play...|(1281,[0,7,11,18,...|(1281,[0,7,11,18,...|
|Other than that, ...|    5.0|  1.0|[10-32, rack, scr...|(1281,[65,91,142,...|(1281,[65,91,142,...|
|After reading num...|    5.0|  1.0|[b, numer, review...|(1281,[0,1,4,6,8,...|(1281,[0,1,4,6,8,...|
|I bought this to ...|    5.0|  1.0|[b, axess, mp1504...|(1281,[1,4,16,53,...|(1281,[1,4,16,53,...|


In [24]:
print("vocabulary: {}".format(vocab[0:10]))

vocabulary: ['guitar', 'b', 'pedal', 'string', 'good', 'great', 'sound', 'other', 'amp', 'time']


### Applying the Naive Bayes algorithm for sentiment analysis
- The basics of ML pipelining in spark relies on building step by step instances of classes drawn from the pyspark.ml library. We will now use the class:
    - pyspark.ml.classification.NaiveBayes : implements the Naive Bayes algorithm

In [29]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

splits = index_df.randomSplit([0.7, 0.3], 24)
merge_df_train = splits[0]
merge_df_test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
nf_fit = nb.fit(merge_df_train)

# apply the model on the test set
trans_test_df = nf_fit.transform(merge_df_test)

trans_test_df.show()

+--------------------+-------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|          reviewText|overall|label|                 bow|           vector_tf|            features|       rawPrediction|         probability|prediction|
+--------------------+-------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|*Incredible quali...|    5.0|  1.0|[b'*incred, quali...|(1281,[12,14,16,3...|(1281,[12,14,16,3...|[-700.22958515685...|[0.99999266163137...|       0.0|
|-  This tuner is ...|    5.0|  1.0|[b, tuner, afford...|(1281,[0,1,5,7,11...|(1281,[0,1,5,7,11...|[-1433.8085621809...|[0.99999999847579...|       0.0|
|A little nicer th...|    5.0|  1.0|[b, littl, nicer,...|(1281,[1,17,27,36...|(1281,[1,17,27,36...|[-229.78352598638...|[8.18921424978878...|       1.0|
|After reading num...|    5.0|  1.0|[b, numer, review...|(1281,[0,1,4,6,8,...|(128

Use the MulticlassClassificationEvaluator to obtain an evaluation of the accuracy of your classification

In [30]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(trans_test_df)
evaluator.evaluate(trans_test_df, {evaluator.metricName: "accuracy"})

0.7647058823529411

### Interpretation of the NaiveBayes results
- The NaiveBayes model provides an internal matrix model.theta that we can convert into a numpy array with model.theta.toArray(). This matrix contains two columns corresponding to the two classes: 0 for negative and 1 for positive

- The values inside that matrix correspond, for each class, to the prior probabilities used to compute the likelihood of a document to belong to the class. In this implementation, the model.theta matrix doesn't provide probabilities, but log of probabilities

- We use this model.theta matrix, combined with the vocabulary obtained above from CountVectorizer, to obtain words that are related to the positive class, and words that are related to the negative class

In [31]:
idx = nf_fit.theta.toArray()
idx.shape

(2, 1281)

In [32]:
idx

array([[-5.00133423, -5.79306503, -5.26479709, ..., -7.8855047 ,
        -7.8855047 , -7.8855047 ],
       [-4.70935861, -5.74273032, -4.33579478, ..., -7.63164235,
        -7.63164235, -9.41951657]])

In [33]:
import numpy as np
vocab_arr = np.array(vocab)
vocab_arr[idx[1,:] > idx[0,:]]    #vocab for positve class

array(['guitar', 'b', 'pedal', 'string', 'good', 'great', 'sound',
       'price', 'tuner', 'more', 'differ', 'fender', 'tone', 'mic',
       'year', 'pick', 'capo', 'better', 'case', 'nice', 'littl', 'one',
       'lot', 'set', 'mani', 'last', 'small', 'box', 'unit', 'use',
       'light', 'bit', 'fret', 'effect', 'music', 'bass', 'end', 'day',
       'batteri', 'sure', 'easi', 'high', 'perfect', 'bag', 'line',
       'power', 'peopl', 'best', 'electr', 'screw', 'right', 'gig',
       'boss', 'size', 'most', 'adjust', 'place', 'board', 'sever',
       'heavi', 'level', 'type', 'joyo', 'microphon', "b'this", 'strat',
       'hand', 'expens', 'cool', 'fact', 'solid', 'onli', 'clean',
       'weight', 'color', 'origin', 'player', 'finger', 'full', 'long',
       'stock', 'practic', 'stuff', 'overdr', 'audio', 'ukulel', 'anyon',
       'gain', 'behring', 'big', 'interfac', 'accur', 'hour', 'erni',
       'home', 'everyth', 'side', 'bright', 'happi', 'simpl', 'store',
       'featur', 'str

Most words are associated with a positive feeling which aligns with our assumption that a 5 star reviews is positive

In [34]:
vocab_arr[idx[1,:] < idx[0,:]]   #vocab for neg class

array(['other', 'amp', 'time', 'product', 'thing', 'cabl', 'qualiti',
       'strap', 'stand', 'problem', 'way', 'review', 'few', 'i\\',
       'cheap', 'same', 'acoust', 'someth', 'new', 'distort', 'money',
       'issu', 'note', 'volum', 'neck', 'much', 'amazon', 'instrument',
       'i', 'record', 'piec', 'first', 'work', 'low', 'tune', 'metal',
       'tube', 'week', 'bad', 'brand', 'ball', 'it\\', 'fine', 'plug',
       'noth', 'cord', 'pickup', 'item', 'nois', 'plastic', 'less',
       'standard', 'model', 'part', 'packag', 'hard', 'kind', 'planet',
       'bottom', 'seller', 'star', 'month', 'mine', 'old', 'anyth',
       'idea', 'next', "b'the", 'bridg', 'buck', 'top', 'real', 'play',
       'purchas', 'bodi', 'design', 'feel', 'second', 'wrong', 'button',
       'own', 'point', 'junk', 'back', 'wave', 'not', 'defect', 'style',
       'input', 'extra', 'chord', 'head', 'dollar', 'useless', 'dunlop',
       'jack', 'minut', 'posit', 'disappoint', 'e', 'signal', 'abl',
       'co

Most words are associated with a negative feeling which aligns with our assumption that a 1 star reviews is negative. Some words listed above include nasti, cheaply, break, etc

In [None]:
We can rank these words by their decreasing prior probabilities

In [36]:
thetaarray = nf_fit.theta.toArray().T

vocab_size = len(vocab)

dtype = [('label', 'S10'), ('neg', float), ('pos', float)]
prob_values = [ (vocab[i],
                 np.exp(thetaarray[i,0])*(1-np.exp(thetaarray[i,1])),
                 (1-np.exp(thetaarray[i,0]))*np.exp(thetaarray[i,1]))
               for i in range(vocab_size) ]

a = np.array(prob_values, dtype=dtype)       # create a structured array

In [37]:
prob_values[0]

('guitar', 0.006668331316865159, 0.008949923326246001)

In [38]:
np.sort(a, order='pos')[::-1][0:20]

array([(b'pedal', 5.10275341e-03, 0.01302378),
       (b'string', 4.18879502e-03, 0.01078995),
       (b'guitar', 6.66833132e-03, 0.00894992),
       (b'great', 1.64973165e-03, 0.00762815),
       (b'amp', 7.55513350e-03, 0.00623695),
       (b'case', 2.40428010e-03, 0.00614058),
       (b'sound', 5.01402974e-03, 0.00588801),
       (b'pick', 2.61484237e-03, 0.00564948),
       (b'price', 2.28746169e-03, 0.00538022),
       (b'joyo', 5.63118321e-04, 0.00494086),
       (b'tone', 2.20021242e-03, 0.00481678),
       (b'tuner', 3.53516182e-03, 0.00476615),
       (b'other', 5.04294263e-03, 0.00476033),
       (b'good', 3.49316226e-03, 0.00474767),
       (b'differ', 1.56750224e-03, 0.0045025 ),
       (b'stand', 5.74731133e-03, 0.00434902),
       (b'more', 3.37425667e-03, 0.00418864),
       (b'easi', 7.33212339e-04, 0.00398313),
       (b'microphon', 9.24316288e-04, 0.0039802 ),
       (b'overdr', 6.26869052e-05, 0.0039709 )],
      dtype=[('label', 'S10'), ('neg', '<f8'), ('pos', '<f8'

In [39]:
np.sort(a, order='neg')[::-1][0:20]

array([(b'amp', 0.00755513, 6.23695382e-03),
       (b'guitar', 0.00666833, 8.94992333e-03),
       (b'strap', 0.00611594, 3.86792398e-03),
       (b'stand', 0.00574731, 4.34901810e-03),
       (b'problem', 0.0053888 , 1.74449215e-03),
       (b'product', 0.00520243, 1.96740529e-03),
       (b'pedal', 0.00510275, 1.30237764e-02),
       (b'other', 0.00504294, 4.76033418e-03),
       (b'sound', 0.00501403, 5.88800899e-03),
       (b'cabl', 0.00460681, 3.66733539e-03),
       (b'thing', 0.00441564, 2.48073296e-03),
       (b'issu', 0.00439096, 1.59803926e-03),
       (b'cheap', 0.00426314, 1.23743653e-03),
       (b'string', 0.0041888 , 1.07899492e-02),
       (b'i\\', 0.00399233, 3.69778244e-03),
       (b'few', 0.00398753, 1.95065656e-03),
       (b'bad', 0.00390939, 5.30020555e-04),
       (b'time', 0.00387453, 3.20540758e-03),
       (b'way', 0.00380267, 1.50586247e-03),
       (b'seller', 0.00372379, 8.08231106e-05)],
      dtype=[('label', 'S10'), ('neg', '<f8'), ('pos', '<f8')])

We can observe some words that clearly carry out a positive/negative feeling. But they are mixed with other words that are only related to the products. It's because we have run this analysis on a dataset based on Instruments only. Thus, the positive/negative concept it biased by the terms related to the products people generally evaluate positively (or negatively).