# A&A Project: Sentiment Analysis of Apple M1 in Twitter

Author: Hongshen Lee

Date:  2020/11/21

## Step 2: Generate a Sentiment Model

This part is to generate a model to do the sentiment analysis. This model take the sentences of one review or one tweet as input, and outputs the sentimens label, incluing positive or negative.

To achieve this goal, I collect the data set of [Amazon Reviews](https://www.kaggle.com/bittlingmayer/amazonreviews) from Kaggle. This data set include 4,000,000 records.In the model phase, I use the Naive Bayes to do. 

In [38]:
import bz2
import re
import os

import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier
from string import punctuation

import pickle

## Step 2.1: Data Phase

### Step 2.1.1: Read Data

Read data from bz2 files and decode with utf-8


In [39]:
data_path = "./data/reviews"

print(os.listdir(data_path))

train_file_path = data_path+ "/train.ft.txt.bz2"
test_file_path = data_path + "/test.ft.txt.bz2"

['test.ft.txt.bz2', 'train.ft.txt.bz2']


In [40]:
def read_data_from_BZ2File(train_file_path,test_file_path):
    train_file = bz2.BZ2File(train_file_path)
    test_file = bz2.BZ2File(test_file_path)
    
    train_file_lines  = train_file.readlines()
    test_file_lines   = test_file.readlines()
    
    # Convert from raw binary strings to strings that can be parsed
    train_file_lines  = [x.decode('utf-8') for x in train_file_lines]
    test_file_lines   = [x.decode('utf-8') for x in test_file_lines]
    
    return train_file_lines,test_file_lines

### Step 2.1.2: Clean Data

- Remove punctuation, stopwords,lowercase
- Remove some raw url string

In [41]:
def clean_data(origin_data):
    sw = stopwords.words('english')
    labels = [0 if s.split(' ')[0] == '__label__1' else 1 for s in origin_data]
    sentences = [s.split(' ', 1)[1][:-1].lower() for s in origin_data]
    
    for i in range(len(sentences)):
        sentences[i] = re.sub('\d','0',sentences[i])
        sentences[i]  = re.sub(r'(www.|https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '',
                            sentences[i], flags=re.MULTILINE)                
                
        sentence    = [w for w in sentences[i].split(' ') if w not in sw]
        sentences[i]= [w for w in sentence if w not in punctuation]
#         porter = nltk.PorterStemmer() 
#         sentences[i] = [porter.stem(w) for w in sentences[i]]
    return labels,sentences

In [42]:
train_file_lines,test_file_lines = read_data_from_BZ2File(train_file_path,test_file_path)

In [43]:
## Data Sample
print(train_file_lines[0])
print(test_file_lines[0])

__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"



### Step 2.1.3: Prepare Datasets

- Create train and test datasets
- Determine dataset size

In [44]:
# Set the size to test the code and meet the memory limit 
train_set_size=5000
test_set_size=500

test_labels,test_sentences   = clean_data(test_file_lines[:test_set_size])
train_labels,train_sentences = clean_data(train_file_lines[:train_set_size])

In [45]:
train_set =[]
test_set  =[]

for i in range(train_set_size):
    train_set.append((train_sentences[i],train_labels[i]))
    
for i in range(test_set_size):
    test_set.append((test_sentences[i],test_labels[i]))

## Step 2.2: Model Phase

### Step 2.2.1: Feature Extraction

Define features of the sentence and apply the features to data

In [46]:
# Define the features of the model
def get_word_features(sentence_list):
    wordlist=[]
    for sentence in sentence_list:
        wordlist.extend(sentence)
    wordlist = nltk.FreqDist(wordlist)
    features = wordlist.keys()
    return features

# Extract and apply the features to the data
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in w_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [47]:
w_features=get_word_features(train_sentences[1:train_set_size])

In [48]:
# Apply the features to the dataset
train_set = [(extract_features(sentence), label) for (sentence, label) in train_set]
test_set = [(extract_features(sentence), label) for (sentence, label) in test_set]

### Step 2.2.2: Train the Model

In [49]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Step 2.2.3: Evaluate the Model

In [50]:
def evaluate_model(correct_vs_prediction):
    tn=0
    tp=0
    fn=0
    fp=0
    for (label, guess) in sorted(correct_vs_prediction):
        if(guess==0):
            if(label==0):
                tn=tn+1
            else:
                fn=fn+1
        else:
            if(label==1):
                tp=tp+1
            else:
                fp=fp+1
    precision=tp/(tp+fp)
    accuracy=(tp+tn)/(tp+tn+fp+fn)
    recall =tp/(tp+fn)
    print('precision={:<8f} accuracy={:<8f} recall={:<8f}'.format(precision, accuracy, recall))   

In [51]:
classifier.show_most_informative_features(26)

Most Informative Features
         contains(waste) = True                0 : 1      =     30.7 : 1.0
         contains(worst) = True                0 : 1      =     21.3 : 1.0
     contains(horrible.) = True                0 : 1      =     18.6 : 1.0
     contains(terrible.) = True                0 : 1      =     18.0 : 1.0
     contains(defective) = True                0 : 1      =     16.3 : 1.0
         contains(throw) = True                0 : 1      =     15.7 : 1.0
      contains(classic:) = True                1 : 0      =     14.4 : 1.0
         contains(awful) = True                0 : 1      =     14.0 : 1.0
        contains(doesnt) = True                0 : 1      =     14.0 : 1.0
          contains(bad.) = True                0 : 1      =     13.9 : 1.0
            contains(:)) = True                1 : 0      =     13.6 : 1.0
       contains(higgins) = True                0 : 1      =     12.3 : 1.0
        contains(wasted) = True                0 : 1      =     12.2 : 1.0

In [52]:
errors = []
for (features, label) in test_set:
    guess = classifier.classify(features)
    errors.append( (label, guess) )
evaluate_model(errors)

precision=0.881057 accuracy=0.828000 recall=0.772201


### Step 2.2.4: Save the Model

In [58]:
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()

In [57]:
print(len(w_features))
file=open('features.txt','w',encoding="utf-8")
feature_str='\n'.join(w_features)
file.write(feature_str)
file.close()

41276


## Step 2.3 Conclusion 

This part finished the model training and saved the model for the next part.

The accuracy of the model is 0.828, not very high. There are serval ways may improve the results:

1. Better environment to support more data to train 
2. Better features engineering 
3. Better model building like Netural Network.