# Lab2_1

### Start by implementing the parseReview and the preProcess functions. Given a line of a tab-separated text file, parseReview should return a triple containing the identifier of the review (as an integer), the review text itself, and the label (either ‘fake’ or ‘real’). The preProcess function should turn a review text (a string) into a list of tokens.
Hint: you can start by tokenising on white space; but you might want to think about some simple normalisation too.

# Lab2_2

### The next step is to implement the toFeatureVector function. Given a preprocessed review (that is, a list of tokens), it will return a Python dictionary that has as its keys the tokens, and as values the weight of those tokens in the preprocessed reviews. The weight could be simply the number of occurrences of a token in the preprocessed review, or it could give more weight to specific words. While building up this feature vector, you may want to incrementally build up a global featureDict, which should be a list or dictionary that keeps track of all the tokens in the whole review dataset. While a global feature dictionary is not strictly required for this coursework, it will help you understand which features (and how many!) you are using to train your classifier and can help understand possible performance issues you encounter on the way.
Hint: start by using binary feature values; 1 if the feature is present, 0 if it’s not.

# Lab2_3

### Using the loadData function already present in the template file, you are now ready to process the review data from amazon_reviews.txt. In order to train a good classifier, finish the implementation of the crossValidate function to do a 10-fold cross validation on the training data. Make use of the given functions trainClassifier and predictLabels to do the cross-validation. Make sure that your program stores the (average) precision, recall, f1 score, and accuracy of your classifier in a variable cv_results.
Hint: the package sklearn.metrics contains many utilities for evaluation metrics - you could try precision recall fscore support to start with

# Lab2_4

### Now that you have the numbers for accuracy of your classifier, think of ways to improve this score. Things to consider:
• Improve the preprocessing. Which tokens might you want to throw out or preserve?

• What about punctuation? Do not forget normalisation and lemmatising - what aspects of this might be useful?

• Think about the features: what could you use other than unigram tokens from the review texts? It may be useful to look beyond single words to combinations of words or characters. Also the feature weighting scheme: what could you do other than using binary values?

• You could consider playing with the parameters of the SVM (cost parameter? per-class weighting?)

Report what methods you tried and what the effect was on the classifier performance.

# Lab2_5

### Now look beyond textual features of the review. The data set contains a number of other features for each review (rating, verified purchase, product category, product ID, product title, review title). How can the inclusion of these features improve your classifier’s performance? Pick three of these metadata types to use as additional features and report how they improve the classifier performance.

In [2]:
'''
Without any operation:
Current average accuracy is 0.618154761904762
Current average precision is 0.6185732469315799
Current average recall is 0.618154761904762
Current average fscore is 0.6181181376909236

For question 4:
After the operation include:using the stop words,removing punctuations,converting all letters to lower case,Using standart stemmer from the nltk
synonym replacement,and change the parameters of the SVM
Current average accuracy is 0.642202380952381
Current average precision is 0.6432759799590053
Current average recall is 0.642202380952381
Current average fscore is 0.6416551643326958
This seems to have some improvement, but the effect is not obvious

For question 5:
I choose "Rating","VERIFIED_PURCHASE","REVIEW_TITLE"
Current average accuracy is 0.8036309523809523
Current average precision is 0.8085790394518841
Current average recall is 0.8036309523809523
Current average fscore is 0.8028697389613784
This seems to have a big improvement on classifier, so I think the better way to improve the performance of classifier is to add more features
'''

'\nWithout any operation:\nCurrent average accuracy is 0.618154761904762\nCurrent average precision is 0.6185732469315799\nCurrent average recall is 0.618154761904762\nCurrent average fscore is 0.6181181376909236\n\nFor question 4:\nAfter the operation include:using the stop words,removing punctuations,converting all letters to lower case,Using standart stemmer from the nltk\nsynonym replacement,and change the parameters of the SVM\nCurrent average accuracy is 0.642202380952381\nCurrent average precision is 0.6432759799590053\nCurrent average recall is 0.642202380952381\nCurrent average fscore is 0.6416551643326958\nThis seems to have some improvement, but the effect is not obvious\n\nFor question 5:\nI choose "Rating","VERIFIED_PURCHASE","REVIEW_TITLE"\nCurrent average accuracy is 0.8036309523809523\nCurrent average precision is 0.8085790394518841\nCurrent average recall is 0.8036309523809523\nCurrent average fscore is 0.8028697389613784\nThis seems to have a big improvement on classi

In [3]:
import csv                               # csv reader
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from random import shuffle
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import PorterStemmer
from nltk.stem import WordNetLemmatizer
#path = "/Users/irelia/Desktop/jupyter-notebook/NLP/test"

In [4]:
# load data from a file and append it to the rawData
def loadData(path, Text=None):
    with open(path) as f:
        reader = csv.reader(f, delimiter='\t')
        for line in reader:
            (ID, Text, Label,Rating,VERIFIED_PURCHASE,REVIEW_TITLE) = parseReview(line)
            rawData.append((ID, Text, Label,Rating,VERIFIED_PURCHASE,REVIEW_TITLE))
            preprocessedData.append((ID, preProcess(Text), Label,Rating,VERIFIED_PURCHASE,preProcess(REVIEW_TITLE)))
        del preprocessedData[0]
        del rawData[0]
        
def splitData(percentage):
    dataSamples = len(rawData)
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2)
    for (_, Text, Label,Rating,VERIFIED_PURCHASE,REVIEW_TITLE) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        w = Text+' '+Rating+' '+VERIFIED_PURCHASE+' '+REVIEW_TITLE
        trainData.append((toFeatureVector(preProcess(w)),Label))
    for (_, Text, Label,Rating,VERIFIED_PURCHASE,REVIEW_TITLE) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        w = Text+' '+Rating+' '+VERIFIED_PURCHASE+' '+REVIEW_TITLE
        testData.append((toFeatureVector(preProcess(w)),Label))

In [5]:
# QUESTION 1

# Convert line from input file into an id/text/label tuple
def parseReview(reviewLine):
    # Should return a triple of an integer, a string containing the review, and a string indicating the label
    #Id,Text,Label
    ID,Text,Label,Rating,VERIFIED_PURCHASE,REVIEW_TITLE = reviewLine[0],reviewLine[8],reviewLine[1],reviewLine[2],reviewLine[3],reviewLine[7]
    return (ID, Text, Label,Rating,VERIFIED_PURCHASE,REVIEW_TITLE)


In [6]:
# TEXT PREPROCESSING AND FEATURE VECTORIZATION

# Input: a string of one review
stop_words = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preProcess(text):
    # Should return a list of tokens
    text = word_tokenize(text)
    text_preprocessed = []
    for word in text:
        if word.isalpha(): # removing punctuation
            if word not in stop_words: # removing stopwords or "too common" words
                word = word.lower() # converting all letters to lower case
                word = wordnet_lemmatizer.lemmatize(word) 
                word = stemmer.stem(word) # Using standart stemmer from the nltk
                text_preprocessed.append(word)
    return text_preprocessed

In [7]:
# QUESTION 2
featureDict = {} # A global dictionary of features
vectorizer = CountVectorizer(min_df=1)
def toFeatureVector(tokens):
    # Should return a dictionary containing features as keys, and weights as values
    featureVector = {}
    for token in tokens:
        if token not in featureVector:
            featureVector[token] = 1.0
        else:
            featureVector[token] = float(featureVector[token] + 1)

        if token not in featureDict:
            featureDict[token] = 1.0
        else:
            featureDict[token] = float(featureDict[token] + 1)
    return featureVector

In [7]:
# TRAINING AND VALIDATING OUR CLASSIFIER
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC(C = 0.001, class_weight = "balanced"))])
    return SklearnClassifier(pipeline).train(trainData)#return a classifier

In [8]:
# QUESTION 3
'''
p = precision_score(y_true, y_pred, average='binary')
r = recall_score(y_true, y_pred, average='binary')
f1score = f1_score(y_true, y_pred, average='binary')
accuracy_score(y_true, y_pred)
'''
def crossValidate(dataset, folds):
    shuffle(dataset)
    cv_results = []
    foldSize = int(len(dataset)/folds)
    #print(len(dataset))
    for i in range(0,len(dataset),foldSize):
    # Replace by code that trains and tests on the 10 folds of data in the dataset
        print("fold start %d foldSize %d" %(i,foldSize))
        data_Test = dataset[i:i+foldSize]
        data_Train = dataset[:i]+dataset[i+foldSize:]
        classifier = trainClassifier(data_Train)
        y_true = list(map(lambda x : x[1],data_Test))
        y_pred = classifier.classify_many(map(lambda x : x[0],data_Test))
        cv_results.append(accuracy_score(y_true,y_pred))
        cv_results.append(precision_score(y_true,y_pred,average = 'weighted'))
        cv_results.append(recall_score(y_true,y_pred,average = 'weighted'))
        cv_results.append(f1_score(y_true,y_pred,average = 'weighted'))
    return cv_results
    

In [9]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: toFeatureVector(preProcess(t[1])), reviewSamples))

def predictLabel(reviewSample, classifier):
    return classifier.classify(toFeatureVector(preProcess(reviewSample)))

In [10]:
# MAIN

# loading reviews
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
preprocessedData = [] # the preprocessed reviews (just to see how your preprocessing is doing)
trainData = []        # the training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the test data as a percentage of the total dataset (currently 20%, or 4200 samples)
# the output classes
fakeLabel = 'fake'
realLabel = 'real'
# references to the data files
reviewPath = "amazon_reviews.txt"

## Do the actual stuff
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')
loadData(reviewPath) 
# We split the raw dataset into a set of training data and a set of test data (80/20)
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')
splitData(0.8)
#print(trainData)
# We print the number of training samples and the number of features
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDict), sep='\n')

Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
Now 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
21570


In [11]:
cv_results = crossValidate(trainData,10)
cv_results = np.asarray(cv_results)
print(cv_results)

fold start 0 foldSize 1680
Training Classifier...
fold start 1680 foldSize 1680
Training Classifier...
fold start 3360 foldSize 1680
Training Classifier...
fold start 5040 foldSize 1680
Training Classifier...
fold start 6720 foldSize 1680
Training Classifier...
fold start 8400 foldSize 1680
Training Classifier...
fold start 10080 foldSize 1680
Training Classifier...
fold start 11760 foldSize 1680
Training Classifier...
fold start 13440 foldSize 1680
Training Classifier...
fold start 15120 foldSize 1680
Training Classifier...
[0.8        0.80544467 0.8        0.79924925 0.81190476 0.81630518
 0.81190476 0.8114304  0.81488095 0.81941665 0.81488095 0.81428832
 0.79047619 0.79428571 0.79047619 0.78965774 0.80714286 0.81361231
 0.80714286 0.80675986 0.80297619 0.80650833 0.80297619 0.80225241
 0.80535714 0.81042821 0.80535714 0.80403581 0.80595238 0.81030051
 0.80595238 0.80498811 0.7875     0.79224787 0.7875     0.78676253
 0.81130952 0.81655725 0.81130952 0.81046756]


In [12]:
cv_results = cv_results.reshape(10,4)
print("accuracy,precision,recall and fscore is",'\n',cv_results)

accuracy,precision,recall and fscore is 
 [[0.8        0.80544467 0.8        0.79924925]
 [0.81190476 0.81630518 0.81190476 0.8114304 ]
 [0.81488095 0.81941665 0.81488095 0.81428832]
 [0.79047619 0.79428571 0.79047619 0.78965774]
 [0.80714286 0.81361231 0.80714286 0.80675986]
 [0.80297619 0.80650833 0.80297619 0.80225241]
 [0.80535714 0.81042821 0.80535714 0.80403581]
 [0.80595238 0.81030051 0.80595238 0.80498811]
 [0.7875     0.79224787 0.7875     0.78676253]
 [0.81130952 0.81655725 0.81130952 0.81046756]]


In [13]:
print("Current average accuracy is " + str(np.mean(cv_results[:,0], axis=0)))
print("Current average precision is " + str(np.mean(cv_results[:,1], axis=0)))
print("Current average recall is " + str(np.mean(cv_results[:,2], axis=0)))
print("Current average fscore is " + str(np.mean(cv_results[:,3], axis=0)))

Current average accuracy is 0.80375
Current average precision is 0.8085106702118777
Current average recall is 0.80375
Current average fscore is 0.802989199381378
