In [1]:
import csv                               # csv reader
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from random import shuffle
from sklearn.pipeline import Pipeline

In [2]:
# load data from a file and append it to the rawData
def loadData(path, Text=None):
    with open(path, encoding='utf8') as f:
        reader = csv.reader(f, delimiter='\t')
        for line in reader:
            if line[0] == "DOC_ID":  # skip the header
                continue
            (Id,Text,PRODUCT_Category,Rating,VerifiedReview,Label) = parseReview(line)
            rawData.append((Id,Text,PRODUCT_Category,Rating,VerifiedReview,Label))


def splitData(percentage):
    # A method to split the data between trainData and testData 
    dataSamples = len(rawData)
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2)
    for (_, Text,PRODUCT_Category,Rating,VerifiedReview,Label) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text),PRODUCT_Category,Rating,VerifiedReview),Label))
    for (_, Text,PRODUCT_Category, Rating, VerifiedReview,Label) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text),PRODUCT_Category, Rating, VerifiedReview),Label))

There are various features which can be considered i experimented with many combinations such as PRODUCT_ID, RATING, PRODUCT_TITLE which gave me the Precision: 0.55, Recall: 0.55, F Score: 0.55, Accuracy: 0.35 on training Data, While on Test Data it gave Precision: 0.32, Recall: 0.35, Fscore: 0.32, Accuracy: 0.35.

I also experimented with PRODUCT_ID, PRODUCT_CATEGORY, REVIEW_TITLE which gave me ver low results comparing to above feature selection. The results are Precision: 0.53, Recall: 0.53, F-score: 0.53, Accuracy: 0.35 for training data and Precision: 0.28, Recall: 0.28, F-score: 0.28, Accuracy: 0.28 for testing data. 

The most efficient result i got was by the following features PRODUCT_Category, Rating, VerifiedReview as in this notebook with Precision: 0.78 , Recall: 0.78, Fscore: 0.77, Accuracy: 0.81 on the training data and 
Precision:0.81, Recall:0.81, Fscore:0.81, Accuracy:0.81.

# Question 1

In [3]:
# Convert line from input file into an id/text/label tuple
def parseReview(reviewLine):
    # Should return a triple of an integer, a string containing the review, and a string indicating the label
    # DESCRIBE YOUR METHOD IN WORDS
    if reviewLine[1] == '__label2__':
        reviewLine[1] = realLabel
    else:
        reviewLine[1] = fakeLabel
    return (reviewLine[0], reviewLine[8],reviewLine[4],reviewLine[2],reviewLine[3],reviewLine[1])

In the function parseReview with revieLine as an argument it returns Six features consisting of an integer, a string which is the Review_text, a string which is Product category, an integer which is the Rating, next is the Verified_Purchase and lastly is a string which is a label.

In [4]:
# TEXT PREPROCESSING AND FEATURE VECTORIZATION
import re, nltk, string
from nltk.util import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# Input: a string of one review
def preProcess(text):
    # Should return a list of tokens
    # DESCRIBE YOUR METHOD IN WORDS
    
    #tokens=text.split(' ')
    
    #Remove HTML tags
    HTML_TAG_RE = "<[^>]*"
    text = re.sub(HTML_TAG_RE, ' ', text)
    #Remove @ tags
    TAG_RE = "@\S+"
    text = re.sub(TAG_RE, ' ', text)
    #REMOVE WHITE SPACES
    text = text.strip()
    #REMOVE PUNCTUATION
    text = text.translate(text.maketrans('', '', string.punctuation))
    #REMOVING WEBSITE LINKS
    LINKS_RE = "https?:\S+|http?:\S|[^A-Za-z0-9]+"
    text=re.sub(LINKS_RE, ' ', text)
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    #Stop WORD REMOVAL
    stop_words = set(stopwords.words('english'))
    tokens = [ i for i in tokens if not i in stop_words]
    #PORTER STEMMING
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(j) for j in tokens]
    #Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(k) for k in tokens]
    #Stemming gave better results than Lemmatization or by using both
    bigrams = zip(*[tokens[i:] for i in range (2)])
    tokens = [" ".join(bigram) for bigram in bigrams]
    return tokens

To improve the preprocessing of the function various methods are used for example Removing the Html tags from the text then removing the @ symbols from the text and also removing the white spaces and punctuation. Moreover removing the website links and making the text in lower case is also done in the above function. Following it Stemming and Lemmatization of the text is done.

I experimented the preprocessing function by keeping the html tags then by keeping @ tags and again by keeping both html tags and @ tags but there was not a significant change in the result. But while not doing the Stemming of the tokens there was a noticeable change in the results in training data which are as follows
With Stemming:-
Precision: 0.77, Recall: 0.77, Fscore: 0.77, Accuracy: 0.80
Without Stemming:-
Precision: 0.80, Recall: 0.80, Fscore: 0.0.80, Accuracy: 0.81
While there was not a noticeable change on the test data so i have used stemming and lemmatization both in the notebook.

# Question 2

In [5]:
featureDictglobal = {} # A global dictionary of features

def toFeatureVector(tokens,PRODUCT_Category,Rating,VerifiedReview):
    # Should return a dictionary containing features as keys, and weights as values
    # DESCRIBE YOUR METHOD IN WORDS
    localfeatureDict={}
    for t in tokens:
        try:
            featureDictglobal[t] = featureDictglobal[t]+ 1
            localfeatureDict[t] = localfeatureDict[t]+ 1
        except KeyError:
            featureDictglobal[t] = 1
            localfeatureDict[t] = 1
    featureDictglobal.update({'PRODUCT_Category':PRODUCT_Category,'Rating':Rating,'VerifiedReview':VerifiedReview})
    localfeatureDict.update({'PRODUCT_Category':PRODUCT_Category,'Rating':Rating,'VerifiedReview':VerifiedReview})
    return localfeatureDict

A global dictionary of features is created as featureDict and featureDictLocal is created for local dictionary.
toFeatureVector function with tokens, PRODUCT_Category, Rating, VerifiedReview as argument is used to create a dictionary with features as keys and weights as value in it, Where weight increases as the occurence of the words increases in the text and is added as 1 divided by the length of the tokens in Local feature Dictionary. And finally The global dictionary and local Dictionary are updated with the features which are PRODUCT_Category, Rating, VerifiedReview.

I also experimented by adding (1.0/len (tokens) instead of 1 but there was a slight decrease in the preisicon recall and fscore in the training data so i stayed with adding 1 only.

In [6]:
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC(penalty='l2',max_iter=2000, loss='hinge',dual=True,random_state=100,
                                            verbose=1,C=0.001,class_weight='balanced', fit_intercept=True,
                                            intercept_scaling=1,multi_class='ovr'))])
    return SklearnClassifier(pipeline).train(trainData)

In the above funciton training and validating of our classifier is done but with a few more arguments in the pipeline with maximum iterations of 2000 as with less iterations it was asking to increase the iterations, whereas penalty is assinged to l2 which is better suited for non-sparse case and loss value to hinge because hinge is the Standarad SVM Loss where dual(true because of int in random state) is true as random state is equal to 100 and class_weight is balancd, here the default value of class_weight is None but here we have used Balanced beacuse it automatically adjusts weights inversely proportional to class frequency in the input data and fit intercept is true (to calculate the intercept of the model) while intercept scaling is 1.

# Question 3

In [7]:
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score

crossValidateActual=[]
def crossValidate(dataset, folds):
    shuffle(dataset)
    cv_results = []
    foldSize = int(len(dataset)/folds)
    # DESCRIBE YOUR METHOD IN WORDS
    
    for i in range(0,len(dataset),foldSize):
        TestData = dataset[i:i+foldSize]
        TrainData = dataset[:i] + dataset[i+foldSize:]
        classifier = trainClassifier(TrainData)
        Actual = [x[1] for x in TestData]
        PredictedLabels = predictLabels(TestData, classifier)
        cv_results = (precision_recall_fscore_support(Actual,PredictedLabels, average='weighted'))
    print("Accuracy: %f" % accuracy_score(Actual,PredictedLabels))
    return cv_results

In the above function crossValidate there are two arguments one is the input dataset and second is the fold. In this function we have taken 10 folds. Then the shuffle function is used to reorganize the data and then the foldsize is calculated by dividing the length of the dataset by the number of folds. Then a for loop is used from 0 to the length of the dataset and step argument as a third argument which is the foldsize. This loop is used for TestData and TrainData and trainClassifier function is used on the training data. The crossValidationActual is the actual labels in our test data while the crossValidationPredictLabels predicts the labels from the test data and finally the crossvalidation results are returned by the function.

In [8]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

def predictLabel(reviewSample, classifier):
    return classifier.classify(toFeatureVector(preProcess(reviewSample)))

The above funtions does the prediction of the labels given a classifier where in the first function it takes the review sample whereas in the second fucntion it takes the preprocessed review sample.

In [9]:
# MAIN

# loading reviews
# initialize global lists that will be appended to by the methods below
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
trainData = []        # the pre-processed training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the pre-processed test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

# Do the actual stuff (i.e. call the functions we've made)
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')
loadData(reviewPath) 

# We split the raw dataset into a set of training data and a set of test data (80/20)
# You do the cross validation on the 80% (training data)
# We print the number of training samples and the number of features before the split
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')
splitData(0.8)
# We print the number of training samples and the number of features after the split
print("After split, %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDictglobal), sep='\n')

# QUESTION 3 - Make sure there is a function call here to the
# crossValidate function on the training set to get your results
validationResults=crossValidate(trainData,10)
print("Precision: %f\nRecall: %f\nF Score:%f" % validationResults[:3])


Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
434109
Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Training Classifier...
[LibLinear]Accuracy: 0.781548
Precision: 0.784388
Recall: 0.781548
F Score:0.781139


In tha above Main Function the ouput classes are created fakelabel and reallabel. Global lists are created which will be appended by the method which are rawdata trandata and testdata. Now the dataset is parsed into the raw data list. Then the raw Dataset is split into training data and testdata into 80 and 20 percentage respectively. And the data is printed as the number of data in raw data in training data and testing data before and after spliting. And the crossValidate function is called on the training data and the results are printed which are Precision, recall and F Score.

It can be clearly observed that the Precision, Recall and Fscore improved drastically on training data by improving the preprocessing function and adding more features.

# Evaluate on test set

In [10]:
# Finally, check the accuracy of your classifier by training on all the tranin data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = True  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(testData[0])   # have a look at the first test data instance
    classifier = trainClassifier(trainData)  # train the classifier
    TTrue = [t[1] for t in testData]   # get the ground-truth labels from the data
    Pred = predictLabels(testData, classifier)  # classify the test data to get predicted labels
    finalScores = precision_recall_fscore_support(TTrue, Pred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % finalScores[:3])
    print("Accuracy: %f" % accuracy_score(TTrue, Pred))

({'assort realli': 1, 'realli hershey': 1, 'hershey best': 1, 'best littl': 1, 'littl one': 1, 'one alway': 1, 'alway excit': 1, 'excit whenev': 1, 'whenev holiday': 1, 'holiday come': 1, 'PRODUCT_Category': 'Grocery', 'Rating': '5', 'VerifiedReview': 'N'}, 'fake')
Training Classifier...
[LibLinear]Done training!
Precision: 0.814340
Recall: 0.808095
F Score:0.807137
Accuracy: 0.808095


Now the accuracy of the classifier is checked by training on all the training data and then testing on the test set.
This will work when all the functions are complete and if the functions are complete the first instance of the data is printed and the classifier is trained on the training data, And next is the getting of the ground-truth labels form the data and lastly labels are predicted on the test data and results are printed which are Precision, Recall and F Score.

We can clearly deduce that by increasing number of features and improving the data preprocessing function, the Precision, Recall, and F score improved significantly.

# Questions 4 and 5
Once you're happy with your functions for Questions 1 to 3, it's advisable you make a copy of this notebook to make a new notebook, and then within it adapt and improve all three functions in the ways asked for in questions 4 and 5.