In [1]:
import csv                               # csv reader
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from random import shuffle
from sklearn.pipeline import Pipeline

In [2]:
# load data from a file and append it to the rawData
def loadData(path, Text=None):
    with open(path, encoding='utf8') as f:
        reader = csv.reader(f, delimiter='\t')
        for line in reader:
            if line[0] == "DOC_ID":  # skip the header
                continue
            (Id, Text, Label) = parseReview(line)
            rawData.append((Id, Text, Label))


def splitData(percentage):
    # A method to split the data between trainData and testData 
    dataSamples = len(rawData)
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2)
    for (_, Text, Label) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text)),Label))
    for (_, Text, Label) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text)),Label))

# Question 1

In [3]:
# Convert line from input file into an id/text/label tuple
def parseReview(reviewLine):
    # Should return a triple of an integer, a string containing the review, and a string indicating the label
    # DESCRIBE YOUR METHOD IN WORDS
    return (reviewLine[0], reviewLine[8], reviewLine[1])

In the function parseReview with revieLine as an argument it returns a triple consisting of an integer, a string which is the review and a string which is a label.

In [4]:
# TEXT PREPROCESSING AND FEATURE VECTORIZATION
import re, nltk
# Input: a string of one review
def preProcess(text):
    # Should return a list of tokens
    # DESCRIBE YOUR METHOD IN WORDS
    tokens=text.split(' ')
    return tokens

Preprocessing of text is done in the above preProcess function with text as an argument and splitting the data over a white space with the help of split method and returning it as tokens.

# Question 2

In [5]:
featureDictglobal = {} # A global dictionary of features

def toFeatureVector(tokens):
    # Should return a dictionary containing features as keys, and weights as values
    # DESCRIBE YOUR METHOD IN WORDS
    localfeatureDict={}
    for t in tokens:
        try:
            featureDictglobal[t] = featureDictglobal[t]+1
            localfeatureDict[t] = localfeatureDict[t]+1
        except KeyError:
            featureDictglobal[t] = 1
            localfeatureDict[t] = 1
    return localfeatureDict

A global dictionary of features is created as featureDict and featureDictLocal is created for local dictionary.
toFeatureVector function with tokens as argument is used to create a dictionary with features as keys and weights as value in it. Where weight will be the occurence of the word in the text. if its the first occurence it will assigned as one and will increase one as the frequency increases in the text in both the local and global dictionary.

In [6]:
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC())])
    return SklearnClassifier(pipeline).train(trainData)

In the above function training and validating of our classifier is done by making a linearSVC pipeline and the function returning SklearnClassifier on the training data.

# Question 3

In [7]:
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
crossValidateActual=[]
def crossValidate(dataset, folds):
    shuffle(dataset)
    cv_results = []
    foldSize = int(len(dataset)/folds)
    # DESCRIBE YOUR METHOD IN WORDS
    for i in range(0,len(dataset),foldSize):
        TestData = dataset[i:i+foldSize]
        TrainData = dataset[:i] + dataset[i+foldSize:]
        classifier = trainClassifier(TrainData)
        Actual = [x[1] for x in TestData]
        PredictedLabels = predictLabels(TestData, classifier)
        cv_results = (precision_recall_fscore_support(Actual, PredictedLabels, average='weighted'))
    print("Accuracy: %f" % accuracy_score(Actual,PredictedLabels))
    return cv_results

In the above function crossValidate there are two arguments one is the input dataset and second is the fold. In this function we have taken 10 folds. Then the shuffle function is used to reorganize the data and then the foldsize is calculated by dividing the length of the dataset by the number of folds. Then a for loop is used from 0 to the length of the dataset and step argument as a third argument which is the foldsize. This loop is used for TestData and TrainData and trainClassifier function is used on the training data. The Actual are the actual labels  while the PredictLabels predicts the labels from the data and finally the crossvalidation results are returned by the function.

In [8]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

def predictLabel(reviewSample, classifier):
    return classifier.classify(toFeatureVector(preProcess(reviewSample)))

The above funtions does the prediction of the labels given a classifier where in the first function it takes the review sample whereas in the second fucntion it takes the preprocessed review sample.

In [9]:
# MAIN

# loading reviews
# initialize global lists that will be appended to by the methods below
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
trainData = []        # the pre-processed training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the pre-processed test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

# Do the actual stuff (i.e. call the functions we've made)
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')
loadData(reviewPath) 

# We split the raw dataset into a set of training data and a set of test data (80/20)
# You do the cross validation on the 80% (training data)
# We print the number of training samples and the number of features before the split
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')
splitData(0.8)
# We print the number of training samples and the number of features after the split
print("After split, %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDictglobal), sep='\n')

# QUESTION 3 - Make sure there is a function call here to the
# crossValidate function on the training set to get your results
validationResults=crossValidate(trainData,10)
print("Precision: %f\nRecall: %f\nF Score:%f" % validationResults[:3])

Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
89043
Training Classifier...
Training Classifier...
Training Classifier...




Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Accuracy: 0.648214
Precision: 0.650270
Recall: 0.648214
F Score:0.648442


In tha above Main Function the ouput classes are created fakelabel and reallabel. Global lists are created which will be appended by the method which are rawdata trandata and testdata. Now the dataset is parsed into the raw data list. Then the raw Dataset is split into training data and testdata into 80 and 20 percentage respectively. And the data is printed as the number of data in raw data in training data and testing data before and after spliting. And the crossValidate function is called on the training data and the results are printed which are Precision, recall and F Score.

# Evaluate on test set

In [10]:
# Finally, check the accuracy of your classifier by training on all the tranin data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = True  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(testData[0])   # have a look at the first test data instance
    classifier = trainClassifier(trainData)  # train the classifier
    TTrue = [t[1] for t in testData]   # get the ground-truth labels from the data
    Pred = predictLabels(testData, classifier)  # classify the test data to get predicted labels
    finalScores = precision_recall_fscore_support(TTrue, Pred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % finalScores[:3])
    print("Accuracy: %f" % accuracy_score(TTrue, Pred))

({'This': 1, 'assortment': 1, 'is': 1, 'really': 1, "Hershey's": 1, 'at': 1, 'their': 1, 'best.': 1, 'The': 1, 'little': 1, 'ones': 1, 'are': 1, 'always': 1, 'excited': 1, 'whenever': 1, 'the': 1, 'holidays': 1, 'come': 1, 'because': 1, 'of': 1, 'this.': 1}, '__label1__')
Training Classifier...
Done training!
Precision: 0.615498
Recall: 0.615476
F Score:0.615458
Accuracy: 0.615476


Now the accuracy of the classifier is checked by training on all the training data and then testing on the test set.
This will work when all the functions are complete and if the functions are complete the first instance of the data is printed and the classifier is trained on the training data, And next is the getting of the ground-truth labels form the data and lastly labels are predicted on the test data and results are printed which are Precision, Recall and F Score.

# Questions 4 and 5
Once you're happy with your functions for Questions 1 to 3, it's advisable you make a copy of this notebook to make a new notebook, and then within it adapt and improve all three functions in the ways asked for in questions 4 and 5.