# Car Reviews

## Introduction

This workbook illustrates how a classifier can be built to conduct sentiment anlysis on car reviews. Each review is labelled with either ‘Pos’ or ‘Neg’ to indicate whether the review has been assessed as positive or negative in the sentiment it expresses. The workbook first begins with prepping the data to allow for a general purpose model to classfiy the reveiws. In part one a Multinomial Naive Bayes model is used to classify the reviews. In part two a new model is introduced in an attempt to improve upon the resuts from part one. 

In [1]:
#Import modules
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline

## Load the data

In [2]:
df = pd.read_csv('car-reviews.csv')

In [3]:
df.head()

Unnamed: 0,Sentiment,Review
0,Neg,In 1992 we bought a new Taurus and we really ...
1,Neg,The last business trip I drove to San Franci...
2,Neg,My husband and I purchased a 1990 Ford F250 a...
3,Neg,I feel I have a thorough opinion of this truc...
4,Neg,AS a mother of 3 all of whom are still in ca...


In [4]:
#The positive and negative are euqally dispersed
df['Sentiment'].value_counts()

Neg    691
Pos    691
Name: Sentiment, dtype: int64

## Preprocessing

Start by converting the positive and negative reviews into a binary encoding of 0 and 1 where Negative is 0 and Positive is 1.

In [5]:
#Binary coding
df['Sentiment'] = np.where(df['Sentiment']=='Neg', 0, 1)
df['Sentiment'].value_counts()

0    691
1    691
Name: Sentiment, dtype: int64

The next step is to split the data into testing and training sets. This is done at the begining of the workbook to prevent any data leakage.  

The data is randomly split into 80% training and 20% testing. (Use a seed (55) so that we can reproduce results.)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.Review, df.Sentiment, test_size=0.2, random_state = 55)

In [7]:
print(f'Train dimensions: {X_train.shape, y_train.shape}')
print(f'Test dimensions: {X_test.shape, y_test.shape}')

# Check out target distribution
print(y_train.value_counts())
print(y_test.value_counts())

Train dimensions: ((1105,), (1105,))
Test dimensions: ((277,), (277,))
1    556
0    549
Name: Sentiment, dtype: int64
0    142
1    135
Name: Sentiment, dtype: int64


Rounding up gives us an equal split of 80% in the training set and 20% in the test set. However, it is important to note that now the split between positive and negative reviews is no longer exactly 50:50.

## Part One

# 1.

### Removal of words and punctuation that do not affect sentiment and all words are lower case.

In [8]:
# Create an instance of RegexpTokenizer for alphanumeric tokens
tokeniser = RegexpTokenizer(r'\w+')

To demonstrate how we will use stemming and prevent altering the origional training set then create a copy of the X_train.

In [9]:
demo =  X_train.copy()
demo.reset_index(inplace=True, drop=True)

In [10]:
demo

0        im not exactly an unbiased author  ive owned ...
1        I am a young woman and a first time car buyer...
2        This vehicle was a first time lease for my hu...
3        Currently own Dark Wedgewood  Eddie Bauer Edi...
4        We bought this vehicle 6 months ago after the...
                              ...                        
1100     I bought this car new in 1997 to replace my N...
1101     I bought this car as a repairable for  3k and...
1102     I am a soccer mom  And I got this car at a di...
1103     This car is terrible  It is falling apart and...
1104      NOTE read this part of the review first then...
Name: Review, Length: 1105, dtype: object

A demonstration of how the each review will be processed and stemmed will be shown on the first reviews. This process will then be scaled up and conducted on all the reviews. However, as the data set is large in the demo stage only the first review will be printed out and examples of lower casing and stemming will be shown on this review.

In [11]:
# Uncomment to print out the first review
# print(demo[0])

### Sample of how tokenisation, stemming and removal of stopwords is conducted.  

A function called sample_stem is created for the purpose of this demo. A more genral purpose function that performs the same steps is later created when processing all the data.  

#### Tokenisation

In this funciton the first step is to take the entire review which is currently one string and to tokenise this. This takes each word and places it into its own string.  

#### Stemming

Stemming is essentailly trimming a word down in an attempt to achieve a 'stem' of the word. In this case the porter stemming algorithm is used (Porter, 2006). While it is important to note that not all the stems are real English langugage and to the reader may not make sense. The purpose of this algorithm is to find a common stem between words which are similar. However, in many cases it is possible to demonstate (as shown below in the examples from the first review) that many different variations of the same word are can be condensed into one stem.  

#### Lower case

Here this is done in the same step as stemming but it is simply taking all words and ensuring they are lower case.  

#### Removal of stop words 

In this step the nltk stopwords corpus is used. It has a total of 179 stop words for English for example: 'we' or 'you' are all considered stopwords that do not contribute much to the sentiment of the review. Subsequently, these stop words are stripped from each review.

In [12]:
def sample_stem(review):
    tokens = tokeniser.tokenize(review)
    porter = PorterStemmer()
    #Lower case and stemming is done in one step
    stems = [porter.stem(token.lower()) for token in tokens]
    print("Number of stems: ")
    print(len(stems))
    #Using list comprehension we can remove all stop words
    keywords = [stem for stem in stems if stem not in stopwords.words('english')]
    print('The first 10 keywords are now: ')
    print(keywords[:10])
    print('The number of keywords is: ')
    print(len(keywords))
    return tokens, keywords

In [13]:
first_review = sample_stem(demo[0])

Number of stems: 
2068
The first 10 keywords are now: 
['im', 'exactli', 'unbias', 'author', 'ive', 'corral', 'late', 'model', 'perform', 'mustang']
The number of keywords is: 
1213


# 2.

### Words with the same stem are treated as variations of the same stem. 

#### Demonstration on 3 different stems:

- mustang
- comfort
- servic

In [14]:
# Count each keyword from previous step
post_stem = {word: first_review[1].count(word) for word in set(first_review[1])}

In [15]:
# Compare with pre stem 
pre_stem = {word: first_review[0].count(word) for word in set(first_review[0])}

In [16]:
#Display the key and value
print('With stemming:')
print("")
keys = ['mustang','mustangs']
for key in keys:
    print(key,'=', post_stem.get(key))

With stemming:

mustang = 14
mustangs = None


In [17]:
print('Without stemming:')
print("")
keys = ['Mustang','Mustangs']
for key in keys:
    print(key,'=', pre_stem.get(key))

Without stemming:

Mustang = 8
Mustangs = 6


In [18]:
#Display the key and value
print('With stemming:')
print("")
keys = ['comfort','comfortable', 'Comfort', 'Comfortable']
for key in keys:
    print(key,'=', post_stem.get(key))

With stemming:

comfort = 2
comfortable = None
Comfort = None
Comfortable = None


In [19]:
print('Without stemming:')
print("")
keys = ['comfort','comfortable', 'Comfort', 'Comfortable']
for key in keys:
    print(key,'=', pre_stem.get(key))

Without stemming:

comfort = None
comfortable = 1
Comfort = 1
Comfortable = None


In [20]:
print('With stemming:')
print("")
keys = ['servicing','service','servic','services']
for key in keys:
    print(key,'=', post_stem.get(key))

With stemming:

servicing = None
service = None
servic = 4
services = None


In [21]:
print('Without stemming:')
print("")
keys = ['servicing','service','servic','services']
for key in keys:
    print(key,'=', pre_stem.get(key))

Without stemming:

servicing = 1
service = 2
servic = None
services = 1


Now that we have demonstrated how stemming works we create a function which takes the text as an input and can transfrom the full training set tokenising, removing punctuation, convert to lowercase, stemming and removal of stop words.  

Placing it into a function will help with the pipeline when then trying to repeat the process with the testing set.

In [22]:
def preprocess_text(text):
    '''
    This function takes some text as an input and returns a tokenised set stripped from punctuation,
    capital letters and stop words.
    '''
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    
    # Porter stemming
    porter = PorterStemmer()
    stems = [porter.stem(token.lower()) for token in tokens]
    
    
    # Remove stop words
    keywords = [stem for stem in stems if stem not in stopwords.words('english')]
    return keywords

# 3.

### Output to illustrate a vector for each review has been created.  

Each element in the vector is either a binary variable indicating the presence of a word/stem or the number of times it appears. Only a small sample of reviews will be displayed.  

Prior to placing the elements into a vector it is helpful to transform each word into a score of some form that will allow for a better method of defining how important each word is in context of the review. In this workbook term frequency-inverse document frequency (Tf-IDf) is used.  

The Tf-IDf score which is given to each word in every review is based on how many times each word apppears in each review but then reduced when it occurrs in other reviews.

### Term frequency-inverse document frequency formula:

 The formula for the term frequency is as follows:  

$$tf(w,d) = \log(1+f(w,d))$$

where $d$ is the document or in this case each review, $w$ is the word and the function f is the frequency.  

The inverse term frequency is then calculated:  

$$idf(w, D) = \log(\frac{N}{f(w,D)})$$

where N is the number of reviews, D is the collection of all reviews.  

Lastly, the score is:  


$$tfidf(w,d,D) = tf(w,d) \times idf(w,D)$$  

This is now applied to the training data:

In [23]:
# Create an instance of TfidfVectorizer
vectoriser = TfidfVectorizer(analyzer=preprocess_text)
# Fit to the data and transform to feature matrix
X_train_tfidf = vectoriser.fit_transform(X_train)
X_train_tfidf.shape

(1105, 10023)

 A sample of the first thirty scores from the vector can be seen below. 

In [24]:
df_tdif = pd.DataFrame(X_train_tfidf[0].T.todense(), index=vectoriser.get_feature_names(), columns=["Tf-IDf"])
df_tdif = df_tdif.sort_values('Tf-IDf', ascending=False)
print (df_tdif.head(30))

             Tf-IDf
6l         0.221023
amp        0.201446
mustang    0.198180
gt         0.153084
perform    0.146717
hatchback  0.124528
exclud     0.120055
model      0.117719
0l         0.114280
4          0.111049
91         0.108844
2000       0.103618
system     0.103168
packag     0.094137
25k        0.090041
rear       0.089710
gaug       0.088516
rev        0.088416
option     0.087536
structur   0.085307
curiou     0.085307
style      0.084744
rigid      0.080288
get        0.080132
im         0.079742
provid     0.079349
mayb       0.079120
wheel      0.078813
due        0.076333
rpm        0.075411


# 4. 

### Use the Multinomial Naive Bayes Model to classify the car reivews

This is a probabilistic learning method that is used to apply textual analysis. It is based on Bayes theorem and is used to predict the senitiment of the car review positive or negative.  

The settings of this model is used with the [defaults](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html). In this case $\alpha = 1.0$ for Laplace smoothing to ensure that the model can deal with words that do not appear in the test set or training set.

In [25]:
M_NB = MultinomialNB()
mnb_clf = M_NB.fit(X_train_tfidf, y_train)

In [26]:
mnb_clf_scores = cross_val_score(mnb_clf, X_train_tfidf, y_train, cv=5)
print(mnb_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (mnb_clf_scores.mean(), mnb_clf_scores.std() * 2))

[0.760181   0.79638009 0.78280543 0.74660633 0.79638009]
Accuracy: 0.78 (+/- 0.04)


Initial model seems to fit with an accuracy of 78% (+/- 4% of the time).

In [27]:
mnb_clf_pred = cross_val_predict(mnb_clf, X_train_tfidf, y_train, cv=5)

# 5.

### Labled confusion matrix showing the performance of Naive Bayes classifier.  

In [28]:
tn_NBtraining, fp_NBtraining, fn_NBtraining, tp_NBtraining = confusion_matrix(y_train, mnb_clf_pred).ravel()

In [29]:
print("Number of True positives:", tp_NBtraining)
print("Number of True negatives:", tn_NBtraining)
print("Number of False positives:", fp_NBtraining)
print("Number of False negatives:", fn_NBtraining)

Number of True positives: 467
Number of True negatives: 391
Number of False positives: 158
Number of False negatives: 89


Use the test data to see how the model performs on an unseen test set. 

In [30]:
pipe = Pipeline([('vectoriser', vectoriser),
                 ('classifier', mnb_clf)])
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectoriser',
                 TfidfVectorizer(analyzer=<function preprocess_text at 0x7fa988e9e1f0>)),
                ('classifier', MultinomialNB())])

In [31]:
y_test_pred_NB = pipe.predict(X_test)
print("Accuracy: %0.2f" % (accuracy_score(y_test, y_test_pred_NB)))

Accuracy: 0.81


Here we can see that the Naive Bayes model performs with an accuracy of 81% while it is actually higher than the average of the training percentage it is still nonetheless within the two standard deviations of the training model. Overall, this is a good output especially since the test results closely matches the training results.

In [32]:
tn_NBtest, fp_NBtest, fn_NBtest, tp_NBtest = confusion_matrix(y_test, y_test_pred_NB).ravel()

In [33]:
#The confusion matrix
print("Number of True positives:", tp_NBtest)
print("Number of True negatives:", tn_NBtest)
print("Number of False positives:", fp_NBtest)
print("Number of False negatives:", fn_NBtest)

Number of True positives: 111
Number of True negatives: 112
Number of False positives: 30
Number of False negatives: 24


## Part Two

# 1.

### Using stochastic gradient descent to find a better performing model to classify the car reviews.

In part two stochastic gradient descent will be used in an attempt to fit multiple different models with a range of different hyper paramters. The goal is to select the best performing model on the training data. This will then be compared to the origional multinomial naive bayes model from part one.

[Stochastic gradient descent](https://scikit-learn.org/stable/modules/sgd.html) works by taking a random training point for each iteration and fits the gradient to the data. It will update this gradient for each iteration. It will attempt to approximate the true gradient $E(w,b)$ for the data. For each iteration the model parameters are updated as follows:

$$w \leftarrow w - \eta \left[\alpha \frac{\partial R(w)}{\partial w}
+ \frac{\partial L(w^T x_i + b, y_i)}{\partial w}\right]$$

$\eta:$ is the learning rate, $b:$ is the intercept, $R:$ the penalty parameter, $L:$ the loss function and $\alpha:$ a non negative regulisation hyperparameter.

# 2.

### Stochastic gradient descent with its default settings:

In the default case, the stochastic gradient descent loss function used is a linear Support Vector Machine (SVM). The data points that are closer to the hyperplane are called support vectors which then influence the position and orientation of the hyperplane (Vandewalle, 1999). The default penalty is $l2$ which shrinks but does not eliminate any coefficients. The constant $\alpha$ that is used to multiply the regularization term is 0.0001. The intercept is set at True and will be fitted. Early stopping is False meaning the learning algorithm will not stop learning even if the model is getting worse at prediciting the correct results.  

Here there is no need to scale the data as textual data has its own scale by default (number of times a word appears).

In [34]:
# fit the baseline model
sgd_clf = SGDClassifier(random_state=55)
sgf_clf_scores = cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=10)
print(sgf_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (sgf_clf_scores.mean(), sgf_clf_scores.std() * 2))

[0.79279279 0.76576577 0.77477477 0.86486486 0.81081081 0.82727273
 0.79090909 0.78181818 0.78181818 0.79090909]
Accuracy: 0.80 (+/- 0.06)


In [35]:
sgf_clf_pred = cross_val_predict(sgd_clf, X_train_tfidf, y_train, cv=5)
print(confusion_matrix(y_train, sgf_clf_pred))

[[429 120]
 [108 448]]


# 3.

### Using grid search cross validation to improve the results of the SGD Classifier.

In [36]:
grid = {'fit_intercept': [True,False],
        'early_stopping': [True, False],
        'loss' : ['hinge', 'log', 'squared_hinge'],
        'penalty' : ['l2', 'l1', 'none']}
search = GridSearchCV(estimator=sgd_clf, param_grid=grid, cv=5)
search.fit(X_train_tfidf, y_train)
search.best_params_

{'early_stopping': True, 'fit_intercept': True, 'loss': 'log', 'penalty': 'l2'}

After testing a range of different parameters the best loss function is now the logistic regression model, with early stopping (fit intercept and pentaly $l2$ remain unchanged).

In [37]:
grid_sgd_clf_scores = cross_val_score(search.best_estimator_, X_train_tfidf, y_train, cv=5)
print(grid_sgd_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (grid_sgd_clf_scores.mean(), grid_sgd_clf_scores.std() * 2))

[0.77375566 0.80995475 0.83257919 0.81447964 0.7918552 ]
Accuracy: 0.80 (+/- 0.04)


The accuracy of this new model on the training data is 80% (as before with the linear SVM. However, there is a slight improvement in the margin of error now only +/- 4% rather than +/- 6% earlier.

# 4.

### Looking at the test data results with the new model.

In [38]:
pipe2 = Pipeline([('vectoriser', vectoriser),
                 ('classifier', search.best_estimator_)])
pipe2.fit(X_train, y_train)

Pipeline(steps=[('vectoriser',
                 TfidfVectorizer(analyzer=<function preprocess_text at 0x7fa988e9e1f0>)),
                ('classifier',
                 SGDClassifier(early_stopping=True, loss='log',
                               random_state=55))])

In [39]:
y_test_pred_SGD = pipe2.predict(X_test)
print("Accuracy: %0.2f" % (accuracy_score(y_test, y_test_pred_SGD)))

Accuracy: 0.78


In [40]:
tn_SGDtest, fp_SGDtest, fn_SGDtest, tp_SGDtest = confusion_matrix(y_test, y_test_pred_SGD).ravel()
print("Number of True positives:", tp_SGDtest)
print("Number of True negatives:", tn_SGDtest)
print("Number of False positives:", fp_SGDtest)
print("Number of False negatives:", fn_SGDtest)

Number of True positives: 107
Number of True negatives: 108
Number of False positives: 34
Number of False negatives: 28


The test data results show an accuracy of 78%, 107 correct positives (cases where there is positve review) and 108 correct negatives (negative reviews). With 34 false positives and 28 false negatives.

Interestringly, while using a SDG Classifier it did perform better than Naive Bayes on the training data, it nonetheless has an accuracy rate of 3% lower in the test data. It also has 4 more false positives 34 instead of 30 in part one's Multinomial NB model. The model in part two also has 4 more false negatives 28 rather than the 24 in part one.  

Although we expected better performance from fitting a range of different models with a support vector machine or logistic regression using stochastic gradient descent and then further improving these models by applying cross validation to find the best parameters. These better results where only seen in the training data and did not hold up for the test data.  

These results suggest that in part two we are in fact overfitting the training data and although we recieved better results this did not transalate into a better model overall and subseqently led to more false positives and false negatives.

## Conclusion

In this project we have shown how a supervised classification model can be used to classify car reivews. The initial data set contains 691 reviews labeled positive and 691 reviews labeled negative. This was split into a training set 80% and a test set 20%. The data was then preprocessed to better enable a classification model to be fitted on the training set. Two models where used the first a Multinomial Niave Bayes and the second a Stochastic Gradient Descent model.

While the second model perfroms better on the trianing data in practice we still get a better result for the MultinomialNB model in the testing data. This suggests that the new model is perhaps overfitting the training data and therefore does not hold up when using the test data. Although it was expected that the second model would perform better on the training data the "naive" element of the Naive Bayes model is perhaps what really prevents this model from overfitting the training data and ultimately providing a better result on the test set. Overall, it must be pointed out that the accuracy of both are within the error margins from both the first and second models training data where we output errors margines of within 4%. 

## References  

Porter, M.F., 2006. An algorithm for suffix stripping. Program.  

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.  

Suykens, J.A. and Vandewalle, J., 1999. Least squares support vector machine classifiers. Neural processing letters, 9(3), pp.293-300.