## Introduction

When working with cleaned and tagged data related to sentiment analysis, the question that can float towards the end is which model to utilize. Models have different strengths and weaknesses and some are better suited to towards different tasks. Some such as a decision tress come with a strong ability to easily explain to a lay audience what the model is doing in the background under the hood. While a a neural network can also create an easy visual the math, especially in complex networks, can quickly go over an audiences head. Moreover if the model needs to explain the why a result is met this may take longer to derive.

In text data, due to the high dimensionality that are some models that are commonly used. This is because as dimensionality grows, so to does the time it will take the computer to converge on the final model. Naive Bayes for one, can quickly and easily translate through the high dimensionality and produce a results. Along with Naive Bayes another lazy learner support vector machines model is also able to quickly produce results in various circumstances. The math however does take a bit longer to converge than Naive Bayes. The question is can Support Vector Machines (SVM) outperform Naive Bayes in text sentiment analysis?

Over the next several paragraphs and some python code, the two models will be compared. First by vectorizing the text to just unigrams and later with bigrams. Lastly, the SVM will be put through a larger test and see if it can produce better results by adjusting the cost parameter, which also the model to become more generalizable and hopefully perform better on the larger dataset.

In [1]:
#@title Calling of the packages { display-mode: "form" }
%%capture
# import the packages
import pandas as pd
import os
import numpy as np
import nltk
import warnings
warnings.filterwarnings("ignore")
nltk.download(['stopwords','punkt','wordnet'])
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
# explicitly require this experimental feature
from sklearn.experimental import enable_halving_search_cv # noqa
# now you can import normally from model_selection
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split
os.chdir('/content/drive/MyDrive/Graduate School/IST736 Text Mining/Week 7/Week 7 Homework')

# The Data

The data for the testing tasks comes from a Kaggle Data Set of twitter users who have reviewed about their airline traveling expirence. The Data is already tagged a postive, negative, and neutral.

The data set contains a little over 14 thousand tweets for the analysis. The first review was of how many postive, negative, and neutral tweets there were. The Data set is heavily baised toward negative sentiment with 9 thousand tweets balanced towards negative, 3 thousand neutral, and only 2.4 thousand were postive.

In order to deal with the imbalance between classes, the data was assigned a weight. 1 for negative, 3 for neutral, and 4 for positive. The data was then sampled using Pandas sample function with replacement and asking for 18,000 records. This will roughly double the number of results for neutral, and positive, while allowing the majority of negative tweets to come through.

In [2]:
#@title Displaying the Data
# Import and inspect the data
data = pd.read_csv('Tweets.csv', encoding='latin1')
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
#@title Count of Classes Pre-Balancing
# Reduce the columns
data = data[['text','airline_sentiment']]
#Check the Balance of the classes
data['airline_sentiment'].value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [4]:
#@title Count of Classes Post Balancing
# Balancing the Data Set
weights = {'negative':1, 'neutral':3, 'positive':4} # set the weights
data['weights'] = data['airline_sentiment'].apply(lambda x: weights[x])
data = data.sample(n = 18000, weights=data['weights'], replace= True, random_state=24)
data['airline_sentiment'].value_counts()

positive    6081
neutral     5967
negative    5952
Name: airline_sentiment, dtype: int64

## Vectorization

Since computers can not readily utilize text for analysis, the next step is to vectorize the written text. This process seperates the text into words or short phrases calles Tokens. When a token represents a single word it is called a unigram, 2 words is called a bigram, and so forth through n-grams where n represents the number of words in the token.

The Vectorization then can try various methods of representing that word in a corpus, or text body, amoung the larger Corpora of text. There are three commonly seen methods. If the Token exists in the corprus, it will be represented with True or 1 and if it does not then it will be represented with a false. Secondly utilizing counts, which a token will be counted for each time it is represented in the text. Lastly utilizing TFIDF which counts the occurence of a token in a respecitve corpus and divides it by how frequently it is seen across the corpora. This in effect weakens common words while those that are less frequent will stand out.

In order to maintain an apples to apples comparison, for all three tasks, the same Tokenization will be applied the Corpora.

As Sklearn includes a Vectorization tool that will implement the tokenization, removal of stops, and convert to n-grams in 1 step, the first thing is to define the items that are going to be utilized.

First the Tokenization and Lemmatization, or the returning of a word to it's base morpheme so that words like buying and buy will be represented with 1 token buy. This reduces features which text data has large quantities and thus can make model's slow to converge.

Using nltk's built in word tokenizer, which keeps punctuation, a number of punctuation characters were removed so as not become their own tokens. In this list specifically the hash and at signs were added as this data set comes from twitter and these may create additional one off tokens.

The last vectorization preperatory step was to create a stopword list. Stopwords are common words such as "the", "I", "can" and so on that create sentence structure but often have little meaning in sentiment analysis tasks...except those that negate such "not", "don't", "can't" as English generally requires helper words to do this task rather than morphemes. Nltk's built in stopword list was called and the negation stopwords were removed. These words will be passed to the vectorizer.

In [5]:
#@title Defines Tokenization and Lemmatization in 1 Step
#Found on Github as sample to incorporate both
#nltk.word_tokenizer and Lemmatization for call in Vectorizer
#git location: https://gist.github.com/4OH4/f727af7dfc0e6bb0f26d2ea41d89ee55

#Removing the "#" and "@" sign will remove features specific to twitter.

class LemmaTokenizer:
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`',"!","?", "'", "#","@"]
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in nltk.word_tokenize(doc) if t not in self.ignore_tokens]

Tokenizer = LemmaTokenizer()


In [6]:
#@title Defines Stopwords
# Manging the stopword list
#import from nltk package
stop_words = stopwords.words('english')
#gather negative stopwords, english helper words
negativeStopwords = ['ain', 'aren', "aren't", 'couldn', "couldn't", 'didn',
"didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven',
"haven't", 'isn', "isn't",'mightn', "mightn't", 'mustn', "mustn't", 'needn',
"needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't", "not", "don't","no",
"nor","don"]
#remove them from the list
for word in negativeStopwords:
 stop_words.remove(word)

In [7]:
#@title Creates the defined Vectors for Task 1 and 2
#Setting up the vectorizers to use the Tokenizer defined 2 blocks up and the stops words defined in the previous block
#by default the Vectorizer will lower case the tokens, and just unigrams
#for the Bigrams defined below
unigramBinary = CountVectorizer(tokenizer=Tokenizer, stop_words=stop_words, binary=True)
unigramCounts = CountVectorizer(tokenizer=Tokenizer, stop_words=stop_words, binary=False)
bigramVectorizer = CountVectorizer(tokenizer=Tokenizer, stop_words=stop_words, binary=False, ngram_range=(2,2)) #Only Bigrams


## Training and Test Sets

For Task 1 and Task 2, training and test sets will be created using a 60-40 spilt where 60 percent of the data will be used for training and 40 percent will be used to test the model. Sklearn's train_test_split function shuffles the records using psuedo randome. In order for the results to be able to seen by anyone who runs this later a random_state is set that will produce the same set as long as the records are in the same order.

In [8]:
#@title Creates the Test and Training Sets used in Tasks 1 and 2
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['airline_sentiment'], test_size = 0.4, random_state = 37)

Now that the test in training sets are created, the text entries need to be Vectorized. Starting first with the training sets that will training the matrix by using fit_transform method. This will create the features used both sets.

Next the test set was passed to the transform where the vectorizer will check for the features and trainsform the data set.

# The Analysis/Models

## Task 1

This task is focused on the comparison of Unigrams Models for both MultinomialNB and Support Vector Machines. For this test the two models were tested below using their base structures.

For instance MultinomialNB is instantiated with a defaul smoothing of 1 and the same is done for cost the support vector machines algorithm. The differnt in the two runs is the first utilizes the binary representation of the Token within the Corpus and the later uses the count of how many times the Token appears in the corpus. This was to see if the any changes happen to the model based on the two sets of results.

In [9]:
#@title Vectorization of Test and Training for Task 1
X_binary = unigramBinary.fit_transform(X_train)
X_counts = unigramCounts.fit_transform(X_train)

X_binary_test=unigramBinary.transform(X_test)
X_counts_test=unigramCounts.transform(X_test)

In [10]:
#@title Binary Unigram Model
#Instantiate the Model
nb_clf = MultinomialNB()
svm_clf = SVC(kernel='linear')

#Fit the binary models
nb_clf.fit(X_binary, y_train)
svm_clf.fit(X_binary, y_train)

#Create the predictions
nb_y_pred = nb_clf.predict(X_binary_test)
svm_y_pred = svm_clf.predict(X_binary_test)


In [11]:
#@title Multinomial Naive Bayes Results
#Checking the results of the MultinomialNB model
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_y_pred))
print("Classification Report:\n", classification_report(y_test, nb_y_pred))

Confusion Matrix:
 [[2082  186  113]
 [ 451 1686  249]
 [ 157  115 2161]]
Classification Report:
               precision    recall  f1-score   support

    negative       0.77      0.87      0.82      2381
     neutral       0.85      0.71      0.77      2386
    positive       0.86      0.89      0.87      2433

    accuracy                           0.82      7200
   macro avg       0.83      0.82      0.82      7200
weighted avg       0.83      0.82      0.82      7200



In [12]:
#@title Support Vector Machine's Linear Model Results
#Checking the results of the SVM Model
print("Confusion Matrix:\n", confusion_matrix(y_test, svm_y_pred))
print("Classification Report:\n", classification_report(y_test, svm_y_pred))

Confusion Matrix:
 [[2004  292   85]
 [ 190 2058  138]
 [  71  100 2262]]
Classification Report:
               precision    recall  f1-score   support

    negative       0.88      0.84      0.86      2381
     neutral       0.84      0.86      0.85      2386
    positive       0.91      0.93      0.92      2433

    accuracy                           0.88      7200
   macro avg       0.88      0.88      0.88      7200
weighted avg       0.88      0.88      0.88      7200



The Naive Bayes model based on the results with Binary data shows a good accuracy around 82%. However it struggles with classifiying neutral sentiment. The confusion Matrix shows as well that the model's incorrect prediction tend to be over weighted in the neutral category. Negative Category also show a signficant drop in precision or classifiying newly seen data.

The Support Vector Machine also shows the lowest score in the neutral category at .85 but the overal results is much closer to the 88% accuracy of the model. Incorrect predictions are pretty balanced among all three classes. Precision and recall scores a closer to the overall results of the model for all elements. One thing of note the model converges in 10 seconds, this is almost 10x longer than NB model

### The Most indicative Features

The advantage of both of these models is that it is easy to pull the most indicative features from them, so long as SVC is run with a linear kernal.

In [13]:
#@title Most  Indicative Features
nb_features = sorted(zip(nb_clf.coef_[0], unigramBinary.get_feature_names_out()), reverse=True)
print("Top 10 Most Negative Features: ", nb_features[:10])
print("Top 10 Most Postive Features: ", nb_features[-10:])
svm_features = sorted(zip(svm_clf.coef_[0], unigramBinary.get_feature_names_out()), reverse=True)
print(svm_features[:10])
print(svm_features[-10:])

Top 10 Most Negative Features:  [(-3.8906343660885954, 'flight'), (-3.9163710807326444, 'united'), (-4.106544612809285, 'usairways'), (-4.191102000837348, 'americanair'), (-4.398741365615592, "n't"), (-4.719649085695694, 'southwestair'), (-4.734309034106378, 'no'), (-4.751330721675809, 'not'), (-4.762118312804806, 'wa'), (-4.884249181397293, 'hour')]
Top 10 Most Postive Features:  [(-10.895516355801455, "'til"), (-10.895516355801455, "'request"), (-10.895516355801455, "'noooo"), (-10.895516355801455, "'kewl"), (-10.895516355801455, "'just"), (-10.895516355801455, "'em"), (-10.895516355801455, "'customer"), (-10.895516355801455, "'crashing"), (-10.895516355801455, "'blumanity"), (-10.895516355801455, "'bluemanity")]
[(<1x10348 sparse matrix of type '<class 'numpy.float64'>'
	with 5700 stored elements in Compressed Sparse Row format>, '$')]
[(<1x10348 sparse matrix of type '<class 'numpy.float64'>'
	with 5700 stored elements in Compressed Sparse Row format>, '$')]


The top 10 most indactive words for the negative class show words like Flight, United, US Airways, and AmericanAir which are the names of airlines are not very indicative of negative words. Many of the the reviews on twitter. The top 10 negative words does include negations which are words left in from the stopword list.

The 10 most indicative words for the positive class appear to be words that are commonly seen in Tweets about Airlines and not words that would normally be considered positive by the population at large. Likely the model would perform poorly when introduced to other types of text data out side of airline tweets

### Task 1 with Counts

Now that the request has been run with Binary true false for tokens with high accuracy, reviewing to see if counts of Token Occurence will provide better returns. For multinomial naive bayes the base model will be utilized. However; since the most indicative features are requested for support vector machines, the linear model will need to be used.

In [14]:
#@title Counts Unigram Models
#Instantiate the Model
nb_clf = MultinomialNB()
svm_clf = SVC(kernel='linear')

#Fit the count models
nb_clf.fit(X_counts, y_train)
svm_clf.fit(X_counts, y_train)

#Create the predictions
nb_y_pred = nb_clf.predict(X_counts_test)
svm_y_pred = svm_clf.predict(X_counts_test)

In [15]:
#@title Multinomial Model Results
#Checking the results of the MultinomialNB model
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_y_pred))
print("Classification Report:\n", classification_report(y_test, nb_y_pred))

Confusion Matrix:
 [[2076  189  116]
 [ 454 1674  258]
 [ 154  112 2167]]
Classification Report:
               precision    recall  f1-score   support

    negative       0.77      0.87      0.82      2381
     neutral       0.85      0.70      0.77      2386
    positive       0.85      0.89      0.87      2433

    accuracy                           0.82      7200
   macro avg       0.82      0.82      0.82      7200
weighted avg       0.82      0.82      0.82      7200



In [16]:
#@title Support Vector Machine Results
#Checking the results of the SVM Model
print("Confusion Matrix:\n", confusion_matrix(y_test, svm_y_pred))
print("Classification Report:\n", classification_report(y_test, svm_y_pred))

Confusion Matrix:
 [[1991  307   83]
 [ 187 2070  129]
 [  69  101 2263]]
Classification Report:
               precision    recall  f1-score   support

    negative       0.89      0.84      0.86      2381
     neutral       0.84      0.87      0.85      2386
    positive       0.91      0.93      0.92      2433

    accuracy                           0.88      7200
   macro avg       0.88      0.88      0.88      7200
weighted avg       0.88      0.88      0.88      7200



Based on the model output, counting the token occurence or utilizing binary representation of the Token's existing in the text shows the same results.

#### Most indicative Features

As the model's preformance numbers didn't change, reviewing the model's most indicative features

In [17]:
#@title Most informative features
nb_features = sorted(zip(nb_clf.coef_[0], unigramCounts.get_feature_names_out()), reverse=True)
print(nb_features[:10]) #Negative
print(nb_features[-10:]) #Positive
svm_features = sorted(zip(svm_clf.coef_[0], unigramCounts.get_feature_names_out()), reverse=True)
print(svm_features[:10])
print(svm_features[-10:])

[(-3.7590428114861085, 'flight'), (-3.86741084820789, 'united'), (-4.111864186108851, 'usairways'), (-4.212352584942677, 'americanair'), (-4.346057573609645, "n't"), (-4.628199465685811, 'no'), (-4.675333543360264, 'wa'), (-4.7067086659280175, 'not'), (-4.743223878903114, 'southwestair'), (-4.8708694393503755, 'hour')]
[(-10.925308785619746, "'til"), (-10.925308785619746, "'request"), (-10.925308785619746, "'noooo"), (-10.925308785619746, "'kewl"), (-10.925308785619746, "'just"), (-10.925308785619746, "'em"), (-10.925308785619746, "'customer"), (-10.925308785619746, "'crashing"), (-10.925308785619746, "'blumanity"), (-10.925308785619746, "'bluemanity")]
[(<1x10348 sparse matrix of type '<class 'numpy.float64'>'
	with 5666 stored elements in Compressed Sparse Row format>, '$')]
[(<1x10348 sparse matrix of type '<class 'numpy.float64'>'
	with 5666 stored elements in Compressed Sparse Row format>, '$')]


The most imformative 10 features show the same tokens however the order is slightly different. This likely drives why the models with and without counts are showing similiar results.

## Task 2 - Bigrams

For this task keeping the measures of Task 1 the same with one change in Count Vectorization from unigrams to bigrams and reruning the previous models. The CountVectorization was setup earlier along with the stopword list and vectorization steps. 

In this round, given there was not a difference seen in the model output via the token counts or boolean value, and since boolean value in a count is assumed at zero, only the counts will be used in the creation of the models.

The same training and tests sets defined earlier will remain so these model maintain similar comparables to those in Task 1

In [18]:
#@title Bigram Vectorization and Model Execution
#Vectorize the text to Bigrams
X_train_bigrams = bigramVectorizer.fit_transform(X_train)
X_test_bigrams = bigramVectorizer.transform(X_test)

#instatiate the models
nb_clf = MultinomialNB()
svm_clf = SVC(kernel='linear')

#Fit the models
nb_clf.fit(X_train_bigrams, y_train)
svm_clf.fit(X_train_bigrams, y_train)

bigram_ypred_nb = nb_clf.predict(X_test_bigrams)
bigram_ypred_svm = svm_clf.predict(X_test_bigrams)

In [19]:
#@title Multinomial Model
print("Confusion Matrix:\n", confusion_matrix(y_test, bigram_ypred_nb))
print("Classification Report:\n", classification_report(y_test, bigram_ypred_nb))

Confusion Matrix:
 [[1975  210  196]
 [ 239 1902  245]
 [ 107   93 2233]]
Classification Report:
               precision    recall  f1-score   support

    negative       0.85      0.83      0.84      2381
     neutral       0.86      0.80      0.83      2386
    positive       0.84      0.92      0.87      2433

    accuracy                           0.85      7200
   macro avg       0.85      0.85      0.85      7200
weighted avg       0.85      0.85      0.85      7200



In [20]:
#@title Support Vector Machines
print("Confusion Matrix:\n", confusion_matrix(y_test, bigram_ypred_svm))
print("Classification Report:\n", classification_report(y_test, bigram_ypred_svm))

Confusion Matrix:
 [[1731  557   93]
 [ 122 2186   78]
 [  60  184 2189]]
Classification Report:
               precision    recall  f1-score   support

    negative       0.90      0.73      0.81      2381
     neutral       0.75      0.92      0.82      2386
    positive       0.93      0.90      0.91      2433

    accuracy                           0.85      7200
   macro avg       0.86      0.85      0.85      7200
weighted avg       0.86      0.85      0.85      7200



Interestingly, both model's preform with the same overall accuracy. There is a large misclassifcation of Negative Sentiment as Neurtal in the SVM category. In addition, the while the SVM precision and recall are pretty close in the positive category, both Negative and Neurtal categories have their struggles with well below average recall and precision scores respectively.

The multinomial NB is generally more balanced across the board is balances in its results, although as seen in the unigram model it still struggles a bit in the classification of sentiment.

That said the best restuls between Classification Task 1 with unigrams and classifcation 2 with Bigrams, the Classification Task 1 with SVM model has so far preformed the best of the all the variation that have been tested.

### Most informative Features

In [21]:
#@title Multinomial Features
nb_features = sorted(zip(nb_clf.coef_[0], bigramVectorizer.get_feature_names_out()), reverse=True)
print(nb_features[:10]) #Negative
print(nb_features[-10:])

[(-6.088558058226031, '& amp'), (-6.148877848459554, 'cancelled flightled'), (-6.148877848459554, "ca n't"), (-6.169821022304797, 'customer service'), (-6.957220741110932, 'late flight'), (-7.148888160323124, 'cancelled flighted'), (-7.237180767468803, 'flight wa'), (-7.317223475142339, "wo n't"), (-7.317223475142339, "n't get"), (-7.35112502681802, 'late flightr')]
[(-11.41156803736444, '$ 440'), (-11.41156803736444, '$ 39'), (-11.41156803736444, '$ 33'), (-11.41156803736444, '$ 25.00'), (-11.41156803736444, '$ 20'), (-11.41156803736444, '$ 192'), (-11.41156803736444, '$ 17.58'), (-11.41156803736444, '$ 1000cost-'), (-11.41156803736444, '$ 1000'), (-11.41156803736444, '$ .50')]


In [22]:
#@title Support Vector Machine Features
svm_features = sorted(zip(svm_clf.coef_[0], bigramVectorizer.get_feature_names_out()), reverse=True)
print(svm_features[:10])
print(svm_features[-10:])

[(<1x48715 sparse matrix of type '<class 'numpy.float64'>'
	with 35590 stored elements in Compressed Sparse Row format>, '$ $')]
[(<1x48715 sparse matrix of type '<class 'numpy.float64'>'
	with 35590 stored elements in Compressed Sparse Row format>, '$ $')]


Negative features are more indicative of negative events like cancelled flights, can't, customer service (which while not a negative topic, often is only consulted when there is a problem), late flight, etc. 

The positive indications explain the poorer performance as many of the inidcations show dollar values of what appear to be fees paid in the tweets. None of them are traditionally words thought of as positive in English.

## Task 3

For Task three the ask is to utilize the parameters available for the SVM Model to to see if a better model can be produced. So far in Task 1 utilizing unigrams the best mode has seen an 88% accuracy under SVM, and the best preforming model so far as both recall and percsions are in the mid 80s or higher for all sentiment classifcations.

For Task three a combination of both Bigrams and Unigrams were submitted. In addition to try and achieve the best model, SVC's main calcuation Kernel and the Cost were adjusted using a for loop and outputting the classification report for each of the results. This allow various parameters to be tested in 1 run of the code. The computation time is longer, but allows to quickly see which model preforms the best over the code.

For SVC the cost was selected range from .5 to 2 in increments of .5 Costs tell the model how tightly or loosely it should classify the training data. Normally speaking lower cost overfits the model to the training data while higher cost allows more generalized results.

In addition the other factor tested was the Kernal. Linear and Radial Based Function are both being tested to see if one of them produces a better outcome.

To best measure accuracy, cross fold validation is done, where the model is build utilize the majority of the folds and holding 1 out to test the results. This happens for each section of the date and the results is averaged to help identify how well the model is likely to generalize to unseen data.

In [23]:
#@title SVM Cost and Kernel Comparisons
#Create vectorization feature
Vectorizer = CountVectorizer(tokenizer=Tokenizer, stop_words=stop_words, binary=False, ngram_range=(1,2))
X_vect = Vectorizer.fit_transform(data['text'])
y_train_all = data['airline_sentiment']

#testing parameters
cost = [.5,1,1.5,2]
kernal = ['linear', 'rbf']

for k in kernal:
  for c in cost:
    svm_clf = SVC(kernel=k, C=c) #instatiates the model
    y_preds = cross_val_predict(svm_clf, X_vect, y_train_all, cv = 5)
    print("For Cost of: ", c, "\n For kernel: ", k, "\n Classifcation Report: \n", classification_report(y_train_all, y_preds, digits=5))

For Cost of:  0.5 
 For kernel:  linear 
 Classifcation Report: 
               precision    recall  f1-score   support

    negative    0.94035   0.87668   0.90740      5952
     neutral    0.88030   0.92559   0.90238      5967
    positive    0.94123   0.95609   0.94860      6081

    accuracy                        0.91972     18000
   macro avg    0.92063   0.91945   0.91946     18000
weighted avg    0.92074   0.91972   0.91966     18000

For Cost of:  1 
 For kernel:  linear 
 Classifcation Report: 
               precision    recall  f1-score   support

    negative    0.94016   0.87634   0.90713      5952
     neutral    0.88064   0.92609   0.90279      5967
    positive    0.94204   0.95691   0.94942      6081

    accuracy                        0.92006     18000
   macro avg    0.92095   0.91978   0.91978     18000
weighted avg    0.92106   0.92006   0.91998     18000

For Cost of:  1.5 
 For kernel:  linear 
 Classifcation Report: 
               precision    recall  f1-scor

The RBF model shows increasing accuracy and pretty stable precision and recall in all three categories. It is worth noting the computation time on the RBF is much longer than the linear. Using cross validation tends to extend the time, but while linear with 10 fold cross validation would take about 2 minutes and 30 seconds to run. RBF was showing around 3 minutes to converge with 10 fold cross validation. The additional 1/5 total time of the linear version in the smaller data set is only helping a few extra corpus get classified with the correct sentiment. However in a larger dataset that time trade off might be well worth the additional time cost.

The Linear model does show a very modest increase as cost increases and model bondary becomes more generalized. There is a slight decrease between a cost of 1 and 1.5 in overall accuracy. But that said the model show so little change in overall performance to find it, the result had to be expanded to 5 decimal places to see the change. At lower cost it out preforms the rbf version of the model. However rbf continues to improve as the model gets better. Lastly while both model do show some challenge in differentiating between Negative and Neutral sentiment, linear this gap is wider with the model more likely to classify newly seen data as negative when it is infact neutral.


# Conclusion

In the above model tasks 1 and 2, it has been seen that Support Vector Machines preforms as well as Naive Bayes or slightly better. The better however does come at a time time cost; while small, the Naive Bayes generally can build the model is 1.3 seconds and SVM takes around 10 seconds. This expands greatly as seen in task three where cross validation is done is done as part of the model training.

Task three, the Kernel different showed that rbf for test data produced better results both in terms of allow the fitted model to become more general, and in the ability to better segregate the three tags.

