# Linear SVM Sentiment Scoring Model
<p> This is a process for training a Linear Support Vector Machine for sentiment classification model using tweets from a Kaggle competition. The trained model will be saved and exported at the end for reuse. </p>

## Packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.stats import itemfreq
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

## Import Test and Training Data

#### Our training data consists of the following attributes:
<ul> 
    <li><b>target:</b> the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)</li>
    <li><b>ids:</b> The id of the tweet ( 2087)</li>
    <li><b>date:</b> the date of the tweet (Sat May 16 23:58:44 UTC 2009)</li>
    <li><b>flag:</b> The query (lyx). If there is no query, then this value is NO_QUERY.</li>
    <li><b>user:</b> the user that tweeted (robotickilldozr)</li>
    <li><b>text:</b> the text of the tweet (Lyx is cool)</li>
</ul>

In [2]:
filename = "training_XL.csv"
data_set = pd.read_csv(filename, delimiter=',', encoding='ISO-8859-1', header=None)

## Process Data

In [4]:
data_set.columns = ["target","ids","date","flag","user","text"]
#Shuffle the data (get sample frac=1 means that we will use 100% of data for sample)
data_set = data_set.sample(frac=1).reset_index(drop=True)
y=data_set['target'].values
X=data_set['text'].values
data_set.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1974556938,Sat May 30 13:22:31 PDT 2009,NO_QUERY,dougc84,@maeband did you guy's show sell out? i can't...
1,4,1468252598,Tue Apr 07 00:33:29 PDT 2009,NO_QUERY,AiyerChitra,@jnarin oh i didnt know what you were talking ...
2,0,1985822080,Sun May 31 17:50:20 PDT 2009,NO_QUERY,bradbonnell,"argh, will my voice ever recover?"
3,0,2055355782,Sat Jun 06 09:10:00 PDT 2009,NO_QUERY,Hillary411,ahhh work work work
4,4,1966523555,Fri May 29 17:55:29 PDT 2009,NO_QUERY,musiclove18,Waiting for Jonas To Come On Un-Broke : What Y...


## Prepare Data For Holdout Test
<p> Remember that (X = Text) and (y = Sentiment Score)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(1072000,) (1072000,) (528000,) (528000,)
Omg. Cleaning out my car was like cleaning up after world war 2!! 
0
im seeing people subscribe to my twitter, while ur at it why not subscribe to our youtube   Link on the side
4


In [6]:
training_labels = set(y_train)
print(training_labels)
training_category_dist = np.unique(y_train, return_counts=True)
print(training_category_dist)

{0, 4}
(array([0, 4]), array([535917, 536083]))
