<a id='top'></a>

# Twitter Sentiment Analysis in Python: The Base Model

This notebook will eventually be populated with:
1. TF-IDF with log regression and regularized log regression

## Contents

1. TF-IDF
    1. [bag of shapes](#tfidf)
    2. [twitter TF-IDF](#tfidftweet)
2. log regression
    1. [derivation](#log)
    2. [string vectorization](#vectorization)
    3. [bag of shapes](#bag)
    4. [twitter log regression](#logtweet)

You can read more about TF-IDF and simple regressions in this [paper](http://www.cs.ubc.ca/~nando/540-2013/projects/p9.pdf) and this blog [post](https://www.ocf.berkeley.edu/~janastas/supervised-learning-with-text-1-03-01-sheet.html)

And, well, while we're at it I've enjoyed the documentation for [gensim](https://radimrehurek.com/gensim/models/tfidfmodel.html)

<a id='tfidf'></a>

### Term Frequency Inverse Document Frequency (TF-IDF)

[back to top](#top)

The basic idea of TF-IDF is to score words in a document based on how well they discern the document from the rest of the corpus. You can think of it as selecting the unique objects in each of a set of bags that distinguishes that bag from the set:

<img src="tfidf.png" alt="Drawing" width="200"/>

In this case, the highest scoring shapes for the left bag would be the red triangles, and the highest scoring shape in the right bag would be the orange chord. Let's work this out mathematically.



In [1]:
bag_a = ["triangle", "triangle", "circle", "circle", "star"]
bag_b = ["chord", "star", "circle", "circle", "circle"]

The term frequency (TF) and inverse document frequency (IDF) terms are computed indivudally, before being combined. Both the TF, IDF, and their combination can be computed in a variety of ways to attune for a particular corpus or desired analysis. Here I'll demonstrate the simplest case. 



**Term frequency**

The TF can be computed as follows:

$TF_{t,d} = \frac{\sum_{i=1}^{w_{d}}1(w_{i} = t)}{w_{d}}$

where we are tabulating the frequency of word (shape), t in document (bag), d and $w_{d}$ normalizes for the number of words in the document.

In [2]:
from collections import Counter
def tf_shapes(shape, bag):
    return Counter(bag)[shape] / sum(Counter(bag).values())

Let's investigate the scores for each of our shapes:

In [3]:
print("triangle tf for bag a:\t {}".format(tf_shapes("triangle", bag_a)))
print("circle tf for bag a:\t {}".format(tf_shapes("circle", bag_a)))
print("star tf for bag a:\t {}".format(tf_shapes("star", bag_a)))

triangle tf for bag a:	 0.4
circle tf for bag a:	 0.4
star tf for bag a:	 0.2


You'll notice that for `bag_a` circle and triangle receive the same score. In the context of the set of bags, we know that triangle should be more discerning than circle. This discrepancy is addressed by the IDF.

**Inverse document frequency**

The IDF can be computed as follows:

$IDF_{t,D} = ln\left(\frac{D}{1 + \sum_{j=1}^{D}1(d_{j} = t)}\right)$

where we logarithmically scale the number of documents in the corpus over the times word, t appears in the corpus, D.


In [4]:
from math import log as ln
def idf_shapes(shape, set_of_bags):
    return ln(sum(len(bag) for bag in set_of_bags) / sum(1 for bag in set_of_bags if shape in bag))

Let's investigate the scores for each of our shapes:

In [5]:
print("triangle idf for bag a:\t {:.2}".format(idf_shapes("triangle", [bag_a, bag_b])))
print("circle idf for bag a:\t {:.2}".format(idf_shapes("circle", [bag_a, bag_b])))
print("star idf for bag a:\t {:.2}".format(idf_shapes("star", [bag_a, bag_b])))

triangle idf for bag a:	 2.3
circle idf for bag a:	 1.6
star idf for bag a:	 1.6


We see now that triangle scores higher than circle within the set of bags (2.3 vs 1.6). Our last step is to see how this computes into a final tf-idf score.

The simplest TF-IDF can be computed by multiplying the TF and IDF together:

$TF * IDF$

In [6]:
def tfidf_shapes(shape, bag, set_of_bags):
    return tf_shapes(shape, bag) * idf_shapes(shape, set_of_bags)

Let's view the final tf-idf for each of our shapes in bag_a:

In [7]:
print("triangle tf-idf for bag a:\t {:.2}".format(tfidf_shapes("triangle", bag_a, [bag_a, bag_b])))
print("circle tf-idf for bag a:\t {:.2}".format(tfidf_shapes("circle", bag_a, [bag_a, bag_b])))
print("star tf-idf for bag a:\t\t {:.2}".format(tfidf_shapes("star", bag_a, [bag_a, bag_b])))

triangle tf-idf for bag a:	 0.92
circle tf-idf for bag a:	 0.64
star tf-idf for bag a:		 0.32


Viola! we've correctly identified triangle as the most discerning word in `bag_a`

<a id='tfidftweet'></a>

### TF-IDF Analysis on Twitter Feed ###

[back to top](#top)

The documents in our corpus are very short (280 characters)... so the term frequency calculation will really only differentiate stop words, the only words appearing multiple times in a given tweet. 

In [8]:
import pandas as pd 
import numpy as np
data = pd.read_csv("../../core/data/tweet_global_warming.csv", encoding="latin") #load the corpus
print("Total tweets: {}".format(data.shape[0]))

Total tweets: 6090


Let's use a simplified text processing library to convert our above tfidf code into something higher-performing:

In [9]:
import math
from textblob import TextBlob as tb #text processing library

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In the following cell I've truncated our loop so as not to print out 6090 tweets!

In [10]:
bloblist = list(map(tb, data.iloc[:,0]))
for i, blob in enumerate(bloblist):
    print("Top words in tweet {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
    if i == 10:
        break

Top words in tweet 1
	Word: act|BRUSSELS, TF-IDF: 0.45801
	Word: Belgium, TF-IDF: 0.45801
	Word: hunger, TF-IDF: 0.45801
Top words in tweet 2
	Word: poverty, TF-IDF: 0.73515
	Word: Fighting, TF-IDF: 0.66839
	Word: Africa, TF-IDF: 0.66415
Top words in tweet 3
	Word: Vatican, TF-IDF: 0.51245
	Word: failed, TF-IDF: 0.51245
	Word: offsets, TF-IDF: 0.48534
Top words in tweet 4
	Word: Vatican, TF-IDF: 0.51245
	Word: failed, TF-IDF: 0.51245
	Word: offsets, TF-IDF: 0.48534
Top words in tweet 5
	Word: URUGUAY, TF-IDF: 0.58289
	Word: Tools, TF-IDF: 0.58289
	Word: Needed, TF-IDF: 0.58289
Top words in tweet 6
	Word: JaymiHeimbuch, TF-IDF: 0.48854
	Word: sejorg, TF-IDF: 0.46151
	Word: Intensifying, TF-IDF: 0.42745
Top words in tweet 7
	Word: around, TF-IDF: 0.61988
	Word: us|A, TF-IDF: 0.36861
	Word: doubters, TF-IDF: 0.33369
Top words in tweet 8
	Word: Migratory, TF-IDF: 0.81423
	Word: Stay, TF-IDF: 0.76918
	Word: Strategy, TF-IDF: 0.73722
Top words in tweet 9
	Word: Competing, TF-IDF: 0.48854
	Wo

Our TF-IDF analysis is behaving like we'd expect it to: unusual words (in the context of climate sentiment tweets) score highest. But you'll notice textblob is improperly handling some of our wordage (us|A, India|Ludhiana, etc.) Let's take advantage of another text processing library to properly preprocess our twitter feed:

In [11]:
import gensim
import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

def read_data(data_file):
    for i, line in enumerate (data_file): 
        yield gensim.utils.simple_preprocess (line)
        
dataset = list(read_data(data['tweet']))
dct = Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]
model = TfidfModel(corpus)
vector = model[corpus[0]]

In case we're interested, we can now checkout the scores for each word in every tweet:

In [12]:
for i in range(len(vector)):
    print("{!s:.5}\t{}".format(vector[i][1], dataset[0][i]))

0.270	global
0.090	warming
0.243	report
0.372	urges
0.372	governments
0.297	to
0.034	act
0.318	brussels
0.347	belgium
0.287	ap
0.089	the
0.174	world
0.055	faces
0.062	increased
0.331	hunger
0.035	and
0.172	link


Stop words like 'and' and 'the' score extremely low while words appearing far less in the twitter corpus like 'hunger' and 'governments' score much higher. As we would expect in this corpus filtered for climate sentiment, 'warming' receives a very low score - on par with the stop words.

<a id='log'></a>

### Log Regression

[back to top](#top)

We've reviewed TF-IDF and its implementation on our twitter data. Eventually, we will use it in a preprocessing step when creating a classification model. Before that, let's review the simplest classification model, logistic regression.

logistic regression is where the log odds of the propability of an outcome are represented by a linear combination of features (independent variables). 

The logistic function is defined as:

$\sigma(h) = \frac{\exp^h}{\exp^h+1} $

where $h$ (also called the hypothesis in machine learning parlance) is a linear combination of features:

$h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_mx_m$

The inverse of the logistic funtion is sometimes called logit, hence the name logistic.

In a linear regression, we would minimize the least-squares function (the error between real output values and predicted output values) to determine the coefficients for the model:

$J(\theta)=\frac{1}{2}\sum\limits_{i=1}^{m} (h_\theta(x^i)-y^i)^2 \to min$

Unfortunately, there is no closed-form solution for minimizing the logistic cost function (or maximizing the inverse, the log-likelihood) aside from some very special [cases](https://www.tandfonline.com/doi/abs/10.1080/02664763.2014.932760).

To circumvent this, we use an optimization algorithm called gradient descent. 

#### Gradient Descent ####

Adapting the least-squares function above to reflect our ouput mapping from 0 to 1, the cost function in logistic regression is:

$J(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m} (-y^i\ln(h_\theta(x^i))-(1-y^i)\ln(1-h_\theta(x^i))$

In [13]:
def cost(X, Y, theta, lambda_=0): 
    J = 0 #cost initialization
    H = sigmoid(np.dot(X, theta)) #the hypothesis
    J = 1/(m) * np.sum(-Y[i]*np.log(H[i])-(1-Y[i])*np.log(1-H[i]) for i in range (m))   
    return J

where the hypothesis $h_\theta$:

$h_\theta(x)=\sigma(\theta^Tx)= \frac{1}{1 + \exp^{-\theta^Tx}}$

includes the logistic function that ranges our hypothesis from 0 to 1:

$\sigma(z)=\frac{1}{1+e^-z}$

In [14]:
def sigmoid(z):    
    g = 1/(1+np.exp(-z))
    return g

Technically, this is known as the sigmoid function, a special case of the logistic function (and there are other smooth functions we could have chosen). 

Finally, we need to pull these functions together and form a gradient (the partial derivatives of the cost function with respect to each value of theta) so that we can optimize our values for theta:

In [15]:
def gradient(X, Y, theta, lambda_=0):    
    grad = np.zeros(theta.shape)
    H = sigmoid(np.dot(X, theta))   
    grad = (1/m)*(H - np.mat(Y))*X
    return np.ravel(grad)

#### Orthogonality in Numerical Text Representation

In order to continue our bag of shapes example, we'll have to create some data. It is here that we are forced to make a decision about how to represent our string data numerically. In the simplest case, we could imagine that we represent every unique string by a unique number:

* triangle = 1
* circle = 2
* star = 3

But now we have these weird unwanted relationships between our features:

* star = circle + triangle
* circle = (triangle + star) /2
* etc...

To circumvent this, we can represent our strings with vectors. The particular vectorization I'm going to use is one hot encoding. The resultant vectors are orthogonal (they can't be used to operate on each other) because they are zero everywhere aside from one elemental position that is unqiue to that vector:

In [33]:
triangle = [0, 0, 0, 1] # one hot encode shapes
circle = [0, 0, 1, 0]
star = [0, 1, 0, 0]
chord = [1, 0, 0, 0]
bag_type_a = [triangle, triangle, circle, circle, star]
bag_type_b = [chord, star, circle, circle, circle]
shapes = [triangle, circle, star, chord]

Lastly, we have to flatten our vectors in order to perform logistic regression. Since I have four unique shapes and five shapes in a bag this results in a feature vector of length 20 for every bag in my dataset. I'm going to initiate a sample size of 100 bags. In this simple demonstration, I won't use a test set or introduce noise (irreducible error) in my sample set

In [34]:
X = np.zeros(((100,20))) # initialize X
y = np.random.randint(0,2,size=100)
for index, value in enumerate(y): #fill X based on Y
    if value == 0:
        np.random.shuffle(bag_type_a) #shuffle bag contents
        X[index] = np.ndarray.flatten(np.array(bag_type_a))
    else: 
        np.random.shuffle(bag_type_b) #shuffle bag contents
        X[index] = np.ndarray.flatten(np.array(bag_type_b))

Now that we have our data, we can look at what each of my functions are outputing before we go ahead and pass the gradient and cost function to a solver. You'll notice that in the following cell, I manually enter the y-intercept in our theta vector (it is one element longer than our feature vector):

In [38]:
m, n = X.shape
X = np.insert(X, 0, np.ones(len(X)), 1)
theta = np.zeros(n + 1) #insert y-intercept

The cost function is computing the error between our prediction from the current parameters in the logistic regression and the true y-labeling:

In [37]:
cost(X, y, np.array(theta))

0.693147180559946

The gradient computes the first derivative of the cost function with respect to each value of theta:

In [39]:
gradient(X, y, theta)

array([-0.01 , -0.01 , -0.01 , -0.07 ,  0.02 , -0.025,  0.065, -0.045,
       -0.005, -0.045,  0.085, -0.05 ,  0.02 , -0.1  ,  0.12 , -0.05 ,
       -0.02 , -0.07 ,  0.13 , -0.04 , -0.025, -0.035,  0.09 ])

#### Optimization Using Fminunc

Alright we've almost arrived. We need to choose a solver to minimize our cost function. I'm going to use a built-in scipy function [fminunc](#https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_ncg.html) to optimize our values for theta:

In [40]:
import scipy.optimize
def mycost(t):
    return cost(X, y, t)

def mygrad(t):
    return gradient(X, y, t)

m, n = X.shape
X = np.insert(X, 0, np.ones(len(X)), 1)
theta = np.zeros(n + 1)
optimal_theta = scipy.optimize.fmin_ncg(mycost, theta, fprime=mygrad)

         Current function value: 0.000138
         Iterations: 8
         Function evaluations: 9
         Gradient evaluations: 1067
         Hessian evaluations: 0


Now we can check if our function is able to predict if a bag is a-type or b-type.

In [19]:
def predict(x):
    T = np.insert(x, 0, 1)
    H = sigmoid(np.dot(T, optimal_theta))       
    if H >= 0.5:
        p = 'bag_b'
    else:
        p = 'bag_a'    
    return p

no matter how we shake our bag contents, our model gets it right everytime:

In [32]:
np.random.shuffle(bag_type_b)
predict(np.ndarray.flatten(np.array(bag_type_b)))

'bag_b'

We know preemptively that we could categorize our bags with a subset of the information (e.g. we know bag_a has two circles whereas bag_b has three, bag_b has no triangles, bag_a has no chords, etc.). Can we use our TF-IDF to preprocess our bag data to reflect this generality? Of course we can. We can imagine the most aggressive approach and retain only the highest TF-IDF shape for that bag or corpus. We could progressively decrease this aggression by retaining more and more top TF-IDF shapes. 

## Log Regression on Twitter Data

Let's now perform this analysis on our twitter data. 

Recall that our simple sandbox problem of four shapes in a bag of five shapes resulted in feature vectors of length 20. In the case of our twitter data, we have thousands of unique words and 280 character length "bag" sizes. The resultant vectors we would create with our prior method would be enormous and sparse. Common methods to circumvent this problem are to:

* preprocess text with stemming (word reduction)
* remove stop words (word reduction)
* retain only top occuring words (word reduction)
* vectorize words into meaningful vector space (sparsity reduction)

...and there are new methods being explored all the time. Removal of stop words is a no-brainer, and we will do that here. Stemming is a harder call. Imagine the case where we now condense "warm", "warmed", and "warming" into a single vector representation. These words could encode vastly different sentiment in the case of whether a tweeter believe "warming" is occuring or the planet has "warmed" in the past, etc.

Retaining only top occuring words is a straight-forward method of reducing our word space. We may intuit, however, that the discerning words in our corpus are what we're really after for inclusion in our model. We can use our TF-IDF analysis to do that. 



#### First Pass on the Twitter Feed
##### Where we introduce keras, tokenizer, and logistic regression with sklearn

Before we apply all these preprocessing steps, let's see what we get in our first pass. We'll integer-represent all the words in our dataset and include 30 words per tweet, padding the short tweets with zeros and truncating the long tweets:

In [293]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
data = pd.read_csv("../../core/data/tweet_global_warming.csv", encoding="latin")
print("Full dataset: {}".format(data.shape[0]))
data['existence'].replace(('Y', 'N'), ('Yes', 'No'), inplace=True) 
data.dropna(inplace=True) #remove ambiguous tweets
tweets = data.iloc[:,0]
sentiment = data.iloc[:,1]
print("Number of unique words: {}".format(len(np.unique(np.hstack(tweets)))))

top_words = 20000
max_words = 30 #max/min vector length
test_split = 0.25

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tweets)
X = token.texts_to_sequences(texts=tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
print(np.unique(sentiment,return_counts=True))

Full dataset: 6090
Number of unique words: 3889
(array(['No', 'Yes'], dtype=object), array([1114, 3111]))


We're going to use the logistic regression model from sklearn:

In [294]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.0001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.737
testing score: 0.724
(array(['No', 'Yes'], dtype=object), array([  50, 1007]))


The classification is biased toward 'Yes'; this baseline model is not performing super well. Our algorithm is running into the issues that motivated us to use vector representations in the bag of shapes example: there are unwarranted mathematical relationships between words and there is no protection against the shuffling of words (i.e. the same $\theta_1$ value operates on any word appearing first within the tweet regardless of what word it is).

#### Word Vectorization and Selection with TF-IDF

Let's revisit our work with TF-IDF.

In [295]:
dataset = list(read_data(data['tweet']))
dct = Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]
model = TfidfModel(corpus)

after initiating our corpus, we're going to loop through our tweets and remove those not in the top 5

In [296]:
top_words = 5
lst = [[] for _ in range(len(dataset))]
X = [[] for _ in range(len(dataset))]
for j in range(len(dataset)):
    vector = model[corpus[j]] #create tfidf vector for tweet
    sorted_words = sorted(vector, key=lambda x: x[1], reverse=True) #sort
    for i in range(len(vector)):
        if vector[i][1] in [elem for sublist in sorted_words[:top_words] for elem in sublist]:
            lst[j].append(dataset[j][i])
            X[j].append(vector[i][0])
tokenized_tweets = lst

If we only keep the top 5 words for every tweet in the corpus, we're left with 4704 words! To represent these orthogonally we would have 4704-length vectors - that's extremely sparse

In [297]:
flat_list = [item for sublist in lst for item in sublist]
len(np.unique(np.array(flat_list)))

4704

With so many unique words, the best way to represent them would probably be with google's [word2vec](https://code.google.com/archive/p/word2vec/) model. That implementation is probably worth a post/notebook of its own. To tie this work up, let's continue to represent our words with integers:

In [299]:
test_split = 0.25
X_train, X_test, Y_train, Y_test = train_test_split(X, sentiment, test_size=test_split)
X_train = sequence.pad_sequences(X_train, maxlen=top_words)
X_test = sequence.pad_sequences(X_test, maxlen=top_words)
model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.00001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.742
testing score: 0.719
(array(['Yes'], dtype=object), array([1057]))


So, after distilling our tweets down to the top 5 tf-idf words for every tweet, or logistic regression breaks: we've lost enough detail to where we can only predict the median ('yes') of our corpus. 

Let's sum this up by reverting back to the original dataset and see how we do limiting our preprocessing to the removal of stop words:

In [300]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
stop = stopwords.words('english') + list(string.punctuation)
tokenized_tweets = []
for index, tweet in enumerate(tweets):
    tokens = [i for i in word_tokenize(tweet.lower()) if i not in stop]
    tokenized_tweets.append(tokens)

In [304]:
top_words = 20000
max_words = 30
test_split = 0.5

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tokenized_tweets)
X = token.texts_to_sequences(texts=tokenized_tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.1, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.734
testing score: 0.741
(array(['No', 'Yes'], dtype=object), array([  38, 2075]))


Although we get a small boost in our testing score, our regression is heavily relying on its ability to predict the median (only 38/2075 'no' predictions). We can also try stemming our words:

In [306]:
from nltk.stem.lancaster import LancasterStemmer
from nltk.tokenize import RegexpTokenizer

# Tokenize and stem
tkr = RegexpTokenizer('[a-zA-Z0-9@]+')
stemmer = LancasterStemmer()

tokenized_corpus = []

for i, tweet in enumerate(tweets):
    tokens = [stemmer.stem(t) for t in tkr.tokenize(tweet) if not t.startswith('@')]
    tokenized_corpus.append(tokens)

Results in a regression with more 'no' labeling. But at the expense of accuracy:

In [307]:
top_words = 20000
max_words = 30
test_split = 0.5

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tokenized_tweets)
X = token.texts_to_sequences(texts=tokenized_tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.0001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.721
testing score: 0.709
(array(['No', 'Yes'], dtype=object), array([ 132, 1981]))


At this point we've pretty much exhausted what we can get with integer-representation of our words. Next up: applying google's word2vec 300-length vector representation with logistic regression.