## Vectorization

Before vectorizing, I have to pre-process the `text` column:

- Converting all letters to lower case.
- Turning the tweets into tokens. Tokens are words separated by spaces in a text.
- Eliminating unwanted characters, such as punctuation marks, special characters, white spaces etc.
- Remove stop words, defined by the nltk library.
- Apply lemmatization, and returns a word to its base or dictionary form. Example: Better -> Good.

Download relevant packages used in this notebook:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

Import the preprocessed data from the previous step and randomly shuffle the dataset. This is because in the next step, we have to split the dataset into chunks and partially fit them into the model. Since the dataframe is arranged in a way where the first half entries are associated with positive sentiments and the other half associated with negative sentiments, we do not want the chunks to only contain tweet entries with a single sentiment (containing only -1 or 1):

In [2]:
preprocess_train = pd.read_csv('data/preprocess_data.csv')
preprocess_train = preprocess_train.sample(frac=1,random_state=1).reset_index(drop=True)
preprocess_train.head()

Unnamed: 0,sentiment,text
0,0,Running heat recommended Although falling gett...
1,0,need dye hair
2,1,whenever get bored ill alot since school need
3,0,bye homework
4,1,taylor swiftkeith urban boston MA 80 sure im g...


In [3]:
preprocess_train['sentiment'].value_counts()

1    797022
0    795809
Name: sentiment, dtype: int64

In [5]:
# Set up our predictor and target columns
X = preprocess_train.drop(['sentiment'],axis=1)
y = preprocess_train[['sentiment']]

### Data for Incremental Learning

For better model evaluation, we will be splitting the whole dataset into train and validation set before vectorization. If not, we are carrying information from the train set over to the test set:

In [5]:
# Split the dataset into train and validation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

Now, I can vectorize the preprocessed `text` column:

After splitting the dataframe into predictor and target columns, I will be vectorizing the `text` column, with the following conditions set:
- max_feature: only the top 10,000 features order by term frequency across the corpus, are considered in the model.
- min_df and max_df: Ignores terms that appear in less than 0.1% in the documents and terms that appear in more than 50% of the documents.
- stop_words: Pass through a list of stop words, containing frequently appearing terms like 'we','have','and', then ignoring them.
- token_patterns: Ignoring terms that would be token that has one or more numbers in it.
- strip_accents: Remove accents and perform other character normalization on characters that have an direct ASCII mapping https://www.ascii-code.com/.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer  
# Vectorizing the text column
X_text = X_train['text'].astype(str)

vectorizer = TfidfVectorizer(max_features=10000, 
                                 min_df=0.0001, max_df=0.2, 
                                 stop_words=stopwords.words('english'),
                                 token_pattern=r'\b[^\d\W]+\b',
                                 strip_accents = "ascii")
X_text = vectorizer.fit_transform(X_text)

# Transforming the sparse matrix into dataframe
X_text = pd.DataFrame(columns=vectorizer.get_feature_names(),data=X_text.toarray())
X_text.shape

(1274264, 5598)

In [7]:
X_text

Unnamed: 0,aa,aaa,aaah,aah,aaron,abandoned,abc,ability,abit,able,...,yup,z,zac,zach,zealand,zero,zombie,zombies,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1274259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1274260,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1274261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1274262,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Retrieving numerical features from the preprocessed dataframe
X_num = X_train.drop(['text'],axis=1)
X_num.shape

(1277304, 7)

In [21]:
# Save the vocabulary of the vectorizer, for transforming the twitter data
import pickle
pickle.dump(vectorizer.vocabulary_,open("new_pickle/feature.pkl","wb"))

Then, I will merge all of the vectorized columns and the numerical columns in the train set, but since the dataframe is too big for concatenation, I will try to concatenate it by splitting the entire train data into 12 chunks, with 10,000 entries in each chunk, and then save them for modelling purposes:

In [8]:
# X_num split
n = 100000  #chunk row size
list_num = [X_num[i:i+n] for i in range(0,X_num.shape[0],n)]

In [9]:
# X_text split
n = 100000  #chunk row size
list_text = [X_text[i:i+n] for i in range(0,X_text.shape[0],n)]

In [11]:
#for i in range(9,13):
#    X_train = pd.concat([list_num[i],list_text[i]],axis=1).astype('int64')
#   X_train.to_csv(f'/Users/JennaFu/Desktop/DataScience/BrainStation/Capstone/data/X_{i}.csv')

In [12]:
#batch_size = 100000
#for i in range(0, len(y_train), batch_size):
#    y_train = y_train[i:i+batch_size]
#    y_train.to_csv(f'/Users/JennaFu/Desktop/DataScience/BrainStation/Capstone/data/y_{int(i/batch_size)}.csv')

After saving the train data by chunks, I will be doing the some with the test data. Before splitting them into chunks, I will have to vectorize the test dataset as well using the vectorizer we have created with the train data, allowing the train and test datasets to have homogeneous features. This is because by transforming the validation data with the vectorizer, only the features in the train data will remain in the test data.

In [16]:
# Transform the validation set based on the vectorizer
#X_test_text = X_test['text'].values.astype('U')
#X_test_text = vectorizer.transform(X_val_text)
#X_test_text = pd.DataFrame(columns=vectorizer.get_feature_names(),data=X_test_text.toarray())

# Retrieve numerical features in train and validation sets
#X_train_num = X_train.drop(['text'],axis=1)
#X_test_num = X_test.drop(['text'],axis=1)

Then I will be concatenating the text and numerical features, and saving the test data for modelling.

In [15]:
#X_test = pd.concat([X_test_num,X_test_text],axis=1)

In [None]:
#X_test.to_csv(f'/Users/JennaFu/Desktop/DataScience/BrainStation/Capstone/data/X_test.csv')
#y_test.to_csv(f'/Users/JennaFu/Desktop/DataScience/BrainStation/Capstone/data/y_test.csv')

### Data for Reduced Model

In [6]:
# Reset our predictor and target columns
X = preprocess_train.drop(['sentiment'],axis=1)
y = preprocess_train[['sentiment']]

Other than using incremental learning to train the model on the entire dataset (1.6 million entries), I also want to compare the results when the model is trained on a smaller subset of the entire dataset. I have choosen a sample of 10,000 entries from the dataset. Then using the same process as above, I have splitted the dataset into train and test set.

In [7]:
# Create a sample datset X and y
X = X.sample(n=600000, random_state=7).reset_index(drop=True)
y = y.sample(n=600000, random_state=7).reset_index(drop=True)

In [8]:
# Split the dataset into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

Using the same conditions, I have vectorized the preprocessed text column of the subset:

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer  
# Vectorizing the text column
X_train_text = X_train['text'].astype(str)

vectorizer = TfidfVectorizer(max_features=10000,
                             ngram_range=(1,2),
                                 min_df=0.0001, max_df=0.5, 
                                 stop_words=stopwords.words('english'),
                                 token_pattern=r'\b[^\d\W]+\b',
                                 strip_accents = "ascii")
X_train_text = vectorizer.fit_transform(X_train_text)

# Transforming the sparse matrix into dataframe
X_train_text = pd.DataFrame(columns=vectorizer.get_feature_names(),data=X_train_text.toarray())
X_train_text.shape

(480000, 8808)

In [10]:
# Save the vocabulary of the vectorizer, for transforming the twitter data
import pickle
pickle.dump(vectorizer.vocabulary_,open("pickle/reduced_new_feature.pkl","wb"))

I will be transforming the test data using the vectorizer we have created with the train data of the reduced dataset, agian to homogeneize the features of the train and test set.

In [13]:
# Transform the test set based on the vectorizer
X_test_text = X_test['text'].values.astype('U')
X_test_text = vectorizer.transform(X_test_text)
X_test_text = pd.DataFrame(columns=vectorizer.get_feature_names(),data=X_test_text.toarray())
X_test_text.shape

(120000, 8808)

## Model with Reduced Dataset (Without Numerical Features)

In [14]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Classifiers used
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [15]:
X_test = X_test_text
X_train = X_train_text

### Hyperparameter Tuning

In [None]:
from collections import Counter
import numpy as np
from sklearn.svm import LinearSVC

# Fit the data to scaler
scaler = RobustScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)

In [40]:
# Scale the train and test sets, with three types of scalers
for scaler in [StandardScaler(), MinMaxScaler(), RobustScaler()]:
    
    # Fit the data to scaler
    scaler.fit_transform(X_train)
    scaler.transform(X_test)
    
    # Instantiate and fit to the train set
    Logistic = LogisticRegression()
    
    # Fit the data
    Logistic.fit(X_train,y_train)
    
    # Score the model
    print(scaler)
    print(Logistic.score(X_train,y_train))
    print(Logistic.score(X_test,y_test))

StandardScaler(copy=True, with_mean=True, with_std=True)
0.802925
0.76145
MinMaxScaler(copy=True, feature_range=(0, 1))
0.802925
0.76145
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True)
0.802925
0.76145


In [41]:
# Scale the train and test sets, with three types of scalers
for scaler in [StandardScaler(), MinMaxScaler(), RobustScaler()]:
    
    # Fit the data to scaler
    scaler.fit_transform(X_train)
    scaler.transform(X_test)
    
    # Instantiate and fit to the train set
    SVC = LinearSVC()
    
    # Fit the data
    SVC.fit(X_train,y_train)
    
    # Score the model
    print(scaler)
    print(SVC.score(X_train,y_train))
    print(SVC.score(X_test,y_test))

StandardScaler(copy=True, with_mean=True, with_std=True)
0.8159125
0.75245
MinMaxScaler(copy=True, feature_range=(0, 1))
0.8159125
0.75245
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
             with_scaling=True)
0.8159125
0.75245


In [30]:
svc_comb = []
svc_counter = []
svc_score = []
for i in [0.0001,0.001,0.01,0.1,1,10]:
    for j in range(2,6):
        clf = LinearSVC(C = i ,class_weight={0: 1, 1: j})
        clf = clf.fit(X_train, y_train)
        # Score the model
        svc_comb.append([i,j])
        svc_counter.append(Counter(clf.predict(X_test)))
        svc_score.append(clf.score(X_test,y_test))

In [12]:
# Fit the data to scaler
scaler = RobustScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [32]:
# Instantiate the model
clf = LinearSVC(C = 10)
clf = clf.fit(X_train, y_train)

# Score the model
print(clf.score(X_train,y_train))
print(clf.score(X_test,y_test))

0.78615625
0.7734166666666666


In [34]:
import pickle
filename = 'pickle/svc_new_model.sav'
# save the model to disk
pickle.dump(clf, open(filename, 'wb'))

In [33]:
from sklearn.metrics import confusion_matrix
y_pred = clf.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[44686, 15507],
       [11683, 48124]])