# Tweet Sentiment NLP

This notebook covers sentiment analysis for a collection of 1.6 million tweets, sourced from this Kaggle [dataset](https://www.kaggle.com/kazanova/sentiment140). My goal is build a series of machine learning models that can predict document polarity reasonably accurately, with each attempt attacking the classification problem from a different natural language processing approach and level of sophistication. The below code can be split into three sections: generalized text preprocessing, single-layer model evaluation, and multi-layer model evaluation. 

## Generalized Text Preprocessing

### Basic Corpus Cleaning

For the generalized text preprocessing section, modifications made are only those that do not intentionally "lose" tweet information (i.e. no dropping words or ther destructive changes). This then leaves only formatting, standardization, and other methods of purely data cleaning. See the "general_preprocessing" function for a full list and order of all data preparation techniques used. A general summary of changes made is as follows: replacing urls, usernames, emoticons, and unrecognized characters with words; expanding contractions and common abbreviations; removing all non-alphanumeric characters; and truncating egregiously long sequences of repeated characters. This should then leave the corpus comprised solely of recognized English language words and numbers.

In [1]:
# check if modules are installed
from subprocess import Popen, PIPE

output = Popen("pip list | awk '{print $1}'", shell = True, stdout=PIPE).stdout.read().split()
packages = [x.decode('utf-8') for x in output][2:]
modules = ['contractions', 'kaggle', 'nltk', 'pandas', 'scikit-learn']
for nm in modules:
    if nm not in packages:
        ! pip install {nm}

In [17]:
# importing modules and setup
import contractions
import csv
import demoji
import glob
import nltk
import pandas as pd
import re
import zipfile
from kaggle.api.kaggle_api_extended import KaggleApi
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedShuffleSplit

In [3]:
# downloading dataset
api = KaggleApi()
api.authenticate()
dataset = 'kazanova/sentiment140'
csv_name = 'training.1600000.processed.noemoticon.csv'
try:
    os.remove(csv_name)
except:
    pass
api.dataset_download_file(dataset, file_name=csv_name, path='./')
fn = glob.glob('train*.zip', recursive = True)[0]
with zipfile.ZipFile(fn) as zip_file:
    for file in zip_file.namelist():
        if file == csv_name:
            zip_file.extract(csv_name)
os.remove(fn)

In [4]:
# importing data
columns = ['target', 'text']
df_data = pd.read_csv(csv_name, usecols = [0, 5], header = None, names = columns)
print(f'df_data dimensions: {df_data.shape}')
df_data.head()

df_data dimensions: (1600000, 2)


Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [5]:
# importing emoticon descriptions
df_repl = pd.read_csv('emoticon_descriptions.csv', header = 0, usecols =[0, 1], names = ['emoticon', 'description'])
print(f'df_repl dimensions: {df_repl.shape}')
df_repl.head()

df_repl dimensions: (138, 2)


Unnamed: 0,emoticon,description
0,:?),smile
1,:),smile
2,:-],smile
3,:],smile
4,:-3,smile


In [6]:
# moving to dictionary
dict_emot = {a:b for a, b in zip(df_repl.iloc[:, 0], df_repl.iloc[:, 1])}

In [7]:
# adding html replacements
dict_emot['&quot;'] = 'quote'
dict_emot['&amp;'] = 'and'
dict_emot['&lt;'] = 'less than'
dict_emot['&gt;'] = 'greater than'

In [8]:
# creating dictionary of common abbreviations
df_repl = pd.read_csv('common_abbreviations.csv').applymap(lambda x: x.lower())
dict_abbr = {a:b for a, b in zip(df_repl.iloc[:, 0], df_repl.iloc[:, 1])}

In [9]:
# general preprocessing tweet body text
def general_preprocess(text):
    '''
    Returns a generally-applicable preprocessed version of the passed string.

        Parameters:
            text (str) : passed string
        
        Returns:
            mod_text (str) : preprocessed string
    '''

    # add leading and trailing whitespace
    mod_text = ' ' + text + ' '

    # replace usernames
    mod_text = re.sub(r'(?<=\s)(@\S+)(?=\s)', ' USER ', mod_text)
    
    # replace urls
    mod_text = re.sub(r'(?<=\s)(https?:\/\/\S+)(?=\s)', ' URL ', mod_text)
    
    # replace non-space whitespace
    mod_text = re.sub(r'\s+', ' ', mod_text)
    
    # replace emoticons with text
    for i, k in dict_emot.items():
        mod_text = mod_text.replace(i, f' {k} ')
    
    # replace unrecognized characters
    mod_text = mod_text.replace('İ', 'I')
    
    # expand contractions
    mod_text = contractions.fix(mod_text)

    # remove non-alphanumeric characters
    mod_text = re.sub(r'[^a-zA-Z0-9]', ' ', mod_text)

    # lower case text
    mod_text = mod_text.lower()

    # truncate repeated characters
    mod_text = re.sub(r'(.)\1{2,}', r'\1\1', mod_text)

    # replace common abbreviations
    for i, k in dict_abbr.items():
        mod_text = mod_text.replace(f' {i} ', f' {k} ')
    
    # remove repeated spaces
    mod_text = re.sub(r'( )\1+', ' ', mod_text)

    # trim leading / trailing whitespace
    mod_text = mod_text.strip()

    return mod_text


In [10]:
# apply function to dataframe
df_data.loc[:, 'text'] = df_data['text'].apply(general_preprocess)
df_data.head()

Unnamed: 0,target,text
0,0,user url aww that is a bummer you shoulda got ...
1,0,is upset that he can not update his facebook b...
2,0,user i dived many times for the ball managed t...
3,0,my whole body feels itchy and like its on fire
4,0,user no it is not behaving at all i am mad why...


In [12]:
# saving dataframe to pickle
df_data.to_pickle('df_data.pkl')

## Single Layer Machine Learning Models

### More Processing

The most basic version of natural language encoding identifies documents by unique sets of word tokens and counts. The ML models in this section will all be trained using an optimized variant of this method, term frequency-inverse document frequency vectors. TF-IDF weights the word token counts to highlight words that are more important to a document, while vectorization converts the string of text into a more machine-comprehensible vector of numbers.

Two constraints will be used to futher filter and simplify our corpus dictionary: removing stopwords, and lemmatization. Both techniques will result in a loss of contextual information, hence their absence from the prior section, though this is a necessary tradeoff to keep our vector space small.

In [13]:
# downloading stopwords and lemmas from nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download('wordnet')
lemma = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/parkernisbet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/parkernisbet/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
# additional preprocessing for machine learning methods
def sl_preprocess(text):
    '''
    Returns a preprocessed version of the passed string useful for single layer methods.
    
        Parameters:
            text (str) : passed string
        
        Returns:
            mod_text (str) : preprocessed string
    '''
    
    # split string into tokens, removing stopwords and single characters, lemmatize
    mod_text = ' '.join([lemma.lemmatize(i) for i in text.split() if len(i) > 1 if i not in stop_words])

    return mod_text

In [15]:
# applying function to dataframe
df_data.loc[:, 'text'] = df_data['text'].apply(sl_preprocess)
df_data.head()

Unnamed: 0,target,text
0,0,user url aww bummer shoulda got david carr thi...
1,0,upset update facebook texting might cry result...
2,0,user dived many time ball managed save 50 rest...
3,0,whole body feel itchy like fire
4,0,user behaving mad see


In [18]:
# saving to pickle
df_data.to_pickle('df_data_sl.pkl')

In [19]:
# test train split
sss = StratifiedShuffleSplit(n_splits = 1, test_size = .2, random_state = 3)
sss.get_n_splits(df_data.text, df_data.target)
for train_ind, test_ind in sss.split(df_data.text, df_data.target):
    pass
print(f'Train_ind shape: {train_ind.shape}\nTest_ind shape: {test_ind.shape}')


Train_ind shape: (1280000,)
Test_ind shape: (320000,)


In [23]:
# tfidf vectorizing
vectorizer = TfidfVectorizer(analyzer = 'word', max_features = 300000)
X_train = vectorizer.fit_transform(df_data.text[train_ind])
y_train = df_data.target[train_ind]
X_test = vectorizer.fit_transform(df_data.text[test_ind])
y_test = df_data.target[test_ind]

### Model Evaluation