# Naive Bayes - Trained on the Sentiment 140 data set
In this section we'll attempt to build a custom Naive Bayes model for text classification. Naive Bayes classifiers use conditional probabilities based on pre-tagged data, so weWill use pre-tagged data set containing about 1.6 million tweets tagged for negative, positive, and neutral sentiment. Try training and running a multinomial Naive Bayes model from sklearn. To "vectorize" the text, we must first apply a TFIDF (term-frequency/inverse document frequency) transformation to model the text values as numeric values.

https://www.kaggle.com/datasets/kazanova/sentiment140?resource=download

In [20]:
import os
import pandas as pd
from comment_scraper import get_sql_table
from matplotlib import pyplot as plt
plt.style.use('seaborn-notebook')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer # tweet tokenizer is good for handling common words/symbols/emoticons found in social media data
from nltk.corpus import stopwords
from nltk.tag import pos_tag

In [21]:
DATA_DIR = 'C:\\Users\\keatu\\Regis_archive\\practicum2_data\\'
dbname = os.path.join(DATA_DIR, "Youtube_Data_msnbc.db")

In [12]:
colnames = ["target","id","date","flag","user","text"]
s140_test = pd.read_csv(os.path.join(DATA_DIR,"resources","s140_test.csv"), names = colnames)
s140_train = pd.read_csv(os.path.join(DATA_DIR,"resources","s140_train.csv"), names = colnames, encoding='latin-1')

In [34]:
colnames = ["target","id","date","flag","user","text"]
s140df = pd.read_csv(os.path.join(DATA_DIR,"resources","s140_train.csv"), names = colnames, encoding='latin-1')
sample = pd.concat([s140df.head(100000),s140df.tail(100000)])

In [35]:
# create the data pre-processing function that will be applied to each text string
def clean_data(text):
    #stopwords = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()
    tokenizer = TweetTokenizer()
    text_vector = []
    for each_text in text:
        lemmatized_tokens = []
        tokens=tokenizer.tokenize(each_text.lower())
        pos_tags=pos_tag(tokens)
        for each_token, tag in pos_tags: 
            if tag.startswith('NN'): 
                pos='n'
            elif tag.startswith('VB'): 
                pos='v'
            else: 
                pos='a'
            lemmatized_token=lemmatizer.lemmatize(each_token, pos)
            lemmatized_tokens.append(lemmatized_token)
        text_vector.append(' '.join(lemmatized_tokens))
    return text_vector

In [31]:
clean_data("I love movies")

['i', '', 'l', 'o', 'v', 'e', '', 'm', 'o', 'v', 'i', 'e', 's']

In [36]:
estimators=[('cleaner', FunctionTransformer(clean_data)), 
            ('vectorizer', TfidfVectorizer(max_features=100000, ngram_range=(1, 2)))]
preprocessing_pipeline=Pipeline(estimators)

In [37]:
X=sample['text']
y=sample['target']
X_train, X_test, y_train, y_test=train_test_split(X, y)

In [38]:
X_train

48600      I want to go to Comicon in sd, but none of my ...
1599298    english exam went okay        revising for fre...
21753      my CALVIN KLEIN specs have decided to start fa...
6818       i seem to be lossing loads of followers at pre...
1517729                                  @AnnaaaC Good girl 
                                 ...                        
1521295    Guess who's awake?!?! -_-, my sleep pattern su...
1552832    @methodusti  I read one of your tweets as bein...
99328      I want to take a walk  I hope someone wants to...
827        Work laptop is officially dead .. Not happy at...
98529      baba has decided that mixing paint and coffee ...
Name: text, Length: 150000, dtype: object

In [39]:
# Fit and transform the pipeline
X_train_transformed=preprocessing_pipeline.fit_transform(X_train)

  if LooseVersion(joblib_version) < '0.12':


ValueError: could not convert string to float: 'I want to go to Comicon in sd, but none of my friends will go.. never gone, if I DO go, i dont know what to dress up as..  GOTTA dress up!'