#### [NLP Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/data): Basic Naive Bayes Model

Using this [data exploration notebook](https://www.kaggle.com/mitchvanlee/basic-data-exploration/edit), I found that:

* The keyword column is populated with a non-null value for most rows
* Certain keywords are highly predictive of whether or not there is a true disaster
* The text column is non-null for all rows
* tweets are sort
* tweets tend not to have many repeat words


In this notebook, I am going to create a Naive Bayes Classifaction Model. As I develop more complicated models, this will provide a good benchmark for comparision.

#### Import packages

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

#### Load Data

In [2]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

train_df.sort_values("keyword").head(n=5)

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
51,74,ablaze,India,Man wife get six years jail for setting ablaze...,1
52,76,ablaze,Barbados,SANTA CRUZ ÛÓ Head of the St Elizabeth Police...,0
53,77,ablaze,Anaheim,Police: Arsonist Deliberately Set Black Church...,1
54,78,ablaze,Abuja,Noches El-Bestia '@Alexis_Sanchez: happy to se...,0


#### Helper Functions

In [3]:
def prep_features(df):
    
    # vectorize "text"
    count_vectorizer = CountVectorizer(stop_words="english")
    
    text_features = count_vectorizer.fit_transform(df["text"]).toarray()
    
    # one-hot-encode "keyword"
    df["keyword"] = df["keyword"].fillna("")
    
    one_hot_encoder = OneHotEncoder()
    one_hot_keyword = one_hot_encoder.fit_transform(df[["keyword"]]).toarray()
    
    X = np.concatenate([text_features, one_hot_keyword], axis=1)
    
    return X

#### Preprocess Data

In [4]:
all_df = pd.concat([train_df, test_df])

X = prep_features(all_df)

# Train Values
X_train = X[all_df["target"].notna(),:]
y_train = all_df.loc[all_df["target"].notna(), ["target"]].values.ravel()

# Test Values
X_test = X[all_df["target"].isna(), :]

#### Train Model

In [5]:
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb.score(X_train, y_train)

0.9096282674372783

Generate Predictions

The score on the test data was around 78%. This indicates that I am overfitting the training data.

In [6]:
y_test_pred = nb.predict(X_test)

preds_df = pd.DataFrame()
preds_df["id"] = test_df["id"]
preds_df["target"] = y_test_pred.astype("int")

submission_path = "/kaggle/working/submission.csv"


preds_df.to_csv(submission_path,
                header=True, 
                index=False)
              