<a href="https://colab.research.google.com/github/mxinburritos/Disaster-Tweet-Sorting/blob/master/Live_Prediction_Final_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing: Twitter Classification
The problem being solved is being able to detect whether a tweet is relevant or irrelevant to a disaster. Many times, people who tweet make references to disaster to describe but aren't referring to a real disaster. 

DATASET: http://bit.ly/Twitter_Dataset

Rather than use a neural network, we used a Naive Bayes algorithm, specifically Bernoulli which is a specialized form of logarithmic regression.

#Twitter Training Dataset


1.   Number of Instances: 7416 instances
2.   Number of Attributes: 2 plus the class attribute
3.   Attribute Information: Attribute Domain

      1. Index of tweet
      2. Class: 0 for irrelevant, 1 for relevant
      3. Text: Text of tweet
      
4. Missing Attribute Value: N/A
5. Class Distribution:

  Instance belonging to class 0: 4,305 (58.1%)
  
  Instances belonging to class 1: 3,111 (41.9%)






In [0]:
# Imports
            
import pandas as pd
import numpy as np
import nltk
import sklearn
import warnings
import matplotlib.pyplot as plt

%matplotlib inline

nltk.download("stopwords")
nltk.download('punkt')
nltk.download('wordnet')

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from tensorflow import keras  
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from nltk.corpus import stopwords
from nltk import word_tokenize
from sklearn.naive_bayes import BernoulliNB


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
# Data Collection

df = pd.read_csv("train.csv")
df2 = pd.read_csv("test.csv")

# Take relevant data from DataFrame and put it in Numpy Array

tweets = df['text'].tolist()
labels = df['class_label'].to_numpy()
testTweets = df2['text'].tolist()
testLabels = df2['class_label'].to_numpy()

print(df.head())
print(df.tail())

   Unnamed: 0  class_label                                               text
0        8525            0                       she keep it wet like tsunami
1        5008            1  when ur friend and u are talking about forest ...
2        8803            0  but i will be uploading these videos asap so y...
3        6795            0              i'm interested   is it through yahoo?
4        4603            0                   holy fuck someone set me on fire
      Unnamed: 0  ...                                               text
7411        7850  ...  and if your best evidence is the word of a guy...
7412        3611  ...  i'm gonna drown myself in leftover chilis wish...
7413        5969  ...                  i look like a mass murderer in it
7414        5435  ...  who's that shadow holdin me hostage i've been ...
7415        7618  ...  i liked a  video  boeing 737 takeoff in snowst...

[5 rows x 3 columns]


#Preprocessing
We created a lemmatizing method and defined a list of stopwords to remove. We use a vectorizer with the defined a preprocessing method and stopwords to convert all the tweets into a large vector. Each represents 

In [0]:
# TFID Vectorizer
def preprocess(s):
  lemmatizer = nltk.WordNetLemmatizer()
  return lemmatizer.lemmatize(s)

stop = set(stopwords.words('english'))


"""
# Stopwords part

def not_stopword(s):
  s = s.strip()
  v = stopwords.words('english')
  result = ""
  words = nltk.word_tokenize(s)
  for word in words:
    if word not in v:
      result += word + " "
  return result.strip()

i=0
for token in tokens:
  token = preprocess(token)
  
finalsentence = ' '.join(tweet.split())


print(finalsentence)
"""


vectorizer = TfidfVectorizer(stop_words=stop, analyzer='word', max_features=20000, dtype=np.float32, preprocessor=preprocess)

data = vectorizer.fit_transform(tweets).toarray()
testData = vectorizer.transform(testTweets).toarray()
print(type(data), data)


  'stop_words.' % sorted(inconsistent))


<class 'numpy.ndarray'> [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [0]:
#X_train, X_test, Y_train, Y_test = train_test_split(data, labels, test_size=0.3, random_state=10)
X_train = data
Y_train = labels
X_test = testData
Y_test = testLabels

print(len(X_test[0]))
print(Y_test)


14839
[1 1 1 ... 0 0 0]


In [0]:

clf = BernoulliNB()
clf.fit(X_train, Y_train)
BernoulliNB(alpha=2.0, binarize=0.0, class_prior=None, fit_prior=True)

predictions = clf.predict(X_test)

print(predictions)

print(clf.score(X_test, Y_test))

print(recall_score(Y_test, predictions, average='macro'))
print(precision_score(Y_test, predictions, average='macro'))

[1 1 1 ... 1 0 0]
0.8058252427184466
0.7808352143656005
0.8184909531575053


In [0]:
# live predictor
keepGoing = True
while keepGoing: 
  liveTester = input("Enter a tweet to see if it's relevant to a disaster: \n")

  
  #print(liveTester)

  liveData = vectorizer.transform([liveTester]).toarray()

  #print(liveData)
  
  ans = clf.predict(liveData)

  if ans == 0:
    print("\nTweet not relevant to a disaster\n")
  
  else:
    print("\nTweet relevant to a disaster\n")



Enter a tweet to see if it's relevant to a disaster: 
manage large analytics data sets in moden efficient fashion with #ONTAPAI and #Omnisci

Tweet not relevant to a disaster

