In this project,we're going to buil a accident detector for Twitter tweets using the multinomial Naive Bayes algorithm.Our goal is to write a progam that classifies new tweets with an accuracy greater than 50%

To train the algorithm,we'll use a dataset of 7,613 Twitter tweets that are already classified by humans 

In [27]:
import pandas as pd

tweet = pd.read_csv('accident.csv')

tweet.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


text - the text of the tweet

In [28]:
tweet['keyword'].value_counts(dropna=False)

NaN                      61
fatalities               45
deluge                   42
armageddon               42
sinking                  41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 222, dtype: int64

In [29]:
tweet['keyword'] = tweet['keyword'].fillna('None')

In [30]:
tweet['location'].value_counts(dropna=False)

NaN                             2533
USA                              104
New York                          71
United States                     50
London                            45
                                ... 
Surulere Lagos,Home Of Swagg       1
MontrÌ©al, QuÌ©bec                 1
Montreal                           1
ÌÏT: 6.4682,3.18287                1
Lincoln                            1
Name: location, Length: 3342, dtype: int64

We will drop 'location' and 'id' and 'keyword' column because it isn't useful for our Naive Bayes algorithm

In [31]:
tweet = tweet.drop(['location','id','keyword'],axis=1)

In [32]:
tweet.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


target -this denotes whether a tweet is about a real disaster (1) or not (0)

In [33]:
tweet['target'] = tweet['target'].astype('string')

In [34]:
status_replace ={
     'target':{
         '1' : 'accident',
         '0' : 'non_accident',
     }
}
tweet = tweet.replace(status_replace)

In [35]:
tweet['target'].value_counts(normalize=True)

non_accident    0.57034
accident        0.42966
Name: target, dtype: Float64

We see that about 57% of the tweet are non_accident and the remaining are accidents

We are now going to split our dataset into training and a test set 

In [36]:
# Randomize the dataset
data_randomized = tweet.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(6090, 2)
(1523, 2)


Data Cleaning 

In [37]:
training_set['text'] = training_set['text'].str.lower()
training_set.head()

Unnamed: 0,text,target
0,goulburn man henry van bilsen missing: emergen...,accident
1,the things we fear most in organizations--fluc...,non_accident
2,@tsunami_esh ?? hey esh,non_accident
3,@potus you until you drown by water entering t...,non_accident
4,crawling in my skin\nthese wounds they will no...,accident


In [38]:
training_set = training_set.drop_duplicates(subset=['text'])

Creating Vacabulary

In [39]:
training_set['text'] = training_set['text'].str.split()

vocabulary = []
for sms in training_set['text']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [40]:
len(vocabulary)

23731

The Final Training DataSet

In [41]:
word_counts_per_sms = {unique_word: [0] * len(training_set['text']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['text']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [42]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,@ladyfleur,grew,mistreated.,happening!,ten-4.,#japan,legionnaire's,150+,body-bagging,http://t.co/gyzpisbi1u,...,steamship,clerical,http://t.co/vg7jnah0iw,trail,û÷exceptionalûª,melanie,http://t.co/te2yerugsi,gillibrand,fatally,http://t.co/jmkywhv7mp
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,text,target,@ladyfleur,grew,mistreated.,happening!,ten-4.,#japan,legionnaire's,150+,...,steamship,clerical,http://t.co/vg7jnah0iw,trail,û÷exceptionalûª,melanie,http://t.co/te2yerugsi,gillibrand,fatally,http://t.co/jmkywhv7mp
0,"[goulburn, man, henry, van, bilsen, missing:, ...",accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[the, things, we, fear, most, in, organization...",non_accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[@tsunami_esh, ??, hey, esh]",non_accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"[@potus, you, until, you, drown, by, water, en...",non_accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"[crawling, in, my, skin, these, wounds, they, ...",accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
training_set_clean = training_set_clean.loc[:,~training_set_clean.columns.duplicated()]

Below,we'll use our training set to calculate:
P(Accident) and P(Non_accident),NAccident,N_Non_Accident,N_Vocabulary

In [45]:
# Isolating accident and non_accident tweets first
accident_tweet = training_set_clean.copy()[training_set_clean['target'] == 'accident']
non_accident_tweet = training_set_clean.copy()[training_set_clean['target'] == 'non_accident']

# P(Accident) and P(Non_accident)
p_accident = len(accident_tweet) / len(training_set_clean) #will use later #
p_non_accident = len(non_accident_tweet) / len(training_set_clean)   #will use later #

# N_Accident
n_words_per_accident_tweet = accident_tweet['text'].apply(len)
n_accident = n_words_per_accident_tweet.sum()

# N_Non_Accident
n_words_per_non_accident_tweet= non_accident_tweet['target'].apply(len)
n_non_accident = n_words_per_non_accident_tweet.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

In [46]:
columns_1 = accident_tweet.columns.drop(['text','target'])
columns_2 = non_accident_tweet.columns.drop(['text','target'])

In [47]:
accident_tweet = accident_tweet.dropna()
non_accident_tweet = non_accident_tweet.dropna()

In [48]:
# Initiate parameters
parameters_accident = {unique_word:0 for unique_word in vocabulary}
parameters_non_accident = {unique_word:0 for unique_word in vocabulary}

In [49]:
for column in columns_1:
    n_word_given_accident = accident_tweet[column].sum()   
    p_word_given_accident = (n_word_given_accident + alpha) / (n_accident + alpha*n_vocabulary)
    parameters_accident[column] = p_word_given_accident
 

In [50]:
for column in columns_2:
    n_word_given_non_accident = non_accident_tweet[column].sum()   
    p_word_given_non_accident = (n_word_given_non_accident + alpha) / (n_non_accident + alpha*n_vocabulary)
    parameters_non_accident[column] = p_word_given_non_accident

Classifying A New Message 

In [51]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_accident_given_message = p_accident
    p_non_accident_given_message = p_non_accident

    for word in message:
        if word in parameters_accident:
            p_accident_given_message *= parameters_accident[word]
            
        if word in parameters_non_accident:
            p_non_accident_given_message *= parameters_non_accident[word]
            
    print('P(Accident|message):', p_accident_given_message)
    print('P(Non_accident|message):', p_non_accident_given_message)
    
    if p_non_accident_given_message > p_accident_given_message:
        print('Label: Non_accident')
    elif p_non_accident_given_message < p_accident_given_message:
        print('Label: Accident')
    else:
        print('Equal proabilities, have a human classify this!')

In [54]:
test_set.head()

Unnamed: 0,text,target
0,@DwarfOnJetpack I guess I can say you and me m...,non_accident
1,E1.1.2 Particulate=Break up of Solid Combust F...,accident
2,7.Beyonce Is my pick for http://t.co/thoYhrHkf...,non_accident
3,@KirCut1 lets get a dope picture together and ...,non_accident
4,Breast milk is the original #superfood but rat...,non_accident


In [55]:
test_set = test_set.loc[:,~test_set.columns.duplicated()]

In [59]:
def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_accident_given_message = p_accident
    p_non_accident_given_message = p_non_accident

    for word in message:
        if word in parameters_accident:
            p_accident_given_message *= parameters_accident[word]
            
        if word in parameters_non_accident:
            p_non_accident_given_message *= parameters_non_accident[word]
    
    if p_non_accident_given_message > p_accident_given_message:
        return 'non_accident'
    elif p_accident_given_message > p_non_accident_given_message:
        return 'accident'
    else:
        return 'needs human classification'

In [60]:
test_set['predicted'] = test_set['text'].apply(classify_test_set)
test_set.head()

Unnamed: 0,text,target,predicted
0,@DwarfOnJetpack I guess I can say you and me m...,non_accident,non_accident
1,E1.1.2 Particulate=Break up of Solid Combust F...,accident,non_accident
2,7.Beyonce Is my pick for http://t.co/thoYhrHkf...,non_accident,non_accident
3,@KirCut1 lets get a dope picture together and ...,non_accident,non_accident
4,Breast milk is the original #superfood but rat...,non_accident,non_accident


In [61]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['target'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 862
Incorrect: 661
Accuracy: 0.5659881812212738


The accuracy is close to 60%,which reach our accuracy target but isn't really good.