# DISASTER TWEETS NB DATASET
Using the above dataset to build a Naive Bayes Model to predict the nature of the tweets.

## BUSINESS OBJECTIVE
* Maximize Accurate prediction
* Minimize Fake news 
* Identify False Claims

## CONSTRAINTS
* Authenticating information


## DATA DICTIONARY

| **slno** | **Name of Feature** | **Description**                                         | **Type**     | **Relevance** |
|:--------:|:--------------------|:--------------------------------------------------------|:------------:|:-------------:|
| 1        | id                  | Id of the tweet                                         | Count        | Irrelevant    |
| 2        | keyword             | Keywords in the tweet                                   | Categorical  | Irrelevant    |
| 3        | location            | Location from which the tweet was posted                | Categorical  | Irrelevant    |
| 4        | text                | Text content of the tweet                               | Categorical  | Relevant      |
| 5        | target              | 1 for disaster and 0 for no disaster                    | Binary       | Relevant      |

Importing the required libraries.

In [1]:
import pandas as pd
import numpy as np
from termcolor import colored
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB as MB
from sklearn.metrics import accuracy_score

Loading the dataset using the pandas library and confirming the dataset has been loaded properly using the 'head' function

In [2]:

df0 = pd.read_csv(r"D:\360Digitmg\ASSIGNMENTS\Ass14\Disaster_tweets_NB.csv")
df=df0.copy()
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


### EXPLORATORY DATA ANALYSIS

The below three lines give a general idea about the dataset like the shape, type and non null values.

In [3]:
df.shape

(7613, 5)

In [4]:
df.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


The describe function gives the count, min, max, mean, standard deviation and quantile values of the dataset.

In [6]:
df.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


Checking the Number of Duplicates in the Dataset.

In [7]:
duplicate_values=df.duplicated(subset=None,keep='first').sum()
print(colored(' Number of Duplicate values: ','blue',attrs=['bold']),duplicate_values)

[1m[34m Number of Duplicate values: [0m 0


Checking the Number and Percentage of Missing Values in the Dataset.

In [8]:
missing=df.isna().sum().sort_values(ascending=False)
print(colored("Number of Missing Values\n\n",'blue',attrs=['bold']),missing)

[1m[34mNumber of Missing Values

[0m location    2533
keyword       61
id             0
text           0
target         0
dtype: int64


In [9]:
print(colored('Number of Unique Values:\n\n','blue',attrs=['bold']),df.nunique())

[1m[34mNumber of Unique Values:

[0m id          7613
keyword      221
location    3341
text        7503
target         2
dtype: int64


Dropping the below three columns as they are irrelevant for the analysis.

In [10]:
df.drop(df[['id','keyword','location']],axis=1,inplace=True)
df.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


Converting the text data into lower case text data. 

In [11]:
df['text']=df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df.head()

Unnamed: 0,text,target
0,our deeds are the reason of this #earthquake m...,1
1,forest fire near la ronge sask. canada,1
2,all residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,just got sent this photo from ruby #alaska as ...,1


Removing all the extra characters keeping only the alphabet data. 

In [12]:
df['text']=df['text'].str.replace('[^a-z" "]+','')
df.head()

  df['text']=df['text'].str.replace('[^a-z" "]+','')


Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,people receive wildfires evacuation orders in...,1
4,just got sent this photo from ruby alaska as s...,1


Loading custom stop words as a list. 

In [13]:
stop_words = []
# Load the custom built Stopwords
with open(r"C:\Users\lenny\Downloads\stop_words_english.txt","r",encoding='utf-8') as sw:
    stop_words = sw.read()

stop_words = stop_words.split("\n")

In [14]:
stop_words

['able',
 'about',
 'above',
 'abroad',
 'according',
 'accordingly',
 'across',
 'actually',
 'adj',
 'after',
 'afterwards',
 'again',
 'against',
 'ago',
 'ahead',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'alongside',
 'already',
 'also',
 'although',
 'always',
 'am',
 'amid',
 'amidst',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 "a's",
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'back',
 'backward',
 'backwards',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'begin',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'came',
 'can',
 'cannot',
 'cant',
 "can't",
 'caption',
 'cau

Removing all the stop words in the text data. 

In [15]:
df['text']=df['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words ))
df.head()

Unnamed: 0,text,target
0,deeds reason earthquake allah forgive,1
1,forest la ronge sask canada,1
2,residents asked shelter place notified officer...,1
3,people receive wildfires evacuation orders cal...,1
4,photo ruby alaska smoke wildfires pours school,1


### MODEL BUILDING

Splitting the dataframe into test and train dataframe

In [16]:
df_train, df_test= train_test_split(df,test_size=0.2,random_state=1000,stratify=df.target)

Creating a custom function to tokenize the text in each rows of the dataframe. 

In [17]:
def split_into_words(i):
    return [word for word in i.split(" ")]

Converting texts in to word count matrix format i.e Bag of words using CountVectorizer. 

In [18]:
df_bow=CountVectorizer(analyzer=split_into_words).fit(df.text)

BOW for entire Dataset.

In [19]:
df_matrix=df_bow.transform(df.text)

BOW for training Dataset

In [20]:
df_train_matrix=df_bow.transform(df_train.text)

BOW for test dataset. 

In [21]:
df_test_matrix=df_bow.transform(df_test.text)

Learning Term Weighting and normalizing on entire dataset. 

In [22]:
tfidf_transformer=TfidfTransformer().fit(df_matrix)

Preparing TFIDF for train dataset. 

In [23]:
df_train_tfidf=tfidf_transformer.transform(df_train_matrix)
df_train_tfidf.shape

(6090, 20877)

Preparing TFIDF for test dataset. 

In [24]:
df_test_tfidf=tfidf_transformer.transform(df_test_matrix)
df_test_tfidf.shape

(1523, 20877)

Preparing a naive bayes model on training dataset. 

In [25]:
classifier_mb=MB()
classifier_mb.fit(df_train_tfidf, df_train.target)

MultinomialNB()

Evaluation on test data. 

In [26]:
test_pred=classifier_mb.predict(df_test_tfidf)

In [27]:
pd.crosstab(test_pred, df_test.target,rownames = ['Predictions'], colnames= ['Actuals'])

Actuals,0,1
Predictions,Unnamed: 1_level_1,Unnamed: 2_level_1
0,770,209
1,99,445


Test data accuracy

In [28]:
accuracy_score(test_pred, df_test.target)

0.7977675640183848

Train Data accuracy. 

In [29]:
train_pred=classifier_mb.predict(df_train_tfidf)
accuracy_score(train_pred,df_train.target)

0.9203612479474549

In [30]:
pd.crosstab(train_pred,df_train.target,rownames = ['Predictions'], colnames= ['Actuals'])

Actuals,0,1
Predictions,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3394,406
1,79,2211


### HYPERPARAMETER TUNING

Apply laplace smoothing to carryout hyperparameter tuning and evaluating both train and test dataset. 

In [31]:
classifier_mb_lap=MB(alpha=1)
classifier_mb_lap.fit(df_train_tfidf, df_train.target)

MultinomialNB(alpha=1)

In [32]:
test_pred_lap=classifier_mb_lap.predict(df_test_tfidf)

In [34]:
accuracy_score(test_pred_lap, df_test.target)

0.7977675640183848

In [35]:
pd.crosstab(test_pred_lap, df_test.target,rownames = ['Predictions'], colnames= ['Actuals'])

Actuals,0,1
Predictions,Unnamed: 1_level_1,Unnamed: 2_level_1
0,770,209
1,99,445


In [36]:
train_pred_lap=classifier_mb_lap.predict(df_train_tfidf)
accuracy_score(train_pred_lap,df_train.target)

0.9203612479474549

In [37]:
pd.crosstab(train_pred_lap, df_train.target,rownames = ['Predictions'], colnames= ['Actuals'])

Actuals,0,1
Predictions,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3394,406
1,79,2211


### CONCLUSION

This model is not good for prediction as it is overfitting , so the best option is to try another model.This analysis would help to identify the emergency tweets so that help would be given at the earliest. 