# 1.Problem Statement

# 2.Bussiness Problem

In this case study, the objective is to identify whether a tweet regarding a disaster is real (1) or fake (0) using a Naïve Bayes model. This classification problem has significant implications, such as:

Timely Response to Real Disasters: By identifying genuine disaster-related tweets, emergency services, government agencies, and humanitarian organizations can prioritize and allocate resources effectively.

Combating Misinformation: Classifying fake disaster tweets helps prevent the spread of panic, misinformation, and malicious content, ensuring accurate information dissemination.

Automation in Social Media Analysis: Automating disaster tweet classification reduces the need for manual monitoring, saving time and effort.

# 3.Constraints

Data Quality: The dataset must be clean and free from noise, such as irrelevant tweets or ambiguous language.

Feature Representation: Tweets need to be converted into a suitable numerical format (e.g., Bag of Words, TF-IDF, etc.) for the Naïve Bayes model, which assumes independence between features.

Class Imbalance: The dataset might have an unequal distribution of real (1) and fake (0) tweets, which could affect the model's performance.

Ambiguity in Text: Tweets are often short and may contain slang, abbreviations, or sarcasm, making it challenging to interpret the context.

Computational Efficiency: Naïve Bayes is computationally efficient, but processing a large volume of tweets can require significant preprocessing and optimization.

Generalizability: The model should perform well on unseen data and not just on the given dataset.

In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 


In [2]:
df=pd.read_csv('Disaster_tweets_NB.csv')
df.sample(3)

Unnamed: 0,id,keyword,location,text,target
2807,4038,disaster,los angeles,Keeps askin me what this means\nNot like i got...,1
4986,7114,military,,Ford : Other Military VERY NICE M151A1 MUTT wi...,0
1430,2063,casualty,,Property/casualty insurance rates up 1% in Jul...,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [4]:
df.isnull().sum() #simpleImputer

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [9]:
df.shape

(7613, 5)

In [11]:
df.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [13]:
df.duplicated().sum()

0

# 4.Exploratory Data Analysis

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [18]:
df.sample(3)

Unnamed: 0,id,keyword,location,text,target
4912,6994,massacre,Norway,Is this the creepiest youth camp ever?. http:/...,0
4630,6580,injury,Russia,Our big baby climbed up on this thing on wheel...,0
1799,2585,crash,Galatians 2:20,Please keep Josh the Salyers/Blair/Hall famili...,0


In [20]:
df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [22]:
df['text'][df['target']==1][0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [24]:
df['text'][df['target']==1][1]

'Forest fire near La Ronge Sask. Canada'

In [26]:
df['text'][df['target']==1][3]

'13,000 people receive #wildfires evacuation orders in California '

In [28]:
df['text'][df['target']==1][5]

'#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires'

In [30]:
df['text'][df['target']==1].shape

(3271,)

In [32]:
df['text'][df['target']==1]

0       Our Deeds are the Reason of this #earthquake M...
1                  Forest fire near La Ronge Sask. Canada
2       All residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       Just got sent this photo from Ruby #Alaska as ...
                              ...                        
7608    Two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @TheTawniest The out of control w...
7610    M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611    Police investigating after an e-bike collided ...
7612    The Latest: More Homes Razed by Northern Calif...
Name: text, Length: 3271, dtype: object

In [34]:
df[df['target']==1].shape

(3271, 5)

In [36]:
df['text'][df['target']==1][7608]

'Two giant cranes holding a bridge collapse into nearby homes http://t.co/STfMbbZFB5'

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [40]:
df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [42]:
df.shape

(7613, 5)

In [44]:
df.sample(4)

Unnamed: 0,id,keyword,location,text,target
6559,9385,survived,,RT THR 'RT THRArchives: 1928: When Leo the MGM...,1
6237,8906,snowstorm,"Louisiana, USA",you're the snowstorm I'm purified. the darkest...,0
107,157,aftershock,304,'Nobody remembers who came in second.' Charles...,0
6925,9933,trouble,"Davis, California",Strawberries are in big trouble. Scientists ra...,1


In [46]:
# triming the values 


# 5.Data Preprocessing

In [49]:
df=df.dropna()

In [51]:
df.shape

(5080, 5)

In [53]:
df.isnull().sum()

id          0
keyword     0
location    0
text        0
target      0
dtype: int64

In [55]:
df.sample(4)

Unnamed: 0,id,keyword,location,text,target
6587,9432,survivors,Shanghai,Survivors of Shanghai Ghetto reunite after 70 ...,0
3377,4835,evacuation,EIU Chucktown/LaSalle IL,@Eric_Bulak @jaclynsonne @_OliviaAnn_ I was lo...,0
5269,7531,oil%20spill,"Los Angeles, CA",Refugio oil spill may have been costlier bigge...,1
2019,2898,damage,Your Conversation,This real shit will damage a bitch,0


In [57]:
X=df.iloc[:,:-1]
y=df['target']

In [59]:
# spliting of data 
from sklearn.model_selection import train_test_split 
X_train,X_test,y_train,y_test =train_test_split(X,y,test_size=0.2,random_state=42)

In [60]:
# vectorzing the word 

In [61]:
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer 

In [65]:
vector=CountVectorizer()
x_train_cv = vector.fit_transform(X_train)

In [67]:
x_test_cv=vector.fit_transform(X_test)

In [69]:
x_train_cv.shape
X_train.shape

(4064, 4)

# 6.Model Selection

In [72]:
from sklearn.naive_bayes import MultinomialNB 
clf1=MultinomialNB()

In [75]:
y_train

5732    1
3873    0
3382    0
6664    0
3233    1
       ..
6604    1
709     0
4602    0
5609    1
1312    0
Name: target, Length: 4064, dtype: int64

In [77]:
x_train_cv.shape

(4, 4)

In [79]:
X=X.iloc[:,1:]

In [81]:
k=vector.fit_transform(X['keyword'])

In [83]:
numerical_df = pd.DataFrame(k.toarray(), columns=vector.get_feature_names_out())

In [85]:
numerical_df.shape

(5080, 239)

In [87]:
X.shape

(5080, 3)

In [89]:
k=vector.fit_transform(X['text'])

In [91]:
numerical_df1 = pd.DataFrame(k.toarray(), columns=vector.get_feature_names_out())

In [93]:
numerical_df1.shape

(5080, 16420)

In [94]:
k=vector.fit_transform(X['location'])
numerical_df2 = pd.DataFrame(k.toarray(), columns=vector.get_feature_names_out())
numerical_df2.shape

(5080, 3261)

In [95]:
new_df =pd.DataFrame()

In [99]:
combined = pd.concat([numerical_df, numerical_df1, numerical_df2], ignore_index=True)


In [100]:
combined.shape

(15240, 18418)

In [101]:
combined_columns = pd.concat([numerical_df, numerical_df1, numerical_df2], axis=1)
combined_columns.shape

(5080, 19920)

In [102]:
from sklearn.model_selection import train_test_split 
X_train,X_test,y_train,y_test =train_test_split(combined_columns,y,test_size=0.2,random_state=42)

In [103]:
from sklearn.naive_bayes import MultinomialNB 
clf1=MultinomialNB()

In [104]:
clf1.fit(X_train,y_train)

In [105]:
y_pre=clf1.predict(X_test)

In [106]:
from sklearn.metrics import accuracy_score 
accuracy_score(y_test,y_pre)

0.8031496062992126

# 7.Conclusion

Using the Naïve Bayes model for classifying disaster-related tweets is a practical approach because of its simplicity and effectiveness in handling text data. Once trained, the model can:

1.Accurately predict if a disaster tweet is real or fake, assisting in faster and more reliable decision-making.

2.Reduce misinformation spread by identifying fake tweets early.

3.Be implemented as part of an automated system to process large volumes of tweets in real time.