# Analysis of Disaster-Related Terms in Social Media Data using NER

**Objectives**:
1. Apply Named Entity Recognition (NER) to identify disaster types and related terms.
1. Analyze and compare the frequency of different disaster terms to gauge public awareness and concerns.

**Steps**:

1. Load the dataset from `'tweets.csv'`.
1. Preprocessing and data cleaning
1. Utilize the `spacy` library to lemmatize the text and filter it based on parts of speech.
1. Analyze the data to identify the most common disaster keywords.
1. Extract and count terms related to the top five disaster keywords.
1. Organize and display the findings for each disaster type, showing the most common related terms and their frequencies.

In [None]:
import pandas as pd

# Load the data
data = pd.read_csv('tweets.csv')

In [33]:
import re

def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F700-\U0001F77F"  # alchemical symbols
        u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        u"\U0001FA00-\U0001FA6F"  # Chess Symbols
        u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        u"\U00002702-\U000027B0"  # Dingbats
        u"\U000024C2-\U0001F251" 
        "]+", 
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', text)

# Update preprocess_text function to use remove_emojis
def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove emojis
    text = remove_emojis(text)
    return text


In [34]:
import spacy

nlp = spacy.load('en_core_web_sm')

def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])

In [35]:
def filter_pos(text):
    doc = nlp(text)
    return " ".join([token.text for token in doc if token.pos_ in ['NOUN', 'VERB', 'ADJ']])

In [36]:
from collections import Counter

disaster_counts = Counter(data['keyword'])
top_5_disasters = disaster_counts.most_common(5)

In [37]:
def get_related_terms(disaster_type):
    filtered_data = data[data['keyword'] == disaster_type]
    terms = []
    for text in filtered_data['text']:
        text = preprocess_text(text)
        text = lemmatize_text(text)
        text = filter_pos(text)
        terms.extend(text.split())
    term_counts = Counter(terms)
    return term_counts

disaster_term_counts = {disaster: get_related_terms(disaster) for disaster, _ in top_5_disasters}


In [39]:
# Iterate through each disaster type and its DataFrame to print the terms and frequencies
for disaster, df in disaster_dataframes.items():
    print(f"Disaster Type: {disaster}")
    print(df.sort_values(by='Frequency', ascending=False).head(10))  # Show top 10 terms for each disaster
    print('\n')  # Print a newline to separate results for different disaster types


Disaster Type: thunderstorm
                Term  Frequency
7       thunderstorm         34
29           weather         23
97           include         15
107           severe         10
171               pm         10
167         continue         10
18      Thunderstorm         10
187  nwsseveretstorm          8
20            affect          8


Disaster Type: flattened
         Term  Frequency
2     flatten         65
34       home         13
33    tornado         13
108      kill         12
104    prayer         12
105    closet         12
109      many         12
106   survive         12
107  powerful         12
25       have          6


Disaster Type: mass%20murder
        Term  Frequency
3       mass         43
6     murder         42
0         be         16
150    thank          7
24    people          6
42       use          5
228     want          5
32    regime          5
51      mean          5
28   iranian          5


Disaster Type: stretcher
          Term  Frequency
4 