# TOPIC CLASSIFICATION

## 1. Introduction

### Objective
While the app's main functionality centers on capturing and analyzing images of utility poles for maintenance and safety concerns, the integrated text box feature serves a crucial supplementary role, it allows customers to provide additional context, report specific issues, and express their concerns or feedback regarding the utility services.

On top of sentiment and urgency text analysis we will implement the following to further enhance the use of the text:
- Support Vector Machine for Topic Classification - This model will help identify the urgency of an asset's condition by categorizing the text into 'infrastructure_and_utility_damage' or 'requests_or_urgent_needs'

### Datset Description

This project utilizes the HumAID Twitter dataset, which consists of manually annotated tweets collected during eleven significant natural disaster events from 2016 to 2019. 
We use this HumAID Twitter dataset as a stand-in for actual customer data from SDG&E due to the current lack of available customer interactions. Our approach utilizes these annotated tweets as a proxy to develop and refine our sentiment analysis and topic classification models.

Link: https://crisisnlp.qcri.org/humaid_dataset

The dataset is organized into separate folders for each of the following natural disaster events:

- Canada Wildfires (2016)
- Cyclone Idai (2019)
- Ecuador Earthquake (2016)
- Hurricane Harvey (2017)
- Hurricane Irma (2017)
- Hurricane Maria (2017)
- Hurricane Matthew (2016)
- Italy Earthquake (August 2016)
- Kaikoura Earthquake (2016)
- Puebla Mexico Earthquake (2017)
- Sri Lanka Floods (2017)

For each event, the dataset is divided into two subsets: Training Set, and Test Set.

Each subset consists of tab-separated values (TSV) files containing 3 columns: tweet_id, tweet_text, class_label.

## 2. Setup

### Library Import

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report

### Preprocessing

In [2]:
events = [
    "canada_wildfires_2016", "cyclone_idai_2019", "ecuador_earthquake_2016",
    "hurricane_harvey_2017", "hurricane_irma_2017", "hurricane_maria_2017",
    "hurricane_matthew_2016", "italy_earthquake_aug_2016", "kaikoura_earthquake_2016",
    "puebla_mexico_earthquake_2017", "srilanka_floods_2017"
]

dataframes = {event: {} for event in events}

# Loop through each event and load train, dev, and test sets
for event in events:
    for set_type in ['train', 'dev', 'test']:
        file_path = f'../data/HumAID/{event}/{event}_{set_type}.tsv'
        # Load the dataset and store it in the dictionary under the appropriate event and set type
        dataframes[event][set_type] = pd.read_csv(file_path, sep='\t')

In [3]:
dataframes['canada_wildfires_2016']['test']

Unnamed: 0,tweet_id,tweet_text,class_label
0,728674116773904384,RT @FoothillsFCU23: In response the to the #Fo...,rescue_volunteering_or_donation_effort
1,729787427829612544,Redcross is offering charitable donation recei...,rescue_volunteering_or_donation_effort
2,730510385544085505,RT @globeandmail: Red Cross to transfer $50-mi...,rescue_volunteering_or_donation_effort
3,733705874594746368,Live: Emergency operations briefing on north A...,other_relevant_information
4,730606066023665665,"$9bn fire damage to Fort McMurray, ‘the beast’...",infrastructure_and_utility_damage
...,...,...,...
440,729062171993374720,I feel sad Mom &amp; I r donating money to the...,rescue_volunteering_or_donation_effort
441,728733841230311425,This is the best way to help. The Red Cross wi...,rescue_volunteering_or_donation_effort
442,730078294259961856,RT @TheGrayGroup: Donations for Fort McMurray ...,rescue_volunteering_or_donation_effort
443,730002964035723264,Local volleyball team members raise money to h...,rescue_volunteering_or_donation_effort


In [4]:
all_dfs = []

# Iterate over each event and set type, adding the event name as a column
for event, sets in dataframes.items():
    for set_type, df in sets.items():
        df['event'] = event  # Add the event name as a column

        all_dfs.append(df)

# Concatenate the lists of DataFrames into a single DataFrame
all_df = pd.concat(all_dfs, ignore_index=True)

Filter the DataFrame that only contains the labels 'infrastructure_and_utility_damage' and 'requests_or_urgent_needs'

In [5]:
all_df = all_df[(all_df['class_label'] == 'infrastructure_and_utility_damage') | (all_df['class_label'] == 'requests_or_urgent_needs')]

### Data Overview

In [6]:
all_df.head()

Unnamed: 0,tweet_id,tweet_text,class_label,event
16,729658483948302336,Great tale from @ReutersWinnipeg: #Canada fire...,infrastructure_and_utility_damage,canada_wildfires_2016
21,731202296210685952,RT @CBCNorth: Justin Trudeau in Fort McMurray ...,infrastructure_and_utility_damage,canada_wildfires_2016
32,728697237153447936,Wildfire official Chad Morrison says between 1...,infrastructure_and_utility_damage,canada_wildfires_2016
39,733413736686358528,RT @WaterTrends: CANADA ALBERTA: Wildfire cont...,infrastructure_and_utility_damage,canada_wildfires_2016
45,728604991397584896,Catastrophic Canadian Wildfire Is a Sign of De...,infrastructure_and_utility_damage,canada_wildfires_2016


In [7]:
all_df.count()

tweet_id       7554
tweet_text     7554
class_label    7554
event          7554
dtype: int64

## 3. Topic Classification

We want to clean the text in our dataset, so we will create a function to remove stop words, lowercase, and remove punctuation

In [13]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

In [14]:
all_df['tweet_text'] = all_df['tweet_text'].apply(clean_text)

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Split the data into train and test sets

In [15]:
X = all_df['tweet_text']
y = all_df['class_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

In [16]:
sgd = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=10, tol=None)),
               ])
sgd.fit(X_train, y_train)

y_pred = sgd.predict(X_test)
my_tags = list(all_df['class_label'].unique())
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))

accuracy 0.9638288486987208
                                   precision    recall  f1-score   support

infrastructure_and_utility_damage       0.96      0.99      0.98      1768
         requests_or_urgent_needs       0.96      0.87      0.91       499

                         accuracy                           0.96      2267
                        macro avg       0.96      0.93      0.95      2267
                     weighted avg       0.96      0.96      0.96      2267



We can see that SVM gives us an accuracy of 96%

In [17]:
sgd.predict(['This pole has a slight tilt to it'])

array(['infrastructure_and_utility_damage'], dtype='<U33')

In [19]:
import pickle
with open('../models/svm_model.pkl', 'wb') as f:
    pickle.dump(sgd, f)