<a href="https://colab.research.google.com/github/jzfrank/h4g-idmc-articleClassifier/blob/main/idmc_article_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 2.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 46.1 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 42.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


In [2]:
import torch

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

Definig Classes

In [4]:
class IsDisasterClassifier:
  def __init__(self):
    self.tokenizer = AutoTokenizer.from_pretrained("sacculifer/dimbat_disaster_distilbert")
    self.model = AutoModelForSequenceClassification.from_pretrained("sacculifer/dimbat_disaster_distilbert", from_tf=True)
  def isDisaster(self, text):
    inputs = self.tokenizer(text, return_tensors="pt")
    with torch.no_grad():
      logits = self.model(**inputs).logits
    predicted_class_id = logits.argmax().item()
    return {
        1: True, 
        0: False
    }[
        predicted_class_id
    ]

In [5]:
class DisasterTypeClassifier:
  def __init__(self):
    self.tokenizer = AutoTokenizer.from_pretrained("sacculifer/dimbat_disaster_type_distilbert")
    self.model = AutoModelForSequenceClassification.from_pretrained("sacculifer/dimbat_disaster_type_distilbert", from_tf=True)
  def disasterType(self, text):
    inputs = self.tokenizer(text, return_tensors="pt")
    with torch.no_grad():
      logits = self.model(**inputs).logits
    predicted_class_id = logits.argmax().item()
    return {
        1: "disease",
        2: "earthquake",
        3: "flood",
        4: "hurricane & tornado",
        5: "wildfire",
        6: "industrial accident",
        7: "societal crime",
        8: "transportation accident",
        9: "meteor crash",
        0: "haze"
    }[
        predicted_class_id
    ]


In [6]:
class ArticleClassifier:
  def __init__(self):
    self.isDisasterClassifier = IsDisasterClassifier()
    self.disasterTypeClassifier = DisasterTypeClassifier()
  def isDisaster(self, text: str) -> bool:
    return self.isDisasterClassifier.isDisaster(text)
  def disasterType(self, text: str) -> str:
    if not self.isDisaster(text):
      return "not a disaster"
    return self.disasterTypeClassifier.disasterType(text)

In [7]:
ac = ArticleClassifier()

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/557 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

All TF 2.0 model weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.


Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/982 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

All TF 2.0 model weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.


In [8]:
examples = [
    "NBC: Evacuations Lifted in 1,100-Acre Brush Fire in Santa Clarita Valley",
    "KRQE News: Dog Head Fire: Information for evacuees",
    "France 24: Hurricane Fiona batters Turks and Caicos after devastating Puerto Rico - 21/09/2022",
    "Russia-Ukraine War Explosion Damages Crimea Bridge, Imperiling Russian Supply Route",
    "Arizona court halts enforcement of near-total abortion ban",
    "Wow, Google Really, Really Wants to Be Cooler Than Apple",
    "The Hack4Good coordinator is in charge of facilitating the communication between the H4G Organization Committee to address any organizational issue which might arise."
]

In [15]:
for example in examples:
  isDisaster = ac.isDisaster(example)
  disasterType = ac.disasterType(example)
  print(f"{example} \n isDisaster? {isDisaster}\n disasterType? {disasterType} \n\n")

NBC: Evacuations Lifted in 1,100-Acre Brush Fire in Santa Clarita Valley 
 isDisaster? True
 disasterType? wildfire 


KRQE News: Dog Head Fire: Information for evacuees 
 isDisaster? True
 disasterType? wildfire 


France 24: Hurricane Fiona batters Turks and Caicos after devastating Puerto Rico - 21/09/2022 
 isDisaster? True
 disasterType? hurricane & tornado 


Russia-Ukraine War Explosion Damages Crimea Bridge, Imperiling Russian Supply Route 
 isDisaster? True
 disasterType? industrial accident 


Arizona court halts enforcement of near-total abortion ban 
 isDisaster? False
 disasterType? not a disaster 


Wow, Google Really, Really Wants to Be Cooler Than Apple 
 isDisaster? False
 disasterType? not a disaster 


The Hack4Good coordinator is in charge of facilitating the communication between the H4G Organization Committee to address any organizational issue which might arise. 
 isDisaster? False
 disasterType? not a disaster 




In [10]:
# IsDisplacementClassifier
class IsDisplacementClassifier:
  def __init__(self):
    self.displacementKeywords = ["refugee", "evacuate", "displace", "flee"]
  def isDisplacement(self, text):
    pass

In [32]:

# import the necessary libraries
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [40]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [41]:
class TextPreprocessor:
  def __init__(self):
    self.stemmer = PorterStemmer()
    self.lemmatizer = WordNetLemmatizer()
  def text_lowercase(self, text):
    return text.lower()
  def remove_numbers(self, text):
    result = re.sub(r'\d+', '', text)
    return result
  def remove_punctuation(self, text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
  def remove_whitespace(self, text):
    return  " ".join(text.split())
  def remove_stopwords(self, text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return " ".join(filtered_text)
  def stem_words(self, text):
    word_tokens = word_tokenize(text)
    stems = [self.stemmer.stem(word) for word in word_tokens]
    return " ".join(stems)
  def lemmatize_word(self, text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [self.lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return " ".join(lemmas)
  def pipeline(self, text):
    for f in [
        self.text_lowercase, self.remove_numbers, self.remove_punctuation, 
        self.remove_whitespace, self.remove_whitespace, self.remove_stopwords,
        self.lemmatize_word
        ]:
      text = f(text)
      print(text)
    return text 
    
  

In [42]:
textProcessor = TextPreprocessor()
textProcessor.pipeline("This is a sample sentence and we are going to remove the stopwords from this")

this is a sample sentence and we are going to remove the stopwords from this
this is a sample sentence and we are going to remove the stopwords from this
this is a sample sentence and we are going to remove the stopwords from this
this is a sample sentence and we are going to remove the stopwords from this
this is a sample sentence and we are going to remove the stopwords from this
sample sentence going remove stopwords
sample sentence go remove stopwords


'sample sentence go remove stopwords'

In [30]:
textProcessor.stem_words("data science uses scientific methods algorithms and many types of processes")

'data scienc use scientif method algorithm and mani type of process'