# Earth Analytics Final Project
## 2024 Summer

## Project Overview
For my final project I am interested in exploring the narrative text fields of the ICS-209-PLUS-WILDFIRE dataset. The ICS-209-PLUS-WILDFIRES is a fire-focused subset of the [all-hazards dataset](https://www.nature.com/articles/s41597-023-01955-0) mined from the US National Incident Management System 1999–2020 by St. Denis et al. (2023).  I will process the narrative fields to convert text about societal impacts into a suitable format for natural language processing and analyze text to find links between fire hazard characteristics, incident response, and societal impacts/threats incrementally across all phases of active response. My main question that drives the anaysis is:

How can we connect societal impacts and geophysical metrics by topic modeling methods on incidence reporting's narrative fields?

<center>
    <img src="graphics/workflowwhite.png" alt="Project Workflow" width="700"/>
</center>

## Installation
The project will run in the public [earth-analytics-python-env](https://github.com/earthlab/earth-analytics-python-env) that contains the dependencies and libraries needed for the project. If other libraries becomes neccessary for the project, they will be listed here as additional requirements. 

## Data
For my project, I plan to heavily focus on text-based narrative data (ICS-209-PLUS-WILDFIRES) and use geospatial data (like Monitoring Trends in Burn Severity).

### ICS-209-PLUS-WILDFIRES

- Standardized, on-scene Incident Status Summary
- Part of the National Incident Management System (NIMS)
- Text-based narrative wealth of data
- Science-grade situation reports focusing on large wildfires
- Daily “informational snapshots” of fire response/management
- View into the decision-making process
    - Large fire event development and response

### Monitoring Trends in Burn Severity

The Monitoring Trends in Burn Severity (MTBS) dataset is a comprehensive dataset that maps the fire severity and perimeters of large wildfires in the United States across all ownerships. The MTBS vector datasets include burn scar boundaries that are delineated from satellite imagery and burn severity index data at a map scale of 1:24,000 to 1:50,000. I will primarily use the vector burn area boundaries.


## Analysis

Text preprocessing is a crucial step in Natural Language Processing (NLP) that transforms text into a format that is suitable for further analysis. Here are some common techniques:

1. **Tokenization**: This is the process of breaking down text into individual words (or tokens). This is usually the first step in text preprocessing.

    ```python
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    ```

2. **Lowercasing**: This is done to avoid having multiple copies of the same words. For example, 'Hello' and 'hello' should be treated as the same word.

    ```python
    text = text.lower()
    ```

3. **Stopwords Removal**: Stopwords are common words that do not contain important meaning and are usually removed from texts. Examples of stopwords are 'is', 'the', 'and', etc.

    ```python
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in tokens if word not in stop_words]
    ```

4. **Stemming**: This is the process of reducing inflected (or sometimes derived) words to their word stem or root form. For example, 'jumps', 'jumping', 'jumped' are all transformed to 'jump'.

    ```python
    from nltk.stem import PorterStemmer
    stemmer = PorterStemmer()
    stemmed_text = [stemmer.stem(word) for word in filtered_text]
    ```

5. **Lemmatization**: Similar to stemming, but it reduces words to their base or root form (lemma) considering the context. It's more accurate but slower than stemming.

    ```python
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = [lemmatizer.lemmatize(word) for word in filtered_text]
    ```

6. **Removing Punctuation**: Punctuation can provide grammatical context to a sentence which supports our understanding. But for our vectorizer which counts the number of words and not the context, it does not add value, so it is often removed.

    ```python
    import string
    text = text.translate(str.maketrans('', '', string.punctuation))
    ```

7. **Removing HTML tags**: When dealing with HTML data, we often have to clean it to remove all the HTML tags in it.

    ```python
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    ```