# Getting Fallout 4 Dialogue in Pandas
Using the `requests` library alongside `BeautifulSoup`, the webpage for the dialogue of a given Fallout 4 NPC or character may be gathered. To achieve this, CSS selectors were utilized with the `.select()` method found in the `BeautifulSoup` library.

First, the webscraping libraries were imported and an LXML parser read the contents of the website. The character [Cait](https://fallout-archive.fandom.com/wiki/Cait) had been selected for this Jupyter notebook, but any character from [the list of Fallout 4 characters](https://fallout-archive.fandom.com/wiki/Fallout_4_characters) should suffice, provided that the character actually has dialogue in-game and that the source page contains said in-game dialogue.

In [1]:
import requests
from bs4 import BeautifulSoup

character = 'Cait'
url = f'https://fallout-archive.fandom.com/wiki/{character}%27s_dialogue'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

The returned contents had HTML tags in most of its entries. Thus, it proved convenient to create a function to strip the entries of common tags.

In [2]:
def trim_tags(sample: str) -> str:
    '''
    Given a string, from a BeautifulSoup select query on a CSS selector,
    trim the <th>, </th>, <td>, and </td> tags along with the escaped
    new line character \\n.
    
    Parameters
    ==========
    @sample, str: A CSS selector
    '''
    return sample.replace('<th>', '').replace('</th>', '').replace('<td>', '').replace('</td>', '').replace('\n', '')

Below is a preview of the returned contents after applying the `trim_tags` function by list comprehension.

In [3]:
columns_ = [trim_tags(str(child)) for child in soup.select('tr th:nth-child(n)')][2:]
columns_

['CATEGORY',
 'TYPE',
 'SUBTYPE',
 'PROMPT',
 'DIALOGUE BEFORE',
 'RESPONSE TEXT',
 'DIALOGUE AFTER',
 'SCRIPT NOTES',
 'SCENE']

Typically, a CSS selector would allow one to simply gather all $n$ children of a given tag. For example, `'td:nth-child(n)'` usually would return the elements found under the `td` tag:

In [4]:
soup.select('table tbody td:nth-child(n)')[15:25]

[<td class="va-infobox-spacing-v" colspan="3"></td>,
 <td class="va-infobox-label" colspan="1" style="" title="">Data File</td>,
 <td class="va-infobox-spacing-h"></td>,
 <td class="va-infobox-content" colspan="1" style="" title="">fallout4.esm</td>,
 <td class="va-infobox-spacing-v"></td>,
 <td>Scene
 </td>,
 <td>SceneDialogue
 </td>,
 <td>Custom
 </td>,
 <td>
 </td>,
 <td>Player Default: It was America's pasttime. A sport that united families on warm summer days. And it wasn't violent. Mostly.
 </td>]

Admittedly, all $n$ children had been returned; however, the format returned above appears to be considerably inconvenient to deal with, as it suggests the use of cyclic indexing of lists (that is, an approach is possible through the use of the modulo operator `%`). An easier solution exists.

Consider measuring the total size of the collection of *all* children and divide that value by the total size of an individual child $b$: this essentially solves for $n$. This approach is valid because each column has the same length, so the choice of $b$ is entirely arbitrary.

In [5]:
import math
# nth-child(n) not working as desired... so how many n's do we need?
num_children = math.ceil(len(soup.select('table tbody td:nth-child(n)')) / len(soup.select('table tbody td:nth-child(3)')))
num_children

9

An even easier (perhaps obvious) approach is to make note that `columns_` has nine elements: there exists a child for each element. With $n$ known, a list comprehension allows for appending the contents of an individual child to a list by iterating through integer values $i = 0, 1, \ldots,  n$.

In [6]:
# It should be nine, could have taken the number of columns from above!
massive_list_ = [soup.select(f'table tbody td:nth-child({idx + 1})') for idx in range(num_children)]

In [7]:
massive_list_[8][:5]

[<td>MoeGreetSceneBaseball02
 </td>,
 <td>MoeGreetSceneBaseball02
 </td>,
 <td>MoeGreetSceneBaseball02
 </td>,
 <td>08MayorIntroScene2a
 </td>,
 <td>08MayorIntroScene2a
 </td>]

Fortunately, a function had already been created to conveniently remove the HTML tags. Similarly, a function which indentifies a non-dialogue entry with a single asterisk `*` had also been created. For the purpose of creating a `pandas` DataFrame of Fallout 4 dialogue, any element that is *not* dialogue (with the exception of those that give context to the dialogue) is relatively useless in the event of training a model.

In [8]:
def purge_nondialogue(entry: str) -> str:
    '''
    Given a string, from a BeautifulSoup select query on a CSS selector,
    remove the entry if there is any indication it is not dialogue.
    
    Parameters
    ==========
    @sample, str: A CSS selector
    '''
    for idx in range(len(entry)):
        if '<td class=' in str(entry[idx]):
            entry[idx] = '*'
    return entry

In [9]:
purged_massive_list = [purge_nondialogue(massive_list_[idx]) for idx in range(len(massive_list_))]
purged_massive_list[0][:15]

['*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 <td>Scene
 </td>,
 <td>Scene
 </td>,
 <td>Scene
 </td>,
 <td>Scene
 </td>,
 <td>Scene
 </td>]

Reassigning the non-dialogue entries to asterisks will help in standardizing the length of each column, as each column *must* have the same length (as does the source page). After removing non-dialogue entries, all common HTML tags were also removed.

In [10]:
trimmed_purge = [[trim_tags(str(entry)) for entry in purged_massive_list[idx]] for idx in range(len(purged_massive_list))]
trimmed_purge[0][:15]

['*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 '*',
 'Scene',
 'Scene',
 'Scene',
 'Scene',
 'Scene']

Keep in mind that the `*` character had been used to identify entries that have been *made* to be empty due to data processing; any entries that are actually empty (i.e., `''`) are to be empty in the finalized `pandas` DataFrame.

Thus, a `while`-loop had been utilized in iteratively removing all `*` entries to reduce the size of offending columns to match that of its corresponding source column (that is, the matching column on the website).

In [11]:
# Purge the *s; the empty values are actually empty in the source database! Beware!
for idx in range(len(trimmed_purge)):
    while '*' in trimmed_purge[idx]:
        trimmed_purge[idx].remove('*')

In [12]:
trimmed_purge[0][:5]

['Scene', 'Scene', 'Scene', 'Scene', 'Scene']

Before the final steps dealing with DataFrame construction, it was ensured that each and every entry had the same length. Any shorter-than-max-length entries had blank data appended until the entry had reached proper size. Missing data should be no new issue for a data scientist and can be addressed when the time arises.

In [13]:
# Get the max length of an entry
lens = []
for entry in trimmed_purge:
    lens.append(len(entry))
max_len = max(lens)

# Adjusts shorter entries to match max length
for entry in trimmed_purge:
    if len(entry) < max_len:
        delta = max_len - len(entry)
        for idx in range(delta):
            entry.append('')

With the data having been entirely processed, only the construction of the DataFrame remained; however, instantiating the DataFrame directly from `trimmed_purge` would have returned an error. Such a direct instantiation incorrectly aligns the data against the columns and there is a dimension mismatch. 

Instead, an empty DataFrame with the correct columns had been created first.

In [14]:
import pandas as pd

df = pd.DataFrame([], columns = columns_)
df.head()

Unnamed: 0,CATEGORY,TYPE,SUBTYPE,PROMPT,DIALOGUE BEFORE,RESPONSE TEXT,DIALOGUE AFTER,SCRIPT NOTES,SCENE


An enumeration was then used to iterate through the sections of `trimmed_purge`, correctly mapping each entry with its corresponding column.

In [15]:
for idx, cat in enumerate(df.columns):
    df[cat] = trimmed_purge[idx]

In [16]:
df.head()

Unnamed: 0,CATEGORY,TYPE,SUBTYPE,PROMPT,DIALOGUE BEFORE,RESPONSE TEXT,DIALOGUE AFTER,SCRIPT NOTES,SCENE
0,Scene,SceneDialogue,Custom,,Player Default: It was America's pasttime. A s...,"Sounds stupid. Without a little violence, what...",Moe: ... I like my version better.,,MoeGreetSceneBaseball02
1,Scene,SceneDialogue,Custom,,Player Default: The teams would also beat the ...,Now that's my kind of action.,Moe: ... I like my version better.,,MoeGreetSceneBaseball02
2,Scene,SceneDialogue,Custom,,"Player Default: There were balls, strikes, thr...",No hittin' or beatin'? What's the damn point?,Moe: ... I like my version better.,,MoeGreetSceneBaseball02
3,Scene,SceneDialogue,Custom,,Player Default: Always believed in freedom of ...,We use newspapers at the Combat Zone... put it...,"Mayor: Oh, I didn't mean to bring you into thi...",,08MayorIntroScene2a
4,Scene,SceneDialogue,Custom,,Player Default: Newspapers just like to stir u...,I tried to read a newspaper once but I couldn'...,"Mayor: Oh, I didn't mean to bring you into thi...",,08MayorIntroScene2a


Finally, the DataFrame gets exported to `.csv`.

In [17]:
df.to_csv(f'{character.lower()}_dialogue.csv', index = False)

## Closing Remarks
As mentioned in the beginning, the character Cait had been selected for this Jupyter notebook, but any character from the list of Fallout 4 characters should suffice, provided that the character actually has dialogue in-game and that the source page contains said in-game dialogue. This Notebook serves to illustrate the transformation of a full dialogue table of a Fallout 4 character from the associated website into a `pandas` DataFrame. 

Ultimately, this entire Jupyter Notebook may be reduced to a Python script to be utilized in a Natural Language Processing (NLP) project. It is through a DataFrame of character dialogue that a library such as `nltk` or `spaCy` may train on and learn about characters and content related to the Fallout universe.