# **SentimentArcs (Part 1): Text Preprocessing**

```
Jon Chun
12 Jun 2021: Started
04 Mar 2022: Last Update
```

Welcome! 

SentimentArcs is a methodlogy and software framework for analyzing narrative in text. Virtually all long text contains narrative elements...(TODO: Insert excerpts from Paper Abstract/Intro Sections here)

***

* **SentimentArcs: Cloning the Github repository to your gDrive**

If this is the first time using SentimentArcs, you will need to copy the software from our Github.com repository (github repo). The default recommended gDrive path is ./gdrive/MyDrive/research/sentiment_arcs/'. 

The first time you run this notebook and connect your Google gDrive, it will allow to to specify the path to your SentimentArcs subdirectory. If it does not exists, this notebook will copy/clone the SentimentArcs github repository code to your gDrive at the path you specify.


***

* **NovelText: A Reference Corpus of 24 Diverse Novel**

Sentiment Arcs comes with a carefully curated reference corpus of Novels to illustrate the unique diachronic sentiment analysis characteristic of long form fictional narrativeas. This corpus of 24 diverse novels also provides a baseline for exploring and comparing new novels with sentiment analysis using SentimentArcs.

***

* **Preparing New Novels: Formatting and adding to subdirectory**

To analyze new novels with SentimentArcs, the body of the text should consist of plain text organized in to blocks separated by two newlines which visually look like a single blank line between blocks. These blocks are usually paragraphs but can also include title headers, separate lines of dialog or quotes. Please reference any of the 24 novels in the NovelText corpus for examples of this expected format.

Once the new novel is correctly formatted as a plain text file, it should follow this standard file naming convention:

[first letter of first name]+[full lastname]_[abbreviated book title].txt

Examples:

* fdouglass_narrativelifeofaslave
* fscottfitzgerald_thegreatgatsby.txt
* vwoolf_mrsdalloway.txt
* homer-ewilson_odyssey.txt (trans. E.Wilson)
* mproust-mtreharne_3guermantesway.txt (Book 3, trans. M.Treharne)
* staugustine_confessions9end.txt (Upto and incl Book 9)

Note the optional author suffix (-translator) and optional title suffix (-selected chapters/books)

***

* **Adding New Novels: Add file to subdirectory and Update this Notebook**

Once you have a cleaned and text file named according the standard rule above, you must move that file to the subdirectory of all input novels and update the global variable in this notebook that defines which novels to analyze.

First, copy your cleaned text file to the subdirectory containing all novels read by this notebook. This subdir is defined by the program variable 'subdir_novels' with the default value './in1_novels/'

Second, update the program variable 'novels_dt'. This is a Dictionary data structure that following the pattern below:
```
novels_dt = {
  'cdickens_achristmascarol':['A Christmas Carol by Charles Dickens ',1843,1399],
```
Where the first string (the dictionary key) must match the filename root without the '.txt' suffix (e.g. cdickens_achristmascarol). The Dictionary value after the ':' is a list of three elements:

* A nicely formatted string of the form '(title) by (full first and last name of author)' that should be a human friendly string used to label plots and saved files.

* The (publication year) and the (sentence count). Both are optional, but should have placeholder string '0' if unknown. These are intended for future reference and analytics.

* Your future self will thank you if you insert new novels into the 'novels_dt' in alphabetic order for faster and more accurate reference.

***

* **How to Execute SentimentArcs Notebooks:**

This is a Jupyter Notebook created to run on Google's free Colab service using only a browers and your exiting Google email account. We chose Google Colab because it is relatively, fast, free, easy to use and makes collaboration as simple as web browsing.

A few reminders about using Jupyter Notebooks general and SentimentArcs in particular:

* All cells must be run ***in order*** as later code cells often depend upon the output of earlier code cells

* ***Cells that take more time to execute*** (> 1 min) usually begin with *%%time* which outputs the *total execution time* of the last run.  This timing output is deleted and recalculated each time the code cell is executed.

* **[OPTIONAL]** at the top of a cell indicates you *may* change a setting in that cell to customize behavior.

* **[CUSTOMIZE]** at the top of a cell indicates you *must* change a setting in that cell.

* **[RESTART REQUIRED]** at the top of a cell indicates you *may* see a *[RESTART REQUIRED] button* at the end of the output. *If you see this button, you must select [Runtime]->[Restart Runtime] from the top menubar.

* **[INPUT REQUIRED]** at the top of a cell indicates you will be required to take some action for execution to proceed, usually by clicking a button or entering the response to a prompt.

All cells with a top comment prefixed with # [OPTIONAL]: indicates that you can change a setting to customize behavior, the prefix [CUSTOMIZE] indicates you MUST set/change a setting

* SentimentArcs divides workflow into a series of chronological Jupyter Notebooks that must be run in order. Here is an overview of the workflow:

***

**SentimentArcs Notebooks Workflow**
1. Notebook #1: Preprocess Text
2. Notebook #2: Compute Sentiment Values (Simple Models/CPUs)
3. Notebook #3: Compute Sentiment Values (Complex Models/GPUs)
4. Notebook #4: Combine all Sentiment Values, perform Time Series analysis, and extract Crux points and surrounding text

If you are unfamilar with setting up and using Google Colab or Jupyter Notebooks, here are a series of resources to quickly bring you up to speed. If you are using SentimentArcs with the Cambridge University Press Elements textbook, there are also a series of videos by Prof Elkins and Chun stepping you through these notebooks.

***

**Additional Resources and Tutorials**


**Google Colab and Jupyter Resources:**

* Coming...
* [IPython, Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html) 

**Cambridge University Press Videos:**

* Coming...




# **[STEP 1] Manual Configuration/Setup**



## (Popups) Connect Google gDrive

In [None]:
# [INPUT REQUIRED]: Authorize access to Google gDrive

# Connect this Notebook to your permanent Google Drive
#   so all generated output is saved to permanent storage there

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("Attempting to attach your Google gDrive to this Colab Jupyter Notebook")
  drive.mount('/gdrive', force_remount=True)
else:
  print("Your Google gDrive is attached to this Colab Jupyter Notebook")

Attempting to attach your Google gDrive to this Colab Jupyter Notebook


## (3 Inputs) Define Directory Tree

In [None]:
# [CUSTOMIZE]: Change the text after the Unix '%cd ' command below (change directory)
#              to math the full path to your gDrive subdirectory which should be the 
#              root directory cloned from the SentimentArcs github repo.

# NOTE: Make sure this subdirectory already exists and there are 
#       no typos, spaces or illegals characters (e.g. periods) in the full path after %cd

# NOTE: In Python all strings must begin with an upper or lowercase letter, and only
#         letter, number and underscores ('_') characters should appear afterwards.
#         Make sure your full path after %cd obeys this constraint or errors may appear.

# #@markdown **Instructions**

# #@markdown Set Directory and Corpus names:
# #@markdown <li> Set <b>Path_to_SentimentArcs</b> to the project root in your **GDrive folder**
# #@markdown <li> Set <b>Corpus_Genre</b> = [novels, finance, social_media]
# #@markdown <li> <b>Corpus_Type</b> = [reference_corpus, new_corpus]
# #@markdown <li> <b>Corpus_Number</b> = [1-20] (id nunmber if a new_corpus)

# #@markdown <hr>

# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/cdh/sentiment_arcs/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}


#@markdown Set this to the project root in your <b>GDrive folder</b>
#@markdown <br> (e.g. /<wbr><b>gdrive/MyDrive/research/sentiment_arcs/</b>)

#@markdown <hr>

#@markdown **Which type of texts are you cleaning?** \

Corpus_Genre = "novels" #@param ["novels", "social_media", "finance"]


Corpus_Type = "new" #@param ["new", "reference"]


Corpus_Number = 2 #@param {type:"slider", min:0, max:10, step:1}


#@markdown Put in the corresponding Subdirectory under **./text_raw**:
#@markdown <li> All Texts as clean <b>plaintext *.txt</b> files 
#@markdown <li> A <b>YAML Configuration File</b> describing each Texts

#@markdown Please verify the required textfiles and YAML file exist in the correct subdirectories before continuing.

print('Current Working Directory:')
%cd $Path_to_SentimentArcs

print('\n')

if Corpus_Type == 'reference':
  SUBDIR_TEXT_RAW_CORPUS = f'text_raw_{Corpus_Genre}_ref'
else:
  SUBDIR_TEXT_RAW_CORPUS = f'text_raw_{Corpus_Genre}_{Corpus_Type}_corpus{Corpus_Number}/'


PATH_TEXT_RAW_CORPUS = f'./text_raw/{SUBDIR_TEXT_RAW_CORPUS}'


print(f'SUBDIR_TEXT_RAW_CORPUS:\n  [{SUBDIR_TEXT_RAW_CORPUS}]')
print(f'PATH_TEXT_RAW_CORPUS:\n  [{PATH_TEXT_RAW_CORPUS}]')

# **[STEP 2] Automatic Configuration/Setup**

In [None]:
# Add PATH for ./utils subdirectory

import sys
import os

!python --version

print('\n')

PATH_UTILS = f'{Path_to_SentimentArcs}/utils'
PATH_UTILS

sys.path.append(PATH_UTILS)

print('Contents of Subdirectory [./sentiment_arcs/utils/]\n')
!ls $PATH_UTILS

# More Specific than PATH for searching libraries
# !echo $PYTHONPATH

In [None]:
# Review Global Variables and set the first few

import global_vars as global_vars

global_vars.SUBDIR_SENTIMENTARCS = Path_to_SentimentArcs
global_vars.Corpus_Genre = Corpus_Genre
global_vars.Corpus_Type = Corpus_Type
global_vars.Corpus_Number = Corpus_Number

global_vars.SUBDIR_TEXT_RAW_CORPUS = SUBDIR_TEXT_RAW_CORPUS
global_vars.PATH_TEXT_RAW_CORPUS = PATH_TEXT_RAW_CORPUS

dir(global_vars)

## Custom Libraries & Define Globals

In [None]:
# Import SentimentArcs Utilities to define Directory Structure
#   based the Selected Corpus Genre, Type and Number

!pwd 
print('\n')

# from utils import sa_config # .sentiment_arcs_utils
from utils import sa_config

print('Objects in sa_config()')
print(dir(sa_config))
print('\n')

# Directory Structure for the Selected Corpus Genre, Type and Number
sa_config.get_subdirs(Corpus_Genre, Corpus_Type, Corpus_Number, 'none')


In [None]:
# Call SentimentArcs Utility to define Global Variables

sa_config.set_globals()

# Verify sample global var set
print(f'MIN_PARAG_LEN: {global_vars.MIN_PARAG_LEN}')
print(f'STOPWORDS_ADD_EN: {global_vars.STOPWORDS_ADD_EN}')
print(f'TEST_WORDS_LS: {global_vars.TEST_WORDS_LS}')
print(f'SLANG_DT: {global_vars.SLANG_DT}')

## Configure Jupyter Notebook

In [None]:
# Configure Jupyter

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Read YAML Configuration for Corpus and Models 

In [None]:
# from utils import sa_config # .sentiment_arcs_utils

import yaml

from utils import read_yaml

print('Objects in read_yaml()')
print(dir(read_yaml))
print('\n')

# Directory Structure for the Selected Corpus Genre, Type and Number
read_yaml.read_corpus_yaml(Corpus_Genre, Corpus_Type, Corpus_Number)

print('SentimentArcs Model Ensemble ------------------------------\n')
model_titles_ls = global_vars.models_titles_dt.keys()
print('\n'.join(model_titles_ls))


print('\n\nCorpus Texts ------------------------------\n')
corpus_titles_ls = global_vars.corpus_titles_dt.keys()
print('\n'.join(corpus_titles_ls))


print(f'\n\nThere are {len(model_titles_ls)} Models in the SentimentArcs Ensemble above.\n')
print(f'\nThere are {len(corpus_titles_ls)} Texts in the Corpus above.\n')
print('\n')


## Install Libraries

In [None]:
# Library to Read R datafiles from within Python programs

!pip install pyreadr

In [None]:
# Powerful Industry-Grade NLP Library

!pip install -U spacy

In [None]:
# NLP Library to Simply Cleaning Text

!pip install texthero

In [None]:
# Advanced Sentence Boundry Detection Pythn Library
#   for splitting raw text into grammatical sentences
#   (can be difficult due to common motifs like Mr., ..., ?!?, etc)

!pip install pysbd

In [None]:
# Python Library to expand contractions to aid in Sentiment Analysis
#   (e.g. aren't -> are not, can't -> can not)

!pip install contractions

In [None]:
# Library for dealing with Emoticons (punctuation) and Emojis (icons)

!pip install emot

## Load Libraries

In [None]:
# Core Python Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import re
import string
from datetime import datetime
import os
import sys
import glob
import json
from pathlib import Path
from copy import deepcopy

In [None]:
# More advanced Sentence Tokenizier Object from PySBD
from pysbd.utils import PySBDFactory

In [None]:
# Simplier Sentence Tokenizer Object from NLTK
import nltk 
from nltk.tokenize import sent_tokenize

# Download required NLTK tokenizer data
nltk.download('punkt')

In [None]:
# Instantiate and Import Text Cleaning Ojects into Global Variable space
import texthero as hero
from texthero import preprocessing

In [None]:
# Expand contractions (e.g. can't -> can not)
import contractions

# Translate emoticons :0 and emoji icons to text
import emot 
emot_obj = emot.core.emot() 

from emot.emo_unicode import UNICODE_EMOJI, EMOTICONS_EMO

# Test
text = "I love python ☮ 🙂 ❤ :-) :-( :-)))" 
emot_obj.emoticons(text)

In [None]:
# Import spaCy, language model and setup minimal pipeline

import spacy

nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])
# nlp.max_length = 1027203
nlp.max_length = 2054406
nlp.add_pipe(nlp.create_pipe('sentencizer')) # https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization

# Test some edge cases, try to find examples that break spaCy
doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print('\n')
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

print('\nAnother Test:\n')
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

for token in doc:
    print("{:<12}{:<30}{:<12}".format(token.text, token.lemma, token.lemma_))

## Define/Customize Stopwords

In [None]:
# Define Globals
"""
# Main data structure: Dictionary (key=text_name) of DataFrames (cols: text_raw, text_clean)
corpus_texts_dt = {}

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/get_globals.py'

SLANG_DT.keys()
""";

In [None]:
global_vars.SLANG_DT.keys()

In [None]:
dir(global_vars)

In [None]:
%whos

In [None]:
# Verify English Stopword List

stopwords_spacy_en_ls = nlp.Defaults.stop_words

','.join([x for x in stopwords_spacy_en_ls])

stopwords_en_ls = stopwords_spacy_en_ls

print(f'\n\nThere are {len(stopwords_spacy_en_ls)} default English Stopwords from spaCy\n')

In [None]:
# Customize Default SpaCy English Stopword List

print(f'\n\nThere are {len(stopwords_spacy_en_ls)} default English Stopwords from spaCy\n')

# [CUSTOMIZE] Stopwords to ADD or DELETE from default spaCy English stopword list
LOCAL_STOPWORDS_DEL_EN = set(global_vars.STOPWORDS_DEL_EN).union(set(['a','an','the','but','yet']))
print(f'    Deleting these stopwords: {LOCAL_STOPWORDS_DEL_EN}')
LOCAL_STOPWORDS_ADD_EN = set(global_vars.STOPWORDS_ADD_EN).union(set(['a','an','the','but','yet']))
print(f'    Adding these stopwords: {LOCAL_STOPWORDS_ADD_EN}\n')

stopwords_en_ls = list(set(stopwords_spacy_en_ls).difference(set(LOCAL_STOPWORDS_DEL_EN)).union(set(LOCAL_STOPWORDS_ADD_EN)))
print(f'Final Count: {len(stopwords_en_ls)} Stopwords')

## Setup Matplotlib Style

In [None]:
# Configure Matplotlib

# View available styles
# plt.style.available

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/config_matplotlib.py'

config_matplotlib()

print('Matplotlib Configuration ------------------------------')
print('\n  (Uncomment to view)')
# plt.rcParams.keys()
print('\n  Edit ./utils/config_matplotlib.py to change')

## Setup Seaborn Style

In [None]:
# Configure Seaborn

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/config_seaborn.py'

config_seaborn()

print('Seaborn Configuration ------------------------------\n')
# print('\n  Update ./utils/config_seaborn.py to display seaborn settings')

## **Utility Functions**

### Generate Convenient Data Lists

In [None]:
# Derive List of Texts in Corpus a)keys and b)full author and titles

print('Dictionary: corpus_titles_dt')
global_vars.corpus_titles_dt
print('\n')

corpus_texts_ls = list(global_vars.corpus_titles_dt.keys())
print(f'\nCorpus Texts:')
for akey in corpus_texts_ls:
  print(f'  {akey}')
print('\n')

print(f'\nNatural Corpus Titles:')
corpus_titles_ls = [x[0] for x in list(global_vars.corpus_titles_dt.values())]
for akey in corpus_titles_ls:
  print(f'  {akey}')


In [None]:
# Get Model Families of Ensemble

from utils.get_model_families import get_ensemble_model_famalies

global_vars.models_ensemble_dt = get_ensemble_model_famalies(global_vars.models_titles_dt)

print('\nTest: Lexicon Family of Models:')
global_vars.models_ensemble_dt['lexicon']

### Text Cleaning 

In [None]:
# [VERIFY]: Texthero preprocessing pipeline

hero.preprocessing.get_default_pipeline()



# Create Default and Custom Stemming TextHero pipeline

# Create a custom cleaning pipeline
def_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                # , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace]

# Create a custom cleaning pipeline
stem_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace
                , preprocessing.stem]
                   
# Test: pass the custom_pipeline to the pipeline argument
# df['clean_title'] = hero.clean(df['title'], pipeline = custom_pipeline)df.head()

In [None]:
# Test Text Cleaning Functions
# NOTE: These functions rely on big imports made in this main notebook (e.g. NLTK, SpaCy, TextHero)
#       therefore we execute/define them inline with %run rather than as modules with better separation 

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/text_cleaners.py'

test_suite_ls = ['text2lemmas',
                 'text_str2sents',
                 'textfile2df',
                 'emojis2text',
                 'all_emos2text',
                 'expand_slang',
                 'clean_text',
                 'lemma_pipe'
                 ]

# test_suite_ls = []

# Test: text2lemmas()
if 'text2lemmas' in test_suite_ls:
  text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)
  print('\n')

# Test: text_str2sents()
if 'text_str2sents' in test_suite_ls:
  text_str2sents('Hello. You are a great dude! WTF?\n\n You are a goat. What is a goat?!? A big lazy GOAT... No way-', pysbd_only=False) # !?! Dr. and Mrs. Elipses...', pysbd_only=True)
  print('\n')

# Test: textfile2df()
if 'textfile2df' in test_suite_ls:
  # ???
  print('\n')

# Test: emojis2text()
if 'emojis2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually ;) fulfilling orders 😒"
  test_str = emojis2text(test_str)
  print(f'test_str: [{test_str}]')
  print('\n')

# Test: all_emos2text()
if 'all_emos2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling :o of making a sale 😎, The feeling :( of actually ;) fulfilling orders 😒"
  all_emos2text(test_str)
  print('\n')

# Test: expand_slang():
if 'expand_slang' in test_suite_ls:
  expand_slang('idk LOL you suck!')
  print('\n')

# Test: clean_text()
if 'clean_text' in test_suite_ls:
  test_df = pd.DataFrame({'text_dirty':['The RAin in SPain','WTF?!?! Do you KnoW...']})
  clean_text(test_df, 'text_dirty', text_type='formal')
  print('\n')

# Test: lemma_pipe()
if 'lemma_pipe' in test_suite_ls:
  print('\nTest #1:\n')
  test_ls = ['I am running late for a meetings with all the many people.',
            'What time is it when you fall down running away from a growing problem?',
            "You've got to be kidding me - you're joking right?"]
  lemma_pipe(test_ls)
  print('\nTest #2:\n')
  texts = pd.Series(["I won't go and you can't make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])
  for doc in nlp.pipe(texts):
    print([tok.lemma_ for tok in doc])
  print('\nTest #3:\n')
  lemma_pipe(texts)


### File Functions

In [None]:
# Verify in SentimentArcs Root Directory

os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

# Method #1: Preferred isolation using module directories
# TODO: list individual methods() used to min polluting global namespace
from utils.file_utils import get_fullpath, textfile2df, write_dict_dfs, read_dict_dfs

# Method #2: Run code in-line by executing file
# %run -i './utils/file_utils.py'

# TODO: Not used? Delete?
# get_fullpath(text_title_str, ftype='data_clean', fig_no='', first_note = '',last_note='', plot_ext='png', no_date=False)

# **[STEP 2] Read in Corpus and Clean**

## Create List of Raw Textfiles

In [None]:
# Current key Directories

print('Current Subdirectory:')
!pwd
print('\n')

print(f'SentimentArcs root Subdirectory: [{global_vars.SUBDIR_SENTIMENTARCS}]\n')

path_text_raw = './' + '/'.join(global_vars.SUBDIR_TEXT_RAW.split('/')[1:-1])
print(f'path_text_raw: [{path_text_raw}]\n')
# SUBDIR_TEXT_RAW = path_text_raw + '/'
print(f'Full Path to Corpus text_raw: [{global_vars.SUBDIR_SENTIMENTARCS}/text_raw/{global_vars.SUBDIR_TEXT_RAW_CORPUS}]')

In [None]:
# Get a list of all the Textfile filename roots in Subdir text_raw

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

corpus_titles_ls = list(global_vars.corpus_titles_dt.keys())

print(f'Corpus_Genre: {global_vars.Corpus_Genre}')
print(f'Corpus_Type: {global_vars.Corpus_Type}\n')

# Build path to Corpus Subdir
# TODO: Temp fix until print(f'Original: {SUBDIR_TEXT_RAW}\n')
# path_text_raw = './' + '/'.join(SUBDIR_TEXT_RAW.split('/')[1:-1]) + '/' + SUBDIR_TEXT_RAW_CORPUS
path_text_raw = './text_raw/' + global_vars.SUBDIR_TEXT_RAW_CORPUS
# print(f'Corpus Subdir: {path_text_raw}')

# Create a List (preprocessed_ls) of all preprocessed text files
try:
  # texts_raw_ls = glob.glob(f'{SUBDIR_TEXT_RAW}*.txt')
  texts_raw_root_ls = glob.glob(f'{path_text_raw}/*.txt')
  texts_raw_root_ls = [x.split('/')[-1] for x in texts_raw_root_ls]
  texts_raw_root_ls = [x.split('.')[0] for x in texts_raw_root_ls]
except IndexError:
  raise RuntimeError('No *.txt files found')

# print(f'\ntexts_raw_root_ls:\n  {texts_raw_root_ls}\n')

print('Texts found in Corpus Subdirectory:')
print('-----------------------------------')

text_ct = 0
for afile_root in texts_raw_root_ls:
  # file_root = file_fullpath.split('/')[-1].split('.')[0]
  text_ct += 1
  print(f'{afile_root}: ') # {corpus_titles_dt[afile_root]}')

print(f'\nThe Corpus has [{text_ct}] Texts found in the Subdirectory:')
print(f'---------------------------------------------------\n  {global_vars.SUBDIR_TEXT_RAW}')

## Read and Segment into Sentences

In [None]:
from utils.text_cleaners import textfile2df

In [None]:
# %%time
# %xmode Verbose
# %debug

%run -i './utils/text_cleaners.py'


# Read all Corpus Textfiles and Segment each into Sentences

# NOTE: 3m30s Entire Corpus of 25 
#       7m30s Ref Corpus 32 Novels
#       7m24s Ref Corpus 32 Novels
#       1m00s New Corpus1 2 Novels
#      ~1m30s New Corpus2 3 Novels

# Read all novel files into a Dictionary of DataFrames
#   Dict.keys() are novel names
#   Dict.values() are DataFrames with one row per Sentence

# Continue here ONLY if last cell completed WITHOUT ERROR

# anovel_df = pd.DataFrame()
# import pandas as pd

# %run -i './utils/text_cleaners.py'

global_vars.corpus_titles_ls = list(global_vars.corpus_texts_dt.keys())
for i, file_root in enumerate(global_vars.corpus_titles_ls):
  file_fullpath = f'{global_vars.SUBDIR_TEXT_RAW}{file_root}.txt'
  # print(f'Processing Novel #{i}: {file_fullpath}') # {file_root}')
  # fullpath_str = novels_subdir + asubdir + '/' + asubdir + '.txt'
  # print(f"  Size: {os.path.getsize(file_fullpath)}")

  global_vars.corpus_texts_dt[file_root] = textfile2df(file_fullpath)
  
# corpus_dt.keys()

# Verify First Text is Segmented into text_raw Sentences
print('\n\n')
global_vars.corpus_texts_dt[global_vars.corpus_titles_ls[0]].head()


In [None]:
list(global_vars.corpus_texts_dt.keys())

In [None]:
%%time
# %xmode Verbose
# %debug

%run -i './utils/text_cleaners.py'


# Read all Corpus Textfiles and Segment each into Sentences

# NOTE: 3m30s Entire Corpus of 25 
#       7m30s Ref Corpus 32 Novels
#       7m24s Ref Corpus 32 Novels
#       1m00s New Corpus1 2 Novels
#      ~1m30s New Corpus2 3 Novels

# Read all novel files into a Dictionary of DataFrames
#   Dict.keys() are novel names
#   Dict.values() are DataFrames with one row per Sentence

# Continue here ONLY if last cell completed WITHOUT ERROR

# anovel_df = pd.DataFrame()
import pandas as pd

# %run -i './utils/text_cleaners.py'

for i, file_root in enumerate(corpus_titles_ls):
  file_fullpath = f'{global_vars.SUBDIR_TEXT_RAW}{file_root}.txt'
  # print(f'Processing Novel #{i}: {file_fullpath}') # {file_root}')
  # fullpath_str = novels_subdir + asubdir + '/' + asubdir + '.txt'
  # print(f"  Size: {os.path.getsize(file_fullpath)}")

  global_vars.corpus_texts_dt[file_root] = textfile2df(file_fullpath)
  
# corpus_dt.keys()

# Verify First Text is Segmented into text_raw Sentences
print('\n\n')
global_vars.corpus_texts_dt[global_vars.corpus_titles_ls[0]].head()


## Clean Sentences

In [None]:
%%time

# NOTE: (no stem) 4m09s (24 Novels)
#       (w/ stem) 4m24s (24 Novels)

i = 0

for key_novel, atext_df in global_vars.corpus_texts_dt.items():

  print(f'Processing Novel #{i}: {key_novel}...')

  atext_df['text_clean'] = clean_text(atext_df, 'text_raw', text_type='formal')
  atext_df['text_clean'] = lemma_pipe(atext_df['text_clean'])
  atext_df['text_clean'] = atext_df['text_clean'].astype('string')

  # TODO: Fill in all blank 'text_clean' rows with filler semaphore
  atext_df.text_clean = atext_df.text_clean.fillna('empty_placeholder')

  atext_df.head(2)

  print(f'  shape: {atext_df.shape}')

  i += 1

In [None]:
# Verify the first Text in Corpus is cleaned

global_vars.corpus_texts_dt[global_vars.corpus_titles_ls[0]].head(20)
global_vars.corpus_texts_dt[global_vars.corpus_titles_ls[0]].info()

## Save Cleaned Corpus

In [None]:
# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

print('Currently in SentimentArcs root directory:')
!pwd

# Verify Subdir to save Cleaned Texts and Texts into..

print(f'\nSaving Clean Texts to Subdir: {global_vars.SUBDIR_TEXT_CLEAN}')
print(f'\nSaving these Texts:\n  {global_vars.corpus_texts_dt.keys()}')

In [None]:
# Save the cleaned Textfiles

i = 0
for key_novel, anovel_df in global_vars.corpus_texts_dt.items():
  anovel_fname = f'{key_novel}.csv'

  anovel_fullpath = f'{global_vars.SUBDIR_TEXT_CLEAN}{anovel_fname}'
  print(f'Saving Novel #{i} to {anovel_fullpath}')
  global_vars.corpus_texts_dt[key_novel].to_csv(anovel_fullpath)
  i += 1

# **[END OF NOTEBOOK]**