# **SentimentArcs (Part 1): Text Preprocessing**

```
Jon Chun
12 Jun 2021: Started
04 Mar 2022: Last Update
```

Welcome! 

SentimentArcs is a methodlogy and software framework for analyzing narrative in text. Virtually all long text contains narrative elements...(TODO: Insert excerpts from Paper Abstract/Intro Sections here)

***

* **SentimentArcs: Cloning the Github repository to your gDrive**

If this is the first time using SentimentArcs, you will need to copy the software from our Github.com repository (github repo). The default recommended gDrive path is ./gdrive/MyDrive/research/sentiment_arcs/'. 

The first time you run this notebook and connect your Google gDrive, it will allow to to specify the path to your SentimentArcs subdirectory. If it does not exists, this notebook will copy/clone the SentimentArcs github repository code to your gDrive at the path you specify.


***

* **NovelText: A Reference Corpus of 24 Diverse Novel**

Sentiment Arcs comes with a carefully curated reference corpus of Novels to illustrate the unique diachronic sentiment analysis characteristic of long form fictional narrativeas. This corpus of 24 diverse novels also provides a baseline for exploring and comparing new novels with sentiment analysis using SentimentArcs.

***

* **Preparing New Novels: Formatting and adding to subdirectory**

To analyze new novels with SentimentArcs, the body of the text should consist of plain text organized in to blocks separated by two newlines which visually look like a single blank line between blocks. These blocks are usually paragraphs but can also include title headers, separate lines of dialog or quotes. Please reference any of the 24 novels in the NovelText corpus for examples of this expected format.

Once the new novel is correctly formatted as a plain text file, it should follow this standard file naming convention:

[first letter of first name]+[full lastname]_[abbreviated book title].txt

Examples:

* fdouglass_narrativelifeofaslave
* fscottfitzgerald_thegreatgatsby.txt
* vwoolf_mrsdalloway.txt
* homer-ewilson_odyssey.txt (trans. E.Wilson)
* mproust-mtreharne_3guermantesway.txt (Book 3, trans. M.Treharne)
* staugustine_confessions9end.txt (Upto and incl Book 9)

Note the optional author suffix (-translator) and optional title suffix (-selected chapters/books)

***

* **Adding New Novels: Add file to subdirectory and Update this Notebook**

Once you have a cleaned and text file named according the standard rule above, you must move that file to the subdirectory of all input novels and update the global variable in this notebook that defines which novels to analyze.

First, copy your cleaned text file to the subdirectory containing all novels read by this notebook. This subdir is defined by the program variable 'subdir_novels' with the default value './in1_novels/'

Second, update the program variable 'novels_dt'. This is a Dictionary data structure that following the pattern below:
```
novels_dt = {
  'cdickens_achristmascarol':['A Christmas Carol by Charles Dickens ',1843,1399],
```
Where the first string (the dictionary key) must match the filename root without the '.txt' suffix (e.g. cdickens_achristmascarol). The Dictionary value after the ':' is a list of three elements:

* A nicely formatted string of the form '(title) by (full first and last name of author)' that should be a human friendly string used to label plots and saved files.

* The (publication year) and the (sentence count). Both are optional, but should have placeholder string '0' if unknown. These are intended for future reference and analytics.

* Your future self will thank you if you insert new novels into the 'novels_dt' in alphabetic order for faster and more accurate reference.

***

* **How to Execute SentimentArcs Notebooks:**

This is a Jupyter Notebook created to run on Google's free Colab service using only a browers and your exiting Google email account. We chose Google Colab because it is relatively, fast, free, easy to use and makes collaboration as simple as web browsing.

A few reminders about using Jupyter Notebooks general and SentimentArcs in particular:

* All cells must be run ***in order*** as later code cells often depend upon the output of earlier code cells

* ***Cells that take more time to execute*** (> 1 min) usually begin with *%%time* which outputs the *total execution time* of the last run.  This timing output is deleted and recalculated each time the code cell is executed.

* **[OPTIONAL]** at the top of a cell indicates you *may* change a setting in that cell to customize behavior.

* **[CUSTOMIZE]** at the top of a cell indicates you *must* change a setting in that cell.

* **[RESTART REQUIRED]** at the top of a cell indicates you *may* see a *[RESTART REQUIRED] button* at the end of the output. *If you see this button, you must select [Runtime]->[Restart Runtime] from the top menubar.

* **[INPUT REQUIRED]** at the top of a cell indicates you will be required to take some action for execution to proceed, usually by clicking a button or entering the response to a prompt.

All cells with a top comment prefixed with # [OPTIONAL]: indicates that you can change a setting to customize behavior, the prefix [CUSTOMIZE] indicates you MUST set/change a setting

* SentimentArcs divides workflow into a series of chronological Jupyter Notebooks that must be run in order. Here is an overview of the workflow:

***

**SentimentArcs Notebooks Workflow**
1. Notebook #1: Preprocess Text
2. Notebook #2: Compute Sentiment Values (Simple Models/CPUs)
3. Notebook #3: Compute Sentiment Values (Complex Models/GPUs)
4. Notebook #4: Combine all Sentiment Values, perform Time Series analysis, and extract Crux points and surrounding text

If you are unfamilar with setting up and using Google Colab or Jupyter Notebooks, here are a series of resources to quickly bring you up to speed. If you are using SentimentArcs with the Cambridge University Press Elements textbook, there are also a series of videos by Prof Elkins and Chun stepping you through these notebooks.

***

**Additional Resources and Tutorials**


**Google Colab and Jupyter Resources:**

* Coming...
* [IPython, Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html) 

**Cambridge University Press Videos:**

* Coming...




# **[STEP 1] Manual Configuration/Setup**



## (Popups) Connect Google gDrive

In [None]:
# [INPUT REQUIRED]: Authorize access to Google gDrive

# Connect this Notebook to your permanent Google Drive
#   so all generated output is saved to permanent storage there

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("Attempting to attach your Google gDrive to this Colab Jupyter Notebook")
  drive.mount('/gdrive', force_remount=True)
else:
  print("Your Google gDrive is attached to this Colab Jupyter Notebook")

Attempting to attach your Google gDrive to this Colab Jupyter Notebook


## (3 Inputs) Define Directory Tree

In [2]:
# [CUSTOMIZE]: Change the text after the Unix '%cd ' command below (change directory)
#              to math the full path to your gDrive subdirectory which should be the 
#              root directory cloned from the SentimentArcs github repo.

# NOTE: Make sure this subdirectory already exists and there are 
#       no typos, spaces or illegals characters (e.g. periods) in the full path after %cd

# NOTE: In Python all strings must begin with an upper or lowercase letter, and only
#         letter, number and underscores ('_') characters should appear afterwards.
#         Make sure your full path after %cd obeys this constraint or errors may appear.

# #@markdown **Instructions**

# #@markdown Set Directory and Corpus names:
# #@markdown <li> Set <b>Path_to_SentimentArcs</b> to the project root in your **GDrive folder**
# #@markdown <li> Set <b>Corpus_Genre</b> = [novels, finance, social_media]
# #@markdown <li> <b>Corpus_Type</b> = [reference_corpus, new_corpus]
# #@markdown <li> <b>Corpus_Number</b> = [1-20] (id nunmber if a new_corpus)

#@markdown <hr>

# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/cdh/sentiment_arcs/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}


#@markdown Set this to the project root in your <b>GDrive folder</b>
#@markdown <br> (e.g. /<wbr><b>gdrive/MyDrive/research/sentiment_arcs/</b>)

#@markdown <hr>

#@markdown **Which type of texts are you cleaning?** \

Corpus_Genre = "novels" #@param ["novels", "social_media", "finance"]


Corpus_Type = "new" #@param ["new", "reference"]


Corpus_Number = 2 #@param {type:"slider", min:0, max:10, step:1}


#@markdown Put in the corresponding Subdirectory under **./text_raw**:
#@markdown <li> All Texts as clean <b>plaintext *.txt</b> files 
#@markdown <li> A <b>YAML Configuration File</b> describing each Texts

#@markdown Please verify the required textfiles and YAML file exist in the correct subdirectories before continuing.

print('Current Working Directory:')
%cd $Path_to_SentimentArcs

print('\n')

if Corpus_Type == 'reference':
  SUBDIR_TEXT_RAW_CORPUS = f'text_raw_{Corpus_Genre}_ref'
else:
  SUBDIR_TEXT_RAW_CORPUS = f'text_raw_{Corpus_Genre}_{Corpus_Type}_corpus{Corpus_Number}/'


PATH_TEXT_RAW_CORPUS = f'./text_raw/{SUBDIR_TEXT_RAW_CORPUS}'


print(f'SUBDIR_TEXT_RAW_CORPUS:\n  [{SUBDIR_TEXT_RAW_CORPUS}]')
print(f'PATH_TEXT_RAW_CORPUS:\n  [{PATH_TEXT_RAW_CORPUS}]')

Current Working Directory:
/gdrive/MyDrive/cdh/sentiment_arcs


SUBDIR_TEXT_RAW_CORPUS:
  [text_raw_novels_new_corpus2/]
PATH_TEXT_RAW_CORPUS:
  [./text_raw/text_raw_novels_new_corpus2/]


In [3]:
# Add PATH for ./utils subdirectory

import sys
import os

!python --version

print('\n')

PATH_UTILS = f'{Path_to_SentimentArcs}/utils'
PATH_UTILS

sys.path.append(PATH_UTILS)

print('Contents of Subdirectory [./sentiment_arcs/utils/]\n')
!ls $PATH_UTILS

# More Specific than PATH for searching libraries
# !echo $PYTHONPATH

Python 3.7.12


Contents of Subdirectory [./sentiment_arcs/utils/]

config_matplotlib.py   global_constants.py    sa_config_bu.py
config_seaborn.py      global_vars.py	      sa_config.py
file_utils.py	       imdb50k_lemmas.csv     sentiment_analysis.py
get_fullpath.py        imdb50k_lemmas_df.csv  sentiment_arcs_config.py
get_model_families.py  imdb50k_stems.csv      set_globals.py
get_sentimentr.R       __init__.py	      subdir_constants.py
get_sentiments.py      __pycache__	      text_cleaners.py
get_subdirs.py	       read_yaml.py


In [4]:
# Review Global Variables and set the first few

import global_vars as global_vars

global_vars.SUBDIR_SENTIMENTARCS = Path_to_SentimentArcs
global_vars.Corpus_Genre = Corpus_Genre
global_vars.Corpus_Type = Corpus_Type
global_vars.Corpus_Number = Corpus_Number

global_vars.SUBDIR_TEXT_RAW_CORPUS = SUBDIR_TEXT_RAW_CORPUS
global_vars.PATH_TEXT_RAW_CORPUS = PATH_TEXT_RAW_CORPUS

dir(global_vars)

['Corpus_Genre',
 'Corpus_Number',
 'Corpus_Type',
 'FNAME_SENTIMENT_RAW',
 'MIN_PARAG_LEN',
 'MIN_SENT_LEN',
 'NotebookModels',
 'PATH_TEXT_RAW_CORPUS',
 'SLANG_DT',
 'STOPWORDS_ADD_EN',
 'STOPWORDS_DEL_EN',
 'SUBDIR_DATA',
 'SUBDIR_GRAPHS',
 'SUBDIR_SENTIMENTARCS',
 'SUBDIR_SENTIMENT_CLEAN',
 'SUBDIR_SENTIMENT_RAW',
 'SUBDIR_TEXT_CLEAN',
 'SUBDIR_TEXT_RAW',
 'SUBDIR_TEXT_RAW_CORPUS',
 'SUBDIR_TIMESERIES_CLEAN',
 'SUBDIR_TIMESERIES_RAW',
 'SUBDIR_UTILS',
 'TEST_SENTENCES_LS',
 'TEST_WORDS_LS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'corpus_titles_dt',
 'lexicons_dt',
 'model_titles_dt']

# **[STEP 2] Automatic Configuration/Setup**

## Custom Libraries & Define Globals

In [5]:
# Import SentimentArcs Utilities to define Directory Structure
#   based the Selected Corpus Genre, Type and Number

!pwd 
print('\n')

# from utils import sa_config # .sentiment_arcs_utils
from utils import sa_config

print('Objects in sa_config()')
print(dir(sa_config))
print('\n')

# Directory Structure for the Selected Corpus Genre, Type and Number
sa_config.get_subdirs(Corpus_Genre, Corpus_Type, Corpus_Number, 'none')


/gdrive/MyDrive/cdh/sentiment_arcs


Objects in sa_config()
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'get_subdirs', 'global_vars', 'set_globals']


Verify the Directory Structure:

-------------------------------

           [Corpus Genre]: novels

            [Corpus Type]: new


    [FNAME_SENTIMENT_RAW]: [NONE]




INPUTS:
-------------------------------

   [SUBDIR_SENTIMENTARCS]: /gdrive/MyDrive/cdh/sentiment_arcs


STEP 1: Clean Text
--------------------

        [SUBDIR_TEXT_RAW]: ./text_raw/text_raw_novels_new_corpus2/

      [SUBDIR_TEXT_CLEAN]: ./text_clean/text_clean_novels_new_corpus2/


STEP 2: Get Sentiments
--------------------

   [SUBDIR_SENTIMENT_RAW]: ./sentiment_raw/sentiment_raw_novels_new_corpus2/

 [SUBDIR_SENTIMENT_CLEAN]: ./sentiment_clean/sentiemnt_clean_novels_new_corpus2/


STEP 3: Smooth Time Series and Get Crux Points
--------------------

  [SUBDIR_TIMESERIES_RAW]: ./sentiment_raw/sentiment

In [6]:
# Call SentimentArcs Utility to define Global Variables

sa_config.set_globals()

# Verify sample global var set
print(f'MIN_PARAG_LEN: {global_vars.MIN_PARAG_LEN}')
print(f'STOPWORDS_ADD_EN: {global_vars.STOPWORDS_ADD_EN}')
print(f'TEST_WORDS_LS: {global_vars.TEST_WORDS_LS}')
print(f'SLANG_DT: {global_vars.SLANG_DT}')

MIN_PARAG_LEN: 10
STOPWORDS_ADD_EN: ['a', 'the', 'an']
TEST_WORDS_LS: ['Love', 'Hate', 'bizarre', 'strange', 'furious', 'elated', 'curious', 'beserk', 'gambaro']
SLANG_DT: {'$': ' dollar ', '€': ' euro ', '4ao': 'for adults only', 'a.m': 'before midday', 'a3': 'anytime anywhere anyplace', 'aamof': 'as a matter of fact', 'acct': 'account', 'adih': 'another day in hell', 'afaic': 'as far as i am concerned', 'afaict': 'as far as i can tell', 'afaik': 'as far as i know', 'afair': 'as far as i remember', 'afk': 'away from keyboard', 'app': 'application', 'approx': 'approximately', 'apps': 'applications', 'asap': 'as soon as possible', 'asl': 'age, sex, location', 'atk': 'at the keyboard', 'ave.': 'avenue', 'aymm': 'are you my mother', 'ayor': 'at your own risk', 'b&b': 'bed and breakfast', 'b+b': 'bed and breakfast', 'b.c': 'before christ', 'b2b': 'business to business', 'b2c': 'business to customer', 'b4': 'before', 'b4n': 'bye for now', 'b@u': 'back at you', 'bae': 'before anyone else', '

## Configure Jupyter Notebook

In [7]:
# Configure Jupyter

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Read YAML Configuration for Corpus and Models 

In [8]:
!ls utils

config_matplotlib.py   global_constants.py    sa_config_bu.py
config_seaborn.py      global_vars.py	      sa_config.py
file_utils.py	       imdb50k_lemmas.csv     sentiment_analysis.py
get_fullpath.py        imdb50k_lemmas_df.csv  sentiment_arcs_config.py
get_model_families.py  imdb50k_stems.csv      set_globals.py
get_sentimentr.R       __init__.py	      subdir_constants.py
get_sentiments.py      __pycache__	      text_cleaners.py
get_subdirs.py	       read_yaml.py


In [9]:
!head -n 40 ./utils/read_yaml.py

import global_vars

import yaml

def read_corpus_yaml(Corpus_Genre, Corpus_Type, Corpus_Number):
  '''
  Given a Corpus_Genre (e.g. novels), Corpus_Type (new or reference) and Corpus_Number (for new)
  Read and return the long-form titles for both Models and Corpus Texts
  '''

  if Corpus_Type == 'new':
    path_info_yaml = f'text_raw/text_raw_{Corpus_Genre}_{Corpus_Type}_corpus{Corpus_Number}'
    file_yaml = f'text_raw_{Corpus_Genre}_{Corpus_Type}_corpus{Corpus_Number}_info.yaml'
  elif Corpus_Type == 'reference':
    path_info_yaml = f'text_raw/text_raw_{Corpus_Genre}_{Corpus_Type}'
    file_yaml = f'text_raw_{Corpus_Genre}_{Corpus_Type}_info.yaml'
  else:
    print(f'ERROR: Illegal value for Corpus_Type = {Corpus_Type}')
    return

  # Read Models Ensemble YAML Config Files 
  with open("./config/models_ref_info.yaml", "r") as stream:
    try:
      global_vars.models_titles_dt = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
      print(exc)

  # Read Corpus Texts YAML

In [10]:
%whos

Variable                 Type             Data/Info
---------------------------------------------------
Corpus_Genre             str              novels
Corpus_Number            int              2
Corpus_Type              str              new
IN_COLAB                 bool             True
Image                    type             <class 'IPython.core.display.Image'>
InteractiveShell         MetaHasTraits    <class 'IPython.core.inte<...>eshell.InteractiveShell'>
PATH_TEXT_RAW_CORPUS     str              ./text_raw/text_raw_novels_new_corpus2/
PATH_UTILS               str              /gdrive/MyDrive/cdh/sentiment_arcs//utils
Path_to_SentimentArcs    str              /gdrive/MyDrive/cdh/sentiment_arcs/
SUBDIR_TEXT_RAW_CORPUS   str              text_raw_novels_new_corpus2/
display                  function         <function display at 0x7f41da133290>
drive                    module           <module 'google.colab.dri<...>s/google/colab/drive.py'>
global_vars              module          

In [11]:
# from utils import sa_config # .sentiment_arcs_utils

import yaml

from utils import read_yaml

print('Objects in read_yaml()')
print(dir(read_yaml))
print('\n')

# Directory Structure for the Selected Corpus Genre, Type and Number
read_yaml.read_corpus_yaml(Corpus_Genre, Corpus_Type, Corpus_Number)

print('SentimentArcs Model Ensemble ------------------------------\n')
model_titles_ls = global_vars.models_titles_dt.keys()
print('\n'.join(model_titles_ls))


print('\n\nCorpus Texts ------------------------------\n')
corpus_titles_ls = global_vars.corpus_titles_dt.keys()
print('\n'.join(corpus_titles_ls))


print(f'\n\nThere are {len(model_titles_ls)} Models in the SentimentArcs Ensemble above.\n')
print(f'\nThere are {len(corpus_titles_ls)} Texts in the Corpus above.\n')
print('\n')


Objects in read_yaml()
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'global_vars', 'read_corpus_yaml', 'yaml']


YAML Directory: text_raw/text_raw_novels_new_corpus2
YAML File: text_raw_novels_new_corpus2_info.yaml
SentimentArcs Model Ensemble ------------------------------

AutoGluon_Text
BERT_2IMDB
BERT_Dual_Coding
BERT_Multilingual
BERT_Yelp
CNN_DNN
Distilled_BERT
FLAML_AutoML
Fully_Connected_Network
HyperOpt_CNN_Flair_AutoML
LSTM_DNN
Logistic_Regression
Logistic_Regression_CV
Multilingual_CNN_Stanza_AutoML
Multinomial_Naive_Bayes
Pattern
Random_Forest
RoBERTa_Large_15DB
RoBERTa_XML_8Language
SentimentR_JockersRinker
SentimentR_Jockers
SentimentR_Bing
SentimentR_NRC
SentimentR_SentiWord
SentimentR_SenticNet
SentimentR_LMcD
SentimentR_SentimentR
PySentimentR_JockersRinker
PySentimentR_Huliu
PySentimentR_NRC
PySentimentR_SentiWord
PySentimentR_SenticNet
PySentimentR_LMcD
SyuzhetR_AFINN
SyuzhetR_Bing
SyuzhetR_NRC
SyuzhetR_Sy

## Install Libraries

In [12]:
# Library to Read R datafiles from within Python programs

!pip install pyreadr

Collecting pyreadr
  Downloading pyreadr-0.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (361 kB)
[K     |████████████████████████████████| 361 kB 5.0 MB/s 
Installing collected packages: pyreadr
Successfully installed pyreadr-0.4.4


In [13]:
# Powerful Industry-Grade NLP Library

!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.3 MB/s 
[?25hCollecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.2 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 55.9 MB/s 
[?25hCollecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.15-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (653 kB)
[K     |████████████████████████████████| 653 kB 60.2 MB/s 
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 58.5 MB/s 
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic

In [14]:
# NLP Library to Simply Cleaning Text

!pip install texthero

Collecting texthero
  Downloading texthero-1.1.0-py3-none-any.whl (24 kB)
Collecting nltk>=3.3
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 7.2 MB/s 
Collecting spacy<3.0.0
  Downloading spacy-2.3.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 44.9 MB/s 
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.4-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 64.6 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2022.3.15-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (749 kB)
[K     |████████████████████████████████| 749 kB 48.2 MB/s 
Collecting srsly<1.1.0,>=1.0.2
  Downloading srsly-1.0.5-cp37-cp37m-manylinux2014_x86_64.whl (184 kB)
[K     |████████████████████████████████| 184 kB 46.9 MB/s 
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp37-cp37m-manylinux2014_x86_64.whl (1.0 MB)
[K     |█████████████

In [15]:
# Advanced Sentence Boundry Detection Pythn Library
#   for splitting raw text into grammatical sentences
#   (can be difficult due to common motifs like Mr., ..., ?!?, etc)

!pip install pysbd

Collecting pysbd
  Downloading pysbd-0.3.4-py3-none-any.whl (71 kB)
[?25l[K     |████▋                           | 10 kB 22.8 MB/s eta 0:00:01[K     |█████████▏                      | 20 kB 22.2 MB/s eta 0:00:01[K     |█████████████▉                  | 30 kB 9.7 MB/s eta 0:00:01[K     |██████████████████▍             | 40 kB 8.9 MB/s eta 0:00:01[K     |███████████████████████         | 51 kB 4.6 MB/s eta 0:00:01[K     |███████████████████████████▋    | 61 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████████| 71 kB 3.6 MB/s 
[?25hInstalling collected packages: pysbd
Successfully installed pysbd-0.3.4


In [16]:
# Python Library to expand contractions to aid in Sentiment Analysis
#   (e.g. aren't -> are not, can't -> can not)

!pip install contractions

Collecting contractions
  Downloading contractions-0.1.68-py2.py3-none-any.whl (8.1 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 5.4 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 36.3 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.0 contractions-0.1.68 pyahocorasick-1.4.4 textsearch-0.0.21


In [17]:
# Library for dealing with Emoticons (punctuation) and Emojis (icons)

!pip install emot

Collecting emot
  Downloading emot-3.1-py3-none-any.whl (61 kB)
[?25l[K     |█████▎                          | 10 kB 24.1 MB/s eta 0:00:01[K     |██████████▋                     | 20 kB 16.3 MB/s eta 0:00:01[K     |████████████████                | 30 kB 11.2 MB/s eta 0:00:01[K     |█████████████████████▎          | 40 kB 4.9 MB/s eta 0:00:01[K     |██████████████████████████▋     | 51 kB 4.9 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 13 kB/s 
[?25hInstalling collected packages: emot
Successfully installed emot-3.1


## Load Libraries

In [18]:
# Core Python Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import re
import string
from datetime import datetime
import os
import sys
import glob
import json
from pathlib import Path
from copy import deepcopy

In [19]:
# More advanced Sentence Tokenizier Object from PySBD
from pysbd.utils import PySBDFactory

In [20]:
# Simplier Sentence Tokenizer Object from NLTK
import nltk 
from nltk.tokenize import sent_tokenize

# Download required NLTK tokenizer data
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [21]:
# Instantiate and Import Text Cleaning Ojects into Global Variable space
import texthero as hero
from texthero import preprocessing

2022-03-16 09:14:22,847 : INFO : 'pattern' package not found; tag filters are not available for English
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [22]:
# Expand contractions (e.g. can't -> can not)
import contractions

# Translate emoticons :0 and emoji icons to text
import emot 
emot_obj = emot.core.emot() 

from emot.emo_unicode import UNICODE_EMOJI, EMOTICONS_EMO

# Test
text = "I love python ☮ 🙂 ❤ :-) :-( :-)))" 
emot_obj.emoticons(text)

{'flag': True,
 'location': [[20, 23], [24, 27], [28, 33]],
 'mean': ['Happy face smiley',
  'Frown, sad, andry or pouting',
  'Very very Happy face or smiley'],
 'value': [':-)', ':-(', ':-)))']}

In [41]:
# Import spaCy, language model and setup minimal pipeline

import spacy

nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])
# nlp.max_length = 1027203
nlp.max_length = 2054406
nlp.add_pipe(nlp.create_pipe('sentencizer')) # https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization

# Test some edge cases, try to find examples that break spaCy
doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print('\n')
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

print('\nAnother Test:\n')
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

for token in doc:
    print("{:<12}{:<30}{:<12}".format(token.text, token.lemma, token.lemma_))



Token Attributes: 
 token.text, token.pos_, token.tag_, token.dep_, token.lemma_
Apples                                          Apples      
and                                             and         
oranges                                         orange      
are                                             be          
similar                                         similar     
.                                               .           
Boots                                           Boots       
and                                             and         
hippos                                          hippo       
are         AUX         VBP                     be          
n't         PART        RB                      not         
.                                               .           

Another Test:

Apples      9297668116247400838           Apples      
and         2283656566040971221           and         
oranges     2208928596161743350           orange      
are 

## Define/Customize Stopwords

In [None]:
# Define Globals
"""
# Main data structure: Dictionary (key=text_name) of DataFrames (cols: text_raw, text_clean)
corpus_texts_dt = {}

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/get_globals.py'

SLANG_DT.keys()
""";

dict_keys(['$', '€', '4ao', 'a.m', 'a3', 'aamof', 'acct', 'adih', 'afaic', 'afaict', 'afaik', 'afair', 'afk', 'app', 'approx', 'apps', 'asap', 'asl', 'atk', 'ave.', 'aymm', 'ayor', 'b&b', 'b+b', 'b.c', 'b2b', 'b2c', 'b4', 'b4n', 'b@u', 'bae', 'bak', 'bbbg', 'bbc', 'bbias', 'bbl', 'bbs', 'be4', 'bfn', 'blvd', 'bout', 'brb', 'bros', 'brt', 'bsaaw', 'btw', 'bwl', 'c/o', 'cet', 'cf', 'cia', 'csl', 'cu', 'cul8r', 'cv', 'cwot', 'cya', 'cyt', 'dae', 'dbmib', 'diy', 'dm', 'dwh', 'e123', 'eet', 'eg', 'embm', 'encl', 'encl.', 'etc', 'faq', 'fawc', 'fb', 'fc', 'fig', 'fimh', 'ft.', 'ft', 'ftl', 'ftw', 'fwiw', 'fyi', 'g9', 'gahoy', 'gal', 'gcse', 'gfn', 'gg', 'gl', 'glhf', 'gmt', 'gmta', 'gn', 'g.o.a.t', 'goat', 'goi', 'gps', 'gr8', 'gratz', 'gyal', 'h&c', 'hp', 'hr', 'hrh', 'ht', 'ibrb', 'ic', 'icq', 'icymi', 'idc', 'idgadf', 'idgaf', 'idk', 'ie', 'i.e', 'ifyp', 'IG', 'iirc', 'ilu', 'ily', 'imho', 'imo', 'imu', 'iow', 'irl', 'j4f', 'jic', 'jk', 'jsyk', 'l8r', 'lb', 'lbs', 'ldr', 'lmao', 'lmfao', 

In [25]:
global_vars.SLANG_DT.keys()

dict_keys(['$', '€', '4ao', 'a.m', 'a3', 'aamof', 'acct', 'adih', 'afaic', 'afaict', 'afaik', 'afair', 'afk', 'app', 'approx', 'apps', 'asap', 'asl', 'atk', 'ave.', 'aymm', 'ayor', 'b&b', 'b+b', 'b.c', 'b2b', 'b2c', 'b4', 'b4n', 'b@u', 'bae', 'bak', 'bbbg', 'bbc', 'bbias', 'bbl', 'bbs', 'be4', 'bfn', 'blvd', 'bout', 'brb', 'bros', 'brt', 'bsaaw', 'btw', 'bwl', 'c/o', 'cet', 'cf', 'cia', 'csl', 'cu', 'cul8r', 'cv', 'cwot', 'cya', 'cyt', 'dae', 'dbmib', 'diy', 'dm', 'dwh', 'e123', 'eet', 'eg', 'embm', 'encl', 'encl.', 'etc', 'faq', 'fawc', 'fb', 'fc', 'fig', 'fimh', 'ft.', 'ft', 'ftl', 'ftw', 'fwiw', 'fyi', 'g9', 'gahoy', 'gal', 'gcse', 'gfn', 'gg', 'gl', 'glhf', 'gmt', 'gmta', 'gn', 'g.o.a.t', 'goat', 'goi', 'gps', 'gr8', 'gratz', 'gyal', 'h&c', 'hp', 'hr', 'hrh', 'ht', 'ibrb', 'ic', 'icq', 'icymi', 'idc', 'idgadf', 'idgaf', 'idk', 'ie', 'i.e', 'ifyp', 'IG', 'iirc', 'ilu', 'ily', 'imho', 'imo', 'imu', 'iow', 'irl', 'j4f', 'jic', 'jk', 'jsyk', 'l8r', 'lb', 'lbs', 'ldr', 'lmao', 'lmfao', 

In [27]:
dir(global_vars)

['Corpus_Genre',
 'Corpus_Number',
 'Corpus_Type',
 'FNAME_SENTIMENT_RAW',
 'MIN_PARAG_LEN',
 'MIN_SENT_LEN',
 'NotebookModels',
 'PATH_TEXT_RAW_CORPUS',
 'SLANG_DT',
 'STOPWORDS_ADD_EN',
 'STOPWORDS_DEL_EN',
 'SUBDIR_DATA',
 'SUBDIR_GRAPHS',
 'SUBDIR_SENTIMENTARCS',
 'SUBDIR_SENTIMENT_CLEAN',
 'SUBDIR_SENTIMENT_RAW',
 'SUBDIR_TEXT_CLEAN',
 'SUBDIR_TEXT_RAW',
 'SUBDIR_TEXT_RAW_CORPUS',
 'SUBDIR_TIMESERIES_CLEAN',
 'SUBDIR_TIMESERIES_RAW',
 'SUBDIR_UTILS',
 'TEST_SENTENCES_LS',
 'TEST_WORDS_LS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'corpus_titles_dt',
 'lexicons_dt',
 'model_titles_dt',
 'models_titles_dt']

In [26]:
%whos

Variable                 Type             Data/Info
---------------------------------------------------
Corpus_Genre             str              novels
Corpus_Number            int              2
Corpus_Type              str              new
EMOTICONS_EMO            dict             n=221
IN_COLAB                 bool             True
Image                    type             <class 'IPython.core.display.Image'>
InteractiveShell         MetaHasTraits    <class 'IPython.core.inte<...>eshell.InteractiveShell'>
PATH_TEXT_RAW_CORPUS     str              ./text_raw/text_raw_novels_new_corpus2/
PATH_UTILS               str              /gdrive/MyDrive/cdh/sentiment_arcs//utils
Path                     type             <class 'pathlib.Path'>
Path_to_SentimentArcs    str              /gdrive/MyDrive/cdh/sentiment_arcs/
PySBDFactory             type             <class 'pysbd.utils.PySBDFactory'>
SUBDIR_TEXT_RAW_CORPUS   str              text_raw_novels_new_corpus2/
UNICODE_EMOJI            dic

In [28]:
# Verify English Stopword List

stopwords_spacy_en_ls = nlp.Defaults.stop_words

','.join([x for x in stopwords_spacy_en_ls])

stopwords_en_ls = stopwords_spacy_en_ls

print(f'\n\nThere are {len(stopwords_spacy_en_ls)} default English Stopwords from spaCy\n')

"re,’s,becomes,or,towards,already,thence,before,next,their,noone,namely,none,he,alone,against,fifteen,very,ours,moreover,ourselves,really,nobody,various,for,him,twenty,to,else,into,from,whole,amongst,when,i,was,until,done,had,becoming,six,others,whereafter,only,former,all,hereby,more,whoever,her,any,did,once,otherwise,whereby,elsewhere,therefore,twelve,’d,move,rather,you,often,below,where,whither,his,who,sometimes,go,n’t,further,seemed,during,onto,whether,take,around,give,my,will,full,except,‘re,such,first,which,seem,anyhow,something,of,same,does,though,hence,’m,on,nothing,show,besides,thereafter,least,be,been,someone,due,they,have,with,should,are,n‘t,do,up,every,please,am,third,again,less,nevertheless,sixty,'ll,regarding,whenever,anywhere,while,can,five,‘ve,themselves,thereupon,latterly,among,another,it,per,here,many,ten,’re,well,wherever,upon,now,whereas,mine,what,’ll,sometime,one,still,in,those,must,doing,anything,seeming,himself,yourselves,several,and,neither,get,either,out,just,is



There are 326 default English Stopwords from spaCy



## (Optional) Customize Stopword List (add/del)

In [30]:
# Customize Default SpaCy English Stopword List

print(f'\n\nThere are {len(stopwords_spacy_en_ls)} default English Stopwords from spaCy\n')

# [CUSTOMIZE] Stopwords to ADD or DELETE from default spaCy English stopword list
LOCAL_STOPWORDS_DEL_EN = set(global_vars.STOPWORDS_DEL_EN).union(set(['a','an','the','but','yet']))
print(f'    Deleting these stopwords: {LOCAL_STOPWORDS_DEL_EN}')
LOCAL_STOPWORDS_ADD_EN = set(global_vars.STOPWORDS_ADD_EN).union(set(['a','an','the','but','yet']))
print(f'    Adding these stopwords: {LOCAL_STOPWORDS_ADD_EN}\n')

stopwords_en_ls = list(set(stopwords_spacy_en_ls).difference(set(LOCAL_STOPWORDS_DEL_EN)).union(set(LOCAL_STOPWORDS_ADD_EN)))
print(f'Final Count: {len(stopwords_en_ls)} Stopwords')



There are 326 default English Stopwords from spaCy

    Deleting these stopwords: {'a', 'the', 'yet', 'but', 'jimmy', 'an', 'dean'}
    Adding these stopwords: {'yet', 'the', 'an', 'a', 'but'}

Final Count: 326 Stopwords


## Setup Matplotlib Style

In [31]:
# Configure Matplotlib

# View available styles
# plt.style.available

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/config_matplotlib.py'

config_matplotlib()

print('Matplotlib Configuration ------------------------------')
print('\n  (Uncomment to view)')
# plt.rcParams.keys()
print('\n  Edit ./utils/config_matplotlib.py to change')




 New figure size:  (20, 10)
Matplotlib Configuration ------------------------------

  (Uncomment to view)

  Edit ./utils/config_matplotlib.py to change


## Setup Seaborn Style

In [32]:
# Configure Seaborn

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/config_seaborn.py'

config_seaborn()

print('Seaborn Configuration ------------------------------\n')
# print('\n  Update ./utils/config_seaborn.py to display seaborn settings')




Seaborn Configuration ------------------------------



## **Utility Functions**

### Generate Convenient Data Lists

In [33]:
# Derive List of Texts in Corpus a)keys and b)full author and titles

print('Dictionary: corpus_titles_dt')
global_vars.corpus_titles_dt
print('\n')

corpus_texts_ls = list(global_vars.corpus_titles_dt.keys())
print(f'\nCorpus Texts:')
for akey in corpus_texts_ls:
  print(f'  {akey}')
print('\n')

print(f'\nNatural Corpus Titles:')
corpus_titles_ls = [x[0] for x in list(global_vars.corpus_titles_dt.values())]
for akey in corpus_titles_ls:
  print(f'  {akey}')


Dictionary: corpus_titles_dt


{'cliu_threebodyproblem': ['The Three Body Problem by Cixin Liu', 2008, 0],
 'sking_doctorsleep': ['Doctor Sleep by Stphen King', 2013, 0],
 'tmorrison_songofsolomon': ['Song of Solomon by Toni Morrison', 1977, 0]}




Corpus Texts:
  tmorrison_songofsolomon
  cliu_threebodyproblem
  sking_doctorsleep



Natural Corpus Titles:
  Song of Solomon by Toni Morrison
  The Three Body Problem by Cixin Liu
  Doctor Sleep by Stphen King


In [36]:
# Get Model Families of Ensemble

from utils.get_model_families import get_ensemble_model_famalies

global_vars.models_ensemble_dt = get_ensemble_model_famalies(global_vars.models_titles_dt)

print('\nTest: Lexicon Family of Models:')
global_vars.models_ensemble_dt['lexicon']


There are 12 Lexicon Models
  Lexicon Model #0: sentimentr_sentimentr
  Lexicon Model #1: pysentimentr_jockersrinker
  Lexicon Model #2: pysentimentr_huliu
  Lexicon Model #3: pysentimentr_nrc
  Lexicon Model #4: pysentimentr_sentiword
  Lexicon Model #5: pysentimentr_senticnet
  Lexicon Model #6: pysentimentr_lmcd
  Lexicon Model #7: syuzhetr_afinn
  Lexicon Model #8: syuzhetr_bing
  Lexicon Model #9: syuzhetr_nrc
  Lexicon Model #10: syuzhetr_syuzhetr
  Lexicon Model #11: afinn

There are 9 Heuristic Models
  Heuristic Model #0: pattern
  Heuristic Model #1: sentimentr_jockersrinker
  Heuristic Model #2: sentimentr_jockers
  Heuristic Model #3: sentimentr_bing
  Heuristic Model #4: sentimentr_nrc
  Heuristic Model #5: sentimentr_sentiword
  Heuristic Model #6: sentimentr_senticnet
  Heuristic Model #7: sentimentr_lmcd
  Heuristic Model #8: vader

There are 8 Traditional ML Models
  Traditional ML Model #0: autogluon
  Traditional ML Model #1: flaml
  Traditional ML Model #2: logreg


['sentimentr_sentimentr',
 'pysentimentr_jockersrinker',
 'pysentimentr_huliu',
 'pysentimentr_nrc',
 'pysentimentr_sentiword',
 'pysentimentr_senticnet',
 'pysentimentr_lmcd',
 'syuzhetr_afinn',
 'syuzhetr_bing',
 'syuzhetr_nrc',
 'syuzhetr_syuzhetr',
 'afinn']

### Text Cleaning 

In [68]:
# Test Text Cleaning Functions

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/text_cleaners.py'

test_suite_ls = ['text2lemmas',
                 'text_str2sents',
                 'textfile2df',
                 'emojis2text',
                 'all_emos2text',
                 'expand_slang',
                 'clean_text',
                 'lemma_pipe'
                 ]

# test_suite_ls = []

# Test: text2lemmas()
if 'text2lemmas' in test_suite_ls:
  text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)
  print('\n')

# Test: text_str2sents()
if 'text_str2sents' in test_suite_ls:
  text_str2sents('Hello. You are a great dude! WTF?\n\n You are a goat. What is a goat?!? A big lazy GOAT... No way-', pysbd_only=False) # !?! Dr. and Mrs. Elipses...', pysbd_only=True)
  print('\n')

# Test: textfile2df()
if 'textfile2df' in test_suite_ls:
  # ???
  print('\n')

# Test: emojis2text()
if 'emojis2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually ;) fulfilling orders 😒"
  test_str = emojis2text(test_str)
  print(f'test_str: [{test_str}]')
  print('\n')

# Test: all_emos2text()
if 'all_emos2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling :o of making a sale 😎, The feeling :( of actually ;) fulfilling orders 😒"
  all_emos2text(test_str)
  print('\n')

# Test: expand_slang():
if 'expand_slang' in test_suite_ls:
  expand_slang('idk LOL you suck!')
  print('\n')

# Test: clean_text()
if 'clean_text' in test_suite_ls:
  test_df = pd.DataFrame({'text_dirty':['The RAin in SPain','WTF?!?! Do you KnoW...']})
  clean_text(test_df, 'text_dirty', text_type='formal')
  print('\n')

# Test: lemma_pipe()
if 'lemma_pipe' in test_suite_ls:
  print('\nTest #1:\n')
  test_ls = ['I am running late for a meetings with all the many people.',
            'What time is it when you fall down running away from a growing problem?',
            "You've got to be kidding me - you're joking right?"]
  lemma_pipe(test_ls)
  print('\nTest #2:\n')
  texts = pd.Series(["I won't go and you can't make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])
  for doc in nlp.pipe(texts):
    print([tok.lemma_ for tok in doc])
  print('\nTest #3:\n')
  lemma_pipe(texts)


SyntaxError: ignored

NameError: ignored

In [37]:
# [VERIFY]: Texthero preprocessing pipeline

hero.preprocessing.get_default_pipeline()



# Create Default and Custom Stemming TextHero pipeline

# Create a custom cleaning pipeline
def_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                # , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace]

# Create a custom cleaning pipeline
stem_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace
                , preprocessing.stem]
                   
# Test: pass the custom_pipeline to the pipeline argument
# df['clean_title'] = hero.clean(df['title'], pipeline = custom_pipeline)df.head()

[<function texthero.preprocessing.fillna>,
 <function texthero.preprocessing.lowercase>,
 <function texthero.preprocessing.remove_digits>,
 <function texthero.preprocessing.remove_punctuation>,
 <function texthero.preprocessing.remove_diacritics>,
 <function texthero.preprocessing.remove_stopwords>,
 <function texthero.preprocessing.remove_whitespace>]

In [50]:
# Test Text Cleaning Functions

from utils.text_cleaners import text2lemmas, text_str2sents, emojis2text, expand_slang, clean_text, lemma_pipe

test_suite_ls = ['text2lemmas',
                 'text_str2sents',
                 'textfile2df',
                 'emojis2text',
                 'all_emos2text',
                 'expand_slang',
                 'clean_text',
                 'lemma_pipe'
                 ]

# Comment out this line to active tests above
# test_suite_ls = []

"""
# Test: text2lemmas()
if 'text2lemmas' in test_suite_ls:
  text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)
  print('\n')
"""

# Test: text_str2sents()
if 'text_str2sents' in test_suite_ls:
  text_str2sents('Hello. You are a great dude! WTF?\n\n You are a goat. What is a goat?!? A big lazy GOAT... No way-', pysbd_only=False) # !?! Dr. and Mrs. Elipses...', pysbd_only=True)
  print('\n')

# Test: textfile2df()
if 'textfile2df' in test_suite_ls:
  # ???
  print('\n')

# Test: emojis2text()
if 'emojis2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually ;) fulfilling orders 😒"
  test_str = emojis2text(test_str)
  print(f'test_str: [{test_str}]')
  print('\n')

# Test: all_emos2text()
if 'all_emos2text' in test_suite_ls:
  test_str = "Hilarious 😂. The feeling :o of making a sale 😎, The feeling :( of actually ;) fulfilling orders 😒"
  all_emos2text(test_str)
  print('\n')

# Test: expand_slang():
if 'expand_slang' in test_suite_ls:
  expand_slang('idk LOL you suck!')
  print('\n')

# Test: clean_text()
if 'clean_text' in test_suite_ls:
  test_df = pd.DataFrame({'text_dirty':['The RAin in SPain','WTF?!?! Do you KnoW...']})
  clean_text(test_df, 'text_dirty', text_type='formal')
  print('\n')
"""
# Test: lemma_pipe()
if 'lemma_pipe' in test_suite_ls:
  print('\nTest #1:\n')
  test_ls = ['I am running late for a meetings with all the many people.',
            'What time is it when you fall down running away from a growing problem?',
            "You've got to be kidding me - you're joking right?"]
  lemma_pipe(test_ls)
  print('\nTest #2:\n')
  texts = pd.Series(["I won't go and you can't make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])
  for doc in nlp.pipe(texts):
    print([tok.lemma_ for tok in doc])
  print('\nTest #3:\n')
  lemma_pipe(texts)
"""

"\n# Test: text2lemmas()\nif 'text2lemmas' in test_suite_ls:\n  text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)\n  print('\n')\n"

NameError: ignored

### File Functions

In [48]:
# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

from utils.file_utils import *

# %run -i './utils/file_utils.py'

# TODO: Not used? Delete?
# get_fullpath(text_title_str, ftype='data_clean', fig_no='', first_note = '',last_note='', plot_ext='png', no_date=False)

# **[STEP 2] Read in Corpus and Clean**

## Create List of Raw Textfiles

In [51]:
global_vars.SUBDIR_SENTIMENTARCS

'/gdrive/MyDrive/cdh/sentiment_arcs'

In [53]:
# TODO: Temp fix until print(f'Original: {SUBDIR_TEXT_RAW}\n')
path_text_raw = './' + '/'.join(global_vars.SUBDIR_TEXT_RAW.split('/')[1:-1])
print(f'path_text_raw: {path_text_raw}\n')
# SUBDIR_TEXT_RAW = path_text_raw + '/'
print(f'Full Path to Corpus text_raw: ./text_raw/{global_vars.SUBDIR_TEXT_RAW_CORPUS}')

path_text_raw: ./text_raw/text_raw_novels_new_corpus2

Full Path to Corpus text_raw: ./text_raw/text_raw_novels_new_corpus2/


In [54]:
!pwd

/gdrive/MyDrive/cdh/sentiment_arcs


In [57]:
# Get a list of all the Textfile filename roots in Subdir text_raw

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

corpus_titles_ls = list(global_vars.corpus_titles_dt.keys())

print(f'Corpus_Genre: {global_vars.Corpus_Genre}')
print(f'Corpus_Type: {global_vars.Corpus_Type}\n')

# Build path to Corpus Subdir
# TODO: Temp fix until print(f'Original: {SUBDIR_TEXT_RAW}\n')
# path_text_raw = './' + '/'.join(SUBDIR_TEXT_RAW.split('/')[1:-1]) + '/' + SUBDIR_TEXT_RAW_CORPUS
path_text_raw = './text_raw/' + global_vars.SUBDIR_TEXT_RAW_CORPUS
print(f'Corpus Subdir: {path_text_raw}')

# Create a List (preprocessed_ls) of all preprocessed text files
try:
  # texts_raw_ls = glob.glob(f'{SUBDIR_TEXT_RAW}*.txt')
  texts_raw_root_ls = glob.glob(f'{path_text_raw}/*.txt')
  texts_raw_root_ls = [x.split('/')[-1] for x in texts_raw_root_ls]
  texts_raw_root_ls = [x.split('.')[0] for x in texts_raw_root_ls]
except IndexError:
  raise RuntimeError('No *.txt files found')

print(f'\ntexts_raw_root_ls:\n  {texts_raw_root_ls}\n')

text_ct = 0
for afile_root in texts_raw_root_ls:
  # file_root = file_fullpath.split('/')[-1].split('.')[0]
  text_ct += 1
  print(f'{afile_root}: ') # {corpus_titles_dt[afile_root]}')

print(f'\nThere are {text_ct} Texts defined in SentmentArcs [corpus_dt] and found in the subdir: [SUBDIR_TEXT_RAW]')

Corpus_Genre: novels
Corpus_Type: new

Corpus Subdir: ./text_raw/text_raw_novels_new_corpus2/

texts_raw_root_ls:
  ['sking_doctorsleep', 'cliu_threebodyproblem', 'tmorrison_songsolomon']

sking_doctorsleep: 
cliu_threebodyproblem: 
tmorrison_songsolomon: 

There are 3 Texts defined in SentmentArcs [corpus_dt] and found in the subdir: [SUBDIR_TEXT_RAW]


In [58]:
path_text_raw

'./text_raw/text_raw_novels_new_corpus2/'

In [59]:
!ls $path_text_raw

cliu_threebodyproblem.txt  text_raw_novels_new_corpus2_info.yaml
sking_doctorsleep.txt	   tmorrison_songsolomon.txt


In [60]:
glob.glob(f'{path_text_raw}/*.txt')

['./text_raw/text_raw_novels_new_corpus2/sking_doctorsleep.txt',
 './text_raw/text_raw_novels_new_corpus2/cliu_threebodyproblem.txt',
 './text_raw/text_raw_novels_new_corpus2/tmorrison_songsolomon.txt']

## Read and Segment into Sentences

In [None]:
%run -i 

In [64]:
%%time

# Read all Corpus Textfiles and Segment each into Sentences

# NOTE: 3m30s Entire Corpus of 25 
#       7m30s Ref Corpus 32 Novels
#       7m24s Ref Corpus 32 Novels
#       1m00s New Corpus 2 Novels

# Read all novel files into a Dictionary of DataFrames
#   Dict.keys() are novel names
#   Dict.values() are DataFrames with one row per Sentence

# Continue here ONLY if last cell completed WITHOUT ERROR

# anovel_df = pd.DataFrame()

for i, file_root in enumerate(corpus_titles_ls):
  file_fullpath = f'{global_vars.SUBDIR_TEXT_RAW}{file_root}.txt'
  # print(f'Processing Novel #{i}: {file_fullpath}') # {file_root}')
  # fullpath_str = novels_subdir + asubdir + '/' + asubdir + '.txt'
  # print(f"  Size: {os.path.getsize(file_fullpath)}")

  corpus_texts_dt[file_root] = textfile2df(file_fullpath)
  
# corpus_dt.keys()

# Verify First Text is Segmented into text_raw Sentences
print('\n\n')
corpus_texts_dt[corpus_titles_ls[0]].head()


2022-03-16 09:52:47,024 : ERROR : Internal Python error in the inspect module.
Below is the traceback from this internal error.

2022-03-16 09:52:47,032 : INFO : 
Unfortunately, your original traceback can not be constructed.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-64-3dbd978afc86>", line 1, in <module>
    get_ipython().run_cell_magic('time', '', '\n# Read all Corpus Textfiles and Segment each into Sentences\n\n# NOTE: 3m30s Entire Corpus of 25 \n#       7m30s Ref Corpus 32 Novels\n#       7m24s Ref Corpus 32 Novels\n#       1m00s New Corpus 2 Novels\n\n# Read all novel files into a Dictionary of DataFrames\n#   Dict.keys() are novel names\n#   Dict.values() are DataFrames with one row per Sentence\n\n# Continue here ONLY if last cell completed WITHOUT ERROR\n\n# anovel_df = pd.DataFrame()\n\nfor i, file_root in enumerate(corpus_titles_ls):\n  file_fullpath = f\'{global_vars.SUBDIR_TEXT_RAW}{file_root}.txt\'\n  # print(f\'Processing Novel #{i}: {file_fullpath}\') # {file_root}\')\n  # fullpath_str = novels_subdir + asubdir + \'/\' +

NameError: ignored

## Clean Sentences

In [None]:
%%time

# NOTE: (no stem) 4m09s (24 Novels)
#       (w/ stem) 4m24s (24 Novels)

i = 0

for key_novel, atext_df in corpus_texts_dt.items():

  print(f'Processing Novel #{i}: {key_novel}...')

  atext_df['text_clean'] = clean_text(atext_df, 'text_raw', text_type='formal')
  atext_df['text_clean'] = lemma_pipe(atext_df['text_clean'])
  atext_df['text_clean'] = atext_df['text_clean'].astype('string')

  # TODO: Fill in all blank 'text_clean' rows with filler semaphore
  atext_df.text_clean = atext_df.text_clean.fillna('empty_placeholder')

  atext_df.head(2)

  print(f'  shape: {atext_df.shape}')

  i += 1

Processing Novel #0: scollins_thehungergames1...
  shape: (9021, 2)
Processing Novel #1: cmieville_thecityandthecity...
  shape: (10125, 2)
CPU times: user 6.92 s, sys: 140 ms, total: 7.07 s
Wall time: 7.75 s


In [None]:
# Verify the first Text in Corpus is cleaned

corpus_texts_dt[corpus_titles_ls[0]].head(20)
corpus_texts_dt[corpus_titles_ls[0]].info()

Unnamed: 0,text_raw,text_clean
0,"""THE TRIBUTES""",the tribute
1,"When I wake up, the other side of the bed is c...",when i wake up the other side of the bed be cold
2,"My fingers stretch out, seeking Prims warmth b...",my finger stretch out seek prims warmth but fi...
3,She must have had bad dreams and climbed in wi...,she must have have bad dream and climb in with...
4,"Of course, she did.",of course she do
5,This is the day of the reaping.,this be the day of the reap
6,I prop myself up on one elbow.,i prop myself up on one elbow
7,Theres enough light in the bedroom to see them.,there be enough light in the bedroom to see them
8,"My little sister, Prim, curled up on her side,...",my little sister prim curl up on her side coco...
9,"In sleep, my mother looks younger, still worn ...",in sleep my mother look young still wear but n...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9021 entries, 0 to 9020
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text_raw    9021 non-null   object
 1   text_clean  9021 non-null   string
dtypes: object(1), string(1)
memory usage: 141.1+ KB


## Save Cleaned Corpus

In [None]:
# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

print('Currently in SentimentArcs root directory:')
!pwd

# Verify Subdir to save Cleaned Texts and Texts into..

print(f'\nSaving Clean Texts to Subdir: {SUBDIR_TEXT_CLEAN}')
print(f'\nSaving these Texts:\n  {corpus_texts_dt.keys()}')

Currently in SentimentArcs root directory:
/gdrive/MyDrive/cdh/sentiment_arcs

Saving Clean Texts to Subdir: ./text_clean/novels_text_new_clean/

Saving these Texts:
  dict_keys(['scollins_thehungergames1', 'cmieville_thecityandthecity'])


In [None]:
# Save the cleaned Textfiles

i = 0
for key_novel, anovel_df in corpus_texts_dt.items():
  anovel_fname = f'{key_novel}.csv'

  anovel_fullpath = f'{SUBDIR_TEXT_CLEAN}{anovel_fname}'
  print(f'Saving Novel #{i} to {anovel_fullpath}')
  corpus_texts_dt[key_novel].to_csv(anovel_fullpath)
  i += 1

Saving Novel #0 to ./text_clean/novels_text_new_clean/scollins_thehungergames1.csv
Saving Novel #1 to ./text_clean/novels_text_new_clean/cmieville_thecityandthecity.csv


# **[END OF NOTEBOOK]**