# **SentimentArcs: Self-Supervising Time Series Sentiment Analysis**

Welcome! 

SentimentArcs is a methodlogy and software framework for analyzing narrative in text. Virtually all long text contains narrative elements...(TODO: Insert excerpts from Paper Abstract/Intro Sections here)

***

* **SentimentArcs: Cloning the Github repository to your gDrive**

If this is the first time using SentimentArcs, you will need to copy the software from our Github.com repository (github repo). The default recommended gDrive path is ./gdrive/MyDrive/research/sentiment_arcs/'. 

The first time you run this notebook and connect your Google gDrive, it will allow to to specify the path to your SentimentArcs subdirectory. If it does not exists, this notebook will copy/clone the SentimentArcs github repository code to your gDrive at the path you specify.


***

* **NovelText: A Reference Corpus of 24 Diverse Novel**

Sentiment Arcs comes with a carefully curated reference corpus of Novels to illustrate the unique diachronic sentiment analysis characteristic of long form fictional narrativeas. This corpus of 24 diverse novels also provides a baseline for exploring and comparing new novels with sentiment analysis using SentimentArcs.

***

* **Preparing New Novels: Formatting and adding to subdirectory**

To analyze new novels with SentimentArcs, the body of the text should consist of plain text organized in to blocks separated by two newlines which visually look like a single blank line between blocks. These blocks are usually paragraphs but can also include title headers, separate lines of dialog or quotes. Please reference any of the 24 novels in the NovelText corpus for examples of this expected format.

Once the new novel is correctly formatted as a plain text file, it should follow this standard file naming convention:

[first letter of first name]+[full lastname]_[abbreviated book title].txt

Examples:

* fdouglass_narrativelifeofaslave
* fscottfitzgerald_thegreatgatsby.txt
* vwoolf_mrsdalloway.txt
* homer-ewilson_odyssey.txt (trans. E.Wilson)
* mproust-mtreharne_3guermantesway.txt (Book 3, trans. M.Treharne)
* staugustine_confessions9end.txt (Upto and incl Book 9)

Note the optional author suffix (-translator) and optional title suffix (-selected chapters/books)

***

* **Adding New Novels: Add file to subdirectory and Update this Notebook**

Once you have a cleaned and text file named according the standard rule above, you must move that file to the subdirectory of all input novels and update the global variable in this notebook that defines which novels to analyze.

First, copy your cleaned text file to the subdirectory containing all novels read by this notebook. This subdir is defined by the program variable 'subdir_novels' with the default value './in1_novels/'

Second, update the program variable 'novels_dt'. This is a Dictionary data structure that following the pattern below:
```
novels_dt = {
  'cdickens_achristmascarol':['A Christmas Carol by Charles Dickens ',1843,1399],
```
Where the first string (the dictionary key) must match the filename root without the '.txt' suffix (e.g. cdickens_achristmascarol). The Dictionary value after the ':' is a list of three elements:

* A nicely formatted string of the form '(title) by (full first and last name of author)' that should be a human friendly string used to label plots and saved files.

* The (publication year) and the (sentence count). Both are optional, but should have placeholder string '0' if unknown. These are intended for future reference and analytics.

* Your future self will thank you if you insert new novels into the 'novels_dt' in alphabetic order for faster and more accurate reference.

***

* **How to Execute SentimentArcs Notebooks:**

This is a Jupyter Notebook created to run on Google's free Colab service using only a browers and your exiting Google email account. We chose Google Colab because it is relatively, fast, free, easy to use and makes collaboration as simple as web browsing.

A few reminders about using Jupyter Notebooks general and SentimentArcs in particular:

* All cells must be run ***in order*** as later code cells often depend upon the output of earlier code cells

* ***Cells that take more time to execute*** (> 1 min) usually begin with *%%time* which outputs the *total execution time* of the last run.  This timing output is deleted and recalculated each time the code cell is executed.

* **[OPTIONAL]** at the top of a cell indicates you *may* change a setting in that cell to customize behavior.

* **[CUSTOMIZE]** at the top of a cell indicates you *must* change a setting in that cell.

* **[RESTART REQUIRED]** at the top of a cell indicates you *may* see a *[RESTART REQUIRED] button* at the end of the output. *If you see this button, you must select [Runtime]->[Restart Runtime] from the top menubar.

* **[INPUT REQUIRED]** at the top of a cell indicates you will be required to take some action for execution to proceed, usually by clicking a button or entering the response to a prompt.

All cells with a top comment prefixed with # [OPTIONAL]: indicates that you can change a setting to customize behavior, the prefix [CUSTOMIZE] indicates you MUST set/change a setting

* SentimentArcs divides workflow into a series of chronological Jupyter Notebooks that must be run in order. Here is an overview of the workflow:

***

**SentimentArcs Notebooks Workflow**
1. Notebook #1: Preprocess Text
2. Notebook #2: Compute Sentiment Values (Simple Models/CPUs)
3. Notebook #3: Compute Sentiment Values (Complex Models/GPUs)
4. Notebook #4: Combine all Sentiment Values, perform Time Series analysis, and extract Crux points and surrounding text

If you are unfamilar with setting up and using Google Colab or Jupyter Notebooks, here are a series of resources to quickly bring you up to speed. If you are using SentimentArcs with the Cambridge University Press Elements textbook, there are also a series of videos by Prof Elkins and Chun stepping you through these notebooks.

***

**Additional Resources and Tutorials**


**Google Colab and Jupyter Resources:**

* Coming...
* [IPython, Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html) 

**Cambridge University Press Videos:**

* Coming...




# **[STEP 1] Configuration and Setup**



## Configure Jupyter Notebook

In [None]:
# Ignore warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Configure Jupyter

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

## [INPUT] Connect Google gDrive to this Jupyter Notebook

In [None]:
# [INPUT REQUIRED]: Authorize access to Google gDrive

# Connect this Notebook to your permanent Google Drive
#   so all generated output is saved to permanent storage there

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("Attempting to attach your Google gDrive to this Colab Jupyter Notebook")
  drive.mount('/gdrive')
else:
  print("Your Google gDrive is attached to this Colab Jupyter Notebook")

Attempting to attach your Google gDrive to this Colab Jupyter Notebook
Mounted at /gdrive


In [None]:
# [CUSTOMIZE]: Change the text after the Unix '%cd ' command below (change directory)
#              to math the full path to your gDrive subdirectory which should be the 
#              root directory cloned from the SentimentArcs github repo.

# NOTE: Make sure this subdirectory already exists and there are 
#       no typos, spaces or illegals characters (e.g. periods) in the full path after %cd

# NOTE: In Python all strings must begin with an upper or lowercase letter, and only
#         letter, number and underscores ('_') characters should appear afterwards.
#         Make sure your full path after %cd obeys this constraint or errors may appear.



# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/cdh/sentiment_arcs/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}

#@markdown (e.g. /gdrive/MyDrive/research/sentiment_arcs/)



# Step #2: Move to Parent directory of Sentiment_Arcs
# =======
parentdir_sentiment_arcs = '/'.join(Path_to_SentimentArcs.split('/')[:-2])
print(f'subdir_parent: {parentdir_sentiment_arcs}')
%cd $parentdir_sentiment_arcs


# Step #3: If project sentiment_arcs subdir does not exist, 
#          clone it from github
# =======
import os

if ~os.path.isdir('sentiment_arcs'):
  # !git clone https://github.com/jon-chun/sentiment_arcs.git

  # Test on open access github repo
  !git clone https://github.com/jon-chun/nabokov_palefire.git


# Step #4: Change into sentiment_arcs subdir
# =======
# %cd ./sentiment_arcs
# Test on open acess github repo
%cd ./nabokov_palefire

# Step #5: Confirm contents of sentiment_arcs subdir
# =======
!ls


subdir_parent: /gdrive/MyDrive/cdh
/gdrive/MyDrive/cdh
fatal: destination path 'nabokov_palefire' already exists and is not an empty directory.
/gdrive/MyDrive/cdh/nabokov_palefire
Foreword_Text.txt  palefire_clean_parts  Poem.txt  README.md


In [None]:
# [VERIFY]: Ensure that all the manually preprocessed novel are in plain text
#   files and file names are formatted correctly

%cd ../sentiment_arcs
!pwd
!ls ./text_raw

/gdrive/MyDrive/cdh/sentiment_arcs
/gdrive/MyDrive/cdh/sentiment_arcs
finance_text_raw      novels_text_raw	   social_text_raw
finance_text_ref_raw  novels_text_ref_raw  social_text_ref_raw


### Define Directory Tree Structure

In [None]:
#@markdown **Sentiment Arcs Directory Structure** \
#@markdown \
#@markdown **1. Input Directories:** \
#@markdown (a) Raw textfiles in subdir: ./text_raw/(text_type)/  \
#@markdown (b) Cleaned textfiles in subdir: ./text_clean/(text_type)/ \
#@markdown \
#@markdown **2. Output Directories** \
#@markdown (1) Raw Sentiment time series datafiles and plots in subdir: ./sentiment_raw/(text_type) \
#@markdown (2) Cleaned Sentiment time series datafiles and plots in subdir: ./sentiment_clean/(text_type) \
#@markdown \
#@markdown **Which type of texts are you cleaning?** \

Text_Type = "novels" #@param ["novels", "social_media", "finance"]

Corpus = "new_texts" #@param ["reference_corpora", "new_texts"]

#@markdown Please check that the required textfiles and datafiles exist in the correct subdirectories before continuing.


In [None]:
# Create Directory CONSTANTS based On Document Type

if Corpus == "new_texts":
  Corpus_Type = "new"
else:
  Corpus_Type = "ref"

SUBDIR_TEXT_RAW = f"./text_raw/{Text_Type}_text_{Corpus_Type}_raw/"
SUBDIR_TEXT_CLEAN = f"./text_clean/{Text_Type}_text_{Corpus_Type}_clean/"
SUBDIR_SENTIMENT_RAW = f"./sentiment_raw/{Text_Type}_sentiment_{Corpus_Type}_raw/"
SUBDIR_SENTIMENT_CLEAN = f"./sentiment_clean/{Text_Type}_sentiment_{Corpus_Type}_clean/"
SUBDIR_PLOTS = f"./plots/{Text_Type}/plots/"

# Verify Directory Structure

print('Verify the Directory Structure:\n')
print('-------------------------------\n')

print(f'           [Corpus Type]: {Text_Type}\n')
print(f'       [SUBDIR_TEXT_RAW]: {SUBDIR_TEXT_RAW}\n')
print(f'     [SUBDIR_TEXT_CLEAN]: {SUBDIR_TEXT_CLEAN}\n')
print(f'  [SUBDIR_SENTIMENT_RAW]: {SUBDIR_SENTIMENT_RAW}\n')
print(f'[SUBDIR_SENTIMENT_CLEAN]: {SUBDIR_SENTIMENT_CLEAN}\n')
print(f'          [SUBDIR_PLOTS]: {SUBDIR_PLOTS}\n')

NameError: ignored

### Read YAML Configuration for Corpus and Models 

In [None]:
!pip install pyyaml
import yaml



In [None]:
# Read SentimentArcs YAML Config Files for Different Corpora Types(3) and Text Files Details

# Read SentimentArcs YAML Config Files on Models

# Model in SentimentArcs Ensemble
with open("./config/models_ref_info.yaml", "r") as stream:
  try:
    models_titles_dt = yaml.safe_load(stream)
  except yaml.YAMLError as exc:
    print(exc)

if Text_Type == 'novels':

  # Novel Text Files
  if Corpus == 'new_texts':
    # Corpus of New Novels
    with open("./config/novels_new_info.yaml", "r") as stream:
      try:
        corpus_titles_dt = yaml.safe_load(stream)
      except yaml.YAMLError as exc:
        print(exc)
  else:
    # Corpus of Reference Novels
    with open("./config/novels_ref_info.yaml", "r") as stream:
      try:
        corpus_titles_dt = yaml.safe_load(stream)
      except yaml.YAMLError as exc:
        print(exc)    

elif Text_Type == 'finance':

  # Finance Text Files
  if Corpus == 'new_texts':
    # Corpus of New Finance Texts
    with open("./config/finance_new_info.yaml", "r") as stream:
      try:
        corpus_titles_dt = yaml.safe_load(stream)
      except yaml.YAMLError as exc:
        print(exc)
  else:
    # Corpus of Reference Finance Texts
    with open("./config/finance_ref_info.yaml", "r") as stream:
      try:
        corpus_titles_dt = yaml.safe_load(stream)
      except yaml.YAMLError as exc:
        print(exc)

elif Text_Type == 'social_media':

  # Social Media Text Files
  if Corpus == 'new_texts':
    # Corpus of New Social Media Texts
    with open("./config/social_new_info.yaml", "r") as stream:
      try:
        corpus_titles_dt = yaml.safe_load(stream)
      except yaml.YAMLError as exc:
        print(exc)
  else:
    # Corpus of Reference Social Media Texts
    with open("./config/social_ref_info.yaml", "r") as stream:
      try:
        corpus_titles_dt = yaml.safe_load(stream)
      except yaml.YAMLError as exc:
        print(exc)

else:
  
  print(f"ERROR: Illegal Text_Type: {Text_Type}\n")

print(f'Corpus Titles Dictionary =')
corpus_titles_dt.keys()

print(f'\n\nThe Corpus Titles contains [{len(corpus_titles_dt.keys())} {Text_Type}] textfiles ')
print(f'\nFirst Text in Corpus:')
print(corpus_titles_dt[next(iter(corpus_titles_dt))])

Corpus Titles Dictionary =


dict_keys(['scollins_thehungergames1', 'cmieville_thecityandthecity'])



The Corpus Titles contains [2 novels] textfiles 

First Text in Corpus:
['The Hunger Games 1 by Suzanne Collins ', 2008, 0]


### Define Working Corpus

In [None]:
# This contains the titles and metadata for each Text in the Corpus

corpus_titles_dt

{'cmieville_thecityandthecity': ['The City and The City by China Mieville',
  2009,
  0],
 'scollins_thehungergames1': ['The Hunger Games 1 by Suzanne Collins ',
  2008,
  0]}

### Normalize Sentiment Analysis Model Names

In [None]:
# Mapping to standarize col/model names

cols_map_dt = {'syuzhet':'syuzhetr',
               'huliu':'bing_sentimentr',
               'sentiword':'sentiword_sentimentr',
               'senticnet':'senticnet_sentimentr',
               'lmcd':'lmcd_sentimentr',
               'jockers':'jockers_sentimentr',
               'jockers_rinker':'jockersrinker_sentimentr'
               }

cols_missing_ls = ['nrc_sentimentr']

In [None]:
# Review All Sentiment Analysis Models in the Ensemble

models_titles_dt

{'AutoGluon_Text': ['autogluon', 'tradml', 'autogluon_text'],
 'BERT_2IMDB': ['imdb2way', 'transformer', 'bert'],
 'BERT_Dual_Coding': ['hinglish', 'transformer', 'bert'],
 'BERT_Multilingual': ['nlptown', 'transformer', 'bert'],
 'BERT_Yelp': ['yelp', 'transformer', 'bert'],
 'CNN_DNN': ['cnn', 'dnn', 1315937],
 'Distilled_BERT': ['huggingface', 'transformer', 'bert'],
 'FLAML_AutoML': ['flaml', 'tradml', 'flaml'],
 'Fully_Connected_Network': ['fcn', 'dnn', 6287671],
 'HyperOpt_CNN_Flair_AutoML': ['flair', 'dnn', 0],
 'LSTM_DNN': ['lstm', 'dnn', 7109089],
 'Logistic_Regression': ['logreg', 'tradml', 'scikit'],
 'Logistic_Regression_CV': ['logreg_cv', 'tradml', 'scikit'],
 'Multilingual_CNN_Stanza_AutoML': ['stanza', 'dnn', 0],
 'Multinomial_Naive_Bayes': ['multinb', 'tradml', 'scikit'],
 'Pattern': ['pattern', 'lexicon', 2918],
 'Random_Forest': ['rf', 'tradml', 'scikit'],
 'RoBERTa_Large_15DB': ['roberta15lg', 'transformer', 'roberta'],
 'RoBERTa_XML_8Language': ['robertaxml8lang', '

## Install Libraries

In [None]:
# [RESTART REQUIRED] May be required after installing this library**

# Library to Read R datafiles from within Python programs

!pip install pyreadr

Collecting pyreadr
  Downloading pyreadr-0.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (361 kB)
[?25l[K     |█                               | 10 kB 22.2 MB/s eta 0:00:01[K     |█▉                              | 20 kB 18.0 MB/s eta 0:00:01[K     |██▊                             | 30 kB 10.6 MB/s eta 0:00:01[K     |███▋                            | 40 kB 4.0 MB/s eta 0:00:01[K     |████▌                           | 51 kB 3.9 MB/s eta 0:00:01[K     |█████▍                          | 61 kB 4.6 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 4.8 MB/s eta 0:00:01[K     |███████▎                        | 81 kB 5.1 MB/s eta 0:00:01[K     |████████▏                       | 92 kB 5.6 MB/s eta 0:00:01[K     |█████████                       | 102 kB 4.5 MB/s eta 0:00:01[K     |██████████                      | 112 kB 4.5 MB/s eta 0:00:01[K     |██████████▉                     | 122 kB 4.5 MB/s eta 0:00:01[K     |███████████▊            

In [None]:
# [RESTART REQUIRED] May be required after installing this library**

# Powerful Industry-Grade NLP Library

!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 3.8 MB/s 
[?25hCollecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 57.1 MB/s 
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 46.6 MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 20.7 MB/s 
Collecting spacy-logge

In [None]:
# [RESTART REQUIRED] May be required after installing this library**

# NLP Library to Simply Cleaning Text

!pip install texthero

Collecting texthero
  Downloading texthero-1.1.0-py3-none-any.whl (24 kB)
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.3-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 6.4 MB/s 
Collecting spacy<3.0.0
  Downloading spacy-2.3.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 24.2 MB/s 
Collecting nltk>=3.3
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 68.3 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2022.1.18-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (748 kB)
[K     |████████████████████████████████| 748 kB 38.2 MB/s 
Collecting catalogue<1.1.0,>=0.0.7
  Downloading catalogue-1.0.0-py2.py3-none-any.whl (7.7 kB)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp37-cp37m-manylinux2014_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 26.4 MB/s 
Collecting srsly<1.1.0,>=1.0.

In [None]:
# [RESTART REQUIRED] May be required after installing this library**

# Advanced Sentence Boundry Detection Python Library
#   for splitting raw text into grammatical sentences
#   (can be difficult due to common motifs like Mr., ..., ?!?, etc)

!pip install pysbd

Collecting pysbd
  Downloading pysbd-0.3.4-py3-none-any.whl (71 kB)
[?25l[K     |████▋                           | 10 kB 25.2 MB/s eta 0:00:01[K     |█████████▏                      | 20 kB 13.3 MB/s eta 0:00:01[K     |█████████████▉                  | 30 kB 9.0 MB/s eta 0:00:01[K     |██████████████████▍             | 40 kB 3.7 MB/s eta 0:00:01[K     |███████████████████████         | 51 kB 3.7 MB/s eta 0:00:01[K     |███████████████████████████▋    | 61 kB 4.3 MB/s eta 0:00:01[K     |████████████████████████████████| 71 kB 3.1 MB/s 
[?25hInstalling collected packages: pysbd
Successfully installed pysbd-0.3.4


In [None]:
# [RESTART REQUIRED] May be required after installing this library**

# Python Library to expand contractions to aid in Sentiment Analysis
#   (e.g. aren't -> are not, can't -> can not)

!pip install contractions

Collecting contractions
  Downloading contractions-0.1.66-py2.py3-none-any.whl (8.0 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 5.7 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 33.2 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.0 contractions-0.1.66 pyahocorasick-1.4.4 textsearch-0.0.21


In [None]:
# [RESTART REQUIRED] May be required after installing this library**

# Library for dealing with Emoticons (punctuation) and Emojis (icons)

!pip install emot

Collecting emot
  Downloading emot-3.1-py3-none-any.whl (61 kB)
[?25l[K     |█████▎                          | 10 kB 19.9 MB/s eta 0:00:01[K     |██████████▋                     | 20 kB 23.7 MB/s eta 0:00:01[K     |████████████████                | 30 kB 8.3 MB/s eta 0:00:01[K     |█████████████████████▎          | 40 kB 3.9 MB/s eta 0:00:01[K     |██████████████████████████▋     | 51 kB 4.5 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 4.2 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 14 kB/s 
[?25hInstalling collected packages: emot
Successfully installed emot-3.1


## Load Libraries

In [None]:
# Core Python Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# Additional Python Libraries

import re
import string

from datetime import datetime
import os
import sys
import glob
import json
from pathlib import Path
from copy import deepcopy

In [None]:
# NLP Specialized Libraries

# More advanced Sentence Tokenizier Object from PySBD

from pysbd.utils import PySBDFactory

In [None]:
# Simplier Sentence Tokenizer Object from NLTK

import nltk 
from nltk.tokenize import sent_tokenize

In [None]:
# Download required NLTK tokenizer data

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Instantiate and Import Text Cleaning Ojects into Global Variable space

import texthero as hero
from texthero import preprocessing

import contractions

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Instantiate and Import Text Cleaning Ojects into Global Variable space

import emot 
emot_obj = emot.core.emot() 

from emot.emo_unicode import UNICODE_EMOJI, EMOTICONS_EMO

# Test
text = "I love python ☮ 🙂 ❤ :-) :-( :-)))" 
emot_obj.emoticons(text)

{'flag': True,
 'location': [[20, 23], [24, 27], [28, 33]],
 'mean': ['Happy face smiley',
  'Frown, sad, andry or pouting',
  'Very very Happy face or smiley'],
 'value': [':-)', ':-(', ':-)))']}

In [None]:
# Import spaCy, language model and setup minimal pipeline

import spacy

nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])
# nlp.max_length = 1027203
nlp.max_length = 2054406
nlp.add_pipe(nlp.create_pipe('sentencizer')) # https://stackoverflow.com/questions/51372724/how-to-speed-up-spacy-lemmatization

# Test some edge cases, try to find examples that break spaCy
doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print('\n')
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

print('\nAnother Test:\n')
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

for token in doc:
    print("{:<12}{:<30}{:<12}".format(token.text, token.lemma, token.lemma_))



Token Attributes: 
 token.text, token.pos_, token.tag_, token.dep_, token.lemma_
Apples                                          Apples      
and                                             and         
oranges                                         orange      
are                                             be          
similar                                         similar     
.                                               .           
Boots                                           Boots       
and                                             and         
hippos                                          hippo       
are         AUX         VBP                     be          
n't         PART        RB                      not         
.                                               .           

Another Test:

Apples      9297668116247400838           Apples      
and         2283656566040971221           and         
oranges     2208928596161743350           orange      
are 

In [None]:
# [CUSTOMIZE]: spaCy stopword list
#   Can add/remove specific stopwords

stops = nlp.Defaults.stop_words
','.join([x for x in stops])
print('\n')
type(stops)
stopwords_ls = list(stops)
print('\n')
type(stopwords_ls)

stops = nlp.Defaults.stop_words
# type(stops)
stopwords_ls = list(stops)


"become,except,otherwise,could,for,due,along,because,twenty,beforehand,keep,whoever,fifteen,across,out,many,you,might,us,'d,during,she,part,an,ever,anyway,others,those,was,about,once,most,our,on,be,at,none,themselves,into,take,'s,whence,say,whatever,ca,whenever,after,rather,each,since,thru,mine,give,'m,does,or,would,get,sixty,from,beyond,very,much,his,thereupon,however,himself,again,latterly,formerly,never,perhaps,thence,neither,both,namely,among,somehow,whose,already,the,will,over,behind,towards,he,enough,yourselves,’s,per,‘s,have,before,this,often,of,down,front,therein,ten,‘ll,several,‘ve,mostly,‘d,becoming,doing,forty,'ll,which,fifty,their,nevertheless,yet,eight,n't,must,now,own,where,latter,else,been,am,but,less,they,anywhere,also,same,every,twelve,whom,no,not,them,n’t,please,somewhere,is,her,with,almost,seems,may,yourself,herself,too,him,anything,seem,beside,side,then,’re,other,‘re,six,whole,therefore,really,whereby,put,even,show,re,former,nothing,whereafter,how,upon,well,back,eve





set





list

In [None]:
# Process

# https://stackoverflow.com/questions/45605946/how-to-do-text-pre-processing-using-spacy

# import spacy #load spacy
# nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])

# stops = stopwords.words("english")

stops = nlp.Defaults.stop_words
# type(stops)
stopwords_ls = list(stops)


def text2lemmas(comment, lowercase, remove_stopwords):
    if lowercase:
        comment = comment.lower()
    comment = nlp(comment)
    lemmatized = list()
    for word in comment:
        lemma = word.lemma_.strip()
        if lemma:
            if not remove_stopwords or (remove_stopwords and lemma not in stopwords_ls):
                lemmatized.append(lemma)
    return " ".join(lemmatized)


# Test
# Data['Text_After_Clean'] = Data['Text'].apply(normalize, lowercase=True, remove_stopwords=True)

text2lemmas('I am going to start studying more often and working harder.', lowercase=True, remove_stopwords=False)

'i be go to start study much often and work hard .'

## Setup Matplotlib Style

* https://matplotlib.org/stable/tutorials/introductory/customizing.html

In [None]:
from cycler import cycler

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']   
linestyles = ['-', '--', ':', '-.','-', '--', ':', '-.','-', '--']

cycle = plt.cycler("color", colors) + plt.cycler("linestyle", linestyles)

# View previous matplotlib configuration
print('\n Old Matplotlib Configurtion Settings:\n')
# plt.rc.show
print('\n\n')

# Update and view new matplotlib configuration
print('\n New Matplotlib Configurtion Settings:\n')
myparams = {'axes.prop_cycle': cycle}
plt.rcParams.update(myparams)

plt.rcParams["axes.titlesize"] = 16
plt.rcParams['figure.figsize'] = 20,10
plt.rcParams["legend.fontsize"] = 10
plt.rcParams["xtick.labelsize"] = 12
plt.rcParams["ytick.labelsize"] = 12
plt.rcParams["axes.labelsize"] = 12



 Old Matplotlib Configurtion Settings:





 New Matplotlib Configurtion Settings:



In [None]:
"""
import matplotlib.colors as mcolors

mcolors.TABLEAU_COLORS

all_named_colors = {}
all_named_colors.update(mcolors.TABLEAU_COLORS)

print('\n')
all_named_colors.values()
""";

## Setup Seaborn Style

In [None]:
# View previous seaborn configuration
print('\n Old Seaborn Configurtion Settings:\n')
sns.axes_style()
print('\n\n')

# Update and View new seaborn configuration
print('\n New Seaborn Configurtion Settings:\n')
# sns.set_style('white')
sns.set_context('paper')
sns.set_style('white')
sns.set_palette('tab10')

# Change defaults
# sns.set(style='white', context='talk', palette='tab10')


 Old Seaborn Configurtion Settings:



{'axes.axisbelow': 'line',
 'axes.edgecolor': 'black',
 'axes.facecolor': 'white',
 'axes.grid': False,
 'axes.labelcolor': 'black',
 'axes.spines.bottom': True,
 'axes.spines.left': True,
 'axes.spines.right': True,
 'axes.spines.top': True,
 'figure.facecolor': (1, 1, 1, 0),
 'font.family': ['sans-serif'],
 'font.sans-serif': ['DejaVu Sans',
  'Bitstream Vera Sans',
  'Computer Modern Sans Serif',
  'Lucida Grande',
  'Verdana',
  'Geneva',
  'Lucid',
  'Arial',
  'Helvetica',
  'Avant Garde',
  'sans-serif'],
 'grid.color': '#b0b0b0',
 'grid.linestyle': '-',
 'image.cmap': 'viridis',
 'lines.solid_capstyle': 'projecting',
 'patch.edgecolor': 'black',
 'patch.force_edgecolor': False,
 'text.color': 'black',
 'xtick.bottom': True,
 'xtick.color': 'black',
 'xtick.direction': 'out',
 'xtick.top': False,
 'ytick.color': 'black',
 'ytick.direction': 'out',
 'ytick.left': True,
 'ytick.right': False}





 New Seaborn Configurtion Settings:



In [None]:
# Seaborn: Set Theme (Scale of Font)

sns.set_theme('paper')  # paper, notebook, talk, poster


# Seaborn: Set Context
# sns.set_context("notebook")



# Seaborn: Set Style

# sns.set_style('ticks') # darkgrid, whitegrid, dark, white, and ticks

In [None]:
# Seaborn: Default Palette (Pastel?)

sns.color_palette()

In [None]:
# Seaborn: Set to High-Contrast Palette (more Vision Impaired Friendly)

sns.set_palette('tab10')
sns.color_palette()

In [None]:
plt.style.available

['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

In [None]:
plt.style.use('seaborn-whitegrid')

## Define Global Parameters

In [None]:
# Define minimum paragraph and sentence lengths for data cleaning
#   any parag/sent less than these mins will be ignored/blanked

MIN_PARAG_LEN = 10
MIN_SENT_LEN = 3

In [None]:
# Master Data Structure: Store all texts in Dictionary of DataFrames (one per novel)

corpus_texts_dt = {}
corpus_titles_dt

{'cmieville_thecityandthecity': ['The City and The City by China Mieville',
  2009,
  0],
 'scollins_thehungergames1': ['The Hunger Games 1 by Suzanne Collins ',
  2008,
  0]}

## **Utility Functions**

### Generate Convenient Data Lists

In [None]:
# Verify in SentimentArcs root directory
!pwd

print('\n')

# Check the number of lines in a sample novel
print(f"Number of texts in the corpus:")
!ls $SUBDIR_TEXT_RAW | wc -l

/gdrive/MyDrive/cdh/sentiment_arcs


Number of texts in the corpus:
2


In [None]:
# Derive List of Texts in Corpus a)keys and b)full author and titles

corpus_titles_ls = list(corpus_titles_dt.keys())
print(f'\nCorpus Keys:')
for akey in corpus_titles_ls:
  print(f'  {akey}')
print('\n')

print(f'\nNatural Text Titles:')
texts_full_ls = [x[0] for x in list(corpus_titles_dt.values())]
for akey in texts_full_ls:
  print(f'  {akey}')


Corpus Keys:
  scollins_thehungergames1
  cmieville_thecityandthecity



Natural Text Titles:
  The Hunger Games 1 by Suzanne Collins 
  The City and The City by China Mieville


In [None]:
# Convenience lists for each type of model

# Lexicon Models
models_lexicon_ls = [x[0] for x in models_titles_dt.values() if x[1] == 'lexicon']
print(f'\nThere are {len(models_lexicon_ls)} Lexicon Models')
for i,amodel in enumerate(models_lexicon_ls):
  print(f'  Lexicon Model #{i}: {amodel}')

# Heuristic Models
models_heuristic_ls = [x[0] for x in models_titles_dt.values() if x[1] == 'heuristic']
print(f'\nThere are {len(models_heuristic_ls)} Heuristic Models')
for i,amodel in enumerate(models_heuristic_ls):
  print(f'  Heuristic Model #{i}: {amodel}')

# Traditional ML Models
models_tradml_ls = [x[0] for x in models_titles_dt.values() if x[1] == 'tradml']
print(f'\nThere are {len(models_tradml_ls)} Traditional ML Models')
for i,amodel in enumerate(models_tradml_ls):
  print(f'  Traditional ML Model #{i}: {amodel}')

# DNN Models
models_dnn_ls = [x[0] for x in models_titles_dt.values() if x[1] == 'dnn']
print(f'\nThere are {len(models_dnn_ls)} DNN Models')
for i,amodel in enumerate(models_dnn_ls):
  print(f'  DNN Model #{i}: {amodel}')

# Transformer Models
models_transformer_ls = [x[0] for x in models_titles_dt.values() if x[1] == 'transformer']
print(f'\nThere are {len(models_transformer_ls)} Transformer Models')
for i,amodel in enumerate(models_transformer_ls):
  print(f'  Transformer Model #{i}: {amodel}')

# All Models

models_ensemble_ls = models_lexicon_ls + models_heuristic_ls + models_tradml_ls + models_dnn_ls + models_transformer_ls

print(f'\nThere are {len(models_ensemble_ls)} Total Models:')
for i,amodel in enumerate(models_ensemble_ls):
  print(f'  Model #{i:>2}: {amodel}')

print(f'\nThere are {len(models_ensemble_ls)} Total Models (+1 for Ensemble Mean)')


There are 6 Lexicon Models
  Lexicon Model #0: pattern
  Lexicon Model #1: sentimentr
  Lexicon Model #2: afinn
  Lexicon Model #3: bing
  Lexicon Model #4: nrc
  Lexicon Model #5: syuzhetr

There are 7 Heuristic Models
  Heuristic Model #0: bing_sentimentr
  Heuristic Model #1: jockers_sentimentr
  Heuristic Model #2: jockersrinker_sentimentr
  Heuristic Model #3: lmcd_sentimentr
  Heuristic Model #4: sentiword_sentimentr
  Heuristic Model #5: senticnet_sentimentr
  Heuristic Model #6: vader

There are 8 Traditional ML Models
  Traditional ML Model #0: autogluon
  Traditional ML Model #1: flaml
  Traditional ML Model #2: logreg
  Traditional ML Model #3: logreg_cv
  Traditional ML Model #4: multinb
  Traditional ML Model #5: rf
  Traditional ML Model #6: textblob
  Traditional ML Model #7: xgb

There are 5 DNN Models
  DNN Model #0: cnn
  DNN Model #1: fcn
  DNN Model #2: flair
  DNN Model #3: lstm
  DNN Model #4: stanza

There are 8 Transformer Models
  Transformer Model #0: imdb2wa

### File Functions

In [None]:
def get_fullpath( ftype='data_clean', fig_no='', first_note = '',last_note='', plot_ext='png', no_date=False):
  '''
  Given a required file_type(ftype:['data_clean','data_raw','plot']) and
    optional first_note: str inserted after Title and before (optional) SMA/Standardization info
             last_note: str insterted after (optional) SMA/Standardization info and before (optional) timedate stamp
             plot_ext: change default *.png extension of plot file
             no_date: don't add trailing datetime stamp to filename
  Generate and return a fullpath (/subdir/filename.ext) to save file to
  '''

  # String with full path/filename.ext to return
  fname = ''

  # Get current datetime stamp as a string
  if no_date:
    date_dt = ''
  else:
    date_dt = f'_{datetime.now().strftime("%Y_%m_%d-%I_%M_%S_%p")}'

  # Clean optional file notation if passed in
  if first_note:
    fnote_str = first_note.replace(' ', '_')
    fnote_str = '_'.join(fnote_str.split())
    fnote_str = '_'.join(fnote_str.split('.'))
    fnote_str = '_'.join(fnote_str.split('__'))
    fnote_str = fnote_str.lower()

  # Get Current Novel Name and Clean
  # novel_title_str = Novel_Title[0].replace(' ', '_').lower()
  # novel_title_str = '_'.join(novel_title_str.split())
  # novel_title_str = '_'.join(novel_title_str.split('.'))
  # novel_title_str = '_'.join(novel_title_str.split('__'))

  novel_title_str = Novel_Key

  if first_note:
    novel_title_str = f'{novel_title_str}_{first_note}'

  # Option (a): Cleaned Model Data (Smoothed then Standardized)
  if ftype == 'data_clean':
    subdir_path = data_clean_subdir
    fprefix = 'sa_clean_'
    fname_str = f'{subdir_path}{fprefix}{novel_title_str}_{Model_Standardization_Method.lower()}_sma{Window_Percent}'
    if last_note:
      fname = f'{fname_str}_{last_note}{date_dt}.csv'
    else:
      fname = f'{fname_str}{date_dt}.csv'

  # Option (b): Raw Model Data
  elif ftype == 'data_raw':
    subdir_path = data_raw_subdir
    fprefix = 'sa_raw_'
    fname_str = f'{subdir_path}{fprefix}{novel_title_str}'
    if last_note:
      fname = f'{fname_str}_{last_note}{date_dt}.csv'
    else:
      fname = f'{fname_str}{date_dt}.csv'

  # Option (c): Plot Figure
  elif ftype == 'plot':
    subdir_path = plots_subdir
    if fig_no:
      fprefix = f'plot_{fig_no}_'
    else:
      fprefix = 'plot_'
    fname_str = f'{subdir_path}{fprefix}{novel_title_str}'
    if last_note:
      fname = f'{fname_str}_{last_note}{date_dt}.{plot_ext}'
    else:
      fname = f'{fname_str}{date_dt}.{plot_ext}'

  # Option (d): Crux Text
  elif ftype == 'crux_text':
    subdir_path = crux_subdir
    fprefix = 'crux_'
    fname_str = f'{subdir_path}{fprefix}{novel_title_str}'
    if last_note:
      fname = f'{fname_str}_{last_note}{date_dt}.txt'
    else:
      fname = f'{fname_str}{date_dt}.txt'

  else:
    print(f'ERROR: In get_fullpath() with illegal arg ftype:[{ftype}]')
    return f'ERROR: ftype:[{ftype}]'

  return fname


## Text Cleaning

In [None]:
def text_str2sents(text_str, pysbd_only=False):
  '''
  Given a long text string (e.g. a novel) and pysbd_only flag
  Return a list of every Sentence defined by (a) 2+ newlines as paragraph separators, 
                                             (b) SpaCy+PySBD Pipeline, and 
                                             (c) Optionally, NLTK sentence tokenizer
  '''

  parags_ls = []
  sents_ls = []

  from pysbd.utils import PySBDFactory
  nlp = spacy.blank('en')
  nlp.add_pipe(PySBDFactory(nlp))

  print(f'BEFORE stripping out headings len: {len(text_str)}')

  parags_ls = re.split(r'[\n]{2,}', text_str)

  parags_ls = [x.strip() for x in parags_ls]

  # Strip out non-printing characters
  parags_ls = [re.sub(f'[^{re.escape(string.printable)}]', '', x) for x in parags_ls]

  # Filter out empty lines Paragraphs
  parags_ls = [x for x in parags_ls if (len(x.strip()) >= MIN_PARAG_LEN)]

  print(f'   Parag count before processing sents: {len(parags_ls)}')
  # FIRST PASS at Sentence Tokenization with PySBD

  for i, aparag in enumerate(parags_ls):
  

    aparag_nonl = re.sub('[\n]{1,}', ' ', aparag)
    doc = nlp(aparag_nonl)
    aparag_sents_pysbd_ls = list(doc.sents)
    print(f'pysbd found {len(aparag_sents_pysbd_ls)} Sentences in Paragraph #{i}')

    # Strip ofaparag_sents_pysbd_lsf whitespace from Sentences
    aparag_sents_pysbd_ls = [str(x).strip() for x in aparag_sents_pysbd_ls]

    # Filter out empty line Sentences
    aparag_sents_pysbd_ls = [x for x in aparag_sents_pysbd_ls if (len(x.strip()) > MIN_SENT_LEN)]

    print(f'      {len(aparag_sents_pysbd_ls)} Sentences remain after cleaning')

    sents_ls += aparag_sents_pysbd_ls

  # (OPTIONAL) SECOND PASS as Sentence Tokenization with NLTK
  if pysbd_only == True:
    # Only do one pass of SpaCy/PySBD Sentence tokenizer
    # sents_ls += aparag_sents_pysbd_ls
    pass
  else:
    # Do second NLTK pass at Sentence tokenization if pysbd_only == False
    # Do second pass, tokenize again with NLTK to catch any Sentence tokenization missed by PySBD
    # corpus_sents_all_nltk_ls = []
    # sents_ls = []
    # aparag_sents_nltk_ls = []
    aparag_sents_pysbd_ls = deepcopy(sents_ls)
    sents_ls = []
    for asent in aparag_sents_pysbd_ls:
      print(f'Processing asent: {asent}')
      aparag_sents_nltk_ls = []
      aparag_sents_nltk_ls = sent_tokenize(asent)

      # Strip off whitespace from Sentences
      aparag_sents_nltk_ls = [str(x).strip() for x in aparag_sents_nltk_ls]

      # Filter out empty line Sentences
      aparag_sents_nltk_ls = [x for x in aparag_sents_nltk_ls if (len(x.strip()) > MIN_SENT_LEN)]

      # corpus_sents_all_second_ls += aparag_sents_nltk_ls

      sents_ls += aparag_sents_nltk_ls

  print(f'About to return sents_ls with len = {len(sents_ls)}')
  return sents_ls

# Test and example why both SpaCy+PySBD and NLTK.sentence_tokenizer are both needed: 'What is a goat?!? A big fat GOAT..'
test_ls = text_str2sents('Hello. You are a great dude! WTF?\n\n You are a goat. What is a goat?!? A big lazy GOAT... No way-', pysbd_only=False) # !?! Dr. and Mrs. Elipses...', pysbd_only=True)
test_ls

BEFORE stripping out headings len: 96
   Parag count before processing sents: 2
pysbd found 3 Sentences in Paragraph #0
      3 Sentences remain after cleaning
pysbd found 3 Sentences in Paragraph #1
      3 Sentences remain after cleaning
Processing asent: Hello.
Processing asent: You are a great dude!
Processing asent: WTF?
Processing asent: You are a goat.
Processing asent: What is a goat?!? A big lazy GOAT...
Processing asent: No way-
About to return sents_ls with len = 7


['Hello.',
 'You are a great dude!',
 'WTF?',
 'You are a goat.',
 'What is a goat?!?',
 'A big lazy GOAT...',
 'No way-']

In [None]:
def textfile2df(fullpath_str):
  '''
  Given a full path to a *.txt file
  Return a DataFrame with one Sentence per row
  '''

  textfile_df = pd.DataFrame()

  with open(fullpath_str,'r') as fp:
    content_str = fp.read() # .replace('\n',' ')

  sents_ls = text_str2sents(content_str)

  textfile_df['text_raw'] = pd.Series(sents_ls)

  return textfile_df

In [None]:
def emojis2text(atext):
  for emot, text_desc in UNICODE_EMOJI.items():
    atext = atext.replace(emot, ' '.join(text_desc.replace(",", "").split()))

  atext = atext.replace('_', ' ').replace(':','')

  return atext

# Test
test_str = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually ;) fulfilling orders 😒"
test_str = emojis2text(test_str)
print(f'test_str: [{test_str}]')

test_str: [Hilarious face with tears of joy. The feeling of making a sale smiling face with sunglasses, The feeling of actually ;) fulfilling orders unamused face]


In [None]:
def all_emos2text(atext):
  '''
  Given a text string with embedded emojis and/or emoticons
  Return a expanded text string with all emojis/emoticons translated into text
  '''

  # First, convert emoticons to text
  for emot, text_desc in EMOTICONS_EMO.items():
    atext = atext.replace(emot, ' ' + ' '.join(text_desc.replace(",", " ").split()))

  # Second, convert emojis to text
  for emot, text_desc in UNICODE_EMOJI.items():
    atext = atext.replace(emot, ' ' + ' '.join(text_desc.replace(",", " ").split()))

  atext = re.sub(r':([A-Za-z_]*):',r'\1',atext)
  # atext = re.sub(r'([\w]+)([_])([\w]+)',r'\1 \3',atext)
  atext = re.sub(r'_', ' ', atext)
  atext = ' '.join(atext.split())

  return atext

# Test
test_str = "Hilarious 😂. The feeling :o of making a sale 😎, The feeling :( of actually ;) fulfilling orders 😒"
all_emos2text(test_str)

'Hilarious face with tears of joy. The feeling Surprise of making a sale smiling face with sunglasses, The feeling Frown sad andry or pouting of actually Wink or smirk fulfilling orders unamused face'

In [None]:
# Abbreviation / Slang
# https://www.kaggle.com/nmaguette/up-to-date-list-of-slangs-for-text-preprocessing/notebook

slang = {
    "$" : " dollar ",
    "€" : " euro ",
    "4ao" : "for adults only",
    "a.m" : "before midday",
    "a3" : "anytime anywhere anyplace",
    "aamof" : "as a matter of fact",
    "acct" : "account",
    "adih" : "another day in hell",
    "afaic" : "as far as i am concerned",
    "afaict" : "as far as i can tell",
    "afaik" : "as far as i know",
    "afair" : "as far as i remember",
    "afk" : "away from keyboard",
    "app" : "application",
    "approx" : "approximately",
    "apps" : "applications",
    "asap" : "as soon as possible",
    "asl" : "age, sex, location",
    "atk" : "at the keyboard",
    "ave." : "avenue",
    "aymm" : "are you my mother",
    "ayor" : "at your own risk", 
    "b&b" : "bed and breakfast",
    "b+b" : "bed and breakfast",
    "b.c" : "before christ",
    "b2b" : "business to business",
    "b2c" : "business to customer",
    "b4" : "before",
    "b4n" : "bye for now",
    "b@u" : "back at you",
    "bae" : "before anyone else",
    "bak" : "back at keyboard",
    "bbbg" : "bye bye be good",
    "bbc" : "british broadcasting corporation",
    "bbias" : "be back in a second",
    "bbl" : "be back later",
    "bbs" : "be back soon",
    "be4" : "before",
    "bfn" : "bye for now",
    "blvd" : "boulevard",
    "bout" : "about",
    "brb" : "be right back",
    "bros" : "brothers",
    "brt" : "be right there",
    "bsaaw" : "big smile and a wink",
    "btw" : "by the way",
    "bwl" : "bursting with laughter",
    "c/o" : "care of",
    "cet" : "central european time",
    "cf" : "compare",
    "cia" : "central intelligence agency",
    "csl" : "can not stop laughing",
    "cu" : "see you",
    "cul8r" : "see you later",
    "cv" : "curriculum vitae",
    "cwot" : "complete waste of time",
    "cya" : "see you",
    "cyt" : "see you tomorrow",
    "dae" : "does anyone else",
    "dbmib" : "do not bother me i am busy",
    "diy" : "do it yourself",
    "dm" : "direct message",
    "dwh" : "during work hours",
    "e123" : "easy as one two three",
    "eet" : "eastern european time",
    "eg" : "example",
    "embm" : "early morning business meeting",
    "encl" : "enclosed",
    "encl." : "enclosed",
    "etc" : "and so on",
    "faq" : "frequently asked questions",
    "fawc" : "for anyone who cares",
    "fb" : "facebook",
    "fc" : "fingers crossed",
    "fig" : "figure",
    "fimh" : "forever in my heart", 
    "ft." : "feet",
    "ft" : "featuring",
    "ftl" : "for the loss",
    "ftw" : "for the win",
    "fwiw" : "for what it is worth",
    "fyi" : "for your information",
    "g9" : "genius",
    "gahoy" : "get a hold of yourself",
    "gal" : "get a life",
    "gcse" : "general certificate of secondary education",
    "gfn" : "gone for now",
    "gg" : "good game",
    "gl" : "good luck",
    "glhf" : "good luck have fun",
    "gmt" : "greenwich mean time",
    "gmta" : "great minds think alike",
    "gn" : "good night",
    "g.o.a.t" : "greatest of all time",
    "goat" : "greatest of all time",
    "goi" : "get over it",
    "gps" : "global positioning system",
    "gr8" : "great",
    "gratz" : "congratulations",
    "gyal" : "girl",
    "h&c" : "hot and cold",
    "hp" : "horsepower",
    "hr" : "hour",
    "hrh" : "his royal highness",
    "ht" : "height",
    "ibrb" : "i will be right back",
    "ic" : "i see",
    "icq" : "i seek you",
    "icymi" : "in case you missed it",
    "idc" : "i do not care",
    "idgadf" : "i do not give a damn fuck",
    "idgaf" : "i do not give a fuck",
    "idk" : "i do not know",
    "ie" : "that is",
    "i.e" : "that is",
    "ifyp" : "i feel your pain",
    "IG" : "instagram",
    "iirc" : "if i remember correctly",
    "ilu" : "i love you",
    "ily" : "i love you",
    "imho" : "in my humble opinion",
    "imo" : "in my opinion",
    "imu" : "i miss you",
    "iow" : "in other words",
    "irl" : "in real life",
    "j4f" : "just for fun",
    "jic" : "just in case",
    "jk" : "just kidding",
    "jsyk" : "just so you know",
    "l8r" : "later",
    "lb" : "pound",
    "lbs" : "pounds",
    "ldr" : "long distance relationship",
    "lmao" : "laugh my ass off",
    "lmfao" : "laugh my fucking ass off",
    "lol" : "laughing out loud",
    "ltd" : "limited",
    "ltns" : "long time no see",
    "m8" : "mate",
    "mf" : "motherfucker",
    "mfs" : "motherfuckers",
    "mfw" : "my face when",
    "mofo" : "motherfucker",
    "mph" : "miles per hour",
    "mr" : "mister",
    "mrw" : "my reaction when",
    "ms" : "miss",
    "mte" : "my thoughts exactly",
    "nagi" : "not a good idea",
    "nbc" : "national broadcasting company",
    "nbd" : "not big deal",
    "nfs" : "not for sale",
    "ngl" : "not going to lie",
    "nhs" : "national health service",
    "nrn" : "no reply necessary",
    "nsfl" : "not safe for life",
    "nsfw" : "not safe for work",
    "nth" : "nice to have",
    "nvr" : "never",
    "nyc" : "new york city",
    "oc" : "original content",
    "og" : "original",
    "ohp" : "overhead projector",
    "oic" : "oh i see",
    "omdb" : "over my dead body",
    "omg" : "oh my god",
    "omw" : "on my way",
    "p.a" : "per annum",
    "p.m" : "after midday",
    "pm" : "prime minister",
    "poc" : "people of color",
    "pov" : "point of view",
    "pp" : "pages",
    "ppl" : "people",
    "prw" : "parents are watching",
    "ps" : "postscript",
    "pt" : "point",
    "ptb" : "please text back",
    "pto" : "please turn over",
    "qpsa" : "what happens", #"que pasa",
    "ratchet" : "rude",
    "rbtl" : "read between the lines",
    "rlrt" : "real life retweet", 
    "rofl" : "rolling on the floor laughing",
    "roflol" : "rolling on the floor laughing out loud",
    "rotflmao" : "rolling on the floor laughing my ass off",
    "rt" : "retweet",
    "ruok" : "are you ok",
    "sfw" : "safe for work",
    "sk8" : "skate",
    "smh" : "shake my head",
    "sq" : "square",
    "srsly" : "seriously", 
    "ssdd" : "same stuff different day",
    "tbh" : "to be honest",
    "tbs" : "tablespooful",
    "tbsp" : "tablespooful",
    "tfw" : "that feeling when",
    "thks" : "thank you",
    "tho" : "though",
    "thx" : "thank you",
    "tia" : "thanks in advance",
    "til" : "today i learned",
    "tl;dr" : "too long i did not read",
    "tldr" : "too long i did not read",
    "tmb" : "tweet me back",
    "tntl" : "trying not to laugh",
    "ttyl" : "talk to you later",
    "u" : "you",
    "u2" : "you too",
    "u4e" : "yours for ever",
    "utc" : "coordinated universal time",
    "w/" : "with",
    "w/o" : "without",
    "w8" : "wait",
    "wassup" : "what is up",
    "wb" : "welcome back",
    "wtf" : "what the fuck",
    "wtg" : "way to go",
    "wtpa" : "where the party at",
    "wuf" : "where are you from",
    "wuzup" : "what is up",
    "wywh" : "wish you were here",
    "yd" : "yard",
    "ygtr" : "you got that right",
    "ynk" : "you never know",
    "zzz" : "sleeping bored and tired"
}

In [None]:
def expand_slang(astring):
  words_ls = []
  words_expanded_ls = []
  slang_keys = slang.keys()

  words_ls = astring.split()
  for aword in words_ls:
    if aword.lower() in slang.keys():
      words_expanded_ls.append(slang[aword.lower()])
    else:
      words_expanded_ls.append(aword.lower())

  # abbreviations[word.lower()] if word.lower() in abbreviations.keys() else word

  astring_expanded = ' '.join(words_expanded_ls)

  return astring_expanded 

# Test
expand_slang('idk LOL you suck!')

'i do not know laughing out loud you suck!'

In [None]:
# [VERFIY]: Hero text preprocessing pipeline

hero.preprocessing.get_default_pipeline()

[<function texthero.preprocessing.fillna>,
 <function texthero.preprocessing.lowercase>,
 <function texthero.preprocessing.remove_digits>,
 <function texthero.preprocessing.remove_punctuation>,
 <function texthero.preprocessing.remove_diacritics>,
 <function texthero.preprocessing.remove_stopwords>,
 <function texthero.preprocessing.remove_whitespace>]

In [None]:
from nltk.corpus import stopwords

print(f'stopwords_ls: {len(stopwords_ls)}')
print(f"nltk stopwords: {len(stopwords.words('english'))}")

stopwords_ls: 326
nltk stopwords: 179


In [None]:
# Create Default and Custom Stemming TextHero pipeline

# Create a custom cleaning pipeline
def_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                # , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace]

# Create a custom cleaning pipeline
stem_pipeline = [preprocessing.fillna
                , preprocessing.lowercase
                , preprocessing.remove_digits
                , preprocessing.remove_punctuation
                , preprocessing.remove_diacritics
                , preprocessing.remove_stopwords
                , preprocessing.remove_whitespace
                , preprocessing.stem]
                   
# Test: pass the custom_pipeline to the pipeline argument
# df['clean_title'] = hero.clean(df['title'], pipeline = custom_pipeline)df.head()

In [None]:
def clean_text(text_df, text_col, text_type='formal'): 
  '''
  Given a DataFrame with a Text Column of raw text of type (formal, informal, tweet)
  Return a Series of clean texts
  '''

  text_clean_ser = pd.Series()

  # Extra processing steps for 'informal' and 'tweet' types of text
  if text_type in ['informal', 'tweet']:

    # Remove URLs
    text_clean_ser = hero.remove_urls(text_df[text_col])

    # Emoticons and then Emojis to Text
    text_clean_ser = text_clean_ser.apply(lambda x : all_emos2text(x))

    # Expand Slang/Abbr
    text_clean_ser = text_clean_ser.apply(lambda x : expand_slang(x))

  else:

    text_clean_ser = text_df[text_col]


  # Expand Contractions
  text_clean_ser = text_clean_ser.apply(lambda x : contractions.fix(x))

  # Clean text: lowercase, remove punctuation/numbers, etc
  # text_clean_ser = text_clean_ser.pipe(hero.clean, hero_pre_pipeline)
  text_clean_ser = hero.clean(text_clean_ser, pipeline = def_pipeline)

  return text_clean_ser

# Test
# clean_tweet(tweets_user_df.iloc[:5], 'text')

In [None]:
def lemma_pipe(texts):
  '''
  Given a text string
  Return a text string with all tokens lemmatized using SpaCy pipe for speed
  Called by clean_text() with SpaCy Lemmatizer
  '''
  # https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html

  lemma_tokens = []
  # for doc in nlp.pipe(docs, batch_size=32, n_process=3, disable=["tagger", "parser", "ner"]):
  for doc in nlp.pipe(texts, batch_size=200, n_process=3, disable=["tagger", "parser", "ner"]):
    # lemma_tokens.append([str(tok.lemma_).lower() if tok.lemma_ != '-PRON-' else str(tok.orth_).lower() for tok in doc])
    temp_ls = [str(tok.lemma_).lower() if tok.lemma_ != '-PRON-' else str(tok.orth_).lower() for tok in doc]
    lemma_tokens.append(' '.join(temp_ls))

  return lemma_tokens
      
# Tests
print('\nTest #1:\n')
test_ls = ['I am running late for a meetings with all the many people.',
           'What time is it when you fall down running away from a growing problem?',
           "You've got to be kidding me - you're joking right?"]

lemma_pipe(test_ls)

print('\nTest #2:\n')

texts = pd.Series(["I won't go and you can't make me.", "Billy is running really quickly and with great haste.", "Eating freshly caught seafood."])

for doc in nlp.pipe(texts):
  print([tok.lemma_ for tok in doc])

print('\nTest #3:\n')
lemma_pipe(texts)



Test #1:



['i be run late for a meeting with all the many people .',
 'what time be it when you fall down run away from a grow problem ?',
 'you have get to be kid me - you be joke right ?']


Test #2:

['I', 'will', 'not', 'go', 'and', 'you', 'can', 'not', 'make', 'me', '.']
['Billy', 'be', 'run', 'really', 'quickly', 'and', 'with', 'great', 'haste', '.']
['Eating', 'freshly', 'catch', 'seafood', '.']

Test #3:



['i will not go and you can not make me .',
 'billy be run really quickly and with great haste .',
 'eating freshly catch seafood .']

# **[STEP 2] Read Corpus and Define Subcorpus**


## Read in Text Files

In [None]:
# Verify in SentimentArcs root directory
!pwd
print('\n')
!ls
print('\n')
print(f'[SUBDIR_TEXT_RAW]: {SUBDIR_TEXT_RAW}')
print('\n')
print(f'Count of the number of docs in this SUBDIR:')
!ls $SUBDIR_TEXT_RAW | wc -l

/gdrive/MyDrive/cdh/sentiment_arcs


config		  models__info.yaml  sentiment_clean  text_raw
fastText-0.9.2	  notebooks	     sentiment_raw    v0.9.2.zip
get_sentimentr.R  plots		     text_clean


[SUBDIR_TEXT_RAW]: ./text_raw/novels_text_new_raw/


Count of the number of docs in this SUBDIR:
2


In [None]:
print(f'Text_Type: {Text_Type}')
print(f'Corpus: {Corpus_Type}')

Text_Type: novels
Corpus: new


In [None]:
corpus_titles_dt.keys()

dict_keys(['scollins_thehungergames1', 'cmieville_thecityandthecity'])

In [None]:
corpus_titles_ls

['scollins_thehungergames1', 'cmieville_thecityandthecity']

In [None]:
# Get a list of all the texts either 
# reference corpus: defined in the corpus_dt Dictionary
# new texts: list of files in the ./(new_dir)/

corpus_titles_ls = list(corpus_titles_dt.keys())

text_ct = 0
for file_fullpath in glob.glob(SUBDIR_TEXT_RAW + "*.txt"):
  file_root = file_fullpath.split('/')[-1].split('.')[0]
  if file_root in corpus_titles_ls:
    text_ct += 1
    print(f'{file_root}: {corpus_titles_dt[file_root]}')

print(f'\nThere are {text_ct} Texts defined in SentmentArcs [corpus_dt] and found in the subdir: [SUBDIR_TEXT_RAW]')

cmieville_thecityandthecity: ['The City and The City by China Mieville', 2009, 0]
scollins_thehungergames1: ['The Hunger Games 1 by Suzanne Collins ', 2008, 0]

There are 2 Texts defined in SentmentArcs [corpus_dt] and found in the subdir: [SUBDIR_TEXT_RAW]


# **[STEP 3] Clean Corpus**

## Tokenize Sentences

In [None]:
# Verify the novels found and to be processed by SentimentArcs

corpus_titles_ls

['scollins_thehungergames1', 'cmieville_thecityandthecity']

In [None]:
%%time

# NOTE: 3m30s Entire Corpus of 25 
#       7m30s Ref Corpus 32 Novels
#       7m24s Ref Corpus 32 Novels
#       1m00s New Corpus 2 Novels

# Read all novel files into a Dictionary of DataFrames
#   Dict.keys() are novel names
#   Dict.values() are DataFrames with one row per Sentence

# Continue here ONLY if last cell completed WITHOUT ERROR

# anovel_df = pd.DataFrame()

for i, file_root in enumerate(corpus_titles_ls):
  file_fullpath = f'{SUBDIR_TEXT_RAW}{file_root}.txt'
  print(f'Processing Novel #{i}: {file_root}')
  # fullpath_str = novels_subdir + asubdir + '/' + asubdir + '.txt'
  print(f"  Size: {os.path.getsize(file_fullpath)}")

  corpus_texts_dt[file_root] = textfile2df(file_fullpath)
  
# corpus_dt.keys()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Processing asent: Stay careful, he said.
Processing asent: He left and they stared after him, five faces frightened and bewildered, one of them bloody and dripping.
Processing asent: My own was set, I suspect, from the effort of not showing anything.
Processing asent: Youve got me confused, Borl.
Processing asent: He drove much more slowly than we had come on the return.
Processing asent: I cant work out what just happened.
Processing asent: You backed away from that, and it was our best lead.
Processing asent: The only thing that makes sense is that youre worried about complicity.
Processing asent: Because sure, if you got a call and went with it, if you took them up on that information, then yeah thats breach.
Processing asent: But no ones going to give a shit about you, Borl.
Processing asent: Its a little tiny breach, and you know as well as I do that theyll let that go if we sort out something bigger.
Processing asen

In [None]:
# Verify all novels Sentence tokenized correctly

for key, value in corpus_texts_dt.items():
  print(f'key: {key} = value: {value}')

key: scollins_thehungergames1 = value:                                                text_raw
0                                        "THE TRIBUTES"
1     When I wake up, the other side of the bed is c...
2     My fingers stretch out, seeking Prims warmth b...
3     She must have had bad dreams and climbed in wi...
4                                   Of course, she did.
...                                                 ...
9016                              His voice isnt angry.
9017                        Its hollow, which is worse.
9018  Already the boy with the bread is slipping awa...
9019  I take his hand, holding on tightly, preparing...
9020                                    END OF BOOK ONE

[9021 rows x 1 columns]
key: cmieville_thecityandthecity = value:                                                 text_raw
0                                             Annotation
1      The city is Beszel, a rundown metropolis on th...
2      The other city is Ul Qoma, a modern Eastern 

In [None]:
# Verify First Text is Segmented into text_raw Sentences

corpus_texts_dt[corpus_titles_ls[0]].head()

Unnamed: 0,text_raw
0,"""THE TRIBUTES"""
1,"When I wake up, the other side of the bed is c..."
2,"My fingers stretch out, seeking Prims warmth b..."
3,She must have had bad dreams and climbed in wi...
4,"Of course, she did."


## Clean Sentences

In [None]:
%%time

# NOTE: (no stem) 4m09s
#       (w/ stem) 4m24s

i = 0

for key_novel, atext_df in corpus_texts_dt.items():

  print(f'Processing Novel #{i}: {key_novel}...')

  atext_df['text_clean'] = clean_text(atext_df, 'text_raw', text_type='formal')

  atext_df['text_clean'] = lemma_pipe(atext_df['text_clean'])
  atext_df['text_clean'] = atext_df['text_clean'].astype('string')

  # TODO: Fill in all blank 'text_clean' rows with filler semaphore
  atext_df.text_clean = atext_df.text_clean.fillna('this_blank')

  atext_df.head(2)

  print(f'  shape: {atext_df.shape}')

  i += 1

Processing Novel #0: scollins_thehungergames1...
  shape: (9021, 2)
Processing Novel #1: cmieville_thecityandthecity...
  shape: (10125, 2)
CPU times: user 8.07 s, sys: 207 ms, total: 8.28 s
Wall time: 11.3 s


In [None]:
# Verify the first Text in Corpus is cleaned

corpus_texts_dt[corpus_titles_ls[0]].head(20)
corpus_texts_dt[corpus_titles_ls[0]].info()

Unnamed: 0,text_raw,text_clean
0,"""THE TRIBUTES""",the tribute
1,"When I wake up, the other side of the bed is c...",when i wake up the other side of the bed be cold
2,"My fingers stretch out, seeking Prims warmth b...",my finger stretch out seek prims warmth but fi...
3,She must have had bad dreams and climbed in wi...,she must have have bad dream and climb in with...
4,"Of course, she did.",of course she do
5,This is the day of the reaping.,this be the day of the reap
6,I prop myself up on one elbow.,i prop myself up on one elbow
7,Theres enough light in the bedroom to see them.,there be enough light in the bedroom to see them
8,"My little sister, Prim, curled up on her side,...",my little sister prim curl up on her side coco...
9,"In sleep, my mother looks younger, still worn ...",in sleep my mother look young still wear but n...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9021 entries, 0 to 9020
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text_raw    9021 non-null   object
 1   text_clean  9021 non-null   string
dtypes: object(1), string(1)
memory usage: 141.1+ KB


## Save Cleaned Corpus

In [None]:
!pwd
!ls

/gdrive/MyDrive/cdh/sentiment_arcs
config		  models__info.yaml  sentiment_clean  text_raw
fastText-0.9.2	  notebooks	     sentiment_raw    v0.9.2.zip
get_sentimentr.R  plots		     text_clean


In [None]:
# Verify save subdir for Cleaned Texts and Texts

print(f'Saving Clean Texts to Subdir: {SUBDIR_TEXT_CLEAN}')
print(f'\nSaving these Texts:\n  {corpus_texts_dt.keys()}')

Saving Clean Texts to Subdir: ./text_clean/novels_text_new_clean/

Saving these Texts:
  dict_keys(['scollins_thehungergames1', 'cmieville_thecityandthecity'])


In [None]:
i = 0

for key_novel, anovel_df in corpus_texts_dt.items():
  anovel_fname = f'{key_novel}.csv'

  anovel_fullpath = f'{SUBDIR_TEXT_CLEAN}{anovel_fname}'
  print(f'Saving Novel #{i} to {anovel_fullpath}')
  corpus_texts_dt[key_novel].to_csv(anovel_fullpath)

  i += 1

Saving Novel #0 to ./text_clean/novels_text_new_clean/scollins_thehungergames1.csv
Saving Novel #1 to ./text_clean/novels_text_new_clean/cmieville_thecityandthecity.csv


# **[END OF NOTEBOOK]**