# **SentimentArcs (Part 5): Time Series Feature Analysis**

By: Jon Chun
12 Jun 2021

References:

* Coming...

TODO:
* add global SUBDIR_TIMESERIES_RAW(_CLEAN), SUBDIR_SENTIMENTARCS
* rename all processed files to: timeseries_clean/timeseries_clean_novels_ref
* drop text_clean, text_raw, vader_rstd from sentiment for size
* no autosave(playground sandbox) https://stackoverflow.com/questions/60867546/save-failed-in-google-colab
* ---
* Demo datafiles
* Error detection around Crux points context (out of bounds)
* lex_discrete2continous (research binary->gaussian transformation fn)
* Text Preprocessing hints/tips/flowchart
* Clearly document workflow and partition across notebooks/libraries
* Code review and extraction to libraries
* Corpus ingestion for any format
* XAI (mlm false peak 1717SyuzhetR/1732SentimentR/1797robertalg15 adam watches war argument at dinner) 
* Centralize and Standardize Model name lists
* Normalize model SA Series lengths
* Standardize all SA Series with the same method
* Seamless report generation/file saving
* Get raw text from SentimentR
* Filter out non-printable characters
* Roll-over Crux-Points (SentNo+Sent/Parag) (plotly)
* Label/Roll-over Chapter/Sect No at Boundries
* Generate Report PDF/csv
* Option to select raw or discrete2continous transformation (Bing)
* Annotation functionality + Share/Collaboration of findings/reseearch
* clusters, centroids = kmeans1d.cluster(np.array(corpus_sentimentr_df['jockers_rinker']), k)
* plotly prefered library to save dynamic images: kaleido
* Correlation heatmaps: Justify choice of Spearman, Pearson, or other algo

Facts:
* SyuzhetR vs SentimentTime Clean/Preprocess
* V.Woolf - To The Lighthouse
* SyuzhetR Clean: 3511 (SyuzhetR Preprocessed) Sentences (SentimentTime Preprocessed) 3403
* SentimentTime Clean: (Raw) 3402  (Clean) 3402


Preprocessing of Corpus Textfile
* Put headers in ALL CAPS
* Put \n\n between each CHAPTER/BOOK or SECTION header or Paragraphs
* Keep your format/spacing consistent
* Try to use utf-8 (not cp1252 (e.g. \n <- \n\r)
* No leading blank lines, one trailing blank line at end of textfile
* Check for illegal, non-printable or other problematic code (e.g. curly single/double quotes)

# **Reference Code**

Surveys:
* https://github.com/prrao87/fine-grained-sentiment (20210409) Fine-grained SA (7 Models)


Other:
* https://github.com/annabiancajones/GA_capstone_project/blob/master/part3_mine_refine.ipynb
* https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 CV

# **[RESTART RUNTIME] May be Required for these Libaries**

In [None]:
# If you see [Interactive namespace is empty] in response to the [%who] command below
#   your working with a fresh Linux Virtual Machine,
#   any previous work is lost,
#   and you need to SEQUENTIALLY execute EVERY cell this Notebook from the beginning 

%whos

In [None]:
# Takes far too long for inference, 
#   currently not used

# !pip install moepy

In [None]:
!pip install dtaidistance

In [None]:
!pip install sktime

In [None]:
# [RESTART RUNTIME] May be Required (only needed for Plotly)

# Designed Security Hole in older version of PyYAML, must upgrade to use plotly

!pip install pyyaml==5.4.1

In [None]:
# To Reduce Time Series Dimensionality

!pip install lttb

In [None]:
!pip install tslearn

# [STEP 1] Manual Configuration

## [INPUT] Connect Google gDrive to this Jupyter Notebook

In [None]:
# [INPUT REQUIRED]: Authorize access to Google gDrive

# Connect this Notebook to your permanent Google Drive
#   so all generated output is saved to permanent storage there

try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("Attempting to attach your Google gDrive to this Colab Jupyter Notebook")
  drive.mount('/gdrive')
else:
  print("Your Google gDrive is attached to this Colab Jupyter Notebook")

In [None]:
# [CUSTOMIZE]: Change the text after the Unix '%cd ' command below (change directory)
#              to math the full path to your gDrive subdirectory which should be the 
#              root directory cloned from the SentimentArcs github repo.

# NOTE: Make sure this subdirectory already exists and there are 
#       no typos, spaces or illegals characters (e.g. periods) in the full path after %cd

# NOTE: In Python all strings must begin with an upper or lowercase letter, and only
#         letter, number and underscores ('_') characters should appear afterwards.
#         Make sure your full path after %cd obeys this constraint or errors may appear.



# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/cdh/sentiment_arcs/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}

#@markdown (e.g. /gdrive/MyDrive/research/sentiment_arcs/)

%cd $Path_to_SentimentArcs

print('\n\n')

!ls

print(f'\n\nVERIFY that this is the correct SentimentArcs Subdirectory')
"""

# Step #2: Move to Parent directory of Sentiment_Arcs
# =======
parentdir_sentiment_arcs = '/'.join(Path_to_SentimentArcs.split('/')[:-2])
print(f'subdir_parent: {parentdir_sentiment_arcs}\n')
%cd $parentdir_sentiment_arcs

# TODO: 
# Step #3: If project sentiment_arcs subdir does not exist, 
#          clone it from github
# =======
import os

if not os.path.isdir('sentiment_arcs'):
  # NOTE: This will not work until SentimentArcs becomes an open sourced PUBLIC repo
  # !git clone https://github.com/jon-chun/sentiment_arcs.git

  # Test on open access github repo
  !git clone https://github.com/jon-chun/nabokov_palefire.git


# Step #4: Change into sentiment_arcs subdir
# =======
%cd ./sentiment_arcs
# Test on open acess github repo
# %cd ./nabokov_palefire

# Step #5: Confirm contents of sentiment_arcs subdir
# =======
!ls 

""";

## [INPUT] Define Directory and Input Corpus

In [None]:

# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/cdh/sentiment_arcs/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}

#@markdown (e.g. /gdrive/MyDrive/research/sentiment_arcs/)

#@markdown **Which type of texts are you cleaning?** \

Corpus_Genre = "novels" #@param ["novels", "social_media", "finance"]

Corpus_Type = "new" #@param ["new", "reference"]

#@markdown Please check that the required textfiles and datafiles exist in the correct subdirectories before continuing.




In [None]:
# [CUSTOMIZE]: Change the text after the Unix '%cd ' command below (change directory)
#              to math the full path to your gDrive subdirectory which should be the 
#              root directory cloned from the SentimentArcs github repo.

# NOTE: Make sure this subdirectory already exists and there are 
#       no typos, spaces or illegals characters (e.g. periods) in the full path after %cd

# NOTE: In Python all strings must begin with an upper or lowercase letter, and only
#         letter, number and underscores ('_') characters should appear afterwards.
#         Make sure your full path after %cd obeys this constraint or errors may appear.



# Step #1: Get full path to SentimentArcs subdir on gDrive
# =======
#@markdown **Accept default path on gDrive or Enter new one:**

Path_to_SentimentArcs = "/gdrive/MyDrive/cdh/sentiment_arcs/" #@param ["/gdrive/MyDrive/sentiment_arcs/"] {allow-input: true}

#@markdown (e.g. /gdrive/MyDrive/research/sentiment_arcs/)



#@markdown **Sentiment Arcs Directory Structure** \
#@markdown \
#@markdown **1. Input Directories:** \
#@markdown (a) Raw textfiles in subdir: ./text_raw/(text_type)/  \
#@markdown (b) Cleaned textfiles in subdir: ./text_clean/(text_type)/ \
#@markdown \
#@markdown **2. Output Directories** \
#@markdown (1) Raw Sentiment time series datafiles and plots in subdir: ./sentiment_raw/(text_type) \
#@markdown (2) Cleaned Sentiment time series datafiles and plots in subdir: ./sentiment_clean/(text_type) \
#@markdown \
#@markdown **Which type of texts are you cleaning?** \

Corpus_Genre = "novels" #@param ["novels", "social_media", "finance"]

Corpus_Type = "new" #@param ["new", "reference"]

#@markdown Please check that the required textfiles and datafiles exist in the correct subdirectories before continuing.


In [None]:
# Global Variable
"""
SUBDIR_SENTIMENTARCS = '/gdrive/MyDrive/cdh/sentiment_arcs'
SUBDIR_TIMESERIES_RAW = f'/timeseries_raw/timeseries_raw_{Corpus_Genre}_{Corpus_Type}'
SUBDIR_TIMESERIES_CLEAN = f'/timeseries_raw/timeseries_clean_{Corpus_Genre}_{Corpus_Type}'
SUBDIR_TIMESERIES_RAW
""";

# **[TEST] Download Books**

* Books http://glozman.com/textpages.html

* Web Script: 


In [None]:
!pip install -U grab

In [None]:
import logging

from grab import Grab

logging.basicConfig(level=logging.DEBUG)

g = Grab()

url_root = 'http://glozman.com/textpages.html'
url_base = 'http://glozman.com/'
g.go(url_root)

# g.go('https://github.com/login')
# g.doc.set_input('login', '****')
# g.doc.set_input('password', '****')
# g.doc.submit()

# g.doc.save('/tmp/x.html')

# g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

# home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
# repo_url = home_url + '?tab=repositories'

# g.go(repo_url)

url_books_ls = []
for elem in g.doc.select('//li/a'):
  # print('%s: %s' % (elem.text(), g.make_url_absolute(elem.attr('href'))))
  aurl_book = elem.attr('href')
  print(f"elem.attr(): {aurl_book}")
  afullurl_book = f'{url_base}/{aurl_book}'
  print(f'afullurl_book: {afullurl_book}')
  url_books_ls.append(afullurl_book)

In [None]:
url_books_ls[0]

In [None]:
import urllib.request
import re

In [None]:
!pwd

In [None]:
!mkdir books
%cd books
!pwd

In [None]:
for aurl in url_books_ls:
  aurl_clean = ('%20').join(aurl.split(' '))
  print(f'Trying to get arul: {aurl_clean}')
  try:
    with urllib.request.urlopen(aurl_clean) as f:
      file_text = f.read().decode(errors="ignore") # 'windows-1252')) # 'utf-8'))
    file_out = aurl.split('/')[-1]
    file_out = aurl.split('.')[-2]
    file_out = '_'.join(file_out.split(' '))
    file_out = re.sub(r"[^a-zA-Z0-9_ ]", "", file_out)
    file_out = re.sub(r"[_]+", "_", file_out)
    file_out = file_out.lower()
    file_out = file_out + '.txt'
    file_out = re.sub(r"comtextpages","", file_out)
    print(f'Writing to: {file_out}')
    with open(file_out, 'w') as fp:
      fp.write(file_text)
  except urllib.error.URLError as e:
    print(e.reason)

In [None]:
!ls -altr *.txt

In [None]:
!ls -altr *.txt | wc -l

In [None]:
!head -n 20 clancy_tom_patriot_games.txt

print('\n\n')

!tail -n 20 clancy_tom_patriot_games.txt

In [None]:
!head -n 20 clancy_tom_red_storm_rising.txt

print("\n\n")

!tail -n 20 clancy_tom_red_storm_rising.txt

# **[STEP 2] Automatic Configuration/Setup**

In [None]:
# Define all Sub/Dir global CONSTANTS

import os

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/get_subdirs.py'

get_subdirs(Corpus_Genre, Corpus_Type, 'lex2ml')

## Configure Jupyter Notebook

In [None]:
# Configure Jupyter

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Read YAML Configuration for Corpus and Models 

In [None]:
# Define all Corpus Texts & Ensemble Models as global CONSTANTS

import yaml

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/read_yaml.py'

read_yaml(Corpus_Genre, Corpus_Type)

print('SentimentArcs Model Ensemble ------------------------------\n')
model_titles_ls = models_titles_dt.keys()
print('\n'.join(model_titles_ls))


print('\n\nCorpus Texts ------------------------------\n')
corpus_titles_ls = corpus_titles_dt.keys()
print('\n'.join(corpus_titles_ls))


print(f'\n\nThere are {len(model_titles_ls)} Models in the SentimentArcs Ensemble above.\n')
print(f'\nThere are {len(corpus_titles_ls)} Texts in the Corpus above.\n')
print('\n')

## Install Python Libraries

In [None]:
# Intentionally left blank

## Load Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.set_option('max_colwidth', 100) # -1)

from pandas.core.arrays.numeric import T

from glob import glob
import copy
# import yaml # Already done above
import json # Already done above
from itertools import groupby


In [None]:
from pandas.core.arrays.numeric import T

In [None]:
# Time Series Smoothing and Scaling

from statsmodels.nonparametric.smoothers_lowess import lowess as sm_lowess
from statsmodels import robust as sm_robust

# Too Slow
# from moepy import lowess, eda
# lowess_moepy = lowess.Lowess()

from sklearn.preprocessing import MinMaxScaler   # To normalize time series
from sklearn.preprocessing import StandardScaler # To sandardize time series
from sklearn.preprocessing import RobustScaler   # To deal with outliers

scaler_minmax = MinMaxScaler()
scaler_zscore = StandardScaler()
scaler_robust = RobustScaler()

In [None]:
# Time Series Dimensionality Reduction and Clustering

import lttb
from lttb.validators import *

from dtaidistance import clustering

In [None]:
# Distance/Similiarity Metrics for Time Series

from tslearn.metrics import dtw, soft_dtw, soft_dtw_alignment, ctw, lcss, gak

# calculating euclidean distance between vectors
from scipy.spatial.distance import euclidean

# calculating manhattan distance between vectors
from scipy.spatial.distance import cityblock

In [None]:
# Plotly Visualizations
# Note: Security Hole in default, must upgrade above
#       !pip install pyyaml==5.4.1

import plotly.graph_objects as go
import plotly.express as px
import plotly

## Define Global Parameters




In [None]:
# Define Globals

# Main data structure: Dictionary (key=text_name) of DataFrames (cols: text_raw, text_clean)
corpus_texts_dt = {}

# corpus_sa_dt = {}

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/get_globals.py'

SLANG_DT.keys()


## Setup Matplotlib Style

* https://matplotlib.org/stable/tutorials/introductory/customizing.html

In [None]:
# Configure Matplotlib

# View available styles
# plt.style.available

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/config_matplotlib.py'

config_matplotlib()

print('Matplotlib Configuration ------------------------------\n')
plt.rcParams.keys()
print('\n  Edit ./utils/config_matplotlib.py to change')

## Setup Seaborn Style

In [None]:
# Configure Seaborn

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/config_seaborn.py'

config_seaborn()

print('Seaborn Configuration ------------------------------\n')
# print('\n  Update ./utils/config_seaborn.py to display seaborn settings')

## Utility Functions

### Generate Convenient Data Lists

In [None]:
# Derive List of Texts in Corpus a)keys and b)full author and titles

print('Dictionary: corpus_titles_dt')
corpus_titles_dt
print('\n')

corpus_texts_ls = list(corpus_titles_dt.keys())
print(f'\nCorpus Texts:')
for akey in corpus_texts_ls:
  print(f'  {akey}')
print('\n')

print(f'\nNatural Corpus Titles:')
corpus_titles_ls = [x[0] for x in list(corpus_titles_dt.values())]
for akey in corpus_titles_ls:
  print(f'  {akey}')


In [None]:
# get_model_families()

# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/get_model_families.py'

ensemble_models_dt = get_model_famalies(models_titles_dt)

print('\nTest: Lexicon Family of Models:')
ensemble_models_dt['lexicon']

### File Functions

In [None]:
# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

%run -i './utils/file_utils.py'

# TODO: Not used? Delete?
# get_fullpath(text_title_str, ftype='data_clean', fig_no='', first_note = '',last_note='', plot_ext='png', no_date=False)

In [None]:
# Encode text for JSON.dump()

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    # https://stackoverflow.com/questions/57269741/typeerror-object-of-type-ndarray-is-not-json-serializable
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32,
                              np.float64)):
            return float(obj)
        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

# **[STEP 3] Read, Clean & EDA Visualizations**

## Read All Raw Sentiment Data

In [None]:
# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

print(f'Reading from SUBDIR_SENTIMENT_RAW: {SUBDIR_SENTIMENT_RAW}\n')
sentiment_raw_datafile_ls = os.listdir(SUBDIR_SENTIMENT_RAW)

corpus_texts_dt = {}
first_bo = T

!ls -altr $SUBDIR_SENTIMENT_RAW
print('\n')

for i,afile in enumerate(sentiment_raw_datafile_ls):
  temp_dt = {}
  print(f'File #{i}: {afile}')
  afile_key = afile.split('.')[0].split('_')[-1]
  print(f'         {afile_key}')
  temp_dt = read_dict_dfs(in_file=afile, in_dir=SUBDIR_SENTIMENT_RAW)
  corpus_titles_ls = list(temp_dt.keys())
  # print(corpus_titles_ls)
  for j, atitle in enumerate(corpus_titles_ls):
    print(f'    Text #{j}: {atitle}')
    if (atitle in corpus_texts_dt.keys()):
      print(f'Append:')
      corpus_texts_dt[atitle] = pd.concat([corpus_texts_dt[atitle],temp_dt[atitle]], axis=1)
      # corpus_texts_dt[atitle] = temp_dt[atitle]
      # first_bo = False
    else:
      print(f'  New:')
      corpus_texts_dt[atitle] = temp_dt[atitle].copy()

In [None]:
corpus_texts_dt.keys()

In [None]:
corpus_texts_dt[corpus_titles_ls[1]].info()

In [None]:
ensemble_ls = list(set(corpus_texts_dt[corpus_titles_ls[1]].columns) - set(['text_clean','text_raw']))
ensemble_ls

## Delete Duplicate Columns

In [None]:
# Check for any duplicated columns/models

# corpus_texts_dt[atext].columns.duplicated()
print(corpus_texts_dt[corpus_titles_ls[0]].columns.value_counts())

print('\n')
next(iter(zip(corpus_texts_dt[corpus_titles_ls[0]].columns.duplicated(), corpus_texts_dt[corpus_titles_ls[0]].columns)))

In [None]:
# Check for Duplicate Models/Col names

column_list = pd.Series()

for atext_no in range(len(corpus_texts_dt.keys())):
  atext = corpus_texts_ls[atext_no]
  df = corpus_texts_dt[atext]
  print(f'\n\n\nProcessing #{atext_no}: {atext}')
  # col_dups_ct = corpus_texts_dt[atext].columns.duplicated().sum()
  # col_dups_ls = corpus_texts_dt[atext].columns[corpus_texts_dt[atext].columns.duplicated()]

  col_dups_ct = df.columns.duplicated().sum()

  if col_dups_ct > 0:
    col_dups_ls = df.columns[df.columns.duplicated()]
    print(f'\nBEFORE: {col_dups_ct} duplicated columns')
    print(f'  {", ".join(col_dups_ls)}')  

    # get names of duplicated columns
    column_list = df.columns.value_counts()
    col_del_ls = column_list[column_list>1]
    col_del_ls

    # Add '_dup' suffix to columns repeated more than once (keep one original without _dup suffix)
    my_suffix = '_dup'
    df.columns = [name if (duplicated == False & ~(name.endswith('_dup'))) else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]

    # Drop all columns with '_dup' suffix
    col_drop_ls = [x for x in df.columns if x.endswith('_dup')]
    df.drop(columns=col_drop_ls, inplace=True)

    # check to see if table has duplicate column names which prohibits upload to BigQuery - TRUE Desired
    len(df.columns) == len(set(df.columns))

  else:
    print(f'No Duplicated Columns')

In [None]:
# After deleting duplicates, check remaining cols/models of last text processed

corpus_texts_dt[corpus_titles_ls[0]].info()

## Interpolate and NaN/None Values

In [None]:
# TODO:

## Visually Verify Data

In [None]:
for i, atext in enumerate(corpus_texts_ls):
    
  atitle = f'{corpus_titles_dt[atext][0]}\nSentiment Analysis (SMA 10%)\nNo Normalization/Standardization'
  corpus_texts_dt[atext][ensemble_ls].rolling(200, center=True).mean().plot(title=atitle)
  plt.grid(True, alpha=0.7) 
  plt.show();              

In [None]:
# Check for clean DataFrame

# Check for NaN values
# TODO:
"""
print(f'Any Null values: [{corpus_texts_dt[corpus_titles_ls[0]].isnull().values.any()}]')

print('\n')

corpus_texts_dt[corpus_titles_ls[0]].columns.duplicated()

print('\n')

print(corpus_texts_dt[corpus_titles_ls[0]].columns.value_counts())

print('\n')

next(iter(zip(corpus_texts_dt[atext].columns.duplicated(), corpus_texts_dt[atext].columns)))
""";

## Make Robust to Outliers

In [None]:
# Clip Outliers based on IQR: RobustScaler())

def clip_outliers(floats_ser):
  '''
  Given a pd.Series of float values
  Return a list with outliers removed, values limited within 3 median absolute deviations from median
  '''
  # https://www.statsmodels.org/stable/generated/statsmodels.robust.scale.mad.html#statsmodels.robust.scale.mad

  # Old mean/std, less robust
  # ser_std = floats_ser.std()
  # ser_median = floats_ser.mean() # TODO: more robust: asym/outliers -> median/IQR or median/median abs deviation

  floats_np = np.array(floats_ser)
  ser_median = floats_ser.median()
  ser_mad = sm_robust.mad(floats_np)
  # print(f'ser_median = {ser_median}')
  # print(f'ser_mad = {ser_mad}')

  if ser_mad == 0:
    # for TS with small ranges (e.g. -1.0 to +1.0) Median Abs Deviation = 0
    #   so pass back the original time series
    floats_clip_ls = list(floats_ser)

  else:
    ser_oldmax = floats_ser.max()
    ser_oldmin = floats_ser.min()
    # print(f'ser_max = {ser_oldmax}')
    # print(f'ser_min = {ser_oldmin}')

    ser_upperlim = ser_median + 2.5*ser_mad
    ser_lowerlim = ser_median - 2.5*ser_mad
    # print(f'ser_upperlim = {ser_upperlim}')
    # print(f'ser_lowerlim = {ser_lowerlim}')

    # Clip outliers to max or min values
    floats_clip_ls = np.clip(floats_np, ser_lowerlim, ser_upperlim)
    # print(f'max floast_ls {floats_ls.max()}')

    # def map2range(value, low, high, new_low, new_high):
    #   '''map a value from one range to another'''
    #   return value * 1.0 / (high - low + 1) * (new_high - new_low + 1)

    # Map all float values to range [-1.0 to 1.0]
    # floats_clip_sig_ls = [map2range(i, ser_oldmin, ser_oldmax, ser_upperlim, ser_lowerlim) for i in floats_clip_ls]

    # listmax_fl = float(max(floats_ls))
    # floats_ls = [i/listmax_fl for i in floats_ls]
    #floats_ls = [1/(1+math.exp(-i)) for i in floats_ls]

  return floats_clip_ls  # floats_clip_sig_ls

# Test
# Will not work on first run as corpus_sents_df is not defined yet

# data = np.array([1, 4, 4, 7, 12, 13, 16, 19, 22, 24])
# test_ls = clip_outliers(pd.Series(data))

print('Comparison Test: (a) Manual IRQ Clipping vs (b) RobustScaler()')
# Plot #1: Clipped Outliers with IQR
test_ls = clip_outliers(corpus_texts_dt[corpus_texts_ls[0]]['vader'])
# test_ls = clip_outliers(corpus_texts_dt[corpus_texts_ls[0]]['afinn'].iloc[0])
# print(f'new min is {min(test_ls)}')
# print(f'new max is {max(test_ls)}')
pd.DataFrame(test_ls).rolling(300, center=True, min_periods=0).mean().plot(label='clipped', alpha=0.7);
corpus_texts_dt[corpus_texts_ls[0]]['vader'].rolling(300, center=True, min_periods=0).mean().plot(label='original', alpha=0.7)

# transformer = scaler_robust.fit(corpus_texts_dt[corpus_texts_ls[0]]['vader'].values.reshape(-1, 1))

# Plot #2: Scale Outliers with RobustScaler()
test_df = corpus_texts_dt[corpus_texts_ls[0]]['vader'].copy(deep=True) #   pd.DataFrame()
# test_df = pd.DataFrame({'vader': scaler_robust.fit_transform(np.array(corpus_texts_dt[corpus_texts_ls[0]]['vader']).reshape(-1, 1))})
test_df['vader_std'] = pd.Series(scaler_robust.fit_transform(np.array(corpus_texts_dt[corpus_texts_ls[0]]['vader']).reshape(-1, 1)).flatten())
test_df['vader_std'].rolling(300, center=True, min_periods=0).mean().plot(label='RobustScaler', alpha=0.7)

plt.title('Dealing with Outliers in Sentiment Time Series\n(a) Manually Clip with IQR or,\n (b) Scale with RobustScaler()')
plt.grid(True, alpha=0.7)
plt.legend()
plt.show();

In [None]:
corpus_texts_dt[corpus_titles_ls[0]].info()

In [None]:
# Deal with Outliers: (a) Manually clip with IQR, or (b) Automatically Scale RobustScaler()

for i, atext in enumerate(corpus_texts_dt.keys()):
  print(f'Text #{i}: {atext}')
  
  win_10per = int(0.10 * corpus_texts_dt[corpus_texts_ls[0]].shape[0])

  fig = plt.figure()
  ax = plt.subplot(111)

  models_rstd_ls = []
  for j, amodel in enumerate(ensemble_ls):
    amodel_rstd = f'{amodel}_rstd'
    # print(f'  Model #{j}: {amodel} (Model_Std: {amodel_rstd})')
    # clip_outliers(corpus_texts_dt[corpus_texts_ls[0]]['vader'])

    # Option (a): Manually Clip with 2.5*IQR
    # corpus_texts_dt[atext][amodel_rstd] = pd.Series(clip_outliers(corpus_texts_dt[atext][amodel])) # .reshape(-1,1)).flatten())

    # Option (b): Automatically Scale wit scikit-learns ScalerRobust()
    corpus_texts_dt[atext][amodel_rstd] = pd.Series(scaler_robust.fit_transform(np.array(corpus_texts_dt[atext][amodel]).reshape(-1,1)).flatten())

    # Plot
    _ = ax.plot(corpus_texts_dt[atext][amodel_rstd].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rstd, alpha=0.3)

    models_rstd_ls.append(amodel_rstd)

  # Plot Median of Ensemble
  _ = ax.plot(corpus_texts_dt[atext][models_rstd_ls].median(axis=1).rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Shrink current axis by 20%
  # box = ax.get_position()
  # ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])

  # Put a legend to the right of the current axis
  _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  plt.grid(True, alpha=0.7)
  atitle = plt.title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window=10%)\nClipped with IQR + zScore Standardized')
  plt.show();


## Standardization with zScore

In [None]:
%%time

# NOTE:

# zScore Standardization (mean=0, std=1)

for i, atext in enumerate(corpus_texts_dt.keys()):
  print(f'Text #{i}: {atext}')

  fig = plt.figure()
  ax = plt.subplot(111)

  models_std_ls = []
  for j, amodel in enumerate(ensemble_ls):
    amodel_rstd = f'{amodel}_rstd'
    amodel_rzstd = f'{amodel}_rzstd'
    # print(f'  Model #{j}: {amodel} (Model_Std: {amodel_rzstd})')
    # clip_outliers(corpus_texts_dt[corpus_texts_ls[0]]['vader'])

    # Get SMA 10% window length
    win_10per = int(0.10 * corpus_texts_dt[atext][amodel_rstd].shape[0])


    # UNCOMMENT only ONE of these TWO Options
    # ---------------------------------------
    # Option (a): Manually Clip with IQR
    corpus_texts_dt[atext][amodel_rzstd] = scaler_zscore.fit_transform(np.array(corpus_texts_dt[atext][amodel_rstd]).reshape(-1,1))

    # Option (b): Automatically Scale wit scikit-learns ScalerRobust()
    # corpus_texts_dt[atext][amodel_rzstd] = pd.Series(scaler_robust.fit_transform(np.array(corpus_texts_dt[atext][amodel]).reshape(-1,1)).flatten())

    # Plot amodel_rzstd
    _ = ax.plot(corpus_texts_dt[atext][amodel_rzstd].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rzstd, alpha=0.3)

    models_std_ls.append(amodel_rzstd)

  # Plot Median of Ensemble
  _ = ax.plot(corpus_texts_dt[atext][models_std_ls].median(axis=1).rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Put a legend to the right of the current axis
  ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  plt.grid(True, alpha=0.7)
  plt.title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window=10%)\nClipped with IQR + zScore Standardized')
  plt.show();


In [None]:
# Drop the Robust '_rstd' Columns in for all Texts

for i, atext in enumerate(corpus_texts_dt.keys()):
  print(f'Text #{i}: {atext}')

  models_rstd_ls = [x for x in corpus_texts_dt[atext].columns if '_rstd' in x]
  # models_rstd_ls

  corpus_texts_dt[atext].drop(columns=models_rstd_ls, inplace=True)
  corpus_texts_dt[atext].info()

  # Verify no '_rstd' columns exist
  [x for x in corpus_texts_dt[atext].columns if '_rstd' in x]
  print('\n')

## [TEMP] Numpy Experiments

In [None]:
# Test numpy: line vector vs column vector

x1_test = np.linspace(0,1,100)
x1_test.shape
print(x1_test[:5])
x2_test = np.linspace(0,1,100).reshape(-1,1)
x2_test.shape
x2_test[:5]

In [None]:
# Test numpy

A = np.array([1,4,9,16,25,36,49,64,81,100,121,144,169,196,225,256])
B = A.reshape(-1,1)
B
C = np.array(list(range(len(A))))
C
D = np.dstack((C,A))
D

E = np.array(list(zip(C,A)))
E.shape
F= scaler_minmax.fit_transform(E)
F
plt.plot(F[:,0], F[:,1])

## SMA + LOWESS Smoothing

In [None]:
#@markdown **Select Smoothing Parameters:**

#@markdown Simple Moving Average/Rolling Mean (default 10%):

Window_Percent = "10" #@param ["5", "10", "15", "20"]

#@markdown Second: LOWESS (default 20):

Inv_Fraction = "20" #@param ["5", "10","15","20","30", "40"]

#@markdown **NOTE:** frac = 1.0/Inv_Fraction (e.g. if frac = 1./30 = 0.03)

In [None]:
%%time

# timeit_res = %timeit -n1 -r1 -o sum(range(1000000))


# NOTE:      1m30s  19:10 on 20220308 Colab Pro/CPU w/moepy (1 Novel  x 2 Models)
#            5m28s  19:12 on 20220308 Colab Pro/CPU w/moepy (2 Novels x 2 Models)
#            5m29s  19:39 on 20220308 Colab Pro/CPU w/moepy (2 Novels x 2 Models) ~ 1m20s per Model per Novel
#        ~1h24m29s  19:39 on 20220308 Colab Pro/CPU w/moepy (2 Novels x 32 Models) 
#            4m23s  06:34 on 20220309 Colab Pro/CPU w/statsmodels (2 Novels x 32 Models) ~ 4.5s per Model per Novel
#            4m48s  11:53 on 20220309 Colab Pro/CPU w/statsmodels (2 novels x 32 Models)

# zScore Standardization (mean=0, std=1)

# Smoothing Parameters
win_per = int(Window_Percent)/100
afrac_inv = int(Inv_Fraction) # [10, 20, 30]


# for i, atext in enumerate(corpus_texts_dt.keys()):
for i, atext in enumerate(corpus_titles_ls):
  print(f'Text #{i}: {atext}')

  fig = plt.figure()
  ax = plt.subplot(111)

  models_smalowess_rzstd_ls = []
  models_sma_rzstd_ls = []
  for j, amodel in enumerate(ensemble_ls):
    # amodel_rstd = f'{amodel}_rstd'
    # Assume ensemble_ls only contains model roots, if not
    #   could add check to prevent duplicating models with suffixes
    amodel_rzstd = f'{amodel}_rzstd'
    amodel_sma_rzstd = f'{amodel}_sma_rzstd'
    amodel_smalowess_rzstd = f'{amodel}_smalowess_rzstd'
    print(f'  Model #{j}: {amodel} (Model_Std: {amodel_rzstd})')

    # Save SMA of Robust+zScore Standized Models
    sent_ct = int(win_per * corpus_texts_dt[atext][amodel].shape[0])
    corpus_texts_dt[atext][amodel_sma_rzstd] = corpus_texts_dt[atext][amodel_rzstd].rolling(sent_ct, center=True, min_periods=0).mean()

    # Get x/y values as numpy arrays
    x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
    # y_sma_rzstd = corpus_texts_dt[atext][amodel_rzstd].rolling(win_10per, center=True, min_periods=0).mean().to_numpy()
    y_sma_rzstd = corpus_texts_dt[atext][amodel_sma_rzstd].to_numpy()

    # Get LOWESS smoothing of SMA smoothed Sentiment
    # UNCOMMENT only ONE of these TWO Options
    # ---------------------------------------

    # Option (a): statsmodels LOWESS
    sm_x, y_smalowess_rzstd_pred = sm_lowess(y_sma_rzstd, x,  frac=1./afrac_inv, it=5).T

    # Option (b): moepy LOWESS
    # print('Computing MOEPy LOWESS for RStd Time Series')
    # lowess_moepy.fit(x, y_sma_rzstd, frac=1./afrac_inv)
    # y_sma_rzstd_pred = lowess_moepy.predict(x)
    # moepy fit _sma_rstd
    # print('Computing MOEPy LOWESS for SMA_RStd Time Series')
    ## lowess_moepy.fit(x, y_sma_rzstd, frac=1./afrac_inv)
    ## y_smalowess_rzstd_pred = lowess_moepy.predict(x)

    # Save SMA smoothed Robust+zScore Standardized Time Series for Model=amodel
    # (if > 5 models comment out) corpus_texts_dt[atext][amodel_sma_rzstd] = pd.Series(y_sma_rzstd)

    # Save SMA+LOWESS smoothed Robust+zScore Standardized Time Series for Model=amodel
    corpus_texts_dt[atext][amodel_smalowess_rzstd] = pd.Series(y_smalowess_rzstd_pred)

    # Plot amodel_smalowess_rzstd
    # atitle = f'{corpus_titles_dt[atext][0]}\nSentimentArcs Model: {amodel}\nLOWESS Smoothed (frac=1./{afrac_inv})\nClipped (2.5*IQR) and Standardized (zScore)'
    # _ = ax.plot(y_sma_rzstd, label=amodel_sma_rzstd, alpha=0.3) # .rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rzstd, alpha=0.3)
    _ = ax.plot(corpus_texts_dt[atext][amodel_sma_rzstd], label=amodel_sma_rzstd, alpha=0.3) # .rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rzstd, alpha=0.3)
    _ = ax.plot(corpus_texts_dt[atext][amodel_smalowess_rzstd], label=amodel_smalowess_rzstd, alpha=0.3) # .rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rzstd, alpha=0.3)

    print(f'  Appending {amodel_smalowess_rzstd} to models_smalowess_rzstd_ls')
    models_smalowess_rzstd_ls.append(amodel_smalowess_rzstd)
    models_sma_rzstd_ls.append(amodel_sma_rzstd)

  # Plot Ensemble Median(SMA+LOWESS)
  print('Plotting Ensemble Median(SMA+LOWESS)')
  models_smalowess_rzstd_ls_median = corpus_texts_dt[atext][models_smalowess_rzstd_ls].median(axis=1)
  _ = ax.plot(models_smalowess_rzstd_ls_median, label='Ensemble Median(SMA+LOWESS)', color='r', linewidth=3) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Plot Ensemble LOWESS(Median(SMA+LOWESS))
  print('Plotting Ensemble LOWESS(Median(SMA+LOWESS))')

  # Get x/y values as numpy arrays
  x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
  y_lowess_smalowess_median = models_smalowess_rzstd_ls_median.to_numpy()

  # Get LOWESS smoothing of SMA smoothed Sentiment
  # UNCOMMENT only ONE of these TWO Options
  # ---------------------------------------

  # Option (a): statsmodels LOWESS
  sm_x, y_lowess_smalowess_median_pred = sm_lowess(y_lowess_smalowess_median, x,  frac=1./afrac_inv, it=5).T

  # Option (b): moepy LOWESS
  # print('Computing MOEPy LOWESS for RStd Time Series')
  # lowess_moepy.fit(x, y_lowess_smalowess_median, frac=1./afrac_inv)
  # y_lowess_smalowess_median = lowess_moepy.predict(x)
  # moepy fit _sma_rstd
  # print('Computing MOEPy LOWESS for SMA_RStd Time Series')
  ## lowess_moepy.fit(x, y_lowess_smalowess_median, frac=1./afrac_inv)
  ## y_lowess_smalowess_median_pred = lowess_moepy.predict(x)

  # Save SMA smoothed Robust+zScore Standardized Time Series for Model=amodel
  # (if > 5 models comment out) corpus_texts_dt[atext][amodel_sma_rzstd] = pd.Series(y_sma_median)

  # Save SMA+LOWESS smoothed Robust+zScore Standardized Time Series for Model=amodel
  # corpus_texts_dt[atext][f'{amodel}_sma_median_pred'] = pd.Series(y_lowess_smalowess_median_pred)

  lowesssma_ls_median = y_lowess_smalowess_median_pred # lowess(sma_ls_median)
  _ = ax.plot(lowesssma_ls_median, label='Ensemble LOWESS(Median(SMA+LOWESS))', color='b', linewidth=3) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Put a legend to the right of the current axis
  ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  plt.grid(True, alpha=0.7)
  plt.title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window={Window_Percent}) + LOWESS (frac=1./{Inv_Fraction}))\nClipped with IQR + zScore Standardized')
  plt.show();

In [None]:
%whos list


In [None]:
models_sma_ls = [x for x in corpus_texts_dt[atext].columns if x.endswith('_sma_rzstd')]
models_sma_ls

In [None]:
# Verify LOWESS(SMA+LOWESS) is a good fit (Adjust outter/last LOWESS frac as necessary)

# atext = corpus_texts_ls[0]
model_ct = 2

win_per = int(int(Window_Percent) * corpus_texts_dt[atext].shape[0])

models_smalowess_ls = [x for x in corpus_texts_dt[atext].columns if x.endswith('_smalowess_rzstd')]
models_sma_ls = [x for x in corpus_texts_dt[atext].columns if x.endswith('_sma_rzstd')]
# sma_ls = [x.replace('_smalowess_', '_') for x in models_smalowess_ls]

# afrac_inv = 10 # (defined above)


# for i, atext in enumerate(corpus_texts_dt.keys()):
for i, atext in enumerate(corpus_titles_ls):
  print(f'Text #{i}: {atext}')

  fig = plt.figure()
  ax = plt.subplot(111)

  # for i, amodel_smalowess in enumerate(smalowess_ls[:model_ct]):
  for i, amodel_smalowess in enumerate(models_smalowess_ls[:model_ct]):
    # for i, amodel_smalowess in enumerate(models_smalowess_ls):

    # _ = ax.plot(models_smalowess_rzstd_ls_median, label='Ensemble Median', color='r', linewidth=3) # .rolling(win_per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)
    smalowess_label = f'sma+lowess: {models_sma_ls[i]}'
    _ = ax.plot(corpus_texts_dt[atext][models_smalowess_ls[i]], label=smalowess_label, alpha=0.3, linewidth=3)
    sma_label = f'sma: {models_sma_ls[i]}'
    # _ = ax.plot(corpus_texts_dt[atext][sma_models_sma_lsls[i]].rolling(win_per, center=True, min_periods=0).mean(), label=sma_label, alpha=0.3, linewidth=3)
    _ = ax.plot(corpus_texts_dt[atext][models_sma_ls[i]], label=sma_label, alpha=0.3, linewidth=3)

  # Put a legend to the right of the current axis
  _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  _ = ax.grid(True, alpha=0.7)
  _ = ax.set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {model_ct} Models\nSmoothed: SMA only (vs) SMA (window=10%) + LOWESS (frac=1./{afrac_inv}\nClipped with IQR + zScore Standardized')
  plt.show(); 

print('Verify that the smooth LOWESS curves relatively accurately follow the jagged SMA curves below')
print('  If not, go back and adjust the parameters for LOWESS/SMA so they more closely match (e.g. increase LOWESS 20->30)')

In [None]:
# Create lists of Model Types

# Get Types of Models
print('Model Types (e.g. for VADER):')
[x for x in corpus_texts_dt[corpus_texts_ls[0]].columns if 'vader' in x]
print('\n\n')

models_smalowess_ls = [x for x in corpus_texts_dt[atext].columns if x.endswith('_smalowess_rzstd')]
print(f'\n\nModel Type: models_smalowess_ls=[*_smalowess_rzstd]\n  {models_smalowess_ls}')
print(f'   Model Count: {len(models_smalowess_ls)}\n\n')

models_rzstd_ls = [x for x in corpus_texts_dt[atext].columns if (x. endswith('_rzstd') & ~x.endswith('_smalowess_rzstd') &~x.endswith('_sma_rzstd'))]
print(f'\n\nModel Type=[*_rzstd]\n  {models_rzstd_ls}')
print(f'   Model Count: {len(models_rzstd_ls)}\n\n')


In [None]:
# Verify Ensemble SMA+LOWESS Arcs for all Texts

for i, atext in enumerate(corpus_titles_ls):
  print(f'Plotting {corpus_titles_dt[atext][0]}:')

  fig = plt.figure()
  ax = plt.subplot(111)

  for j, amodel in enumerate(models_smalowess_ls):
    _ = ax.plot(corpus_texts_dt[atext][amodel], label=amodel, alpha=0.3)
    # _ = ax.plot(corpus_texts_dt[atext][models_smalowess_ls].plot(title=f'{corpus_titles_dt[atext][0]}')

  _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  _ = ax.grid(True, alpha=0.7)
  _ = ax.set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window={Window_Percent}) + LOWESS (frac=1./{Inv_Fraction}))\nClipped with IQR + zScore Standardized')
  fig.show();

In [None]:
# Global Variable to hold different Ensembles Centrality Series for each Text

corpus_centrality_dt = {}

In [None]:
models_smalowess_ls

In [None]:
%whos list

In [None]:
%%time

# NOTE: 

# Compute, Plot, and Save Different measures of Centrality

# TODO: Recompute for to make this code cell more robust, indep from prev cell execution?
# models_smalowess_ls


# Individual Model Smoothed Arcs
# 1. SMA(rzstd) - temporal accuracy
# 2. LOWESS(SMA(rzstd)) - peak detection

# Collective Ensemble Smoothed Medians
# a. Median(LOWESS(SMA(rzstd)))
# b. LOWESS(Median(LOWESS(SMA(rzstd))))
# c. LOWESS(Median(SMA(rzstd)))
# d. LOWESS(Median(rzstd))

# Plot SMA+LOWESS Smoothed Sentiment Arcs

# Smoothing Parameters
win_per = int(Window_Percent)/100
afrac_inv = int(Inv_Fraction) # [10, 20, 30]


# for i, atext in enumerate(corpus_texts_dt.keys()):
for i, atext in enumerate(corpus_titles_ls):
  print(f'Text #{i}: {atext}')

  # Create an key:(empty)value Dict entry to avoid key error upon insert
  corpus_centrality_dt[atext] = pd.Series()

  # Save SMA of Robust+zScore Standized Models
  sent_ct = int(win_per * corpus_texts_dt[atext][amodel].shape[0])
  # corpus_texts_dt[atext][amodel_sma_rzstd] = corpus_texts_dt[atext][amodel_rzstd].rolling(sent_ct, center=True, min_periods=0).mean()
  # win_per = int(int(Window_Percent)/100 * corpus_texts_dt[atext].shape[0])
  # std_scaler_fl = 4.5 # 5.0
  # from socket import AF_AX25

  # smalowess_ls = [x for x in corpus_texts_dt[atext].columns if x.endswith('_smalowess_rzstd')]
  # smalowess_ls


  fig, ax  = plt.subplots(nrows=2, ncols=1, sharex=True, gridspec_kw={'height_ratios': [3, 1]})


  # Top Plot: Ensemble Sentiment Time Series

  # _ = ax.plot(models_smalowess_rzstd_ls_median, label='Ensemble Median', color='r', linewidth=3) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)
  for amodel in models_smalowess_ls:
    _ = ax[0].plot(corpus_texts_dt[atext][amodel], label=amodel, alpha=0.3, linewidth=1)

  # TOP PLOTS
  # -------------------------------------------------- 

  # Take Median immediately after Robust IQR+zScore Standardization (rzstd), before Smoothing
  # Common Series(3 x rzstd_median_ser): Median(rzstd)
  # a. SMA(Median(rzstd)) 
  # b. LOWESS(Median(rzstd))
  # c. LOWESS(SMA(Median(rzstd)))

  # Take Median after Smoothing
  # SMA - better for localization, so do first
  # LOWESS -better for peak detection, so do last
  # apply median() first to min outliers, thereafter use mean() to avoid further compression of features (excempt for Rolling Mean=SMA use mean() even if first transformation step)
  # omit _rzstd_ where possible, meaning clear
  # Common Series(3 x sma_rzstd_ser): SMA(rzstd)
  # d. Median(LOWESS(SMA(rzstd)))
  # e. LOWESS(Median(LOWESS(SMA(rzstd))))
  # f. LOWESS(Median(SMA(rzstd)))

  # Resuable subcomponents
  rzstd_median_ser = corpus_texts_dt[atext][models_rzstd_ls].median(axis=1)
  sma_rzstd_ser = corpus_texts_dt[atext][models_rzstd_ls].rolling(sent_ct, center=True, min_periods=0).mean()
  x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))

  # Plot: (a) SMA(Median(rzstd))
  # TODO: Debug problem with corpus_titles_ls[0] (too high, seems unnormed)
  plota_sma_median = corpus_texts_dt[atext][models_rzstd_ls].median(axis=1).rolling(sent_ct, center=True, min_periods=0).mean()
  corpus_centrality_dt[atext]['sma_median_rzstd'] = plota_sma_median
  _ = ax[0].plot(plota_sma_median, label='SMA(Median(rzstd))', color='r', linewidth=3, alpha=0.7) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Plot: (b) LOWESS(Median(rzstd))
  # Get x/y values as numpy arrays
  # x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
  y = rzstd_median_ser.to_numpy()
  # LOWESS fit
  sm_x, y_sm = sm_lowess(y, x,  frac=1./afrac_inv, it=5).T
  plotb_lowess_median = y_sm
  corpus_centrality_dt[atext]['lowess_median_rzstd'] = plotb_lowess_median
  # plot
  _ = ax[0].plot(plotb_lowess_median, label='LOWESS(Median(rzstd))', color='c', linewidth=3, alpha=0.7) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Plot: (c) LOWESS(SMA(Median(rzstd)))
  # Get x/y values as numpy arrays
  # x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
  y = rzstd_median_ser.rolling(sent_ct, center=True, min_periods=0).mean().to_numpy()
  # LOWESS fit
  sm_x, y_sm = sm_lowess(y, x,  frac=1./afrac_inv, it=5).T
  plotc_lowess_sma_median = y_sm
  corpus_centrality_dt[atext]['lowess_sma_median_rzstd'] = plotc_lowess_sma_median
  # plot
  _ = ax[0].plot(plotc_lowess_sma_median, label='LOWESS(SMA(Median(rzstd)))', color='b', linewidth=3, alpha=0.7) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Plot: (d) Median(LOWESS(SMA(rzstd)))
  plotd_median_lowess_sma = corpus_texts_dt[atext][models_smalowess_ls].median(axis=1)
  corpus_centrality_dt[atext]['median_lowess_sma_rzstd'] = plotd_median_lowess_sma
  _ = ax[0].plot(plotd_median_lowess_sma, label='Median(LOWESS(SMA(rzstd)))', color='r', linewidth=3, alpha=0.7) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Plot: (e) LOWESS(Median(LOWESS(SMA(rzstd)))) smalowess_ls
  # Get x/y values as numpy arrays
  # x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
  y = plotd_median_lowess_sma.to_numpy()
  # LOWESS fit
  sm_x, y_sm = sm_lowess(y, x,  frac=1./afrac_inv, it=5).T
  plote_lowess_sma_median = y_sm
  corpus_centrality_dt[atext]['lowess_median_lowess_sma_rzstd'] = plote_lowess_sma_median
  # plot
  _ = ax[0].plot(plote_lowess_sma_median, label='LOWESS(Median(LOWESS(SMA(rzstd))))', color='g', linewidth=3, alpha=0.7) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)

  # Plot: (f) LOWESS(Median(SMA(rzstd)))
  # Get x/y values as numpy arrays
  # x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
  y = sma_rzstd_ser.median(axis=1).to_numpy()
  # LOWESS fit
  sm_x, y_sm = sm_lowess(y, x,  frac=1./afrac_inv, it=5).T
  plotf_lowess_sma_median = y_sm
  corpus_centrality_dt[atext]['lowess_median_sma_rzstd'] = plotf_lowess_sma_median
  # plot
  _ = ax[0].plot(plotf_lowess_sma_median, label='LOWESS(Median(SMA(rzstd)))', color='g', linewidth=3, alpha=0.7) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)


  # Put a legend to the right of the current axis
  _ = ax[0].legend(loc='center left', bbox_to_anchor=(1, 0.5))
  _ = ax[0].grid(True, alpha=0.3)
  _ = ax[0].set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window=10%) + LOWESS (frac=1./{afrac_inv}\nClipped with IQR + zScore Standardized')



  # BOTTOM PLOTS: Ensemble Std and MinMax Range Time Series
  # --------------------------------------------------

  # Already computed several cells above as models_smalowess_ls
  # smalowess_ls = [x for x in corpus_texts_dt[atext].columns if x.endswith('_smalowess_rzstd')]
  # smalowess_ls

  # Get list
  # sma_ls = [x.replace('_smalowess_', '_') for x in smalowess_ls]

  # TODO: MinMaxScaler() for std_ser and minmax_ser to put on same scale
  # Reshape numpy with reshape(-1, 1), flatten(), or ravel()
  # test_df = pd.DataFrame({'vader': scaler_robust.fit_transform(np.array(corpus_texts_dt[atext]['vader']).reshape(-1, 1))})
  # test_df['vader_std'] = pd.Series(scaler_robust.fit_transform(np.array(corpus_texts_dt[atext]['vader']).reshape(-1, 1)).flatten())
  # test_df['vader_std'].rolling(300, center=True, min_periods=0).mean().plot(label='RobustScaler', alpha=0.7)
  # test_df = pd.DataFrame({'vader': scaler_robust.fit_transform(np.array(corpus_texts_dt[atext]['vader']).reshape(-1, 1))})
  std_minmax_ser = pd.Series(scaler_minmax.fit_transform(np.array(corpus_texts_dt[atext][models_smalowess_ls].std(axis=1)).reshape(-1, 1)).flatten())
  range_minmax_ser = pd.Series(scaler_minmax.fit_transform(np.array(corpus_texts_dt[atext][models_smalowess_ls].max(axis=1).values - corpus_texts_dt[atext][models_smalowess_ls].min(axis=1).values).reshape(-1, 1)).flatten())

  # _ = ax.plot(models_smalowess_rzstd_ls_median, label='Ensemble Median', color='r', linewidth=3) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)
  _ = ax[1].plot(std_minmax_ser, label=f'Std Dev', color='r', alpha=0.7, linewidth=3)
  _ = ax[1].plot(range_minmax_ser, label='MinMax Range', alpha=0.7)
  # _ = ax.plot(corpus_texts_dt[atext][sma_ls[:model_ct]].rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', alpha=0.3, linewidth=3)

  # Put a legend to the right of the current axis
  _ = ax[1].legend(loc='center left', bbox_to_anchor=(1, 0.5))

  # ax[1].set_xticks(major_ticks)
  # ax[1].set_xticks(minor_ticks, minor=True)
  # ax[1].set_xticklabels(ax[1].get_xticklabels(), rotation=90, ha='center')

  _ = ax[1].grid(True, alpha=0.7)
  # plt.title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {model_ct} Models\nSmoothed: SMA only (vs) SMA (window=10%) + LOWESS (frac=1./{afrac_inv}\nClipped with IQR + zScore Standardized')
  _ = ax[1].set_title('Standard Deviations vs MinMax Range (Both Stadardized)')
  plt.show();  



## Grid Search

In [None]:
%whos list

In [None]:
ensemble_ls

In [None]:
[x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if 'vader' in x]

In [None]:
#@title 
import ipywidgets as widgets
from ipywidgets import HBox, Label
from ipywidgets import Layout, Button, Box, FloatText, Textarea, Dropdown, Label, IntSlider
import time
import pandas as pd

#Create DF
df = df = pd.DataFrame(columns = ['Dropdown_column', 'Float_column'])
df

# Layout
form_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='space-between',
)


button_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='center',
    padding = '5%'
)


# Independent dropdown item

drop_down_input = 'Dropdown_input_1'

drop_down = widgets.Dropdown(options=('Dropdown_input_1', 'Dropdown_input_2'))

def dropdown_handler(change):
    global drop_down_input
    print('\r','Dropdown: ' + str(change.new),end='')
    drop_down_input = change.new

drop_down.observe(dropdown_handler, names='value')


# Dependent drop down

# Dependent drop down elements

dependent_drop_down_elements = {}
dependent_drop_down_elements['Dropdown_input_1'] = ensemble_ls # ['A', 'B']
dependent_drop_down_elements['Dropdown_input_2'] = ['C', 'D', 'E'] 

# Define dependent drop down

dependent_drop_down = widgets.Dropdown(options=(dependent_drop_down_elements['Dropdown_input_1']))

def dropdown_handler(change):
    global drop_down_input
    print('\r','Dropdown: ' + str(change.new),end='')
    drop_down_input = change.new  
drop_down.observe(dropdown_handler, names='value')



# Button

button = widgets.Button(description='Add row to dataframe')
out = widgets.Output()
def on_button_clicked(b):
    global df
    button.description = 'Row added'
    time.sleep(1)
    with out:
      new_row = {'Dropdown_column': drop_down_input, 'Float_column': float_input}
      df = df.append(new_row, ignore_index=True)
      button.description = 'Add row to dataframe'
      out.clear_output()  
      display(df)
button.on_click(on_button_clicked)

# Form items

form_items = [         
    Box([Label(value='Independent dropdown'),
         drop_down], layout=form_item_layout),
    Box([Label(value='Dependent dropdown'),
         dependent_drop_down], layout=form_item_layout)
         ]

form = Box(form_items, layout=Layout(
    display='flex',
    flex_flow='column',
    border='solid 1px',
    align_items='stretch',
    width='30%',
    padding = '1%'
))
display(form)
display(out)

In [None]:
%whos list

In [None]:
models_smalowess_rzstd_ls.sort()
models_smalowess_rzstd_str = ','.join([f'"{x}"' for x in models_smalowess_rzstd_ls])
models_smalowess_rzstd_str

# models_smalowess_rzstd_str = ','.join(models_smalowess_rzstd_ls)
# models_smalowess_rzstd_str

In [None]:
# https://www.semicolonworld.com/question/83905/how-to-create-a-dynamic-dependent-dropdown-menu-using-ipywidgets

In [None]:
#@title 
import ipywidgets as widgets
from ipywidgets import HBox, Label
from ipywidgets import Layout, Button, Box, FloatText, Textarea, Dropdown, Label, IntSlider
import time
import pandas as pd

#Create DF
# df = df = pd.DataFrame(columns = ['Dropdown_column', 'Float_column'])
# df

# Layout
form_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='space-between',
)


# Independent dropdown item

drop_down_input = 'Dropdown_input_1'

drop_down = widgets.Dropdown(options=(models_smalowess_rzstd_ls)) # 'Dropdown_input_1', 'Dropdown_input_2'))

def dropdown_handler(change):
    global drop_down_input
    print('\r','Dropdown: ' + str(change.new),end='')
    drop_down_input = change.new

drop_down.observe(dropdown_handler, names='value')

# Form items

form_items = [         
    Box([Label(value='Model Preprocessed'),
         drop_down], layout=form_item_layout),
         ]

form = Box(form_items, layout=Layout(
    display='flex',
    flex_flow='column',
    border='solid 1px',
    align_items='stretch',
    width='30%',
    padding = '1%'
))
display(form)
display(out)

In [None]:
#@markdown **Instructions:**

#@markdown <li> Select a Smoothing Technique
#@markdown <li> Choose the Parameters for the Smoothing Technique Selected
#@markdown <li> Execute this code cell

#@markdown <hr>

#@markdown **Smoothing Technique**
Smoothing_Algo = "LOWESS" #@param ["SMA", "LOWESS"]

Sample_Model = "roberta15lg_smalowess_rzstd" #@param models ["afinn_smalowess_rzstd","flair_smalowess_rzstd","hinglish_smalowess_rzstd","huggingface_smalowess_rzstd","imdb2way_smalowess_rzstd","nlptown_smalowess_rzstd","pattern_smalowess_rzstd","pysentimentr_huliu_smalowess_rzstd","pysentimentr_jockersrinker_smalowess_rzstd","pysentimentr_lmcd_smalowess_rzstd","pysentimentr_nrc_smalowess_rzstd","pysentimentr_senticnet_smalowess_rzstd","pysentimentr_sentiword_smalowess_rzstd","roberta15lg_smalowess_rzstd","robertaxml8lang_smalowess_rzstd","sentimentr_huliu_smalowess_rzstd","sentimentr_jockers_smalowess_rzstd","sentimentr_jockersrinker_smalowess_rzstd","sentimentr_loughran_mcdonald_smalowess_rzstd","sentimentr_nrc_smalowess_rzstd","sentimentr_senticnet_smalowess_rzstd","sentimentr_sentiword_smalowess_rzstd","sentimentr_socal_google_smalowess_rzstd","stanza_smalowess_rzstd","syuzhetr_afinn_smalowess_rzstd","syuzhetr_bing_smalowess_rzstd","syuzhetr_nrc_smalowess_rzstd","syuzhetr_syuzhet_smalowess_rzstd","t5imdb50k_smalowess_rzstd","textblob_smalowess_rzstd","vader_smalowess_rzstd","yelp_smalowess_rzstd"]

#@markdown <hr>

#@markdown **Window Percent for SMA Smoothing (default 10%)**
Window_Start = 6 #@param {type:"slider", min:2, max:10, step:1}
Window_End = 14 #@param {type:"slider", min:10, max:20, step:1}
Window_Step = 2 #@param {type:"slider", min:1, max:5, step:1}

win_start_int = int(Window_Start)
win_end_int = int(Window_End)
win_step_int = int(Window_Step)

#@markdown <hr>

#@markdown **Inverse Fraction for LOWESS Smoothing (default 20)**
Inv_Frac_Start = 10 #@param {type:"slider", min:2, max:20, step:1}
Inv_Frac_End = 30 #@param {type:"slider", min:20, max:30, step:1}
Inv_Frac_Step = 2 #@param {type:"slider", min:1, max:5, step:1}

invfrac_start_int = int(Inv_Frac_Start)
invfrac_end_int = int(Inv_Frac_End) # + 1
invfrac_step_int = int(Inv_Frac_Step)



In [None]:
# Grid Search over the Smoothing Hyperparameter Space

if Smoothing_Algo == 'SMA':

  for i, win_size in enumerate(range(win_start_int, win_end_int, win_step_int)):
    print(f'Loop #{i}: {win_size}')

    # Verify Ensemble SMA+LOWESS Arcs for all Texts

    for i, atext in enumerate(corpus_titles_ls):
      print(f'Plotting {corpus_titles_dt[atext][0]}:')

      fig = plt.figure()
      ax = plt.subplot(111)


      _ = ax.plot(corpus_texts_dt[atext][Sample_Model], label=amodel, alpha=0.3)
      # _ = ax.plot(corpus_texts_dt[atext][models_smalowess_ls].plot(title=f'{corpus_titles_dt[atext][0]}')

      _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

      _ = ax.grid(True, alpha=0.7)
      _ = ax.set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window={Window_Percent}) + LOWESS (frac=1./{Inv_Fraction}))\nClipped with IQR + zScore Standardized')
      fig.show();

elif Smoothing_Algo == 'LOWESS':

  for i, frac_size in enumerate(range(invfrac_start_int, invfrac_end_int, invfrac_step_int)):
    print(f'Loop #{i}: {frac_size}')

    # Verify Ensemble SMA+LOWESS Arcs for all Texts

    for i, atext in enumerate(corpus_titles_ls):
      print(f'Plotting {corpus_titles_dt[atext][0]}:')

      fig = plt.figure()
      ax = plt.subplot(111)

      for j, amodel in enumerate(models_smalowess_ls):
        _ = ax.plot(corpus_texts_dt[atext][amodel], label=amodel, alpha=0.3)
        # _ = ax.plot(corpus_texts_dt[atext][models_smalowess_ls].plot(title=f'{corpus_titles_dt[atext][0]}')

      _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

      _ = ax.grid(True, alpha=0.7)
      _ = ax.set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window={Window_Percent}) + LOWESS (frac=1./{Inv_Fraction}))\nClipped with IQR + zScore Standardized')
      fig.show();

else:
  print(f'ERROR: Illegal value for Smoothing_Algo: {Smoothing_Algo}')

## [TEMP] Rename folders under SentimentArcs to be more intuitive

In [None]:
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

In [None]:
# %cd ../sentiment_clean
!ls -altr

In [None]:
# !mv social_clean_social_new sentiment_clean_social_new
# !mv finance_clean_social_new sentiment_clean_finance_new

In [None]:
# !mkdir social_clean_social_new

In [None]:
"""
!mkdir timeseries_clean_novels_new
!mkdir timeseries_clean_novels_ref
!mkdir timeseries_clean_social_new
!mkdir timeseries_clean_social_ref
!mkdir timeseries_clean_finance_new
!mkdir timeseries_clean_finance_ref

!ls -altr
""";

## Save Checkpoint (Dict of DataFiles)

In [None]:
# Review all variations of each model being saved

[x for x in corpus_texts_dt[corpus_titles_ls[0]] if 'vader' in x]

In [None]:
# TODO: Delete text_clean, text_raw from all files downstream of text_clean to save space

[x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if 'text' in x]

In [None]:
# Global Variable
"""
SUBDIR_SENTIMENTARCS = '/gdrive/MyDrive/cdh/sentiment_arcs'
SUBDIR_TIMESERIES_RAW = f'/timeseries_raw/timeseries_raw_{Corpus_Genre}_{Corpus_Type}'
SUBDIR_TIMESERIES_CLEAN = f'/timeseries_raw/timeseries_clean_{Corpus_Genre}_{Corpus_Type}'
SUBDIR_TIMESERIES_RAW
""";

In [None]:
# Note Structure of corpus_distance_dt[atext] = 
# (model, distance) for all models
# (distance, silimarity_metric e.g. mahattan)
# (baseline, time_series of Ensemble Centrality Series, long)

# Verify in SentimentArcs Root Directory
os.chdir(SUBDIR_SENTIMENTARCS)

print(f'\nSaving to Subdirectory:\n  {SUBDIR_TIMESERIES_RAW}\n\n')
print('--------------------------------------------------\n\n')

# Save Each Text Distance Dictionary as a Separate File
for i, atext in enumerate(corpus_texts_dt.keys()):
  print(f'Saving Timeseries #{i}: [{atext}]')
  # Generate Unique Filename
  filename_save = f'timeseries_raw_{Corpus_Genre}_{Corpus_Type}_{atext}.csv'
  print(f'             to file: [{filename_save}]')
  print(f'           in subdir: [{SUBDIR_TIMESERIES_RAW}]\n\n')
  corpus_texts_dt[atext].head()
  corpus_texts_dt[atext].to_csv(f'.{SUBDIR_TIMESERIES_RAW}/{filename_save}')

  # pretty = json.dumps(corpus_texts_dt[atext], indent=4)
  # print(pretty)


# **[STEP 4] Evaluate Ensemble on Texts**

* (code) ts_dtw_clustering_dtaidistance_emo_20211201(1).ipynb

## Select Ensemble Baseline and Model Preprocessing

In [None]:
#@markdown **Choose Type of Model Proprocessing (default - SMA+LOWESS):**

Normalized_Series = 'SMA+LOWESS' #@param ['SMA', 'SMA+LOWESS']

#@markdown **Choose Type of Ensemble Centrality (default - lowess_sma_median_rzstd):**

Centrality_Series = 'lowess_median_lowess_sma_rzstd' #@param ['sma_median_rzstd', 'lowess_median_rzstd', 'lowess_sma_median_rzstd', 'median_lowess_sma_rzstd', 'lowess_median_lowess_sma_rzstd', 'lowess_median_sma_rzstd']


for atext in corpus_titles_ls:
  # atitle = f'Sentiment Analysis\n{corpus_titles_dt[atext][0]}\nCentrality Series: {Centrality_Series}'
  
  fig = plt.figure()
  ax = plt.subplot(111)

  if Normalized_Series == 'SMA':
    models_preproc_ls = models_sma_rzstd_ls
  elif Normalized_Series == 'SMA+LOWESS':
    models_preproc_ls = models_smalowess_rzstd_ls
  else:
    print(f'ERROR: Invalid value for Normalized_Series: {Normalized_Series}')

  # Plot ALL Ensemble Models for a Text 
  for amodel in models_preproc_ls:
    _ = ax.plot(corpus_texts_dt[atext][amodel], label=amodel, alpha=0.3)

  # Plot ONE Ensemble Baseline for a Text
  _ = ax.plot(corpus_centrality_dt[atext][Centrality_Series], label=Centrality_Series, color='red', linewidth=3, alpha=0.7)

  # Put a legend to the right of the current axis
  _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  plt.grid(True, alpha=0.7)
  _ = ax.set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window={Window_Percent}) + LOWESS (frac=1./{Inv_Fraction}))\nClipped with IQR + zScore Standardized')
  plt.show();

In [None]:
# Verify Ensemble Models and Variations for each Model

temp_ls = corpus_texts_dt[corpus_texts_ls[0]].columns

[x for x in temp_ls if '_rzstd' in x]

print(f'\n\nThere are {len(temp_ls)} Columns/Models\n\n')

[x for x in corpus_texts_dt[corpus_texts_ls[0]].columns if 'vader' in x]

## Simple Correlation Matricies

In [None]:
# Sentence Heatmap Correlation of StdScaler Roll100 Sentiments
# Depends on 'col_stdscaler_rollwin_ls' defined in prior code cell

Correlation_Metric = "spearman" #@param ["pearson", "spearman", "kendall"]
# corr_methods_ls = ['pearson', 'spearman', 'kendall']

col_rzstd_ls = []
for amodel in ensemble_ls:
  acol_rzstd = f'{amodel}_rzstd'
  col_rzstd_ls.append(acol_rzstd)
print(f'\n\nDEFAULT Models:\n    col_rzstd_ls: {col_rzstd_ls}')

# OPTIONAL EDIT: Manually select problematic model to remove from analysis 
#                (e.g. Pattern can misbehave at times)
model_root_bad = 'zzz'
# model_root_bad = ''
col_rzstd_ls = [x for x in col_rzstd_ls if model_root_bad not in x]
print(f'\n\nMODIFED Models:\n    col_rzstd_ls: {col_rzstd_ls}\n\n')

for atext in corpus_texts_ls:

  corr_df = corpus_texts_dt[atext][col_rzstd_ls].dropna(axis=0, how='any').corr(method=Correlation_Metric)

  # Customize the heatmap of the corr_meat correlation matrix and rotate the x-axis labels

  heat_fig = sns.clustermap(corr_df, 
                      # corpus_sents_df[col_rzstd_ls].dropna(axis=0, how='any').corr(method=corr_method),
                      row_cluster=True,
                      col_cluster=True,
                      annot=True,
                      # annot_kws={"size": 15},
                      figsize=(20, 20))

  # plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(), rotation=90)
  # plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)

  heat_title = f'{corpus_titles_dt[atext][0]}\n{Correlation_Metric.capitalize()} Correlation of Sentiment Time Series\nIQR Clip + zScore Standardization'
  _ = heat_fig.fig.suptitle(heat_title, y=1.05, fontsize=20)
  
  plt.show();


## Rank Distance from Ensemble Median

In [None]:
# Globals variable

corpus_distance_dt = {}

In [None]:
# TODO: Compute Ensemble Median and each Time Series
# https://stats.stackexchange.com/questions/185912/alternate-distance-metrics-for-two-time-series

# https://tslearn.readthedocs.io/en/stable/gen_modules/tslearn.metrics.html 

# https://www.semanticscholar.org/paper/A-Formally-Robust-Time-Series-Distance-Metric-Toller-Geiger/a6b598bfa5d679003b20a292839a93e7c0cd3705 (2020 0c)
# https://www.semanticscholar.org/paper/A-review-on-distance-based-time-series-Abanda-Mori/1a22c4fcff260a8623a1fd677f2e7a7916343dae (2018 75c)

# Magnitude and Temporal Alignment: Euclidian (better for removed outliers/smoothed arcs)

# Overall Shape w/o excessive distortions: DWT with Global Constraints (Sakoe-Chiba band and the Itakura Parallelogram)

# LCSS - Longest Common Subsequence - has been originally developed to analyse string similarity but can also be used for numerical time series


In [None]:
%%time

# NOTE:      >1s @18:05 on 20220310 Colab Pro/CPU:euclidean   (2 Novels/32 Models)
#            >1s @18:06 on 20220310 Colab Pro/CPU:manhattan   (2 Novels/32 Models)
#          1m28s @15:50 on 20220310 Colab Pro/CPU:dtw    (2 Novels/32 Models)
#          3m17s @17:47 ON 20220310 Colab Pro/CPU:ctw    (2 Novels/32 Models)
#          1m20s @17:47 ON 20220310 Colab Pro/CPU:lcss   (2 Novels/32 Models) returns matrix
#         11m47s @17:53 ON 20220310 Colab Pro/CPU:soft_dtw  (2 Novels/32 Models) all same
#        ~10m20s @17:47 ON 20220310 Colab Pro/CPU:soft_dtw_alignment  (2 Novels/32 Models) - returns matrix

# Rank Each Model by Distance from Ensemble Baseline (Choosen Centrality Metric)

corpus_distance_dt = {}

#@markdown **Select a Metric to measure Distance from the Baseline:** 
# Distance_Metric = "lcss" #@param ["euclidean", "manhattan", "dtw", "soft_dtw", "soft_dtw_alignment", "ctw", "lcss", "gak"]
Distance_Metric = "euclidean" #@param ["euclidean", "manhattan", "dtw", "ctw"]

#@markdown **NOTE:** Distances are stored in a Dict of Dicts with 2 metadata cols (distance, baseline)

print('REVIEW Selections:')
print('------------------\n')
print(f'Ensemble Baseline Centrality Metric: {Centrality_Series}')
print(f'    Model Preprocessing & Smoothing: {Normalized_Series}')
print(f'                 Correlation Metric: {Correlation_Metric.capitalize()}')
print(f'                    Distance Metric: {Distance_Metric}\n\n')

# Get and Compare each Model Series with the Ensemble Baseline Centrality Series
for i, atext in enumerate(corpus_texts_ls):
  print(f'\n\nProcessing Text: #{i} {atext}')
  print('----------------------------------------------')

  # Get Ensemble Baseline Centrality Series
  # ensemble_baseline = corpus_centrality_dt[atext][Centrality_Series]

  # Get the Preprocessed & Smoothed Model Series 
  if Normalized_Series == 'SMA':
    models_preproc_ls = models_sma_rzstd_ls
  elif Normalized_Series == 'SMA+LOWESS':
    models_preproc_ls = models_smalowess_rzstd_ls
  else:
    print(f'ERROR: Invalid value for Normalized_Series: {Normalized_Series}')

  # Compare each Proprocessed&Smoothed Model with the Chosen Ensemble Baseline Centrality Series
  models_distance_dt = {}
  for j, amodel in enumerate(models_preproc_ls):
    print(f'          Model: #{j} {amodel}')
    print(f'      Baseline: {Centrality_Series}\n      vs Model: {amodel}')

    if Distance_Metric == 'euclidean':
      adist = euclidean(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))
    elif Distance_Metric == 'manhattan':
      adist = cityblock(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))
    elif Distance_Metric == 'dtw':
      adist = dtw(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))
    elif Distance_Metric == 'soft_dtw':
      adist = soft_dtw(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))
    elif Distance_Metric == 'soft_dtw_alignment':
      adist = soft_dtw_alignment(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))
    elif Distance_Metric == 'ctw':
      adist = ctw(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))
    elif Distance_Metric == 'lcss':
      adist = lcss(np.array(corpus_texts_dt[atext][amodel]), np.array(corpus_centrality_dt[atext][Centrality_Series]))    
    else:
      print(f'ERROR: Illegal value for Distance_Metric: {Distance_Metric}')
  
    print(f'      Distance: {adist}\n\n')
    models_distance_dt[amodel] = float(adist)
  
  # Sort models_distance_dt in ascending order, models closest to ensemble central baseline first
  # ONLY Python 3.7+: {k: v for k, v in sorted(x.items(), key=lambda item: item[1])} 
  models_distance_sorted_dt = {k:v for k,v in sorted(models_distance_dt.items(), key=lambda item: float(item[1]))}
  models_distance_sorted_dt['distance_metric'] = Distance_Metric
  models_distance_sorted_dt['baseline'] = corpus_centrality_dt[atext][Centrality_Series]
  corpus_distance_dt[atext] = copy.deepcopy(models_distance_sorted_dt)


In [None]:
# Verify Sample Text Model Count and Model Names

atext
print('\n')
corpus_distance_dt[atext].keys()
print('\n')
print(f'[{len(list(corpus_distance_dt[atext].keys()))}] Models')

In [None]:
# Remove metadata cols (distance,baseline) and rank the remaining Models by chosen distance metric

corpus_distance_dt.keys()
print('\n')
corpus_distance_dt[corpus_titles_ls[0]].items()
print('\n')

for i, atext in enumerate(corpus_texts_ls):
  # print(f'\n\nProcessing Text: #{i} {atext}')
  # print('----------------------------------------------')

  fig = plt.figure()
  ax = plt.subplot(111)

  D = copy.deepcopy(corpus_distance_dt[atext])
  # Pop off the two metadata key:values that are not Model Time Series
  distance = D.pop('distance_metric', None)
  baseline = D.pop('baseline', None)

  _ = ax.bar(range(len(D)), list(D.values()), align='center')
  _ = ax.set_xticks(np.arange(len(D)))
  # _ = ax.set_xticks(range(len(D)), list(D.values()))
  # _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
  labels_clean = [x.split('_sma')[0] for x in D.keys()]
  # _ = ax.set_xticklabels(D.keys(), fontsize=12 , rotation=90, ha='right')
  _ = ax.set_xticklabels(labels_clean, fontsize=12 , rotation=90, ha='right')
  # _ = ax.set_xticklabels(rotation=90, ha='right')
  # _ = ax.bar_label(ax.containers[0]) # matplotlib >=3.4.0
  # Label Bars
  for i, v in enumerate(D.values()):
    _ = ax.text(i, v + .25, str(f'{v:.3f}'), color='white', fontweight='bold', rotation=90, va='top')

  atitle = f"{corpus_titles_dt[atext][0]}\nModel Distance from SentimentArc Ensemble Center Baseline of {len(ensemble_ls)} Models\nPreprocessing & Smoothing: ({Normalized_Series})\nEnsemble Central Baseline: ({Centrality_Series}) using Metric: ({Distance_Metric})"

  _ = ax.grid(True, alpha=0.5)
  _ = ax.set_title(atitle)
  plt.show();

In [None]:
corpus_texts_ls

In [None]:
corpus_distance_dt.keys()

In [None]:
corpus_distance_dt['scollins_thehungergames1'].keys()

In [None]:
# Verify plaintext ranking of distance of each model from ensemble central 

for i, atext in enumerate(corpus_texts_ls):
  print(f'\n\nProcessing Text: #{i} {atext}')
  print('----------------------------------------------')

  amodel_distance_dt = {}

  D = corpus_distance_dt[atext]
  print(f'Text: {atext} (Sorted by distance to Ensemble Central Baseline):\n')
  for amodel,adistance in D.items():
    if amodel in ['baseline', 'distance_metric']:
      continue
    else:
      print(f' {amodel :.<50}{adistance:.3f}')

    amodel_distance_dt[amodel] = adistance
  
  # Sort models_distance_dt in ascending order, models closest to ensemble central baseline first
  # ONLY Python 3.7+: {k: v for k, v in sorted(x.items(), key=lambda item: item[1])} 

  # models_distance_sorted_dt = {k:v for k,v in sorted(models_distance_dt.items(), key=lambda item: item[1])}
  # corpus_distance_dt[atext] = models_distance_sorted_dt

  # models_distance_sorted_dt = {k:v for k,v in sorted(models_distance_dt.items(), key=lambda item: item[1])}
  corpus_distance_dt[atext] = amodel_distance_dt

## Save Checkpoint (Dict of Dicts)

In [None]:
# Write Model Distances Dict of Dicts to .json file

# Verify in SentimentArcs Root Directory
os.chdir(SUBDIR_SENTIMENTARCS)

for i, atext in enumerate(corpus_distance_dt.keys()):
  print(f'Saving Timeseries #{i}: [{atext}]')
  # Generate Unique Filename
  fname_distance_json = f'timeseries_distance_{Distance_Metric.lower()}_raw_{Corpus_Genre}_{Corpus_Type}_{atext}.json'
  fname_distance_json_path = f'.{SUBDIR_TIMESERIES_RAW}/{fname_distance_json}'
  print(f'             to file: [{fname_distance_json}]')
  print(f'           in subdir: [{SUBDIR_TIMESERIES_RAW}]\n\n')

  atext_distance_dt = corpus_distance_dt[atext]
  dumped_dt2json = json.dumps(atext_distance_dt, cls=NumpyEncoder)
  with open(fname_distance_json_path, 'w') as fp:
      fp.write(dumped_dt2json + '\n') 

In [None]:
# Verify Saved json dump by reading back in

test_dt = {}

for i, atext in enumerate(corpus_distance_dt.keys()):
  print(f'     Text #{i}: {atext}')
  atext_distance_dt = corpus_distance_dt[atext]
  # print(f'        Type: {type(list(atext_distance_dt.keys()))}')
  # atext_distance_dt.keys()

  fname_distance_json = f'timeseries_distance_{Distance_Metric.lower()}_raw_{Corpus_Genre}_{Corpus_Type}_{atext}.json'
  fname_distance_json_path = f'.{SUBDIR_TIMESERIES_RAW}/{fname_distance_json}'

  print(f'Reading File: {fname_distance_json}\n')
  with open(fname_distance_json_path, 'r') as fp:
    test_dt = json.load(fp)

  # print(json.dumps(test_dt, indent=4))
  # pprint.pprint(test_dt, width=1)
  print(f'Just read {atext}:')
  test_dt 

In [None]:
# [SKIP]

In [None]:
lowess_grid_dt = {}
crux_ct_ls = []
# temp_df['sent_no'] = pd.Series([x for x in corpus_sents_df['sent_no']])
temp_df['avg_stdscaler'] = corpus_sents_df[models_subset_ls].mean()

fig = plt.figure()
ax = plt.axes()


for afrac in range(frac_start_int, frac_end_int, frac_step_int):
  print(f'Processing afrac = {afrac}')
  # Compute error between subset of models
  afrac_fl = afrac/100
  temp_df = get_lowess(corpus_sents_df, models_ls=models_subset_ls, text_unit='sentence', afrac=afrac_fl, do_plot=False);
  temp_df['minmax_diff'] = temp_df.max(axis=1) - temp_df.min(axis=1)
  diff_sum = temp_df['minmax_diff'].sum()
  print(f"  Sum(minmax_diff): {diff_sum}");
  lowess_grid_dt[afrac] = diff_sum
  # Compute Crux Points
  temp_df['sent_no'] = pd.Series(list(range(temp_df.shape[0])))
  crux_ls = get_crux_points(temp_df,
                            'median',
                            text_type='sentence', 
                            win_per=5, 
                            sec_y_labels=False, 
                            sec_y_height=0, 
                            subtitle_str=' ', 
                            do_plot=False,
                            save2file=False)
  ax.plot(temp_df['sent_no'], temp_df['median'], label=f'frac={afrac}')
  # plt.plot(data=temp_df, x='sent_no', y='median', label=f'frac={afrac}')
  crux_ct_ls.append(len(crux_ls))
  print(f'  {len(crux_ls)} Crux Points')

plt.title(f"{CORPUS_FULL} \n LOWESS Smoothing Grid Search (frac={Frac_Start} to {Frac_End}")
plt.legend()
plt.show()

In [None]:
# np.array(corpus_texts_dt[atext][amodel_std]).reshape(-1,1).shape

np.array(corpus_texts_dt[atext][amodel_std]).reshape(-1,).shape

np.array(corpus_texts_dt[atext][amodel_std]).flatten().shape

## Agglomerative Hierarichal Clustering

In [None]:
# Global Variable

corpus_lttb_dt = {}

In [None]:
# Verify Ensemble Models and Variations for each Model

temp_ls = corpus_texts_dt[corpus_texts_ls[0]].columns

[x for x in temp_ls if '_rzstd' in x]

print(f'\n\nThere are {len(temp_ls)} Columns/Models\n\n')

[x for x in corpus_texts_dt[corpus_texts_ls[0]].columns if 'vader' in x]

In [None]:
# Get list of model values that are (a) Clipped with IRQ to treat outliers and (b) zScore Standardized

models_rzstd_ls = [x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if (x.endswith('_rzstd') & ~x.endswith('_smalowess_rzstd') & ~x.endswith('_sma_rzstd'))]
print('\n'.join(models_rzstd_ls))

print('\n\nModel with values that are (a) Clipped with IRQ to treat outliers and (b) zScore Standardized')

### LTTB

* Dimensionality Reduction
* Extract Features (Peaks)
* Overall Arc & Vertical Feature Similarity w/DTW

In [None]:
import lttb # https://git.sr.ht/~javiljoen/lttb-numpy (202110 )

# import lttbc # https://github.com/dgoeries/lttbc (20201020 7stars)
from lttb.validators import *

# Global Variable
LTTB_CT = 100

corpus_lttb_dt = {}

In [None]:
# Compare SMA vs SMA+LOWESS Smoothed Sentiments

_ = corpus_texts_dt[corpus_titles_ls[0]]['vader_smalowess_rzstd'].plot(label='SMA+LOWESS', alpha=0.3)
_ = corpus_texts_dt[corpus_titles_ls[0]]['vader_rzstd'].rolling(300, center=True, min_periods=0).mean().plot(label='SMA', alpha=0.3)

In [None]:
# LTTB_CT = 100

# Get np.arrays of test Sentiment TS
test_ts_vals_np = corpus_texts_dt[corpus_titles_ls[0]]['vader_rzstd'].rolling(300, center=True, min_periods=0).mean().values # .reshape(-1,1)
# type(test_ts_vals_np)
test_ts_vals_np.shape
test_ts_indx_np = np.arange(test_ts_vals_np.shape[0])
test_ts_indx_np.shape

test_ts_np = np.column_stack((test_ts_indx_np, test_ts_vals_np))
test_ts_np.shape

# Downsample with stricter validators
test_ts_sm_np = lttb.downsample(test_ts_np, n_out=LTTB_CT, validators=[has_two_columns, x_is_regular])
type(test_ts_sm_np)
test_ts_sm_indx_np = np.arange(test_ts_sm_np.shape[0])
test_ts_sm_indx_np.shape

# Plot
plt.plot(test_ts_indx_np, test_ts_vals_np.transpose())
# plt.plot(test_ts_indx_np, test_ts_vals_np.transpose())
plt.show()

plt.plot(test_ts_sm_indx_np, test_ts_sm_np.transpose()[1])
# plt.plot(test_ts_indx_np, test_ts_vals_np[1].transpose()[1])
plt.show();

In [None]:
scale_mult*100

In [None]:
corpus_texts_dt[atext][amodel_rzstd].shape

In [None]:
amodel_rzstd

In [None]:
# Test and see how original Model (*_rzstd) compares to 100 point downsampled LTTB

atext = corpus_titles_ls[0]
amodel_rzstd = 'vader_rzstd'

temp_lttb_df = pd.DataFrame()

# Get Sentiment Values Col
temp_ts_vals_np = corpus_texts_dt[atext][amodel_rzstd].rolling(win_10per, center=True, min_periods=0).mean().values # .reshape(-1,1)
# type(temp_ts_vals_np)
# temp_ts_np.shape
# Generate Index Col
temp_ts_indx_np = np.arange(temp_ts_vals_np.shape[0])
# temp_ts_indx_np.shape
# Col Stack
temp_ts_np = np.column_stack((temp_ts_indx_np, temp_ts_vals_np))
# temp_ts_np.shape

# Downsample with stricter validators
# NOTE: 'x_is_regular' makes storage/comparison easier by sacrificing accuracy for smaller LTTB_CT
temp_ts_vals_sm_np = lttb.downsample(temp_ts_np, n_out=LTTB_CT, validators=[has_two_columns, x_is_regular])
# type(temp_ts_vals_sm_np)
temp_ts_sm_indx_np = np.arange(temp_ts_vals_sm_np.shape[0])
# temp_ts_sm_indx_np.shape

# Impute Sentence Number corresponding to Datapoints
orig_len = temp_ts_vals_np.shape[0]
temp_sm_len = temp_ts_vals_sm_np.shape[0]
scale_mult = orig_len/temp_sm_len
# print(f'scale_mult: {scale_mult}')

# TODO: Fix the horizontal shift early and esp late in the LTTB time series
# sentno_np = np.round(scale_mult * temp_ts_sm_indx_np)
sentno_np = (scale_mult + 1) * temp_ts_sm_indx_np

# Save
# temp_lttb_dt = {"sentno" : list(sentno_np), "sentiment" : list(temp_ts_vals_sm_np)}
# temp_lttb_df[amodel_rzstd] = pd.DataFrame(temp_lttb_dt)
temp_lttb_df[amodel_rzstd] = pd.Series(list(temp_ts_vals_sm_np))

# Plot
plt_title = f'{corpus_titles_dt[atext][0]} Standardized Sentiment\nModel: {amodel_rzstd} Smoothed (SMA 10%)\n{temp_ts_indx_np.shape[0]} Datapoints'
_ = plt.plot(temp_ts_indx_np, temp_ts_vals_np.transpose())
_ = plt.title(plt_title)
# plt.plot(temp_ts_indx_np, temp_ts_vals_np.transpose())
# plt.show()

plt_title = f'{corpus_titles_dt[atext][0]} Standardized Sentiment\nModel: {amodel_rzstd} Smoothed (SMA 10%)\nReduced to {LTTB_CT} Datapoints via LTTB'
# plt.plot(temp_ts_sm_indx_np, temp_ts_sm_np.transpose()[1])
_ = plt.plot(sentno_np, temp_ts_vals_sm_np.transpose()[1])
_ = plt.title(plt_title)
# plt.plot(temp_ts_indx_np, temp_ts_vals_np[1].transpose()[1])
plt.show();

In [None]:
sentno_np[-5:]

In [None]:
test_ts_sm_indx_np

In [None]:
%%time

# NOTE: 43s

# Compare SMA with LTTB Approximation

corpus_lttb_dt = {}

LTTB_CT = 100 # [50, 75, 100, 200, 500, 1000]

for i, atext in enumerate(corpus_texts_dt.keys()):
  print(f'Text #{i}: {atext}')

  # Get 10% Window Size for SMA
  win_10per = int(0.10 * corpus_texts_dt[atext].shape[0])
  
  # Get list of model values that are (a) Clipped with IRQ to treat outliers and (b) zScore Standardized
  models_rzstd_ls = [x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if (x.endswith('_rzstd') & ~x.endswith('_smalowess_rzstd') & ~x.endswith('_sma_rzstd'))]


  fig = plt.figure()
  ax = plt.subplot(111)

  # models_std_ls = []
  temp_lttb_df = pd.DataFrame()
  for j, amodel_rzstd in enumerate(models_rzstd_ls):

    # Get Sentiment Values Col
    temp_ts_vals_np = corpus_texts_dt[atext][amodel_rzstd].rolling(win_10per, center=True, min_periods=0).mean().values # .reshape(-1,1)
    
    # type(temp_ts_vals_np)
    # temp_ts_np.shape
    # Generate Index Col
    temp_ts_indx_np = np.arange(temp_ts_vals_np.shape[0])
    # temp_ts_indx_np.shape
    # Col Stack
    temp_ts_np = np.column_stack((temp_ts_indx_np, temp_ts_vals_np))
    # temp_ts_np.shape

    # Downsample with stricter validators
    # NOTE: 'x_is_regular' makes storage/comparison easier by sacrificing accuracy for smaller LTTB_CT
    temp_ts_vals_sm_np = lttb.downsample(temp_ts_np, n_out=LTTB_CT, validators=[has_two_columns, x_is_regular])
    # type(temp_ts_vals_sm_np)
    temp_ts_sm_indx_np = np.arange(temp_ts_vals_sm_np.shape[0])
    # temp_ts_sm_indx_np.shape

    # Impute Sentence Number corresponding to Datapoints
    orig_len = temp_ts_vals_np.shape[0]
    temp_sm_len = temp_ts_vals_sm_np.shape[0]
    scale_mult = orig_len/temp_sm_len
    # print(f'scale_mult: {scale_mult}')

    # TODO: Fix the horizontal shift early and esp late in the LTTB time series
    sentno_np = (scale_mult+1) * temp_ts_sm_indx_np

    # Save
    # temp_lttb_dt = {"sentno" : list(sentno_np), "sentiment" : list(temp_ts_vals_sm_np)}
    # temp_lttb_df[amodel_rzstd] = pd.DataFrame(temp_lttb_dt)
    temp_lttb_df[amodel_rzstd] = list(temp_ts_vals_sm_np)

    # Plot
    plt_title = f'{corpus_titles_dt[atext][0]} Standardized Sentiment\nModel: {amodel_rzstd} Smoothed (SMA 10%)\n{temp_ts_indx_np.shape[0]} Datapoints'
    _ = plt.plot(temp_ts_indx_np, temp_ts_vals_np.transpose())
    _ = plt.title(plt_title)
    # plt.plot(temp_ts_indx_np, temp_ts_vals_np.transpose())
    # plt.show()

    plt_title = f'{corpus_titles_dt[atext][0]} Standardized Sentiment\nModel: {amodel_rzstd} Smoothed (SMA 10%)\nReduced to {LTTB_CT} Datapoints via LTTB'
    # plt.plot(temp_ts_sm_indx_np, temp_ts_sm_np.transpose()[1])
    _ = plt.plot(sentno_np, temp_ts_vals_sm_np.transpose()[1])
    _ = plt.title(plt_title)
    # plt.plot(temp_ts_indx_np, temp_ts_vals_np[1].transpose()[1])
    plt.show();

  # Add median Col/Model
  # median_x_val = ... Ensemble has multiple irregular time series - difficult to find median
  # median_y_val = corpus_texts_dt[atext][models_rzstd_ls].median(axis=1)
  # temp_lttb_df['median_rzstd'] = zip(median_x_val, median_y_val)
  corpus_lttb_dt[atext] = temp_lttb_df

In [None]:
temp_lttb_df.columns

In [None]:
corpus_lttb_dt.keys()

In [None]:
corpus_lttb_dt[corpus_titles_ls[0]].info()

In [None]:
corpus_lttb_dt[corpus_titles_ls[0]]['vader_rzstd']

In [None]:
corpus_lttb_dt[corpus_titles_ls[0]]['roberta15lg_rzstd']

In [None]:
# corpus_lttb_dt[corpus_titles_ls[0]]['median_rzstd']

In [None]:
print(corpus_lttb_dt[corpus_titles_ls[0]]['vader_rzstd'].to_list()[0])

In [None]:
# Verify different Models have different/non-uniform x-values 
atext = corpus_titles_ls[0]
amodel_rzstd1 = 'vader_rzstd'
amodel_rzstd2 = 'roberta15lg_rzstd'
# amodel_rzstd3 = 'median'

x_vals_ls = [alist[0] for alist in corpus_lttb_dt[atext][amodel_rzstd1].to_list()]
y_vals_ls = [alist[1] for alist in corpus_lttb_dt[atext][amodel_rzstd1].to_list()]
plt.plot(x_vals_ls, y_vals_ls, label='vader_rzstd')

x_vals_ls = [alist[0] for alist in corpus_lttb_dt[atext][amodel_rzstd2].to_list()]
y_vals_ls = [alist[1] for alist in corpus_lttb_dt[atext][amodel_rzstd2].to_list()]
plt.plot(x_vals_ls, y_vals_ls, label='roberta15lg_rzstd')

# x_vals_ls = [alist[0] for alist in corpus_lttb_dt[atext][amodel_rzstd3].to_list()]
# y_vals_ls = [alist[1] for alist in corpus_lttb_dt[atext][amodel_rzstd3].to_list()]
# plt.plot(x_vals_ls, y_vals_ls, label='median')

plt_title = f'{atext} Standardized Sentiment\nModel: {amodel_rzstd2} Smoothed (SMA 10%)\nReduced to {LTTB_CT} Datapoints via LTTB'
_ = plt.title(plt_title)

plt.grid(True, alpha=0.7)
plt.legend()
plt.show();

In [None]:
%%time
"""
# NOTE: 41s

# Compare SMA with LTTB Approximation

LTTB_CT = 100

for i, atext in enumerate(corpus_texts_dt.keys()):
  print(f'Text #{i}: {atext}')

  win_10per = int(0.10 * corpus_texts_dt[corpus_titles_ls[0]].shape[0])

  fig = plt.figure()
  ax = plt.subplot(111)

  models_std_ls = []
  for j, amodel_rzstd in enumerate(models_rzstd_ls):

    # Get Sentiment Values Col
    temp_ts_vals_np = corpus_texts_dt[atext][amodel_rzstd].rolling(win_10per, center=True, min_periods=0).mean().values # .reshape(-1,1)
    # type(temp_ts_vals_np)
    # temp_ts_np.shape
    # Generate Index Col
    temp_ts_indx_np = np.arange(temp_ts_vals_np.shape[0])
    # temp_ts_indx_np.shape
    # Col Stack
    temp_ts_np = np.column_stack((temp_ts_indx_np, temp_ts_vals_np))
    # temp_ts_np.shape

    # Downsample with stricter validators
    temp_ts_vals_sm_np = lttb.downsample(temp_ts_np, n_out=LTTB_CT) #, validators=[has_two_columns, x_is_regular])
    # type(temp_ts_vals_sm_np)
    temp_ts_sm_indx_np = np.arange(temp_ts_vals_sm_np.shape[0])
    # temp_ts_sm_indx_np.shape

    # Impute Sentence Number corresponding to Datapoints
    orig_len = temp_ts_vals_np.shape[0]
    temp_sm_len = temp_ts_vals_sm_np.shape[0]
    scale_mult = orig_len/temp_sm_len
    # print(f'scale_mult: {scale_mult}')

    sentno_np = np.round(scale_mult * temp_ts_sm_indx_np)

    # Save
    corpus_lttb_dt[amodel_rzstd] = pd.DataFrame()

    # Plot
    plt_title = f'{corpus_titles_dt[atext][0]} Standardized Sentiment\nModel: {amodel_rzstd} Smoothed (SMA 10%)\n{temp_ts_indx_np.shape[0]} Datapoints'
    _ = plt.plot(temp_ts_indx_np, temp_ts_vals_np.transpose())
    _ = plt.title(plt_title)
    # plt.plot(temp_ts_indx_np, temp_ts_vals_np.transpose())
    # plt.show()

    plt_title = f'{corpus_titles_dt[atext][0]} Standardized Sentiment\nModel: {amodel_rzstd} Smoothed (SMA 10%)\nReduced to {LTTB_CT} Datapoints via LTTB'
    # plt.plot(temp_ts_sm_indx_np, temp_ts_sm_np.transpose()[1])
    _ = plt.plot(sentno_np, temp_ts_vals_sm_np.transpose()[1])
    _ = plt.title(plt_title)
    # plt.plot(temp_ts_indx_np, temp_ts_vals_np[1].transpose()[1])
    plt.show();

  corpus_lttb_dt[atext] = temp_lttb_df
""";

### DTW

Code:

* https://towardsdatascience.com/how-to-apply-hierarchical-clustering-to-time-series-a5fe2a7d8447 (DTW) *
* https://cran.r-project.org/web/packages/dtwclust/vignettes/dtwclust.pdf 
* https://www.sktime.org/en/latest/examples/mrseql.html Classification Mr-SEQL

Guidance:

* https://stats.stackexchange.com/questions/63546/comparing-hierarchical-clustering-dendrograms-obtained-by-different-distances 

Ranks:
* https://paperswithcode.com/task/time-series-clustering

Papers:
* https://reader.elsevier.com/reader/sd/pii/S2666827020300013?token=2286F6993FF63B6B3096B72F09503A950095DD5F1C1BB11146BDC0EAAA8E4D942BAD3A216FDBE65978BAE3B5C9F2363F&originRegion=us-east-1&originCreation=20210817184824

In [None]:
from dtaidistance import dtw

series = [
    np.array([0, 0, 1, 2, 1, 0, 1, 0, 0], dtype=np.double),
    np.array([0.0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0]),
    np.array([0.0, 0, 1, 2, 1, 0, 0, 0])]
    
ds = dtw.distance_matrix_fast(series)
ds

In [None]:
corpus_lttb_dt[corpus_titles_ls[0]]['vader_rzstd']

In [None]:
x_vals_ls = [data_ls[0] for data_ls in corpus_lttb_dt[corpus_titles_ls[0]]['vader_rzstd'].to_list()]
y_vals_ls = [data_ls[1] for data_ls in corpus_lttb_dt[corpus_titles_ls[0]]['vader_rzstd'].to_list()]

plt.plot(x_vals_ls, y_vals_ls)

In [None]:
print(np.array(x_vals_ls))

In [None]:
# Check for dup Cols/Models

l = [1,2,3,2,1,5,6,5,5,5]
l = corpus_lttb_dt[corpus_titles_ls[0]].columns

l_iter = groupby(sorted(l))

for a_iter in l_iter:
  aname, acount = a_iter
  print(f'{aname}: {len(list(acount))}')

# py3.8 res = [(x, count) for x, g in groupby(sorted(l)) if (lambda x : count = len(list(g)); count > 1]

In [None]:
series_ls = []
for acol in corpus_lttb_dt[corpus_titles_ls[0]].columns:
  temp_ls_ls = corpus_lttb_dt[corpus_titles_ls[0]][acol]
  temp_ls = [x[0] for x in temp_ls_ls]
  series_ls.append(np.array(corpus_lttb_dt[corpus_titles_ls[0]][acol][0]))

len(series_ls)
print('\n')
len(series_ls[0])

In [None]:
len(corpus_lttb_dt[corpus_titles_ls[0]].columns)

In [None]:
corpus_lttb_dt.keys()

In [None]:
# Create Distance Matricies 

# Global Variable
corpus_dtai_dt = {}
models_ls = []

for i, atext in enumerate(corpus_lttb_dt.keys()):
  print(f'\n\nText #{i}: {atext}')

  series_ls = []

  models_rzstd_ls = corpus_lttb_dt[atext].columns
  for amodel_rzstd in models_rzstd_ls:
    models_ls.append(amodel_rzstd)
    # length = len(series_list[i])
    # series_list[i] = series_list[i].values.reshape((length, 1))

    x_vals_ls = [data_ls[0] for data_ls in corpus_lttb_dt[corpus_titles_ls[0]][amodel_rzstd].to_list()]
    y_vals_ls = [data_ls[1] for data_ls in corpus_lttb_dt[corpus_titles_ls[0]][amodel_rzstd].to_list()]
    # print(f'len(y_vals_ls): {len(y_vals_ls)}')
    series_ls.append(np.array(y_vals_ls, dtype=np.double))
        
  dist_np = dtw.distance_matrix_fast(series_ls)

  corpus_dtai_dt[atext] = dist_np

"""
  # Initialize distance matrix
  n_series = len(series_ls)
  distance_matrix = np.zeros(shape=(n_series, n_series))

  # Build distance matrix
  for i in range(n_series):
    for j in range(n_series):
      x = series_ls[i]
      y = series_ls[j]
      if i != j:
        dist = dtw_distance(x, y)
        distance_matrix[i, j] = dist
""";

In [None]:
corpus_dtai_dt.keys()

In [None]:
corpus_dtai_dt['cmieville_thecityandthecity'][0].shape

In [None]:
# Series of 

len(series_ls)
print('\n')
series_ls[0].shape
type(series_ls[0])

### Agglomerative Hierarichal Clustering

In [None]:
atext = corpus_titles_ls[0]

series = corpus_dtai_dt[atext]

# Custom Hierarchical clustering
model1 = clustering.Hierarchical(dtw.distance_matrix_fast, {})
cluster_idx = model1.fit(series)

# Augment Hierarchical object to keep track of the full tree
model2 = clustering.HierarchicalTree(model1)
cluster_idx = model2.fit(series)

# SciPy linkage clustering
model3 = clustering.LinkageTree(dtw.distance_matrix_fast, {})
cluster_idx = model3.fit(series)

In [None]:
# Customize and Label Hierarchical Plot (Option #3: SciPy Linkage)

fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [3, 1]}, figsize=(30, 20))
# show_ts_label = lambda idx: "ts-" + str(idx)
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-8,
           ts_left_margin=2, ts_sample_length=1)

In [None]:
# ts_label_margin
# ts_left_margin
# ts_sample_length

fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1,2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-10,
           ts_left_margin=10, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
# [SKIP]

In [None]:
# ts_label_margin
# ts_left_margin
# ts_sample_length

fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1,3]}, figsize=(30, 10))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-6,
           ts_left_margin=2, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
# ts_label_margin
# ts_left_margin
# ts_sample_length

fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1,3]}, figsize=(30, 10))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-6,
           ts_left_margin=2, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
# ts_label_margin
# ts_left_margin
# ts_sample_length

fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-2,
           ts_left_margin=0, ts_sample_length=4)

Image(filename='hierarchy.png') 

In [None]:
# 

fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-2,
           ts_left_margin=0, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=0,
           ts_left_margin=0, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=0,
           ts_left_margin=4, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-8.5,
           ts_left_margin=0, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
ts_labels = ['SentimentR',
             'SyuzhetR',
             'TextBlob',
             'Flair',
             'Stanza',
             'Logistic Regression',
             'LSTM',
             'CNN',
             'RoBERTa 15 Large',
             'T5']


fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))
# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-8.5,
           ts_left_margin=4, ts_sample_length=1)

Image(filename='hierarchy.png') 

In [None]:
Image(filename='hierarchy.png') 

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, gridspec_kw={'width_ratios': [1, 2]}, figsize=(30, 20))

# show_ts_label = lambda idx: "ts-" + str(idx)
# show_ts_label = lambda idx: ts_labels[idx]
show_ts_label = lambda idx: models_ls[idx]
model3.plot("hierarchy.png", axes=ax, show_ts_label=show_ts_label,
           show_tr_label=True, ts_label_margin=-8.5,
           ts_left_margin=4, ts_sample_length=3)

Image(filename='hierarchy.png') 

## SentimentArcs Metrics

* (code) sentimentarcs_part7_join_norm.ipynb

### Model-Corpus Compatibility (MCC)

MCC(corpus) = 1 / ( | amodel(corpus)- median(corpus) | / len(corpus) )

For each Corpus, compute a Coherence Metric for all Models by:
* Computing the Euclidian Distance of each zScore/SMA Model from the zScore/SMA Median
* Sum all Euclidian Distances 
* Identify and record furtherest outliers per Corpus/Model
* Sum all Euclidian Distances after removing 2-3 of ~35 outliers (5-10% discard)
* Normalize 2 Sums of Euclidian Distances over the entire set of Corpora
* Rank order the Corpora in terms of Coherence
* Rank Order Models in terms of Outlier frequency

In [None]:
# TODO

# Drop all 'vader_rstd' cols/models
# Add 'vader_sma_rzstd' cols/models
# ? add 'median_' col/model?
# Rank Models by MCC Metric per Text
# Rank Familes by MFC Metric per Text
# Norm all Metrics
# Create 3x3x3 Grid of Metrics and guidelines to proceed for each situation

In [None]:
[x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if 'vader' in x]

In [None]:
[x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if '_rzstd' in x]

In [None]:
corpus_texts_dt[corpus_titles_ls[0]].drop(columns=['median_rzstd'], inplace=True)

In [None]:
for i, atext in enumerate(corpus_titles_ls):
  print(f'\n\nProcessing #{i}: {atext}')
  win_10per = int(0.10 * corpus_texts_dt[atext].shape[0])
  models_rzstd_ls = [x for x in corpus_texts_dt[atext].columns if (x.endswith('_rzstd') & ~('_smalowess_' in x))]
  # Add Median if missing
  if ('median_rzstd' in corpus_texts_dt[atext].columns):
    pass
  else:
    corpus_texts_dt[atext]['median_rzstd'] = corpus_texts_dt[atext][models_rzstd_ls].median()
    models_rzstd_ls.append('median_rzstd')

  fig = plt.figure()
  ax = plt.subplot(111)

  for j, amodel_rzstd in enumerate(models_rzstd_ls):
    temp_np = corpus_texts_dt[atext][amodel_rzstd].rolling(win_10per, center=True, min_periods=0).mean()
    if amodel_rzstd == 'median_rzstd':
      _ = ax.plot(temp_np, label=amodel_rzstd, color='r', linewidth=3, alpha=1)
    else:
      _ = ax.plot(temp_np, label=amodel_rzstd, alpha=0.3)

  # Put a legend to the right of the current axis
  _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

  _ = ax.grid(True, alpha=0.3)
  _ = ax.set_title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {len(ensemble_ls)} Models\nSmoothed: SMA (window=10%)\nClipped with IQR + zScore Standardized')
  plt.show();

In [None]:
models_smalowess_rzstd_ls = [x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if x.endswith('_smalowess_rzstd')]
corpus_texts_dt[corpus_titles_ls[0]][models_smalowess_rzstd_ls].plot(alpha=0.3)
corpus_texts_dt[corpus_titles_ls[0]][models_smalowess_rzstd_ls].median(axis=1).plot(color='r', linewidth=5)

In [None]:
# Calculate MCC Metric for each corpus:model combination

subdir_out = 'data_corpora_all'

median_model_area_ls = []
mcc_ls = []                   # Model-Corpus Compatibility (MCC) 

corpora_median_area_dt = {}
corpora_mcc_dt = {}

for i, atext in enumerate(corpora_ls):
  print(f'Processing Corpus #{i}: {acorpus}')

  # median_model_area_ls = []
  mcc_ls = []

  for j, amodel_z in enumerate(model_z_cols_ls):
    print(f'  with Model #{j}: {amodel_z}')

    median_model_area = np.sum(np.abs(corpora_all_dt[acorpus][amodel_z] - corpora_all_dt[acorpus]['median_z']))
    print(f'    Area between Median: {median_model_area}')

    mcc = 1 / (median_model_area / corpora_all_dt[acorpus].shape[0])

    # median_model_area_ls.append((amodel_z, median_model_area))
    mcc_ls.append((amodel_z, mcc)) 
    # print(f'      Growing list: {median_model_area_ls}')
    print(f'      Growing mcc_ls: {mcc_ls}')

  # median_model_area_sorted_ls = copy.deepcopy(median_model_area_ls) # .sort(key=lambda x:x[1]) #  # .sort(key=lambda x: float(x[1])) # .sort(key=lambda y: y[1]) # .sort(key=lambda y: y[1])
  # median_model_area_sorted_ls.sort(key=lambda x:x[1], reverse=True)
  # print(f'        Copying sorted list: {median_model_area_sorted_ls}')

  mcc_sorted_ls = copy.deepcopy(mcc_ls) # .sort(key=lambda x:x[1]) #  # .sort(key=lambda x: float(x[1])) # .sort(key=lambda y: y[1]) # .sort(key=lambda y: y[1])
  # median_model_area_sorted_ls.sort(key=lambda x:x[1], reverse=True)
  print(f'        Copying sorted list: {mcc_sorted_ls}')


  # corpora_median_area_dt[acorpus] = copy.deepcopy(median_model_area_sorted_ls)

  corpora_mcc_dt[acorpus] = copy.deepcopy(mcc_sorted_ls)

  # corpora_all_dt[acorpus][model_z_cols_ls].head(2)


In [None]:
np.sum(np.abs(corpus_texts_dt[corpus_titles_ls[0]]['vader_z'] - corpus_texts_dt[corpus_titles_ls[0]]['vader_z']))

#### MCC Ranked Models for Each Corpus

In [None]:
print(temp_maxmcc_df.model_z.to_list())

In [None]:

plt.rcParams["figure.figsize"] = (10, 8)

save_plot = False

maxmc_dt = {}

for acorpus in corpora_ls:

  temp_df = corpora_mcc_df[corpora_mcc_df['corpus']==acorpus]
  temp_df = temp_df.sort_values('mcc', ascending=False)
  temp_df = temp_df[temp_df.model_z != 'median_z']
  # print(f'temp_df: {temp_df.shape}')
  # if (acorpus == 'cdickens_achristmascarol'):
  temp_maxmcc_df = temp_df[['model_z','mcc']].reset_index(drop=True)
  maxmc_dt[acorpus] = temp_maxmcc_df.copy(deep=True)
  # print(f'For Corpus: {acorpus}:\n\n  {maxmc_dt[acorpus]}')
  # model_ls = temp_df['model_z']
  # mcc_ls = temp_df['mcc']
  # merged_list = tuple(zip(model_ls, mcc_ls)) 
  # print(f'merged_list: {merged_list}')

  # temp_maxmcc_df.plot()
  # plt.title(f'{acorpus.upper()} MCC Ranked Models')
  # plt.legend('off')
  # plt.xticks('model_z')

  # fig, ax = plt.subplots(1, 1)
  # fig = plt.figure()
  # ax = fig.add_subplot(111)

  ax = temp_maxmcc_df.plot(label=acorpus, linewidth=3)
  # ax = corpora_all_dt[acorpus]['median_z'].rolling(win10per, center=True, min_periods=1).mean().plot(label='z-Score Median', style=['r'], linewidth=3, alpha=0.9)

  ax.grid(True)
  ax.set_title(f'Model Rank by MCC: {acorpus}', fontsize=18)
  # ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
  # ax.set_xlabel('Line Number', fontsize=14)
  ax.set_ylabel('MCC Score', fontsize=14)
  # ax.set_xticks(df.Date.values)
  # xticks_ls = temp_maxmcc_df.model_z.to_list()
  # ax.set_xticklabels(temp_maxmcc_df.model_z.values, size=8, rotation=90)
  ax.set_xticklabels(temp_maxmcc_df.model_z, rotation=90) # , size=8, rotation=90)
  # ax.set_xticklabels(xticks_ls, size=6, rotation=90)
  ax.legend('off') # loc='center left', bbox_to_anchor=(1, 0.5), fontsize=10, title='Model', title_fontsize=14);

  if save_plot:
    filename_plt = f'./{subdir_name}/plt_metric_mcc_ranked_{acorpus}.png'
    plt.savefig(filename_plt)
    print(f'Saved plot to filepath: {filename_plt}\n\n')

  plt.show();



In [None]:
print(list(corpora_mcc_df.groupby(['corpus'])))

In [None]:
len(corpora_mcc_dt.keys())

In [None]:
len(list(corpora_mcc_dt.values())[0])

In [None]:
len(list(corpora_mcc_dt.items()))

In [None]:
# Gather the Error/Area between Model zScore and Median zScore for all Corpora in one DataFrame

# corpora_median_area_df = pd.DataFrame()
corpora_mcc_df = pd.DataFrame()

first_loop_fl = True

# for acorpus, area_tup_ls in corpora_median_area_dt.items():
for acorpus, model_mccls_tup_ls in corpora_mcc_dt.items():
  
  print(f'\nCorpus: {acorpus}\n') #   {area_tup_ls}')
  print(f'\nlen(model_mccls_tup_ls): {len(model_mccls_tup_ls)}\n') #   {area_tup_ls}')

  # areas_ls = [i[1] for i in area_tup_ls]
  mcc_ls = [i[1] for i in model_mccls_tup_ls]
  models_ls = [i[0] for i in model_mccls_tup_ls]

  # print(f'  areas_ls: {areas_ls}')
  print(f'  len(models_ls): {len(models_ls)}')
  print(f'  len(mcc_ls): {len(mcc_ls)}')

  # temp_df = pd.DataFrame({'model_z' : models_ls,'area_z' : areas_ls})
  temp_df = pd.DataFrame({'model_z' : models_ls,'mcc' : mcc_ls})
  print(f'    temp_df.shape(): {temp_df.shape}')
  temp_len = temp_df.shape[0]
  # temp_df['area_z_norm'] = temp_df['area_z']/corpora_all_dt[acorpus].shape[0]
  model_col = [acorpus] * temp_len
  first_ser = pd.Series(model_col)
  temp_df = pd.concat([first_ser, temp_df], axis=1)
  temp_df.rename(columns={0:'corpus'}, inplace=True)
  print(f'     temp_df.shape() after horizontal concat(): {temp_df.shape}')
  
  temp_df.head()

  if first_loop_fl:
    print(f'  Adding {acorpus} as first DataFrame')
    # corpora_model_area_df = temp_df.copy(deep=True)
    corpora_mcc_df = temp_df.copy(deep=True)
    first_loop_fl = False
  else:
    # temp_copy_df = temp_df.copy(deep=True)
    print(f'  Adding {acorpus} as successive DataFrame')
    # corpora_model_area_df = pd.concat([corpora_model_area_df, temp_df], axis=0)
    corpora_mcc_df = pd.concat([corpora_mcc_df, temp_df], axis=0)

  # pd.DataFrame(model_col, columns=['corpus'])], axis=1, join='inner')

  # lst = pd.Series([0.25,1.24865,2.541,3.1,4.4582]) # <-converted to series
  # pd.concat([pd.Series(lst), df], axis=1)

  temp_df.head()

In [None]:
# corpora_median_area_df.head()

corpora_mcc_df.head()

In [None]:
# should be 875

corpora_mcc_df.shape

In [None]:
# corpora_model_area_df.tail()
corpora_mcc_df.tail()

In [None]:
corpora_mcc_df[(corpora_mcc_df['corpus']=='vwoolf_tothelighthouse') & (corpora_mcc_df['model_z']=='xgb_z')]

In [None]:
corpora_ls

In [None]:
model_z_cols_ls

In [None]:
# should be 875

corpora_mcc_df.shape

In [None]:
# Save MCC for all Models

subdir_out = 'data_corpora_all'
filename_out = f'models_mcc.csv'

fullpath_out = f'./{subdir_out}/{filename_out}'

print(f'\nSaving MCC in file: {fullpath_out}')
corpora_mcc_df.to_csv(fullpath_out)

#### MCC Statistics

In [None]:
corpora_mcc_df.head()

In [None]:
# corpora_ls = list(corpora_mcc_df.corpus.unique())
corpora_ls
print('\n')
len(corpora_ls)
print('\n')
type(corpora_ls)

In [None]:
type(corpora_ls)

In [None]:
corpora_mcc_df.head()

In [None]:
temp_df = pd.DataFrame(corpora_mcc_df.groupby('model_z'))
temp_df.head()

In [None]:
# Standardize (MinMax) MCC Metrics

corpora_mcc_minmax_df = corpora_mcc_df.pivot_table(index='corpus', columns='model_z', values='mcc').T

# Replacing infinite with nan
corpora_mcc_minmax_df.replace([np.inf, -np.inf], np.nan, inplace=True)
  
# Dropping all the rows with nan values
corpora_mcc_minmax_df.dropna(inplace=True)

corpora_mcc_minmax_df



In [None]:
model_ls = list(corpora_mcc_minmax_df.index)

In [None]:
# corpora_mcc_minmax_df.groupby('model_z').min()

# from scipy.stats import zscore
# corpora_mcc_std_df.apply(zscore)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

numeric_cols = corpora_mcc_minmax_df.select_dtypes(include=[np.number]).columns
# model_col = list(corpora_mcc_minmax_df.index)

corpora_mcc_minmax_df = pd.DataFrame(scaler.fit_transform(corpora_mcc_minmax_df), columns=numeric_cols)
corpora_mcc_minmax_df.index = pd.Series(model_ls)
corpora_mcc_minmax_df


In [None]:
corpora_mcc_rank_dt = {}

for acorpus in corpora_ls:
  corpora_mcc_rank_dt[acorpus] = corpora_mcc_minmax_df[acorpus].rank(ascending=False)

corpora_mcc_rank_dt.keys()

corpora_mcc_rank_df = pd.DataFrame(corpora_mcc_rank_dt)
corpora_mcc_rank_df.head()

In [None]:
corpora_mcc_rank_df['cdickens_achristmascarol'].sort_values()

In [None]:
corpora_mcc_rank_df.T.describe()

In [None]:
# https://lost-stats.github.io/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html 

"""

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates

plt.rcParams['figure.figsize'] = 20,10

# Read in the data
df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv', parse_dates=['date'])
# df = corpora_mcc_rank_df.T

# Create the column we wish to plot
title = 'Log of Google Trends Index'
df[title] = np.log(df['hits'])

# Set a style for the plot
plt.style.use('ggplot')
plt.style.use('default')

# Make a plot
fig, ax = plt.subplots()

# Add lines to it
sns.lineplot(ax=ax, data=df, x="corpus", y=title, hue="name", legend=None)

# Add the text--for each line, find the end, annotate it with a label, and
# adjust the chart axes so that everything fits on.
for line, name in zip(ax.lines, df.columns.tolist()):
	y = line.get_ydata()[-1]
	x = line.get_xdata()[-1]
	if not np.isfinite(y):
	    y=next(reversed(line.get_ydata()[~line.get_ydata().mask]),float("nan"))
	if not np.isfinite(y) or not np.isfinite(x):
	    continue     
	text = ax.annotate(name,
		       xy=(x, y),
		       xytext=(0, 0),
		       color=line.get_color(),
		       xycoords=(ax.get_xaxis_transform(),
				 ax.get_yaxis_transform()),
		       textcoords="offset points")
	text_width = (text.get_window_extent(
	fig.canvas.get_renderer()).transformed(ax.transData.inverted()).width)
	if np.isfinite(text_width):
		ax.set_xlim(ax.get_xlim()[0], text.xy[0] + text_width * 1.05)

# Format the date axis to be prettier.
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%d'))
ax.xaxis.set_minor_locator(mdates.DayLocator())
ax.xaxis.set_major_locator(mdates.AutoDateLocator(interval_multiples=False))
plt.tight_layout()
plt.show()

""";

In [None]:
plt.style.use('default')

In [None]:
corpora_mcc_rank_df.T.head()
temp_df.head(5)

In [None]:
temp_df['corpus'] = temp_df.index
temp_df.head()

In [None]:
temp_df = corpora_mcc_rank_df.T
temp_cols = temp_df.columns
# Make a plot
fig, ax = plt.subplots()
sns.lineplot(ax=ax, data=temp_df, x=temp_df.index, y=temp_df.flair_z ,palette='Accent', linewidth=1, alpha=0.9)
plt.show()

In [None]:
temp_models_ls = temp_df.columns.to_list()
temp_models_ls = [x for x in temp_models_ls if x.endswith('_z')]
temp_models_ls

In [None]:
temp_df.head()

In [None]:
temp_df['corpus'] = temp_df.index
temp_df.head()

In [None]:
# pd.melt(df, id_vars='date', value_vars=['AA', 'BB', 'CC'])

temp_models_ls = temp_df.columns.to_list()

temp_tall_df = pd.melt(temp_df, id_vars='model', value_vars=temp_models_ls)
temp_tall_df.head()

In [None]:
temp_df.melt(id_vars='corpus')

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates

# Read in the data
df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv',
                 parse_dates=['date'])

# Create the column we wish to plot
title = 'Log of Google Trends Index'
df[title] = np.log(df['hits'])

df = temp_df.melt(id_vars='corpus')

# Set a style for the plot
# plt.style.use('ggplot')
plt.style.use('default')
plt.rcParams['figure.figsize'] = 10,8
# plt.grid(True)

# Make a plot
fig, ax = plt.subplots()

# Add lines to it
# sns.lineplot(ax=ax, data=df, x="date", y=title, hue="name", legend=None)
# sns.lineplot(ax=ax, data=df, x="date", y=title, hue="name", legend=None)
sns.lineplot(ax=ax, data=df, x='value', y='variable', hue='corpus', legend=None) # palette='Accent', linewidth=1, alpha=0.9)

# Add the text--for each line, find the end, annotate it with a label, and
# adjust the chart axes so that everything fits on.
# for line, name in zip(ax.lines, df.columns.tolist()):
for line, name in zip(ax.lines, df.columns.tolist()):
	y = line.get_ydata()[-1]
	x = line.get_xdata()[-1]
	if not np.isfinite(y):
	    y=next(reversed(line.get_ydata()[~line.get_ydata().mask]),float("nan"))
	if not np.isfinite(y) or not np.isfinite(x):
	    continue     
	text = ax.annotate(name,
		       xy=(x, y),
		       xytext=(0, 0),
		       color=line.get_color(),
		       xycoords=(ax.get_xaxis_transform(),
				 ax.get_yaxis_transform()),
		       textcoords="offset points")
	text_width = (text.get_window_extent(
	fig.canvas.get_renderer()).transformed(ax.transData.inverted()).width)
	if np.isfinite(text_width):
		ax.set_xlim(ax.get_xlim()[0], text.xy[0] + text_width * 1.05)

# Format the date axis to be prettier.
# ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%d'))
# ax.xaxis.set_minor_locator(mdates.DayLocator())
# ax.xaxis.set_major_locator(mdates.AutoDateLocator(interval_multiples=False))
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()



In [None]:
"""

seaborn palettes:

'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 
'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 
'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 
'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 
'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 
'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 
'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 
'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 'gist_earth', 
'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 
'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 
'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r'

""";

In [None]:
plt.rcParams['figure.figsize'] = 12,14

save_plot = True

subdir_name = 'data_corpora_plots'

# plt.figure(facecolor='white')

ax = sns.lineplot(data=corpora_mcc_rank_df.T, palette='Accent', linewidth=3) # , alpha=0.5)
ax.grid(True, alpha=0.3) # True, alpha=0.3)
ax.set_title('Model MCC Rank Across Corpora', fontsize=20)
# ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
# ax.set_xlabel('Decade', fontsize=20)
ax.set_ylabel('MCC Rank', fontsize=15)
ax.set_xticklabels(corpora_mcc_rank_df.columns, size=12, rotation=90)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
leg = ax.legend(fontsize=10, title='Model', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc='top')
for line in leg.get_lines():
  line.set_linewidth(3.0);

if save_plot:
  filename_plt = f'./{subdir_name}/plt_metric_mcc_line_rank.png'
  plt.savefig(filename_plt)
  print(f'Saved plot to filepath: {filename_plt}\n\n')

plt.show();

In [None]:
plt.style.use('default')

In [None]:
plt.rcParams['figure.figsize'] = 12,8

ax = corpora_mcc_rank_df.T.plot()
ax.grid(True, alpha=0.3)
ax.set_title('Model MCC Rank Across Corpora (Starting at Bottom)', fontsize=20)
# ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
# ax.set_xlabel('Decade', fontsize=20)
ax.set_ylabel('Rank', fontsize=15)
ax.set_xticklabels(corpora_mcc_rank_df.columns, size=10, rotation=90)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
ax.legend(fontsize=10, title='Model', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc='upper left');

In [None]:
ax = corpora_mcc_rank_df.T.plot()
ax.legend(loc='best')

In [None]:
corpora_mcc_rank_df.T.describe().loc['mean']

In [None]:
corpora_mcc_rank_df.T.head()

In [None]:
describe_num_df = corpora_mcc_rank_df.T.describe(include=['int64','float64'])
describe_num_df.head(2)

In [None]:
describe_num_df.rename(columns={'index':'stat'}, inplace=True)
describe_num_df.head(2)

In [None]:
# describe_num_df = describe_num_df.set_index('stat')
# describe_num_df.head(2)

In [None]:
describe_num_T_df = describe_num_df.T
describe_num_T_df

In [None]:
describe_num_df.columns

In [None]:
# Boxplot of Model MCC Statistics

plt.rcParams["figure.figsize"] = (10,10)

save_plot = True
subdir_name = 'data_corpora_plots'

describe_num_sorted_df = describe_num_df.T.sort_values(['mean'])
# describe_num_sorted_df.T.boxplot()
# plt.xticks(rotation=90)

ax = describe_num_sorted_df.T.boxplot()
ax.grid(True, alpha=0.3)
ax.set_title('Sorted Model MCC Statistics', fontsize=20)
# ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
# ax.set_xlabel('Decade', fontsize=20)
ax.set_ylabel('MCC Metric', fontsize=15)
ax.set_xticklabels(describe_num_sorted_df.T.columns, size=10, rotation=90)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
# ax.legend(fontsize=10, title='Model', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc='upper left');

if save_plot:
  filename_plt = f'./{subdir_name}/plt_metric_mcc_box_stats.png'
  plt.savefig(filename_plt)
  print(f'Saved plot to filepath: {filename_plt}\n\n')

plt.show();

In [None]:

save_plot = False


describe_num_sorted_df = describe_num_df.T.sort_values(['mean'])
# describe_num_sorted_df.T.boxplot()
# plt.xticks(rotation=90)

ax = describe_num_sorted_df.T.boxplot()
plt.xticks(rotation=90)

ax = corpora_mcc_rank_df.T.plot()
ax.grid(True, alpha=0.3)
ax.set_title('Model MCC Rank Across Corpora (Starting at Bottom)', fontsize=20)
# ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
# ax.set_xlabel('Decade', fontsize=20)
ax.set_ylabel('Rank', fontsize=15)
ax.set_xticklabels(corpora_mcc_rank_df.columns, size=10, rotation=90)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
ax.legend(fontsize=10, title='Model', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc='upper left');


In [None]:
describe_num_df.head(2)

In [None]:
corpora_mcc_rank_df.head()

In [None]:
describe_num_df.head()

In [None]:
describe_num_T_df

In [None]:
describe_num_df.head()

In [None]:
describe_num_df.T.head()

In [None]:
describe_num_df = describe_num_df.sort_values('mean')
describe_num_df.head()

In [None]:
describe_num_T_df = describe_num_df.describe(include=['int64','float64'])
describe_num_T_df


In [None]:
describe_num_df = describe_num_df.sort_values('mean')

describe_num_T_df = describe_num_df.describe(include=['int64','float64'])
describe_num_T_df.head(2)

# describe_num_df = describe_num_df.sort_values(['mean','std'])
# describe_num_T_df.sort_values('mean')
describe_num_T_df.T.boxplot()
plt.xticks(rotation=90)

In [None]:
# https://medium.com/analytics-vidhya/how-to-visualize-pandas-descriptive-statistics-functions-480c3f2ea87c

describe_num_df = corpora_mcc_rank_df.T.describe(include=['int64','float64'])

# describe_num_df.reset_index(inplace=True)

# Make a plot
# fig, ax = plt.subplots(5,5)
# fig, axes = plt.subplots(5,5, figsize=(15,10))

# Add lines to it
# sns.lineplot(ax=ax, data=df, x="corpus", y=title, hue="name", legend=None)


sns.boxplot(describe_num_df)

"""
# To remove any variable from plot
# describe_num_df = describe_num_df[describe_num_df['index'] != 'count']
num_col = describe_num_df.columns
for i,acol in enumerate(num_col):
  if acol in ['index','count']:
    continue  
  ax_row = i // 5
  ax_col = i % 5
  axes[ax_row, ax_col] = sns.factorplot(x='index', y=acol, data=describe_num_df.reset_index())  
  plt.ylabel('Numeric Value')
  plt.xlabel('Statistic')
  plt.title(f'MCC Descriptive Statistics\n{acol}')
  # sns.regplot(data = df_comments.reset_index(), x = 'index', y = 'score')
  plt.show();
""";

In [None]:
# https://medium.com/analytics-vidhya/how-to-visualize-pandas-descriptive-statistics-functions-480c3f2ea87c

describe_num_df = corpora_mcc_rank_df.T.describe(include=['int64','float64'])

# describe_num_df.reset_index(inplace=True)

# Make a plot
# fig, ax = plt.subplots(5,5)
fig, axes = plt.subplots(5,5, figsize=(15,10))

# Add lines to it
# sns.lineplot(ax=ax, data=df, x="corpus", y=title, hue="name", legend=None)



# To remove any variable from plot
# describe_num_df = describe_num_df[describe_num_df['index'] != 'count']
num_col = describe_num_df.columns
for i,acol in enumerate(num_col):
  if acol in ['index','count']:
    continue  
  ax_row = i // 5
  ax_col = i % 5
  axes[ax_row, ax_col] = sns.factorplot(x='index', y=acol, data=describe_num_df.reset_index())  
  plt.ylabel('Numeric Value')
  plt.xlabel('Statistic')
  plt.title(f'MCC Descriptive Statistics\n{acol}')
  # sns.regplot(data = df_comments.reset_index(), x = 'index', y = 'score')
  plt.show();

In [None]:
corrmat = corpora_mcc_rank_df.T.corr()
# print(corrmat)

In [None]:
plt.rcParams['figure.figsize'] = (20,20)

ax = sns.heatmap(corrmat, vmax=1, annot=True, linewidths=.5)

# ax.grid(True, alpha=0.3)
ax.set_title('Model MCC Correlation Matrix', fontsize=20)
# ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
# ax.set_xlabel('Decade', fontsize=20)
# ax.set_ylabel('Rank', fontsize=15)
# ax.set_xticklabels(corpora_mcc_rank_df.columns, size=10, rotation=40)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
# ax.legend(fontsize=10, title='Model', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc='upper left');

In [None]:
plt.figure(figsize=(30, 30))

sns.heatmap(corrmat, vmax=1, annot=True, linewidths=.5)
plt.xticks(rotation=30) # , horizontalalignment=’right’)
plt.title()
plt.show();

In [None]:
temp_df = pd.DataFrame()

for acorpus in corpora_ls:
  temp_df = corpora_mcc_df.groupby('model_z')
  temp_df = temp_df[temp_df['model_z'] == ]
  print(f'{acorpus}: \n\ntemp_df: {temp_df.head()}')

In [None]:
corpora_mcc_df.groupby('model_z').mean().sort_values('mcc')

#### MCC Ranking of Models for each Corpus

In [None]:
corpora_mcc_df.columns

In [None]:
# Plot Rank of Model Area from Median per Corpus

save_plot = True

plt.rcParams["figure.figsize"] = (8, 8)

subdir_name = 'data_corpora_plots'

models_all_order_dt = {}
for acorpus in corpora_ls:
  if acorpus != 'median_z':
    models_all_order_dt[acorpus] = []
  # for amodel_z in model_z_cols_ls:
  #   models_all_order_dt[acorpus][amodel_z] = []

temp_df = pd.DataFrame()

for i, acorpus in enumerate(corpora_ls):
  print(f'Plotting Model-Median Area for Corpus #{i}: {acorpus}')
  # temp_df = corpora_model_area_df[corpora_model_area_df.corpus == acorpus]
  temp_df = corpora_mcc_df[corpora_mcc_df.corpus == acorpus]

  # Drop the 'median_z' row
  temp_df = temp_df[temp_df.model_z != 'median_z']

  # Sort in place 
  temp_df.sort_values(by=['mcc'], inplace=True, ascending=False)
  # temp_df.head(2) # area_z.plot(kind='bar')
  # temp_df.area_z.plot(kind='bar' x=model_z, y=area_z)

  # Store order for each Model
  models_order_ls = temp_df.model_z.to_list()
  # for j, amodel_ord in enumerate(models_order_ls):
  #   models_all_order_dt[acorpus][amodel_ord].append(str(j)) 
  # models_order_ls.reverse()  # Reverses in-place
  models_order_ls.sort()
  models_all_order_dt[acorpus] = models_order_ls

  #ax = temp_df.plot.bar(x='model_z', y='area_z', rot=90)

  plt.barh('model_z', 'mcc', data=temp_df)
  # plt.xticks(fontsize=20) # , rotation=0)
  # plt.yticks(fontsize=20) # , rotation=0)
  plt.rcParams.update({'font.size': 8})
  plt.title(f'{corpora_full_dt[acorpus]}\n Model-Corpus Compatibility (MCC) Metric', pad=20, fontdict={'fontsize':10})

  if save_plot:
    subdir_name = 'data_corpora_plots'
    filename_plt = f'./{subdir_name}/plt_mcc_rank_{acorpus}.png'
    plt.savefig(filename_plt)

  plt.show()
  plt.close();

In [None]:
print(models_all_order_dt['cdickens_achristmascarol'])

#### Plot Rank and Spread by Model over the Corpora

In [None]:
# Create a Dict (Models) of Lists (Ranks)

models_all_rank_dt = {}

models_z_rank_ls = models_all_order_dt['cdickens_achristmascarol']
for amodel_z in models_z_rank_ls:
  models_all_rank_dt[amodel_z] = []

for key, values in models_all_order_dt.items():
  print(f'Corpus: {key}')
  for i, amodel in enumerate(values):
    print(f' Model #{i}: {amodel}')
    models_all_rank_dt[amodel].append(i)

In [None]:
# models_all_rank_dt.keys()

models_z_rank_ls = models_all_order_dt['cdickens_achristmascarol']
models_mean_ls = []
models_std_ls = []
models_max_ls = []
models_min_ls = []
models_ranks_ls_ls = []

for i, amodel_z in enumerate(models_z_rank_ls):
  print(f'For Model: {amodel_z}:')
  model_mean = np.mean(models_all_rank_dt[amodel_z])
  print(f'  Mean: {model_mean}')
  models_mean_ls.append(model_mean)

  model_std = np.std(models_all_rank_dt[amodel_z])
  print(f'  STD: {model_std}')
  models_std_ls.append(model_std)

  model_min = np.min(models_all_rank_dt[amodel_z])
  print(f'  Min: {model_min}')
  models_min_ls.append(model_min)

  model_max = np.max(models_all_rank_dt[amodel_z])
  print(f'  Max: {model_max}')
  models_max_ls.append(model_max)

  model_ranks_ls = models_all_rank_dt[amodel_z]
  print(f' Ranks: {model_ranks_ls}')
  models_ranks_ls_ls.append(model_ranks_ls)


models_rank_dt = {'model_z': models_z_rank_ls,
                  'mean':models_mean_ls,
                  'std':models_std_ls,
                  'min':models_min_ls,
                  'max':models_max_ls,
                  'ranks':models_ranks_ls_ls}

models_rank_df = pd.DataFrame.from_dict(models_rank_dt)

In [None]:
models_rank_df.head()

In [None]:
models_labels_ls = ['-'.join(i.split('_')[:-1]) for i in models_rank_df.model_z.to_list()]
models_labels_ls
len(models_labels_ls)

In [None]:
len(models_labels_ls)

In [None]:
# models_labels_ls = models_rank_df.model_z.to_list()
# models_labels_ls = ['-'.join(i.split('_')[:-1]) for i in models_labels_ls]
# models_labels_ls = [w.replace('_', '-') for w in models_labels_ls]
# models_labels_ls = [i.split('_')[:-1] for i in models_rank_df.model_z.to_list()]

# models_box_ready_df = pd.DataFrame(np.array(models_rank_df.ranks.to_list()).T, columns=models_labels_ls)
models_box_ready_df = pd.DataFrame(np.array(models_rank_df.ranks.to_list()).T, columns=models_labels_ls)
models_box_ready_df

In [None]:
# Plot Rank and Spread of Models across all 25 Corpora

plt.rcParams["figure.figsize"] = (8, 8)

save_plot = True
subdir_name = 'data_corpora_plots'

ax = models_box_ready_df.plot(kind='box')
ax.grid(True, alpha=0.3)

ax.set_ylabel('Rank', fontsize=12)
ax.set_xticklabels(corpora_mcc_rank_df.columns, size=10, rotation=40)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)

# ax.invert_yaxis()

plt.xticks(fontsize=20, rotation=90)
# plt.yticks(fontsize=20) # , rotation=0)
# plt.rcParams.update({'font.size': 20})
# plt.title(f'{corpora_full_dt[acorpus]}\n Error Area between zScore Models and Median', pad=20, fontdict={'fontsize':24})
plt.grid(alpha=0.3)
plt.title('Rank and Spread of Models across all 25 Corpora')

ax = corpora_mcc_rank_df.T.plot()

ax.set_title('Model MCC Rank Across Corpora (Starting at Bottom)', fontsize=20)
# ax.set(xlabel='Decade', ylabel='Weighted Percent of Top Songs', fontsize=10)
# ax.set_xlabel('Decade', fontsize=20)
ax.set_ylabel('Rank', fontsize=15)
ax.set_xticklabels(corpora_mcc_rank_df.columns, size=10, rotation=40)
# ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
ax.legend(fontsize=10, title='Model', title_fontsize='14', bbox_to_anchor=(1.05, 1), loc='upper left');


if save_plot:
  filename_plt = f'./{subdir_name}/plt_models_all_rank_spread.png'
  plt.savefig(filename_plt)

plt.show()
plt.close();

In [None]:

models_z_rank_ls = models_all_order_dt['cdickens_achristmascarol']

for amodel_z in models_z_rank_ls:
  models_all_rank_dt[amodel_z] = []

for key, values in models_all_order_dt.items():
  print(f'Corpus: {key}')
  for i, amodel in enumerate(values):
    print(f' Model #{i}: {amodel}')
    # models_all_rank_dt[amodel].append(i)

  models_all_rank_dt['logreg_cv_z']
  print('\n')

  model_median = np.median(models_all_rank_dt['logreg_cv_z'])
  print(f'  Model Median: {model_median}')

  model_std = np.std(models_all_rank_dt['logreg_cv_z'])
  print(f'  Model STD: {model_std}')

In [None]:
# Create DataFrame from Dictionaries

corpora_median_area_dt = {}
temp_df = pd.DataFrame()
first_loop_fl = True

for acorpus, area_tup_ls in corpora_median_area_dt.items():
  
  print(f'\nCorpus: {acorpus}\n   {area_tup_ls}')

  areas_ls = [i[1] for i in area_tup_ls]
  model_ls = [i[0] for i in area_tup_ls]
  temp_df = pd.DataFrame(areas_ls, columns=['area_z'] )
  print(f'temp_df.head():\n{temp_df.head()}')
  if first_loop_fl:
    corpora_median_area_dt[acorpus] = temp_df.copy(deep=True)
    first_loop_fl = False
  else:
    corpora_median_area_dt = pd.merge(corpora_median_area_dt, temp_df, how='inner', on = 'model_z')

# corpora_median_area_df = pd.DataFrame(corpora_median_area_dt, columns=['model_z', 'area_z'])


"""
for acorpus, area_tup_ls in corpora_median_area_dt.items():
  
  print(f'Corpus: {acorpus}\n   {area_tup_ls}')

for i, acorpus in enumerate(corpora_ls):
  print(f'Processing Corpus #{i}: {acorpus}')

  median_model_area_ls = []

  for j, amodel_z in enumerate(model_z_cols_ls):
    print(f'  with Model #{j}: {amodel_z}')

    median_model_area = np.sum(np.abs(corpora_all_dt[acorpus][amodel_z] - corpora_all_dt[acorpus]['median_z']))
    print(f'    Area between Median: {median_model_area}')

    median_model_area_ls.append((amodel_z, median_model_area))


corpora_median_area_dt['cdickens_achristmascarol'].plot(kind=bar)
""";

In [None]:
corpora_median_area_dt

In [None]:
corpora_median_area_dt.keys()

In [None]:
corpora_median_area_dt['cdickens_achristmascarol']

In [None]:
corpora_median_area_df.head()

In [None]:
print(corpora_median_area_dt['cdickens_achristmascarol'][0][1])

In [None]:
print(corpora_median_area_dt['cdickens_achristmascarol'].sort(key=lambda x:x[0][1]))

In [None]:
print(corpora_median_area_dt['cdickens_achristmascarol']) #.sort(key=lambda y: y[1]))

In [None]:
mx = max(corpora_median_area_dt['cdickens_achristmascarol'], key=lambda e: int(e[1]))
print(mx)

### Ensemble-Corpus Compatibility (ECC)

In [None]:
corpora_mcc_minmax_df.head()

In [None]:
plt.rcParams["figure.figsize"] = (6, 6)

subdir_name = 'data_corpora_plots'

corpora_mcc_minmax_df.sum().sort_values().plot(kind='bar')
plt.title(f'Ensemble-Corpus Compatility (ECC) Metric by Novel')

plt.ylabel('ECC Value')
# plt.xticks(fontsize=20) # , rotation=0)
# plt.yticks(fontsize=20) # , rotation=0)
plt.rcParams.update({'font.size': 8})
plt.title(f'{corpora_full_dt[acorpus]}\n Model-Corpus Compatibility (MCC) Metric', pad=20, fontdict={'fontsize':10})

if save_plot:

  filename_plt = f'./{subdir_name}/plt_ecc_corpora.png'
  plt.savefig(filename_plt)

plt.show()
plt.close();

In [None]:
# [SKIP]

In [None]:
corpus_texts_dt[atext].iloc[:2][smalowess_ls].sum(axis=1)

In [None]:
corpus_texts_dt[atext][smalowess_ls].std().values

In [None]:
minmax_ser = pd.Series(corpus_texts_dt[atext][smalowess_ls].max(axis=1) - corpus_texts_dt[atext][smalowess_ls].min(axis=1))

In [None]:
atext = corpus_texts_ls[0]

win_10per = int(0.10 * corpus_texts_dt[atext].shape[0])

smalowess_ls = [x for x in corpus_texts_dt[corpus_texts_ls[0]].columns if x.endswith('_smalowess_rzstd')]
smalowess_ls

sma_ls = [x.replace('_smalowess_', '_') for x in smalowess_ls]

fig = plt.figure()
ax = plt.subplot(111)

std_ser = pd.Series(corpus_texts_dt[atext][smalowess_ls].std(axis=1).values)
# .rolling(win_10per, center=True, min_periods=0).mean()
# _ = ax.plot(models_smalowess_rzstd_ls_median, label='Ensemble Median', color='r', linewidth=3) # .rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', color='r', linewidth=3)
_ = ax.plot(std_ser, label='Std', alpha=0.3, linewidth=3)
_ = ax.plot(minmax_ser, label='MinMax Range', alpha=0.3)
# _ = ax.plot(corpus_texts_dt[atext][sma_ls[:model_ct]].rolling(win_10per, center=True, min_periods=0).mean(), label='Ensemble Median', alpha=0.3, linewidth=3)

# Put a legend to the right of the current axis
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.grid(True, alpha=0.3)
plt.title(f'{corpus_titles_dt[atext][0]}\nSentimentArc Ensemble of {model_ct} Models\nSmoothed: SMA only (vs) SMA (window=10%) + LOWESS (frac=1./{afrac_inv}\nClipped with IQR + zScore Standardized')
plt.show();  

### Model Family Coherence (MFC)

In [None]:
corpora_mcc_minmax_df.head()

In [None]:
corpora_mcc_minmax_df.index

In [None]:
model_family_map_dt = {'afinn_z':'lexicon',
                   'autogluon_z':'dnn', 
                   'bing_z':'lexicon', 
                   'cnn_z':'dnn', 
                   'fcn_z':'dnn', 
                   'flair_z':'dnn',
                   'flaml_z':'dnn', 
                   'hinglish_z':'transformer', 
                   'huggingface_z':'transformer', 
                   'huliu_z':'lexicon', 
                   'imdb2way_z':'transformer',
                   'jockers_rinker_z':'heuristic', 
                   'jockers_z':'lexicon', 
                   'lmcd_z':'heuristic', 
                   'logreg_cv_z':'ml', 
                   'logreg_z':'ml',
                   'lstm_z':'heuristic', 
                   'multinb_z':'ml', 
                   'nlptown_z':'transformer', 
                   'nrc_z':'heuristic', 
                   'rf_z':'ml', 
                   'roberta15lg_z':'transformer',
                   'robertaxml8lang_z':'transformer', 
                   'senticnet_z':'heuristic', 
                   'sentimentr_z':'lexicon', 
                   'sentiword_z':'heuristic',
                   'stanza_z':'dnn', 
                   'syuzhet_z':'lexicon', 
                   't5imdb50k_z':'transformer', 
                   'textblob_z':'ml', 
                   'vader_z':'heuristic',
                   'xgb_z':'ml', 
                   'yelp_z':'transformer'
}

In [None]:
# corpora_mfc_df.drop(columns=['model_z'], inplace=True)

In [None]:
corpora_mfc_df = pd.DataFrame(corpora_mcc_minmax_df)

# corpora_mfc_df.insert(loc=0, column='model_z', value=corpora_mfc_df.index)

corpora_mfc_df.head()

In [None]:
# del corpora_mfc_df
corpora_mfc_df['model'] = corpora_mfc_df.index

In [None]:
corpora_mfc_df.head()

In [None]:
corpora_mfc_df.iloc[0]

In [None]:

for index, arow in corpora_mfc_df.iterrows():
  print(arow['model'])

In [None]:

zfamily_ls = []

for index, arow in corpora_mfc_df.iterrows():
  zmodel = arow['model']
  zfamily = model_family_map_dt[zmodel]
  print(f'Model: {zmodel} belongs to Family: {zfamily}')
  zfamily_ls.append(zfamily)

print(f'zfamily_ls: {zfamily_ls}')

corpora_mfc_df.insert(loc=0, column='family', value=zfamily_ls)
corpora_mfc_df.head()

In [None]:
mfc_family_df = corpora_mfc_df.groupby('family').describe()
mfc_family_df.shape
# mfc_family_df.plot() # ['mean'].plot(kind='bar')
print('\n')
mfc_family_df.head()

In [None]:
plt.rcParams["figure.figsize"] = (8, 8)

save_plot = True

subdir_name = 'data_corpora_plots'

family_types_ls = ['lexicon', 'heuristic', 'dnn', 'ml', 'transformer']

family_means_ls = []
family_mins_ls = []
family_maxs_ls = []
family_stds_ls = []

for afamily in family_types_ls:
  family_means_ls.append(mfc_family_df.T[afamily].mean())
  family_mins_ls.append(mfc_family_df.T[afamily].min())
  family_maxs_ls.append(mfc_family_df.T[afamily].max())
  family_stds_ls.append(mfc_family_df.T[afamily].std())


# family_means_df = pd.DataFrame()
family_stats_df = pd.DataFrame({'zfamily':family_types_ls, 
                                'zmean':family_means_ls,
                                'zmin':family_mins_ls,
                                'zmax':family_maxs_ls,
                                'zstd':family_stds_ls,
                                }) #  = pd.DataFrame(family_means_dt)
family_stats_df.index = family_stats_df.zfamily

family_stats_df.head()
family_stats_df.info()
# family_means_df['min'].sort_values().plot(kind='bar')

# Plot MFC 

# plt.rcParams['axes.grid'] = True
# plt.rcParams['grid.alpha'] = 1
# plt.rcParams['grid.color'] = "#cccccc"

# fig, ax = plt.subplots()

family_stats_df.sort_values(by='zmean').plot(kind='bar')
plt.grid(True, alpha=0.3)
plt.legend(loc='best')
# plt.xlabel('Model Family', size=20)
plt.ylabel('MFC Value', size=10)
# plt.axis('off')
# plt.xlabel.set_visible(False)
plt.xticks(size=10, rotation=10)
# plt.title('Model Family Coherence (MFC) Over Entire Corpus', size=16)

# plt.rcParams.update({'font.size': 8})
plt.title(f'{corpora_full_dt[acorpus]}\n Model-Corpus Compatibility (MCC) Metric', pad=20, fontdict={'fontsize':16})

if save_plot:

  filename_plt = f'./{subdir_name}/plt_mfc_corpora.png'
  plt.savefig(filename_plt)

plt.show();

## Save Checkpoint

In [None]:
print('Current Working Directory:')
!pwd

print(f'\nSaving to subdir=SUBDIR_SENTIMENT_CLEAN:\n  {SUBDIR_SENTIMENT_CLEAN}\n')

print(f'Saving to filename=FNAME_SENTIMENT_CLEAN:\n  {FNAME_SENTIMENT_CLEAN}\n')

In [None]:
corpus_texts_dt.keys()

In [None]:
# Review Models per Text

# Unique Model per each Text
print('Unique Models for each Text:')
[x for x in corpus_texts_dt[corpus_texts_ls[0]].columns if 'vader' in x]
print('\n\n')

# Check for duplicate Cols/Models
col_dups_ct = corpus_texts_dt[corpus_texts_ls[0]].columns.duplicated().sum()
print(f'[{col_dups_ct}] Duplicated Cols/Models')

In [None]:
# Check for duplicate Texts
print('Check for duplicated Texts in Corpus:\n')
corpus_texts_dt.keys()

# Delete if necessary
# corpus_texts_dt['pattern_smalowess_rzlstd'].head()
# del corpus_texts_dt['pattern_smalowess_rzstd']

In [None]:
%%time

# timeit_res = %timeit -n1 -r1 -o sum(range(1000000))

# NOTE:

# for i, atext in enumerate(corpus_texts_dt.keys()):

temp_df = pd.DataFrame()

for i, atext in enumerate(corpus_titles_ls):
  print(f'Text #{i}: {atext}')

  col_models_ls = []
  for j, amodel in enumerate(ensemble_ls):
  
    col_models_ls = [x for x in corpus_texts_dt[corpus_texts_ls[0]].columns if (x.startswith(f'{amodel}_') or x == amodel)]
    
    print(f'\n\natext: {atext}\namodel: {amodel}\ncol_models_ls: {col_models_ls}\n\n')

    temp_df = corpus_texts_dt[atext][col_models_ls]
    temp_df.info()
    print(f'type(temp_df): {type(temp_df)}')
    temp_fname = f'sentiment_clean_{Corpus_Genre}_{Corpus_Type}_{amodel}_{atext}.csv'
    print(f'filename: {temp_fname}')
    temp_df.to_csv(f'{SUBDIR_SENTIMENT_CLEAN}{temp_fname}')


"""
    # amodel_rstd = f'{amodel}_rstd'
    amodel_rzstd = f'{amodel}_rzstd'
    amodel_sma_rzstd = f'{amodel}_sma_rzstd'
    amodel_smalowess_rzstd = f'{amodel}_smalowess_rzstd'
    print(f'  Model #{j}: {amodel} (Model_Std: {amodel_rzstd})')
    win_10per = int(0.10 * corpus_texts_dt[atext][amodel].shape[0])
    # clip_outliers(corpus_texts_dt[corpus_texts_ls[0]]['vader'])
""";

In [None]:
# Verify in SentimentArcs Root Directory
os.chdir('/gdrive/MyDrive/cdh/sentiment_arcs/')

# Save sentiment values to subdir_sentiments
write_dict_dfs(corpus_texts_dt, out_file=FNAME_SENTIMENT_CLEAN, out_dir=SUBDIR_SENTIMENT_CLEAN)
print(f'Saving Corpus_Genre: {Corpus_Genre}')
print(f'        Corpus_Type: {Corpus_Type}')
print('\n')

# Verify Dictionary was saved correctly by reading back the *.json datafile
test_dt = read_dict_dfs(in_file=FNAME_SENTIMENT_CLEAN, in_dir=SUBDIR_SENTIMENT_CLEAN)
print(f'These Text Titles:')
test_dt.keys()
print('\n')

corpus_texts_dt[corpus_titles_ls[0]].head()
print('\n')

test_dt[corpus_titles_ls[0]].info()


In [None]:
# [SKIP]

In [None]:


# ax.plot(x, y_rstd, alpha=0.1) # , label=f'LOWESS (frac={afrac_inv})') # , color='tomato', linewidth=5)

ax.plot(corpus_texts_dt[atext][amodel_rstd].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rstd, alpha=0.3)
ax.plot(corpus_texts_dt[atext][amodel].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel, alpha=0.3)

# Plot statsmodels LOWESS
# ax.plot(sm_x, sm_y, label=f'LOWESS (frac={afrac_inv})') # , color='tomato', linewidth=5)
# Plot moepy LOWESS
ax.plot(x, y_rstd_pred, label=f'Std LOWESS (frac=1./{afrac_inv})') # , color='tomato', linewidth=5)
ax.plot(x, y_sma_rstd_pred, label=f'Std+SMA LOWESS (frac=1./{afrac_inv})') # , color='tomato', linewidth=5)

atitle = f'{corpus_titles_dt[atext][0]}\nSentimentArcs Model: {amodel}\nLOWESS Smoothed (frac={afrac_inv})\nClipped (2.5*IQR) and Standardized (zScore)'
plt.title(atitle)
plt.legend()
plt.show();

In [None]:


sm_x, sm_y = sm_lowess(y, x,  frac=1./160., it=5, return_sorted = True).T
plt.plot(sm_x, sm_y, color='tomato', linewidth=5)
# sm_x, sm_y = sm_lowess(y, x,  frac=1./20., it=5, return_sorted = True).T
# plt.plot(sm_x, sm_y, color='green', linewidth=5, alpha=0.3)
# plt.plot(x,)
plt.plot(x, y, 'k.', alpha=0.1)
plt.plot(x, y_roll10, label='SMA 10%')
plt.plot(x, df.sentiment_roll10_interp, label='interpolate roll10')
plt.plot(sm_x, sm_y, label='interpolate LOWESS')
plt.plot(x, df.sentiment_roll10_ewm, label='ewm')
plt.legend(loc='best')

In [None]:
lowess_grid_dt = {}
crux_ct_ls = []
# temp_df['sent_no'] = pd.Series([x for x in corpus_sents_df['sent_no']])
temp_df['avg_stdscaler'] = corpus_sents_df[models_subset_ls].mean()

fig = plt.figure()
ax = plt.axes()


for afrac in range(frac_start_int, frac_end_int, frac_step_int):
  print(f'Processing afrac = {afrac}')
  # Compute error between subset of models
  afrac_fl = afrac/100
  temp_df = get_lowess(corpus_sents_df, models_ls=models_subset_ls, text_unit='sentence', afrac=afrac_fl, do_plot=False);
  temp_df['minmax_diff'] = temp_df.max(axis=1) - temp_df.min(axis=1)
  diff_sum = temp_df['minmax_diff'].sum()
  print(f"  Sum(minmax_diff): {diff_sum}");
  lowess_grid_dt[afrac] = diff_sum
  # Compute Crux Points
  temp_df['sent_no'] = pd.Series(list(range(temp_df.shape[0])))
  crux_ls = get_crux_points(temp_df,
                            'median',
                            text_type='sentence', 
                            win_per=5, 
                            sec_y_labels=False, 
                            sec_y_height=0, 
                            subtitle_str=' ', 
                            do_plot=False,
                            save2file=False)
  ax.plot(temp_df['sent_no'], temp_df['median'], label=f'frac={afrac}')
  # plt.plot(data=temp_df, x='sent_no', y='median', label=f'frac={afrac}')
  crux_ct_ls.append(len(crux_ls))
  print(f'  {len(crux_ls)} Crux Points')

plt.title(f"{CORPUS_FULL} \n LOWESS Smoothing Grid Search (frac={Frac_Start} to {Frac_End}")
plt.legend()
plt.show()

In [None]:
corpus_texts_dt[atext][amodel_std].shape[0]

In [None]:
# np.array(corpus_texts_dt[atext][amodel_std]).reshape(-1,1).shape

np.array(corpus_texts_dt[atext][amodel_std]).reshape(-1,).shape

np.array(corpus_texts_dt[atext][amodel_std]).flatten().shape

In [None]:
lowess_model.fit(x, y_roll10, frac=0.2)

# x_pred = x # np.linspace(0, 6448, 100)
y_pred = lowess_model.predict(x)

# Plotting
plt.plot(x, y_roll10, label='Sin Wave', zorder=2)
plt.plot(x, y_pred, '--', label='Estimate', color='k', zorder=3)
# plt.scatter(x, y_roll10, label='With Noise', color='C1', s=5, zorder=1)
plt.legend(frameon=False)
# plt.xlim(0, 5)

In [None]:
%%time

# NOTE:  1m21s @14:30 on 20220308 Colab Pro/CPU (1 Novel/1model using moepy)
#        1m21s @14:37 on 20220308 Colab Pro/CPU (1 Novel/1model using moepy)
#        1m21s @14:37 on 20220308 Colab Pro/CPU (1 Novel/1model using statsmodels)
#        1m21s @14:37 on 20220308 Colab Pro/CPU (1 Novel/1model using statsmodels)

atext = 'cmieville_thecityandthecity'
amodel = 'afinn'
amodel_rstd = 'afinn_rstd'
afrac_inv = 10

win_10per = int(0.10 * corpus_texts_dt[atext][amodel].shape[0])

x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))
y = corpus_texts_dt[atext][amodel].to_numpy()
y_std = corpus_texts_dt[atext][amodel_std].to_numpy()
# print(f'x.shape: {x.shape}\ny.shape: {y.shape}')
# sm_x, sm_y = sm_lowess(y, x,  frac=1./afrac_inv, it=5, return_sorted = True).T

# statsmodels
#sm_x, sm_y = sm_lowess(y, x,  frac=1./afrac_inv, it=5).T

# moepy
lowess_moepy.fit(x, y_std, frac=1./afrac_inv)
y_pred = lowess_moepy.predict(x)

atitle = f'{corpus_titles_dt[atext][0]}\nSentimentArcs Model: {amodel}\nLOWESS Smoothed (frac=1./{afrac_inv})\nClipped (2.5*IQR) and Standardized (zScore)'

fig = plt.figure()
ax = plt.subplot(111)

# ax.plot(x, y_std, alpha=0.1) # , label=f'LOWESS (frac={afrac_inv})') # , color='tomato', linewidth=5)

ax.plot(corpus_texts_dt[atext][amodel_rstd].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rstd, alpha=0.3)
ax.plot(corpus_texts_dt[atext][amodel].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel, alpha=0.3)

# Plot statsmodels LOWESS
# ax.plot(sm_x, sm_y, label=f'LOWESS (frac={afrac_inv})') # , color='tomato', linewidth=5)
# Plot moepy LOWESS
ax.plot(x, y_pred, label=f'LOWESS (frac=1./{afrac_inv})') # , color='tomato', linewidth=5)

atitle = f'{corpus_titles_dt[atext][0]}\nSentimentArcs Model: {amodel}\nLOWESS Smoothed (frac={afrac_inv})\nClipped (2.5*IQR) and Standardized (zScore)'
plt.title(atitle)
plt.legend()
plt.show();

In [None]:
%%time

# NOTE:  1m21s @14:30 on 20220308 Colab Pro/CPU (1 Novel/1model using moepy)
#        1m21s @14:37 on 20220308 Colab Pro/CPU (1 Novel/1model using moepy)
#        1m21s @14:37 on 20220308 Colab Pro/CPU (1 Novel/1model using statsmodels)
#        1m21s @14:37 on 20220308 Colab Pro/CPU (1 Novel/1model using statsmodels)

atext = 'cmieville_thecityandthecity'
amodel = 'afinn'
amodel_rstd = 'afinn_rstd'
afrac_inv = 10

win_10per = int(0.10 * corpus_texts_dt[atext][amodel].shape[0])

x = np.array(range(corpus_texts_dt[atext][amodel].shape[0]))

y = corpus_texts_dt[atext][amodel].to_numpy()
y_std = corpus_texts_dt[atext][amodel_std].to_numpy()

# print(f'x.shape: {x.shape}\ny.shape: {y.shape}')
# sm_x, sm_y = sm_lowess(y, x,  frac=1./afrac_inv, it=5, return_sorted = True).T

# statsmodels
#sm_x, sm_y = sm_lowess(y, x,  frac=1./afrac_inv, it=5).T

# moepy
lowess_moepy.fit(x, y_std, frac=1./afrac_inv)
y_pred = lowess_moepy.predict(x)

atitle = f'{corpus_titles_dt[atext][0]}\nSentimentArcs Model: {amodel}\nLOWESS Smoothed (frac=1./{afrac_inv})\nClipped (2.5*IQR) and Standardized (zScore)'

fig = plt.figure()
ax = plt.subplot(111)

# ax.plot(x, y_std, alpha=0.1) # , label=f'LOWESS (frac={afrac_inv})') # , color='tomato', linewidth=5)

ax.plot(corpus_texts_dt[atext][amodel_rstd].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel_rstd, alpha=0.3)
ax.plot(corpus_texts_dt[atext][amodel].rolling(win_10per, center=True, min_periods=0).mean(), label=amodel, alpha=0.3)

# Plot statsmodels LOWESS
# ax.plot(sm_x, sm_y, label=f'LOWESS (frac={afrac_inv})') # , color='tomato', linewidth=5)
# Plot moepy LOWESS
ax.plot(x, y_pred, label=f'LOWESS (frac=1./{afrac_inv})') # , color='tomato', linewidth=5)

atitle = f'{corpus_titles_dt[atext][0]}\nSentimentArcs Model: {amodel}\nLOWESS Smoothed (frac={afrac_inv})\nClipped (2.5*IQR) and Standardized (zScore)'
plt.title(atitle)
plt.legend()
plt.show();


# **[STEP 5] Customize Ensemble & Smoothing**

* https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb


## Select Ensemble Subset

In [None]:
ensemble_ls.sort()
ensemble_ls

In [None]:
# Sentence Plotly Interactive/Zoom Sentiment Plots

ensemble_subset_ls = []

#@markdown **Instructions:**
#@markdown <li> Select Models for Ensemble Subset
#@markdown <li> Then execute this code cell

#@markdown <hr>

#@markdown **Lexicon Models**
# Lexicon Models from SyuzhetR
SyuzhetR_AFINN = False #@param {type:"boolean"}
SyuzhetR_Bing = True #@param {type:"boolean"}
SyuzhetR_NRC = False #@param {type:"boolean"}
SyuzhetR_Syuzhet = True #@param {type:"boolean"}

if SyuzhetR_AFINN == True:
  ensemble_subset_ls.append('syuzhetr_afinn')
if SyuzhetR_Bing == True:
  ensemble_subset_ls.append('syuzhetr_bing')
if SyuzhetR_NRC == True:
  ensemble_subset_ls.append('syuzhetr_nrc')
if SyuzhetR_Syuzhet == True:
  ensemble_subset_ls.append('syuzhetr_syuzhet')

# Lexicon Models that are Standalone
Pattern = True #@param {type:"boolean"}
AFINN = True #@param {type:"boolean"}

if Pattern == True:
  ensemble_subset_ls.append('pattern')
if AFINN == True:
  ensemble_subset_ls.append('afinn')

# Lexicons ported from SentimentR to Python
pysentimentr_JockerRinker = True #@param {type:"boolean"}
pysentimentr_HuLiu = True #@param {type:"boolean"}
pysentimentr_NRC = True #@param {type:"boolean"}
pysentimentr_SentiWord = True #@param {type:"boolean"}
pysentimentr_SenticNet = True #@param {type:"boolean"}
pysentimentr_LMcD = True #@param {type:"boolean"}

if pysentimentr_JockerRinker == True:
  ensemble_subset_ls.append('pysentimentr_jockersrinker')
if pysentimentr_HuLiu == True:
  ensemble_subset_ls.append('pysentimentr_huliu')
if pysentimentr_NRC == True:
  ensemble_subset_ls.append('pysentimentr_nrc')
if pysentimentr_SentiWord == True:
  ensemble_subset_ls.append('pysentimentr_sentiword')
if pysentimentr_SenticNet == True:
  ensemble_subset_ls.append('pysentimentr_senticnet')
if pysentimentr_LMcD == True:
  ensemble_subset_ls.append('pysentimentr_lmcd')
  
# Future -----
# https://www.liwc.app/ (LIWC)
# LIWC = False #@param {type:"boolean"}
# https://github.com/nickderobertis/pysentiment (HarvardIV, LMcD)
# HarvardIV = False #@param {type:"boolean"}
# https://mpqa.cs.pitt.edu/ (MPQA)
# MPQA_Arc = False #@param {type:"boolean"}
# http://sentistrength.wlv.ac.uk/ (SentiStrength)
# SentiStrength_Arc = False #@param {type:"boolean"}

Median_Lexicon = True #@param {type:"boolean"}

if Median_Lexicon == True:
  ensemble_subset_ls.append('median_lexicon')


#@markdown **Heuristic Models**
SentimentR_JockersRinker = True #@param {type:"boolean"}
SentimentR_Jockers = True #@param {type:"boolean"}
SentimentR_HuLiu = True #@param {type:"boolean"}
SentimentR_SenticNet = True #@param {type:"boolean"}
SentimentR_SentiWord = True #@param {type:"boolean"}
SentimentR_NRC = True #@param {type:"boolean"}
SentimentR_LoughranMcDonald = True #@param {type:"boolean"}
SentimentR_SoCal_Google = True #@param {type:"boolean"}
VADER = True #@param {type:"boolean"}


if SentimentR_JockersRinker == True:
  ensemble_subset_ls.append('sentimentr_jockersrinker')
if SentimentR_Jockers == True:
  ensemble_subset_ls.append('sentimentr_jockers')
if SentimentR_HuLiu == True:
  ensemble_subset_ls.append('sentimentr_huliu')
if SentimentR_SenticNet == True:
  ensemble_subset_ls.append('sentimentr_senticnet')
if SentimentR_SentiWord == True:
  ensemble_subset_ls.append('sentimentr_sentiword')
if SentimentR_NRC == True:
  ensemble_subset_ls.append('sentimentr_nrc')
if SentimentR_LoughranMcDonald == True:
  ensemble_subset_ls.append('sentimentr_loughran_mcdonald')
if SentimentR_SoCal_Google == True:
  ensemble_subset_ls.append('sentimentr_socal_google')
if VADER == True:
  ensemble_subset_ls.append('vader')

Median_Heuristic = True #@param {type:"boolean"}

if Median_Heuristic == True:
  ensemble_subset_ls.append('median_heuristic')

#@markdown **Statistical ML Models**
Logistic_Regression = False #@param {type:"boolean"}
Logistic_Regression_cv6 = False #@param {type:"boolean"}
Multinomial_NaiveBayes = False #@param {type:"boolean"}
Multinomial_NaiveBayes_POS = False #@param {type:"boolean"}
Random_Forest = False #@param {type:"boolean"}
XGBoost = False #@param {type:"boolean"}
AutoML_FLAML = False #@param {type:"boolean"}
AutoML_AutoGluon = False #@param {type:"boolean"}

if Logistic_Regression == True:
  ensemble_subset_ls.append('logreg')
if Logistic_Regression_cv6 == True:
  ensemble_subset_ls.append('logreg_cv6')
if Multinomial_NaiveBayes == True:
  ensemble_subset_ls.append('multinb')
if Multinomial_NaiveBayes_POS == True:
  ensemble_subset_ls.append('multinb')
if Random_Forest == True:
  ensemble_subset_ls.append('rf')
if XGBoost == True:
  ensemble_subset_ls.append('xgboost')
if AutoML_FLAML == True:
  ensemble_subset_ls.append('flaml')
if AutoML_AutoGluon == True:
  ensemble_subset_ls.append('autogluon')

Median_ML = False #@param {type:"boolean"}

if Median_ML == True:
  ensemble_subset_ls.append('median_ml')

#@markdown **DNN Models**
FCN = False #@param {type:"boolean"}
LSTM = False #@param {type:"boolean"}
CNN = False #@param {type:"boolean"}
AutoML_Stanza = True #@param {type:"boolean"}
AutoML_Flair = True #@param {type:"boolean"}

if FCN == True:
  ensemble_subset_ls.append('fcn')
if LSTM == True:
  ensemble_subset_ls.append('lstm')
if CNN == True:
  ensemble_subset_ls.append('cnn')
if AutoML_Stanza == True:
  ensemble_subset_ls.append('stanza')
if AutoML_Flair == True:
  ensemble_subset_ls.append('flair')

Median_DNN = False #@param {type:"boolean"}

if Median_DNN == True:
  ensemble_subset_ls.append('median_dnn')


#@markdown **Transformer Models**
RoBERTaLg15 = True #@param {type:"boolean"}
Huggingface = True #@param {type:"boolean"}
NLPTown = True #@param {type:"boolean"}
T5IMDB50k = True #@param {type:"boolean"}
IMDB2way = True #@param {type:"boolean"}
Yelp = True #@param {type:"boolean"}
RoBERTaXML8lang = True #@param {type:"boolean"}
Hinglish = True #@param {type:"boolean"}

if RoBERTaLg15 == True:
  ensemble_subset_ls.append('roberta15lg')
if Huggingface == True:
  ensemble_subset_ls.append('huggingface')
if NLPTown == True:
  ensemble_subset_ls.append('nlptown')
if T5IMDB50k == True:
  ensemble_subset_ls.append('t5imdb50k')
if IMDB2way == True:
  ensemble_subset_ls.append('imdb2way')
if Yelp == True:
  ensemble_subset_ls.append('yelp')
if RoBERTaXML8lang == True:
  ensemble_subset_ls.append('robertaxml8lang')
if Hinglish == True:
  ensemble_subset_ls.append('hinglish')

Median_Transformer = True #@param {type:"boolean"}

if Median_Transformer == True:
  ensemble_subset_ls.append('median_transformer')

#@markdown <hr>

Median_Ensemble = True #@param {type:"boolean"}

if Median_Ensemble == True:
  ensemble_subset_ls.append('median_ensemble')

ensemble_subset_ls

print(f'\n\n[{len(ensemble_subset_ls)}] Models Selected Above\n')


## Smoothing

In [None]:
#@markdown **Smoothing Technique**
Smoothing_Algo = "SMA" #@param ["SMA", "LOWESS", "Both"]

#@markdown <hr>

#@markdown **For SMA Smoothing (default 10%)**
Window_Percent = 7 #@param {type:"slider", min:3, max:20, step:1}

#@markdown <hr>

#@markdown **For LOWESS Smoothing (default 0.08)**
Frac_Start = 0.08 #@param {type:"slider", min:0.01, max:0.3, step:0.01}
Frac_End = 0.2 #@param {type:"slider", min:0.01, max:0.2, step:0.01}
Frac_Step = 0.02 #@param {type:"slider", min:0.01, max:0.05, step:0.01}

frac_start_int = int(100*Frac_Start)
frac_end_int = int(100*Frac_End) + 1
frac_step_int = int(100*Frac_Step)

In [None]:
ensemble_ls

In [None]:
models_rzstd = [x for x in corpus_texts_dt[corpus_titles_ls[0]].columns if (x.endswith('_rzstd') & ('_smalowess_' not in x))]
models_rzstd

In [None]:
ensemble_subset_ls = [x for x in models_rzstd if x[0] in ['s','r']]
ensemble_subset_ls

In [None]:
corpus_texts_dt[corpus_titles_ls[0]][ensemble_subset_ls].rolling(300, center=True, min_periods=0).mean().plot(alpha=0.3)

## Select Smoothing and Hyperparameters

### Option (a): Simple Moving Average (default window = 10%)

In [None]:

SMA_Window_Percent = 10 #@param {type:"slider", min:1, max:20, step:1}

### Option (b): LOWESS (default frac=.08)

In [None]:
Frac_Start = 0.08 #@param {type:"slider", min:0.01, max:0.3, step:0.01}
Frac_End = 0.2 #@param {type:"slider", min:0.01, max:0.2, step:0.01}
Frac_Step = 0.02 #@param {type:"slider", min:0.01, max:0.05, step:0.01}

frac_start_int = int(100*Frac_Start)
frac_end_int = int(100*Frac_End) + 1
frac_step_int = int(100*Frac_Step)

print('GRID SEARCH --------------------\n')

lowess_grid_dt = {}
crux_ct_ls = []
# temp_df['sent_no'] = pd.Series([x for x in corpus_sents_df['sent_no']])
temp_df['avg_stdscaler'] = corpus_sents_df[models_subset_ls].mean()

fig = plt.figure()
ax = plt.axes()


for afrac in range(frac_start_int, frac_end_int, frac_step_int):
  print(f'Processing afrac = {afrac}')
  # Compute error between subset of models
  afrac_fl = afrac/100
  temp_df = get_lowess(corpus_sents_df, models_ls=models_subset_ls, text_unit='sentence', afrac=afrac_fl, do_plot=False);
  temp_df['minmax_diff'] = temp_df.max(axis=1) - temp_df.min(axis=1)
  diff_sum = temp_df['minmax_diff'].sum()
  print(f"  Sum(minmax_diff): {diff_sum}");
  lowess_grid_dt[afrac] = diff_sum
  # Compute Crux Points
  temp_df['sent_no'] = pd.Series(list(range(temp_df.shape[0])))
  crux_ls = get_crux_points(temp_df,
                            'median',
                            text_type='sentence', 
                            win_per=5, 
                            sec_y_labels=False, 
                            sec_y_height=0, 
                            subtitle_str=' ', 
                            do_plot=False,
                            save2file=False)
  ax.plot(temp_df['sent_no'], temp_df['median'], label=f'frac={afrac}')
  # plt.plot(data=temp_df, x='sent_no', y='median', label=f'frac={afrac}')
  crux_ct_ls.append(len(crux_ls))
  print(f'  {len(crux_ls)} Crux Points')

plt.title(f"{CORPUS_FULL} \n LOWESS Smoothing Grid Search (frac={Frac_Start} to {Frac_End}")
plt.legend()
plt.show()


In [None]:
# Plot Declining Error as a function of LOWESS frac

# lowess_grid_dt

lists = sorted(lowess_grid_dt.items()) # sorted by key, return a list of tuples

x, y = zip(*lists) # unpack a list of pairs into two tuples
# plt.plot(x, y, label='Interplot Error')

adj_factor = 40
crux_ct_adj_ls = [adj_factor * x for x in crux_ct_ls]

# create figure and axis objects with subplots()
fig,ax = plt.subplots()
# make first plot: Error
ax.plot(x, y, color="red", label='Coherence Error', marker="o")
# set x-axis label
ax.set_xlabel("LOWESS frac Hyperparemeter",fontsize=14)
# set y-axis label
ax.set_ylabel("Coherence Error",color="red",fontsize=14)

# twin object for two different y-axis on the sample plot
ax2=ax.twinx()

# make second plot: Crux Count, with different y-axis using second axis object
ax2.plot(x, crux_ct_ls,color="blue",label='Crux Count', marker="o")
ax2.set_ylabel("Crux Count",color="blue",fontsize=14)
plt.title(f'{CORPUS_FULL} Sentence Sentiment \n Grid Search for LOWESS [frac] Hyperparemeter')
plt.legend(loc='best')
plt.show();
"""
# save the plot as a file
fig.savefig('two_different_y_axis_for_single_python_plot_with_twinx.jpg',
            format='jpeg',
            dpi=100,
            bbox_inches='tight')
""";

### Agglomerative Hierarichal Clustering

# **[STEP 6] Crux Detection and Extraction**

## **Search Corpus for Substring**

INSTRUCTIONS:

* In [Search_for_Substring] enter a Substring to search for in the Corpus

* Enter a Substring long enough/unique enough so only a reasonable number of Sentences will be returned

* Substring can contain spaces/punctuation, for example: 'in the garden'

In [None]:
# Search Corpus Sentences for Substring

Search_for_Substring = "love" #@param {type:"string"}

sentno_matching_ls = corpus_sents_df[corpus_sents_df['sent_raw'].str.contains(Search_for_Substring, regex=False)]['sent_no']

for i, asentno in enumerate(sentno_matching_ls):
  # sentno, sentraw = asent
  print(f"\n\nMatch #{i}: Sentence #{asentno}\n\n")
  sent_highlight = re.sub(Search_for_Substring, Search_for_Substring.upper(), corpus_sents_df.iloc[asentno]['sent_raw'])
  print(f'    {sent_highlight}')

## **Plot Top-n Crux Peaks/Valleys for selected Model**

INSTRUCTIONS:

* Select [Crux_Window_Percent] exclusive zone around Crux Points as a percentage of Corpus length

* [Sentiment_Model] Select a Sentiment Analysis model

* Select [Anomaly_Detction] to plot raw Sentiment values to detect outlier/anomaly Sentences. Leave unchecked to plot SMA smoothed Sentiment arc and detect Crux points

* Select [Save_to_File] to also save plot to external *.png file

In [None]:
Crux_Window_Percent = 5 #@param {type:"slider", min:1, max:20, step:1}
Baseline_SMA_Model = "SentiWord" #@param ["SentimentR", "SyuzhetR", "Bing", "SenticNet", "SentiWord", "NRC", "AFINN", "VADER", "TextBlob", "Flair", "Pattern", "Stanza"]
Anomaly_Detection = False #@param {type:"boolean"}
Vertical_Labels = True #@param {type:"boolean"}
Vertical_Labels_Height = -0.1 #@param {type:"slider", min:-50, max:50, step:0.1}
Save_to_Report = False #@param {type:"boolean"}

if Baseline_SMA_Model == 'SentimentR':
  model_selected = f'sentimentr'
if Baseline_SMA_Model == 'SyuzhetR':
  model_selected = f'syuzhet'
if Baseline_SMA_Model == 'Bing':
  model_selected = f'bing'
if Baseline_SMA_Model == 'SenticNet':
  model_selected = f'senticnet'
if Baseline_SMA_Model == 'SentiWord':
  model_selected = f'sentiword'
if Baseline_SMA_Model == 'NRC':
  model_selected = f'nrc'
if Baseline_SMA_Model == 'AFINN':
  model_selected = f'afinn'
if Baseline_SMA_Model == 'VADER':
  model_selected = f'vader'
if Baseline_SMA_Model == 'TextBlob':
  model_selected = f'textblob'
if Baseline_SMA_Model == 'Flair':
  model_selected = f'flair'
if Baseline_SMA_Model == 'Pattern':
  model_selected = f'pattern'
if Baseline_SMA_Model == 'Stanza':
  model_selected = f'stanza'

if Anomaly_Detection == False:
  # (a) Use Sentence SMA smoothed Sentiment models to detect Crux Points
  model_selected_fullname = f'{model_selected}_stdscaler_{roll_str}'
else:
  # (b)Use Sentence Raw Sentiment models to detect outliers
  model_selected_fullname = f'{model_selected}'


# TODO: enable multiple overlay crux points with underlying mean/median arc
corpus_models_selected_ls = [model_selected_fullname]

# Warning: requires definitions of: x, section_sents_df
#          so Baseline models must be run first

for amodel in corpus_models_selected_ls:
  corpus_cruxes_all_dt[amodel] = get_crux_points(ts_df=corpus_sents_df, 
                                         col_series=corpus_models_selected_ls, 
                                         text_type='sentence', 
                                         win_per=Crux_Window_Percent, 
                                         sec_y_labels=Vertical_Labels,
                                         sec_y_height=Vertical_Labels_Height, 
                                         subtitle_str= '5% Crux ', 
                                         do_plot=True, 
                                         save2file=False);
  
model_crux_ls = corpus_cruxes_all_dt[amodel]
# model_crux_ls;

## **Context around Top-n Crux Peaks/Valleys**

INSTRUCTIONS:

* Select [Get_Peak_Cruxes] to retrieve Peaks (if unchecked Valleys are retrieved)

* [Get_n_Cruxes] determines how many Top-n Cruxes to retrieve

* Enter [No_Paragraphs_on_Each_Side] to retrieve this many Paragraphs before and after the Paragraph containing your Crux Sentence (e.g. 2 will bring back 5 paragraphs centered around the Paragraph containing the Crux Sentence)

* Select [Highlight_Crux_Sentence] to have the Crux Sentence converted to ALL CAPS for easier identification. The Paragraph containing the Crux Sentence will be prefaced with a '<*>' as well.

* Select [Save_to_File] to also save output to external *.txt file

In [None]:
# Crux Point Details
Get_Peak_Cruxes = False #@param {type:"boolean"}
Get_n_Cruxes = 20 #@param {type:"slider", min:1, max:20, step:1}
Sort_by_SentenceNo = True #@param {type:"boolean"}

# Context Details
No_Paragraphs_on_Each_Side = 5 #@param {type:"slider", min:0, max:5, step:1}
Highlight_Sentence = True #@param {type:"boolean"}
Save_to_Report = False #@param {type:"boolean"}


if Sort_by_SentenceNo == True:
  sort_on = 'sent_no'
else:
  sort_on = 'sentiment_val'


print(f'Crux Report --------------------\n')
print(f'            Corpus: {CORPUS_FULL}')
print(f'            Model: {Baseline_SMA_Model}')
print(f'            Crux Win%: {Crux_Window_Percent}')
print(f'            SMA Win%: {roll_str}')

if Save_to_Report == False:
  crux_sortsents_report(model_crux_ls, 
                        ts_df = corpus_sents_df,
                        library_type='baseline', 
                        top_n=Get_n_Cruxes, 
                        get_peaks=Get_Peak_Cruxes,
                        sort_by = sort_on, # sent_no, or abs(polarity)
                        n_sideparags=No_Paragraphs_on_Each_Side,
                        sentence_highlight=Highlight_Sentence)
else:
  # import sys
  # with open('filename.txt', 'w') as f:
  #   print('This message will be written to a file.', file=f)
  # https://www.kite.com/python/answers/how-to-get-stdout-and-stderr-from-a-process-as-a-string-in-python
  # process = subprocess.run(["echo", "This goes to stdout"], capture_output=True)
  # stdout_as_str = process.stdout.decode("utf-8")
  # print(stdout_as_str)
  temp_out = StringIO()
  sys.stdout = temp_out
  crux_sortsents_report(model_crux_ls, top_n=Get_n_Cruxes, get_peaks=Get_Peak_Cruxes, n_sideparags=No_Paragraphs_on_Each_Side)
  print(temp_out)
  # attempt to save temp_out to generated filename
  sys.stdout = sys.__stdout__


In [None]:
asent_no = 124
corpus_df = corpus_sents_df
asent_raw = str(corpus_df[corpus_df['sent_no'] == int(asent_no)]['sent_raw'].values[0])
asent_raw

## **Zoom in on Context surrounding a particular Crux Point**

INSTRUCTIONS:

* Enter [Crux_Sentence_No] that matches a Crux point/Sentence No you want to explore

* Enter [No_Paragraphs_on_Each_Side] to retrieve this many Paragraphs before and after the Paragraph containing your Crux Sentence (e.g. 2 will bring back 5 paragraphs centered around the Paragraph containing the Crux Sentence)

* Select [Highlight_Crux_Sentence] to have the Crux Sentence converted to ALL CAPS for easier identification. The Paragraph containing the Crux Sentence will be prefaced with a '<*>' as well.

* Select [Save_to_File] to also save output to external *.txt file

In [None]:
# Select details about the Crux Point Context to Retrieve

# print(f'Last Sentence No: {corpus_sents_df.shape[0]}')
Crux_Sentence_No =  200#@param {type:"number"}
No_Paragraphs_on_Each_Side = 4 #@param {type:"slider", min:0, max:10, step:1}
Highlight_Crux_Sentence = True #@param {type:"boolean"}
Save_to_Report = False #@param {type:"boolean"}

corpus_sents_len = corpus_sents_df.shape[0]

# if (Crux_Sentence_No >= No_Paragraphs_on_Each_Side) & (Crux_Sentence_No+No_Paragraphs_on_Each_Side <= corpus_parag_len):
# get_sentnocontext_report()
# try:

# get_sentnocontext_report(ts_df = corpus_sents_df, the_sent_no=7, the_n_sideparags=1, the_sent_highlight=True):
get_sentnocontext_report(ts_df=corpus_sents_df, the_sent_no=Crux_Sentence_No, the_n_sideparags=No_Paragraphs_on_Each_Side, the_sent_highlight=Highlight_Crux_Sentence)

# except:
#   print('ERROR')
# else:
#   print(f'ERROR: The combination of your [Crux_Sentence_No] and [No_Pargraphs_on_Each_Side]\n       results in a window outside the range of the Corpus Paragraphs.\n\n       Try again with different values.')

# **END OF NOTEBOOK**