---
title: "Forecaster Analysis"
author: "Michelle Gelman"
date: "04/28/2025"
format: 
  html:
    code-fold: true
    execute:
      eval: false
      echo: True
      cache: True
jupyter: python3
---

In [34]:
%%capture
from tqdm import tqdm
from convokit import Corpus, Speaker, Utterance
from collections import defaultdict
import pandas as pd
import pprint as pp
from IPython.display import display
from modules.DataPreprocesser import DataPreprocesser
from modules import CorpusUtils as corp
from convokit import Corpus
from convokit import TextCleaner
import torch
import pickle

# Add the src directory to the path
import sys
import os

# Add the src
sys.path.append(os.path.abspath("."))
import import_ipynb

#Convokit Imports
from convokit.forecaster.CRAFTModel import CRAFTModel
from convokit.forecaster.forecaster import Forecaster
import pandas as pd
from convokit.fighting_words.fightingWords import FightingWords

# **Down-Sampled Performance Check**

# **Fine-Tuned (Kodis) Performance Check**

### Configuaration:
- 80/20/10 Train-test-val split
- "craft-wiki-pretrained model:

{'dropout': 0.1, 

 'batch_size': 64,

 'clip': 50.0,

 'learning_rate': 1e-05,
 
 'print_every': 10,

 'finetune_epochs': 30,
 
 'validation_size': 0.2}

# **Fighting Words Analysis**

### How to read:
- Y-Axis: Strength of association of each n-gram across both classes as measured by maginitdue from 0.
- X-Axis: Frequency of occurenc in both classes. Further right -> equally likely to appear in both classes



- may need to pre-process to tag the uttances with the "submit agreement box" as "submission" to not include them in analysis

### KODIS Corpus

In [38]:
%%capture
TextCleaner(verbosity=50000).transform(kodis_corp_wiki)
fw = FightingWords(obj_type='conversation')
fw.fit(
    corpus=kodis_corp_wiki, 
    class1_func=lambda conv: conv.meta.get('label') == 1, 
    class2_func=lambda conv: conv.meta.get('label') == 0
)

In [None]:
df_z = fw.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')

# 4) If you’d like the full z‐score table, for deeper inspection:
df_z = fw.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')
df_z_sorted = df_z.sort_values('z-score', ascending=False)

# 1. Reset the index so “ngram” becomes a column:
df = df_z_sorted.reset_index().rename(columns={'index':'ngram'})

# 2. Split into two DataFrames, one per class:
df_unsucc = (
    df[df['class']=='Unsuccessful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=False)   # strongest markers first
      .reset_index(drop=True)
)

df_succ  = (
    df[df['class']=='Successful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=True)    # most negative first
      .reset_index(drop=True)
)
# 3b. Or, to see them side-by-side (top 20 of each):
combined = pd.concat(
    [df_unsucc.head(20), df_succ.head(20)],
     axis=1,
     keys=['Unsuccessful','Successful']
)
print("\n=== Side-by-Side Comparison ===")
display(combined.head(20))



=== Side-by-Side Comparison ===


Unnamed: 0_level_0,Unsuccessful,Unsuccessful,Successful,Successful
Unnamed: 0_level_1,ngram,z-score,ngram,z-score
0,away,25.414886,thank,-12.585395
1,walk,20.197809,thank you,-12.553772
2,walk away,19.796939,gets full refund,-10.209181
3,will not,10.190836,gets full,-10.209181
4,proof,8.773605,buyer gets full,-10.209181
5,reject,8.66537,review seller did,-9.939158
6,apologize reject,8.58433,and buyer did,-9.398857
7,apologize reject deal,8.58433,full refund seller,-9.111432
8,reject deal,8.523136,yes,-8.330688
9,didn apologize reject,8.234699,partial,-8.18213


### Kodis with TextCleaner

In [39]:
df_z = fw.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')

# 4) If you’d like the full z‐score table, for deeper inspection:
df_z = fw.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')
df_z_sorted = df_z.sort_values('z-score', ascending=False)

# 1. Reset the index so “ngram” becomes a column:
df = df_z_sorted.reset_index().rename(columns={'index':'ngram'})

# 2. Split into two DataFrames, one per class:
df_unsucc = (
    df[df['class']=='Unsuccessful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=False)   # strongest markers first
      .reset_index(drop=True)
)
df_succ  = (
    df[df['class']=='Successful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=True)    # most negative first
      .reset_index(drop=True)
)
# 3b. Or, to see them side-by-side (top 20 of each):
combined = pd.concat(
    [df_unsucc.head(20), df_succ.head(20)],
     axis=1,
     keys=['Unsuccessful','Successful']
)
print("\n=== Side-by-Side Comparison ===")
display(combined.head(20))



=== Side-by-Side Comparison ===


Unnamed: 0_level_0,Unsuccessful,Unsuccessful,Successful,Successful
Unnamed: 0_level_1,ngram,z-score,ngram,z-score
0,away,25.425381,thank,-12.57323
1,walk,20.203544,thank you,-12.541661
2,walk away,19.802453,gets full refund,-10.198932
3,will not,10.203604,buyer gets full,-10.198932
4,proof,8.783694,gets full,-10.198932
5,reject,8.677553,review seller did,-9.928268
6,apologize reject deal,8.596404,and buyer did,-9.389413
7,apologize reject,8.596404,full refund seller,-9.102135
8,reject deal,8.535226,yes,-8.319438
9,didn apologize reject,8.244375,partial,-8.167295


### Wiki Corpus

In [35]:
%%capture

TextCleaner(verbosity=50000).transform(test_corp_wiki)
fw_wiki = FightingWords(obj_type='conversation')
fw_wiki.fit(
    corpus=test_corp_wiki, 
    class1_func=lambda conv: conv.meta.get('conversation_has_personal_attack') == 1, 
    class2_func=lambda conv: conv.meta.get('conversation_has_personal_attack') == 0
)

In [32]:
df_z = fw_wiki.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')

# 4) If you’d like the full z‐score table, for deeper inspection:
df_z = fw_wiki.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')
df_z_sorted = df_z.sort_values('z-score', ascending=False)

# 1. Reset the index so “ngram” becomes a column:
df = df_z_sorted.reset_index().rename(columns={'index':'ngram'})

# 2. Split into two DataFrames, one per class:
df_unsucc = (
    df[df['class']=='Unsuccessful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=False)   # strongest markers first
      .reset_index(drop=True)
)
df_succ  = (
    df[df['class']=='Successful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=True)    # most negative first
      .reset_index(drop=True)
)
# 3b. Or, to see them side-by-side (top 20 of each):
combined = pd.concat(
    [df_unsucc.head(20), df_succ.head(20)],
     axis=1,
     keys=['Unsuccessful','Successful']
)
print("\n=== Side-by-Side Comparison ===")
display(combined.head(20))



=== Side-by-Side Comparison ===


Unnamed: 0_level_0,Unsuccessful,Unsuccessful,Successful,Successful
Unnamed: 0_level_1,ngram,z-score,ngram,z-score
0,you are,16.014538,mentality,-8.207016
1,you have,10.923482,greeks,-7.79778
2,you re,10.505819,greek,-7.741245
3,stop,10.078668,italian,-7.64815
4,me,9.790722,albania,-7.578888
5,stupid,9.167015,bulgarians,-7.382297
6,are you,8.402324,url,-7.382251
7,that you,8.367206,campaign,-7.379276
8,crap,8.256748,the war,-7.022344
9,yourself,7.987503,ottoman,-6.906538


### With Text Cleaner

In [37]:
df_z = fw_wiki.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')

# 4) If you’d like the full z‐score table, for deeper inspection:
df_z = fw_wiki.get_ngram_zscores(class1_name='Unsuccessful', class2_name='Successful')
df_z_sorted = df_z.sort_values('z-score', ascending=False)

# 1. Reset the index so “ngram” becomes a column:
df = df_z_sorted.reset_index().rename(columns={'index':'ngram'})

# 2. Split into two DataFrames, one per class:
df_unsucc = (
    df[df['class']=='Unsuccessful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=False)   # strongest markers first
      .reset_index(drop=True)
)
df_succ  = (
    df[df['class']=='Successful']
      .loc[:, ['ngram','z-score']]
      .sort_values('z-score', ascending=True)    # most negative first
      .reset_index(drop=True)
)
# 3b. Or, to see them side-by-side (top 20 of each):
combined = pd.concat(
    [df_unsucc.head(20), df_succ.head(20)],
     axis=1,
     keys=['Unsuccessful','Successful']
)
print("\n=== Side-by-Side Comparison ===")
display(combined.head(20))



=== Side-by-Side Comparison ===


Unnamed: 0_level_0,Unsuccessful,Unsuccessful,Successful,Successful
Unnamed: 0_level_1,ngram,z-score,ngram,z-score
0,you are,16.054688,mentality,-8.199172
1,you have,10.987531,greeks,-7.781885
2,you re,10.538478,page number,-7.738768
3,stop,10.104394,greek,-7.718951
4,me,9.845657,italian,-7.633363
5,stupid,9.171591,albania,-7.571467
6,are you,8.424154,bulgarians,-7.374899
7,that you,8.380463,campaign,-7.366674
8,crap,8.261644,url,-7.34913
9,yourself,8.006851,number number,-7.254474


In [36]:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead tr th {
        text-align: left;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th colspan="2" halign="left">Unsuccessful</th>
      <th colspan="2" halign="left">Successful</th>
    </tr>
    <tr>
      <th></th>
      <th>ngram</th>
      <th>z-score</th>
      <th>ngram</th>
      <th>z-score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>you are</td>
      <td>16.014538</td>
      <td>mentality</td>
      <td>-8.207016</td>
    </tr>
    <tr>
      <th>1</th>
      <td>you have</td>
      <td>10.923482</td>
      <td>greeks</td>
      <td>-7.797780</td>
    </tr>
    <tr>
      <th>2</th>
      <td>you re</td>
      <td>10.505819</td>
      <td>greek</td>
      <td>-7.741245</td>
    </tr>
    <tr>
      <th>3</th>
      <td>stop</td>
      <td>10.078668</td>
      <td>italian</td>
      <td>-7.648150</td>
    </tr>
    <tr>
      <th>4</th>
      <td>me</td>
      <td>9.790722</td>
      <td>albania</td>
      <td>-7.578888</td>
    </tr>
    <tr>
      <th>5</th>
      <td>stupid</td>
      <td>9.167015</td>
      <td>bulgarians</td>
      <td>-7.382297</td>
    </tr>
    <tr>
      <th>6</th>
      <td>are you</td>
      <td>8.402324</td>
      <td>url</td>
      <td>-7.382251</td>
    </tr>
    <tr>
      <th>7</th>
      <td>that you</td>
      <td>8.367206</td>
      <td>campaign</td>
      <td>-7.379276</td>
    </tr>
    <tr>
      <th>8</th>
      <td>crap</td>
      <td>8.256748</td>
      <td>the war</td>
      <td>-7.022344</td>
    </tr>
    <tr>
      <th>9</th>
      <td>yourself</td>
      <td>7.987503</td>
      <td>ottoman</td>
      <td>-6.906538</td>
    </tr>
    <tr>
      <th>10</th>
      <td>why</td>
      <td>7.673333</td>
      <td>the italian</td>
      <td>-6.463210</td>
    </tr>
    <tr>
      <th>11</th>
      <td>and you</td>
      <td>7.532976</td>
      <td>western</td>
      <td>-6.397741</td>
    </tr>
    <tr>
      <th>12</th>
      <td>if you</td>
      <td>7.441177</td>
      <td>tribes</td>
      <td>-6.243172</td>
    </tr>
    <tr>
      <th>13</th>
      <td>re</td>
      <td>7.413716</td>
      <td>the current</td>
      <td>-6.214181</td>
    </tr>
    <tr>
      <th>14</th>
      <td>hell</td>
      <td>7.000701</td>
      <td>awards</td>
      <td>-6.212305</td>
    </tr>
    <tr>
      <th>15</th>
      <td>don you</td>
      <td>6.796298</td>
      <td>journal</td>
      <td>-6.122374</td>
    </tr>
    <tr>
      <th>16</th>
      <td>you don</td>
      <td>6.747247</td>
      <td>2015</td>
      <td>-5.976605</td>
    </tr>
    <tr>
      <th>17</th>
      <td>ridiculous</td>
      <td>6.584792</td>
      <td>japan</td>
      <td>-5.921183</td>
    </tr>
    <tr>
      <th>18</th>
      <td>nonsense</td>
      <td>6.386323</td>
      <td>star</td>
      <td>-5.857119</td>
    </tr>
    <tr>
      <th>19</th>
      <td>please</td>
      <td>6.373570</td>
      <td>sea</td>
      <td>-5.829312</td>
    </tr>
  </tbody>
</table>
</div>

SyntaxError: invalid syntax (387958538.py, line 1)

### Model Imports from saved runs

In [None]:

kodis_wiki = "/Users/mishkin/Desktop/Research/Convo_Kit/ConvoKit_Disputes/data/saved_corpora/KODIS_wiki_corpus_results"
kodis_cmv = "/Users/mishkin/Desktop/Research/Convo_Kit/ConvoKit_Disputes/data/saved_corpora/KODIS_cmv_corpus_results"
wiki_test = "/Users/mishkin/Desktop/Research/Convo_Kit/ConvoKit_Disputes/data/saved_corpora/wiki_corpus_test_results"
cmv_test =  "/Users/mishkin/Desktop/Research/Convo_Kit/ConvoKit_Disputes/data/saved_corpora/cmv_corpus_test_results"

kodis_corp_wiki = Corpus(kodis_wiki)
kodis_corp_cmv = Corpus(kodis_cmv)


In [30]:
test_corp_wiki = Corpus(wiki_test)
test_corp_cmv= Corpus(cmv_test)