# Classification performance of the non-answers function

The data and code here produces the in- and out-of-sample classification performance statistics reported in [
Non-answers during Conference Calls](https://doi.org/10.1111/1475-679X.12371).

We import the `non_answers` function from the `ling_features` package, which we have made available on PyPI [here](https://pypi.org/project/ling-features/).
You can use `pip install ling_features` to install this on your system.
We also use the Natural Language Toolkit (`ntlk`) to *tokenize* text into sentences before applying our `non_answers` function.
Finally, we use Pandas here for its core functionality with data frames.

In [1]:
from ling_features import non_answers, get_regexes_df
from nltk import sent_tokenize
import pandas as pd

Our **gold standard** contains information on whether a response or set of responses contained a non-answer and, if so, which categories of non-answers were present (i.e., "unable", "refuse", and "after-call"). 
Note that the data are aggregated across several values of `speaker_number` to account for the possibility that more than one speaker might address a given question or that there might be back-and-forth between a given analyst and management. 
The values in `answer_nums` reflect this aggregation of `speaker_number` values (see data for `gold_standard_text` below).

In [2]:
gold_standard = pd.read_csv("gold_standard.csv")
gold_standard

Unnamed: 0,file_name,section,answer_nums,obs_type,is_unable,is_refuse,is_after_call,is_nonans
0,5070597_T,1,"{23,25}",train,False,False,False,False
1,1313830_T,1,"{13,14,15,17}",train,False,True,False,True
2,3307581_T,1,{32},test,False,False,False,False
3,1338716_T,1,"{17,19,21}",test,False,False,False,False
4,2012178_T,1,"{131,133}",test,False,False,False,False
...,...,...,...,...,...,...,...,...
1791,2088527_T,1,"{75,77,79,80}",train,False,False,False,False
1792,3224390_T,1,{7},train,False,False,False,False
1793,3420595_T,1,{17},test,False,False,False,False
1794,1564480_T,1,"{118,120,121}",train,False,False,False,False


The underlying text associated with the gold standard is found in `gold_standard_text.csv`. As you can see, each observation has a `speaker_number` value, which indicates the row in the underlying data from which it comes, and a value for `answer_nums`, which reflects the aggregation of a response discussed above.

These data are derived from XML files provided by StreetEvents using code available [here](https://github.com/iangow/se_core).
The colum `response_to_analyst` indicates whether the uttterance in `speaker_text` was made in response to a question from an *analyst*. (This column is `True` in all but two cases, which are utterances made by management during the Q&A portion of their respective calls, but not clearly in response to a question from an analyst.)

In [3]:
gold_standard_text = pd.read_csv("gold_standard_text.csv")
gold_standard_text

Unnamed: 0,file_name,section,context,answer_nums,speaker_number,speaker_text,response_to_analyst
0,5070597_T,1,qa,"{23,25}",23,No. Not more than what we have guided to at th...,True
1,5070597_T,1,qa,"{23,25}",25,(multiple speakers). We have kind of got our l...,True
2,1313830_T,1,qa,"{13,14,15,17}",13,"Okay, Energy Services do you want to answer Pa...",True
3,1313830_T,1,qa,"{13,14,15,17}",14,"Yes, on energy what we can tell you is that we...",True
4,1313830_T,1,qa,"{13,14,15,17}",15,Just -- if I may just add one comment on that....,True
...,...,...,...,...,...,...,...
3473,3420595_T,1,qa,{17},17,"Well, we're making decisions on -- economic de...",True
3474,1564480_T,1,qa,"{118,120,121}",118,"Sorry, what should come down?",True
3475,1564480_T,1,qa,"{118,120,121}",120,"Oh, yes, no, no, I think what you'll see is we...",True
3476,1564480_T,1,qa,"{118,120,121}",121,"Andre, I think it's fair to say that there wil...",True


The following function applies the `non_answer` function from the `ling_features` package to each utterance and then aggregates the indicator for a non-answer across values of `('file_name', 'section', 'answer_nums')` so as to line up with the data in `gold_standard`.

In [4]:
regexes = get_regexes_df()

def get_nonans_calc(df):
    
    def get_regex_ids(data):
        if data:
            regex_ids = [ eval(item)['regex_id'] for item in data]
            return [regexes['category'][i] for i in regex_ids]
        else:
            return None
        
    def is_non_answer(types):
        if types:
            return set(['REFUSE', 'UNABLE', 'AFTERCALL']).intersection(types) is not None
        else:
            return False
  
    
    df['non_answers'] = df['speaker_text'].apply(sent_tokenize).map(non_answers)
    df['non_answer_types'] = df['non_answers'].map(get_regex_ids)
    df['is_nonans_calc'] =  df['non_answer_types'].map(is_non_answer)
    
    return df[['file_name', 'section', 'answer_nums', 'is_nonans_calc']]. \
                groupby(by = ['file_name', 'section', 'answer_nums']).any().copy()

In [5]:
df_nonans_calc = get_nonans_calc(gold_standard_text)
df_nonans_calc

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,is_nonans_calc
file_name,section,answer_nums,Unnamed: 3_level_1
1000202_T,1,{26},True
1002829_T,1,"{124,126}",False
1003362_T,1,{29},False
1006561_T,1,{24},False
1008369_T,1,{64},False
...,...,...,...
995280_T,1,{48},False
995295_T,1,"{114,115,116}",False
997566_T,1,"{78,79}",False
998196_T,1,"{24,26,28}",False


In [6]:
df = gold_standard.merge(df_nonans_calc, on = ['file_name', 'section', 'answer_nums'])

In [7]:
def print_stats(df):
    tn = sum((df['is_nonans'] == df['is_nonans_calc']) & ~df['is_nonans_calc'])
    fp = sum((df['is_nonans'] != df['is_nonans_calc']) & df['is_nonans_calc'])
    fn = sum((df['is_nonans'] != df['is_nonans_calc']) & ~df['is_nonans_calc'])
    tp = sum((df['is_nonans'] == df['is_nonans_calc']) & df['is_nonans_calc'])
    print("Accuracy {:.2f}%".format( 100 * (tp + tn)/(tp + tn + fp + fn)))
    print("Precision {:.2f}%".format( 100 * tp/(tp + fp)))
    print("True positive rate {:.2f}%".format( 100 * tp/(tp + fn)))

In [8]:
print_stats(df[df['obs_type']=='test'])

Accuracy 89.20%
Precision 58.95%
True positive rate 78.87%


In [9]:
print_stats(df[df['obs_type']=='train'])

Accuracy 91.05%
Precision 68.38%
True positive rate 82.78%


Note that in our paper, we omitted all responses that were not responses to *analysts*. This affected two observations that were in our gold standard.

In [10]:
gold_standard_text_alt = gold_standard_text[gold_standard_text['response_to_analyst']].copy()
                   
df_nonans_calc = get_nonans_calc(gold_standard_text_alt)
df = gold_standard.merge(df_nonans_calc, on = ['file_name', 'section', 'answer_nums'])

This has no impact on our `test` sample.

In [11]:
print_stats(df[df['obs_type']=='test'])

Accuracy 89.20%
Precision 58.95%
True positive rate 78.87%


But omission of these data points does affect the in-sample (`train`) performance of our classifier.
The statistics reported in the paper are these more conservative values.

In [12]:
print_stats(df[df['obs_type']=='train'])

Accuracy 90.90%
Precision 68.13%
True positive rate 81.82%
