## Basic Text Similarities

This document focuses on single debate and computes similarities between debate text and for/against main points. We aim to find the correlation of similarity scores and win/lose results. 

Steps:
1. Choose a single debate to be analyzed
2. Fetch main points and debate speeches from both sides
3. Compare both sides's similarity towards the main points

This approach can be extended to all debates.

### 1. Fetch Single Debate

Debate is selected with its unique debate ID (d+date) and index in the meta data. 

In [326]:
import pandas as pd
import ast
meta = pd.read_csv('../Meta Data/metadata_appended_main_points.csv', index_col='id')
result = pd.read_csv('../results_data/final_live.csv', index_col='id')

In [327]:
#default debate id and index, can be a different one
debate_id = 'd20180514'
debate = meta.loc[debate_id]

In [328]:
# This prints a nicely formatted single debate info from meta
print('Debate ID: ', debate_id, '\nTopic: ', debate['title'])

# Also constructs a list of main points for both sides
fmains = []
print('\nFOR main points:')
for i, fmain in enumerate(ast.literal_eval(debate['For_Main_Points']), start=1):
    fmains.append(fmain.lower())
    print(i, ')', fmain.lower())
    
amains = []
print('\nAGAINST main points:')
for i, amain in enumerate(ast.literal_eval(debate['against_Main_Points']), start=1):
    amains.append(amain.lower())
    print(i, ')', amain.lower())

Debate ID:  d20180514 
Topic:  Automation Will Crash Democracy

FOR main points:
1 ) the “us versus them” populism sweeping the western world today is fueled by technological advancement: as low- and middle-skilled workers continue to lose jobs to automation, anger will manifest, leaving many concerned that democracy is no longer working in their favor.
2 ) the promise of high-paying jobs in the era of automation is a pipe dream. many who lose their jobs won’t have access to the training needed for the sophisticated jobs of the future. this will further widen wealth inequality and exacerbate the divide between globalization’s winners and losers.
3 ) anti-democratic leaders promising to bring back jobs from immigrants and robots will continue to get elected over status quo candidates, further eroding democratic institutions and empowering the rise of authoritarian societies.

AGAINST main points:
1 ) automation won’t mean the end of work, just as the advent of steam power, electricity, 

In [329]:
print('WINNTER: ', result.loc[debate_id]['winner'])

WINNTER:  for


### 2. Speech Text Cleaning on Single Debate

Fetch the speeches from both sides of a single debate

In [330]:
# Fetch the debate scripts
scripts = pd.read_csv('../For Against Scripts/for_against_scripts_' + debate_id + '.csv')

# Construct lists of the speeches by both sides. 
flist = [s for s in scripts.loc[scripts['side'] == 'for']['script']]
print('FOR side speeches: ', len(flist))
alist = [s for s in scripts.loc[scripts['side'] == 'against']['script']]
print('AGAINST side speeches: ', len(alist))

FOR side speeches:  32
AGAINST side speeches:  31


**Cleaning:** I'd tokenzie all words and sentences, remove puntuations and stopwords. Refer to the cleaning scripts.

In [331]:
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string

# This returns cleaned list of words from given speech string
def cleanwords(speech, debug=False): 
    # tokenize words without punctuations
    words = word_tokenize(speech.translate(str.maketrans('', '', string.punctuation)))
    # remove meaninglesss words
    stop_words = set(stopwords.words('english'))
    cleaned_words = [w.lower() for w in words if w.lower() not in stop_words]
    # remove duplicate words
    cleaned_words = list(set(cleaned_words))
    if debug:
        print('Number of cleaned words:', len(cleaned_words), ', removed words:', len(words) - len(cleaned_words))
    return cleaned_words

# This returns cleaned list of sentences from given speech string
def cleansents(speech, debug=False):  
    # tokenized sentences without punctuations
    sentences = sent_tokenize(speech)
    sentences = [s.translate(str.maketrans('', '', string.punctuation)).lower() for s in sentences]
    if debug:
        print('Number of cleaned sentences:', len(sentences))
    return sentences

# This returns list of cleanwords and cleansents from list of speeches
def clean(speeches):
    cwlst = []
    cslst = []
    for s in speeches:
        cwlst.append(cleanwords(s))
        cslst.append(cleansents(s))
    return cwlst, cslst

# This retuns cleanwords and cleansents from main points
def mclean(mains):
    speech = ""
    for m in mains:
        speech += m + ' '
    cwords = cleanwords(speech)
    csents = cleansents(speech)
    return cwords, csents

In [332]:
# The matrix of cleaned words/sentences of each speech from both sides
fcwlst, fcslst = clean(flist)
acwlst, acslst = clean(alist)

In [333]:
# The cleaned sentences and words from main points as one single speech
mfcwords, mfcsents = mclean(fmains)
macwords, macsents = mclean(amains)

### 3. Similarities

This section tries multiple similarity measures to compare speeches with main points

#### 3.1 Jaccard Similarity
Naive implementation that only checks the intersection of words

In [334]:
# This computes the intersection; input are two lists of words
def jaccard(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

# This averages the jaccard similarities between each speech and the main point
#    - cwlst: clean word matrix
#    - mcwords: cleaned word list of main point of a side
def jaccard_avg(cwlst, mcwords):
    s = 0
    for cw in cwlst:
        jac = jaccard(cw, mcwords)
        s += jac
    return s / len(cwlst)

In [335]:
# Most basic prediction
fsim = jaccard_avg(fcwlst, mfcwords)
asim = jaccard_avg(acwlst, macwords)
if  fsim > asim:
    print('PREDICT: for', fsim - asim)
elif fsim == asim:
    print('PREDICT: undecided')
else:
    print('PREDICT: against', fsim - asim)
print('ACTUAL: ', result.loc[debate_id]['winner'])

PREDICT: for 0.0005937974807516677
ACTUAL:  for


**Note**: after manually trying multiple debates, similairy scores worked for appromximately over 50% of the time. (E.g. d20171003, d20191112, d20180514)

Todo: quantitatively computes the accuracy across all debates. 

#### 3.2 Count Vectorizer Method
Naive implementation that only checks the intersection of words

In [336]:
# packages
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance

# Refer: adsieg's github: cosine_distance_countvectorizer_method
def cos_cv(s1, s2, debug=False):
    # sentences to list
    allsentences = [s1 , s2]
    # text to vector
    vectorizer = CountVectorizer()
    all_sentences_to_vector = vectorizer.fit_transform(allsentences)
    text_to_vector_v1 = all_sentences_to_vector.toarray()[0].tolist()
    text_to_vector_v2 = all_sentences_to_vector.toarray()[1].tolist()
    # distance of similarity
    cosine = distance.cosine(text_to_vector_v1, text_to_vector_v2)
    if debug:
        print('Similarity of two sentences are equal to ',round((1-cosine)*100,2),'%')
    return 1 - cosine

# This averages the count vectorize similarities across all sentences
# For each speech sentence, choose the max of sim between three main points
#    - cslst: matrix of speeches's sentences
#    - mcsents: the sentences of the main points
def cosine_countvectorizer_avg(cslst, mcsents):
    s = 0
    sent_num = 0
    for speech in cslst:
        for sent in speech:
            simscores = [cos_cv(sent, mpoint) for mpoint in mcsents]
            s += max(simscores)
        sent_num += len(speech)
    return s / sent_num        

In [337]:
# Most basic prediction
fcoscv = cosine_countvectorizer_avg(fcslst, mfcsents)
acoscv = cosine_countvectorizer_avg(acslst, macsents)
if  fcoscv > acoscv:
    print('PREDICT: for', fcoscv - acoscv)
elif fcoscv == asim:
    print('PREDICT: undecided')
else:
    print('PREDICT: against', fcoscv - acoscv)
print('ACTUAL: ', result.loc[debate_id]['winner'])

PREDICT: for 0.04698983646117311
ACTUAL:  for


**Note**: after manually trying multiple debates, this method works much better than the previous naive jaccard similarity.

Todo: quantitatively computes the accuracy across all debates. 