## Basic Text Similarities

This document focuses on single debate and computes similarities between debate text and for/against main points. We aim to find the correlation of similarity scores and win/lose results. 

Steps:
1. Choose a single debate to be analyzed
2. Fetch main points and debate speeches from both sides
3. Compare both sides's similarity towards the main points

This approach can be extended to all debates.

### 1. Fetch Single Debate

Debate is selected with its unique debate ID (d+date) and index in the meta data. 

In [241]:
import pandas as pd
import ast
meta = pd.read_csv('../Meta Data/metadata_appended_main_points.csv', index_col='id')
result = pd.read_csv('../results_data/final_live.csv', index_col='id')

In [242]:
#default debate id and index, can be a different one
debate_id = 'd20171003'
debate = meta.loc[debate_id]

In [243]:
# This prints a nicely formatted single debate info from meta
print('Debate ID: ', debate_id, '\nTopic: ', debate['title'])

# Also constructs a list of main points for both sides
fmains = []
print('\nFOR main points:')
for i, fmain in enumerate(ast.literal_eval(debate['For_Main_Points']), start=1):
    fmains.append(fmain.lower())
    print(i, ')', fmain.lower())
    
amains = []
print('\nAGAINST main points:')
for i, amain in enumerate(ast.literal_eval(debate['against_Main_Points']), start=1):
    amains.append(amain.lower())
    print(i, ')', amain.lower())

Debate ID:  d20171003 
Topic:  Western Democracy Is Threatening Suicide

FOR main points:
1 ) xenophobia, racism, and nationalism are on the rise. from support of far-right candidates in france and brexit in europe to the rise of donald trump in the u.s., people around the world are embracing policies and attitudes that are inconsistent with liberal democracy.
2 ) the liberal world order is losing ground. long a beacon of democracy around the world, the united states is turning its back on global institutions and leaving room for alternative powers, such as china and russia, to seize influence.
3 ) fed up with the economic challenges of globalism and dismayed by the power of the political elite, westerners are embracing social change over political stability and – increasingly – considering alternatives to elected democratic leadership.
4 ) with his executive orders on immigration, attacks on the free press, condemnation of court decisions, and firing of james comey, president donald t

In [244]:
print('WINNTER: ', result.loc[debate_id]['winner'])

WINNTER:  for


### 2. Speech Text Cleaning on Single Debate

Fetch the speeches from both sides of a single debate

In [245]:
# Fetch the debate scripts
scripts = pd.read_csv('../For Against Scripts/for_against_scripts_' + debate_id + '.csv')

# Construct lists of the speeches by both sides. 
flist = [s for s in scripts.loc[scripts['side'] == 'for']['script']]
print('FOR side speeches: ', len(flist))
alist = [s for s in scripts.loc[scripts['side'] == 'against']['script']]
print('AGAINST side speeches: ', len(alist))

FOR side speeches:  30
AGAINST side speeches:  30


**Cleaning:** I'd tokenzie all words and sentences, remove puntuations and stopwords. Refer to the cleaning scripts.

In [246]:
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string

# This returns cleaned list of words from given speech string
def cleanwords(speech, debug=False): 
    # tokenize words without punctuations
    words = word_tokenize(speech.translate(str.maketrans('', '', string.punctuation)))
    # remove meaninglesss words
    stop_words = set(stopwords.words('english'))
    cleaned_words = [w.lower() for w in words if w.lower() not in stop_words]
    # remove duplicate words
    cleaned_words = list(set(cleaned_words))
    if debug:
        print('Number of cleaned words:', len(cleaned_words), ', removed words:', len(words) - len(cleaned_words))
    return cleaned_words

# This returns cleaned list of sentences from given speech string
def cleansents(speech, debug=False):  
    # tokenized sentences without punctuations
    sentences = sent_tokenize(speech)
    sentences = [s.translate(str.maketrans('', '', string.punctuation)).lower() for s in sentences]
    if debug:
        print('Number of cleaned sentences:', len(sentences))
    return sentences

# This returns list of cleanwords and cleansents from list of speeches
def clean(speeches):
    cwlst = []
    cslst = []
    for s in speeches:
        cwlst.append(cleanwords(s))
        cslst.append(cleansents(s))
    return cwlst, cslst

# This retuns cleanwords and cleansents from main points
def mclean(mains):
    speech = ""
    for m in mains:
        speech += m + ' '
    cwords = cleanwords(speech)
    csents = cleansents(speech)
    return cwords, csents

In [247]:
# The matrix of cleaned words/sentences of each speech from both sides
fcwlst, fcslst = clean(flist)
acwlst, acslst = clean(alist)

In [248]:
# The cleaned sentences and words from main points as one single speech
mfcwords, mfcsents = mclean(fmains)
macwords, macsents = mclean(amains)

### 3. Similarities

This section tries multiple similarity measures to compare speeches with main points

#### 3.1 Jaccard Similarity
Naive implementation that only checks the intersection of words

In [249]:
# This computes the intersection; input are two lists of words
def jaccard(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

# This averages the jaccard similarities between each speech and the main point
#    - cwlst: clean word matrix
#    - mcwords: cleaned word list of main point of a side
def jaccard_avg(cwlst, mcwords):
    s = 0
    for cw in cwlst:
        jac = jaccard(cw, mcwords)
        s += jac
    return s / len(cwlst)

In [250]:
# Most basic prediction
fsim = jaccard_avg(fcwlst, mfcwords)
asim = jaccard_avg(acwlst, macwords)
if  fsim > asim:
    print('PREDICT: for', fsim - asim)
else:
    print('PREDICT: against', fsim - asim)
print('ACTUAL: ', result.loc[debate_id]['winner'])

PREDICT: for 0.006642677729184979
ACTUAL:  for
