## Basic Text Similarities

This document focuses on single debate and computes similarities between debate text and for/against main points. We aim to find the correlation of similarity scores and win/lose results. 

Steps:
1. Choose a single debate to be analyzed
2. Fetch main points and debate speeches from both sides
3. Compare both sides's similarity towards the main points

This approach can be extended to all debates.

### 1. Fetch Single Debate

Debate is selected with its unique debate ID (d+date) and index in the meta data. 

In [95]:
import pandas as pd
import ast
meta = pd.read_csv('../Meta Data/metadata_appended_main_points.csv', index_col='id')

In [98]:
#default debate id and index, can be a different one
debate_id = 'd20191112'
debate = meta.loc[debate_id]

In [170]:
# This prints a nicely formatted single debate info from meta
print('Debate ID: ', debate_id, '\nTopic: ', debate['title'])

# Also constructs a list of main points for both sides
fmains = []
print('\nFOR main points:')
for i, fmain in enumerate(ast.literal_eval(debate['For_Main_Points']), start=1):
    fmains.append(fmain.lower())
    print(i, ')', fmain.lower())
    
amains = []
print('\nAGAINST main points:')
for i, amain in enumerate(ast.literal_eval(debate['For_Main_Points']), start=1):
    amains.append(amain.lower())
    print(i, ')', amain.lower())

Debate ID:  d20191112 
Topic:  Capitalism Is a Blessing

FOR main points:
1 ) by promoting market competition and rewarding innovation, capitalism has lifted billions out of poverty and ensured consumers have access to cutting-edge ideas and products while hard-working and industrious workers reap the rewards of their efforts.
2 ) under the capitalist system, workers from every social and economic backgroundhave the opportunity tocompete in the free market and to have their talents recognized. this promotes social equality and provides for upward social mobility.
3 ) capitalism provides workers and consumers alike the freedom to choose where to earn and spend their money, contributing to an overall society where the free exchange of goods and ideas is paramount.

AGAINST main points:
1 ) by promoting market competition and rewarding innovation, capitalism has lifted billions out of poverty and ensured consumers have access to cutting-edge ideas and products while hard-working and indus

### 2. Speech Text Cleaning on Single Debate

Fetch the speeches from both sides of a single debate

In [152]:
# Fetch the debate scripts
scripts = pd.read_csv('../For Against Scripts/for_against_scripts_' + debate_id + '.csv')

# Construct lists of the speeches by both sides. 
flist = [s for s in scripts.loc[scripts['side'] == 'for']['script']]
print('FOR side speeches: ', len(flist))
alist = [s for s in scripts.loc[scripts['side'] == 'against']['script']]
print('AGAINST side speeches: ', len(alist))

FOR side speeches:  20
AGAINST side speeches:  20


**Cleaning:** I'd tokenzie all words and sentences, remove puntuations and stopwords. Refer to the cleaning scripts.

In [160]:
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string

# This returns cleaned list of words from given speech string
def cleanwords(speech, debug=False): 
    # tokenize words without punctuations
    words = word_tokenize(speech.translate(str.maketrans('', '', string.punctuation)))
    # remove meaninglesss words
    stop_words = set(stopwords.words('english'))
    cleaned_words = [w.lower() for w in words if w.lower() not in stop_words]
    # remove duplicate words
    cleaned_words = list(set(cleaned_words))
    if debug:
        print('Number of cleaned words:', len(cleaned_words), ', removed words:', len(words) - len(cleaned_words))
    return cleaned_words

# This returns cleaned list of sentences from given speech string
def cleansents(speech, debug=False):  
    # tokenized sentences without punctuations
    sentences = sent_tokenize(speech)
    sentences = [s.translate(str.maketrans('', '', string.punctuation)).lower() for s in sentences]
    if debug:
        print('Number of cleaned sentences:', len(sentences))
    return sentences

# This returns list of cleanwords and cleansents from list of speeches
def clean(speeches):
    cwlst = []
    cslst = []
    for s in speeches:
        cwlst.append(cleanwords(s))
        cslst.append(cleansents(s))
    return cwlst, cslst

In [166]:
fcwlst, fcslst = clean(flist)
acwlst, acslst = clean(alist)

### 3. Similarities

This section tries multiple similarity measures to compare speeches with main points