### Building the SVM
Here we will aim to represent the conversations using Bag-Of-Words (BOW) with a TF-IDF weighing scheme and then build our SVM Suspicious Conversations Identifier (SCI).

First we read in the training data and labels.

In [5]:
import xml.etree.ElementTree as ET
import csv

train_data_path = '../../data/svm_training_data/'
training_xml = ET.parse(train_data_path + 'training_data.xml')
root = training_xml.getroot()

labels_dict = {}
with open(train_data_path + 'sci_labels.csv', 'r') as f:
    file = csv.reader(f)
    for row in file:
        labels_dict[row[0]] = row[1]
print(labels_dict)

{'e621da5de598c9321a1d505ea95e6a2d': '0', 'a7d90f72e1260762785b92a0c54fc4bb': '0', '2c8125f8376aa2515c19222ba4213c28': '0', '7b8bd13557382d5aa86cf3e3b90acaa5': '0', 'bbf1d909f2a37738a9549ff301144475': '0', 'dd6f7c7ef644abca600d28b5afd8c191': '0', 'b073155c5f4c846b2757543a60826d81': '0', 'fc77b9ed8d13697e8deea79de0b4df23': '1', 'dd853d3f28d51df2fceb0a18bdee3ef5': '0', 'b0772b27b7024430b87d4c7ae0a155b9': '0', '8c3f1c715570a7af14a69cce09a045d6': '0', 'e0fabdc12cce220079d5bc02b0631fea': '1', 'ee225c151ed567db879bcdeae1c6e6a6': '0', 'b31360580980663eff6e622b55d661ce': '0', 'cf06faa793596b21b634412b8e2d7fa7': '0', '1413c070f28ef858f0f4db48d0c3f788': '0', 'a59b522e7dd7f3efe98194bb466ca604': '0', '69bb42f3b299f4a8113ce69b88c569ff': '0', '80401299c6762891c8512bf70afb5c1b': '0', '0a16b8ff4d618adf332cde64b93494b1': '0', 'ab70f34cae07b5e8fcb4e15776547d16': '0', '6617c122ebe945c6cd2e9a4cde1b1695': '0', '0af080e9433ebd6802551f2e6a4c1bee': '0', '853f5a49a905f67c11d36fac9011a181': '0', 'a356cc71d95082

Now we can transform the data to represent each conversation as a string.

In [10]:
corpus = [] # each row is a string formed from all messages in a conversations
labels = [] # each row is 0 or 1, corresponds to label for same row in corpus

for conversation in root:
    string = " "
    for message in conversation:
        text = message.find('text').text
        if text is not None:
            string = string + " " + text 
    corpus.append(string)
    labels.append(int(labels_dict[conversation.get('id')]))

print(corpus[0])
print(labels[0])
print(len(corpus))
print(len(labels))
    

  Hola. hi. whats up? not a ton. you? same.  being lazy.  M or f? F. Ditto, I&apos;ve done absolutely nothing with my day besides watching stuff on Hulu. M here.  Just got home from weekend trip.  Tired. Oh, cool. Family thing? yeah. a &amp; l? Gotta love those. 17, Hawaii. and yourself? Uh oh.  older. 30 Been to Hawaii. whoops xD It&apos;s nice, isn&apos;t it? Yeah.  Always enjoy visiting. Which Island you on Oahu? married? i&apos;m assuming since you went on a &apos;family&apos; trip :p yeah. Just found this site a few days ago. Yeah, Oahu. Curious to the whole &quot;random thing&quot; Pretty crazy the individuals you meet, isn&apos;t it? It&apos;s been eye opening for sure. Yeah, I hear you. I&apos;m pretty open to meet/talk to anyone. But pretty clear what most are looking for on this site, it seems. Half the people that strike up a conversation only seem to be interested in more. . risque topics. What more do you expect though, ya know? Yep.  It&apos;s the internet. You seem to ta

We will now represent all conversations using BOW with TF-IDF weighing scheme.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)

(14703, 121394)
