### Approach
The task is modelled as a binary topic classification task for sentences/text, where two classes are:
0: Ubuntu content sentence
1: Phatic sentence

#### Dataset modeling for Classification
To create the dataset for class 0 (Ubuntu content sentence), all the messages are parsed into sentences, and the sentences 
of token length 7 or more are collected. Taking the large sentences only increase the probability of having content 
words in the sentence, while the phatic sentences generally tend to be of smaller length.

For class 1 (Phatic sentence), any non Ubuntu(or related) topic conversation / dialogue dataset can be used. The assumption is
that any other dialogue data would also contain phatic sentences in addition to the topic related sentences.

However, learning a classifier to discriminate between these two classes would essentially be basing its decision on two aspects, 1. Topic of the sentence i.e. Ubuntu or other, 2. The classifier would have only seen the phatic sentences
in the class 1 dataset i.e. non Ubuntu topic conversations. 

A publicly available dataset of travel related customer support (RSiCS - https://s3-us-west-2.amazonaws.com/nextit-public/rsics.html),is used to model the class 1 (Phatic sentence) data. For this exercise, "tagged_selections_by_sentence.csv" from RSiCS dataset is used, which contains manual annotations for segmenting the dialogue into two key pieces - conveys the intent or not, in addition to further annotations like Greetings, Rant etc. 

#### Evaluation
The classifier is evaluated on:
1. Validation set: A portion of the dataset created above.
2. Test set: Phatic examples as provided in the instructions mixed with some non-phatic sentences. A mock small test set. 

Accuracy scores are being used as metric for evaluation.

In [None]:
import numpy as np
import pandas as pd
import os
import pickle


from collections import defaultdict
from nltk import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation as lda
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score


In [2]:
def get_sentences(x):
    return sent_tokenize(x)

def contains_min_tokens(x, num_tokens):
    return len(x.split()) > num_tokens

def contains_max_tokens(x, num_tokens):
    return len(x.split()) < num_tokens

def expand(series):
    return pd.Series([x for _list in series for x in _list])

In [3]:
def get_data(data, col_name, init_sample_size, num_tokens=0, is_min=None):
    data = data.dropna(subset=[col_name])
    data_sample = data[col_name]
    data_sample = data_sample.sample(n=init_sample_size).to_frame()

    z = data_sample[col_name].apply(lambda x: get_sentences(x))
    z = z.to_frame()
    print(z.shape)

    texts = pd.DataFrame({col_name: expand(z[col_name])})
    print(texts.shape)
    if is_min is None:
        return texts

    if is_min:
        func = contains_min_tokens
    else:
        func = contains_max_tokens

    texts = texts[texts[col_name].apply(lambda x: func(x, num_tokens))]
    print(texts.shape)
    return texts

In [4]:
num_dialogues = 3500

ubuntu_data_df = pd.read_csv("./data/ubuntu_support_extract.csv")    
# Filtering Ubuntu data with the assumption that long sentences tend to 
# contain less non-phatic, and more of the topical content
ubuntu_data = get_data(ubuntu_data_df, 'text', num_dialogues, num_tokens=7, is_min=True)
ubuntu_data = ubuntu_data['text'].to_list()
    
# Assigning label 0 to Ubuntu related sentences
ubuntu_labels = [0] * len(ubuntu_data)
    
# Phatic labeled dataset obtained from https://s3-us-west-2.amazonaws.com/nextit-public/rsics.html
phatic_data = pd.read_csv("./data/rsics_dataset/RSiCS/tagged_selections_by_sentence.csv")
phatic_data = get_data(phatic_data, 'Selected', num_dialogues)
phatic_data = phatic_data['Selected'].to_list()
    
# Assigning label 1 to the sentences of this dataset
phatic_labels = [1] * len(phatic_data)
        
# Creating a binary classification labeled dataset by combining the two datasets
Corpus = pd.DataFrame({'text': ubuntu_data + phatic_data, 'label': ubuntu_labels + phatic_labels})
print(Corpus.head())
print(Corpus.shape)

(3500, 1)
(4251, 1)
(1928, 1)
(3500, 1)
(6868, 1)
                                                text  label
0  I've been trying to install Virtualbox, but ev...      0
1  i booted from the CD, i changed the bootsequen...      0
2          I don't know how to make this more clear.      0
3  but do you think a mismatched driver could cau...      0
4   is that why you asked me for the chipset number?      0
(8796, 2)


In [5]:
# Preparing feature and label vectors
X = Corpus['text'].tolist()
y = Corpus['label'].tolist()
max_feats = 20000

vectoriser = TfidfVectorizer(max_features=max_feats).fit(X)
X_train, X_valid, y_train, y_valid = model_selection\
                                    .train_test_split(X, y, test_size=0.1)

X_train_vec = vectoriser.transform(X_train)
X_valid_vec = vectoriser.transform(X_valid)

test_dataset = pd.read_csv("./data/test_dataset.csv")
X_test = test_dataset['text'].tolist()
y_test = test_dataset['label'].tolist()
X_test_vec = vectoriser.transform(X_test)

In [6]:
# Training NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(X_train_vec, y_train)

# Predict the labels on validation dataset
valid_predictions_NB = Naive.predict(X_valid_vec)

# Predict the labels on test dataset
test_predictions_NB = Naive.predict(X_test_vec)

print("Naive Bayes Accuracy Score on validation test-> ", accuracy_score(valid_predictions_NB, y_valid) * 100)
print("Naive Bayes Accuracy Score on test set-> ", accuracy_score(test_predictions_NB, y_test) * 100)

Naive Bayes Accuracy Score on validation test->  86.25
Naive Bayes Accuracy Score on test set->  100.0


In [7]:
# Printing some sample predictions on validation dataset
for (sent, pred, real) in zip(X_valid[:20], valid_predictions_NB[:20], y_valid[:20]):
    print(sent, pred, real)

Hi, 1 1
listed on the website is that 1 1
I would start w/ a dual boot setup,  rather than immediately nuking Windows 0 0
i am not given this option. 1 1
In total, there was an increase/unearned profit of almost 30% . 1 1
As for PAL, their airfares are way up, although I have had no bad experience when flying w/ them. 1 1
 I am not rebooting, I am looking for the log 1 0
am a Bresnan customer and, your verify does not match 1 1
 I'd say pastebin the results of attemping to use your password like Jordan asked before we go any further 1 0
I am now landing in McCook at 9:00 p.m. 1 1
( presumably given by them to the site that booked for me) I cannot find a FAQ or email enquiry form to ask anyone directly.. has anyone any suggestions of what I should do next? 1 1
------------------------? 1 1
Thanks 1 1
 s--how can I find it? 1 1
yeah when i run VMs i get to almost 20 out of 24gb used at times, and the host os still runs decently, so that's good 1 0
besides Nautilus there is also Thunar, P

In [8]:
# Printing some sample predictions on test dataset
for (sent, pred, real) in zip(X_test, test_predictions_NB, y_test):
    print(sent, pred, real)

Hi 1 1
Hello 1 1
hi there 1 1
good morning 1 1
good evening 1 1
bye 1 1
adios 1 1
see you later 1 1
good bye 1 1
thanks 1 1
cheers 1 1
thanks you 1 1
please 1 1
ubuntu 0 0
sudo apt get install  0 0


In [9]:
# Classifier - Algorithm - SVM
# SVM classifier training
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(X_train_vec, y_train)

# Predict the labels on validation dataset
valid_predictions_SVM = SVM.predict(X_valid_vec)
# Predict the labels on manual test dataset
test_predictions_SVM = SVM.predict(X_test_vec)

print("SVM accuracy on validation dataset-> ", accuracy_score(valid_predictions_SVM, y_valid) * 100)
print("SVM accuracy on test dataset -> ", accuracy_score(test_predictions_SVM, y_test) * 100)

SVM accuracy on validation dataset->  91.93181818181819
SVM accuracy on test dataset ->  100.0


In [20]:
# Pickling and saving the vectoriser and classifier
with open('./models/vectoriser.pkl', 'wb') as f:
    pickle.dump(vectoriser, f)
    
with open('./models/classifier.pkl', 'wb') as f:
    pickle.dump(SVM, f)

In [10]:
# Printing some sample predictions on validation dataset
for (sent, pred, real) in zip(X_valid[:20], valid_predictions_SVM[:20], y_valid[:20]):
    print(sent, pred, real)

Hi, 1 1
listed on the website is that 1 1
I would start w/ a dual boot setup,  rather than immediately nuking Windows 0 0
i am not given this option. 1 1
In total, there was an increase/unearned profit of almost 30% . 1 1
As for PAL, their airfares are way up, although I have had no bad experience when flying w/ them. 1 1
 I am not rebooting, I am looking for the log 1 0
am a Bresnan customer and, your verify does not match 1 1
 I'd say pastebin the results of attemping to use your password like Jordan asked before we go any further 1 0
I am now landing in McCook at 9:00 p.m. 1 1
( presumably given by them to the site that booked for me) I cannot find a FAQ or email enquiry form to ask anyone directly.. has anyone any suggestions of what I should do next? 1 1
------------------------? 1 1
Thanks 1 1
 s--how can I find it? 1 1
yeah when i run VMs i get to almost 20 out of 24gb used at times, and the host os still runs decently, so that's good 0 0
besides Nautilus there is also Thunar, P

In [11]:
# Printing some sample predictions on test dataset
for (sent, pred, real) in zip(X_test, test_predictions_SVM, y_test):
    print(sent, pred, real)

Hi 1 1
Hello 1 1
hi there 1 1
good morning 1 1
good evening 1 1
bye 1 1
adios 1 1
see you later 1 1
good bye 1 1
thanks 1 1
cheers 1 1
thanks you 1 1
please 1 1
ubuntu 0 0
sudo apt get install  0 0


#### Classifier predictions on the filtered out data from ubuntu i.e sentences with length less than 7

In [12]:
ubuntu_small_sent_data = get_data(ubuntu_data_df, 'text', 1000, num_tokens=7, is_min=False)

(1000, 1)
(1255, 1)
(596, 1)


In [13]:
# Printing phatic classifier labels on sample from Ubuntu dataset not included in training.
X_small_sent = ubuntu_small_sent_data['text'].tolist()
X_small_sent_vec = vectoriser.transform(X_small_sent)
predictions_SVM = SVM.predict(X_small_sent_vec)
for (sent,pred) in zip(X_small_sent,predictions_SVM):
    print(sent,pred)

no but i could guess :) 1
nvidia? 0
 http://www.webupd8.org/2010/04/best-linux-bittorrent-client.html 0
okay, just figured something out. 1
:D 1
Deleting and creating, yes. 1
Moving, no. 1
Have you considered gparted? 1
http://alauda.sourceforge.net/wikka.php?wakka=HomePage 0
yes 1
ok good. 1
so just type the message verbatim 0
hi 1
 your users name is roo0t ? 0
thanks 1
crontab -e 1
!info mc' 1
hehe 1
yes i can see its 00 1
:P 1
Nope. 1
http://dri.freedesktop.org/wiki/ATIRadeon 1
a hardware driver app? 0
sorry, updated the name 1
Do you get paid to op? 1
please help 1
read the docs. 1
http://a.courreges.free.fr/projets/m...icopier-en.php 0
thats what im using 0
I think you misread 0
alright then 0
he seemed willing to help me 1
press scan button 1
how far did you get? 1
it is not hardware based 0
Do you have an install disk? 0
but... it didn't work 1
hello again all 1
when I forget what it is 1
none. 1
That's not usual where I'm from. 0
it errors invalid parameter 1
Any ideas? 1
yes 1