### Approach
The task is modelled as a binary topic classification task for sentences/text, where two classes are:
0: Ubuntu content sentence
1: Phatic sentence

#### Dataset modeling for Classification
To create the dataset for class 0 (Ubuntu content sentence), all the messages are parsed into sentences, and the sentences 
of token length 7 or more are collected. Taking the large sentences only increase the probability of having content 
words in the sentence, while the phatic sentences generally tend to be of smaller length.

For class 1 (Phatic sentence), any non Ubuntu(or related) topic conversation / dialogue dataset can be used. The assumption is
that any other dialogue data would also contain phatic sentences in addition to the topic related sentences.

However, learning a classifier to discriminate between these two classes would essentially be basing its decision on two aspects, 1. Topic of the sentence i.e. Ubuntu or other, 2. The classifier would have only seen the phatic sentences
in the class 1 dataset i.e. non Ubuntu topic conversations. 

A publicly available dataset of travel related customer support (RSiCS - https://s3-us-west-2.amazonaws.com/nextit-public/rsics.html),is used to model the class 1 (Phatic sentence) data. For this exercise, "tagged_selections_by_sentence.csv" from RSiCS dataset is used, which contains manual annotations for segmenting the dialogue into two key pieces - conveys the intent or not, in addition to further annotations like Greetings, Rant etc. 

#### Evaluation
The classifier is evaluated on:
1. Validation set: A portion of the dataset created above.
2. Test set: Phatic examples as provided in the instructions mixed with some non-phatic sentences. A mock small test set. 

Accuracy scores are being used as metric for evaluation.

In [56]:
import numpy as np
import pandas as pd
import os

from collections import defaultdict
from nltk import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation as lda
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score


In [57]:
def get_sentences(x):
    return sent_tokenize(x)

def contains_min_tokens(x, num_tokens):
    return len(x.split()) > num_tokens

def contains_max_tokens(x, num_tokens):
    return len(x.split()) < num_tokens

def expand(series):
    return pd.Series([x for _list in series for x in _list])

In [58]:
def get_data(data, col_name, init_sample_size, num_tokens=0, is_min=None):
    data = data.dropna(subset=[col_name])
    data_sample = data[col_name]
    data_sample = data_sample.sample(n=init_sample_size).to_frame()

    z = data_sample[col_name].apply(lambda x: get_sentences(x))
    z = z.to_frame()
    print(z.shape)

    texts = pd.DataFrame({col_name: explode(z[col_name])})
    print(texts.shape)
    if is_min is None:
        return texts

    if is_min:
        func = contains_min_tokens
    else:
        func = contains_max_tokens

    texts = texts[texts[col_name].apply(lambda x: func(x, num_tokens))]
    print(texts.shape)
    return texts

In [59]:
num_dialogues = 3500

ubuntu_data_df = pd.read_csv("../data/ubuntu_support_extract.csv")    
# Filtering Ubuntu data with the assumption that long sentences tend to 
# contain less non-phatic, and more of the topical content
ubuntu_data = get_data(ubuntu_data_df, 'text', num_dialogues, num_tokens=7, is_min=True)
ubuntu_data = ubuntu_data['text'].to_list()
    
# Assigning label 0 to Ubuntu related sentences
ubuntu_labels = [0] * len(ubuntu_data)
    
# Phatic labeled dataset obtained from https://s3-us-west-2.amazonaws.com/nextit-public/rsics.html
phatic_data = pd.read_csv("../data/rsics_dataset/RSiCS/tagged_selections_by_sentence.csv")
phatic_data = get_data(phatic_data, 'Selected', num_dialogues)
phatic_data = phatic_data['Selected'].to_list()
    
# Assigning label 1 to the sentences of this dataset
phatic_labels = [1] * len(phatic_data)
        
# Creating a binary classification labeled dataset by combining the two datasets
Corpus = pd.DataFrame({'text': ubuntu_data + phatic_data, 'label': ubuntu_labels + phatic_labels})
print(Corpus.head())
print(Corpus.shape)

(3500, 1)
(4270, 1)
(2002, 1)
(3500, 1)
(6825, 1)
                                                text  label
0  i just want to watch a dvd, how do install lib...      0
1  at the very last page, before installing, ther...      0
2      PS the windows data partition is a good idea.      0
3  there is no *need* to update, if you don't wan...      0
4  just do 'logout', then 'logout' again, and rel...      0
(8827, 2)


In [60]:
# Preparing feature and label vectors
X = Corpus['text'].tolist()
y = Corpus['label'].tolist()
max_feats = 20000

vectoriser = TfidfVectorizer(max_features=max_feats).fit(X)
X_train, X_valid, y_train, y_valid = model_selection\
                                    .train_test_split(X, y, test_size=0.1)

X_train_vec = vectoriser.transform(X_train)
X_valid_vec = vectoriser.transform(X_valid)

test_dataset = pd.read_csv("../data/test_dataset.csv")
X_test = test_dataset['text'].tolist()
y_test = test_dataset['label'].tolist()
X_test_vec = vectoriser.transform(X_test)

In [61]:
# Training NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(X_train_vec, y_train)

# Predict the labels on validation dataset
valid_predictions_NB = Naive.predict(X_valid_vec)

# Predict the labels on test dataset
test_predictions_NB = Naive.predict(X_test_vec)

print("Naive Bayes Accuracy Score on validation test-> ", accuracy_score(valid_predictions_NB, y_valid) * 100)
print("Naive Bayes Accuracy Score on test set-> ", accuracy_score(test_predictions_NB, y_test) * 100)

Naive Bayes Accuracy Score on validation test->  87.08946772366932
Naive Bayes Accuracy Score on test set->  100.0


In [62]:
# Printing some sample predictions on validation dataset
for (sent, pred, real) in zip(X_valid[:20], valid_predictions_NB[:20], y_valid[:20]):
    print(sent, pred, real)

Hi allComments would be welcome please 1 1
I just purchased a ticket on line and didn't see anything about 10% discount..I did however,get a senior discount. 1 1
I'm taking my first trip to Europe in March, and about a month ago flights were priced at $ 817 from to TPA to MAD. 1 1
posted 10/30 and why is this? 1 1
 (hopefully!) 1 1
Any ideas why? 1 1
It cost about $145 altogether and I'll get 1000 miles out of it - not such a great deal in terms of cost per mile, but for a first effort I guess it's not too bad (plus it's a small financial investment if I have to back out at the last minute out for some reason). 1 1
what is the link for the other chat? 1 0
i can get 1 1
When I previously looked to and there was not enough  I don't see this option. 1 1
downgrading is probably not a good idea for this reason. 1 0
I want to have only 1 disk and i gues it will be /dev/sda so system will not work properly after copying all files? 0 0
Trying to decide which airline. 1 1
Since the flight, I ha

In [63]:
# Printing some sample predictions on test dataset
for (sent, pred, real) in zip(X_test, test_predictions_NB, y_test):
    print(sent, pred, real)

Hi 1 1
Hello 1 1
hi there 1 1
good morning 1 1
good evening 1 1
bye 1 1
adios 1 1
see you later 1 1
good bye 1 1
thanks 1 1
cheers 1 1
thanks you 1 1
please 1 1
ubuntu 0 0
sudo apt get install  0 0


In [64]:
# Classifier - Algorithm - SVM
# SVM classifier training
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(X_train_vec, y_train)

# Predict the labels on validation dataset
valid_predictions_SVM = SVM.predict(X_valid_vec)
# Predict the labels on manual test dataset
test_predictions_SVM = SVM.predict(X_test_vec)

print("SVM accuracy on validation dataset-> ", accuracy_score(valid_predictions_SVM, y_valid) * 100)
print("SVM accuracy on test dataset -> ", accuracy_score(test_predictions_SVM, y_test) * 100)

SVM accuracy on validation dataset->  92.41223103057757
SVM accuracy on test dataset ->  100.0


In [65]:
# Printing some sample predictions on validation dataset
for (sent, pred, real) in zip(X_valid[:20], valid_predictions_SVM[:20], y_valid[:20]):
    print(sent, pred, real)

Hi allComments would be welcome please 1 1
I just purchased a ticket on line and didn't see anything about 10% discount..I did however,get a senior discount. 1 1
I'm taking my first trip to Europe in March, and about a month ago flights were priced at $ 817 from to TPA to MAD. 1 1
posted 10/30 and why is this? 1 1
 (hopefully!) 1 1
Any ideas why? 1 1
It cost about $145 altogether and I'll get 1000 miles out of it - not such a great deal in terms of cost per mile, but for a first effort I guess it's not too bad (plus it's a small financial investment if I have to back out at the last minute out for some reason). 1 1
what is the link for the other chat? 1 0
i can get 1 1
When I previously looked to and there was not enough  I don't see this option. 1 1
downgrading is probably not a good idea for this reason. 1 0
I want to have only 1 disk and i gues it will be /dev/sda so system will not work properly after copying all files? 0 0
Trying to decide which airline. 1 1
Since the flight, I ha

In [66]:
# Printing some sample predictions on test dataset
for (sent, pred, real) in zip(X_test, test_predictions_SVM, y_test):
    print(sent, pred, real)

Hi 1 1
Hello 1 1
hi there 1 1
good morning 1 1
good evening 1 1
bye 1 1
adios 1 1
see you later 1 1
good bye 1 1
thanks 1 1
cheers 1 1
thanks you 1 1
please 1 1
ubuntu 0 0
sudo apt get install  0 0


#### Classifier predictions on the filtered out data from ubuntu i.e sentences with length less than 7

In [67]:
ubuntu_small_sent_data = get_data(ubuntu_data_df, 'text', 1000, num_tokens=7, is_min=False)

(1000, 1)
(1252, 1)
(577, 1)


In [69]:
# Printing phatic classifier labels on sample from Ubuntu dataset not included in training.
X_small_sent = ubuntu_small_sent_data['text'].tolist()
X_small_sent_vec = vectoriser.transform(X_small_sent)
predictions_SVM = SVM.predict(X_small_sent_vec)
for (sent,pred) in zip(X_small_sent,predictions_SVM):
    print(sent,pred)

anybody here use eclipse? 0
I need a little help 1
ctrl+alt+t is rasy too 0
minutes of cpu time? 1
since most distributions have xterm installed 1
nw 1
'latex file.tex' creates file.pdf now 0
how do i change that ? 1
? 1
alright 1
I'm no longer using sshd 1
hows it going? 1
i know right. 0
ok thx for help 1
yes, I'll second that 0
I installed Lucid Lynx from win7. 0
Where did ubuntu get installed? 0
please explain 1
in a terminal type cd /media 0
kaffeine is a media player 0
hi 1
9.10. 0
!help | Eventyret 1
just not common to do 1
or your computer is broken 0
i need some serious help 1
wooo october 1
no p[roblem 1
 what chipset? 1
by wep or wpa 1
Wow - that fixed it 0
how? 1
? 1
I hope you're joking? 0
add the numbers as it is 1
its only another desktop environment 0
do not paste here. 0
use the pastebin 0
I use `cp -a /path/to/source /path/to/dest` 0
what sound? 0
ssh perhaps 1
will multiboot 1
Just choose it in login screen. 0
;) 1
are awesome 1
you talking to ...? 0
or to yourself 1