# Basic Observation and data summary

### Load data

Data contains itemID, itemTitle, condition and price. The key to classify items from CSA to JWL is the itemTitle. We load itemTitle data from "CSA5k.txt" and "JWL35k.txt" file. 

In [5]:
file_CSA = open("CSA5k.txt")
file_JWL = open("JWL35k.txt")
data_CSA = []
data_JWL = []
for line in file_CSA:
    title = line.strip("\n").split("\t")[1].decode('utf-8').lower()
    if title != -1:
        data_CSA.append((title,"CSA"))
for line in file_JWL:
    title = line.strip("\n").split("\t")[1].decode('utf-8').lower()
    if title != -1:
        data_JWL.append((title,"JWL"))

### Classification model selection 

#### Package used here

In [17]:
from textblob.classifiers import NaiveBayesClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import linear_model
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
import numpy as np


In order to observe the performance of the model in the future, we split data to two parts. One is train data, the other is test data. In this case, I choose the proportion is 7:3.

In [10]:
TRAIN_PROP = 0.7
train = data_CSA[:int(len(data_CSA)*TRAIN_PROP)] + data_JWL[:int(len(data_JWL)*TRAIN_PROP)]
test = data_CSA[int(len(data_CSA)*TRAIN_PROP):] + data_JWL[int(len(data_JWL)*TRAIN_PROP):]

The idea of classification is that we try to tokenize the title string into short words by space. We count the occurrences of tokens in these two categories. We build a classifier based on this information. In order to avoid the impact of tokens that occur very frequently in a given corpus are hence empirically less informative than features that occur in a small fraction of the training corpus, we transfer occurrences of raw data to tf-idf.We can choose  SVM and Naive Bayes classifier. Here we use 10-fold cross validation to see which model is better. We split train data to 10 subsets. We use 1 piece as test data and the rest 9 pieces as train data to model. We will test the prediction error of two models and choose a better one.

In [19]:
np.random.shuffle(train)
kf = KFold(len(train), n_folds=10)
SVM_error = []
NB_error = []
for train_index,test_index in kf:
    train_val, test_val = np.array(train)[train_index], np.array(train)[test_index]
    ##NaiveBayes classifier
    text_nb = Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf',MultinomialNB()),])
    text_nb = text_nb.fit([x[0] for x in train_val],[x[1] for x in train_val])
    predicted_nb = text_nb.predict([x[0] for x in test_val])
    NB_error.append(np.mean(predicted_nb == [x[1] for x in test_val]))
    ##SVM classifier
    text_svm = Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf',SGDClassifier()),])
    text_svm = text_svm.fit([x[0] for x in train_val],[x[1] for x in train_val])
    predicted_svm = text_svm.predict([x[0] for x in test_val])
    SVM_error.append(np.mean(predicted_svm == [x[1] for x in test_val]))

In [31]:
print 1-np.mean(NB_error)
print 1-np.mean(SVM_error)

0.0303928571429
0.00649675324675


We calculate the average prediction cross validation error of two models. We found that SVM has less misclassification error so that SVM is a better model to choose.

### Classify and test the performance

In [26]:
text_svm = Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf',SGDClassifier()),])
text_svm = text_svm.fit([x[0] for x in train],[x[1] for x in train])  ##final model
predicted_svm = text_svm.predict([x[0] for x in test])
test_error = np.mean(predicted_svm == [x[1] for x in test])

In [29]:
print 1-test_error

0.00808333333333


We use train data to fit the model and use test data as new input to observe the performance of SVM model. Finally the test prediction error is 0.008. We cannot claim that this rate is tolerent or not. We have 12000 test data, which means around 100 of them are misclassified. If everyday listing contains large scale of data, 0.8% misclassification rate might cause problem when doing classification. Therefore, we might need to discuss about the tolerance of this classification model and try to observe those misclassified data to check the reason and the pattern of why this method lead these item misclassified.

In [37]:
index = np.where(predicted_svm != [x[1] for x in test])
index

(array([   56,    71,    97,   113,   126,   128,   140,   147,   150,
          162,   163,   192,   218,   226,   239,   246,   252,   270,
          317,   320,   347,   453,   502,   505,   530,   533,   572,
          578,   599,   624,   629,   630,   632,   635,   640,   663,
          694,   768,   810,   825,   855,   866,   869,   880,   898,
          908,   959,   963,   987,  1019,  1046,  1053,  1065,  1067,
         1073,  1082,  1126,  1132,  1173,  1198,  1222,  1226,  1245,
         1272,  1273,  1280,  1325,  1345,  1351,  1359,  1371,  1452,
         1470,  1484,  3007,  3339,  3909,  4584,  4850,  4906,  5452,
         5833,  6486,  6916,  6922,  6993,  7975,  8157,  8546,  8711,
         9720,  9742, 10863, 10991, 11290, 11323, 11399]),)

In [38]:
np.array(test)[index]

array([[u'vintage englert trucking co silvertone & goldtone #1 key chain',
        u'CSA'],
       [ u'shoe sneaker shoelace charm decoration i love heart names female k kena',
        u'CSA'],
       [ u'peace hippie boho fair trade ethnic hill tribe nepal handbag pom poms bells (29)',
        u'CSA'],
       [u'thanksgiving hair bow on an alligator clip', u'CSA'],
       [ u'skidlid original motorcycle half helmet in bomber pin up silver / blue xs-2xl',
        u'CSA'],
       [ u'corduroy breton cap with brass buttons ; john lennon fisherman style beatles',
        u'CSA'],
       [u'barely breezies s/2 seamless modern teardrop bras a211857',
        u'CSA'],
       [u'steve madden beasst bronze snake us 7', u'CSA'],
       [ u'big cheetah cheer bow glitter white purple ribbon girls uniform accessories ties',
        u'CSA'],
       [u'authentic vans grey / black', u'CSA'],
       [ u'ganz women\u2019s scarf with silver beads & jewel accents various colors er24452',
        u'CSA'],

Just by observation, we can see some titles are really confusing, such as "new bridal waist sash satin belt bridesmaid wedding evening party dress", which contains "dress" but belong to JWL category.
These kinds of cases, we might need to dig into it in the future.