## Kaggle Challenge - Yelp Category Classification
    
Team:

* Surendran Subbiah
* Xuancheng Fan

In [13]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn import linear_model, model_selection
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold, cross_val_score
import nltk 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import state_union
from nltk.tokenize import  PunktSentenceTokenizer
from string import punctuation
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer



### Load the data ###

In [14]:
data= pd.read_csv("yelp_data_official_training.csv", delimiter="|").dropna()

In [15]:
print (data.shape)
data= data.reindex(np.random.permutation(data.index))

(47999, 3)


### Calculate the word frequency in tf.idf ###

In [16]:
x_train=data["Review Text"][:42000]    # set train data 

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1,2))
x_data=vectorizer.fit_transform(x_train)



In [17]:
x_data[0].toarray().shape

(1, 967990)

### Get the train, dev data ###

In [26]:
Y_train=data["Category"][:42000]
X_tr=x_data
Y_tr=Y_train
X_test=vectorizer.transform(data["Review Text"][42001:])   # set test data
Y_test=data["Category"][42001:]



### Set parameters space and train the data ###

In [21]:
from sklearn.model_selection import GridSearchCV
clf=SGDClassifier()
parameters={"alpha":np.linspace(0.000001,2)}
gs_clf = GridSearchCV(clf, parameters, n_jobs=-1)
gs_clf=gs_clf.fit(X_tr, Y_tr)


In [27]:
predicted=gs_clf.predict(X_test)
accuracy= np.mean(predicted==Y_test)

5998


In [28]:
gs_clf.best_score_

0.89561904761904765

### Output the file ###

In [234]:
data_kaggle=pd.read_csv("yelp_data_official_test_nocategories.csv",delimiter="|")

In [235]:
data_kaggle.head()
kaggle_text=data_kaggle["Review Text"]
kaggle_processed= vectorizer.transform(kaggle_text)


In [236]:
kaggle_predict=gs_clf.predict(kaggle_processed)

In [237]:
submission=pd.DataFrame({"ID": data_kaggle["ID"]})
submission["Category"]=kaggle_predict
submission["ID"].shape

(12000,)

In [None]:
submission.to_csv("submission.csv",index=False)




### Brief Team Work ###

1. Xuancheng built a test version with basic structure and simple algorithm to do the first submit.
2. Surendran and Xuancheng respectively tried different methods
3. Surendran's version got a better result at first.
4. Based on Surendran's version, Xuancheng tested different feature and different setting and got the final result


### Member's effort (except those in team work) ###

#### Surendran Subbiah ####

I originally tried preprocessing the text of review to extract key noun phrases, lemmatize those and generate keyphrase bigrams which I had passed into the TfIdfvectorizer. However this method yielded 5 percentage points less accuracy as compared to passing in raw text to TFIDF vectorizer to generate unigram and bigram weighted frequency, based on which the final submissions to Kaggle were made. 

In our approach we used a Generalized linear model and Stochastic Gradient Descent Classifier as it was faster to perform hyperparameter search and cross validation using grid search as compared to other methods such as SVM with Polynomial and RBF Kernel. 

I also tried performing lifting of the features generated by TFIDF vectorizer, but the resulting matrix was dense and hence performing hyperparameter search proved to be computationaly intensive and infeasable. The lifting technique  I tried  was 

$$\phi(x)=cos(G^Tx+b)$$

It is my view that using a random non-linear function as above with very high regularization parameter can outperform all other techniques used by us. 

We reason that using a simple linear classfier gives accuracy levels near 90%, due to the linearly sepearble nature of documents especially after using TFIDF as feature preprocessing. TFIDF generates a sparse matrix with non-zero values attributed to terms and words that are unique to a particular document resulting in the effective linear seperability. 

#### Xuancheng ####

1. My original thoght is to use wn.path_similarity to test the similarity of word to the keywords of 6 categories. For example, "hair" & "salon" would have a close distance to "body". So the review include these 2 words are likely to be classified as "Beauty & Bath".
    Below is the function I use:
$$
def distance_feature(x, keyword):
    try:
        termlist = freq_normed_unigrams(x)# get top unigrams
        distance = []
        for term in termlist:                  # for each term
            s = wn.synsets(term)               # get its nominal synsets
            for syn in s:                   # for each lemma synset   
                c = wn.path_similarity(wn.synset(syn.name()), wn.synset(keyword))
                if c != None:
                    distance.append(c)
        if distance == []:
            return 0
        return max(distance)    
    except TypeError:
        return 0
$$        
2. However, the algorithm's expense is greatly huge. It usually need a whole night to get a accuracy result. And moreover, if using naive bayes, the result is not good enought.
3. Thus, then I tried KNN but the algorithm's expense became even more huge. So I have to give up the similarity idea.
4. Then I tried the frequency idea. Based on Suran's algorithm, I tried different feature and parameter settings to get the highest score. As a result, the highest dev accuracy I got is 0.90072709848121835 and the highest score in Kaggle is 0.88444.
