In [1]:
%matplotlib inline

%load_ext autoreload
%autoreload 2

In [2]:
import json, gzip
myfile = gzip.open(r'../data/all.json.gz')
items = json.load(myfile)
myfile.close()

In [3]:
import pandas
favs_table = pandas.DataFrame(items)

favs_table.head(5)

Unnamed: 0,author,date_string,entry_no,favcount,id,page_no,text
0,sitki siyril,27.08.2004 17:33 ~ 28.05.2014 21:16,0,490,5526036,0,en popülerlerinden bir sözlük celebrity’si ile...
1,sitki siyril,27.08.2004 17:35,1,38,5526061,0,"sözlükte çok popüler, entryleri sevilen bir üç..."
2,sitki siyril,27.08.2004 17:43,2,5,5526178,0,geç kalmadan...(bkz: ekşi itirafçıların demek ...
3,sitki siyril,27.08.2004 18:21,3,6,5526647,0,geçen ssg ile zirvede karşılaştım. arası soğum...
4,sitki siyril,27.08.2004 18:27,4,10,5526710,0,amcıkgülün itirafını okuyunca aklıma geldi. be...


In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Preprocessing

(Stemming and vectorizing)

In [5]:
import snowballstemmer
stemmer = snowballstemmer.stemmer('turkish')
    
analyzer = CountVectorizer().build_analyzer()

#Thanks to stackoverflow
def stemmed_words(doc):
    return (stemmer.stemWord(w) for w in analyzer(doc))

cv = CountVectorizer(analyzer=stemmed_words,token_pattern=ur'((?:\w|ö|ü|ı|ğ|ş|ç)+)', min_df=2, max_df=0.95)


In [6]:
dataset_vectorized = cv.fit_transform(favs_table['text'])
dataset = dataset_vectorized

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

In [8]:
lda = LatentDirichletAllocation(n_topics=20)

dataset_lda = lda.fit_transform(dataset_vectorized)

In [9]:
##From: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

### Top words in topics

In [10]:
print_top_words(lda, cv.get_feature_names(), 10)

Topic #0:
oy pazar tepki dede it sure hah işyer cidi kis
Topic #1:
bir cok bu ve de iç sey be am da
Topic #2:
saç korkuyor hav ışık kar oh mümk diş video dene
Topic #3:
bkz ek eksi dusunuyor 2015 ism karma dus raz edit
Topic #4:
istiyor çok kendi baze olmak istemiyor iç gip olmuyor zor
Topic #5:
şarkı film müzik ses dan com dinliyor fena günlük ilaç
Topic #6:
acayip su şok aci değişiklik resmi yavru evet veriç kapatıp
Topic #7:
kullanma hissetmiyor yalniz evle tez kanal daim zannediyor süt hâlâ
Topic #8:
bir ne bu da de var ya yok ve mi
Topic #9:
facebook ta gizli mavi profil metro bölüm gidel you hu
Topic #10:
bir çok bu ve iç be am şey de da
Topic #11:
sözlük itiraf entry yazar başlık yaz ediyor bura yazdık ekşi
Topic #12:
ardı of the yağmur sahne ankar hissi dans edil kırıl
Topic #13:
fotoğraf yük doktor beyaz verecek top niyet düze siyah karışık
Topic #14:
bir ve ev gün iç geç sonra sabah saat g
Topic #15:
allah keşke mutsuz kedi hissettik millet mutsuzluk ver u herşey
Topic #16:
b

Split training and test sets, ordered by time of entry. We are trying to predict the popularity of the `future' entries.

In [12]:
y = favs_table['favcount'].as_matrix()

dataset = dataset_lda
train_index = int(0.8*len(dataset))
training_X = dataset[0:train_index]
training_y = y[0:train_index]
test_X = dataset[train_index:]
test_y = y[train_index:]

3 classes: 0 favs, 1-4 favs (somewhat popular), popular. I admit that the distinction is a bit arbitrary, but wanted to avoid a regression problem here (as I don't really care whether an entry gets 50 favs or 1000). An alternative is to split them into classes by quantiles.

In [13]:
def discretize_y(in_y):
    ret = []
    for yi in in_y:
        if yi in [0]:
            ret.append(yi)
        elif yi < 5:
            ret.append(1)
        else:
            ret.append(2)
            
    return ret


            
            
            

In [14]:
training_y_discrete = discretize_y(training_y)
test_y_discrete = discretize_y(test_y)

## Random forests

### Create Model

In [15]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, n_jobs=4, max_features=None)#, class_weight={0:3,1:1,2:50})
clf.fit(training_X, training_y_discrete)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [16]:
from sklearn.metrics import confusion_matrix, accuracy_score

### Test Model

In [18]:
print 'Results on training data:'

results_y = clf.predict(training_X[:])

print 'Accuracy:'
print accuracy_score(training_y_discrete[:], results_y)

print 'Confusion matrix:'
confusion_matrix(training_y_discrete[:], results_y)

Results on training data:
Accuracy:
0.987002810203
Confusion matrix:


array([[17152,   830,     0],
       [   75, 54278,     2],
       [    0,    55,  1624]])

In [20]:
from sklearn import cross_validation

print 'Cross validation accuracy (with shuffling):'

cross_validation.cross_val_score(clf, training_X, training_y_discrete, 
                                 cv=cross_validation.StratifiedKFold(training_y_discrete, n_folds=5, shuffle=True ))


Cross validation accuracy (with shuffling):


array([ 0.72257498,  0.72149419,  0.72194825,  0.72552861,  0.72199703])

In [21]:
print 'On test data:'
results_y = clf.predict(test_X)

print 'Accuracy:'
print accuracy_score(test_y_discrete, results_y)

print 'Confusion matrix:'
confusion_matrix(test_y_discrete, results_y)

On test data:
Accuracy:
0.51934716818
Confusion matrix:


array([[ 171, 6254,    4],
       [ 285, 9431,    4],
       [  41, 2306,    8]])

## Naive Bayes
(Simpler model compared to random forests, should be less prone to overfit)

### Create Model

In [22]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

clf.fit(training_X, training_y_discrete)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Test Model

In [24]:
print 'Results on training data:'

results_y = clf.predict(training_X[:])

print 'Accuracy:'
print accuracy_score(training_y_discrete[:], results_y)

print 'Confusion matrix:'
confusion_matrix(training_y_discrete[:], results_y)

Results on training data:
Accuracy:
0.645319930826
Confusion matrix:


array([[ 1382, 14983,  1617],
       [ 3355, 45963,  5037],
       [  133,  1127,   419]])

In [23]:
from sklearn import cross_validation

print 'Cross validation accuracy (with shuffling):'

cross_validation.cross_val_score(clf, training_X, training_y_discrete, 
                                 cv=cross_validation.StratifiedKFold(training_y_discrete, n_folds=5, shuffle=True ))


Cross validation accuracy (with shuffling):


array([ 0.64462307,  0.64516347,  0.66601365,  0.63352023,  0.64004864])

In [25]:
results_y = clf.predict(test_X)

print accuracy_score(test_y_discrete, results_y)

confusion_matrix(test_y_discrete, results_y)

0.46687202767


array([[ 535, 5340,  554],
       [ 739, 7556, 1425],
       [ 177, 1630,  548]])

*Summary of the results*: Vectorized text alone does not help predicting popularity. 

Tried different feature spaces, such as using tf-idf+bag-of-words instead of topics, or using other dimensionality reduction methods such as LSA. Also tried other models (SVM, Logistic regression). They do not provide any significant improvement over the current results.

*Things to try:* Combine text-related features with others: time of the day, entry length, perhaps the author or metadata related to author. Phrase detection to get better features than singular words.

Also try to use unsupervised methods on popular entries to see what features they have in common.


