<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 6


## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



In [25]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import numpy as np

In [4]:
path = "/Users/ga/Desktop/DSI-SF-3_repo/DSI-SF-3/datasets/stack_exchange_travel/"
posts_train = pd.read_csv(path + "posts_train.csv")
comments_train = pd.read_csv(path + "comments_train.csv")
users = pd.read_csv(path + "users.csv")
badges = pd.read_csv(path + "badges.csv")
votes_train = pd.read_csv(path + "votes_train.csv")
tags = pd.read_csv(path + "tags.csv")

In [47]:
posts1 = posts_train[["Id", "Body", "Tags", "Title"]]

In [9]:
comments1 = comments_train[["PostId", "Text"]]

In [11]:
comments1_agg = pd.DataFrame(comments1.groupby("PostId")["Text"].sum())

In [14]:
comments1_agg["PostId"] = comments1_agg.index

In [15]:
# Check for nulls and drop
comments1_agg.isnull().sum()

Text      0
PostId    0
dtype: int64

In [16]:
posts1.isnull().sum()

Id           0
Body       813
Tags     27301
Title    27301
dtype: int64

In [17]:
# Since there are way too many nulls on Tags and Title, we'll ignore these columns
posts1 = posts1[["Id", "Body"]]
posts1.dropna(axis=0, inplace=True)
print posts1.shape, posts1.isnull().sum()

(40476, 2) Id      0
Body    0
dtype: int64


In [19]:
# Merge (left) posts1 and comments1_agg on Id, PostId resp.
posts_comments = pd.merge(posts1, comments1_agg, how = "left", left_on = "Id", right_on = "PostId")

In [20]:
posts_comments.head()

Unnamed: 0,Id,Body,Text,PostId
0,1,<p>My fiancée and I are looking for a good Car...,To help with the cruise line question: Where a...,1.0
1,4,<p>Singapore Airlines has an all-business clas...,This route (as well as LAX-SIN) is being cance...,4.0
2,5,<p>Another definition question that interested...,,
3,8,<p>Can anyone suggest the best way to get from...,"To those voting down, please explain why with ...",8.0
4,9,<p>We are considering visiting Argentina for u...,"I agree with @user27478, you need to specify w...",9.0


In [32]:
# Deal with the NaNs and combine body/text
tres = zip(posts_comments.Body, posts_comments.Text, posts_comments.PostId)
comb = [bod if np.isnan(pid) else bod + txt for bod, txt, pid in tres]

In [31]:
len(comb)

40476

In [33]:
def remove_html(x):
    soup = BeautifulSoup(x, "lxml")
    return soup.get_text()

cleaned_comb = map(remove_html, comb)

In [35]:
cleaned_comb[:1]

[u'My fianc\xe9e and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?\nIt seems like a lot of the cruises don\'t run in this month due to Hurricane season so I\'m looking for other good options.\nEDIT We\'ll be travelling in 2012.\nTo help with the cruise line question: Where are you located? My wife and I live in New Orleans, so we sail out of the port here. It limits us mainly to Carnival (though we are getting some more cruise lines in here), but saves us money on travel expenses getting *to* the port. If you\'re closer to a specific port, and like the cruises offered out of it, then it would make more sense to choose a cruise line from there.Toronto, Ontario. We can fly out of anywhere though."Best" for what?  Please read [this page](http://travel.stackexchange.com/questions/how-to-ask-beta) and particularly the blog posts linked on the right.What do you want out of a cruise? To relax on a boat? To 

In [38]:
back_df = {"PostId": posts1["Id"], "Text": cleaned_comb}

df = pd.DataFrame(back_df)

In [39]:
df.head()

Unnamed: 0,PostId,Text
0,1,My fiancée and I are looking for a good Caribb...
1,4,Singapore Airlines has an all-business class f...
2,5,Another definition question that interested me...
3,8,Can anyone suggest the best way to get from Se...
4,9,We are considering visiting Argentina for up t...


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english', 
                        ngram_range=(1,3), 
                        min_df = 0.025, max_df = 0.9)
tvec.fit(df["Text"])


TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.9, max_features=None, min_df=0.025,
        ngram_range=(1, 3), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [43]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df["Text"])
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(df["Text"])
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf-idf features for NMF...
done in 81.908s.
Extracting tf features for LDA...
done in 16.725s.
Fitting the NMF model with tf-idf features, n_samples=2000 and n_features=1000...
done in 10.435s.

Topics in NMF model:
Topic #0:
like just people don time know ve good want think really place city question places ll going way answer make
Topic #1:
visa transit uk need apply application visas entry valid visit tourist embassy visitor canada citizen indian immigration enter stay india
Topic #2:
airport international transit city terminal airports hours located security code taxi layover area hour need luggage london main domestic minutes
Topic #3:
passport passports country new enter id valid uk citizen travel stamp old countries border immigration expired citizenship eu document british
Topic #4:
train bus station trains ticket tickets buses day hours journey buy time route rail minutes way london line transport taxi
Topic #5:
schengen days area 90 country stay entry eu visa count

In [44]:
# In LDA model, looks like
    # Topic 0: travel documents / travel
    # Topic 1: financials
    # Topic 2: auto-related / car rent
    # Topic 3: airports / flights
    # Topic 4: countries / countries to visit
    # Topic 5: oriental countries / law / food
    # Topic 6: public transportation
    # Topic 7: Ambiguous...
    # Topic 8: security / luggage / airlines
    # Topic 9: airline tickets

In [50]:
tags = posts1["Tags"].dropna(axis=0)

In [89]:
tag_list = []
def parse_tags(tags_series):
    for tags in tags_series:
        tags = tags.split("><")
        for tag in tags:
            tag = tag.replace("<", "")
            tag = tag.replace(">", "")
            tag_list.append(tag)
    return tag_list

In [92]:
parsed_tags = parse_tags(tags)

In [109]:
from collections import Counter
Counter(parsed_tags).most_common()[:30]

[('visas', 5551),
 ('air-travel', 3304),
 ('usa', 3123),
 ('schengen', 2216),
 ('uk', 2128),
 ('customs-and-immigration', 1522),
 ('transit', 1458),
 ('trains', 1281),
 ('public-transport', 1061),
 ('passports', 1020),
 ('tickets', 986),
 ('luggage', 978),
 ('budget', 977),
 ('legal', 922),
 ('europe', 881),
 ('canada', 878),
 ('indian-citizens', 840),
 ('online-resources', 814),
 ('india', 802),
 ('france', 794),
 ('germany', 770),
 ('japan', 716),
 ('international-travel', 714),
 ('airports', 708),
 ('safety', 628),
 ('airlines', 590),
 ('money', 568),
 ('health', 546),
 ('planning', 531),
 ('food-and-drink', 524)]

In [None]:
# It looks like the most common tags are pretty in line with the topics
# what the hell is schengen 

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [121]:
posts2 = posts_train.copy()
acc_posts = list(posts_train["AcceptedAnswerId"].unique())
acc_posts = [post for post in acc_posts if not np.isnan(post)]
posts2["Accepted"] = np.zeros(posts2.shape[0], dtype = np.int8)
posts2.loc[posts_train["Id"].isin(acc_posts), "Accepted"] = 1
posts2.head()

Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,CreationDate,FavoriteCount,Id,LastActivityDate,...,LastEditorUserId,OwnerDisplayName,OwnerUserId,ParentId,PostTypeId,Score,Tags,Title,ViewCount,Accepted
0,393.0,4.0,<p>My fiancée and I are looking for a good Car...,2013-02-25T23:52:47.953,4,,2011-06-21T20:19:34.730,,1,2012-05-24T14:52:14.760,...,101.0,,9.0,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,361.0,0
1,,1.0,<p>Singapore Airlines has an all-business clas...,,1,,2011-06-21T20:24:57.160,,4,2013-01-09T09:55:22.743,...,693.0,,24.0,,1,8,<loyalty-programs><routes><ewr><singapore-airl...,Does Singapore Airlines offer any reward seats...,219.0,0
2,770.0,5.0,<p>Another definition question that interested...,,0,,2011-06-21T20:25:56.787,2.0,5,2012-10-12T20:49:08.110,...,101.0,,13.0,,1,11,<romania><transportation>,What is the easiest transportation to use thro...,340.0,0
3,62.0,3.0,<p>Can anyone suggest the best way to get from...,,2,,2011-06-21T20:30:38.687,1.0,8,2016-03-28T03:41:28.130,...,,,26.0,,1,11,<usa><airport-transfer><taxis><seattle>,Best way to get from SeaTac airport to Redmond?,9219.0,0
4,178.0,4.0,<p>We are considering visiting Argentina for u...,2016-01-02T10:26:48.277,1,,2011-06-21T20:31:21.800,8.0,9,2016-01-01T21:58:02.303,...,101.0,,23.0,,1,12,<sightseeing><public-transport><transportation...,What are must-visit destinations for the first...,1503.0,0


In [123]:
y = posts2.loc[posts2.PostTypeId == 2, "Accepted"]

In [124]:
X = posts2.loc[posts2.PostTypeId == 2, ["Score", "ViewCount", "AnswerCount", "CommentCount", "FavoriteCount"]]

In [127]:
X.fillna(0, inplace=True)

In [128]:
X.head()

Unnamed: 0,Score,ViewCount,AnswerCount,CommentCount,FavoriteCount
9,10,0.0,0.0,3,0.0
10,51,0.0,0.0,3,0.0
11,10,0.0,0.0,1,0.0
16,10,0.0,0.0,4,0.0
18,26,0.0,0.0,5,0.0


In [130]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, X, y)
scores.mean()                             


0.72891893019568565

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.


<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Lets Model - Tournament for stock market predictions

>Start this section of the project by downloading the train and test datasets from the following site: https://numer.ai/rules

> - The data set is clean, your goal is to develop a classification model(s) 
> - Report all the results including log loss, and other coefficients you consider iteresting