<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 7

## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from gensim import corpora, models, matutils
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, Lasso, Ridge
from collections import defaultdict
from sklearn.feature_extraction import text
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.cross_validation import cross_val_score, train_test_split

In [2]:
posts = pd.read_csv('/home/llevin/Desktop/DSI-SF-2-llevin16/Projects/Project 7/Data/stack_exchange_travel/posts_train.csv')
comments = pd.read_csv('/home/llevin/Desktop/DSI-SF-2-llevin16/Projects/Project 7/Data/stack_exchange_travel/comments_train.csv')
tags = pd.read_csv('/home/llevin/Desktop/DSI-SF-2-llevin16/Projects/Project 7/Data/stack_exchange_travel/tags.csv')

In [3]:
posts.head(3).T

Unnamed: 0,0,1,2
AcceptedAnswerId,393,,770
AnswerCount,4,1,5
Body,<p>My fiancée and I are looking for a good Car...,<p>Singapore Airlines has an all-business clas...,<p>Another definition question that interested...
ClosedDate,2013-02-25T23:52:47.953,,
CommentCount,4,1,0
CommunityOwnedDate,,,
CreationDate,2011-06-21T20:19:34.730,2011-06-21T20:24:57.160,2011-06-21T20:25:56.787
FavoriteCount,,,2
Id,1,4,5
LastActivityDate,2012-05-24T14:52:14.760,2013-01-09T09:55:22.743,2012-10-12T20:49:08.110


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [4]:
post_lda = posts[['Body','Id','ParentId','Title']]
post_lda['Body'] = post_lda['Body'].map(lambda x: BeautifulSoup(str(x),'html.parser').get_text())
post_lda['Body'] = post_lda['Body'].map(lambda x: x.replace('\n',''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [5]:
comments_lda = comments[['PostId','Text']]

In [6]:
sequence = [post_lda['Body'],post_lda['Title'],comments_lda['Text']]
lda = pd.concat(sequence)
lda.dropna(inplace=True)

In [7]:
lda.isnull().sum()

0

In [8]:
lda = lda[~lda.map(lambda x: any(i.isdigit() for i in x))]

In [9]:
lda

1        Singapore Airlines has an all-business class f...
2        Another definition question that interested me...
4        We are considering visiting Argentina for up t...
6        I need to travel from Cusco, Peru to La Paz, B...
7        I am aware of travel agencies catering to US c...
8        My wife and I have decided to move across Euro...
9        EURail should be a good place to plan the trip...
12       I'm looking for data plans I can use while tou...
13       I have heard rumours of such things, and have ...
14       When traveling, one of my favorite things to d...
15       Possible Duplicate:What are the best ways to a...
17       The next time family from overseas comes to vi...
18       Round-the-world fares do exist.Most airline al...
19       I'm dreaming of visiting remote islands such a...
20       Tropical destinations have to be one of the mo...
22       This listing should get you started: http://ww...
23       Google's your friend. I would definitely searc.

In [10]:
stop_words = text.ENGLISH_STOP_WORDS.union(['http','www','don','ve','com','yeah','just','like','think','en','org',
                                           'https','wiki','phoog'])
len(stop_words)

332

In [11]:
vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(lda)

In [12]:
vocab = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
#vocab

In [13]:
lda_model = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    num_topics  =  8,
    passes      =  10,
    id2word     =  vocab
)

In [14]:
lda_model.print_topics(num_words=6)

[(0,
  u'0.038*visa + 0.021*passport + 0.017*uk + 0.016*country + 0.013*need + 0.011*schengen'),
 (1,
  u'0.016*luggage + 0.010*security + 0.010*bag + 0.010*water + 0.008*baggage + 0.008*carry'),
 (2,
  u'0.023*time + 0.014*day + 0.014*flight + 0.013*airport + 0.010*train + 0.009*going'),
 (3,
  u'0.013*people + 0.010*op + 0.009*know + 0.007*say + 0.006*make + 0.005*airline'),
 (4,
  u'0.029*car + 0.011*drive + 0.010*driving + 0.009*road + 0.009*voting + 0.009*wikipedia'),
 (5,
  u'0.008*good + 0.008*people + 0.008*english + 0.008*know + 0.007*travel + 0.006*hotel'),
 (6,
  u'0.023*ticket + 0.016*card + 0.013*tickets + 0.012*pay + 0.011*use + 0.010*buy'),
 (7,
  u'0.056*question + 0.046*answer + 0.022*thanks + 0.012*travel + 0.010*information + 0.010*questions')]

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [15]:
answers = posts[posts['PostTypeId']==2]
questions = posts[posts['PostTypeId']==1]

In [16]:
len(answers)

23967

In [17]:
answers.isnull().sum()

AcceptedAnswerId         23967
AnswerCount              23967
Body                         0
ClosedDate               23967
CommentCount                 0
CommunityOwnedDate       23818
CreationDate                 0
FavoriteCount            23967
Id                           0
LastActivityDate             0
LastEditDate             15015
LastEditorDisplayName    23650
LastEditorUserId         15089
OwnerDisplayName         23262
OwnerUserId                403
ParentId                     0
PostTypeId                   0
Score                        0
Tags                     23967
Title                    23967
ViewCount                23967
dtype: int64

In [18]:
answers = answers[['Body','CommentCount','CreationDate','Id','LastActivityDate','OwnerUserId','ParentId','PostTypeId',
                  'Score']]

In [19]:
len(questions)

13988

In [20]:
questions.isnull().sum()

AcceptedAnswerId          7472
AnswerCount                  0
Body                         0
ClosedDate               11361
CommentCount                 0
CommunityOwnedDate       13975
CreationDate                 0
FavoriteCount            10466
Id                           0
LastActivityDate             0
LastEditDate              2911
LastEditorDisplayName    13621
LastEditorUserId          3231
OwnerDisplayName         13613
OwnerUserId                307
ParentId                 13988
PostTypeId                   0
Score                        0
Tags                         0
Title                        0
ViewCount                    0
dtype: int64

In [21]:
questions = questions[['AcceptedAnswerId','AnswerCount','Body','CommentCount','CreationDate','Id','LastActivityDate',
                      'LastEditDate','LastEditorUserId','OwnerUserId','PostTypeId','Score','Tags','Title','ViewCount']]

In [22]:
merged_posts = questions.merge(answers,how='inner',left_on='Id',right_on='ParentId')

In [23]:
merged_posts.rename(columns={'Body_x':'question_body','CommentCount_x':'question_comment_count','Score_x':'question_score',
                            'Body_y':'answer_body','CommentCount_y':'answer_comment_count','Score_y':'answer_score'},
                   inplace=True)

In [24]:
merged_posts['CreationDate_y'] = pd.to_datetime(merged_posts['CreationDate_y'])
merged_posts['CreationDate_x'] = pd.to_datetime(merged_posts['CreationDate_x'])

In [25]:
merged_posts['answer_time'] = merged_posts['CreationDate_y']-merged_posts['CreationDate_x']
merged_posts['answer_time'] = (merged_posts['answer_time']/np.timedelta64(1,'D')).astype(float)

In [26]:
merged_posts['target'] = merged_posts['AcceptedAnswerId'] - merged_posts['Id_y']
merged_posts['target'] = merged_posts['target'].map(lambda x: 1 if x==0 else 0)

In [27]:
merged_posts['last_editor'] = merged_posts['LastEditorUserId']-merged_posts['OwnerUserId_y']
merged_posts['last_editor'] = merged_posts['last_editor'].map(lambda x: 1 if x==0 else 0)

In [28]:
merged_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23967 entries, 0 to 23966
Data columns (total 27 columns):
AcceptedAnswerId          13700 non-null float64
AnswerCount               23967 non-null float64
question_body             23967 non-null object
question_comment_count    23967 non-null int64
CreationDate_x            23967 non-null datetime64[ns]
Id_x                      23967 non-null int64
LastActivityDate_x        23967 non-null object
LastEditDate              19729 non-null object
LastEditorUserId          18965 non-null float64
OwnerUserId_x             23242 non-null float64
PostTypeId_x              23967 non-null int64
question_score            23967 non-null int64
Tags                      23967 non-null object
Title                     23967 non-null object
ViewCount                 23967 non-null float64
answer_body               23967 non-null object
answer_comment_count      23967 non-null int64
CreationDate_y            23967 non-null datetime64[ns]
Id_y       

In [29]:
merged_posts['Text'] = merged_posts['question_body']+merged_posts['Tags']+merged_posts['answer_body']+merged_posts['Title']
merged_posts['Text'] = merged_posts['Text'].map(lambda x: BeautifulSoup(str(x),'html.parser').get_text())
merged_posts['Text'] = merged_posts['Text'].map(lambda x: x.replace('\n',''))

In [30]:
merged_posts.head(3).T

Unnamed: 0,0,1,2
AcceptedAnswerId,393,393,393
AnswerCount,4,4,4
question_body,<p>My fiancée and I are looking for a good Car...,<p>My fiancée and I are looking for a good Car...,<p>My fiancée and I are looking for a good Car...
question_comment_count,4,4,4
CreationDate_x,2011-06-21 20:19:34.730000,2011-06-21 20:19:34.730000,2011-06-21 20:19:34.730000
Id_x,1,1,1
LastActivityDate_x,2012-05-24T14:52:14.760,2012-05-24T14:52:14.760,2012-05-24T14:52:14.760
LastEditDate,2011-12-28T21:36:43.910,2011-12-28T21:36:43.910,2011-12-28T21:36:43.910
LastEditorUserId,101,101,101
OwnerUserId_x,9,9,9


In [31]:
vectorizer = CountVectorizer(stop_words=stop_words,max_df=.9,min_df=.05)
t_vectorizer = TfidfVectorizer(max_features=100, stop_words=stop_words,max_df=.9,min_df=.05)
X = vectorizer.fit_transform(merged_posts['Text'])
X_t = t_vectorizer.fit_transform(merged_posts['Text'])
# docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
docs = pd.DataFrame(X_t.toarray(), columns=t_vectorizer.get_feature_names())
# X_t = t_vectorizer.fit_transform(merged_posts['Text'])

In [32]:
# docs2 = pd.DataFrame(X_t.toarray(), columns=t_vectorizer.get_feature_names())
# print docs2.shape
# merged_posts.shape

In [33]:
qa_final = merged_posts[['AnswerCount','question_comment_count','question_score','ViewCount','answer_comment_count',
                            'answer_score','answer_time','target','last_editor']]

In [34]:
qa_final[[col for col in docs.columns]] = docs[[col for col in docs.columns]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [35]:
qa_final.head(3).T

Unnamed: 0,0,1,2
AnswerCount,4.000000,4.000000,4.000000
question_comment_count,4.000000,4.000000,4.000000
question_score,8.000000,8.000000,8.000000
ViewCount,361.000000,361.000000,361.000000
answer_comment_count,1.000000,0.000000,0.000000
answer_score,7.000000,2.000000,3.000000
answer_time,2.369750,66.159911,81.784922
target,1.000000,0.000000,0.000000
last_editor,0.000000,1.000000,0.000000
10,0.000000,0.000000,0.000000


In [36]:
y = qa_final['target']
X = qa_final[[col for col in qa_final.columns if col not in ['target']]]

print X.shape, y.shape

(23967, 108) (23967,)


In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [38]:
bag_log = BaggingClassifier(LogisticRegression())

log_scores = cross_val_score(bag_log.fit(X_train,y_train),X_test,y_test,cv=5)

print 'Fold Scores:', log_scores
print 'Cross Val Score:', np.mean(log_scores)

Fold Scores: [ 0.76302988  0.75399583  0.76286509  0.76634214  0.78079332]
Cross Val Score: 0.765405252797


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


posttypeid = 1 & 2
score
regularized
baggingclassifier

In [39]:
scores = posts[posts['PostTypeId']<=2]

In [40]:
scores['PostTypeId'].value_counts()

2    23967
1    13988
Name: PostTypeId, dtype: int64

In [41]:
scores.isnull().sum()

AcceptedAnswerId         31439
AnswerCount              23967
Body                         0
ClosedDate               35328
CommentCount                 0
CommunityOwnedDate       37793
CreationDate                 0
FavoriteCount            34433
Id                           0
LastActivityDate             0
LastEditDate             17926
LastEditorDisplayName    37271
LastEditorUserId         18320
OwnerDisplayName         36875
OwnerUserId                710
ParentId                 13988
PostTypeId                   0
Score                        0
Tags                     23967
Title                    23967
ViewCount                23967
dtype: int64

In [42]:
scores['accepted_answer'] = scores['AcceptedAnswerId'].map(lambda x: 0 if np.isnan(x) else 1)
scores['last_editor'] = scores['LastEditorUserId']-scores['OwnerUserId']
scores['last_editor'] = scores['last_editor'].map(lambda x: 1 if x==0 else 0)
scores['status'] = scores['ClosedDate'].map(lambda x: 0 if x!=x else 1)
scores['LastActivityDate'] = pd.to_datetime(scores['LastActivityDate'])
scores['CreationDate'] = pd.to_datetime(scores['CreationDate'])
scores['active_time'] = scores['LastActivityDate'] - scores['CreationDate']
scores['active_time'] = (scores['active_time']/np.timedelta64(1,'D')).astype(float)
scores['post_length'] = scores['Body'].map(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: ht

In [43]:
scores['Text'] = scores['Body'] + scores['Title'] + scores['Tags']
scores['Text'] = scores['Text'].map(lambda x: BeautifulSoup(str(x),'html.parser').get_text())
scores['Text'] = scores['Text'].map(lambda x: x.replace('\n',''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [44]:
scores['Text'][0]

u"My fianc\xe9e and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?It seems like a lot of the cruises don't run in this month due to Hurricane season so I'm looking for other good options.EDIT We'll be travelling in 2012.What are some Caribbean cruises for October?"

In [54]:
vectorizer = CountVectorizer(max_features=100,stop_words=stop_words,max_df=.9,min_df=.05)
t_vectorizer = TfidfVectorizer(max_features=100, stop_words=stop_words,max_df=.9,min_df=.05)
X = vectorizer.fit_transform(scores['Text'])
X_t = t_vectorizer.fit_transform(scores['Text'])
# docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
docs = pd.DataFrame(X_t.toarray(), columns=t_vectorizer.get_feature_names())

In [57]:
scores = scores[['AnswerCount','CommentCount','FavoriteCount','Score','PostTypeId','ViewCount','accepted_answer',
                'last_editor','status','active_time','post_length']]
scores[[col for col in docs.columns]] = docs[[col for col in docs.columns]]

In [66]:
scores.isnull().sum()

AnswerCount        23967
CommentCount           0
FavoriteCount      34433
Score                  0
PostTypeId             0
ViewCount          23967
accepted_answer        0
last_editor            0
status                 0
active_time            0
post_length            0
does                3253
flight              3253
know                3253
nan                 3253
need                3253
time                3253
travel              3253
trip                3253
visa                3253
want                3253
dtype: int64

In [69]:
def impute_nulls(x):
    x = x.map(lambda x: 0. if x!=x else x)
    return x

scores = scores.apply(impute_nulls)

In [70]:
scores.isnull().sum()

AnswerCount        0
CommentCount       0
FavoriteCount      0
Score              0
PostTypeId         0
ViewCount          0
accepted_answer    0
last_editor        0
status             0
active_time        0
post_length        0
does               0
flight             0
know               0
nan                0
need               0
time               0
travel             0
trip               0
visa               0
want               0
dtype: int64

In [71]:
y = scores['Score'].values
X = scores[[col for col in scores.columns if col not in ['Score']]]

print X.shape, y.shape

(37955, 20) (37955,)


In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [73]:
bag_lasso = BaggingRegressor(Lasso())

lasso_scores = cross_val_score(bag_lasso.fit(X_train,y_train),X_test,y_test,cv=5)

print 'Fold Scores:', lasso_scores
print 'Cross Val Score:', np.mean(lasso_scores)

Fold Scores: [ 0.40835707  0.35097866  0.20671942  0.25514116  0.46865318]
Cross Val Score: 0.337969895882


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

In [77]:
views = posts[~posts['ViewCount'].isnull()]

In [78]:
views.isnull().sum()

AcceptedAnswerId          7472
AnswerCount                  0
Body                         0
ClosedDate               11361
CommentCount                 0
CommunityOwnedDate       13975
CreationDate                 0
FavoriteCount            10466
Id                           0
LastActivityDate             0
LastEditDate              2911
LastEditorDisplayName    13621
LastEditorUserId          3231
OwnerDisplayName         13613
OwnerUserId                307
ParentId                 13988
PostTypeId                   0
Score                        0
Tags                         0
Title                        0
ViewCount                    0
dtype: int64

In [79]:
views['accepted_answer'] = views['AcceptedAnswerId'].map(lambda x: 0 if np.isnan(x) else 1)
views['last_editor'] = views['LastEditorUserId']-views['OwnerUserId']
views['last_editor'] = views['last_editor'].map(lambda x: 1 if x==0 else 0)
views['status'] = views['ClosedDate'].map(lambda x: 0 if x!=x else 1)
views['LastActivityDate'] = pd.to_datetime(views['LastActivityDate'])
views['CreationDate'] = pd.to_datetime(views['CreationDate'])
views['active_time'] = views['LastActivityDate'] - views['CreationDate']
views['active_time'] = (views['active_time']/np.timedelta64(1,'D')).astype(float)
views['post_length'] = views['Body'].map(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: ht

In [80]:
views['Text'] = views['Body'] + views['Title'] + views['Tags']
views['Text'] = views['Text'].map(lambda x: BeautifulSoup(str(x),'html.parser').get_text())
views['Text'] = views['Text'].map(lambda x: x.replace('\n',''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [82]:
vectorizer = CountVectorizer(max_features=100,stop_words=stop_words,max_df=.9,min_df=.05)
t_vectorizer = TfidfVectorizer(max_features=100, stop_words=stop_words,max_df=.9,min_df=.05)
X = vectorizer.fit_transform(views['Text'])
X_t = t_vectorizer.fit_transform(views['Text'])
# docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
docs = pd.DataFrame(X_t.toarray(), columns=t_vectorizer.get_feature_names())

In [83]:
views = views[['AnswerCount','CommentCount','Score','PostTypeId','ViewCount','accepted_answer','last_editor','status',
              'active_time','post_length']]
views[[col for col in docs.columns]] = docs[[col for col in docs.columns]]

In [85]:
views.isnull().sum()

AnswerCount            0
CommentCount           0
Score                  0
PostTypeId             0
ViewCount              0
accepted_answer        0
last_editor            0
status                 0
active_time            0
post_length            0
able               10050
airport            10050
apply              10050
best               10050
buy                10050
canada             10050
car                10050
card               10050
case               10050
check              10050
citizen            10050
city               10050
countries          10050
country            10050
day                10050
days               10050
different          10050
does               10050
enter              10050
entry              10050
                   ...  
return             10050
schengen           10050
stay               10050
sure               10050
ticket             10050
tickets            10050
time               10050
tourist            10050
train              10050


In [86]:
views = views.apply(impute_nulls)

In [87]:
views.isnull().sum()

AnswerCount        0
CommentCount       0
Score              0
PostTypeId         0
ViewCount          0
accepted_answer    0
last_editor        0
status             0
active_time        0
post_length        0
able               0
airport            0
apply              0
best               0
buy                0
canada             0
car                0
card               0
case               0
check              0
citizen            0
city               0
countries          0
country            0
day                0
days               0
different          0
does               0
enter              0
entry              0
                  ..
return             0
schengen           0
stay               0
sure               0
ticket             0
tickets            0
time               0
tourist            0
train              0
transit            0
travel             0
traveling          0
travelling         0
trip               0
uk                 0
usa                0
use          

In [88]:
views.reset_index(inplace=True,drop=True)

In [89]:
y = views['ViewCount'].values
X = views[[col for col in views.columns if col not in ['index','ViewCount']]]

print X.shape, y.shape

(13988, 91) (13988,)


In [96]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [97]:
bag_lasso = BaggingRegressor(Lasso())

lasso_scores = cross_val_score(bag_lasso.fit(X_train,y_train),X_test,y_test,cv=5)

print 'Fold Scores:', lasso_scores
print 'Cross Val Score:', np.mean(lasso_scores)

Fold Scores: [ 0.12411385  0.14358584  0.14505438  0.12839379  0.22943835]
Cross Val Score: 0.154117241272


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.
