# Feature Engineering Homework 
***
**Name**: $<$**Juan Lin**$>$ 

**Kaggle Username**: $<$**mzcolor**$>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline 

## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, **it is highly recommended that you perform local evaluation using a validation set or cross-validation. **

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


** Problem 1 Answer **

In [1]:
from numpy import array
import random
import re
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import preprocessing
from sklearn.preprocessing import FunctionTransformer, Normalizer
from sklearn.pipeline import FeatureUnion, Pipeline

In [2]:
class FeatEngr:
    def __init__(self):
        
        self.vectorizer = FeatureUnion([       
#                 Feature 1: sentence words using countvectorizer
#                 ('sentence words tfidf', 
#                   Pipeline([('extract_sentence_column', FunctionTransformer(lambda x: x[0], validate = False)),
#                             ('count_sentence', CountVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b', min_df=1)),
#                 Feature 2: sentence words using tf-idf
                ('sentence words tfidf', 
                  Pipeline([('extract_sentence_column', FunctionTransformer(lambda x: x[0], validate = False)),
                            ('count_sentence', TfidfVectorizer(analyzer='word', ngram_range=(1,2), lowercase=True, norm='l2', stop_words='english'))])),            
#                 Feature 3: trope column using countvercotrizer
                ('trope type', 
                  Pipeline([('extract_trope_column', FunctionTransformer(lambda x: x[2], validate = False)),
                            ('count_trope', TfidfVectorizer(analyzer='word', ngram_range=(1,2), lowercase=True, norm='l2', stop_words='english'))])), 
#                Feature 4: page words in sentence column, customized Transformer
#                 ('page words in sentence column',             
#                   Pipeline([('extract_page_sentence_columns', FunctionTransformer(lambda x: [x[0], x[1]], validate = False)), 
#                             ('count_page_sentence', PageWordsTransformer())])),
            ])

    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a list of forum posts
        vectorizer is to tokenize and count the word occurrences of a corpus of text documents.
        """
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.transform(examples)
    
    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-7:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:7]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))

    
    def train_predict_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
        from sklearn.linear_model import LogisticRegression 

        # get training features and labels 
        dfTrain = pd.read_csv("../data/spoilers/train.csv")  

        temp = [list(dfTrain['sentence']), list(dfTrain['page']), list(dfTrain['trope'])]

        self.X_train = self.build_train_features(temp)
                                                 
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)
        
        # get test features
        dfTest = pd.read_csv("../data/spoilers/test.csv")
       
        temp1 = (list(dfTest['sentence']), list(dfTest['page']), list(dfTest['trope']))

        self.X_test = self.get_test_features(temp1)
                 
        pred = self.logreg.predict(self.X_test)
        
        
        
        # dump predictions to file for submission to Kaggle  
        aa = pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction.csv", index=True, index_label="Id")
        """
        There are two approaches for validation, one is cross validation with fold = 5,
        another one is score method using train_test_split with sklearn.
        """        
#         calculate accuracy using cross validation
        from sklearn.model_selection import cross_val_score
        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv=5)
        print("The score for each fold:", scores)
        print("The average accuracy is: %0.3f with %0.3f deviation" % (scores.mean(), scores.std() * 2))
        
#         from sklearn.model_selection import train_test_split        
#         X_train_crova, X_test_crova, y_train_crova, y_test_crova = train_test_split(self.X_train, self.y_train, test_size=0.2, random_state=None, shuffle=True)
#         clf = self.logreg.fit(X_train_crova, y_train_crova)
#         accuracy_score = clf.score(X_test_crova, y_test_crova)
#         print("The accuracy score is: %0.3f" % accuracy_score)
        
        
        

In [3]:
# to find the words in page columns appear in the sentence column
class PageWordsTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, example):
        return self
    
    def transform(self, example):
        
        import numpy as np
        from scipy.sparse import csr_matrix
       
        X = np.zeros((len(example[0]), 1))

        for item, value in enumerate(example[0]):
            sentence_words_row = nltk.word_tokenize(value)
            # separate each page with space
            page_separate = re.sub(r'([A-Z])', r' \1', example[1][item])
            
            page_words = nltk.word_tokenize(page_separate)
            
            # look for the same words from page appeared in sentence
            same_words = [i for i in page_words if i in sentence_words_row]
            
            
            X[item, :] = len(same_words)
            
            X = preprocessing.normalize(X, norm='l2')
            return csr_matrix(X)
     

In [None]:
# Instantiate the FeatEngr class 
feat = FeatEngr()
# feat.show_top10()
# Train your Logistic Regression classifier and predict the accuracy
feat.train_predict_model(random_state=1230)

### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

**Problem 2 answer:**                                                                       
For the feature engineering homework, so far I've tried few things:                                                                                                          
**Case1**, utilizing sentence column, using CountVectorizer method with bigram features, here called Feature1.                                                                                     
**Case2**, utilizing sentence column, change CountVectorizer method to Tf-idf method to get rid of the unrelated words for normalization, here called Feature2.                                       
**Case3**, using trope column to see which trope is considered as spoiler,here called Feature3.                                                                              
**Case4**, combining page and sentence column, find out the words appeared in the sentence could be used to categorize spoiler or not, here called Feature4.                
**Case5**, using featueunion to combine the feature 2 and feature 3 together.                                                                                                
**Case6**, using featureunion to combine the feature 1, feature 2 and featuer 3.

|$\texttt{}$ | $\texttt{Prediction Accuracy}$ | $\texttt{Test Accuracy}$ |$\texttt{Difference}$ | $\texttt{Overfitting}$ 
|:----:|:----:|:----:|:----:|:----:|:----:|
| **Case1**    |0.666 |0.659	| 0.007|  No |
| **Case2**    |0.683	|0.661	|0.022	|  No |
| **Case3**    |0.721	|0.637	|0.084	|  No |
| **Case4**    |0.529	|0.521	|0.008	| No |
| **Case5**    |0.729	|0.716	|0.019	| No |
| **Case6**    |0.740	|0.701	|0.039	| Yes |

Basically what I did was that trying to predict the spoiler by using the sentence column itself at first, then looking around how the trope and page column related to the sentence column. What I found is that page column has the worst correlation to predict spoiler. In the case5 and case6, it seemed the model got improved quite a bit. However, in case6, there might be a bit overfitting by looking at the difference between the prediciton accuracy and test accuracy. The cross validation with five folds were preformed in this study.
There are few other ways in the future to work around, for instance, gather additional training data and add more features (media genres) so we will have a bigger training pool.

In general, there are some baseline features. For instance, the most straightforward features are the words present in the sentence. Also unigram and bigram features play a role in predicting the model, generally speaking, bigram performs better than uniform which makes sense because bigram uses consecutive pairs of tokens and results in a much bigger feture space. Below is the top seven unigram and bigram features based on the coefficient values.

|$\texttt{Unigram}$ | $\texttt{Bigram}$ | 
|:----:|:----:|
|killed	|the end  |
|freya 	|with the  |
|dies 	|the season |
|Harvey	|turns out  |
|turns	 |to kill  |
|morgana  |	in the  |

                                                                                                                                                                                                                                                                 
|$\texttt{}$|$\texttt{# of features}$ | $\texttt{Accuracy}$ | 
|:----:|:----:|
|Unigram	|19080	| 0.657 |
|Bigram	    |139022	|  0.670 |

Based on the number of features and accuracy shown above, bigram features is better at capturing contextual information with a larger feature space. 

Adding couple features is better than just using the baseline model, however, there is still errors. For example,              
* *The first episode deals with  one of the season 1 bad guys getting killed by an  Eldritch Abomination  with telekinesis!* Our model incorrectly predict this as  a non-spoiler. It has the word, killed, the season, however, if you look at the corresponding trope column in the whole training set,  the spoiler predicts as not a spoiler. That is, our classifier is not able to leverage trope information contained in the sentence. In this case, we need more features, for instance, media genre to  improve the model.          

* *After combing the feature together, Even after she returns in the middle of the charade.* Our classifier incorrectly predicted this sentence as  a spoiler. The reason is that the trope in which the sentence is categorized predicts more spoiler. In this case, filtering the original data is very necessary.                                                                                      

* *Detective Paxson's partner, although not killed, was the one to take the fall for a political trap Michael set up for her.* our classifier predicted it as a spoiler because it got distracted by kill related words. However, this sentence is emphasizing the fact instead of the killing. 

Conclusion : Many sentences were mis-categorized because the lack of effective training data. Data cleaning should be performed in the very beginning. In this case adding features is definitely helpful in terms of enhancing the model accuracy, and what features should be added needs to be carefully filted to avoid overfitting. 