<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.
 
*For discussions, please stick to [ODS Slack](https://opendatascience.slack.com/), channel #mlcourse_ai, pinned thread __#a6__*

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

In [2]:
from nltk.corpus import stopwords
from nltk.stem.lancaster import *
from nltk.stem.snowball import RussianStemmer, EnglishStemmer
from scipy import sparse

In [3]:
stop_words_ru = stopwords.words('russian')

stop_words_eng = stopwords.words('english')

In [4]:
def replace_numeric_with_literal(string, literal='<num> '):
    return re.sub(r'([0-9]+ ?)+', literal, string)


def compact_whitespace(string):
    return re.sub(r'\s+', ' ', string)


def stem(string, stemmer, stop_words):
    return ' '.join([stemmer.stem(word) for word in re.split(' ', string) if not word in stop_words])


def lemmatize(string, lemmatizer, stop_words):
    return ' '.join([lemmatizer.lemmatize(word) for word in re.split(' ', string) if not word in stop_words])

def lower_case(string):
    return string.lower()


def fix_lt(string):
    # fix the HTML-escaped less-than sign
    return re.sub(r'&lt;', '<', string)


def replace_non_alphanumeric_with_space(string):
    # replace punctuation and different whitespace with space character
    return re.sub(r'[^\w0-9\s]', ' ', string)


def strip_punctuation(string):
    # remove punctuation
    return re.sub(r'[^\w0-9\s]', ' ', string)


def remove_stop_words(string, stop_words):
    return ' '.join([word for word in re.split(' ', string) if not word in stop_words])

def pre_process(string):
    s = lower_case(string)
    s = fix_lt(s)
    s = strip_punctuation(s)
    s = remove_stop_words(s, stop_words_ru)
    s = remove_stop_words(s, stop_words_eng)
    s = compact_whitespace(s)
    s = replace_numeric_with_literal(s)
    stemmer = RussianStemmer()
    s = stem(s, stemmer, stop_words_ru)
    stemmer = EnglishStemmer()
    s = stem(s, stemmer, stop_words_eng)
    return s.strip()

The following code will help to throw away all HTML tags from an article content.

In [5]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [6]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [7]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
            content = strip_tags(content)
            content = pre_process(content)
            feature_files[0].write(content + '\n')
            
            published = json_data['published']['$date']
            feature_files[1].write(published + '\n')
            
            title = json_data['title'].replace('\n', ' ').replace('\r', ' ')
            title = pre_process(title)
            feature_files[2].write(title + '\n')
            
            author = json_data['author']['twitter']
            feature_files[3].write(str(author) + '\n')

In [8]:
PATH_TO_DATA = 'data' # modify this if you need to

In [9]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [10]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [14]:
def read_file(filename):
    f = open(os.path.join('data', filename), 'r', encoding="utf-8")
    x = f.readlines()
    return x

In [12]:
tfidf_content =TfidfVectorizer(ngram_range=(1, 2), max_features=100000)

In [13]:
X_train_content_sparse = tfidf_content.fit_transform(read_file('train_content.txt'))

In [14]:
tfidf_title =TfidfVectorizer(ngram_range=(1, 2), max_features=100000)
X_train_title_sparse = tfidf_title.fit_transform(read_file('train_title.txt'))

In [12]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [16]:
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore')
le.fit(read_file('train_author.txt') + read_file('test_author.txt'))
X_l = le.transform(read_file('train_author.txt')).reshape(-1, 1)
X_train_author_sparse = enc.fit_transform(X_l)

In [11]:
from datetime import datetime

In [10]:
#Time features: publication hour, whether it's morning, day, night, whether it's a weekend
def extract_time_features(date):
    date = date.replace('\n', '').replace('\r', '')
    date = datetime.strptime(date, "%Y-%m-%dT%H:%M:%S.%fZ")
    hour = date.hour
    morning = hour > 5 & hour <= 11
    day = hour > 11 & hour <= 22
    night = hour > 22 | hour <= 5
    weekend = date.weekday() >= 5
    weekday = date.weekday()
    month = date.month
    year = date.month
    year_month = (100 * date.year + date.month) / 1e5
    return np.array([hour, morning, day, night, weekend, weekday, month, year, year_month])

In [53]:
X_train_time_features_sparse = np.array([extract_time_features(str(f)) for f in read_file('train_published.txt')])

In [54]:
X_train_time_features_sparse.shape

(62313, 9)

In [55]:
X_train_content_sparse.shape

(62313, 100000)

In [56]:
X_train_title_sparse.shape

(62313, 100000)

In [57]:
X_train_author_sparse.shape

(62313, 23588)

In [58]:
X_test_content_sparse = tfidf_content.transform(read_file('test_content.txt'))
X_test_title_sparse = tfidf_title.transform(read_file('test_title.txt'))

In [59]:
X_t_l = le.transform(read_file('test_author.txt')).reshape(-1, 1)
X_test_author_sparse = enc.transform(X_t_l)
X_test_time_features_sparse = np.array([extract_time_features(str(f)) for f in read_file('test_published.txt')])

**Join all sparse matrices.**

In [62]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, 
                         X_train_time_features_sparse]).tocsr()

In [61]:
X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, 
                        X_test_time_features_sparse]).tocsr()

**Read train target and split data for validation.**

In [26]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [64]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [13]:
from sklearn.linear_model import Ridge

In [66]:
ridge = Ridge(random_state=17)

In [67]:
%%time
ridge.fit(X_train_part_sparse, y_train_part)

Wall time: 2min 26s


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001)

In [68]:
ridge_pred = ridge.predict(X_valid_sparse)

In [69]:
valid_mae = mean_absolute_error(y_valid, ridge_pred)
valid_mae, np.expm1(valid_mae)

(1.0861725987936222, 1.9629120896932477)

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [70]:
%%time
ridge.fit(X_train_sparse, y_train)

Wall time: 3min 26s


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001)

In [71]:
%%time
ridge_test_pred = ridge.predict(X_test_sparse)

Wall time: 131 ms


In [34]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [38]:
write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA,
                                                    'assignment6_medium_submission.csv'))

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [39]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      os.path.join(PATH_TO_DATA,
                                   'medium_all_zeros_submission.csv'))

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [72]:
ridge_test_pred_modif = (ridge_test_pred - ridge_test_pred.mean()) + 4.33328 # You code here

In [73]:
ridge_test_pred_modif.mean()

4.333279999999999

In [75]:
write_submission_file(ridge_test_pred_modif, 
                      os.path.join(PATH_TO_DATA,
                                   'assignment6_medium_submission_with_hack.csv'))

That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>

In [15]:
def extract_time_features(date):
    date = date.replace('\n', '').replace('\r', '')
    date = datetime.strptime(date, "%Y-%m-%dT%H:%M:%S.%fZ")
    hour = date.hour
    morning = hour > 5 & hour <= 11
    day = hour > 11 & hour <= 22
    night = hour > 22 | hour <= 5
    weekend = date.weekday() >= 5
    return np.array([hour, morning, day, night, weekend])

In [16]:
le = LabelEncoder()
enc = OneHotEncoder(handle_unknown='ignore')
le.fit(read_file('train_author.txt') + read_file('test_author.txt'))
X_l = le.transform(read_file('train_author.txt')).reshape(-1, 1)
X_train_author_sparse = enc.fit_transform(X_l)

X_t_l = le.transform(read_file('test_author.txt')).reshape(-1, 1)
X_test_author_sparse = enc.transform(X_t_l)

In [17]:
tfidf_content =TfidfVectorizer(ngram_range=(1, 2), max_features=100000)
tfidf_title =TfidfVectorizer(ngram_range=(1, 2), max_features=100000)

In [18]:
X_train_content_sparse = tfidf_content.fit_transform(read_file('train_content.txt'))
X_train_title_sparse = tfidf_title.fit_transform(read_file('train_title.txt'))


In [19]:
X_test_content_sparse = tfidf_content.transform(read_file('test_content.txt'))
X_test_title_sparse = tfidf_title.transform(read_file('test_title.txt'))


In [20]:
X_train_time_features_sparse = np.array([extract_time_features(str(f)) for f in read_file('train_published.txt')])
X_test_time_features_sparse = np.array([extract_time_features(str(f)) for f in read_file('test_published.txt')])


In [21]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, 
                         X_train_time_features_sparse]).tocsr()
#%%
X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, 
                        X_test_time_features_sparse]).tocsr()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
grid={"alpha":np.logspace(-3,3,7)}#, "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=Ridge()
logreg_cv=GridSearchCV(logreg,grid,cv=5,verbose = 5)
logreg_cv.fit(X_train_sparse,y_train)


In [37]:
score(Ridge(random_state=17, alpha=0.001), X_train_sparse,y_train, 'a_0001')
score(Ridge(random_state=17, alpha=0.01), X_train_sparse,y_train, 'a_001')
score(Ridge(random_state=17, alpha=0.1), X_train_sparse,y_train, 'a_001')
score(Ridge(random_state=17, alpha=10), X_train_sparse,y_train, 'a_10')

Score a_0001: 0.6427888902157468
Score a_001: 0.6438818486408597
Score a_001: 0.6605722770055993
Score a_10: 1.4658332842985802


1.4658332842985802

In [None]:
ridge = Ridge(random_state=17)
ridge.fit(X_train_part_sparse, y_train_part)
ridge_pred = ridge.predict(X_valid_sparse)

In [36]:
def score(est, X, y, prefix):
    train_part_size = int(0.7 * y.shape[0])
    X_train_part = X[:train_part_size, :]
    y_train_part = y[:train_part_size]
    X_valid =  X[train_part_size:, :]
    y_valid = y[train_part_size:]
    est.fit(X, y)
    pred = est.predict(X_valid)
    valid_mae = mean_absolute_error(y_valid, pred)
    score = np.expm1(valid_mae)
    print('Score ' + prefix + ':', score) 
    est.fit(X_train_sparse, y_train)
    ridge_test_pred = est.predict(X_test_sparse)
    ridge_test_pred_modif = (ridge_test_pred - ridge_test_pred.mean()) + 4.33328 # 
    write_submission_file(ridge_test_pred_modif, 
                          os.path.join(PATH_TO_DATA, prefix + '.csv'))
    return np.expm1(valid_mae)

In [None]:
#(1.0861725987936222, 1.9629120896932477)
valid_mae = mean_absolute_error(y_valid, ridge_pred)
valid_mae, np.expm1(valid_mae)