<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6
### <center> Beating benchmarks in "How good is your Medium article?"
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline". Do not forget about our shared ["primitive" baseline](https://github.com/Yorko/mlcourse_open/blob/master/jupyter_english/topic04_linear_models/kaggle_medium_ridge_baseline.ipynb) - you'll find something valuable there.

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [4]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author', 'description', 'readtime', 'domain']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            
            # You code here
            content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
            content_no_html_tags = strip_tags(content)
            feature_files[0].write(content_no_html_tags + '\n')
            
            published = json_data['published']['$date']
            feature_files[1].write(published + '\n')
            
            title = json_data['title'].replace('\n', ' ').replace('\r', ' ')
            feature_files[2].write(title + '\n')
            
            author = json_data['author']
            author2 = json_data['meta_tags']['author'].replace('\n', ' ').replace('\r', ' ')
            feature_files[3].write('{},{},{},{}\n'.format(author['name'],
                                                          author['url'],
                                                          author['twitter'],
                                                          author2))
            
            description = json_data['meta_tags']['description'].replace('\n', ' ').replace('\r', ' ')
            feature_files[4].write(description + '\n')
            
            read_time_txt = json_data['meta_tags']['twitter:data1']
            read_time = read_time_txt.split()[0] if read_time_txt.split()[0].isdigit() else '0'
            feature_files[5].write(read_time + '\n')
            
            domain = json_data['domain'].replace('\n', ' ').replace('\r', ' ')
            feature_files[6].write(domain + '\n')

In [5]:
PATH_TO_DATA = '../../raw_data' # modify this if you need to

In [6]:
#extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

In [7]:
#extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [8]:
# You code here
def contentFeature():
    tfidf = TfidfVectorizer(ngram_range = (1,3), max_features=50000)
    with open(os.path.join(PATH_TO_DATA, 'train_content.txt'), encoding='utf-8') as input_file:
        X_train = tfidf.fit_transform(input_file)
    with open(os.path.join(PATH_TO_DATA, 'test_content.txt'), encoding='utf-8') as input_file:
        X_test = tfidf.transform(input_file)
    return (X_train, X_test)

In [9]:
def titleFeature():
    tfidf = TfidfVectorizer(ngram_range = (1,3), max_features=50000)
    with open(os.path.join(PATH_TO_DATA, 'train_title.txt'), encoding='utf-8') as input_file:
        X_train = tfidf.fit_transform(input_file)
    with open(os.path.join(PATH_TO_DATA, 'test_title.txt'), encoding='utf-8') as input_file:
        X_test = tfidf.transform(input_file)
    return (X_train, X_test)

In [10]:
def dateFeature():
    train_df = pd.read_csv(os.path.join(PATH_TO_DATA,'train_published.txt'), header = None, names=['timestamp'])
    test_df = pd.read_csv(os.path.join(PATH_TO_DATA,'test_published.txt'), header = None, names=['timestamp'])
    
    train_idx = train_df.shape[0]
    print(train_df.shape)
    
    times_df = pd.concat([train_df, test_df], ignore_index=True)        
    
    times_df['timestamp'] = pd.to_datetime(times_df['timestamp'])
    times_df['start_month'] = times_df['timestamp'].apply(lambda ts: 100 * ts.year + ts.month)
    times_df['hour'] = times_df['timestamp'].apply(lambda ts: ts.hour).astype(int)
    times_df['morning'] = ((times_df['hour'] >= 7) & (times_df['hour'] < 10)).astype(np.int32)
    times_df['day'] = ((times_df['hour'] >= 10) & (times_df['hour'] < 19)).astype(np.int32)
    times_df['evening'] = ((times_df['hour'] >= 19) & (times_df['hour'] < 22)).astype(np.int32)
    times_df['night'] = ((times_df['hour'] >= 22) | (times_df['hour'] < 7)).astype(np.int32)
    times_df['is_weekend'] = times_df['timestamp'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
    ohe_weekday_df = pd.get_dummies(times_df['timestamp'].apply(lambda ts: ts.dayofweek), prefix='dayofweek')
    ohe_hour_df = pd.get_dummies(times_df['timestamp'].apply(lambda ts: ts.hour), prefix='hour')
    ohe_daymonth_df = pd.get_dummies(times_df['timestamp'].apply(lambda ts: ts.day), prefix='day')
    times_df = pd.concat([times_df, ohe_weekday_df, ohe_hour_df, ohe_daymonth_df], axis=1)
    times_df.drop(['timestamp'], axis=1, inplace=True)
    
    res_df = StandardScaler().fit_transform(times_df)
    return (res_df[:train_idx], res_df[train_idx:])

In [11]:
def authorFeature():
    train_df = pd.read_csv(os.path.join(PATH_TO_DATA,'train_author.txt'), header = None, names=['name', 'site', 'twitter', 'author'])
    test_df = pd.read_csv(os.path.join(PATH_TO_DATA,'test_author.txt'), header = None, names=['name', 'site', 'twitter', 'author'])
    
    train_idx = train_df.shape[0]
    
    author_df = pd.concat([train_df, test_df], ignore_index=True)        
    enc = OneHotEncoder()
    labeler = LabelEncoder()
    
    #res_df = pd.get_dummies(author_df['author'], prefix='author')
    res_df = enc.fit_transform(labeler.fit_transform(author_df['author'].ravel()).reshape((author_df.shape[0], 1)))
    
    return (res_df[:train_idx], res_df[train_idx:])

In [12]:
def descriptionFeature():
    tfidf = TfidfVectorizer(ngram_range = (1,3), max_features=50000)
    with open(os.path.join(PATH_TO_DATA, 'train_description.txt'), encoding='utf-8') as input_file:
        X_train = tfidf.fit_transform(input_file)
    with open(os.path.join(PATH_TO_DATA, 'test_description.txt'), encoding='utf-8') as input_file:
        X_test = tfidf.transform(input_file)
    return (X_train, X_test)

In [13]:
def domainFeature():
    train_df = pd.read_csv(os.path.join(PATH_TO_DATA,'train_domain.txt'), header = None, names=['domain'])
    test_df = pd.read_csv(os.path.join(PATH_TO_DATA,'test_domain.txt'), header = None, names=['domain'])
    train_idx = train_df.shape[0]
    print(train_df.shape, test_df.shape)
    
    domain_df = pd.concat([train_df, test_df], ignore_index=True)        
    enc = OneHotEncoder()
    labeler = LabelEncoder()
    
    #res_df = pd.get_dummies(author_df['author'], prefix='author')
    label_df = labeler.fit_transform(domain_df['domain'].fillna('empty').ravel()).reshape((domain_df.shape[0], 1))
    res_df = enc.fit_transform(label_df)
    print(res_df.shape)
    print(res_df[:train_idx].shape)
    print(res_df[train_idx:].shape)
    return (res_df[:train_idx], res_df[train_idx:])

In [14]:
def readTimeFeature():
    train_df = pd.read_csv(os.path.join(PATH_TO_DATA,'train_readtime.txt'), header = None, names=['readtime'])
    test_df = pd.read_csv(os.path.join(PATH_TO_DATA,'test_readtime.txt'), header = None, names=['readtime'])
    return (StandardScaler().fit_transform(train_df), StandardScaler().fit_transform(test_df))

In [15]:
%%time
X_train_author_sparse, X_test_author_sparse = authorFeature()

CPU times: user 269 ms, sys: 35.8 ms, total: 305 ms
Wall time: 304 ms


In [16]:
%%time
X_train_time_features_sparse, X_test_time_features_sparse = dateFeature()

(62313, 1)
CPU times: user 2.51 s, sys: 242 ms, total: 2.75 s
Wall time: 2.74 s


In [17]:
%%time
X_train_content_sparse, X_test_content_sparse = contentFeature()

CPU times: user 26min 7s, sys: 40.3 s, total: 26min 47s
Wall time: 26min 44s


In [18]:
%%time
X_train_title_sparse, X_test_title_sparse = titleFeature()

CPU times: user 7.07 s, sys: 192 ms, total: 7.26 s
Wall time: 7.25 s


In [19]:
%%time
X_train_domain, X_test_domain = domainFeature()
print(X_train_domain.shape, X_test_domain.shape)

(62313, 1) (34645, 1)
(96958, 247)
(62313, 247)
(34645, 247)
(62313, 247) (34645, 247)
CPU times: user 80.6 ms, sys: 16 Âµs, total: 80.7 ms
Wall time: 77.9 ms


In [20]:
X_train_readtime, X_test_readtime = readTimeFeature()
print(X_train_readtime.shape, X_test_readtime.shape)

(62313, 1) (34645, 1)


**Join all sparse matrices.**

In [21]:
X_train_sparse = csr_matrix(hstack([X_train_content_sparse, X_train_title_sparse,
                                    X_train_author_sparse, X_train_time_features_sparse, 
                                    X_train_title_sparse, X_train_domain, X_train_readtime]))

In [22]:
X_test_sparse = csr_matrix(hstack([X_test_content_sparse, X_test_title_sparse,
                                   X_test_author_sparse, X_test_time_features_sparse,
                                   X_test_title_sparse, X_test_domain, X_test_readtime]))

**Read train target and split data for validation.**

In [23]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA,'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [24]:
train_part_size = int(0.7 * train_target.shape[0])

X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]

X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

In [25]:
print(X_train_part_sparse.shape, X_valid_sparse.shape)

(43619, 194201) (18694, 194201)


**Train a simple Ridge model and check MAE on the validation set.**

In [26]:
# You code here
from sklearn.linear_model import Ridge,RidgeCV,SGDRegressor

In [31]:
%%time
ridge = RidgeCV(alphas=np.logspace(-5,-1,5), scoring='neg_mean_absolute_error', cv=5)
ridge.fit(X_train_part_sparse, y_train_part);
print(ridge.alpha_)

KeyboardInterrupt: 

In [None]:
%%time
sgd = SGDRegressor(loss='huber', max_iter=1000)
parameters = {'alpha': np.logspace(-4, 4, 10)}

cv = GridSearchCV(estimator=sgd, cv=5, scoring='neg_mean_absolute_error', param_grid = parameters)
cv.fit(X_train_part_sparse, y_train_part)

In [None]:
#print(cv.cv_results_)
print('Best params:', cv.best_params_)
print ('Min MAE:', -cv.best_score_) 

In [33]:
ridge_pred = ridge.predict(X_valid_sparse)

NotFittedError: This RidgeCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

In [None]:
sgd_pred = cv.predict(X_valid_sparse)

In [None]:
valid_mae = mean_absolute_error(y_valid, ridge_pred)
#valid_mae = mean_absolute_error(y_valid, sgd_pred)
valid_mae, np.expm1(valid_mae)

In [None]:
sgd_mae = mean_absolute_error(y_valid, sgd_pred)
sgd_mae, np.expm1(sgd_mae)

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [None]:
%%time
ridge = RidgeCV(alphas=np.logspace(3,7,5), 
                scoring='neg_mean_absolute_error', cv=5)
ridge.fit(X_train_sparse, y_train);
print(ridge.alpha_)
#sgd.fit(X_train_scaled, y_train)

In [None]:
ridge_test_pred = ridge.predict(scaler.fit_transform(X_test_sparse))
#sgd_test_pred = sgd.predict(X_test_sparse)

In [None]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA,'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [None]:
write_submission_file(ridge_test_pred, 'assignment6_medium_submission_ridge.csv')

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [None]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      'medium_all_zeros_submission.csv')

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [None]:
mae_zero = 4.33328

mean_submission = ridge_test_pred.mean()
ridge_test_pred_modif = ridge_test_pred + mae_zero - mean_submission # You code here

In [None]:
write_submission_file(ridge_test_pred_modif, 
                      'assignment6_medium_submission_with_hack.csv')