### Setup

- python: 3.7.0
- environment: recommendations

### Courtesy

- linear kernel: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html#sklearn.metrics.pairwise.linear_kernel

- cosine_similarity: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

- Contents from Machine Learning Mastery: https://machinelearningmastery.com/

- Cosine Similarity concepts:
  - http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
  - http://www.site.uottawa.ca/~diana/csi4107/cosine_tf_idf_example.pdf

### Installation

In [61]:
# !pip3 install -U scikit-learn

### Initialization

In [62]:
# Import packages

import numpy as np
import pandas as pd
from time import time
import json
import csv
import math

In [15]:
# "Scrubbed MLM articles.json" has all the dataset in the form of JSON.
# We will convert this JSON data into appropriate rows and columns format i.e CSV.

# open a file.
json_file = open("Scrubbed MLM articles.json")

# load JSON contents
json_contents = json.load(json_contents)

In [26]:
# Prepare an array that will hold the list of arrays (that will correspond to rows).
# This has:
#   Column 1: counter/row number.
#   Column 2: title of the document
metadata = []

for index, entry in enumerate(json_contents):
    data = []
    data.append(index + 1)
    data.append(entry['title'])
    metadata.append(data)

In [43]:
# Write the metadata to the "dataset.csv".

csv_file = open("dataset.csv", "w")

with csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['counter', 'title'])
    writer.writerows(metadata)

### Data Exploration

In [44]:
dataset = pd.read_csv("dataset.csv")

In [45]:
display(dataset.head(n=1))

Unnamed: 0,counter,title
0,1,A Gentle Introduction to Early Stopping to Avo...


In [37]:
dataset.shape

(683, 2)

In [38]:
display(dataset.describe())

Unnamed: 0,1
count,683.0
mean,343.0
std,197.309402
min,2.0
25%,172.5
50%,343.0
75%,513.5
max,684.0


##### Data fields

- `counter`: represents the row number.
- `title`: title of the document.

In [48]:
print("Total number of the entries:", len(dataset))

Total number of the entries: 684


##### Quick statistics by studying the length of the characters in title

In [49]:
dataset["title_length"] = dataset['title'].str.len()

In [50]:
display(dataset.head(n=5))

Unnamed: 0,counter,title,title_length
0,1,A Gentle Introduction to Early Stopping to Avo...,97
1,2,How to Reduce Overfitting With Dropout Regular...,62
2,3,A Gentle Introduction to Dropout for Regulariz...,70
3,4,How to Reduce Generalization Error in Deep Neu...,96
4,5,Activation Regularization for Reducing General...,92


### Data Preprocessing

### Shuffle and Split Data

### Model Training

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [108]:
# This will take care of data preprocessing step to exclude stop words.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dataset['title'])

In [109]:
print(X.shape)

(684, 863)


In [118]:
# sklearn's cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(X, X)

In [119]:
# sklearn's linear_kernel
# from sklearn.metrics.pairwise import linear_kernel

# cosine_sim = linear_kernel(X, X)

In [143]:
# Construct the reverse map of indices and movie titles

# This will act similar to hash.
indices = pd.Series(dataset.index, index=dataset['title']).drop_duplicates()

In [141]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the record that matches the title
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie.
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movie by their similarity scores.
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top 10 movies.
    sim_scores = sim_scores[1:11]

    # Get the movie indices.
    movie_indices = [i[0] for i in sim_scores]

    # Get the top 10 movies.
    return dataset['title'].iloc[movie_indices]

In [142]:
get_recommendations("Taxonomy of Time Series Forecasting Problems")

53
[(0, 0.0), (1, 0.0), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.2678122476129378), (10, 0.3253198639664405), (11, 0.26023524042473434), (12, 0.2497830097280152), (13, 0.28396228458908446), (14, 0.0), (15, 0.20752597072896306), (16, 0.22808763288444972), (17, 0.27484752391510603), (18, 0.21225630805799003), (19, 0.21137707280073853), (20, 0.19379089122528415), (21, 0.19732144610677832), (22, 0.2488972281492887), (23, 0.171362588026551), (24, 0.1704360280198961), (25, 0.19011209540388657), (26, 0.23711033941125104), (27, 0.21253294274212248), (28, 0.0), (29, 0.07449993136019066), (30, 0.0), (31, 0.0), (32, 0.12920257767492832), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.1457012037262691), (39, 0.07020873553767154), (40, 0.0), (41, 0.24197843450613388), (42, 0.0), (43, 0.0), (44, 0.0), (45, 0.0), (46, 0.20902910816181647), (47, 0.25483447799071896), (48, 0.24669113652980013), (49, 0.29009317907914983), (50, 0.0), (51, 0.0), (52, 0.271

322                     What Is Time Series Forecasting?
284    10 Challenging Machine Learning Time Series Fo...
327          Top Books on Time Series Forecasting With R
321       Time Series Forecasting as Supervised Learning
283       Python Environment for Time Series Forecasting
10     How to Develop LSTM Models for Time Series For...
682                  Practical Machine Learning Problems
298    How to Make Predictions for Time Series Foreca...
272    Feature Selection for Time Series Forecasting ...
309    Autoregression Models for Time Series Forecast...
Name: title, dtype: object

In [106]:
# linear_kernel returns:

# 53          Taxonomy of Time Series Forecasting Problems
# 322                     What Is Time Series Forecasting?
# 682                  Practical Machine Learning Problems
# 327          Top Books on Time Series Forecasting With R
# 321       Time Series Forecasting as Supervised Learning
# 315    How To Backtest Machine Learning Models for Ti...
# 283       Python Environment for Time Series Forecasting
# 323          7 Time Series Datasets for Machine Learning
# 52     How to Develop a Skillful Machine Learning Tim...
# 10     How to Develop LSTM Models for Time Series For...

In [None]:
# cosine_similarity

# 53          Taxonomy of Time Series Forecasting Problems
# 322                     What Is Time Series Forecasting?
# 682                  Practical Machine Learning Problems
# 327          Top Books on Time Series Forecasting With R
# 321       Time Series Forecasting as Supervised Learning
# 315    How To Backtest Machine Learning Models for Ti...
# 283       Python Environment for Time Series Forecasting
# 323          7 Time Series Datasets for Machine Learning
# 52     How to Develop a Skillful Machine Learning Tim...
# 10     How to Develop LSTM Models for Time Series For...