## Recommendation System
Collaborative filtering with implicit feedback based on latent factors. Prepare data on user-item relationships for each user-company in format that ALS can use.
We require each unique assignee ID in the rows of the matrix, and each unique item ID in columns of matrix.
Values of matrix should be (?) binary user-item preference * confidence

In [60]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

import pandas as pd
import numpy as np

from test_model import (get_patent_fields_list, get_ml_patents, 
                        create_title_abstract_col,trim_data, 
                        structure_dataframe, partition_dataframe, 
                        build_pipeline, process_docs, pat_inv_map, get_topics)

from rec_system import alphanum_to_int, int_to_alphanum

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary, mmcorpus
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.phrases import Phrases, Phraser
from gensim.models.ldamodel import LdaModel
from gensim.models import AuthorTopicModel
from gensim.test.utils import common_dictionary, datapath, temporary_file
from smart_open import smart_open

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, punkt, RegexpTokenizer, wordpunct_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

import json
from pandas.io.json import json_normalize
import requests
import re
import os
import calendar
import requests
from bs4 import BeautifulSoup
import pickle
import math

import matplotlib.pyplot as plt
import pyLDAvis
import pyLDAvis.gensim

from pprint import pprint

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark

In [4]:
sc = spark.sparkContext
sc

### Data understanding - Acquire data

#### Data understanding - Acquire data for text workflows

In [5]:
# load pickled dataset
with open('/Users/lee/Documents/techniche/techniche/data/raw_data_1000', 'rb') as f:
    raw_data_1000 = pickle.load(f)

In [6]:
# define keys as criteria to subset dataset #1 for non-text workflows
retained_keys = ['patent_number', 'patent_firstnamed_assignee_id']

# subset raw dataset by desired keys/columns
data_1000 = trim_data(data=raw_data_1000, keys=retained_keys)

In [7]:
# define keys as criteria to subset dataset #2, for text workflows
retained_keys_2 = ['patent_number', 'patent_firstnamed_assignee_id',
                   'patent_title', 'patent_abstract']

# subset raw dataset by desired keys/columns for text analysis workflows
data_1000_2 = trim_data(data=raw_data_1000, keys=retained_keys_2)

#### Data preparation

In [8]:
# create new item in dataset #2 by concat of patent_title and patent_abstract
data_1000_2 = create_title_abstract_col(data=data_1000_2)

In [9]:
# create Pandas dataframe from dataset #1
df_1000 = pd.DataFrame(data_1000)

In [57]:
# create Pandas dataframe from dataset #2
df_1000_2 = pd.DataFrame(data_1000_2)
df_1000_2.head(3)

Unnamed: 0,patent_abstract,patent_firstnamed_assignee_id,patent_number,patent_title,patent_title_abstract
0,"Tools and techniques for the rapid, continuous...",org_VU2IXnxgxGIK8A8oQrwm,10226194,"Statistical, noninvasive measurement of a pati...","Statistical, noninvasive measurement of a pati..."
1,The disclosure relates to structural health mo...,org_9cmRc2rH8nbl8O9VuxYL,10228278,Determining a health condition of a structure,Determining a health condition of a structure....
2,A scenario is defined that including models of...,org_8O8xQifxyiW5pZB2KuDx,10228693,Generating simulated sensor data for training ...,Generating simulated sensor data for training ...


In [11]:
# for dataset #1: drop row that contains invalid data
df_1000[df_1000.patent_number.str.contains('[RE]')]
df_1000 = df_1000.drop(df_1000.index[[717]])

# drop NaNs in patent_firstnamed_assignee_id column
df_1000 = df_1000.dropna()

In [58]:
# for dataset#2: drop row that contains invalid data
df_1000_2[df_1000_2.patent_number.str.contains('[RE]')]
df_1000_2 = df_1000_2.drop(df_1000_2.index[[717]])

# drop NaNs in patent_firstnamed_assignee_id column
df_1000_2 = df_1000_2.dropna()

#### Data preparation - model #1
Prepare data on user-item relationships for each user-company in format that ALS can use.
We require each unique assignee ID in the rows of the matrix, and each unique item ID in columns of matrix.
Values of matrix should be (?) binary user-item preference * confidence

In [13]:
# create new rating column and assign value of 1
df_1000['rating'] = 1

In [14]:
# convert patent_number column from string to int
df_1000 = df_1000.astype({'patent_number': 'int64'})
# uncomment to confirm
# df_1000.info()

In [15]:
# convert alphanumeric patent_firstnamed_assignee_id col to int
df_1000 = df_1000.astype({'patent_number': 'int64'})

In [16]:
# df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs)
df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs) % 65536 # 2^16

In [17]:
# df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs)
df_1000['patent_number'] = df_1000['patent_number'] % 65536 # 2^16

In [18]:
df_1000 = df_1000.astype({'patent_firstnamed_assignee_id': 'int'})

#### Data preparation - model #1 - create Spark dataframe from pandas dataframe

In [19]:
sp_df_1000 = spark.createDataFrame(df_1000)

In [20]:
# cast columns from bigint to int
sp_df_1000_2 = sp_df_1000.withColumn("patent_firstnamed_assignee_id",
                                     sp_df_1000["patent_firstnamed_assignee_id"]
                                     .cast(IntegerType())).withColumn("patent_number",
                                                                      sp_df_1000["patent_number"].
                                                                      cast(IntegerType()))
                                                          .withColumn("rating", sp_df_1000["rating"].cast(IntegerType()))

In [21]:
# partition dataframe 
(training, test) = sp_df_1000.randomSplit([0.8, 0.2])

### Model # 1
Build the recommendation model using ALS on the training data

In [22]:
# build ALS recommendation model
als = ALS(maxIter=5,
          regParam=0.01, 
          rank=10, # number of latent topics- ME-10?
          alpha=30,
          implicitPrefs=True, # # implicitPrefs=True b/c ratings are implicit
          userCol="patent_firstnamed_assignee_id", 
          itemCol="patent_number", 
          ratingCol="rating",
          coldStartStrategy="nan") # coldStartStrategy="nan" to retain NaNs

In [23]:
# fit ALS model to the training set
model = als.fit(training)

#### Model #1 - Evaluation - Compare to naive baseline
Compare model evaluation result with naive baseline model that only outputs (for explicit - the average rating (or you may try one that outputs the average rating per movie).

#### Model #1 - Optimize model

In [None]:
# optimize model

#### Getting Predictions

In [30]:
# get predictions for test set
predictions_test = model.transform(test)
predictions_test_df = predictions_test.toPandas()

In [29]:
# get predictions for training set
predictions_train = model.transform(training)
predictions_train_df = predictions_train.toPandas()
predictions_train_df

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating,prediction
0,59878,3997,1,0.977067
1,25713,9900,1,0.363925
2,59878,26087,1,0.977067
3,59878,32304,1,0.977067
4,26407,43256,1,0.364367
5,6623,54258,1,0.878919
6,30640,34602,1,0.364231
7,11061,2525,1,0.875135
8,30542,3986,1,0.364142
9,59749,4042,1,0.364071


In [32]:
predictions_train_df.dropna()

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating,prediction
0,59878,3997,1,0.977067
1,25713,9900,1,0.363925
2,59878,26087,1,0.977067
3,59878,32304,1,0.977067
4,26407,43256,1,0.364367
5,6623,54258,1,0.878919
6,30640,34602,1,0.364231
7,11061,2525,1,0.875135
8,30542,3986,1,0.364142
9,59749,4042,1,0.364071


In [33]:
predictions_test_df.dropna()

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating,prediction


### Model #2 - Data preparation

In [64]:
train_text, test_text = train_test_split(df_1000_2, test_size = 0.2)

#### Model 2 - Data preparation - text data

- TF-IDF vectorization of patents - metrics - avg distance between individual patents, with ranking
- take tf-idf vector and argsort by absolute value, to see which features are most important to patent
- get top 20 features. normally would do cosine distance betweel all vectors. BUT, only do cosine distance between these top 20 features, for cold start patents

In [36]:
# instantiate TF-IDF Vectorizer using standard English stopwords
tfidf = TfidfVectorizer(stop_words='english')

In [65]:
# fit TF-IDF matrix on text column
tfidf_matrix = tfidf.fit_transform(train_text['patent_title_abstract'])

In [66]:
# output matrix, 972 docs, 5364 terms
tfidf_matrix.shape

(777, 4924)

### Model 3 - compute distance metric

In [67]:
# compute cosine similarity matrix between docs using linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [77]:
# construct reverse map of indices and pat_title_abstract
indices = pd.Series(train_text.index, index = train_text['patent_number']).drop_duplicates()

In [74]:
# tfidf vec requires list, not just string
unseen_data = test_text
unseen_data

Unnamed: 0,patent_abstract,patent_firstnamed_assignee_id,patent_number,patent_title,patent_title_abstract
140,An electronic device can receive user input vi...,org_EilEWQcC6UiqHcSGx9mb,10192549,Extending digital personal assistant action pr...,Extending digital personal assistant action pr...
844,A graymail detection and filtering system pred...,org_hZHBoHvjQMoGbVbMF740,9954805,Graymail filtering-based on user preferences,Graymail filtering-based on user preferences. ...
843,A computer system automatically generates serv...,org_EilEWQcC6UiqHcSGx9mb,9954746,Automatically generating service documentation...,Automatically generating service documentation...
305,A system may include multiple personal data so...,org_oBgJHolxfEg0kgVOfKYg,10140322,Tools and techniques for extracting knowledge ...,Tools and techniques for extracting knowledge ...
173,The present disclosure relates to a system and...,org_U6feekQVzPuPglpgKSBc,10175979,Defect ownership assignment system and predict...,Defect ownership assignment system and predict...
715,Methods and a system are provided that is perf...,org_q9Bn28RHhpYrQjKvraAH,10003923,Location context inference based on user mobil...,Location context inference based on user mobil...
566,A machine learning model is trained by definin...,org_8O8xQifxyiW5pZB2KuDx,10055675,Training algorithm for collision avoidance usi...,Training algorithm for collision avoidance usi...
152,A data analysis system stores in-memory repres...,org_uSkGGmX0kIBgxQmYxLGK,10185930,Collaboration using shared documents for proce...,Collaboration using shared documents for proce...
851,"A computer-implemented method, a processing pi...",org_FMQQGwWD4see8cTUvBeX,9946924,System and method for automating information a...,System and method for automating information a...
503,The invention provides an autonomous vehicle c...,org_UfP75xYv5uGvK5xNfkLe,10073462,Autonomous vehicle with improved visual detect...,Autonomous vehicle with improved visual detect...


In [76]:
unseen_tfidf = tfidf.transform(unseen_data['patent_title_abstract'])
unseen_tfidf.shape

(195, 4924)

In [92]:
# pass patent number from training set to get_recommendations
# take user input of string and output most similar documents
get_pat_recs('10019674')

165    10180667
600    10042880
753     9984679
379    10115055
521    10067965
745     9984058
247    10165230
235    10162794
157    10186329
1      10228278
Name: patent_number, dtype: object

In [86]:
train_text.head()

Unnamed: 0,patent_abstract,patent_firstnamed_assignee_id,patent_number,patent_title,patent_title_abstract
664,A machine learning apparatus includes a state ...,org_e0RPpc3Ny8K6dCs7eFlk,10019674,Machine learning apparatus and coil producing ...,Machine learning apparatus and coil producing ...
955,Systems and techniques for indexing and/or que...,org_rGrTOVRIsnln8BjhEMKM,9898528,Concept indexing among database of documents u...,Concept indexing among database of documents u...
826,A software platform in communication with netw...,org_hkqUNFIP82JJl0DUsMON,9960637,Renewable energy integrated storage and genera...,Renewable energy integrated storage and genera...
393,A mechanism is described for facilitating reco...,org_BhFWbZ5cX0tSnPE1cE4T,10108850,"Recognition, reidentification and security enh...","Recognition, reidentification and security enh..."
655,Smart reminders are generated from input accor...,org_EilEWQcC6UiqHcSGx9mb,10027796,Smart reminder generation from input,Smart reminder generation from input. Smart re...
948,Provided are techniques for receiving a scanne...,org_q9Bn28RHhpYrQjKvraAH,9898452,Annotation data generation and overlay for enh...,Annotation data generation and overlay for enh...
828,Techniques to identify applications based on n...,org_iwO2oOJ6VIBd9fAuP7G6,9961574,Techniques to identify applications based on n...,Techniques to identify applications based on n...
659,A method includes receiving one or more natura...,org_q9Bn28RHhpYrQjKvraAH,10019437,Facilitating information extraction via semant...,Facilitating information extraction via semant...
514,Described herein are technologies related to t...,org_BhFWbZ5cX0tSnPE1cE4T,10074151,Dense optical flow acceleration,Dense optical flow acceleration. Described her...
205,"Methods, systems, and computer program product...",org_q9Bn28RHhpYrQjKvraAH,10169336,Translating structured languages to natural la...,Translating structured languages to natural la...


#### Model #2 - Apply K means clustering to distance matrix

In [None]:
km = KMeans(20)

In [None]:
kmresult = km.fit(tfidf_matrix).predict(unseen_tfidf)

In [None]:
kmresult_p = km.predict(unseen_tfidf)

In [None]:
kmresult_p