## Recommendation System
Collaborative filtering with implicit feedback based on latent factors. Prepare data on user-item relationships for each user-company in format that ALS can use.
We require each unique assignee ID in the rows of the matrix, and each unique item ID in columns of matrix.
Values of matrix should be (?) binary user-item preference * confidence

In [45]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k

import pandas as pd
import numpy as np

from test_model import (get_patent_fields_list, get_ml_patents, 
                        create_title_abstract_col,trim_data, 
                        structure_dataframe, partition_dataframe, 
                        build_pipeline, process_docs, pat_inv_map, get_topics)

from rec_system import alphanum_to_int, int_to_alphanum

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary, mmcorpus
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.phrases import Phrases, Phraser
from gensim.models.ldamodel import LdaModel
from gensim.models import AuthorTopicModel
from gensim.test.utils import common_dictionary, datapath, temporary_file
from smart_open import smart_open

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, punkt, RegexpTokenizer, wordpunct_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

import json
from pandas.io.json import json_normalize
import requests
import re
import os
import calendar
import requests
from bs4 import BeautifulSoup
import pickle
import math

import matplotlib.pyplot as plt
import pyLDAvis
import pyLDAvis.gensim

from pprint import pprint

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark

In [3]:
sc = spark.sparkContext
sc

### Data understanding - Acquire data

#### Data understanding - Acquire data for text workflows

In [4]:
# load pickled dataset
with open('/Users/lee/Documents/techniche/techniche/data/raw_data_1000', 'rb') as f:
    raw_data_1000 = pickle.load(f)

In [5]:
# define desired keys/columns as criteria to subset dataset
retained_keys = ['patent_number', 'patent_firstnamed_assignee_id']

In [6]:
# subset raw dataset by desired keys/columns
data_1000 = trim_data(data=raw_data_1000, keys=retained_keys)

In [7]:
# create Pandas dataframe
df_1000 = pd.DataFrame(data_1000)

#### Data understanding - Acquire data for text workflows

In [8]:
# define desired keys/columns as criteria to subset dataset for text analysis workflows
retained_keys_2 = ['patent_number', 'patent_firstnamed_assignee_id','patent_title',
                 'patent_abstract']

In [9]:
# subset raw dataset by desired keys/columns for text analysis workflows
data_1000_2 = trim_data(data=raw_data_1000, keys=retained_keys_2)

In [10]:
# TODO (Lee) review naming conv duplication - create item in dict by concatenating patent_title and patent_abstract
data_1000_2 = create_title_abstract_col(data=data_1000_2)

In [11]:
# create Pandas dataframe
df_1000_2 = pd.DataFrame(data_1000_2)

In [12]:
# drop row that contains invalid data in patent_number column
df_1000_2[df_1000_2.patent_number.str.contains('[RE]')]
df_1000_2 = df_1000_2.drop(df_1000_2.index[[717]])

In [23]:
# drop NaNs in patent_firstnamed_assignee_id column
df_1000_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 0 to 999
Data columns (total 3 columns):
patent_firstnamed_assignee_id    972 non-null object
patent_number                    999 non-null object
patent_title_abstract            999 non-null object
dtypes: object(3)
memory usage: 31.2+ KB


In [24]:
df_1000_2 = df_1000_2.dropna()

In [25]:
df_1000_2.head(3)

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,patent_title_abstract
0,org_VU2IXnxgxGIK8A8oQrwm,10226194,"Statistical, noninvasive measurement of a pati..."
1,org_9cmRc2rH8nbl8O9VuxYL,10228278,Determining a health condition of a structure....
2,org_8O8xQifxyiW5pZB2KuDx,10228693,Generating simulated sensor data for training ...


### Model #1 - Data preparation
Prepare data on user-item relationships for each user-company in format that ALS can use.
We require each unique assignee ID in the rows of the matrix, and each unique item ID in columns of matrix.
Values of matrix should be (?) binary user-item preference * confidence

In [None]:
# create new rating column and assign value of 1
df_1000['rating'] = 1

In [None]:
# drop row that contains invalid data in patent_number column
df_1000[df_1000.patent_number.str.contains('[RE]')]
df_1000 = df_1000.drop(df_1000.index[[717]])

In [None]:
# drop NaNs in patent_firstnamed_assignee_id column
df_1000.info()
df_1000 = df_1000.dropna()

In [None]:
# convert patent_number column from string to int
df_1000 = df_1000.astype({'patent_number': 'int64'})
# uncomment to confirm
# df_1000.info()

In [None]:
# convert alphanumeric patent_firstnamed_assignee_id col to int
df_1000 = df_1000.astype({'patent_number': 'int64'})
# s = 'org_VU2IXnxgxGIK8A8oQrwm'

# code = [ord(c) for c in s]
# code

In [None]:
df_1000.head(3)

In [None]:
hash('org_VU2IXnxgxGIK8A8oQrwm')

In [None]:
# df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs)
df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs) % 65536 # 2^16

In [None]:
# df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs)
df_1000['patent_number'] = df_1000['patent_number'] % 65536 # 2^16

In [None]:
df_1000 = df_1000.astype({'patent_firstnamed_assignee_id': 'int'})

In [None]:
df_1000.info()

#### Data preparation - create Spark dataframe from pandas dataframe

In [None]:
sp_df_1000 = spark.createDataFrame(df_1000)

In [None]:
sp_df_1000.show()

In [None]:
sp_df_1000

In [None]:
# cast columns from bigint to int
sp_df_1000_2 = sp_df_1000.withColumn("patent_firstnamed_assignee_id", sp_df_1000["patent_firstnamed_assignee_id"].cast(IntegerType())).withColumn("patent_number", sp_df_1000["patent_number"].cast(IntegerType())).withColumn("rating", sp_df_1000["rating"].cast(IntegerType()))

In [None]:
sp_df_1000_2.dtypes

In [None]:
# split into 
(training, test) = sp_df_1000.randomSplit([0.8, 0.2])

In [None]:
training.show()

### Model # 1

In [None]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
# set implicitPrefs to True to get better results b/c latent matrix 
# rank - number of latent topics- ME-10? alpha=0.01?
# ME suggests begin with alpha=30. try alphas for domain and see if recs make sense, r
# build ALS model
als = ALS(maxIter=5,
          regParam=0.01, 
          rank=10,
          alpha=30,
          implicitPrefs=True,
          userCol="patent_firstnamed_assignee_id", 
          itemCol="patent_number", 
          ratingCol="rating",
          coldStartStrategy="nan")

In [None]:
# fit the ALS model to the training set
model = als.fit(training)

#### Model #1 - Evaluation - Compare to naive baseline
Compare model evaluation result with naive baseline model that only outputs (for explicit - the average rating (or you may try one that outputs the average rating per movie).

#### Model #1 - Optimize model

In [None]:
params = ParamGridBuilder().addGrid(als_model.regParam, [0.01,0.001,0.1]).addGrid(als_model.rank, [4,10,50]).build()


## instantiate crossvalidator estimator
cv = CrossValidator(estimator=als_model, estimatorParamMaps=params,evaluator=evaluator,parallelism=4)
best_model = cv.fit(movie_ratings)    

In [None]:
# Getting Predictions for a New User

In [None]:
predictions = model.transform(test)

In [None]:
predictions_df = predictions.toPandas()

In [None]:
predictions_df

In [None]:
predictions_df.dropna()

In [None]:
predictions = model.transform(training)

In [None]:
predictions_train_df = predictions.toPandas()

In [None]:
predictions_train_df

In [None]:
- content-similarity
- limits of patent space
- TF-IDF vectorization of patents - metrics - avg distance between 
- distance between individual patents, with ranking
- Sherry - ascent - TF-IDF vectorization - take tf-idf vector and argsort by absolute value, so you can see which features are most
- important to this patent. Get top 20 features. While normally would do cosine distance betweel all vectors. BUT,
- only do cosine distance between these top 20 features, for cold start patents
- TF-IDF vectorization

### Model #2 - Data preparation

In [28]:
df_1000_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 972 entries, 0 to 999
Data columns (total 3 columns):
patent_firstnamed_assignee_id    972 non-null object
patent_number                    972 non-null object
patent_title_abstract            972 non-null object
dtypes: object(3)
memory usage: 30.4+ KB


#### Model 2 - Data preparation - text data

In [30]:
# instantiate TF-IDF Vectorizer using standard English stopwords
tfidf = TfidfVectorizer(stop_words='english')

In [31]:
# fit TF-IDF matrix on text column
tfidf_matrix = tfidf.fit_transform(df_1000_2['patent_title_abstract'])

In [32]:
# output matrix, 972 docs, 5364 terms
tfidf_matrix.shape

(972, 5364)

### Model 3 - compute distance metric

In [34]:
# compute cosine similarity matrix between docs using linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [37]:
# construct reverse map of indices and movie titles
indices = pd.Series(df_1000_2.index, index=df_1000_2['patent_title_abstract']).drop_duplicates()

In [38]:
indices

patent_title_abstract
Statistical, noninvasive measurement of a patient's physiological state. Tools and techniques for the rapid, continuous, invasive and/or noninvasive measurement, estimation, and/or prediction of a patient's physiological state. In an aspect, some tools and techniques can estimate predict the onset of conditions intracranial pressure, an amount of blood volume loss, cardiovascular collapse, and/or dehydration. Some tools can recommend (and, in some cases, administer) a therapeutic treatment for the patient's condition. In another aspect, some techniques employ high speed software technology that enables active, long term learning from extremely large, continually changing datasets. In some cases, this technology utilizes feature extraction, state-of-the-art machine learning and/or statistical methods to autonomously build and apply relevant models in real-time.                                                                                                          

In [41]:
# tfidf vec requires list, not just string
unseen_data = 'computer vision natural language processor'
unseen_data=[unseen_data]

In [54]:
unseen_tfidf = tfidf.transform(unseen_data)

#### Model #2 - Apply K means clustering to distance matrix

In [55]:
km = KMeans(20)

In [60]:
kmresult = km.fit(tfidf_matrix).predict(unseen_tfidf)

In [61]:
kmresult_p = km.predict(unseen_tfidf)

In [62]:
kmresult_p

array([15], dtype=int32)