## Recommendation System
Collaborative filtering with implicit feedback based on latent factors. Prepare data on user-item relationships for each user-company in format that ALS can use.
We require each unique assignee ID in the rows of the matrix, and each unique item ID in columns of matrix.
Values of matrix should be (?) binary user-item preference * confidence

In [1]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k

import pandas as pd
import numpy as np

from test_model import (get_patent_fields_list, get_ml_patents, 
                        create_title_abstract_col,trim_data, 
                        structure_dataframe, partition_dataframe, 
                        build_pipeline, process_docs, pat_inv_map, get_topics)

from rec_system import alphanum_to_int, int_to_alphanum

import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary, mmcorpus
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.phrases import Phrases, Phraser
from gensim.models.ldamodel import LdaModel
from gensim.models import AuthorTopicModel
from gensim.test.utils import common_dictionary, datapath, temporary_file
from smart_open import smart_open

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, punkt, RegexpTokenizer, wordpunct_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

import json
from pandas.io.json import json_normalize
import requests
import re
import os
import calendar
import requests
from bs4 import BeautifulSoup
import pickle
import math

import matplotlib.pyplot as plt
import pyLDAvis
import pyLDAvis.gensim

from pprint import pprint

%load_ext autoreload
%autoreload 2



In [2]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark

In [3]:
sc = spark.sparkContext
sc

### Data understanding - Acquire data

In [4]:
# load pickled dataset
with open('/Users/lee/Documents/techniche/techniche/data/raw_data_1000', 'rb') as f:
    raw_data_1000 = pickle.load(f)

In [5]:
# define desired keys/columns as criteria to subset dataset
retained_keys = ['patent_number', 'patent_firstnamed_assignee_id']

In [6]:
# subset raw dataset by desired keys/columns
data_1000 = trim_data(data=raw_data_1000, keys=retained_keys)

In [7]:
# create Pandas dataframe
df_1000 = pd.DataFrame(data_1000)

### Data preparation
Prepare data on user-item relationships for each user-company in format that ALS can use.
We require each unique assignee ID in the rows of the matrix, and each unique item ID in columns of matrix.
Values of matrix should be (?) binary user-item preference * confidence

In [8]:
# create new rating column and assign value of 1
df_1000['rating'] = 1

In [9]:
# drop row that contains invalid data in patent_number column
df_1000[df_1000.patent_number.str.contains('[RE]')]
df_1000 = df_1000.drop(df_1000.index[[717]])

In [10]:
# drop NaNs in patent_firstnamed_assignee_id column
df_1000.info()
df_1000 = df_1000.dropna()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 999 entries, 0 to 999
Data columns (total 3 columns):
patent_firstnamed_assignee_id    972 non-null object
patent_number                    999 non-null object
rating                           999 non-null int64
dtypes: int64(1), object(2)
memory usage: 31.2+ KB


In [11]:
# convert patent_number column from string to int
df_1000 = df_1000.astype({'patent_number': 'int64'})
# uncomment to confirm
# df_1000.info()

In [12]:
# convert alphanumeric patent_firstnamed_assignee_id col to int
df_1000 = df_1000.astype({'patent_number': 'int64'})
# s = 'org_VU2IXnxgxGIK8A8oQrwm'

# code = [ord(c) for c in s]
# code

In [249]:
df_1000.head(3)

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating
0,org_VU2IXnxgxGIK8A8oQrwm,10226194,1
1,org_9cmRc2rH8nbl8O9VuxYL,10228278,1
2,org_8O8xQifxyiW5pZB2KuDx,10228693,1


In [250]:
hash('org_VU2IXnxgxGIK8A8oQrwm')

8569133740707573506

In [13]:
# df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs)
df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs) % 65536 # 2^16

In [18]:
# df_1000['patent_firstnamed_assignee_id'] = df_1000['patent_firstnamed_assignee_id'].apply(hash).apply(abs)
df_1000['patent_number'] = df_1000['patent_number'] % 65536 # 2^16

In [19]:
df_1000 = df_1000.astype({'patent_firstnamed_assignee_id': 'int'})

In [20]:
df_1000.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 972 entries, 0 to 999
Data columns (total 3 columns):
patent_firstnamed_assignee_id    972 non-null int64
patent_number                    972 non-null int64
rating                           972 non-null int64
dtypes: int64(3)
memory usage: 30.4 KB


#### Data preparation - create Spark dataframe from pandas dataframe

In [21]:
sp_df_1000 = spark.createDataFrame(df_1000)

In [22]:
sp_df_1000.show()

+-----------------------------+-------------+------+
|patent_firstnamed_assignee_id|patent_number|rating|
+-----------------------------+-------------+------+
|                        21260|         2578|     1|
|                        47046|         4662|     1|
|                        31216|         5077|     1|
|                        16139|         5306|     1|
|                        52705|         5315|     1|
|                        19881|         5490|     1|
|                          625|         5493|     1|
|                         5010|         5497|     1|
|                        54478|         5532|     1|
|                        16139|         5540|     1|
|                          625|         5557|     1|
|                        16139|         5571|     1|
|                        16139|         5573|     1|
|                        55229|         5741|     1|
|                        16139|         5752|     1|
|                        45935|         6057| 

In [23]:
sp_df_1000

DataFrame[patent_firstnamed_assignee_id: bigint, patent_number: bigint, rating: bigint]

In [24]:
# cast columns from bigint to int
sp_df_1000_2 = sp_df_1000.withColumn("patent_firstnamed_assignee_id", sp_df_1000["patent_firstnamed_assignee_id"].cast(IntegerType())).withColumn("patent_number", sp_df_1000["patent_number"].cast(IntegerType())).withColumn("rating", sp_df_1000["rating"].cast(IntegerType()))

In [25]:
sp_df_1000_2.dtypes

[('patent_firstnamed_assignee_id', 'int'),
 ('patent_number', 'int'),
 ('rating', 'int')]

In [26]:
# split into 
(training, test) = sp_df_1000.randomSplit([0.8, 0.2])

In [27]:
training.show()

+-----------------------------+-------------+------+
|patent_firstnamed_assignee_id|patent_number|rating|
+-----------------------------+-------------+------+
|                          351|        52169|     1|
|                          625|          126|     1|
|                          625|         5493|     1|
|                          625|         5557|     1|
|                          625|        11582|     1|
|                          625|        49322|     1|
|                          625|        54258|     1|
|                          753|        13539|     1|
|                          753|        46066|     1|
|                         1096|        27931|     1|
|                         1435|        40319|     1|
|                         3566|        52671|     1|
|                         3680|        60651|     1|
|                         4120|        24017|     1|
|                         4120|        33809|     1|
|                         4673|        58425| 

### Model # 1

In [29]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
# set implicitPrefs to True to get better results b/c latent matrix 
# rank - number of latent topics- ME-10? alpha=0.01?
# ME suggests begin with alpha=30. try alphas for domain and see if recs make sense, r
# build ALS model
als = ALS(maxIter=5,
          regParam=0.01, 
          rank=10,
          alpha=30,
          implicitPrefs=True,
          userCol="patent_firstnamed_assignee_id", 
          itemCol="patent_number", 
          ratingCol="rating",
          coldStartStrategy="nan")

In [30]:
# fit the ALS model to the training set
model = als.fit(training)

#### Model #1 - Evaluation - Compare to naive baseline
Compare model evaluation result with naive baseline model that only outputs (for explicit - the average rating (or you may try one that outputs the average rating per movie).

#### Model #1 - Optimize model

In [None]:
params = ParamGridBuilder().addGrid(als_model.regParam, [0.01,0.001,0.1]).addGrid(als_model.rank, [4,10,50]).build()


## instantiate crossvalidator estimator
cv = CrossValidator(estimator=als_model, estimatorParamMaps=params,evaluator=evaluator,parallelism=4)
best_model = cv.fit(movie_ratings)    

In [None]:
# Getting Predictions for a New User

In [31]:
predictions = model.transform(test)

In [32]:
predictions_df = predictions.toPandas()

In [34]:
predictions_df

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating,prediction
0,42061,34602,1,
1,6441,47655,1,
2,36099,47744,1,
3,21549,16027,1,
4,16139,36649,1,
5,5010,30219,1,
6,17666,46149,1,
7,16139,5030,1,
8,16139,14760,1,
9,52705,41185,1,


In [35]:
predictions_df.dropna()

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating,prediction
88,5010,10300,1,0.119203


In [36]:
predictions = model.transform(training)

In [37]:
predictions_train_df = predictions.toPandas()

In [38]:
predictions_train_df

Unnamed: 0,patent_firstnamed_assignee_id,patent_number,rating,prediction
0,16139,3997,1,0.979337
1,6360,9900,1,0.359585
2,16139,26087,1,0.979337
3,16139,32304,1,0.979337
4,116,43256,1,0.358919
5,625,54258,1,0.881942
6,55229,2525,1,0.868484
7,2765,3986,1,0.358887
8,37262,4042,1,0.359014
9,47241,27977,1,0.643224


In [None]:
- content-similarity
- limits of patent space
- TF-IDF vectorization of patents - metrics - avg distance between 
- distance between individual patents, with ranking
- Sherry - ascent - TF-IDF vectorization - take tf-idf vector and argsort by absolute value, so you can see which features are most
- important to this patent. Get top 20 features. While normally would do cosine distance betweel all vectors. BUT,
- only do cosine distance between these top 20 features, for cold start patents
- TF-IDF vectorization