# U.S. Patent Phrase to Phrase Matching
* [Data](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data?select=train.csv)
   - Don Cenkci, Grigor Aslanyan, Ian Wetherbee, jm, Kiran Gunda, Maggie, Scott Beliveau, Will Cukierski. (2022). U.S. Patent Phrase to Phrase Matching. Kaggle. 
 
## Problem Statement and Objective
The objective is to predict the Pearson similarity score, a number between 0 and 1, between phrases extracted from US Patents.
This is a regression problem.
The data also includes the context, basically the first two levels of the [CPC Classification](https://en.wikipedia.org/wiki/Cooperative_Patent_Classification).

## Methodology
Clean up the data, then convert to numerical values, then train a suitable ML model.
Not too much cleaning seems to be required on the phrases themselves, they are already converted to lower case,
don't seem to have contractions, etc.  Will apply some cleanup just in case, but does not seem to be strictly necessary.
We will however remove "stop" words.

Next, we need to convert the phrases into numerical form.   We will convert each phrase into a vector representing
the meanings in the sentences.  The phrases are quite short generally, just a few words, so need to pick an approach suitable 
for that kind of data.

The other issue is the CPC category.   There are a lot of CPC categories, but we decided to use a one-hot encoding regardless.
We did not try to break the categories into primary and secondary categories however to avoid spurious matches based on similar
secondary categories with different primary categories.  That is, we don't want A45 and B45 to be treated as similar because the
meaning of "45" is completely different in the A and B primary categories.

For the ML model training, we will try a set of different models and use hyperparameter training to try and find the best
approach.   The best models on the leaderboard seem to have accuracies in the 80% to 90% range.

As a note the training data seems to have scores drawn from a small set of possible values: 0, 0.25, 0.5, 0.75, and 1.

## Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [16]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [41]:
import spacy
nlp = spacy.load('en_core_web_lg')
# define a function to find vector for a phrase
def get_vec(x):
    doc = nlp(x)
    vec = doc.vector
    return np.array(vec)

## Data

| Label       | Meaning                       | Values                                                                                  |
| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------|
| id          | unique phrase pair identifier | irrelevant for training                                                                 |
| anchor      |	first phrase                  | string                                                                                  |
| target	  | second phrase                 | string                                                                                  |
| context     | CPC Category                  | just primary and secondary, eg. A45                                                     |
| score	      | training target value         | real number between 0 and 1                                                             |

In [42]:
dataset = pd.read_csv('train.csv')
dataset

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


Clean up data and convert to numerical form.

In [75]:
def clean_data(dataset):
    # one-hot encode context
    ct = ColumnTransformer(transformers=[('encodeContext', OneHotEncoder(), ['context'], remainder='passthrough')]) # context
    dt = ct.fit_transform(dataset)
    Pp = dt.iloc[:, -3:-1].values  # pairs of phrases
    print(Pp)
    # encode strings using word2vec, replace in frame
    for i in range(0,Pp.shape[0]):
        print(Pp[i,0],Pp[i,1])
        vec_a = get_vec(V[i,0])
        vec_b = get_vec(V[i,1])
        X[i].append(np.concatenate([vec_a,vec_b]))
    X = np.array(X)
    print(X)
    y = dataset.iloc[:, -1].values # score
    return (X,y)

In [76]:
(X,y) = clean_data(dataset)

  (0, 10)	1.0
  (1, 10)	1.0
  (2, 10)	1.0
  (3, 10)	1.0
  (4, 10)	1.0
  (5, 10)	1.0
  (6, 10)	1.0
  (7, 10)	1.0
  (8, 10)	1.0
  (9, 10)	1.0
  (10, 10)	1.0
  (11, 10)	1.0
  (12, 10)	1.0
  (13, 10)	1.0
  (14, 10)	1.0
  (15, 10)	1.0
  (16, 10)	1.0
  (17, 10)	1.0
  (18, 10)	1.0
  (19, 10)	1.0
  (20, 10)	1.0
  (21, 11)	1.0
  (22, 11)	1.0
  (23, 11)	1.0
  (24, 12)	1.0
  :	:
  (36448, 31)	1.0
  (36449, 31)	1.0
  (36450, 31)	1.0
  (36451, 31)	1.0
  (36452, 31)	1.0
  (36453, 31)	1.0
  (36454, 31)	1.0
  (36455, 31)	1.0
  (36456, 31)	1.0
  (36457, 31)	1.0
  (36458, 31)	1.0
  (36459, 31)	1.0
  (36460, 31)	1.0
  (36461, 31)	1.0
  (36462, 31)	1.0
  (36463, 31)	1.0
  (36464, 31)	1.0
  (36465, 31)	1.0
  (36466, 31)	1.0
  (36467, 31)	1.0
  (36468, 31)	1.0
  (36469, 31)	1.0
  (36470, 31)	1.0
  (36471, 31)	1.0
  (36472, 31)	1.0
[]
abatement abatement of pollution


AttributeError: 'numpy.ndarray' object has no attribute 'append'

In [29]:
X

array(<36473x33268 sparse matrix of type '<class 'numpy.float64'>'
	with 10919465 stored elements in Compressed Sparse Row format>,
      dtype=object)

In [30]:
X[0]

IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

In [7]:
X[1]

array([1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 38.0, 1, 0, 71.2833],
      dtype=object)

Impute missing values.

In [8]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)
print(X)

[[ 0.      0.      1.     ...  1.      0.      7.25  ]
 [ 1.      0.      0.     ...  1.      0.     71.2833]
 [ 0.      0.      1.     ...  0.      0.      7.925 ]
 ...
 [ 0.      0.      1.     ...  1.      2.     23.45  ]
 [ 1.      0.      0.     ...  0.      0.     30.    ]
 [ 0.      0.      1.     ...  0.      0.      7.75  ]]


In [9]:
print(y)

[0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1
 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0
 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1
 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0
 0 0 1 1 0 1 0 0 1 0 0 0 

## Train/Test Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify = y)

## Feature Scaling

In [11]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Kernel SVC

In [12]:
svc_classifier = SVC(C = 1.0, gamma='scale', kernel = 'rbf', random_state = 0)
svc_classifier.fit(X_train, y_train)

### Confusion Matrix, Accuracy, and Cross-Validation
For default values of hyperparameters for SVC.

In [13]:
y_pred = svc_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[100  10]
 [ 26  43]]


Accuracy against held-out test data

In [14]:
accuracy_score(y_test, y_pred)

0.7988826815642458

Cross-validation

In [15]:
accuracies = cross_val_score(estimator = svc_classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 81.74 %
Standard Deviation: 1.78 %


### Grid Search

In [16]:
cstep = 0.05
gstep = 0.05
parameters = [{'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['linear']},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['poly'], 'degree': [2, 3, 4]},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['sigmoid'], 'gamma': ['scale']},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['sigmoid'], 'gamma': np.arange(gstep,1+gstep,gstep)},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['rbf'], 'gamma': ['scale']},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['rbf'], 'gamma': np.arange(gstep,1+gstep,gstep)}]
svc_grid_search = GridSearchCV(estimator = svc_classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 5,
                           n_jobs = -1)
svc_grid_search.fit(X_train, y_train)
best_accuracy = svc_grid_search.best_score_
best_parameters = svc_grid_search.best_params_
cv_results = pd.DataFrame(svc_grid_search.cv_results_)
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)
print("Cross-Validation Results:\n", cv_results)

Best Accuracy: 82.44 %
Best Parameters: {'C': 0.45, 'degree': 3, 'kernel': 'poly'}
Cross-Validation Results:
      mean_fit_time  std_fit_time  mean_score_time  std_score_time  param_C   
0         0.004099      0.000465         0.001153        0.000063     0.05  \
1         0.004258      0.000807         0.001190        0.000065     0.10   
2         0.004563      0.000924         0.001168        0.000106     0.15   
3         0.004141      0.001269         0.001111        0.000196     0.20   
4         0.004723      0.001048         0.001139        0.000201     0.25   
..             ...           ...              ...             ...      ...   
915       0.005585      0.001349         0.001840        0.000311     1.00   
916       0.006084      0.000847         0.002056        0.000133     1.00   
917       0.005323      0.000694         0.001813        0.000271     1.00   
918       0.005689      0.000907         0.001639        0.000381     1.00   
919       0.005702      0.001005

### Confusion Matrix
For the SVC model found via grid search.

In [17]:
y_pred = svc_grid_search.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[106   4]
 [ 28  41]]


### Accuracy
Against held-out test data.

In [18]:
accuracy_score(y_test, y_pred)

0.8212290502793296

## XGBoost

Train XGBoost classifier.

In [19]:
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train, y_train)

### Confusion Matrix

In [20]:
y_pred = xgb_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[97 13]
 [22 47]]


### Accuracy

In [21]:
accuracy_score(y_test, y_pred)

0.8044692737430168

### Cross-Validation

In [22]:
accuracies = cross_val_score(estimator = xgb_classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 81.31 %
Standard Deviation: 2.77 %


## Random Forest Classifier

In [23]:
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
rf_classifier.fit(X_train, y_train)

### Confusion Matrix

In [24]:
y_pred = rf_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[96 14]
 [25 44]]


### Accuracy

In [25]:
accuracy_score(y_test, y_pred)

0.7821229050279329

### Cross-Validation

In [26]:
accuracies = cross_val_score(estimator = rf_classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 79.77 %
Standard Deviation: 1.56 %


### Grid Search

In [27]:
parameters = [{'n_estimators': range(10,500,5), 'criterion': ['entropy']}]
rf_grid_search = GridSearchCV(estimator = rf_classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 5,
                           n_jobs = -1)
rf_grid_search.fit(X_train, y_train)
best_accuracy = rf_grid_search.best_score_
best_parameters = rf_grid_search.best_params_
cv_results = pd.DataFrame(rf_grid_search.cv_results_)
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)
print("Cross-Validation Results:\n", cv_results)

Best Accuracy: 81.17 %
Best Parameters: {'criterion': 'entropy', 'n_estimators': 435}
Cross-Validation Results:
     mean_fit_time  std_fit_time  mean_score_time  std_score_time   
0        0.011068      0.000151         0.001146        0.000125  \
1        0.016197      0.000147         0.001694        0.000013   
2        0.019246      0.003356         0.001701        0.000354   
3        0.024976      0.001439         0.001957        0.000246   
4        0.029938      0.001057         0.002329        0.000266   
..            ...           ...              ...             ...   
93       0.677404      0.010794         0.024239        0.002555   
94       0.638109      0.026583         0.023748        0.005168   
95       0.554648      0.043767         0.018761        0.006192   
96       0.508622      0.057114         0.014016        0.000162   
97       0.393082      0.039395         0.014072        0.000168   

   param_criterion  param_n_estimators   
0          entropy          

### Confusion Matrix

In [28]:
y_pred = rf_grid_search.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[99 11]
 [25 44]]


### Accuracy

In [29]:
accuracy_score(y_test, y_pred)

0.7988826815642458