# Spark-Matcher advanced example 

This notebook shows how to use the `spark_matcher` with more customized settings. First we create a Spark session:

In [None]:
%config Completer.use_jedi = False  # for proper autocompletion
from pyspark.sql import SparkSession

In [None]:
spark = (SparkSession
             .builder
             .master("local")
             .enableHiveSupport()
             .getOrCreate())

Load the example data:

In [None]:
from spark_matcher.data import load_data

We use the 'library' data and remove the (numeric) 'year' column:

In [None]:
a, b = load_data(spark, kind='library')
a, b = a.drop('year'), b.drop('year')

In [None]:
a.limit(3).toPandas()

Unnamed: 0,title,authors,venue
0,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data
1,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data
2,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data


`spark_matcher` is shipped with a utility function to get the most frequenty occurring words in a Spark dataframe column. We apply this to the `venue` column:

In [None]:
from spark_matcher.utils import get_most_frequent_words

In [None]:
frequent_words = get_most_frequent_words(a.unionByName(b), col_name='venue')
frequent_words.head(10)

Unnamed: 0,words,count,df
0,SIGMOD,1917,0.390428
1,Data,1640,0.334012
2,Conference,1603,0.326477
3,VLDB,1289,0.262525
4,on,1135,0.231161
5,Record,1111,0.226273
6,International,1001,0.20387
7,,858,0.174745
8,Large,843,0.17169
9,Very,843,0.17169


Based on this list, we decide that we want to consider the words 'conference' and 'international' as stopwords. The utility function `remove_stopwords` does this job:

In [None]:
from spark_matcher.utils import remove_stopwords

In [None]:
stopwords = ['conference', 'international']

In [None]:
a = remove_stopwords(a, col_name='venue', stopwords=stopwords).drop('venue')
b = remove_stopwords(b, col_name='venue', stopwords=stopwords).drop('venue')

A new column `venue_wo_stopwords` is created in which the stopwords are removed:

In [None]:
a.limit(3).toPandas()

Unnamed: 0,title,authors,venue_wo_stopwords
0,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",on Management of Data
1,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",on Management of Data
2,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",on Management of Data


We use the `spark_matcher` to link the records in dataframe `a` with the records in dataframe `b`. Instead of the `venue` column, we now use the newly created `venue_wo_stopwords` column.

In [None]:
from spark_matcher.matcher import Matcher

In [None]:
myMatcher = Matcher(spark, col_names=['title', 'authors', 'venue_wo_stopwords'], checkpoint_dir='path_to_checkpoints')

Now we are ready for fitting the `Matcher` object.

In [None]:
myMatcher.fit(a, b)

The `Matcher` is now trained and can be used to predict on all data as usual:

In [None]:
result = myMatcher.predict(a, b, threshold=0.5, top_n=3)

Now let's have a look at the results:

In [None]:
result_pdf = result.toPandas()

In [None]:
result_pdf.sort_values('score')