# Deepmatcher Test

### Setup and Data Download

First, get the imports done.

In [1]:
import deepmatcher as dm
import torch

Next, execute these *commands* in order to download the itunes-amazon dataset

In [None]:
!mkdir -p sample_data/itunes-amazon
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/train.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/validation.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/test.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/unlabeled.csv

### Data Processing

Let's view the **training data**

In [3]:
import pandas as pd
pd.read_csv('sample_data/itunes-amazon/train.csv').head()

Unnamed: 0,id,label,left_Song_Name,left_Artist_Name,left_Album_Name,left_Genre,left_Price,left_CopyRight,left_Time,left_Released,right_Song_Name,right_Artist_Name,right_Album_Name,right_Genre,right_Price,right_CopyRight,right_Time,right_Released
0,448,0,Baby When the Light ( David Guetta & Fred Rist...,David Guetta,Pop Life ( Extended Version ) [ Bonus Version ],"Dance , Music , Rock , Pop , House , Electroni...",$ 1.29,‰ ãÑ 2007 Gum Records,6:17,18-Sep-07,Revolver ( Madonna Vs. David Guetta Feat . Lil...,David Guetta,One Love ( Deluxe Version ),Dance & Electronic,$ 1.29,( C ) 2014 Swedish House Mafia Holdings Ltd ( ...,3:18,"August 21 , 2009"
1,287,1,Outversion,Mark Ronson,Version,"Pop , Music , R&B / Soul,Soul,Dance,Rock,Jazz,...",$ 0.99,2007 Mark Ronson under exclusive license to SO...,1:50,10-Jul-07,Outversion,Mark Ronson,Version [ Explicit ],Pop,$ 0.99,( c ) 2011 J'adore Records,1:50,"July 10 , 2007"
2,534,0,Peer Pressure ( feat . Traci Nelson ),Snoop Dogg,Doggumentary,"Hip-Hop/Rap , Music , Rock , Gangsta Rap , Wes...",$ 1.29,"‰ ãÑ 2011 Capitol Records , LLC . All rights r...",4:07,29-Mar-11,Boom ( ( Feat . T-Pain ) [ Edited ] ),Snoop Dogg,Doggumentary [ Edited ],"Rap & Hip-Hop , West Coast",$ 1.29,"( C ) 2011 Capitol Records , LLC",3:50,"March 29 , 2011"
3,181,1,Stars Come Out ( Tim Mason Remix ),Zedd,Stars Come Out ( Remixes ) - EP,"Dance , Music , Electronic , House",$ 1.29,2012 Dim Mak Inc.,5:49,20-May-14,Stars Come Out ( Dillon Francis Remix ),Zedd,Stars Come Out [ Dillon Francis Remix ],Dance & Electronic,$ 1.29,2012 Dim Mak Inc.,4:08,"May 20 , 2014"
4,485,0,Jump ( feat . Nelly Furtado ),Flo Rida,R.O.O.T.S. ( Deluxe Version ),"Hip-Hop/Rap , Music",$ 1.29,‰ ãÑ 2009 Atlantic Recording Corporation for t...,3:28,30-Mar-09,"Yayo [ Feat . Brisco , Billy Blue , Ball Greez...",Flo Rida,R.O.O.T.S. ( Route Of Overcoming The Struggle ...,Rap & Hip-Hop,$ 1.29,"( C ) 2012 Motown Records , a Division of UMG ...",7:53,"March 30 , 2009"


This **training dataset**, along with the **validation and test datasets** needs to be processed and tokenized before we proceed with training and testing.

In [4]:
train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv')

Let's view the tokenized data:

In [5]:
train_table = train.get_raw_table()
train_table.head()

Unnamed: 0,id,label,left_Song_Name,left_Artist_Name,left_Album_Name,left_Genre,left_Price,left_CopyRight,left_Time,left_Released,right_Song_Name,right_Artist_Name,right_Album_Name,right_Genre,right_Price,right_CopyRight,right_Time,right_Released
0,448,0,baby when the light ( david guetta & fred rist...,david guetta,pop life ( extended version ) [ bonus version ],"dance , music , rock , pop , house , electroni...",$ 1.29,‰ ãñ 2007 gum records,6:17,18-sep-07,revolver ( madonna vs. david guetta feat . lil...,david guetta,one love ( deluxe version ),dance & electronic,$ 1.29,( c ) 2014 swedish house mafia holdings ltd ( ...,3:18,"august 21 , 2009"
1,287,1,outversion,mark ronson,version,"pop , music , r & b / soul , soul , dance , ro...",$ 0.99,2007 mark ronson under exclusive license to so...,1:50,10-jul-07,outversion,mark ronson,version [ explicit ],pop,$ 0.99,( c ) 2011 j'adore records,1:50,"july 10 , 2007"
2,534,0,peer pressure ( feat . traci nelson ),snoop dogg,doggumentary,"hip-hop/rap , music , rock , gangsta rap , wes...",$ 1.29,"‰ ãñ 2011 capitol records , llc . all rights r...",4:07,29-mar-11,boom ( ( feat . t-pain ) [ edited ] ),snoop dogg,doggumentary [ edited ],"rap & hip-hop , west coast",$ 1.29,"( c ) 2011 capitol records , llc",3:50,"march 29 , 2011"
3,181,1,stars come out ( tim mason remix ),zedd,stars come out ( remixes ) - ep,"dance , music , electronic , house",$ 1.29,2012 dim mak inc .,5:49,20-may-14,stars come out ( dillon francis remix ),zedd,stars come out [ dillon francis remix ],dance & electronic,$ 1.29,2012 dim mak inc .,4:08,"may 20 , 2014"
4,485,0,jump ( feat . nelly furtado ),flo rida,r.o.o.t.s . ( deluxe version ),"hip-hop/rap , music",$ 1.29,‰ ãñ 2009 atlantic recording corporation for t...,3:28,30-mar-09,"yayo [ feat . brisco , billy blue , ball greez...",flo rida,r.o.o.t.s . ( route of overcoming the struggle...,rap & hip-hop,$ 1.29,"( c ) 2012 motown records , a division of umg ...",7:53,"march 30 , 2009"


### Training the model

In [6]:
model = dm.MatchingModel(attr_summarizer='hybrid')

In [7]:
model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 17757810
===>  TRAIN Epoch 1


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:30


Finished Epoch 1 || Run Time:   33.0 | Load Time:    0.1 || F1:  35.75 | Prec:  31.37 | Rec:  41.56 || Ex/s:   9.78

===>  EVAL Epoch 1


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 1 || Run Time:    3.4 | Load Time:    0.0 || F1:  47.50 | Prec:  33.93 | Rec:  79.17 || Ex/s:  31.92

* Best F1: 47.5
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:28


Finished Epoch 2 || Run Time:   31.3 | Load Time:    0.1 || F1:  56.17 | Prec:  41.77 | Rec:  85.71 || Ex/s:  10.29

===>  EVAL Epoch 2


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 2 || Run Time:    3.3 | Load Time:    0.0 || F1:  51.35 | Prec:  38.00 | Rec:  79.17 || Ex/s:  32.47

* Best F1: 51.351351351351354
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 3


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:29


Finished Epoch 3 || Run Time:   31.9 | Load Time:    0.1 || F1:  58.04 | Prec:  44.22 | Rec:  84.42 || Ex/s:  10.09

===>  EVAL Epoch 3


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 3 || Run Time:    3.2 | Load Time:    0.0 || F1:  59.38 | Prec:  47.50 | Rec:  79.17 || Ex/s:  33.19

* Best F1: 59.375
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:36


Finished Epoch 4 || Run Time:   39.6 | Load Time:    0.1 || F1:  60.47 | Prec:  47.10 | Rec:  84.42 || Ex/s:   8.13

===>  EVAL Epoch 4


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 4 || Run Time:    3.8 | Load Time:    0.0 || F1:  59.38 | Prec:  47.50 | Rec:  79.17 || Ex/s:  28.13

---------------------

===>  TRAIN Epoch 5


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:35


Finished Epoch 5 || Run Time:   37.6 | Load Time:    0.1 || F1:  61.61 | Prec:  48.51 | Rec:  84.42 || Ex/s:   8.56

===>  EVAL Epoch 5


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 5 || Run Time:    3.4 | Load Time:    0.0 || F1:  60.32 | Prec:  48.72 | Rec:  79.17 || Ex/s:  31.84

* Best F1: 60.317460317460316
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 6


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:29


Finished Epoch 6 || Run Time:   32.2 | Load Time:    0.1 || F1:  62.00 | Prec:  50.41 | Rec:  80.52 || Ex/s:  10.00

===>  EVAL Epoch 6


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 6 || Run Time:    3.3 | Load Time:    0.0 || F1:  61.29 | Prec:  50.00 | Rec:  79.17 || Ex/s:  32.74

* Best F1: 61.29032258064515
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 7


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:29


Finished Epoch 7 || Run Time:   32.0 | Load Time:    0.1 || F1:  64.29 | Prec:  52.94 | Rec:  81.82 || Ex/s:  10.08

===>  EVAL Epoch 7


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 7 || Run Time:    3.2 | Load Time:    0.0 || F1:  61.29 | Prec:  50.00 | Rec:  79.17 || Ex/s:  33.34

---------------------

===>  TRAIN Epoch 8


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:30


Finished Epoch 8 || Run Time:   34.2 | Load Time:    0.1 || F1:  66.33 | Prec:  54.62 | Rec:  84.42 || Ex/s:   9.42

===>  EVAL Epoch 8


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:03


Finished Epoch 8 || Run Time:    4.3 | Load Time:    0.0 || F1:  62.50 | Prec:  50.00 | Rec:  83.33 || Ex/s:  24.73

* Best F1: 62.5
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 9


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:33


Finished Epoch 9 || Run Time:   36.2 | Load Time:    0.1 || F1:  66.33 | Prec:  54.10 | Rec:  85.71 || Ex/s:   8.90

===>  EVAL Epoch 9


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 9 || Run Time:    3.8 | Load Time:    0.0 || F1:  62.50 | Prec:  50.00 | Rec:  83.33 || Ex/s:  27.97

---------------------

===>  TRAIN Epoch 10


0% [████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:33


Finished Epoch 10 || Run Time:   35.6 | Load Time:    0.1 || F1:  68.00 | Prec:  55.28 | Rec:  88.31 || Ex/s:   9.04

===>  EVAL Epoch 10


0% [█] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


Finished Epoch 10 || Run Time:    3.5 | Load Time:    0.0 || F1:  62.30 | Prec:  51.35 | Rec:  79.17 || Ex/s:  30.77

---------------------

Loading best model...
Training done.


62.5

### Evaluating on Test Data

In [8]:
# Compute F1 on test set
model.run_eval(test)

===>  EVAL Epoch 8
Finished Epoch 8 || Run Time:    3.0 | Load Time:    0.0 || F1:  60.00 | Prec:  53.85 | Rec:  67.74 || Ex/s:  36.09



59.99999999999999