# Team Big HIT Stage 3 Notebook

## Introduction

This Jupyter notebook describes the steps taken in our project stage 3, which did entity matching between two tables of data generated from www.imdb.com and www.themoviedb.org. Our goal was to match as many movies as possible with as high precision and recall as we could achieve. 

To begin, we had to install and import the *py_entitymatching* module and other relevant packages.

In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

## 1) Read input tables A and B

In the first step towards entity matching, we had to read in the input tables from imdb and tmdb into Python. These tables were generated in project stage 2. 

## 2) Apply blockers to generate candidate set C

Our blocker chain includes:

## 3) Reading in labeled sample G

After generating candidate set C, we sampled a smaller set of tuples, S, and then labeled them as matches or not-matches to generate a labeled sample G.

Some examples of tuples in the labeled sample G include:

```
,Unnamed: 0,_id,l_id,r_id,l_title,l_cast,l_directors,l_writers,l_genres,l_keywords,l_content_rating,l_run_time,l_release_year,l_languages,l_rating,l_budget,l_revenue,l_opening_weekend_revenue,l_production_companies,l_production_countries,l_alternative_titles,r_title,r_cast,r_directors,r_writers,r_genres,r_keywords,r_content_rating,r_run_time,r_release_year,r_languages,r_rating,r_budget,r_revenue,r_opening_weekend_revenue,r_production_companies,r_production_countries,r_alternative_titles,,label

41,2653,2653,a13,b1148,Baby Driver,Ansel Elgort;Jon Bernthal;Jon Hamm;Eiza González;Micah Howard,Edgar Wright,Edgar Wright,Action;Crime;Music;Thriller,getaway driver;singing in a car;heist;ipod;mini dress,R,112,2017,English;American Sign Language,7.7,"$34,000,000 ","$175,089,495 ","$20,553,320 ",TriStar Pictures;Media Rights Capital (MRC);Double Negative,UK;USA,Baby Driver,Wind River,Jeremy Renner;Elizabeth Olsen;Gil Birmingham;Kelsey Asbille;Jon Bernthal,Taylor Sheridan,Taylor Sheridan,Crime;Drama;Mystery;Thriller,usa;rape;mountain;gun;investigation;forest;murder;native american;shootout;photograph;violence;fbi agent;binoculars;snowmobile;indian reservation,R,107,2017,English,74,"$11,000,000.00 ","$184,770,205.00 ",,,,,,0

129,43448,43448,a113,b3209,The Prestige,Hugh Jackman;Christian Bale;Michael Caine;Piper Perabo;Rebecca Hall,Christopher Nolan,Jonathan Nolan;Christopher Nolan,Drama;Mystery;Sci-Fi;Thriller,rivalry;illusion;obsession;magician;secret,PG-13,130,2006,English,8.5,"$40,000,000 ","$109,676,311 ","$14,801,808 ",Touchstone Pictures;Warner Bros.;Newmarket Productions,USA;UK,El gran truco,Scoop,Scarlett Johansson;Hugh Jackman;Woody Allen;Ian McShane;Kevin McNally,Woody Allen,Woody Allen,Comedy;Mystery,upper class;prostitute;journalist;drowning;newspaper;magic;tarot cards;magic show;lordship;suspicion of murder;headline;funeral;investigation;daughter;music instrument;afterlife;swimming pool;murder;wealth;american;united kingdom,PG-13,96,2006,English,64,"$4,000,000.00 ","$31,584,901.00 ",,,,,,0
```


In [None]:
G = em.read_csv_metadata(FOLDER+'G.csv', key = '_id', ltable = A, rtable = B,
                             fk_ltable = 'l_id', fk_rtable = 'r_id') # labeled data

## 4) Splitting the labeled set G into development set I and evaluation set J

The labeled sample set G must now be split into two different sets, I and J, the development and evaluation sets, respectively. We decided to split the sample set 50/50 so that half of the tuples are in set I and the other half are in set J.

In [None]:
# Split G into I and J for CV
IJ = em.split_train_test(G, train_proportion = 0.5, random_state = 0)
I = IJ['train']
J = IJ['test']
# Save I and J to files
I.to_csv(FOLDER+'I.csv')
J.to_csv(FOLDER+'J.csv')

## 5) Create a set of machine learning matchers

We first generated a set of features F from the input tables A and B.

In [None]:
# Generate features set F
F = em.get_features_for_matching(A, B, validate_inferred_attr_types = False)
print(F.feature_name)
print(type(F))

In [None]:
# put output from previous step here

Then, we extracted the feature vectors (H) using I and F and used it to create a set of matchers.

In [None]:
# Convert I to a set of feature vectors using F
H = em.extract_feature_vecs(I, feature_table = F, attrs_after = 'label', show_progress = False)
#print(H.head)
excluded_attributes = ['_id', 'l_id', 'r_id', 'label']
# Fill in missing values with column's average
H = em.impute_table(H, exclude_attrs = excluded_attributes,
            strategy='mean')
# Create a set of matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')

## 6) Select the best matcher

In [None]:
# Selecting best matcher with CV using F1-score as criteria
CV_result = em.select_matcher([dt, rf, svm, ln, lg], table = H,
                              exclude_attrs = excluded_attributes,
                              k = 10, target_attr = 'label',
                              metric_to_select_matcher = 'f1',
                              random_state = 0)
print(CV_result['cv_stats'])

In [None]:
# put output of best matcher here, should be RF

## 7) Evaluate the best matcher Y using evaluation set J

In [None]:
# Best matcher found is RF, train RF on H
rf.fit(table = H, exclude_attrs = excluded_attributes, target_attr = 'label')
# Convert J into a set of features using F
L = em.extract_feature_vecs(J, feature_table = F, attrs_after = 'label', show_progress = False)
# Fill in missing values with column's average
L = em.impute_table(L, exclude_attrs = excluded_attributes,
            strategy='mean')
# Predict on L
predictions = rf.predict(table = L, exclude_attrs = excluded_attributes,
                         append = True, target_attr = 'predicted', inplace = False,
                         return_probs = True, probs_attr = 'proba')
# Evaluate predictions
eval_result = em.eval_matches(predictions, 'label', 'predicted')
em.print_eval_summary(eval_result)