# Entity Resolution Prototype (Probabilistic) 

We're going to slowly walk through this thing and try to replicate [these guys](http://crpit.com/confpapers/CRPITV159Moir.pdf). The approach will be to start simple
and then walk our way into more complex algorithms. If all goes well, this notebook
will demonstrate an implementation that can then be expressed in Java. The prototyping steps
are as follows:

1. Getting Data with (including labeled training and CV data)
1. Cleaning Data
1. Metrics or Algorithms for Comparing String Values
1. Train SVM
1. Tune SVM with genetic algorithm
1. Figure out bad values with Q-learning algorithm
1. Construct Bayesian network of generic feature resolvers
1. Use genetic algorithm to find optimal Bayes Network

Note that steps 3-6 are required to construct a 'generic feature resolver'. Steps
7-8 are required to construct a bayesian network of those feature resolvers. There
is a second notebook that will provide a deterministic approach rather than the 
probabilistic one you see here.

## Step 1: Getting Data with (including labeled training and CV data)

In [7]:
# Data was taken from the Kaggle Titanic competition (https://www.kaggle.com/c/titanic)

import pandas as pd

pd.read_csv("../data/train.csv")[0:5]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


We know that there aren't multiple instances of the same person here. However, there are instances of the
same family. So our goal is to resolve entities in this dataset to the **family** level; not the individual. We 
would like our unit of analysis to be the family.

In [12]:
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

print("Training set rows: {}".format(len(train)))
print("Test set rows: {}".format(len(test)))

# Note that there isn't a cross validation set available unless we make one.

Training set rows: 891
Test set rows: 418


Unfortunately, this data doesn't include labeled training data for the purpose of entity
resolution. It's not too much so I'll go in and do it by hand. Remember, our goal is to use
entity resolution to aggregate from the *individual* to the *family* unit of analysis.

##Step 2: Cleaning Data

In [29]:
import string
from nltk.corpus import stopwords



def cleanNames(name):
    """ Clean names by removing punctuation and stop words, also uppercase everything"""
    
    stop = stopwords.words('english')
    clean_string = ""
    
    for word in name.split("\s"):
        if word not in stop:
            word = word.translate(None, string.punctuation)
            clean_string += word + " "
            
    return clean_string[0:-1].upper()
            

In [30]:
train["Name"] = train["Name"].apply((lambda x : cleanNames(x)))

train[0:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,BRAUND MR OWEN HARRIS,male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,CUMINGS MRS JOHN BRADLEY FLORENCE BRIGGS THAYER,female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,HEIKKINEN MISS LAINA,female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,FUTRELLE MRS JACQUES HEATH LILY MAY PEEL,female,35,1,0,113803,53.1,C123,S
4,5,0,3,ALLEN MR WILLIAM HENRY,male,35,0,0,373450,8.05,,S


##Step 3: Metrics or Algorithms for Comparing String Values

##Step 4: Train SVM

##Step 5: Tune SVM with genetic algorithm

##Step 6: Figure out bad values with Q-learning algorithm

##Step 7: Construct Bayesian network of generic feature resolvers

##Step 8: Use genetic algorithm to find optimal Bayes Network