# PySpark Propensity Matching

### Welcome 
Welcome to the PySpark Propensity Matching Tutorial Notebook  
The goal of this library to enable the user to perform [propensity matching](https://en.wikipedia.org/wiki/Propensity_score_matching) on spark sized datasets. Propensity matching is typically performed to assess the impact of a treatment where the treatment is not randomly assigned. It tries to create a control population accounting for the other covariates and then assess the impact of the treatment on the response. 


### Tutorial
The tutorial consists of the following parts:  
    1. Data Generation  
    2. Fitting the estimator  
    3. Transforming the dataset  
    4. Assessing the impact of our treatment  
    5. Evaluating our performance  


### Help
For help with your specific use case, or any questions - please feel free to post to the [git repo](https://github.com/Microsoft/pyspark_propensity_matching)


# Tutorial

In [46]:
import logging
import random
from math import exp

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

import pyspark
from pyspark.sql import SparkSession
import pyspark.ml.feature as mlf

In [6]:
import sklearn

In [8]:
sklearn.__version__

'0.18.1'

In [5]:
spark = SparkSession.builder.master('local[2]').getOrCreate()
logging.getLogger('py4j').setLevel(logging.CRITICAL)

In [10]:
random.seed(a=42)

## Data Generation

We'll make a dataset using sklearn's excellent function *make_classification*. This allows us to create different features w/ different distributions for each treatment class & simulate the effect of self selection. 
Then we will model our response variable as a linear function of the features and the treatment class. We will compare our observed impact with the true impact dictated here. 

In [12]:
    args = {
        "n_samples": 10000,
        "n_features": 20,
        "n_informative": 10,
        #"n_redundant": 
        #"n_repeated": 
        "n_classes": 2,
        "weights": [.9, .1],
        "flip_y": 0,
        #"class_sep":  ,
        #"hypercube":  ,
        #"shift":  ,
        #"scale":  ,
        #"shuffle":  ,
        "random_state": random.randint(0,100)
    }

    data, labels = make_classification(**args)

In [37]:
# create pandas dataframe
labels.shape=(10000,1)
np_df = np.concatenate([data,labels], axis=1)

cols = ["f{0}".format(x+1) for x in range(20)]
cols.append('label')

pd_df = pd.DataFrame(data=np_df, columns=cols)

In [55]:
# create col w/ prob of being in positive response class
coeffs = [random.random()/10 for x in range(args['n_features']+1)]
def get_prob(x):
    return 1/(1 + exp(-1*np.dot(x,coeffs)))
pd_df['response_prob'] = pd_df[cols].apply(get_prob, axis=1)

# assign class based on prob in response positive class col
def get_class(x):
    return int(x<random.random())
pd_df['response'] = pd_df['response_prob'].apply(get_class)
pd_df = pd_df.drop('response_prob', axis=1)

In [None]:
# create spark dataframe

In [56]:
df = spark.createDataFrame(pd_df)

## Fitting the estimator

## Transforming the dataset

## Assessing the impact of our treatment

## Evaluating our performance