## Problem: Maximize Profit from these members. 
### Where to go from here? - questions? observations? directions?
#### • emails.tsv: a tab-separated file of all emails sent to a subset of our members from the JobsRadar brand in September
Column1: email_id
Column2: email_send_time (EST)
Column3: email_type
	T plus 1 is a campaign sent 1 day after a member joins
	T plus N are campaigns sent N days after a member joins
	Transactional Forgot Password Email
	Transactional JR Welcome Email
Column4: email_variant
	First Part:
		fixed_keyword_cloud = keyword cloud email (See samples)
		job_alert = custom job listings (see samples)
	2nd part:
		tplusN helps with time since joined (joined N days ago)
		age22+ goes to people age 22+ (includes anyone 22-999, meaning everyone except under 22),
		age35+ likewise
		1opened goes to people that opened at least once before.
Column5: member_id

job_alert_s1_v1 is a specific “version” of a “job alert” type of email we send.
tplus2 means this message was sent to members that joined our site 2 days ago
tplus201 means this message was sent to members that joined our site 201 days ago
age22+ means it was sent to people that were at least 22 years old.
1opened means it was sent to people that had opened at least 1 message from us in the past (“Openers”)

#### • email_responses.tsv: a tab-separated file of open, click, and unsubscribe events that resulted from those emails
#### • members.tsv: a tab-separated file of information about the members to whom those emails were sent
#### • *.png: sample images of each of the email variants sent in this set (as identified by either the "variant" or "campaign" column in emails.tsv)
#### • *.eml: Outlook export files of same

#### A) we make 0.12 per click event 
#### B) it costs us 0.40 for every 1,000 emails sent.
   

In [1]:
# Import some useful python modules
import matplotlib as plt
import seaborn as sns
%matplotlib inline

import numpy as np
import pandas as pd
import os
import datetime
import sqlite3
from sqlalchemy import create_engine, text
import nltk

def sqldf(query, sql_engine):
    query_text = text(query)
    results = pd.read_sql_query(query_text, con=sql_engine)
    return results

def sqlraw():
    conn = sqlite3.connect('jobcase.db')
    return conn

In [48]:
nltk.download()

In [2]:
# change working directory into folder with data, view what is available
os.chdir(os.getcwd() + '\data')

In [3]:
conn = sqlite3.connect('jobcase.db')
engine = create_engine('sqlite:///C:\\Users\\Dan\\1) Python Notebooks\\Jobcase\\data\\jobcase.db')
engine.raw_connection().connection.text_factory = str
conn.close()
connection = engine.connect()

In [3]:
# Parse data into Pandas DataFrames
# emails
emails = pd.read_table('emails.tsv', header=None)
emails.columns = ['email_id','timestamp','email_type','email_variant','member_id']
emails.email_variant = emails.email_variant.apply(lambda x: str(x).replace('\\N', 'NONE')) 
emails['email_variant_first_part'] = emails['email_variant'].apply(lambda x: x.split(':')[0] if not pd.isnull(x) else np.nan)
emails['email_variant_second_part'] = emails['email_variant'].apply(lambda x: x.split(':')[1] if not pd.isnull(x) and len(x.split(':')) > 1 else np.nan)
emails.member_id = emails.member_id.astype(str)

# email responses
email_responses = pd.read_table('email_responses.tsv', header=0)

In [4]:
email_responses['action'].value_counts()

open     1099755
click     436201
unsub      22653
dtype: int64

In [75]:
# Calculating the profit based on the amounts identified ($0.4/1000 and $0.12/email)
total_cost = len(emails.email_id) * 0.4 / 1000
gross_revenue = email_responses['action'].value_counts().click * 0.12
profit = gross_revenue - total_cost
print gross_revenue
print total_cost
print profit 

52344.12
3843.1236
48500.9964


Based on this dataset, we can see that we have sent 9,607,809 emails. Sending these emails cost \$0.4/1000 emails for a total cost of \$3,843.12. Looking at the email responses we can see that there are 436,201 'click' events. Each click event earns Jobcase earns \$0.12/click. This means that based on this dataset we have made \$52,344.12 in revenue. Our profit on this dataset is then \$48,501.

In [4]:
# There are some issues with the members file (varying apparent row lengths, likely due to a separator issue) 
# so we will read that directly and try to identify where the issues are. 
# After the data is in a usable form and then parse the data into a dataframe manually
members_file = open('members.tsv','r')
members_list = []
members_list_raw = []
for line in members_file.readlines():
    members_list_raw.append(line)
    members_list.append(line.replace('\n','').split('\t'))

#print members_list[0]
#members_list_raw[205009]

# From investigating the above rows, we can see that the issue stems from additional tabs in the value fields
# we therefore need to replace the invalid dditional tab values so the data can be formatted correctly
clean_members_list = [
    [unicode(col.decode("ascii","ignore")) 
        for col in member.replace('\n','').replace('\\\t','').replace('\N','').replace("'","").replace('+','').split('\t')
    ] for member in members_list_raw
]

members = pd.DataFrame(data=clean_members_list[1:], 
                       columns=clean_members_list[0])

In [6]:
emails_and_clicks = pd.merge(emails, email_responses[email_responses.action=='click'], how='left', on='email_id', suffixes=['_1', '_2'])
emails_and_resp = pd.merge(emails_and_clicks
                           , pd.DataFrame(data=email_responses.email_id.unique(),columns=['email_id'])
                           , how='inner', on='email_id', suffixes=['_1', '_2'])
emails_and_resp['action'] = emails_and_resp['action'].fillna('na')

In [63]:
emails_and_resp.email_id.count()

963131

In [7]:
emails_and_members = pd.merge(emails_and_resp, members, how='inner', on='member_id', suffixes=['_1', '_2'])
emails_and_members.keyword = emails_and_members.keyword.fillna('').apply(lambda x: x.lower().replace('jobs','').replace('job','').replace('www','').replace('.com',''))

In [75]:
#emails.to_sql('emails', engine)
#email_responses.to_sql('email_responses', engine)
#members.to_sql('members', engine, if_exists='replace')
#emails_and_members.to_sql('emails_and_members', engine)

In [None]:
matching_email = sqldf("""
SELECT count(*)
FROM email_responses er
JOIN emails e ON e.email_id = er.email_id
WHERE e.email_id IS NULL
""", engine)

matching_member = sqldf("""
SELECT count(*)
FROM emails e 
JOIN members m ON m.member_id = e.member_id
WHERE m.member_id IS NULL
""", engine)

In [76]:
pd.read_sql_query("select count(*) from members;", con=engine)

Unnamed: 0,count(*)
0,1607520


In [73]:
con = sqlite3.connect('jobcase.db')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

[(u'emails',), (u'email_responses',), (u'members',)]


In [None]:
cursor.execute("select * from members")

In [79]:
print(cursor.fetchall())

[(0,)]


In [None]:
#state zip degree_level hs_or_ged_year keyword

# Normalizing the Features

In [80]:
emails_and_members.describe()

Unnamed: 0,email_id
count,1465371.0
mean,212516300.0
std,3922908.0
min,205570100.0
25%,209022500.0
50%,212711900.0
75%,215916300.0
max,219114800.0


In [8]:
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize

stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

vectorizer = TfidfVectorizer(max_df=0.5, max_features=10, min_df=2, stop_words='english',tokenizer=tokenize)
vX = vectorizer.fit_transform(emails_and_members.keyword)

In [9]:
emails_and_members['variant_vals'] = emails_and_members.email_variant.fillna('none').apply(lambda x: x.lower().replace('_',' ').replace(':',' ').replace('+',' '))

In [10]:
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

vectorizer_variant_vals = TfidfVectorizer(max_df=0.5, max_features=10, min_df=2, stop_words='english', tokenizer=tokenize)
vX_variants = vectorizer_variant_vals.fit_transform(emails_and_members.variant_vals)

In [11]:
# creating dummy values for the categorical variables
features = pd.concat(objs=[pd.get_dummies(emails_and_members.email_type), 
                           pd.get_dummies(emails_and_members.email_variant_first_part),
                           pd.DataFrame(data=vX.toarray(), columns = vectorizer.get_feature_names())
                          ], axis=1)
features.head()
target = pd.get_dummies(emails_and_members.action).click



# Now building the model


In [12]:
X = features.as_matrix().astype(int).astype(float)
y = target
# This is important
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

print "Feature space holds %d observations and %d features" % X.shape
print "Unique target labels:", np.unique(y)

Feature space holds 932706 observations and 21 features
Unique target labels: [ 0.  1.]


In [13]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report, confusion_matrix


X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
#clf = GradientBoostingClassifier()
#clf = RidgeClassifier()
clf = SVC()
#clf = GaussianNB()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Gradient Boosting Classifier'
print classification_report(y_test, predictions)
print confusion_matrix(y_test, predictions)

In [89]:
total_pred = clf.predict(X)
p = pd.DataFrame(data=total_pred, columns=['predicted'])
p['actual'] = y
p.head()

p[p.predicted == 1].actual.value_counts()

Series([], dtype: int64)

In [83]:
p.actual.value_counts()

0    896954
1     35752
dtype: int64

In [85]:
p.tail()

Unnamed: 0,predicted,actual
932701,1,0
932702,1,0
932703,1,0
932704,1,0
932705,1,0


In [39]:
emails_and_members.head()

Unnamed: 0,email_id,timestamp_1,email_type,email_variant,member_id,email_variant_first_part,email_variant_second_part,timestamp_2,action,date,email_domain,first_name,city,state,zip,degree_level,hs_or_ged_year,pcp_score,keyword
0,205570076,2012-09-01 00:10:08,Transactional JR Welcome Email,account_login_info_s2_v1,14802260,account_login_info_s2_v1,,2012-09-01 00:23:49,click,2012-09-01 01:02:03,yahoo.com,michael,SPARTA,TN,38583,Some HS,2013,0.10344,kroger jobs
1,205570123,2012-09-01 00:11:07,Transactional Forgot Password Email,,8450299,,,2012-09-01 00:12:36,open,2011-07-13 21:57:47,hotmail.com,Maria,Chula vista,CA,91911,Associate,1979,0.320591,Costco job
2,205570123,2012-09-01 00:11:07,Transactional Forgot Password Email,,8450299,,,2012-09-01 00:12:57,click,2011-07-13 21:57:47,hotmail.com,Maria,Chula vista,CA,91911,Associate,1979,0.320591,Costco job
3,205570320,2012-09-01 00:30:08,Transactional JR Welcome Email,account_login_info_s2_v1,14802278,account_login_info_s2_v1,,2012-09-01 00:30:52,open,2012-09-01 01:24:50,yahoo.com,Frank,FORT LAUDERDALE,FL,33301,Some College,1981,0.174869,FedEx Job
4,205570320,2012-09-01 00:30:08,Transactional JR Welcome Email,account_login_info_s2_v1,14802278,account_login_info_s2_v1,,2012-09-01 00:31:09,click,2012-09-01 01:24:50,yahoo.com,Frank,FORT LAUDERDALE,FL,33301,Some College,1981,0.174869,FedEx Job


In [37]:
emails_and_members.action.value_counts()

open     1016152
click     428866
unsub      20353
dtype: int64

In [38]:
428866 * .12

51463.92

In [64]:
(1016152 + 428866 + 20353) * 0.4/1000

1465371

In [None]:
from sklearn.cross_validation import KFold

def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(len(y),n_folds=5,shuffle=True)
    y_pred = y.copy()
    # Iterate through folds
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred

def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)

print "Random forest:"
print "%.3f" % accuracy(y, run_cv(X,y,GradientBoostingClassifier))


In [14]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print 'Decision Tree Classifier'
print classification_report(y_test, predictions)

Decision Tree Classifier
             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83    258449
        1.0       1.00      0.00      0.00    107894

avg / total       0.79      0.71      0.58    366343



In [15]:
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'RandomForest Classifier'
print classification_report(y_test, predictions)

RandomForest Classifier
             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83    258449
        1.0       1.00      0.00      0.00    107894

avg / total       0.79      0.71      0.58    366343



In [21]:
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Gradient Boosting Classifier'
print classification_report(y_test, predictions)

Gradient Boosting Classifier
             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83    259232
        1.0       1.00      0.00      0.00    107111

avg / total       0.79      0.71      0.59    366343



In [22]:
from sklearn.metrics import classification_report, confusion_matrix
confusion_matrix(y_test, predictions)

array([[259232,      0],
       [107108,      3]])

In [16]:
len(y_test)

366343

In [17]:
259196 + 107144

366340

In [15]:
cost = 259196 * 0.4/1000
revenue = 259196 * 0.12
profit = revenue - cost
print cost, revenue, profit

103.6784 31103.52 30999.8416


In [16]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Gradient Boosting Classifier'
print classification_report(y_test, predictions)

Gradient Boosting Classifier
             precision    recall  f1-score   support

        0.0       0.91      0.00      0.00    258449
        1.0       0.29      1.00      0.46    107894

avg / total       0.73      0.29      0.13    366343



In [18]:
clf = ExtraTreesClassifier()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Extra Trees Classifier'
print classification_report(y_test, predictions)


Extra Trees Classifier
             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83    258449
        1.0       1.00      0.00      0.00    107894

avg / total       0.79      0.71      0.58    366343



In [17]:
clf = LogisticRegression()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Logistic Regression Classifier'
print classification_report(y_test, predictions)

Logistic Regression Classifier
             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83    258449
        1.0       1.00      0.00      0.00    107894

avg / total       0.79      0.71      0.58    366343



In [None]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train.ravel())
predictions = knn.predict(X_test)
print 'K Nearest Neighbor Regression Classifier'
print classification_report(y_test, predictions)

In [None]:
clf = SVC()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Support Vector Machines Classifier'
print classification_report(y_test, predictions)

In [14]:
clf = LinearSVC()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'Support Vector Machines Classifier'
print classification_report(y_test, predictions)

Support Vector Machines Classifier
             precision    recall  f1-score   support

        0.0       0.71      1.00      0.83    258949
        1.0       0.67      0.00      0.00    107394

avg / total       0.70      0.71      0.59    366343



In [91]:
clf = SGDClassifier()
clf.fit(X_train, y_train.ravel())
predictions = clf.predict(X_test)
print 'SGD Classifier'
print classification_report(y_test, predictions)

SGD Classifier
             precision    recall  f1-score   support

        0.0       0.96      1.00      0.98    224300
        1.0       0.00      0.00      0.00      8877

avg / total       0.93      0.96      0.94    233177



In [10]:
from sklearn.cross_validation import KFold

def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(len(y),n_folds=5,shuffle=True)
    y_pred = y.copy()
    
    # Iterate through folds
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LogisticRegression as LR

def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)

#print "Support vector machines:"
#print "%.3f" % accuracy(y, run_cv(X,y,SVC))
print "Random forest:"
print "%.3f" % accuracy(y, run_cv(X,y,RF))
print "K-nearest-neighbors:"
print "%.3f" % accuracy(y, run_cv(X,y,KNN))
print "Logistic Regression:"
print "%.3f" % accuracy(y, run_cv(X,y,LR))

In [13]:
def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(len(y), n_folds=5, shuffle=True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        # Predict probabilities, not classes
        y_prob[test_index] = clf.predict_proba(X_test)
    return y_prob

In [None]:
# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
pred_churn = pred_prob[:,1]
is_churn = y == 1

# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_churn)

# calculate true probabilities
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob])
    true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']
counts

In [None]:
# Now running the predictions based on the model we created
model = 
model.fit(X)
predictions = model.predict(X)
predictions.value_counts()
