# W241 Final Project - Craigslist Ads: Machine Learning Approach

Trying to find a connection between treatment and outcomes using machine learning

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Read and process data

In [210]:
df = pd.read_csv("~/MIDS/W241/final project/data.csv")

In [211]:
# Make dummy variables out of our categorical variables
for col in ['author', 'city', 'day']:
    df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
    df = df.drop(col, 1)
    
# Convert missing values (avgoffer) to 0
df = df.convert_objects(convert_numeric=True)
df = df.fillna(0)

## Random Split/Score function

In [212]:
# score function based on a random split into train/test data
def score(X, Y, clf, test_size=0.33, random_state=42):
    train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=test_size, random_state=random_state)
    clf.fit(train_X, train_Y)
    return clf.score(test_X, test_Y)

## Analysis

Thinking of this as a classification problem, we can't use a continous variable as a categorical one representing the class. So we can't do this...

```python
X = np.array(df['treatment'])
Y = np.array(df['rtotal'])
```

But we can do the reverse, i.e. try to predict the treatment status from the outcome

In [213]:
X = np.array(df['rtotal']).reshape(len(df), 1)
Y = np.array(df['treatment'])

# Now we'll try to train a model and see how the predictions go. Let's start with logistic regression.
print score(X, Y, LogisticRegression())

0.454545454545


Not great. In fact, not as good as random. But one problem here is that the matched pairs design is not being taken into account. The difference in offer counts is taken in absolute terms, not relative to the pair splits. To address this, create a new column containing the difference between the number of offers and the mean for that pair.

In [214]:
df = df.join(df.groupby('pairid')['rtotal'].mean(), on='pairid', rsuffix='_pairmean')
df['rtotal_diff'] = df.rtotal - df.rtotal_pairmean

Now let's try using that as our feature.

In [215]:
X = np.array(df['rtotal_diff']).reshape(len(df), 1)
print score(X, Y, LogisticRegression())

0.606060606061


Better. Let's try combining that with all our factor covariates to see if it helps.

In [228]:
cols = [
    'rtotal_diff', 
    'city_1', 'city_2', 'city_3', 'city_4', 'city_5',
    'author_Daniel', 'author_Jonathan', 'author_Kyle', 'author_Raja', 'author_Umber',
    'day_1', 'day_2', 'day_3', 'day_4',
]
X = np.array([df[col] for col in cols]).T

print score(X, Y, LogisticRegression())

0.484848484848


Nope. Let's try the same approach with the other outcome measurements: using the difference between their values and the mean for the pair as a feature.

In [217]:
df = df.join(df.groupby('pairid')['avgoffer'].mean(), on='pairid', rsuffix='_pairmean')
df['avgoffer_diff'] = df.avgoffer - df.avgoffer_pairmean

In [218]:
X = np.array(df['avgoffer_diff']).reshape(len(df), 1)
print score(X, Y, LogisticRegression())

0.515151515152


In [219]:
X = np.array([df[col] for col in ['rtotal_diff', 'avgoffer_diff']]).T
print score(X, Y, LogisticRegression())

0.636363636364


In [222]:
df = df.join(df.groupby('pairid')['roffer'].mean(), on='pairid', rsuffix='_pairmean')
df['roffer_diff'] = df.roffer - df.roffer_pairmean

In [229]:
X = np.array([df[col] for col in ['rtotal_diff', 'avgoffer_diff', 'roffer_diff']]).T
print score(X, Y, LogisticRegression())

0.666666666667


Okay, so we've been able to correctly predict the treatment status on 2/3 of the ads based on the outcomes. Could be worse.

Let's try it with a couple other classifier types just for fun...

In [230]:
print score(X, Y, DecisionTreeClassifier())

0.575757575758


In [231]:
print score(X, Y, RandomForestClassifier())

0.454545454545


## Take-aways

This approach weakly suggests a relationship between the outcomes and treatment. We could try tuning the regression and/or some different algorithms, but this is a very small dataset for machine learning, so the likelihood of overfitting is high and the validity is questionable.