# Initial Analysis of Two Sigma Competition

## ML Society
### Courtesy of Jason C.

### Import libraries and read data

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn import metrics

In [None]:
train = pd.read_json('../input/train.json')

### What is the majority class?

In [None]:
train.interest_level.value_counts(normalize=True)

Okay so 0.695 prediction accuracy is the baseline

### How predictive is the property description alone?
Note: Going forward, it will make sense to first apply some text cleaning.

In [None]:
pipe = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SGDClassifier())])

In [None]:
pipe.fit(train.description, train.interest_level)

In [None]:
pipe.score(train.description, train.interest_level)

### How predictive are the numeric features alone?

In [None]:
train_numer_df = train.select_dtypes(include=['float64', 'int64'])
train_target = train.interest_level

In [None]:
model = SGDClassifier()
model.fit(train_numer_df, train_target)

In [None]:
model.score(train_numer_df, train_target)

### Build list of unique features included in property descriptions
Note: Other competition participants have interesting takes on building this feature list, such as deduplicating features or excluding ones that occur less than 5 times in the dataset: https://www.kaggle.com/jxnlco/two-sigma-connect-rental-listing-inquiries/deduplicating-features

In [None]:
features = []
for i in train.features:
    for j in i:
        if j not in features:
            features.append(j)

### Initialize feature ndarray, iterate over properties dataframe, updating feature ndarray appropriately.

In [None]:
feat_array = np.ndarray((len(train),len(features)))

In [None]:
for i in range(len(train)):
    for word in train.features.iloc[i]:
        if word in features:
            feat_array[i,features.index(word)] = 1
            #print features.index(word)
    

In [None]:
target_array = np.array(train_target)

### Train linear regression model on feature array and test performance.

In [None]:
svm_model = SGDClassifier()
svm_model.fit(feat_array, train_target)

In [None]:
svm_model.score(feat_array, train_target)

In [None]:
predictions = svm_model.predict(feat_array)
predictions = pd.Series(predictions)
predictions.value_counts()

In [None]:
train_target.value_counts()

### Thoughts going forward:
The class imbalance of the training data is obvious. It may be worthwhile to undersample the 'low' and 'medium' interest groups and see if better performance is achieved. In the end, however, I suspect training a deep learning model on the provided image data will be necessary to be competitive.