In [None]:
import pandas as pd
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
df = pd.read_csv('ign.csv')

In [None]:
df = df[['title','score_phrase','platform','score','genre','editors_choice','release_year','release_month','release_day']]

Making the editor's choice 'Y'/'N' to 1/0.

In [None]:
df.editors_choice = df['editors_choice'].apply(lambda x: 1 if x=='Y' else 0)

Generating a list of platforms whose data points in our dataset is less than 10.

In [None]:
sparse_platform = [k for k,v in dict(df.platform.value_counts()).items() if v < 10]

Random sampling on df to get the training and testing data. Split ratio 70:30.

In [None]:
train, test = train_test_split(df, test_size = 0.3)

Removing all the sparse platform entries from the test and appending it in train. This is done to avoid key error which might have occured had the test set included all the data points for a sparse platform. Moreover, the data points for these sparse platforms are so less that the predicted target wouldn't have been accurate.

In [None]:
append_in_train = test[test.platform.isin(sparse_platform)] 
append_in_train2 = test[test.release_year==1970] 

In [None]:
test = test[~test.platform.isin(sparse_platform)] 
test = test[test.release_year!=1970] 

In [None]:
train = train.append(append_in_train, ignore_index=True)
train = train.append(append_in_train2, ignore_index=True)

In [None]:
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [None]:
trainlabels = train.editors_choice
testlabels = test.editors_choice

I'm taking both score and score_phrase (even if they are the same thing!) because of the amount of information they provide for determining the target variable is huge (see Visualization notebook for details).

In [None]:
train = train[['score_phrase','platform','score','release_year','release_month','release_day']]
test = test[['score_phrase','platform','score','release_year','release_month','release_day']]

Initializing some label encoders to encode categorical string('score phrase', 'platform' and 'release year') into categorical integers.

In [None]:
sple = LabelEncoder()
ple = LabelEncoder()
yle = LabelEncoder()

In [None]:
train[['score_phrase']] = sple.fit_transform(train[['score_phrase']])
train[['platform']] = ple.fit_transform(train[['platform']])
train[['release_year']] = yle.fit_transform(train[['release_year']])

In [None]:
test[['score_phrase']] = sple.transform(test[['score_phrase']])
test[['platform']] = ple.transform(test[['platform']])
test[['release_year']] = yle.transform(test[['release_year']])

Large Margin Classifiers seems like a good choice since we have attributes like score phrase and score contributing so much information (see Visualizations notebook). It seems like our data is well seperated, and hence large margin classifiers.
Also, I've kept the regularization term a little less that the default 1 (increasing the regularization), to avoid the model to overfit, due to large seperation between score phrases and score attributes and also because score phrase and score are essentially the same. 

In [None]:
model = LinearSVC(C = 0.7)

In [None]:
model.fit(train,trainlabels)

Well, a leaderboard of scores could be helpful for me to improve on this basic solution, but anyway 92% on a 1:5 skewed binary class prediction seems just fair, if not so impressive.

In [None]:
model.score(test,testlabels)

In [None]:
#Coefficients for the learned model.
model.coef_