# MLP classificer for the sloan data set
We apply almost _out of the box_ the [mlp classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) packaged with sklearn to the sloan dataset and find it to be very accurate! We give a basic introduction on how to use a multilayer perceptron to classification tasks

first of all we need to import all the packages we need. Numpy and Pandas for data manipulation and all the modules from sklearn

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import confusion_matrix

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

import os
print(os.listdir("../input"))

['Skyserver_SQL2_27_2018 6_51_39 PM.csv']


We load the dataset from the csv file and we give a peek into what is inside

In [2]:
dataset = pd.read_csv('../input/Skyserver_SQL2_27_2018 6_51_39 PM.csv')

In [3]:
dataset.head()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1.23765e+18,183.531326,0.089693,19.47406,17.0424,15.94699,15.50342,15.22531,752,301,4,267,3.72236e+18,STAR,-9e-06,3306,54922,491
1,1.23765e+18,183.598371,0.135285,18.6628,17.21449,16.67637,16.48922,16.3915,752,301,4,267,3.63814e+17,STAR,-5.5e-05,323,51615,541
2,1.23765e+18,183.680207,0.126185,19.38298,18.19169,17.47428,17.08732,16.80125,752,301,4,268,3.23274e+17,GALAXY,0.123111,287,52023,513
3,1.23765e+18,183.870529,0.049911,17.76536,16.60272,16.16116,15.98233,15.90438,752,301,4,269,3.72237e+18,STAR,-0.000111,3306,54922,510
4,1.23765e+18,183.883288,0.102557,17.55025,16.26342,16.43869,16.55492,16.61326,752,301,4,269,3.72237e+18,STAR,0.00059,3306,54922,512


Most of the information in the columns are just classification labels. Based on this fact set we can restrict our exploration to columns with physical properties, to be the _redshift_ and the response of the telescope to the electromagnetic bands. Furthermore, we need the class too :)

In [4]:
columns = ['redshift', 'u', 'g', 'r', 'i', 'z', 'class']

The class column contains strings, so we need a label encoder to convert it to numerical values

In [5]:
dataset = dataset.loc[:, columns]

le = LabelEncoder().fit(dataset['class'])
dataset['class'] = le.transform(dataset['class'])

In [6]:
dataset.head()

Unnamed: 0,redshift,u,g,r,i,z,class
0,-9e-06,19.47406,17.0424,15.94699,15.50342,15.22531,2
1,-5.5e-05,18.6628,17.21449,16.67637,16.48922,16.3915,2
2,0.123111,19.38298,18.19169,17.47428,17.08732,16.80125,0
3,-0.000111,17.76536,16.60272,16.16116,15.98233,15.90438,2
4,0.00059,17.55025,16.26342,16.43869,16.55492,16.61326,2


Now we split the dataset intro a training and test. We also perform a simple grid search looking for the better activation function of our network that is, by default, a one layer network with 100 neurons.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(dataset.drop(labels = 'class', axis = 'columns'), dataset['class'], test_size = 0.3)

In [8]:
dici_param = {"activation": ["tanh", "logistic", "relu"]}
clf = GridSearchCV(estimator = MLPClassifier(max_iter=400), param_grid = dici_param, cv = 5, n_jobs = -1)

In [9]:
clf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=400, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'activation': ['tanh', 'logistic', 'relu']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

After training, lets see our score!

In [10]:
clf.score(X_test, y_test)

0.983

The neural network can classify correctly more than 90% of the test sample! Let's make a confusion matrix to see where the erros are distributed

In [11]:
y_pred = clf.predict(X_test)

In [12]:
class_labels = le.inverse_transform([0,1,2])
confusion_df = pd.DataFrame(confusion_matrix(y_test, y_pred),
                            columns = class_labels,
                            index = class_labels)

In [13]:
confusion_df

Unnamed: 0,GALAXY,QSO,STAR
GALAXY,1471,7,19
QSO,23,245,0
STAR,2,0,1233
