# Drill: Playing with layers

Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well.

#### Goal: Lets see if we can build a model to classify which department a piece should go into using MLP

## Approach
1. Clean up the data
2. Perform Select K best to reduce the amount of features
3. Split the data.
4. Grid search over lots of parameters and select the best ones.


In [42]:
############ Imports #################
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Model Infrastructure
import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

#For selecting features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif


# Import the model.
from sklearn.neural_network import MLPClassifier



In [2]:
######## Bring In Data ################
start_time = time.time()
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

print("-- Execution time: %s seconds ---" % (time.time() - start_time))

-- Execution time: 2.727679491043091 seconds ---


In [20]:
######### Clean the Data ##############
start_time = time.time()
# Select Columns.
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

# Convert URL's to booleans.
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

# Convert timestamps and dates
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year

# Remove multiple nationalities, genders, and artists.
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

# Final column drops and NA drop.
X = artworks.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

# Create dummies separately.
artists = pd.get_dummies(artworks.Artist)
nationalities = pd.get_dummies(artworks.Nationality)
dates = pd.get_dummies(artworks.Date)

# Concat with other variables, but artists slows this wayyyyy down so we'll keep it out for now
# Removed sparse
X = pd.get_dummies(X)
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks.Department


print("-- Execution time: %s seconds ---" % (time.time() - start_time))

-- Execution time: 1.0871126651763916 seconds ---


In [22]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103881 entries, 0 to 133538
Columns: 311 entries, URL to 2017
dtypes: bool(2), float64(2), int64(1), uint8(306)
memory usage: 33.7 MB


In [23]:
# Do some feature reduction
start_time = time.time()
selector = SelectKBest(f_classif, k=50)
selector.fit(X,Y)

idxs_selected = selector.get_support(indices=True)
best_features = X[X.columns[idxs_selected]]
print("--- %s seconds ---" % (time.time() - start_time))

--- 2.476583480834961 seconds ---


  f = msb / msw


In [25]:
# Do test train split
#Pull out Train, Dev and Test
X_train, X_test, y_train, y_test = train_test_split(best_features,Y,test_size=0.3)

In [33]:
# Very imbalanced, so need to L2 penalty
y_train.value_counts()

Prints & Illustrated Books    38197
Photography                   16569
Architecture & Design          8059
Drawings                       7405
Painting & Sculpture           2486
Name: Department, dtype: int64

In [38]:
########### Build the Model ################
start_time = time.time()
# Establish and fit the model, with a single, 1000 perceptron layer.

parameters = {'hidden_layer_sizes':[(1000,),(100,4)],
             'activation':['logistic'],
             'solver':['adam'],
             'alpha':[0.0001]}

mlp = MLPClassifier()

grid = grid = GridSearchCV(mlp, parameters, scoring='accuracy', cv=5, verbose=0)

grid.fit(X_train, y_train)
print("-- Execution time: %s seconds ---" % (time.time() - start_time))

-- Execution time: 486.2940151691437 seconds ---


In [39]:
# Tried different activation: logistic and relu. Logistic was the best
# Tried different solver: adam and sgd. Adam was the best
# Tried different hidden layer sizes: 1000 and 100. 1000 was the best
# Tried hidden_layer_sizes: 1000, and alpha values: 0.0001,0.01,0.1. 0.001 was the best
# Tried different hidden layer structure (1000,) and (100,4) 1000, was best
grid.best_params_

{'activation': 'logistic',
 'alpha': 0.0001,
 'hidden_layer_sizes': (1000,),
 'solver': 'adam'}

In [40]:
# 0.58 with cross validation and hidden layer size of 1000
# 0.62 with activation: logistic, hidden_layer_sizes: 1000, solver: adam (708 seconds)
# 0.617 with hidden_layer=1000,, activation: logistic, ,solver: adam, alpha: 0.0001 (1079 seconds)
grid.score(X_test, y_test)

0.6188031445531846

In [41]:
# Create Cross tab
y_pred = grid.predict(X_test)
print(pd.crosstab(y_pred, y_test))

Department                  Architecture & Design  Drawings  \
row_0                                                         
Architecture & Design                        1143       110   
Drawings                                       51       361   
Painting & Sculpture                          170       155   
Photography                                   725       657   
Prints & Illustrated Books                   1312      1843   

Department                  Painting & Sculpture  Photography  \
row_0                                                           
Architecture & Design                        111           92   
Drawings                                      14           47   
Painting & Sculpture                         487          114   
Photography                                   90         4925   
Prints & Illustrated Books                   384         2018   

Department                  Prints & Illustrated Books  
row_0                                        

In [43]:
print(classification_report(y_test, y_pred))

                            precision    recall  f1-score   support

     Architecture & Design       0.61      0.34      0.43      3401
                  Drawings       0.59      0.12      0.19      3126
      Painting & Sculpture       0.43      0.45      0.44      1086
               Photography       0.51      0.68      0.59      7196
Prints & Illustrated Books       0.69      0.76      0.72     16356

               avg / total       0.62      0.62      0.60     31165

