In [112]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

from sklearn import ensemble

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV

from sklearn.feature_selection import SelectKBest

This data comes from a public dataset by the Museum of Modern Art in NYC. It includes information on their art collection, including the names of their pieces, artist, date of acquisition, identification information such as Accession Number and ObjectID, and physical attributes such as length and width. In this experiment, I will Now, use multi-layer perceptron modeling (MLP) to see if I can classify the department a piece should go into using everything but the department name.

Import data and look at columns.

In [25]:
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

In [26]:
artworks.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

Select columns. <br>
Convert URLs to booleans. <br>
Drop 'films' and other tricky rows. <br>
Drop missing data. <br>

In [27]:
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

artworks = artworks.dropna()

Now, look at data and check out what kind of features we have.

In [28]:
artworks.head()

Unnamed: 0,Artist,Nationality,Gender,Date,Department,DateAcquired,URL,ThumbnailURL,Height (cm),Width (cm)
0,Otto Wagner,(Austrian),(Male),1896,Architecture & Design,1996-04-09,True,True,48.6,168.9
1,Christian de Portzamparc,(French),(Male),1987,Architecture & Design,1995-01-17,True,True,40.6401,29.8451
2,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,34.3,31.8
3,Bernard Tschumi,(),(Male),1980,Architecture & Design,1995-01-17,True,True,50.8,50.8
4,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,38.4,19.1


In [29]:
artworks.dtypes

Artist           object
Nationality      object
Gender           object
Date             object
Department       object
DateAcquired     object
URL                bool
ThumbnailURL       bool
Height (cm)     float64
Width (cm)      float64
dtype: object

Convert date to numeric and create a new feature for the year the piece was acquired.

In [30]:
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year
artworks['YearAcquired'].dtype

dtype('int64')

Remove multiple nationalities, genders, and artists. <br>
Convert dates to start date, cutting down number of distinct examples. <br>
Define experimental X dataframe and perform final column drops/NA drop. <br>
Create dummy categories separately then concatenate them together. <br>
Define Y.

In [31]:
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

X = artworks.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

artists = pd.get_dummies(artworks.Artist)
nationalities = pd.get_dummies(artworks.Nationality)
dates = pd.get_dummies(artworks.Date)

X = pd.get_dummies(X, sparse=True)
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks.Department

Alright! We've done our prep, let's build the model. Neural networks are hugely computationally intensive. This may take several minutes to run.<br>

Establish and fit the model with a single 1000 perceptron layer.

In [9]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X, Y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [10]:
mlp.score(X, Y)

0.71537805516252206

In [11]:
Y.value_counts()/len(Y)

Prints & Illustrated Books    0.523811
Photography                   0.225079
Architecture & Design         0.112399
Drawings                      0.103997
Painting & Sculpture          0.034714
Name: Department, dtype: float64

In [12]:
cross_val_score(mlp, X, Y, cv=5)

array([ 0.61088231,  0.67525923,  0.38750787,  0.57477224,  0.51463462])

# Drill: Playing with layers
Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well. It may also be beneficial to do some real feature selection work...

First, I will create subsets to reduce the runtime and compare the full dataset to the reduced dataset for several configurations of the neural network.

In [None]:
art_50 = artworks.sample(frac=0.5)

X_50 = art_50.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

artists = pd.get_dummies(art_50.Artist)
nationalities = pd.get_dummies(art_50.Nationality)
dates = pd.get_dummies(art_50.Date)

X_50 = pd.get_dummies(X_50, sparse=True)
X_50 = pd.concat([X_50, nationalities, dates], axis=1)

Y_50 = art_50.Department

In [34]:
art_10 = artworks.sample(frac=0.1)

X_10 = art_10.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

artists = pd.get_dummies(art_10.Artist)
nationalities = pd.get_dummies(art_10.Nationality)
dates = pd.get_dummies(art_10.Date)

X_10 = pd.get_dummies(X_10, sparse=True)
X_10 = pd.concat([X_10, nationalities, dates], axis=1)

Y_10 = art_10.Department

In [65]:
def run_mlp(X,Y, sizes):
    mlp = MLPClassifier(hidden_layer_sizes=(sizes))
    mlp.fit(X, Y)
    print(f'Hidden Layer Sizes: {sizes}')
    print(f'Accuracy: {mlp.score(X, Y)}')
    scores = cross_val_score(mlp, X, Y, cv=5)
    print(f'Cross Val Scores: {scores}')
    print(f'Cross Val Mean: {scores.mean()}')
    
run_mlp(X, Y, [100,4])

Hidden Layer Sizes: [100, 4]
Accuracy: 0.5238113697594635
Cross Val Scores: [ 0.52376569  0.52379107  0.52381645  0.52384183  0.52384183]
Cross Val Mean: 0.5238113714321069


In [56]:
run_mlp(X_50, Y_50, [100,4])

Hidden Layer Sizes: [100, 4]
Accuracy: 0.5587385885682166
Cross Val Scores: [ 0.5253876   0.52543851  0.52543851  0.52543851  0.52549438]
Cross Val Mean: 0.5254395018031999


In [55]:
run_mlp(X_10, Y_10, [100,4])

Hidden Layer Sizes: [100, 4]
Accuracy: 0.5198178118034693
Cross Val Scores: [ 0.53775411  0.51961259  0.51986434  0.52011634  0.52013586]
Cross Val Mean: 0.5227264274297059


With a network structure of [100,4], we can see that the accuracy of the network stays relatively the same across datasets. This is an interesting finding and not what I expected, since neural nets tend to need a lot of data. None of the networks are overfitting by much either. Perhaps the lack of data can be made up for by expanding the size of the hidden layers, so more analysis is being done despite having less data.

In [38]:
run_mlp(X, Y, [10,10,10])

Hidden Layer Sizes: [10, 10, 10]
Accuracy: 0.6611555831217413
Cross Val Scores: [ 0.6008043   0.63475143  0.56626448  0.56590425  0.50702656]
Cross Val Mean: 0.5438327899489892


In [39]:
run_mlp(X_50, Y_50, [10,10,10])

Hidden Layer Sizes: [10, 10, 10]
Accuracy: 0.6543717170934041
Cross Val Scores: [ 0.62839147  0.64550829  0.59404981  0.60819847  0.64530826]
Cross Val Mean: 0.6092878068959712


In [40]:
run_mlp(X_10, Y_10, [10,10,10])

Hidden Layer Sizes: [10, 10, 10]
Accuracy: 0.600445779629809
Cross Val Scores: [ 0.5285576   0.54915254  0.58187984  0.5681047   0.59679767]
Cross Val Mean: 0.5660266526736795


With a minimal network structure of [10, 10, 10], we see that the accuracy drops from 0.66 to 0.65 to 0.60 as we reduce the dataset. The difference between the full dataset and the 50% dataset is not as big a difference as the 10% dataset. This shows us that although we see a benefit in reduced runtime, it may not be worth our while to reduce datasets in the future because the performance gets worse and worse in a non-linear manner. <br>

I have read advice online that says beyond 2-3 hidden layers, there tends not to be a big increase in performance. Let's test that theory by running 5 layers with 10 perceptrons each.

In [57]:
run_mlp(X, Y, [10,10,10,10,10])

Hidden Layer Sizes: [10, 10, 10, 10, 10]
Accuracy: 0.5979396429748222
Cross Val Scores: [ 0.59038713  0.61488516  0.53772351  0.55194805  0.5086257 ]
Cross Val Mean: 0.5747460568866721


In [58]:
run_mlp(X_50, Y_50, [10,10,10,10,10])

Hidden Layer Sizes: [10, 10, 10, 10, 10]
Accuracy: 0.6595080728005738
Cross Val Scores: [ 0.63653101  0.57466809  0.62176567  0.57554027  0.65354789]
Cross Val Mean: 0.6170057564137159


In [59]:
run_mlp(X_10, Y_10, [10,10,10,10,10])

Hidden Layer Sizes: [10, 10, 10, 10, 10]
Accuracy: 0.5893012888845819
Cross Val Scores: [ 0.56389158  0.58014528  0.52374031  0.52544838  0.51868025]
Cross Val Mean: 0.5407504068716287


3 layers of [10,10,10] gave scores of 0.66, 0.65, and 0.60. Increasing to 5 layers gave scores of 0.57, 0.62, and 0.54. This shows that adding more layers doesn't necessarily improve results. It seems like the size of the layers is more important. Here we see that the 50% reduced dataset actually performed the best, despite some overfitting. This finding would contradict the earlier finding that more data produces better results, but this finding also uses 5 small hidden layers, when standard practice is to use fewer larger layers. <br>

Let's expand the size of the hidden layers, but decrease the number to the minimum (1).

In [60]:
run_mlp(X, Y, [150])

Hidden Layer Sizes: [150]
Accuracy: 0.6175934719826333
Cross Val Scores: [ 0.60797519  0.66547146  0.48122305  0.56503198  0.48245784]
Cross Val Mean: 0.5385386661528683


In [61]:
run_mlp(X_50, Y_50, [150])

Hidden Layer Sizes: [150]
Accuracy: 0.5870757660922994
Cross Val Scores: [ 0.62577519  0.64919081  0.55751526  0.59686016  0.6240791 ]
Cross Val Mean: 0.5644374284034559


In [43]:
run_mlp(X_10, Y_10, [150])

Hidden Layer Sizes: [150]
Accuracy: 0.561197790483574
Cross Val Scores: [ 0.53436592  0.5598063   0.53246124  0.57198255  0.55409995]
Cross Val Mean: 0.473447241477135


Now, we have our worst set of scores yet at 0.54, 0.56, and 0.47. Perhaps more than one layer is needed, or if there is only one layer, it must be much larger than this. <br>

Lets see what happens when we continue with the minimum of 1 layer, but this time with a minimal size of 2. Will the score get even worse?

In [44]:
run_mlp(X, Y, [2])

Hidden Layer Sizes: [2]
Accuracy: 0.5238113697594635
Cross Val Scores: [ 0.52376569  0.52379107  0.52381645  0.52384183  0.52384183]
Cross Val Mean: 0.5148167741133588


In [45]:
run_mlp(X_50, Y_50, [2])

Hidden Layer Sizes: [2]
Accuracy: 0.5254394976062644
Cross Val Scores: [ 0.5253876   0.5512162   0.55858126  0.52543851  0.52549438]
Cross Val Mean: 0.5254395018031999


In [46]:
run_mlp(X_10, Y_10, [2])

Hidden Layer Sizes: [2]
Accuracy: 0.5198178118034693
Cross Val Scores: [ 0.51936108  0.51961259  0.51986434  0.52011634  0.52013586]
Cross Val Mean: 0.5286316009058842


With scores of 0.51, 0.53, and 0.53, these scores with hidden layers=[2] are comparable with the scores of hidden layers=[150]. Let's see when we keep the minimum of 1 layer, but expand the size drastically.

In [62]:
run_mlp(X, Y, [1000])

Hidden Layer Sizes: [1000]
Accuracy: 0.7021979725931813
Cross Val Scores: [ 0.58205339  0.6389185   0.58782769  0.57753441  0.45798604]
Cross Val Mean: 0.5718116178346173


In [63]:
run_mlp(X_50, Y_50, [1000])

Hidden Layer Sizes: [1000]
Accuracy: 0.533405694570969
Cross Val Scores: [ 0.56560078  0.64550829  0.53638918  0.24915205  0.64521132]
Cross Val Mean: 0.5423349798904916


In [66]:
run_mlp(X_10, Y_10, [1000])

Hidden Layer Sizes: [1000]
Accuracy: 0.2944083729043512
Cross Val Scores: [ 0.53194579  0.29539952  0.53682171  0.53756665  0.5346919 ]
Cross Val Mean: 0.4872851115550634


When we expanded the layer, the full dataset saw the accuracy increase to 0.7, which is the best so far, but it is also overfitting as we can see by the cross-val score of 0.57. Also, it took forever. The 50% reduced dataset was not overfitting, but the accuracy was reduced to 0.53. The 10% dataset was dramatically underfitting, which is interesting. This shows us that although large datasets are preferred, they may lead to overfitting.

Let's perform some basic feature selection to see if we can improve this algorithm. Now that we are trying to make a robust algorithm (instead of just tinkering with variables) we will split our data into training and testing datasets to see the accuracy.

In [101]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,
                                                    random_state=0)

In [None]:
top_feat = feature_selection.SelectKBest()

    pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
                                 ('feat', top_feat),
                                 ('clf', linear_model.LogisticRegression())])


var = VarianceThreshold(threshold=(.8 * (1 - .8)))


clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

#sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
#sel.fit_transform(X)

#pipe = Pipeline([
#    ('reduce_dim', PCA()),
#    ('classify', LinearSVC())

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

clf = Pipeline([
    ('feat', VarianceThreshold(threshold=(.8 * (1 - .8)))),
    ('mlp', MLPClassifier())
])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)

# get the accuracy
print accuracy_score(y_test, predicted)

pipe = Pipeline([
    ('feat', VarianceThreshold(threshold=(.8 * (1 - .8)))),
    ('mlp', MLPClassifier(hidden_layer_sizes=[10,10,10]))
])


In [110]:
pipe = Pipeline([
    ('feat', VarianceThreshold()),
    ('mlp', MLPClassifier())
])

param_grid = [
    {
        'feat__threshold': [(.8 * (1 - .8)),(.6 * (1 - .6)),(.4 * (1 - .4))],
        'mlp__hidden_layer_sizes': [[10,10,10],[1000]]
    }
]

grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)
grid.fit(X_train, Y_train)
print(grid.best_params_)
prediction = grid.predict(X_test)
print(accuracy_score(Y_test, prediction))

{'feat__threshold': 0.15999999999999998, 'mlp__hidden_layer_sizes': [10, 10, 10]}
0.54520786898


With a validation score of 0.55, feature selection via variance threshold did not drastically improve our results. We used the hidden layer structures that produced the 2 best scores from earlier in the notebook to test this. When using a hidden layer structure of [10,10,10] on the full dataset, the score hardly changed at all, remaining around 0.54. We will try one more method of feature reduction -- SelectKBest.

In [114]:
pipe = Pipeline([
    ('feat', SelectKBest(k=100)),
    ('mlp', MLPClassifier(hidden_layer_sizes = [10,10,10]))
])

pipe.fit(X_train, Y_train)
prediction = pipe.predict(X_test)
print(accuracy_score(Y_test, prediction))

  f = msb / msw


0.631504990794


Alright! Using SelectKBest with 100 features, we were able to obtain our best score yet -- 0.63. This is still not great, but a significant improvement from the values around 0.54 that all the other models were obtaining.

# Conclusion

I have experimented with hidden layer structures, dataset sizes, and feature selection to classify this set using neural nets. With the full feature dataset, I generally found that including all datapoints led to greater accuracy scores, but this was not always the case. Sometimes including all the datapoints led to overfitting, so I will make a note to be cautious about this in the future. When the structure was very small and minimal, I saw more consistency in results between the differently-sized datasets. <br>

Given that there were so many categorical features conveying little information per feature, I thought that setting a variance threshold to reduce the feature set would be beneficial. However, even after iterating through multiple threshold values and hidden layer structure using GridSearch, I did not see an increase in performance. I did, however, see a significant increase in performance when using SelectKBest, a method I had never used before. I will continue using these methods as I learn more about data science.