# Supervised neural nets

We'll build our first neural network.  This will have multiple features, each of which will go through a set of perceptron models to arrive at a response that will be trained on our output.

The data set we'll use is a public repo of the [collection of the Museum of Modern Art in NYC](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score, train_test_split

%matplotlib inline

In [2]:
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

In [3]:
artworks.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


Let's do some data processing and cleaning to focus only on relevant data.

In [4]:
artworks = artworks[['Artist', 
                    'Nationality', 
                    'Gender', 
                    'Date', 
                    'Department', 
                    'DateAcquired', 
                    'URL', 
                    'ThumbnailURL', 
                    'Height (cm)', 
                    'Width (cm)']]

# Converting URL's to boolean
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Dropping some tricky departments
artworks = artworks[~artworks['Department'].isin(['Film', 
                                                  'Media and Performance Art', 
                                                  'Fluxus Collection'])]

artworks.dropna(inplace=True)

In [5]:
artworks.dtypes

Artist           object
Nationality      object
Gender           object
Date             object
Department       object
DateAcquired     object
URL                bool
ThumbnailURL       bool
Height (cm)     float64
Width (cm)      float64
dtype: object

In [6]:
artworks['DateAcquired'] = pd.to_datetime(artworks['DateAcquired'])
artworks['YearAcquired'] = artworks['DateAcquired'].dt.year

In [7]:
artworks['Gender'].value_counts()

(Male)                                                                                                                                                                                                                83159
(Female)                                                                                                                                                                                                              14351
()                                                                                                                                                                                                                     4957
(Male) (Male)                                                                                                                                                                                                          1394
(Male) (Male) (Male)                                                                                                    

There are pieces with multiple artists.  This affects `Gender`, `Nationality`, and `Artist`.  We'll re-classify those as multiple artists instead.

In [8]:
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '(Multiple Artists)'
artworks.loc[artworks['Gender'].str.contains('\(\)'), 'Gender'] = '(Unknown)'
artworks.loc[artworks['Gender'].str.contains('\(male\)'), 'Gender'] = '(Male)'


In [9]:
artworks['Gender'].value_counts()

(Male)                83183
(Female)              14351
(Multiple Artists)     5410
(Unknown)              4957
Name: Gender, dtype: int64

In [10]:
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '(Multiple Artists)'
artworks.loc[artworks['Nationality'].str.contains('\(\)'), 'Nationality'] = '(Nationality Unknown)'
artworks['Nationality'].value_counts()

(American)               46448
(French)                 16350
(German)                  6897
(Multiple Artists)        5410
(British)                 4926
(Nationality Unknown)     3876
(Spanish)                 2815
(Italian)                 2387
(Japanese)                1951
(Russian)                 1839
(Swiss)                   1693
(Dutch)                   1430
(Belgian)                 1335
(Mexican)                 1099
(Austrian)                 723
(Brazilian)                718
(Czech)                    695
(Colombian)                677
(Argentine)                521
(Canadian)                 507
(Polish)                   441
(Venezuelan)               432
(Chilean)                  389
(Danish)                   343
(South African)            340
(Israeli)                  312
(Nationality unknown)      303
(Australian)               215
(Swedish)                  198
(Cuban)                    177
                         ...  
(Georgian)                   5
(Malaysi

In [11]:
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple Artists'
artworks['Artist'].value_counts()

Multiple Artists             5706
Louise Bourgeois             3230
Unknown photographer         2127
Lee Friedlander              1316
Pablo Picasso                1295
Marc Chagall                 1151
Henri Matisse                1041
Jean Dubuffet                1021
Ludwig Mies van der Rohe      936
Pierre Bonnard                890
Émile Bernard                 631
Georges Rouault               599
Aristide Maillol              576
André Derain                  567
Sol LeWitt                    556
Raoul Dufy                    534
Maurice Denis                 500
Dorothea Lange                475
Joan Miró                     454
Jan Dibbets                   427
Pierre Alechinsky             416
Jasper Johns                  400
Jim Dine                      380
Unknown Artist                375
George Maciunas               369
Walker Evans                  358
Jules Pascin                  342
Garry Winogrand               342
Unknown Designer              340
Thomas Bewick 

In [12]:
artworks['Artist'].nunique()

9347

In [13]:
# Dates are in many formats, pulling out the first year in the string
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]
artworks['Date'].value_counts()

1967    2225
1969    2187
1968    2056
1965    2039
1966    2016
1971    1918
1970    1811
1964    1673
1930    1617
1962    1581
1963    1575
1973    1552
1972    1375
2003    1358
1948    1277
1928    1250
1931    1243
1938    1219
2001    1153
2002    1145
1926    1142
1980    1142
1976    1124
1947    1113
1920    1112
1961    1094
1974    1092
1927    1079
1950    1073
1975    1069
        ... 
1840      22
1879      20
1886      20
1816      18
1825      18
1878      15
1851      14
1844      12
2018      10
1841       9
1845       9
1882       9
1837       9
1847       4
1768       2
1839       2
1832       2
1846       2
1800       1
1808       1
1828       1
1805       1
1811       1
1799       1
1809       1
1838       1
1786       1
1848       1
1501       1
1842       1
Name: Date, Length: 198, dtype: int64

Data cleaning is now complete.  Let's OHE the categoricals and separated out into the target and input features.

In [14]:
X = artworks.drop(['Department', 
                   'DateAcquired', 
                   'Artist', 
                   'Nationality', 
                   'Date'], 1)

artists = pd.get_dummies(artworks['Artist'])
nationalities = pd.get_dummies(artworks['Nationality'])
dates = pd.get_dummies(artworks['Date'])

X = pd.get_dummies(X, sparse=True)

# Not adding Artist here because there are 9k unique artists
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks['Department']

In [31]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, test_size=.50)

In [32]:
print(Y_train.shape)
print(Y_test.shape)

(53950,)
(53951,)


# Neural Net Model

In [35]:
# Instantiate the model with single layer of 1000 perceptrons
mlp = MLPClassifier(hidden_layer_sizes=(1000,), verbose=True)
mlp.fit(X_train, Y_train)

Iteration 1, loss = 5.21029558
Iteration 2, loss = 3.29413642
Iteration 3, loss = 2.60434505
Iteration 4, loss = 2.60983145
Iteration 5, loss = 2.11426327
Iteration 6, loss = 2.29090387
Iteration 7, loss = 2.23118019
Iteration 8, loss = 2.21611951
Iteration 9, loss = 2.49341577
Iteration 10, loss = 1.65128125
Iteration 11, loss = 2.05433725
Iteration 12, loss = 2.03813061
Iteration 13, loss = 1.72030210
Iteration 14, loss = 2.43130212
Iteration 15, loss = 1.16329732
Iteration 16, loss = 2.03982809
Iteration 17, loss = 1.42534065
Iteration 18, loss = 1.42100615
Iteration 19, loss = 1.33031004
Iteration 20, loss = 1.44883339
Iteration 21, loss = 1.54183091
Iteration 22, loss = 1.25713940
Iteration 23, loss = 1.12468507
Iteration 24, loss = 1.30595306
Iteration 25, loss = 1.17065725
Iteration 26, loss = 1.40046077
Iteration 27, loss = 1.16262974
Iteration 28, loss = 1.04594656
Iteration 29, loss = 1.02353810
Iteration 30, loss = 1.42040568
Iteration 31, loss = 1.00882371
Iteration 32, los

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=True, warm_start=False)

In [36]:
mlp.score(X_test, Y_test)

0.7855832143982503

In [21]:
# cross_val_score(mlp, X, Y, cv=5, n_jobs=3, verbose=.75)

## Playing with the hyperparameters

In [38]:
# Two 100 perceptron-wide layers
mlp2 = MLPClassifier(hidden_layer_sizes=(100,100,), verbose=True)
mlp2.fit(X_train, Y_train)
mlp2.score(X_test, Y_test)

Iteration 1, loss = 2.19083779
Iteration 2, loss = 1.47985521
Iteration 3, loss = 0.98018946
Iteration 4, loss = 1.15257461
Iteration 5, loss = 0.98149436
Iteration 6, loss = 0.90199035
Iteration 7, loss = 0.91432714
Iteration 8, loss = 0.89189145
Iteration 9, loss = 0.88229101
Iteration 10, loss = 0.84363198
Iteration 11, loss = 0.86399837
Iteration 12, loss = 0.77369669
Iteration 13, loss = 0.86835279
Iteration 14, loss = 0.76757839
Iteration 15, loss = 0.77675827
Iteration 16, loss = 0.76368167
Iteration 17, loss = 0.73585330
Iteration 18, loss = 0.72345855
Iteration 19, loss = 0.75649690
Iteration 20, loss = 0.72223515
Iteration 21, loss = 0.71757324
Iteration 22, loss = 0.70782829
Iteration 23, loss = 0.71746409
Iteration 24, loss = 0.70257956
Iteration 25, loss = 0.71397433
Iteration 26, loss = 0.70527682
Iteration 27, loss = 0.70242397
Iteration 28, loss = 0.70727829
Iteration 29, loss = 0.70206412
Iteration 30, loss = 0.68582806
Iteration 31, loss = 0.67977431
Iteration 32, los



0.7884561917295324

In [39]:
mlp3 = MLPClassifier(hidden_layer_sizes=(100,20,), verbose=True)
mlp3.fit(X_train, Y_train)
mlp3.score(X_test, Y_test)

Iteration 1, loss = 2.50425802
Iteration 2, loss = 0.91023832
Iteration 3, loss = 0.86244801
Iteration 4, loss = 0.91933278
Iteration 5, loss = 0.86517759
Iteration 6, loss = 0.83528177
Iteration 7, loss = 0.79149036
Iteration 8, loss = 0.81062701
Iteration 9, loss = 0.80777538
Iteration 10, loss = 0.77051371
Iteration 11, loss = 0.81787431
Iteration 12, loss = 0.78098892
Iteration 13, loss = 0.74505838
Iteration 14, loss = 0.91825187
Iteration 15, loss = 0.75807064
Iteration 16, loss = 0.78225827
Iteration 17, loss = 0.78459287
Iteration 18, loss = 0.77811771
Iteration 19, loss = 0.74649484
Iteration 20, loss = 0.71761953
Iteration 21, loss = 0.71494637
Iteration 22, loss = 0.73163273
Iteration 23, loss = 0.71524221
Iteration 24, loss = 0.71594282
Iteration 25, loss = 0.73640966
Iteration 26, loss = 0.70448757
Iteration 27, loss = 0.73289066
Iteration 28, loss = 0.76421225
Iteration 29, loss = 0.71969406
Iteration 30, loss = 0.70836056
Iteration 31, loss = 0.71485135
Iteration 32, los

0.7577246019536246

In [40]:
mlp4 = MLPClassifier(hidden_layer_sizes=(100,20,), activation='logistic', verbose=True)
mlp4.fit(X_train, Y_train)
mlp4.score(X_test, Y_test)

Iteration 1, loss = 1.03871786
Iteration 2, loss = 0.94568690
Iteration 3, loss = 0.93178332
Iteration 4, loss = 0.92805863
Iteration 5, loss = 0.92578189
Iteration 6, loss = 0.92257115
Iteration 7, loss = 0.90105299
Iteration 8, loss = 0.85830835
Iteration 9, loss = 0.82487368
Iteration 10, loss = 0.79204104
Iteration 11, loss = 0.76770876
Iteration 12, loss = 0.75509075
Iteration 13, loss = 0.74139896
Iteration 14, loss = 0.73334691
Iteration 15, loss = 0.72840481
Iteration 16, loss = 0.72019701
Iteration 17, loss = 0.71811396
Iteration 18, loss = 0.71345719
Iteration 19, loss = 0.70613964
Iteration 20, loss = 0.69903952
Iteration 21, loss = 0.70040248
Iteration 22, loss = 0.69985314
Iteration 23, loss = 0.69630587
Iteration 24, loss = 0.69091253
Iteration 25, loss = 0.68958479
Iteration 26, loss = 0.68853296
Iteration 27, loss = 0.68299277
Iteration 28, loss = 0.68320957
Iteration 29, loss = 0.67885596
Iteration 30, loss = 0.67793523
Iteration 31, loss = 0.67578235
Iteration 32, los

0.7486793571944913

In [42]:
mlp4 = MLPClassifier(hidden_layer_sizes=(100,20,20), activation='relu', max_iter=500, warm_start=True, verbose=True)
mlp4.fit(X_train, Y_train)
mlp4.score(X_test, Y_test)

Iteration 1, loss = 1.51511806
Iteration 2, loss = 0.92924470
Iteration 3, loss = 0.88439461
Iteration 4, loss = 0.86241623
Iteration 5, loss = 0.83618552
Iteration 6, loss = 0.81951842
Iteration 7, loss = 0.80038036
Iteration 8, loss = 0.77963468
Iteration 9, loss = 0.76919119
Iteration 10, loss = 0.75418194
Iteration 11, loss = 0.75713811
Iteration 12, loss = 0.73868582
Iteration 13, loss = 0.73274726
Iteration 14, loss = 0.72674465
Iteration 15, loss = 0.72518768
Iteration 16, loss = 0.71639451
Iteration 17, loss = 0.70482896
Iteration 18, loss = 0.69924410
Iteration 19, loss = 0.69196235
Iteration 20, loss = 0.70177283
Iteration 21, loss = 0.70108348
Iteration 22, loss = 0.70367706
Iteration 23, loss = 0.69717987
Iteration 24, loss = 0.69180219
Iteration 25, loss = 0.69508726
Iteration 26, loss = 0.68884554
Iteration 27, loss = 0.68540294
Iteration 28, loss = 0.68785472
Iteration 29, loss = 0.67758477
Iteration 30, loss = 0.67431960
Iteration 31, loss = 0.67330209
Iteration 32, los

0.7827843784174529

In [43]:
mlp4 = MLPClassifier(hidden_layer_sizes=(100,20,10), activation='relu', max_iter=500, warm_start=True, verbose=True)
mlp4.fit(X_train, Y_train)
mlp4.score(X_test, Y_test)

Iteration 1, loss = 2.37749828
Iteration 2, loss = 0.93735247
Iteration 3, loss = 0.86316144
Iteration 4, loss = 0.83404903
Iteration 5, loss = 0.80637034
Iteration 6, loss = 0.79259403
Iteration 7, loss = 0.78341846
Iteration 8, loss = 0.76610720
Iteration 9, loss = 0.77001650
Iteration 10, loss = 0.75794297
Iteration 11, loss = 0.74650035
Iteration 12, loss = 0.74270952
Iteration 13, loss = 0.72820496
Iteration 14, loss = 0.71900135
Iteration 15, loss = 0.72365285
Iteration 16, loss = 0.72436632
Iteration 17, loss = 0.71244866
Iteration 18, loss = 0.71334710
Iteration 19, loss = 0.70112908
Iteration 20, loss = 0.69286206
Iteration 21, loss = 0.70061688
Iteration 22, loss = 0.69621443
Iteration 23, loss = 0.69965515
Iteration 24, loss = 0.67945127
Iteration 25, loss = 0.68145979
Iteration 26, loss = 0.68056340
Iteration 27, loss = 0.67812008
Iteration 28, loss = 0.68353932
Iteration 29, loss = 0.67617157
Iteration 30, loss = 0.68163892
Iteration 31, loss = 0.66862649
Iteration 32, los

0.7622657596708124

In [44]:
mlp5 = MLPClassifier(hidden_layer_sizes=(100,100,100), activation='relu', max_iter=500, warm_start=True, verbose=True)
mlp5.fit(X_train, Y_train)
mlp5.score(X_test, Y_test)

Iteration 1, loss = 1.46651658
Iteration 2, loss = 1.09923255
Iteration 3, loss = 0.92137815
Iteration 4, loss = 0.87830405
Iteration 5, loss = 0.86054496
Iteration 6, loss = 0.81714119
Iteration 7, loss = 0.83157858
Iteration 8, loss = 0.77643813
Iteration 9, loss = 0.77482946
Iteration 10, loss = 0.75386008
Iteration 11, loss = 0.76078128
Iteration 12, loss = 0.75158547
Iteration 13, loss = 0.74029044
Iteration 14, loss = 0.73224358
Iteration 15, loss = 0.72200915
Iteration 16, loss = 0.72337616
Iteration 17, loss = 0.71601561
Iteration 18, loss = 0.72076995
Iteration 19, loss = 0.70597466
Iteration 20, loss = 0.69501451
Iteration 21, loss = 0.68744905
Iteration 22, loss = 0.68911139
Iteration 23, loss = 0.69193745
Iteration 24, loss = 0.68529862
Iteration 25, loss = 0.67790057
Iteration 26, loss = 0.68710174
Iteration 27, loss = 0.69350272
Iteration 28, loss = 0.67012718
Iteration 29, loss = 0.67845678
Iteration 30, loss = 0.66899239
Iteration 31, loss = 0.67495785
Iteration 32, los

0.7988915868102537

In [45]:
mlp5.score(X_train, Y_train)

0.8140871177015755