# CPSC 585 - Artificial Neural Network
## Spring 2021

## Project 6 (Hopfield Network)

## Malyaj Sirothia
<br><br><br>


**1. Use the contents of ratings.csv to create a dataset for your network. There should be a feature vector for each user, with each feature corresponding to a movie. Encode movies that the user has rated 3.0 or above as +1, and other movies as -1.**

In [1]:
import pandas as pd
import numpy as np

ratings = pd.read_csv('ratings.csv', header=0,)
userId = ratings['userId'][0]
movieId = ratings['movieId'][0]
rating = ratings['rating'][0]
timestamp = ratings['timestamp'][0]

print('%-10s %-10s %-10s %-12s' % ('userId', 'movieId', 'rating', 'timestamp'))
print('%-10s %-10s %-10s %-12s' % (userId, movieId, rating, timestamp))

userId     movieId    rating     timestamp   
1          1          4.0        964982703   


In [2]:
# getting unique arrays and inverse-indices from ratings.csv
movieIds = ratings['movieId']#.to_numpy()
u_movie_ids, movie_rindices = np.unique(movieIds, return_inverse=True,)

print('movie id in unique movie ids: ', u_movie_ids[2])
print('movie rindices: ', movie_rindices) # dimension matches movieIds
print('length of movie_rindices: ', len(movie_rindices))

userIds = ratings['userId']
u_user_ids, user_indices = np.unique(userIds, return_index=True,)
i = 2

# creating dataset
dataset = np.ones((610, 9742,), dtype=int) * -1
print('dataset: ', dataset[1,2])
curr_user_id = ratings['userId'][0] # curr_user + 1 = userId

movieIds = ratings['movieId']
u_movie_ids, movie_rindices, occurrence_count = np.unique(movieIds, 
                                        return_counts=True, 
                                        return_inverse=True,)
# u_movie_id has all the unique movieIds in the set, it is a mirror of feature vector
# use movie_rindices and u_movie_ids to figure out which index in train_instance to flip to 1

user_ids = ratings['userId']
u_user_ids, user_indices = np.unique(userIds, return_index=True,)
# user_indices - indices of first appearance of their userId
# the difference between (user_indices[u], user_indices[u+1]) is how many movies they rated
# print(user_indices)

user_ratings = ratings['rating']
for i in range(100836): # 100836 ratings
    # use user_ids to check if we need to make a new train_instance
    new_user_id = user_ids[i]
    if new_user_id != curr_user_id:
        curr_user_id = user_ids[i] 
    # use ratings to flip the feature bit to +1
    if user_ratings[i] >= 3.0:
        u_movie_rindex = movie_rindices[i]
        dataset[curr_user_id-1][u_movie_rindex] = 1

movie id in unique movie ids:  3
movie rindices:  [   0    2    5 ... 9444 9445 9485]
length of movie_rindices:  100836
dataset:  -1


**2. Set aside 10% of your dataset for testing. How many users are there in the training set? What is the storage capacity of the network? Is the network likely to be able to store the dataset?**

In [3]:
test_set_indices = np.random.randint(0, 610, 64)
test_set_indices = np.unique(test_set_indices)
dataset_indices = set([i for i in range(610)])
train_set_indices = np.array(list(dataset_indices - set(test_set_indices)))
print('number of test instances:',len(test_set_indices))
print('number of training instances:', len(train_set_indices))

number of test instances: 59
number of training instances: 551


**3.Construct a Hopfield network for your dataset and train it on your training set.
Hint: You may want to take a look at the pseudocode on pp. 66-67 of Artificial Intelligence Engines.
What accuracy does your trained network achieve?**

The network reached an accuracy of **0.986** for the training set.

In [4]:

# check accuracy on the training set, RECALL
# once finished training, plug in different instances of training set
# accuracy - what percentage of recall and original vectors are similar

W = np.zeros((9742, 9742), dtype=int)
learning_rate = 1
for i in train_set_indices:
    x = dataset[i].reshape((9742, 1))
    W_t = learning_rate * np.dot(x, x.T)
    W = W + W_t

# set diagonals to 0
for i in range(9742):
    W[i][i] = 0

recall_correct = np.ones((610,1), dtype=int)


incorrect_counts = []
for v in train_set_indices:
    num_incorrect = 0
    x_prime = dataset[v]    # corrupted vector
    y = np.array([i for i in x_prime])

    stable = False
    J = np.arange(0, 9742)
    times = 0
    while stable == False:
        stable = True
        np.random.shuffle(J)        
        for k in range(9742):
            j = J[k]
            u_j = np.dot(W[j],y)    
            y_last = y[j]
            y[j] = 1 if u_j >= 0 else -1
            if y[j] != y_last:
                stable = False

                        
        times += 1
        # end while

    for i in range(9742):
        if x_prime[i] != y[i]:
            num_incorrect += 1
            recall_correct[v] = False
    incorrect_counts.append(num_incorrect)
print('Accuracy based on number of features: ', 1 - np.sum(incorrect_counts) / 9742 / len(train_set_indices))
print(f'All or nothing accuracy: {np.sum(recall_correct) / len(train_set_indices)}')

Accuracy based on number of features:  0.9863280253032783
All or nothing accuracy: 0.10707803992740472


**4. Determine your network’s accuracy on the test set. How well is the network performing?**

In [5]:
testset_incorrect_counts = []
network_recalls = []   
for v in test_set_indices:
    x_prime = dataset[v]
    y = np.array([i for i in x_prime])

    stable = False
    J = np.arange(0, 9742)

    times = 0
    while stable == False:
        stable = True
        np.random.shuffle(J)
        for k in range(9742):
            j = J[k]
            u_j = np.dot(W[j],y)
            y_last = y[j]
            y[j] = 1 if u_j >= 0 else -1
            if y[j] != y_last:
                stable = False
            
        times += 1

    num_incorrect = 0
    for i in range(9742):
        if x_prime[i] != y[i]:
            num_incorrect += 1
    testset_incorrect_counts.append(num_incorrect)
    network_recalls.append((num_incorrect, v, y))

print('accuracy: ', 1 - np.sum(testset_incorrect_counts) / 9742 / len(test_set_indices))

accuracy:  0.9854935296758053


**5. Choose a few higher-performing examples from the test set, then use movies.csv to determine which movies those users liked.
Which other movies did the network predict that those users might like? Do the recommendations seem reasonable?**

In [6]:
dataset_observed = np.zeros((610, 9742,), dtype=int)
dataset_random = np.random.randint(0,2, (610,9742)) * 2 - 1
print('dataset: ', dataset_observed[1,2])
curr_user_id = ratings['userId'][0] # curr_user + 1 = userId

movieIds = ratings['movieId']
u_movie_ids, movie_rindices, occurrence_count = np.unique(movieIds, 
                                        return_counts=True, 
                                        return_inverse=True,)

user_ids = ratings['userId']
u_user_ids, user_indices = np.unique(userIds, return_index=True,)

user_ratings = ratings['rating']
for i in range(100836): # 100836 ratings
    new_user_id = user_ids[i]
    if new_user_id != curr_user_id:
        curr_user_id = user_ids[i] 
    if user_ratings[i] >= 3.0:
        u_movie_rindex = movie_rindices[i]
        dataset_observed[curr_user_id-1][u_movie_rindex] = 1
        dataset_random[curr_user_id-1][u_movie_rindex] = 1
    else: 
        dataset_observed[curr_user_id-1][u_movie_rindex] = -1
        dataset_random[curr_user_id-1][u_movie_rindex] = -1

print('dataset: ', dataset_random[:5, 50:70])


dataset:  0
dataset:  [[ 1  1  1  1  1  1 -1  1 -1  1  1 -1  1  1 -1 -1  1  1 -1 -1]
 [-1  1  1  1 -1 -1  1 -1  1  1 -1 -1 -1 -1  1  1  1  1 -1  1]
 [-1 -1 -1 -1  1  1 -1 -1  1  1 -1 -1 -1  1  1  1  1 -1  1 -1]
 [-1  1  1 -1  1 -1 -1  1 -1  1 -1  1  1  1  1  1 -1 -1  1 -1]
 [-1 -1  1  1 -1 -1 -1  1  1  1 -1 -1  1 -1 -1 -1 -1  1  1 -1]]


In [7]:
testset_incorrect_counts = []
network_recalls = []
testset_correct_counts = []
for v in test_set_indices:
    x_prime = dataset_random[v]
    y = np.array([i for i in x_prime])

    stable = False
    J = np.arange(0, 9742)

    times = 0
    while stable == False:
        stable = True
        np.random.shuffle(J)
        for k in range(9742):
            j = J[k]
            u_j = np.dot(W[j],y)
            y_last = y[j]
            if not bool(dataset_observed[v][j]):    
                y[j] = 1 if u_j >= 0 else -1
            if y[j] != y_last:
                stable = False
            
        times += 1

    num_incorrect = 0
    num_correct = 0
    x = dataset[v]
    x_observed = dataset_observed[v]
    for i in range(9742):
        if (x_observed[i] == 1 and y[i] == 1) or (x_observed[i] == -1 and y[i] == -1):
            num_correct += 1
    testset_incorrect_counts.append(num_incorrect)
    network_recalls.append((num_incorrect, v, y))
    testset_correct_counts.append(num_correct)

print('Done testing!')
print(f'Testset incorrect count: {testset_correct_counts}')

Done testing!
Testset incorrect count: [40, 130, 22, 354, 44, 173, 496, 94, 62, 182, 35, 350, 403, 24, 40, 15, 134, 137, 27, 261, 21, 32, 37, 90, 232, 366, 22, 30, 42, 15, 65, 57, 156, 223, 113, 20, 76, 49, 57, 81, 15, 12, 603, 248, 20, 158, 19, 49, 138, 486, 398, 163, 276, 69, 77, 45, 19, 180, 576]


In [8]:
best_indices = []
for val in np.unique(testset_incorrect_counts)[0:5]:
    best_indices += list(np.where(testset_incorrect_counts == val)[0])
print('user_ids in test_set where network has good performance: ', test_set_indices[best_indices])
print()
for i in best_indices[:1]:
    print('user id - 1: ', test_set_indices[i], '| incorrect_count:', testset_incorrect_counts[i])
    print(f'Index of user in network_recalls: {network_recalls[i][1]}')
    user_rated = np.where(dataset[i] == 1)
    print(f'Length of user_rated: {len(user_rated[0])}')
        
    print('indices of movies where user rated good:', user_rated[0])
    network_rec = network_recalls[i][2]
    network_rated = np.where(network_rec == 1)
    print('Number of movies network recommends: ', len(network_rated[0]))
    print('indices of movies where network rated good:', network_rated[0])

    
    user_index = network_recalls[i][1] 
    user_rated = dataset[user_index]
    network_rated = network_recalls[i][2]

    count = 0
    for i in range(9742):
        if user_rated[i] == network_rated[i]:
            count += 1
    print(f'Number of similarities: {count}')
    print()

user_ids in test_set where network has good performance:  [  4  51  59  61  68  73 110 114 164 165 175 218 225 235 237 244 246 264
 276 289 298 299 315 321 324 356 362 378 397 405 416 426 427 431 436 438
 446 449 457 459 466 477 482 483 517 526 530 531 551 554 560 569 572 578
 583 587 594 604 607]

user id - 1:  4 | incorrect_count: 0
Index of user in network_recalls: 4
Length of user_rated: 226
indices of movies where user rated good: [   0    2    5   43   46   62   89   97  124  130  136  184  190  197
  201  224  257  275  291  307  314  320  325  367  384  398  418  436
  461  476  484  485  508  509  510  513  520  546  551  559  592  594
  615  632  701  705  720  723  734  767  781  782  783  786  787  788
  789  797  801  810  815  819  827  828  830  835  855  862  897  898
  899  906  908  910  913  914  920  922  924  926  938  954  956  963
  968  973  976  980  989  995 1035 1059 1075 1083 1109 1125 1145 1153
 1170 1180 1182 1189 1217 1219 1223 1234 1260 1297 1318 1325 13