<img src="https://www.sturgischarterschool.com/wp-content/uploads/2019/06/sturgisheader_logo.png" alt="sturgis" width="250" align="right"/>

## Computer Science 'May I Recommend PART ONE'
### Sturgis Charter Public School 



Student: [your name here]

Collaborators: [N/A]

Notes to the teacher: [N/A]

<img src="rp.jpeg" alt="sturgis" width="2500" align="center"/>

### Learning Objectives for notebook 14 & 15 
Part I
* Pandas-Data Visualization
* Normalization
* Feature Selection

Part II
* Matrix Operations
* Mean Square Error
* Gradient Descent
* Matrix Factorization

![pandafail](pandafail.jpeg)

### Narrative

This notebook is so big that it's being broken into two notebooks. We're going to do something pretty cool here, but it's got a bunch of moving parts. One of the key aspects of this notebook is that we need to be able to visualize our data. Our long goal is to be able to build a recommender system, and don't worry: I'll guide you through this. So long as you pay attention in class, you should be able to follow along. 

#### Pandas & Data Visualization

Some of the tools that we are going to need from pandas include the following. Here is the holistic [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.gt.html).

* Get slices of columns and/or rows. [df.loc[VARIOUS FORMATS]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
* join two or more tables[df.join()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) OR [df.merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
* Sort the table by a particular column and the values within that column[df.sort_values()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)
* transform a dataframe into a dictionary or list [df.to_dict()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html)
* We might also need to bring our output into string format. [df.to_string()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_string.html)
* Might be helpful to see the shape of a df. [df.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)
* Modify certain values at a particular index. [df.at[INDEX, 'COLUMN']](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html)
* Transpose the data from row/column to column/row. [df.T or df.transpose()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html)
* Drop rows that have a Not a Number (NaN) value. [df.dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

#### Feature Selection

Next we need to consider what features we are going to use. Now, remember, that we want a system that will be able to compute a recommendation. Here we should pause and consider. Do we want qualitative or qantitative data? How would we compute either. Consider this [article](https://articles.outlier.org/discrete-vs-continuous-variables), which has some very helpful examples. To do so we need to keep in mind two different kinds of values. There can be discrete values and there can be continuous values.

A discrete value is something that can be counted. 

A continuous value is something that must be measured. 

Considering this, we need to identify some features, and while you are doing this, I want you to consider the following: Why might `Age` be an especially unhelpful 'feature' for a recommender system. 

If then we can't use `Age`, what can we use? 

Consider the following: is a rating a discrete or a continuous value? Is there a way that we can measure the distance between two users? What if we didn't treat all users as equal? What do we do with missing values e.g. `NaN`?

#### Normalization

Normalization is well explained in the following article[Why Data Normalization is Necessary for Machine Learning Models](https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029). In the introduction it states, "Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values." (Jaitley, 2018). But let's think of a simple example. Imagine that we have the following data:

| usr  | hours  | rating 0-10 |
|---|---|---|
| a  | 27  | 7  |
| b  | 3  |  10 |
| c  | 500  | 9  |
| d  | 43  |  7 |
| e  |  127 |  10 |

Now imagine that we wanted to find a relationship between the hours a person plays a game and the rating. What you might notice is that the range for hours goes from 3-500 (and could perhaps go even further). You'll also note that the rating is locked in at 0-10. What will happen if we try and relate these two values? Well, the scales between the two are so radically different, that it's impossible to get reasonable ratios. If, however, we normalize, we can end up with a table that looks like this. For the moment we will assume that the minimums are `0` and the max is `500`. Let's transform this data with just a bit of simple math. 

| usr  | hours 0-1 | rating 0-1 |
|---|---|---|
| a  | .054  | .7  |
| b  | .006  |  1 |
| c  | 1  | .9  |
| d  | .086  |  .7 |
| e  |  .254 |  1 |

Now, of course, this is a simple example, but it actually can be quite necessary in order for the numbers to be able to play together in an appropriate way. 0 to 1 is a common convention. What our normalized data reveals here is that there is in fact NOT a relationship between play time and rating. Can you explain why?

In [1]:
import pandas as pd
import numpy as np
import warnings
# https://docs.python.org/3/library/warnings.html

In [22]:
def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    bk = pd.read_csv('data/Books.csv')
    us = pd.read_csv('data/Users.csv')
    rt = pd.read_csv('data/Ratings.csv')

In [11]:
rt

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


### Question 1: Manipulating Dataframes

So, for our first step we are going to use the loaded dataframes, to manipulate the data. You need to take the above dataframes and end up with two new dataframes. 

Dataframe 1 is going to be an Age table. What I want is 5 to 10 columns of age brackets, and in each of those age brackets, I want a count of how many users inhabit those age brackets. This should be a small table of just one row, but nothing.

Dataframe 2 is going to be a Review table, in which we have the User-ID, the Book-Rating, and the Book Title. 


In [7]:
# Table 1
# Make sure that your final df is called 'df1' 
ranges = [(0,14), (15,29), (30,44),(45,59),(60,79),(80,120)]
df1 = us.dropna()

ages = {}
for r in ranges:
    ages[str(r)] = [len(df1.loc[(df1['Age'] > r[0]) & (df1['Age'] < r[1])].index)]

    print(ages)

df1 = pd.DataFrame.from_dict(ages)
df1

{'(0, 14)': [1935]}
{'(0, 14)': [1935], '(15, 29)': [59513]}
{'(0, 14)': [1935], '(15, 29)': [59513], '(30, 44)': [48746]}
{'(0, 14)': [1935], '(15, 29)': [59513], '(30, 44)': [48746], '(45, 59)': [27408]}
{'(0, 14)': [1935], '(15, 29)': [59513], '(30, 44)': [48746], '(45, 59)': [27408], '(60, 79)': [7373]}
{'(0, 14)': [1935], '(15, 29)': [59513], '(30, 44)': [48746], '(45, 59)': [27408], '(60, 79)': [7373], '(80, 120)': [508]}


Unnamed: 0,"(0, 14)","(15, 29)","(30, 44)","(45, 59)","(60, 79)","(80, 120)"
0,1935,59513,48746,27408,7373,508


In [16]:
df2 = rt.join(bk.set_index('ISBN'), on='ISBN')
df2 = df2.drop(columns='ISBN')
df2 = df2.iloc[:,[2,0,1]]
df2

Unnamed: 0,Book-Title,User-ID,Book-Rating
0,Flesh Tones: A Novel,276725,0
1,Rites of Passage,276726,5
2,The Notebook,276727,0
3,Help!: Level 1,276729,3
4,The Amsterdam Connection : Level 4 (Cambridge ...,276729,6
...,...,...,...
1149775,Get Clark Smart : The Ultimate Guide for the S...,276704,9
1149776,Eight Weeks to Optimum Health: A Proven Progra...,276706,0
1149777,The Sherbrooke Bride (Bride Trilogy (Paperback)),276709,10
1149778,Fourth Grade Rats,276721,10


In [18]:
# Table 2
# Make sure that your final df is called 'df2'
#bk = bk.drop(columns='Book-Title')
bk = bk.iloc[:,0:2]
df2 = rt.merge(bk, how='inner', on='ISBN')
df2 = df2.drop(columns='ISBN')
df2

Unnamed: 0,User-ID,Book-Rating,Book-Title
0,276725,0,Flesh Tones: A Novel
1,2313,5,Flesh Tones: A Novel
2,6543,0,Flesh Tones: A Novel
3,8680,5,Flesh Tones: A Novel
4,10314,9,Flesh Tones: A Novel
...,...,...,...
1031131,276688,0,Mostly Harmless
1031132,276688,7,Gray Matter
1031133,276690,0,Triplet Trouble and the Class Trip (Triplet Tr...
1031134,276704,0,A Desert of Pure Feeling (Vintage Contemporaries)


In [19]:
bk2 = bk.iloc[:,0:2]
bk3 = bk2.join(rt.set_index('ISBN'), on='ISBN')
#df.join(other.set_index('key'), on='key')
df2 = bk3.drop(columns='ISBN')
df2

Unnamed: 0,Book-Title,User-ID,Book-Rating
0,Classical Mythology,2.0,0.0
1,Clara Callan,8.0,5.0
1,Clara Callan,11400.0,0.0
1,Clara Callan,11676.0,8.0
1,Clara Callan,41385.0,0.0
...,...,...,...
271355,There's a Bat in Bunk Five,276463.0,7.0
271356,From One to One Hundred,276579.0,4.0
271357,Lily Dale : The True Story of the Town that Ta...,276680.0,0.0
271358,Republic (World's Classics),276680.0,0.0


In [20]:
# Check Table 1
assert df1.iloc[0][3] == 27408 # Checking, does your count of 45 to 59 year olds match 27408?
# Check Table 2
assert df2.iloc[527][0] == 'Beloved (Plume Contemporary Fiction)' #Is your 527th row's book title this?

#It's possible that you might get the correct answer, but somehow shuffle the order. 
#In such a case you won't pass the assert check, but still have completed the question. Check with the teacher.

### Question 2: Feature Selection

Create features that can be related to **both** the users and the items. There is more than one way that this can be done. You may choose to either show your answer in a table format or in a dictionary format. However, we should be able to take this and apply the features to any user and to any item. 

This is an open ended question. We will discuss in class, but you might find another approach.
Just make sure you're prepared to explain your selection. 
This is a fairly large question, that might be one of those cases where it takes a fair amount of thinking, but not too much coding. :D

In [27]:
bk.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [25]:
rt_final = rt[rt['Book-Rating'] != 0]

In [26]:
rt_final

Unnamed: 0,User-ID,ISBN,Book-Rating
1,276726,0155061224,5
3,276729,052165615X,3
4,276729,0521795028,6
6,276736,3257224281,8
7,276737,0600570967,6
...,...,...,...
1149773,276704,0806917695,5
1149775,276704,1563526298,9
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [29]:
#step 1 - Identify super users
#Create a list of tuples with every user, and every review by that user. This is similar to the age df above.
#We are interested in the most prolific users. 

def getreviewcount(df):
    dfd = df.to_dict('split')
    unique_user = set()
    for review in dfd['data']:
        unique_user.add(review[0])
    unique_user = list(unique_user)
    
    u_counts = {}
    for user in unique_user:
        u_counts[user] = len(df.loc[df['User-ID'] == user].index)
    
    sortable = [(k,v) for k, v in u_counts.items()]
    sortable.sort(key = lambda x:x[1])
    return sortable

In [30]:
allusers = getreviewcount(rt_final)

In [33]:
print(allusers[-100:-50])

[(6575, 237), (200226, 237), (129716, 239), (160541, 241), (174304, 241), (217740, 244), (157247, 246), (190925, 252), (179978, 254), (75591, 254), (241198, 254), (254899, 256), (37950, 258), (81560, 261), (30511, 264), (156150, 265), (89602, 267), (95902, 268), (43246, 269), (88677, 270), (236283, 270), (69697, 275), (270713, 279), (229329, 280), (94853, 281), (209516, 282), (168245, 284), (264321, 285), (112001, 286), (147847, 289), (79441, 290), (177432, 291), (225087, 293), (110973, 293), (31556, 300), (38273, 312), (39467, 312), (30276, 318), (46398, 320), (162639, 323), (7346, 324), (225232, 324), (242006, 324), (16634, 327), (230522, 328), (94347, 330), (31315, 333), (107951, 333), (25981, 337), (135265, 345)]


### Making a Comparison Metric

In order to do anything, we need a way to find out how similar or different users are from each other. In the code below we have a simple function which first finds how many reviews overlap, and how similar those overlapping views are. It doesn't actually matter if they are similar and positive, or similar and negative. Because ultimately we are interested in finding similarity between individual users and groups of users. This will come out of the data either way. 

In [34]:
#Step 2. Find a way to compare two users

# User comparison Function
#select a user, select another user, compare the two users
def simpledist(x):
    return np.sqrt(abs(x.iloc[0] - x.iloc[1]))

def comparison(df, us1, us2):
    dfa = df.loc[df['User-ID'] == us1]
    dfb = df.loc[df['User-ID'] == us2]
    final = dfa.merge(dfb, left_on=["ISBN"], right_on=["ISBN"], how='inner')
    final = final.iloc[:,[2,4]]
    final = final.apply(simpledist, axis=1, result_type='expand')
    final = final.values.tolist()
    return len(final) * (1/((sum(final)+2)*.5))


In [35]:
check = comparison(rt_final, 11676, 98391)
print(check)

1.8397374090046978


### For Demonstration Purposes

Let's take a look at what's happening in that euclidian distance function

In [36]:
rt_final.loc[rt_final['User-ID'] == 171118]

Unnamed: 0,User-ID,ISBN,Book-Rating
703628,171118,0000913154,8
703641,171118,0006479502,10
703645,171118,0006547486,7
703650,171118,0020248717,8
703651,171118,0020259913,7
...,...,...,...
706031,171118,3892680132,8
706032,171118,5552552660,8
706040,171118,B00005VBCP,7
706041,171118,B00005W4U4,8


In [41]:
check = comparison(rt_final, 171118, 177458)
print(check)
dfta = rt_final.loc[rt_final['User-ID'] == 171118]
dftb = rt_final.loc[rt_final['User-ID'] == 177458]
rt1 = dfta.merge(dftb, left_on=["ISBN"], right_on=["ISBN"], how='inner')
rt1

2.496587941497462


Unnamed: 0,User-ID_x,ISBN,Book-Rating_x,User-ID_y,Book-Rating_y
0,171118,312876939,7,177458,8
1,171118,316107549,8,177458,8
2,171118,441003745,8,177458,6
3,171118,441005241,8,177458,7
4,171118,756401364,7,177458,8
5,171118,836218353,7,177458,8
6,171118,836218620,8,177458,8
7,171118,836218663,8,177458,8
8,171118,836218787,8,177458,8
9,171118,836218833,8,177458,8


In [38]:
check = comparison(rt_final, 171118, 63714)
print(check)
dfta = rt_final.loc[rt_final['User-ID'] == 171118]
dftb = rt_final.loc[rt_final['User-ID'] == 63714]
rt1 = dfta.merge(dftb, left_on=["ISBN"], right_on=["ISBN"], how='inner')
rt1

0.6666666666666666


Unnamed: 0,User-ID_x,ISBN,Book-Rating_x,User-ID_y,Book-Rating_y
0,171118,563208449,9,63714,10


In [39]:
check = comparison(rt_final, 171118, 114368)
print(check)
dfta = rt_final.loc[rt_final['User-ID'] == 171118]
dftb = rt_final.loc[rt_final['User-ID'] == 114368]
rt1 = dfta.merge(dftb, left_on=["ISBN"], right_on=["ISBN"], how='inner')
rt1

0.0


Unnamed: 0,User-ID_x,ISBN,Book-Rating_x,User-ID_y,Book-Rating_y


In [42]:
#step 2 - Find similar users to superusers
#Now let's imagine we created a feature called superuser 1
#How do we find the relationship between any given user and that superuser1?
#Well the answer is in the same comparison we created above.


#Note that it seems odd that the difference between 8 and 6 yields 1.414
#The reason is the system is sensitive to how many reviews there are in total. Pretty neat!

tester = rt1
tester = tester.iloc[:,[2,4]]
t2 = tester.apply(simpledist, axis=1, result_type='expand')
tester = pd.concat([tester, t2], axis=1, join='inner')
tester

Unnamed: 0,Book-Rating_x,Book-Rating_y,0
0,7,8,1.0
1,8,8,0.0
2,8,6,1.414214
3,8,7,1.0
4,7,8,1.0
5,7,8,1.0
6,8,8,0.0
7,8,8,0.0
8,8,8,0.0
9,8,8,0.0


### Which users are the best to use?

Ultimately what we want are users that have a lot of reviews that are distinct. These users will act as a 'seed' from which we can attach similar users. In this way we can end up with groups of users that represent distinct elements from each other. 

In [43]:
#We can use this code to only select users that have a certain amount of reviews. 
s2 = [x for x in allusers if x[1] > 300]
s2 = s2[-20:]
allcomp = {}
for user in s2:
    allcomp[user[0]] = []
    for u2 in s2:
        if user == u2: continue
        allcomp[user[0]].append((u2[0], comparison(rt_final, int(user[0]), int(u2[0]))))

In [45]:
find = allcomp[171118]
find.sort(key=lambda x:x[1], reverse = True)
print(find)

[(177458, 2.496587941497462), (235105, 2.0624458405139237), (11676, 1.6977526853445328), (16795, 1.6148908192542457), (248718, 1.5300968740935361), (35859, 1.4838216699867535), (197659, 1.4597505813542533), (101851, 1.3744005332413054), (95359, 1.3565733101699384), (98391, 1.2426406871192852), (158295, 1.17157287525381), (56399, 1.1681590769892485), (23902, 1.0717967697244908), (76499, 1.0618406966509362), (185233, 1.0490617222276717), (153662, 0.9543160631771913), (189835, 0.9291126548932547), (63714, 0.6666666666666666), (114368, 0.0)]


In [46]:
find = allcomp[114368]
find.sort(key=lambda x:x[1], reverse = True)
print(find)

[(185233, 10.0), (189835, 1.6091640342100229), (197659, 1.3861848258221114), (11676, 1.383326612861554), (153662, 1.3748787149949815), (16795, 1.2597650904070532), (35859, 1.2543159677553026), (235105, 1.2466415899920753), (158295, 1.1933666617461582), (98391, 1.16591818322487), (95359, 1.0580318924156866), (101851, 0.9708207136426578), (177458, 0.585786437626905), (23902, 0.585786437626905), (248718, 0.5358983848622454), (63714, 0.0), (56399, 0.0), (171118, 0.0), (76499, 0.0)]


In [None]:
ml = list(allcomp.keys())
print(ml)

#For the time being I'm just going to find users based on a manual search. I'd love to program this up. 
SUbase = [171118, 114368, 76499, 56399, 98391, 158295, 63714,23902]

### Now let's group users together and create: A SUPERUSER!!!

We will need to loop this function for each of our SUbase from above. 

In [57]:
def createSU(df, allusers, target):
    superuser = []
    for user in allusers[-3000:]:
        score = comparison(df, int(user[0]), target)
        #print("The current score is", score, " for user ", user)
        if score > 1.2: #This is our similarity threshold. We can play with this for concentration or impact.
            print(user)
            superuser.append(int(user[0]))
    return superuser

In [None]:
#Let's code this, so it goes through each of our SU's
myfeatures = {}
for SU in SUbase:
    myfeatures[SU] = createSU(rt_final, allusers, SU)

In [None]:
print(myfeatures)

In [58]:
#This is to exemplify. If we want to just create a single user. 
mySU = createSU(rt_final, allusers, 171118)

(187747, 25)
(128325, 26)
(264634, 28)
(38464, 29)
(62172, 29)
(70065, 30)
(157163, 31)
(7958, 32)
(249407, 33)
(26240, 34)
(62558, 34)
(113904, 34)
(273344, 35)
(246759, 35)
(131154, 37)
(60427, 37)
(61910, 37)
(127203, 37)
(182053, 38)
(68128, 38)
(259057, 38)
(123744, 39)
(125736, 39)
(38660, 41)
(43006, 41)
(264862, 42)
(41831, 42)
(3923, 43)
(113334, 44)
(271176, 45)
(11724, 45)
(105221, 45)
(274808, 48)
(42093, 49)
(150663, 51)
(21404, 51)
(277203, 52)
(2891, 53)
(195694, 53)
(201526, 53)
(108799, 53)
(81977, 54)
(28594, 55)
(196108, 55)
(70052, 55)
(100578, 55)
(159858, 56)
(224997, 56)
(264152, 58)
(266753, 58)
(267033, 58)
(87938, 58)
(136491, 59)
(53729, 59)
(138198, 60)
(247129, 60)
(150896, 61)
(190885, 61)
(196148, 62)
(257198, 62)
(20201, 63)
(241204, 63)
(92498, 64)
(228727, 67)
(235392, 68)
(3827, 70)
(55548, 70)
(124597, 70)
(82407, 71)
(136205, 72)
(28204, 72)
(48355, 72)
(68436, 72)
(211344, 72)
(14456, 73)
(207494, 73)
(104665, 74)
(126814, 75)
(174848, 76)
(106225,

In [59]:
print(mySU)

[187747, 128325, 264634, 38464, 62172, 70065, 157163, 7958, 249407, 26240, 62558, 113904, 273344, 246759, 131154, 60427, 61910, 127203, 182053, 68128, 259057, 123744, 125736, 38660, 43006, 264862, 41831, 3923, 113334, 271176, 11724, 105221, 274808, 42093, 150663, 21404, 277203, 2891, 195694, 201526, 108799, 81977, 28594, 196108, 70052, 100578, 159858, 224997, 264152, 266753, 267033, 87938, 136491, 53729, 138198, 247129, 150896, 190885, 196148, 257198, 20201, 241204, 92498, 228727, 235392, 3827, 55548, 124597, 82407, 136205, 28204, 48355, 68436, 211344, 14456, 207494, 104665, 126814, 174848, 106225, 19711, 123056, 24194, 183046, 193499, 133571, 271195, 91184, 102154, 209160, 21576, 69971, 124876, 10819, 257028, 263163, 231237, 44893, 234721, 39616, 178199, 36299, 169682, 211919, 147451, 27647, 174892, 229741, 15418, 102967, 211426, 265313, 203240, 252820, 136382, 37644, 189139, 91113, 16966, 73681, 100459, 258152, 268110, 166123, 247447, 157273, 226965, 32721, 240567, 102647, 184299, 15

In [60]:
SU1 = rt_final[rt_final['User-ID'].isin(mySU)]

In [61]:
SU1

Unnamed: 0,User-ID,ISBN,Book-Rating
1202,277203,030700645X,8
1203,277203,0307127923,8
1204,277203,0307302016,8
1205,277203,0307302636,8
1206,277203,0307987655,8
...,...,...,...
1143552,274808,0689806574,7
1143554,274808,0689831285,6
1143555,274808,0689835906,8
1143565,274808,0860681297,8


In [62]:
#Step 3 - Create a relationship between every superuser category and every user, and every item. 
#Here we need create a similarity score between 

#Just need to build this last piece. 
def usefeature(df, us1, su):
    dfa = df.loc[df['User-ID'] == us1]
    dfb = df[df['User-ID'].isin(su)]
    final = dfa.merge(dfb, left_on=["ISBN"], right_on=["ISBN"], how='inner')
    final = final.iloc[:,[2,4]]
    final = final.apply(simpledist, axis=1, result_type='expand')
    final = final.values.tolist()
    return len(final) * (1/((sum(final)+2)*.5))

In [63]:
test = usefeature(rt_final, 55548, mySU)
test

2.971926027460331

In [64]:
feat1scores = {}
for user in allusers[-1000:-200]:
    feat1scores[user[0]] = usefeature(rt_final, user[0], mySU)

In [65]:
feat1scores

{246823: 0.7772628284242425,
 250300: 1.4721398186935946,
 16106: 1.8040989673141687,
 19664: 1.804901103578031,
 196148: 3.785661298769321,
 206016: 1.8144290748799827,
 206202: 1.7541728844293,
 80945: 2.0658178451008693,
 82893: 1.8171195459631706,
 215988: 1.359779345144293,
 85426: 1.7977137421800835,
 93179: 1.5949847220915805,
 95903: 1.7742013890005592,
 108352: 1.9547684424970444,
 243929: 1.471950213044654,
 246156: 1.6979336813805121,
 257198: 2.695071106940101,
 259260: 1.716088263420847,
 128696: 2.183850882886253,
 135045: 2.2251199222898204,
 16161: 1.6979133029699438,
 20201: 2.6695996706511087,
 155219: 1.5858402662941105,
 28177: 1.765006020874992,
 43021: 1.563700914138001,
 51992: 2.1094868362095465,
 185308: 2.027093510072339,
 63854: 1.737619176641894,
 203910: 1.5657949688528754,
 76352: 1.8340583312621204,
 107645: 0.6821627548042178,
 238781: 1.4697532844220458,
 241204: 3.254206308612984,
 111578: 1.635582656007532,
 249958: 2.113676405610607,
 256167: 1.99622

### Question 3: Normalization

Now that we have features, I want you to analyze whether or not the features are normalized. If they are normalized, then please explain why they are normalized values, additionally explain whether or not you are capturing discrete values or continuous values. 

If they are not normalized, then please apply some process to normalize them. Keep in mind that there might be a nifty panda method that will do just that for you. 

In [None]:
#Code if necessary


This is a markdown cell, you can type into it and it won't confuse it for code.