<img src="https://www.sturgischarterschool.com/wp-content/uploads/2019/06/sturgisheader_logo.png" alt="sturgis" width="250" align="right"/>

## Computer Science 'May I Recommend PART ONE'
### Sturgis Charter Public School 



Student: [your name here]

Collaborators: [N/A]

Notes to the teacher: [N/A]

### Learning Objectives for notebook 
Part I
* Pandas-Data Visualization
* Normalization
* Feature Selection

Part II
* Matrix Operations
* Mean Square Error
* Gradient Descent
* Matrix Factorization

### Narrative

This notebook is so big that it's being broken into two notebooks. We're going to do something pretty cool here, but it's got a bunch of moving parts. One of the key aspects of this notebook is that we need to be able to visualize our data. Our long goal is to be able to build a recommender system, and don't worry: I'll guide you through this. So long as you pay attention in class, you should be able to follow along. 

#### Pandas & Data Visualization

Some of the tools that we are going to need from pandas include the following. Here is the holistic [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.gt.html).

* Get slices of columns and/or rows. [df.loc[VARIOUS FORMATS]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
* join two or more tables[df.join()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) OR [df.merge()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
* Sort the table by a particular column and the values within that column[df.sort_values()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)
* transform a dataframe into a dictionary or list [df.to_dict()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html)
* We might also need to bring our output into string format. [df.to_string()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_string.html)
* Might be helpful to see the shape of a df. [df.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)
* Modify certain values at a particular index. [df.at[INDEX, 'COLUMN']](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html)
* Transpose the data from row/column to column/row. [df.T or df.transpose()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html)
* Drop rows that have a Not a Number (NaN) value. [df.dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

#### Feature Selection

Next we need to consider what features we are going to use. Now, remember, that we want a system that will be able to compute a recommendation. Here we should pause and consider. Do we want qualitative or qantitative data? How would we compute either. Consider this [article](https://articles.outlier.org/discrete-vs-continuous-variables), which has some very helpful examples. To do so we need to keep in mind two different kinds of values. There can be discrete values and there can be continuous values.

A discrete value is something that can be counted. 

A continuous value is something that must be measured. 

Considering this, we need to identify some features, and while you are doing this, I want you to consider the following: Why might `Age` be an especially unhelpful 'feature' for a recommender system. 

If then we can't use `Age`, what can we use? 

Consider the following: is a rating a discrete or a continuous value? Is there a way that we can measure the distance between two users? What if we didn't treat all users as equal? What do we do with missing values e.g. `NaN`?

#### Normalization

Normalization is well explained in the following article[Why Data Normalization is Necessary for Machine Learning Models](https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029). In the introduction it states, "Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values." (Jaitley, 2018). But let's think of a simple example. Imagine that we have the following data:

| usr  | hours  | rating 0-10 |
|---|---|---|
| a  | 27  | 7  |
| b  | 3  |  10 |
| c  | 500  | 9  |
| d  | 43  |  7 |
| e  |  127 |  10 |

Now imagine that we wanted to find a relationship between the hours a person plays a game and the rating. What you might notice is that the range for hours goes from 3-500 (and could perhaps go even further). You'll also note that the rating is locked in at 0-10. What will happen if we try and relate these two values? Well, the scales between the two are so radically different, that it's impossible to get reasonable ratios. If, however, we normalize, we can end up with a table that looks like this. For the moment we will assume that the minimums are `0` and the max is `500`. Let's transform this data with just a bit of simple math. 

| usr  | hours 0-1 | rating 0-1 |
|---|---|---|
| a  | .054  | .7  |
| b  | .006  |  1 |
| c  | 1  | .9  |
| d  | .086  |  .7 |
| e  |  .254 |  1 |

Now, of course, this is a simple example, but it actually can be quite necessary in order for the numbers to be able to play together in an appropriate way. 0 to 1 is a common convention. What our normalized data reveals here is that there is in fact NOT a relationship between play time and rating. Can you explain why?

In [1]:
import pandas as pd
import numpy as np
import warnings
# https://docs.python.org/3/library/warnings.html

In [46]:
def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    bk = pd.read_csv('data/Books.csv')
    us = pd.read_csv('data/Users.csv')
    rt = pd.read_csv('data/Ratings.csv')

In [4]:
bk.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
us.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [6]:
rt.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Question 1: Manipulating Dataframes

So, for our first step we are going to use the loaded dataframes, to manipulate the data. You need to take the above dataframes and end up with two new dataframes. 

Dataframe 1 is going to be an Age table. What I want is 5 to 10 columns of age brackets, and in each of those age brackets, I want a count of how many users inhabit those age brackets. This should be a small table of just one row, but  

Dataframe 2 is going to be a Review table, in which we have the User-ID, the Book-Rating, and the Book Title. 


In [58]:
# Table 1
# Make sure that your final df is called 'df1' 


In [59]:
# Table 2
# Make sure that your final df is called 'df2'

In [37]:
# Check Table 1
assert df1.iloc[0][3] == 27408 # Checking, does your count of 45 to 59 year olds match 27408?
# Check Table 2
assert df2.iloc[527][0] == 'Beloved (Plume Contemporary Fiction)' #Is your 527th row's book title this?

#It's possible that you might get the correct answer, but somehow shuffle the order. 
#In such a case you won't pass the assert check, but still have completed the question. Check with the teacher.

### Question 2: Feature Selection

Create features that can be related to **both** the users and the items. There is more than one way that this can be done. You may choose to either show your answer in a table format or in a dictionary format. However, we should be able to take this and apply the features to any user and to any item. 

In [None]:
#This is an open ended question. We will discuss in class, but you might find another approach.
#Just make sure you're prepared to explain your selection. 
#This is a fairly large question, that might be one of those cases where it takes a fair amount of thinking, but not too much coding. :D

### Question 3: Normalization

Now that we have features, I want you to analyze whether or not the features are normalized. If they are normalized, then please explain why they are normalized values, additionally explain whether or not you are capturing discrete values or continuous values. 

If they are not normalized, then please apply some process to normalize them. Keep in mind that there might be a nifty panda method that will do just that for you. 

In [None]:
#Code if necessary

This is a markdown cell, you can type into it and it won't confuse it for code.