# Client Overview: Reviewly

Reviewly is a startup company that has retained our consulting services to build a <b>"User-Item Collaborative Filtering" recommender system (RS)</b>.  

User-Item Collaborative Filtering: “<i>Customers who are similar to you also liked …</i>”

_(Ideally, I think implicit cues should be used instead of explicit ratings, but... that's not this project!  --Jess)_

## Basic Logic

Person A as input --> identify people who are similar to Person A based on similar ratings --> make recommendations for Person A based on predicted ratings of similar people

Can be conceptualized as a large, sparsely filled (? we need more info on this) matrix with raters (users) across the top and items (being rated) down the side.  Each cell contains an observed rating (scale TBD) for an item (row) by a specific user (column) OR is blank.  We assume that only the most current rating should be considered, so we don't need to track ratings that are changed/updated by the user at different points in time.  Accuracy can be judged by root mean squared error (RMSE).

1) Average rating for an item
2) User bias
3) User similarity to raters

```
# sample logic for average ratings and user bias

rating_mean = by_rating_mean['stars'].tolist()
user_bias = df_sparse.T.apply(lambda k: k.sum()/(k != 0).sum()).tolist()

# sample logic for predicting user similarity to other raters
# "fans" rate the item above the mean, "haters" rate it below the mean

def user_fan_similarity(user, item):
    similarity_vals_list = []
    fans = np.where(data.iloc[:, item] > data.iloc[:, item].mean())[0]
    for fan in fans:
        similarity_vals_list.append(user_similarity(user, fan))
        return sum(similarity_vals_list) / float(len(similarity_vals_list))
        
for item in range(len(item_list)):
    item_mean_rating = rating_means[item]
    user_correlation_with_fan_adjustment = user_rating_matrix[item]
    predicted_rating = item_mean_rating + bias score + user_correlation_with_fan_adjustment
    print("Predicted rating for item %d = %.2f" % (item, predicted_rating))
    rating_predictions.append(predicted_rating)
```

Need to determine the appropriate categories/clusters for the items once we know what is being rated.  Then, continue with appropriate modeling or ML technique.

```
# SVD (singular value decomposition)
file = File()

# example with 100 rows
data = Dataset.load_from_df(df[['User_Id', 'Item_Id', 'Rating']][:100], file)
data.split(n_folds=3)

svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

# What Person A liked most (assuming "5" as highest rating)

df_PersonA = df[(df['User_Id'] == PersonA) & (df['Rating'] == 5)]
df_PersonA = df_PersonA.set_index('Item_Id')
df_PersonA = df_PersonA.join(df_item)['Name']
print(df_PersonA)

# PREDICT what Person A would like based on what they liked before using SVD

data = Dataset.load_from_df(df[['User_Id', 'Item_Id', 'Rating']], file)

trainset = data.build_full_trainset()
svd.train(trainset)

PersonA['Estimate_Rating'] = PersonA['Item_Id'].apply(lambda x: svd.predict(PersonA, x).est)

PersonA = PersonA.drop('Item_Id', axis = 1)

PersonA = PersonA.sort_values('Estimate_Rating', ascending=False)
print(PersonA.head(10))

# RECOMMEND an item for Person A based on the predictions of what they would like

# Use Pearsons' R correlation to measure the linear correlation between ratings of 
# all pairs of items, then recommend the top 10 items with highest correlations

def recommend(item_name, min_count):
    print("For item ({})".format(item_name))
    print("- Top 10 items recommended based on Pearsons'R correlation - ")
    i = int(df_item.index[df_item['Name'] == item_name][0])
    target = df_p[i]
    similar_to_target = df_p.corrwith(target)
    corr_target = pd.DataFrame(similar_to_target, columns = ['PearsonR'])
    corr_target.dropna(inplace = True)
    corr_target = corr_target.sort_values('PearsonR', ascending = False)
    corr_target.index = corr_target.index.map(int)
    corr_target = corr_target.join(df_item).join(df_item_summary)[['PearsonR', 'Name', 'count', 'mean']]
    print(corr_target[corr_target['count']>min_count][:10].to_string(index=False))
    
# Based on some input (e.g. an item name), recommend "top 10" most likely to be liked by Person A
recommend("Some item", 0)
```

Need to extend this logic taking into account what OTHER raters similar to Person A (as per first block of pseudocode) liked.

## Requirements

* Data source with user ratings
    * where ratings are not available, "implicit" proxies for ratings can be used (e.g. other purchases, website behavior)- see important unknowns
* Appropriate categories/clusters for the items being rated
* Decision about whether to consider items with few ratings and users with few ratings (unless new) in algorithm, which will affect size of the "matrix"

## Challenges

* Users don't like to rate things
* What is the "psychology of rating"?  Do users tend to rate things only at the extremes (very good or very bad)?
* Privacy regulations

## Important Unknowns

* Number of existing user ratings
* Source of user ratings
* Quality of user ratings
* Type of ratings 
* Format of ratings
* Context in which the RS will be used
* Target market
* Client technology systems (how will this be deployed/operationalized?)
* Timeline
* Resources
* Existing systems/algorithms
* Company expertise
* Market competition
* Constraints/limitations

## Resources

<link>http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
<link>https://en.wikipedia.org/wiki/Collaborative_filtering