# Recommendations with IBM

In this notebook, you will be putting your recommendation skills to use on real data from the IBM Watson Studio platform. 


You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page.  Either way assure that your code passes the project [RUBRIC](https://review.udacity.com/#!/rubrics/3325/view).  **Please save regularly.**

By following the table of contents, you will build out a number of different methods for making recommendations that can be used for different situations. 


## Table of Contents

I. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
II. [Rank Based Recommendations](#Rank)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations (EXTRA - NOT REQUIRED)](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

At the end of the notebook, you will find directions for how to submit your work.  Let's get started by importing the necessary libraries and reading in the data.

In [198]:
import pandas as pd
# set max display options for analysis
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 5)

import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import project_tests as t
import pickle

%matplotlib inline

df = pd.read_csv('data/user-item-interactions.csv')
df_content = pd.read_csv('data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']


In [199]:
df.shape

(45993, 3)

In [200]:
df.dtypes

article_id    float64
title          object
email          object
dtype: object

In [201]:
# Show df to get an idea of the data
df.head()

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [202]:
df['title'][0]

'using pixiedust for fast, flexible, and easier data analysis and experimentation'

In [203]:
# check if users appear more than once in user-item-interactions table
df.email.value_counts(ascending=False)

email
2b6c0f514c2f2b04ad3c4583407dccd0810469ee    364
77959baaa9895a7e2bdc9297f8b27c1b6f2cb52a    363
2f5c7feae533ce046f2cb16fb3a29fe00528ed66    170
a37adec71b667b297ed2440a9ff7dad427c7ac85    169
8510a5010a5d4c89f5b07baac6de80cd12cfaf93    160
                                           ... 
f5035acf16af3e79700393838fa1023ad38da668      1
81335c2e5917100a5cbdcc2bc0285fed6d685f6d      1
98d4864a24bc8f9915c8c8b5ebd3aa1eaa71cbaf      1
c87e297a1a99ae042be2015ff9056cf13195eefd      1
1f18e8aaccd6c8720180c3fe264c8aef5b00697f      1
Name: count, Length: 5148, dtype: int64

In [204]:
# pivot data to get user-article matrix
# see if user has interacted with article more than once
df_pivot = df.pivot_table(df, index=['email','article_id'] , aggfunc= 'count')
df_pivot.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,title
email,article_id,Unnamed: 2_level_1
0000b6387a0366322d7fbfc6434af145adf7fed1,43.0,2
0000b6387a0366322d7fbfc6434af145adf7fed1,124.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,173.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,288.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,349.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,618.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,732.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,1162.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,1232.0,1
0000b6387a0366322d7fbfc6434af145adf7fed1,1314.0,1


In [205]:
# filter df_pivot for users with more than 1 article interaction
df_pivot[df_pivot['title'] > 5].sort_values(by='title', ascending=False).head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,title
email,article_id,Unnamed: 2_level_1
1588af175b283915f597fc4719cbb2c8621c4fc2,1170.0,42
363cb98a087e4a3eb6890fd1af2d418116f85ff8,1170.0,41
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,1429.0,35
77959baaa9895a7e2bdc9297f8b27c1b6f2cb52a,1429.0,35
b96a4f2e92d8572034b1e9b28f9ac673765cd074,1429.0,25


In [206]:
# filter df_pivot for email with most interactions (2b6c0f514c2f2b04ad3c4583407dccd0810469ee)
df_pivot[df_pivot.index.get_level_values('email') == '2b6c0f514c2f2b04ad3c4583407dccd0810469ee'].sort_values(by='title', ascending=False).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,title
email,article_id,Unnamed: 2_level_1
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,1429.0,35
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,1293.0,16
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,29.0,15
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,43.0,15
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,1172.0,12


In [207]:
df_content.shape

(1056, 5)

In [208]:
df_content.dtypes

doc_body           object
doc_description    object
doc_full_name      object
doc_status         object
article_id          int64
dtype: object

In [209]:
# Show df_content to get an idea of the data
df_content.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


In [210]:
# check if articles appear more than once
df_content.article_id.value_counts(ascending=False).head()

article_id
221    2
232    2
50     2
398    2
577    2
Name: count, dtype: int64

In [211]:
df_content[df_content.article_id == 221]

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
221,* United States\r\n\r\nIBM® * Site map\r\n\r\n...,When used to make sense of huge amounts of con...,How smart catalogs can turn the big data flood...,Live,221
692,Homepage Follow Sign in / Sign up Homepage * H...,One of the earliest documented catalogs was co...,How smart catalogs can turn the big data flood...,Live,221


In [212]:
df_content['doc_body'][0]

"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.\r\nWATCH QUEUE\r\nQUEUE\r\nWatch Queue Queue * Remove all\r\n * Disconnect\r\n\r\nThe next video is starting stop 1. Loading...\r\n\r\nWatch Queue Queue __count__/__total__ Find out why CloseDEMO: DETECT MALFUNCTIONING IOT SENSORS WITH STREAMING ANALYTICS\r\nIBM AnalyticsLoading...\r\n\r\nUnsubscribe from IBM Analytics? Cancel UnsubscribeWorking...\r\n\r\nSubscribe Subscribed Unsubscribe 26KLoading...\r\n\r\nLoading...\r\n\r\nWorking...\r\n\r\nAdd toWANT TO WATCH THIS AGAIN LATER?\r\nSign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?\r\n   Sign in to report inappropriate content. Sign in\r\n * Transcript\r\n * Statistics\r\n * Add translations\r\n\r\n175 views 6LIKE THIS VIDEO?\r\nSign in to make your opinion count. Sign in 7 0DON'T LIKE THIS VIDEO?\r\nSign in to make your opinion count. Sign in 1Loading...\r\n\r\nLoading...\r\n\r\nTRANSCR

In [213]:
# find unique articles in df that are not in df_content
df[~df.article_id.isin(df_content.article_id)].drop_duplicates().head()

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [214]:
# find unique articles in df_content that are not in df
df_content[~df_content.article_id.isin(df.article_id)].drop_duplicates().head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
5,Compose is all about immediacy. You want a new...,Using Compose's PostgreSQL data browser.,Browsing PostgreSQL Data with Compose,Live,5
6,UPGRADING YOUR POSTGRESQL TO 9.5Share on Twitt...,Upgrading your PostgreSQL deployment to versio...,Upgrading your PostgreSQL to 9.5,Live,6
7,Follow Sign in / Sign up 135 8 * Share\r\n * 1...,For a company like Slack that strives to be as...,Data Wrangling at Slack,Live,7


In [215]:
#find all unique articles in df sorted by article_id
unique_ids = df.article_id.unique()

# convert values to int
unique_ids = [int(x) for x in unique_ids]

# find second smallest article_id
unique_ids.sort()
unique_ids[1:10]

[2, 4, 8, 9, 12, 14, 15, 16, 18]

### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Exploratory Data Analysis</a>

Use the dictionary and cells below to provide some insight into the descriptive statistics of the data.

`1.` What is the distribution of how many articles a user interacts with in the dataset?  Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.  

### Total user-article-interaction count

In [216]:
# get the count of the number of articles per email
# convert to dataframe
df_user_article_count = df.groupby('email').count()['article_id'].sort_values(ascending=False).to_frame()
# remane column article_id to count
df_user_article_count.rename(columns={'article_id':'interaction_count'}, inplace=True)
df_user_article_count

Unnamed: 0_level_0,interaction_count
email,Unnamed: 1_level_1
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,364
77959baaa9895a7e2bdc9297f8b27c1b6f2cb52a,363
2f5c7feae533ce046f2cb16fb3a29fe00528ed66,170
a37adec71b667b297ed2440a9ff7dad427c7ac85,169
8510a5010a5d4c89f5b07baac6de80cd12cfaf93,160
...,...
1b520f0f65c0aee52d4235f92fb2de58fa966635,1
7a67e4a2902a20062e1f2a6835b6e099b34b4f6c,1
c4b7e639e91b1d18e5b9c000f0ad3354888fcdde,1
7a7fb282789944665ffc1cddee5ddbdbd7ca9f64,1


In [217]:
df_user_article_count.describe()

Unnamed: 0,interaction_count
count,5148.0
mean,8.930847
std,16.802267
min,1.0
25%,1.0
50%,3.0
75%,9.0
max,364.0


In [218]:
print(f'The ave number of user-article interactions is {df_user_article_count.mean()[0]}.')
print(f'50% of individuals interacted with {df_user_article_count.quantile(0.5)[0]} articles or fewer.')

The ave number of user-article interactions is 8.930846930846931.
50% of individuals interacted with 3.0 articles or fewer.


In [219]:
# box plot of user-article interactions
fig = px.box(df_user_article_count, y='interaction_count')
# restrict y axis to 0-100
fig.update_yaxes(range=[0, 200])
fig.show()

In [220]:
# plot the distribution of how many articles a user interacts with in the dataset (plotly)
fig = px.histogram(df_user_article_count, x='interaction_count', nbins=100, title='Distribution of Total User-Article-Interactions')
fig.update_yaxes(title_text='Number of users per Interaction Count')
fig.update_xaxes(title_text='User-Article-Interactions')
fig.show()

In [221]:
print(f"It is clear that most users have very few interactions with articles. In fact, {df_user_article_count[df_user_article_count['interaction_count'] == 1].shape[0]} out of {df.email.nunique()} users have only 1 interaction with an article.")

It is clear that most users have very few interactions with articles. In fact, 1416 out of 5148 users have only 1 interaction with an article.


In [222]:
# histnorm – Specifies the type of normalization used for this histogram trace.  
# If “”, the span of each bar corresponds to the number of occurrences (i.e. the number of data points lying inside the bins).   
# If “percent” / “probability”, the span of each bar corresponds to the percentage / fraction of occurrences with respect to the total number of sample points (here, the sum of all bin HEIGHTS equals 100% / 1).  
# If “density”, the span of each bar corresponds to the number of occurrences in a bin divided by the size of the bin interval (here, the sum of all bin AREAS equals the total number of sample points).  
# If probability density, the area of each bar corresponds to the probability that an event will fall into the corresponding bin (here, the sum of all bin AREAS equals 1).
# fig = go.Figure()
# fig.add_trace(go.Histogram(x=df_user_article_count['interaction_count'], nbinsx=100, histnorm='probability density', name='Wind_in_direction_of_Est_Tala_dummy = 400'))
# fig.show()

### Unique user-article-interaction count

 What is the distribution of how many articles a user interacts with in the dataset?  Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.  

In [223]:
df.groupby('email').nunique().head(10)

Unnamed: 0_level_0,article_id,title
email,Unnamed: 1_level_1,Unnamed: 2_level_1
0000b6387a0366322d7fbfc6434af145adf7fed1,12,12
001055fc0bb67f71e8fa17002342b256a30254cd,4,4
00148e4911c7e04eeff8def7bbbdaf1c59c2c621,3,3
001a852ecbd6cc12ab77a785efa137b2646505fe,5,5
001fc95b90da5c3cb12c501d201a915e4f093290,2,2
0042719415c4fca7d30bd2d4e9d17c5fc570de13,2,2
00772abe2d0b269b2336fc27f0f4d7cb1d2b65d7,2,2
008ba1d5b4ebf54babf516a2d5aa43e184865da5,10,10
008ca24b82c41d513b3799d09ae276d37f92ce72,1,1
008dfc7a327b5186244caec48e0ab61610a0c660,10,10


In [224]:
val1 = df.groupby('email').nunique().head(1)['article_id'][0]

In [225]:
print(f'The first user interacted with {val1} articles.')

The first user interacted with 12 articles.


In [226]:
# test logic
# test_df = pd.DataFrame({'email':['a','a','a','b', 'b'], 'article_id':[1,1,3,4,5], 'title' : ['a1','a1','a3','b4', 'b5'] })
# test_df.groupby('email').nunique()

In [227]:
# group by email and article_id and count the number of interactions
# df.groupby(['email', 'article_id']).count()['title'].sort_values(ascending=False).head()

In [228]:
group_by_user_obj = df.groupby('email')

In [229]:
group_by_user_obj.get_group('008ba1d5b4ebf54babf516a2d5aa43e184865da5').sort_values(by='article_id')

Unnamed: 0,article_id,title,email
10439,315.0,neurally embedded emojis,008ba1d5b4ebf54babf516a2d5aa43e184865da5
23360,583.0,the million dollar question: where is my data?,008ba1d5b4ebf54babf516a2d5aa43e184865da5
12349,749.0,hurricane how-to,008ba1d5b4ebf54babf516a2d5aa43e184865da5
23785,1062.0,airbnb data for analytics: antwerp calendar,008ba1d5b4ebf54babf516a2d5aa43e184865da5
9834,1186.0,connect to db2 warehouse on cloud and db2 usin...,008ba1d5b4ebf54babf516a2d5aa43e184865da5
12292,1296.0,fortune 100 companies,008ba1d5b4ebf54babf516a2d5aa43e184865da5
18875,1328.0,income (2015): united states demographic measures,008ba1d5b4ebf54babf516a2d5aa43e184865da5
18885,1409.0,uci: red wine quality,008ba1d5b4ebf54babf516a2d5aa43e184865da5
18883,1411.0,uci: white wine quality,008ba1d5b4ebf54babf516a2d5aa43e184865da5
15238,1431.0,visualize car data with brunel,008ba1d5b4ebf54babf516a2d5aa43e184865da5


In [230]:
# get the count of the number of unique articles per email
# convert to dataframe
df_user_article_unique_count = df.groupby('email').nunique()['article_id'].sort_values(ascending=False).to_frame()
df_user_article_unique_count.rename(columns={'article_id':'unique_article_interaction_count'}, inplace=True)
df_user_article_unique_count.head(10)


Unnamed: 0_level_0,unique_article_interaction_count
email,Unnamed: 1_level_1
2b6c0f514c2f2b04ad3c4583407dccd0810469ee,135
77959baaa9895a7e2bdc9297f8b27c1b6f2cb52a,135
d9032ff68d0fd45dfd18c0c5f7324619bb55362c,101
c60bb0a50c324dad0bffd8809d121246baef372b,100
a37adec71b667b297ed2440a9ff7dad427c7ac85,97
2f5c7feae533ce046f2cb16fb3a29fe00528ed66,97
8510a5010a5d4c89f5b07baac6de80cd12cfaf93,96
f8c978bcf2ae2fb8885814a9b85ffef2f54c3c76,96
276d9d8ca0bf52c780b5a3fc554fa69e74f934a3,75
56832a697cb6dbce14700fca18cffcced367057f,75


In [231]:
df_user_article_unique_count.describe()

Unnamed: 0,unique_article_interaction_count
count,5148.0
mean,6.54021
std,9.990676
min,1.0
25%,1.0
50%,3.0
75%,7.0
max,135.0


In [232]:
print(f'The ave number of unique articles interacted with is {df_user_article_unique_count.mean()[0]}.')
print(f'50% of individuals interacted with {df_user_article_unique_count.quantile(0.5)[0]} unique articles or fewer.')

The ave number of unique articles interacted with is 6.54020979020979.
50% of individuals interacted with 3.0 unique articles or fewer.


In [233]:
# plot the distribution of how many unique articles a user interacts with in the dataset (plotly)
fig = px.histogram(df_user_article_unique_count, x='unique_article_interaction_count', nbins=100, title='Distribution of Total Unique User-Article-Interactions')
fig.update_yaxes(title_text='Number of users per Unique Interaction Count')
fig.update_xaxes(title_text='Unique User-Article-Interactions')
fig.show()

In [234]:
print(f"It is clear that most users have very few unique interactions with articles. In fact, {df_user_article_unique_count[df_user_article_unique_count['unique_article_interaction_count'] == 1].shape[0]} out of {df.email.nunique()} users have only 1 unique interaction with an article.")

It is clear that most users have very few unique interactions with articles. In fact, 1557 out of 5148 users have only 1 unique interaction with an article.


In [235]:
# Fill in the median and maximum number of user_article interactios below

# median_val = # 50% of individuals interact with ____ number of articles or fewer.
# max_views_by_user = # The maximum number of user-article interactions by any 1 user is ______.
median_val = df_user_article_count.quantile(0.5)[0]
max_views_by_user = df_user_article_count.max()[0]

print(f'50% of individuals interacted with {median_val} unique articles or fewer.')
print(f'The maximum number of unique user-article interactions by any 1 user is {max_views_by_user}.')

50% of individuals interacted with 3.0 unique articles or fewer.
The maximum number of unique user-article interactions by any 1 user is 364.


In [236]:
# Fill in the median and maximum number of user_article interactios below (unique)

# median_val = # 50% of individuals interact with ____ number of articles or fewer.
# max_views_by_user = # The maximum number of user-article interactions by any 1 user is ______.
median_val_un = df_user_article_unique_count.quantile(0.5)[0]
max_views_by_user_un = df_user_article_unique_count.max()[0]

print(f'50% of individuals interacted with {median_val_un} unique articles or fewer.')
print(f'The maximum number of unique user-article interactions by any 1 user is {max_views_by_user_un}.')

50% of individuals interacted with 3.0 unique articles or fewer.
The maximum number of unique user-article interactions by any 1 user is 135.


`2.` Explore and remove duplicate articles from the **df_content** dataframe.  

In [237]:
df_content.head()

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
0,Skip navigation Sign in SearchLoading...\r\n\r...,Detect bad readings in real time using Python ...,Detect Malfunctioning IoT Sensors with Streami...,Live,0
1,No Free Hunch Navigation * kaggle.com\r\n\r\n ...,"See the forest, see the trees. Here lies the c...",Communicating data science: A guide to present...,Live,1
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat...,Here’s this week’s news in Data Science and Bi...,"This Week in Data Science (April 18, 2017)",Live,2
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA...",Learn how distributed DBs solve the problem of...,DataLayer Conference: Boost the performance of...,Live,3
4,Skip navigation Sign in SearchLoading...\r\n\r...,This video demonstrates the power of IBM DataS...,Analyze NY Restaurant data using Spark in DSX,Live,4


In [238]:
# Find and explore duplicate articles in df_content
# df_content[df_content.duplicated(subset=['article_id'])].sort_values(by='article_id')
df_content[df_content.duplicated(subset=['article_id'], keep=False)].sort_values(by='article_id')

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
50,Follow Sign in / Sign up Home About Insight Da...,Community Detection at Scale,Graph-based machine learning,Live,50
365,Follow Sign in / Sign up Home About Insight Da...,During the seven-week Insight Data Engineering...,Graph-based machine learning,Live,50
221,* United States\r\n\r\nIBM® * Site map\r\n\r\n...,When used to make sense of huge amounts of con...,How smart catalogs can turn the big data flood...,Live,221
692,Homepage Follow Sign in / Sign up Homepage * H...,One of the earliest documented catalogs was co...,How smart catalogs can turn the big data flood...,Live,221
232,Homepage Follow Sign in Get started Homepage *...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
971,Homepage Follow Sign in Get started * Home\r\n...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
399,Homepage Follow Sign in Get started * Home\r\n...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
761,Homepage Follow Sign in Get started Homepage *...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
578,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577
970,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577


In [239]:
df_content[df_content.article_id == 50]

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
50,Follow Sign in / Sign up Home About Insight Da...,Community Detection at Scale,Graph-based machine learning,Live,50
365,Follow Sign in / Sign up Home About Insight Da...,During the seven-week Insight Data Engineering...,Graph-based machine learning,Live,50


In [240]:
# Remove any rows that have the same article_id - only keep the first
df_content.drop_duplicates(subset=['article_id'], keep='first', inplace=True)

In [241]:
df_content[df_content.article_id == 50]

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
50,Follow Sign in / Sign up Home About Insight Da...,Community Detection at Scale,Graph-based machine learning,Live,50


`3.` Use the cells below to find:

**a.** The number of unique articles that have an interaction with a user.  
**b.** The number of unique articles in the dataset (whether they have any interactions or not).<br>
**c.** The number of unique users in the dataset. (excluding null values) <br>
**d.** The number of user-article interactions in the dataset.

In [242]:
df[df.email.isnull()]

Unnamed: 0,article_id,title,email
25131,1016.0,why you should master r (even if it might even...,
29758,1393.0,the nurse assignment problem,
29759,20.0,working interactively with rstudio and noteboo...,
29760,1174.0,breast cancer wisconsin (diagnostic) data set,
29761,62.0,data visualization: the importance of excludin...,
35264,224.0,"using apply, sapply, lapply in r",
35276,961.0,beyond parallelize and collect,
35277,268.0,sector correlations shiny app,
35278,268.0,sector correlations shiny app,
35279,268.0,sector correlations shiny app,


In [243]:
# The number of unique articles that have at least one interaction
article_id_test = df[df.email.isnull()]['article_id']
article_id_test

25131    1016.0
29758    1393.0
29759      20.0
29760    1174.0
29761      62.0
35264     224.0
35276     961.0
35277     268.0
35278     268.0
35279     268.0
35280     268.0
35281     415.0
35282     846.0
35283     268.0
35284     162.0
42749     647.0
42750     965.0
Name: article_id, dtype: float64

In [244]:
# unique_articles =  df[df.email.notnull()].article_id.nunique() # The number of unique articles that have at least one interaction
unique_articles =  df.article_id.nunique() # The number of unique articles that have at least one interaction
total_articles = df_content.shape[0] # The number of unique articles on the IBM platform (duplicates are removed)
unique_users = df.email.nunique() # The number of unique users
# user_article_interactions = df[df.email.notnull()].shape[0] # The number of user-article interactions
user_article_interactions = df.shape[0] # The number of user-article interactions

In [245]:
print(f'The number of unique articles that have at least one interaction is {unique_articles}.')
print(f'The number of unique articles on the IBM platform is {total_articles}.')
print(f'The number of unique users is {unique_users}.')
print(f'The number of user-article interactions is {user_article_interactions}.')

The number of unique articles that have at least one interaction is 714.
The number of unique articles on the IBM platform is 1051.
The number of unique users is 5148.
The number of user-article interactions is 45993.


`4.` Use the cells below to find the most viewed **article_id**, as well as how often it was viewed.  After talking to the company leaders, the `email_mapper` function was deemed a reasonable way to map users to ids.  There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [246]:
# The most viewed article in the dataset was viewed how many times?
# df.groupby('article_id').count().sort_values(by='email', ascending=False).head(1)
# df.groupby('article_id').count()['email'].sort_values(ascending=False).head(1)
#get value
df.groupby('article_id').count()['email'].sort_values(ascending=False).head(1).values[0]

937

In [247]:
most_viewed_article_id = str(df.article_id.value_counts().index[0]) # The most viewed article in the dataset as a string with one value following the decimal 
max_views = df.groupby('article_id').count()['email'].sort_values(ascending=False).head(1).values[0] # The most viewed article in the dataset was viewed how many times?

In [248]:
## No need to change the code here - this will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


In [249]:
df.dtypes

article_id    float64
title          object
user_id         int64
dtype: object

In [250]:
# make article_id a string
df['article_id'] = df['article_id'].astype(str)

In [251]:
## If you stored all your results in the variable names above, 
## you shouldn't need to change anything in this cell

sol_1_dict = {
    '`50% of individuals have _____ or fewer interactions.`': median_val,
    '`The total number of user-article interactions in the dataset is ______.`': user_article_interactions,
    '`The maximum number of user-article interactions by any 1 user is ______.`': max_views_by_user,
    '`The most viewed article in the dataset was viewed _____ times.`': max_views,
    '`The article_id of the most viewed article is ______.`': most_viewed_article_id,
    '`The number of unique articles that have at least 1 rating ______.`': unique_articles,
    '`The number of unique users in the dataset is ______`': unique_users,
    '`The number of unique articles on the IBM platform`': total_articles
}

# Test your dictionary against the solution
t.sol_1_test(sol_1_dict)

It looks like you have everything right here! Nice job!


### <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not.  We only know that a user has interacted with an article.  In these cases, the popularity of an article can really only be based on how often an article was interacted with.

`1.` Fill in the function below to return the **n** top articles ordered with most interactions as the top. Test your function using the tests below.

In [252]:
def get_top_articles(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # get top n article ids from df
    # top_n_articles_ids = df.article_id.value_counts().head(n).index
    # get top n article titles from df using top_articles_ids 
    # top_articles = df[df.article_id.isin(top_n_articles_ids)].title.unique().tolist()
    # df.groupby('title').count()['user_id'].sort_values(ascending=False).head(10).index
    top_articles = df.title.value_counts().head(n).index.to_list()


    return top_articles # Return the top article titles from df (not df_content)

def get_top_article_ids(n, df=df):
    '''
    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    '''
    # get top n article ids from df
    top_articles = df.article_id.value_counts().head(n).index.to_list()
 
    return top_articles # Return the top article ids

In [253]:
print(get_top_articles(10))
print(get_top_article_ids(10))

['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
['1429.0', '1330.0', '1431.0', '1427.0', '1364.0', '1314.0', '1293.0', '1170.0', '1162.0', '1304.0']


In [254]:
# Test your function by returning the top 5, 10, and 20 articles
top_5 = get_top_articles(5)
top_10 = get_top_articles(10)
top_20 = get_top_articles(20)

# Test each of your three lists from above
t.sol_2_test(get_top_articles)

Your top_5 looks like the solution list! Nice job.
Your top_10 looks like the solution list! Nice job.
Your top_20 looks like the solution list! Nice job.


### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.` Use the function below to reformat the **df** dataframe to be shaped with users as the rows and articles as the columns.  

* Each **user** should only appear in each **row** once.


* Each **article** should only show up in one **column**.  


* **If a user has interacted with an article, then place a 1 where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.  


* **If a user has not interacted with an item, then place a zero where the user-row meets for that article-column**. 

Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.

In [255]:
df.head()

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


In [256]:
df.groupby(['user_id', 'article_id'])['title'].max()

user_id  article_id
1        1052.0        access db2 warehouse on cloud and db2 with python
         109.0                                     tensorflow quick tips
         1170.0                 apache spark lab, part 1: basic concepts
         1183.0                                 categorize urban density
         1185.0                    classify tumors with machine learning
                                             ...                        
5146     1416.0        united states demographic measures: population...
         142.0         neural networks for beginners: popular types a...
5147     233.0            bayesian nonparametric models – stats and bots
5148     1160.0             analyze accident reports on amazon emr spark
5149     16.0          higher-order logistic regression for large dat...
Name: title, Length: 33682, dtype: object

In [257]:
df_pivot = df.pivot_table(df, index=['user_id','article_id'] , aggfunc= 'count')
df_pivot

Unnamed: 0_level_0,Unnamed: 1_level_0,title
user_id,article_id,Unnamed: 2_level_1
1,1052.0,2
1,109.0,1
1,1170.0,2
1,1183.0,2
1,1185.0,2
...,...,...
5146,1416.0,1
5146,142.0,1
5147,233.0,1
5148,1160.0,1


In [258]:
# transform the pivot table to a matrix with user_id as rows and article_id as columns
df_matrix = df_pivot.unstack()
df_matrix.head(25)


Unnamed: 0_level_0,title,title,title,title,title
article_id,0.0,100.0,...,996.0,997.0
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,,,...,,
2,,,...,,
3,,,...,,
4,,,...,,
5,,,...,,
6,,,...,,
7,,,...,,
8,,,...,,
9,,,...,,
10,,,...,,


In [259]:
# replace all nan values with 0 and all other values with 1
df_matrix = df_matrix.notnull().astype('int')
df_matrix.head(25)

Unnamed: 0_level_0,title,title,title,title,title
article_id,0.0,100.0,...,996.0,997.0
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,0,0,...,0,0
2,0,0,...,0,0
3,0,0,...,0,0
4,0,0,...,0,0
5,0,0,...,0,0
6,0,0,...,0,0
7,0,0,...,0,0
8,0,0,...,0,0
9,0,0,...,0,0
10,0,0,...,0,0


In [260]:
# Transform df to a matrix with user_id as rows and article_id on the columns with 1 values where a user interacted with an article and a 0 otherwise
df_matrix = df.groupby(['user_id', 'article_id'])['title'].max().unstack()
df_matrix

article_id,0.0,100.0,...,996.0,997.0
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,,...,,
2,,,...,,
3,,,...,,
4,,,...,,
5,,,...,,
...,...,...,...,...,...
5145,,,...,,
5146,,,...,,
5147,,,...,,
5148,,,...,,


In [261]:
# replace all nan values with 0 and all other values with 1
df_matrix = df_matrix.notnull().astype('int')

In [262]:
# create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df - pandas dataframe with article_id, title, user_id columns
    
    OUTPUT:
    user_item - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with 
    an article and a 0 otherwise
    '''
    df_matrix = df.groupby(['user_id', 'article_id'])['title'].max().unstack()
    user_item = df_matrix.notnull().astype('int')
    
    return user_item # return the user_item matrix 

# user_item = create_user_item_matrix(df)

In [263]:
user_item = create_user_item_matrix(df)

In [264]:
## Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")

You have passed our quick tests!  Please proceed!


`2.` Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar).  The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users. 

Use the tests to test your function.

In [265]:
df.head()


Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


In [266]:
user_item.dot(user_item.loc[1]).sort_values(ascending=False)

user_id
1       36
3933    35
23      17
3782    17
203     15
        ..
2326     0
2327     0
2328     0
2329     0
5149     0
Length: 5149, dtype: int64

In [267]:
type(user_item.dot(user_item.loc[1]))

pandas.core.series.Series

In [268]:
# sort the values in descending order
user_item.dot(user_item.loc[1]).sort_values(ascending=False).index

Index([   1, 3933,   23, 3782,  203, 4459, 3870,  131, 4201,   46,
       ...
       2317, 2319, 2321, 2323, 2325, 2326, 2327, 2328, 2329, 5149],
      dtype='int64', name='user_id', length=5149)

In [269]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user
    similarity_to_user_id = user_item.dot(user_item.loc[user_id])

    # sort by similarity
    similarity_to_user_id_sorted = similarity_to_user_id.sort_values(ascending=False)

    # create list of just the ids
    most_similar_users = similarity_to_user_id_sorted.index.tolist()
   
    # remove the own user's id
    most_similar_users.remove(user_id)
       
    return most_similar_users # return a list of the users in order from most to least similar
        

In [270]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))

The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 3870, 131, 4201, 46, 5041]
The 5 most similar users to user 3933 are: [1, 23, 3782, 203, 4459]
The 3 most similar users to user 46 are: [4201, 3782, 23]


`3.` Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend.  Complete the functions below to return the articles you would recommend to each user. 

In [271]:
df.head()


Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


In [272]:
article_ids = [1430.0,1314.0]

In [273]:
df[df['article_id'].isin(article_ids)]['title'].unique().tolist()

[]

In [274]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    # get article names from df using article ids
    article_names = df[df['article_id'].isin(article_ids)]['title'].unique().tolist()
    
    return article_names # Return the article names associated with list of article ids


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''
    # get list of article ids seen by user
    article_ids = user_item.loc[user_id][user_item.loc[user_id] == 1].index.tolist()
    # get article names based on article ids
    article_names = get_article_names(article_ids)
    
    return article_ids, article_names # return the ids and names


def user_user_recs(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # find similar users
    similar_users = find_similar_users(user_id)
    # get articles seen by user
    user_article_ids = get_user_articles(user_id)[0]
    
    # set up recommendations
    recs = []
    # loop through similar users
    for user in similar_users:
        # get articles seen by similar user
        similar_user_article_ids = get_user_articles(user)[0]
        # get recommendations for user
        current_recs = [article for article in similar_user_article_ids if article not in user_article_ids]
        # add recommendations to list until m recommendations are found
        for article in current_recs:
            if len(recs) < m:
                recs.append(article)
            else:
                break        
    
    return recs # return your recommendations for this user_id    

In [275]:
# find_similar_users(1)[1]
# user_article_ids = get_user_articles(1)[0]
# similar_user_article_ids = get_user_articles(23)[0]
# [article for article in similar_user_article_ids if article not in user_article_ids]

In [276]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1

['analyze energy consumption in buildings',
 'analyze accident reports on amazon emr spark',
 '520    using notebooks with pixiedust for fast, flexi...\nName: title, dtype: object',
 '1448    i ranked every intro to data science course on...\nName: title, dtype: object',
 'data tidying in data science experience',
 'airbnb data for analytics: vancouver listings',
 'recommender systems: approaches & algorithms',
 'airbnb data for analytics: mallorca reviews',
 'analyze facebook data using ibm watson and watson studio',
 'a tensorflow regression model to predict house values']

In [277]:
df.dtypes

article_id    object
title         object
user_id        int64
dtype: object

In [278]:
get_article_names([1024.0, 1176.0, 1305.0, 1314.0, 1422.0, 1427.0])

[]

In [279]:
get_user_articles(20)[0]

['1320.0', '232.0', '844.0']

In [280]:
# Test your functions here - No need to change this code - just run this cell
# assert set(get_article_names(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
# assert set(get_article_names(['1320.0', '232.0', '844.0'])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set(['1320.0', '232.0', '844.0'])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")

If this is all you see, you passed all of our tests!  Nice job!


`4.` Now we are going to improve the consistency of the **user_user_recs** function from above.  

* Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.


* Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be  what would be obtained from the **top_articles** function you wrote earlier.

In [282]:
test_df = user_item.dot(user_item.loc[1]).sort_values(ascending=False).sort_values(ascending=False).to_frame().reset_index().rename(columns={0: 'similarity', 'user_id': 'neighbor_id'})

In [284]:
test_df

Unnamed: 0,neighbor_id,similarity
0,1,36
1,3933,35
2,23,17
3,3782,17
4,203,15
...,...,...
5144,4211,0
5145,4210,0
5146,4208,0
5147,4207,0


In [285]:
# drop own user's id from test_df
test_df = test_df[test_df['neighbor_id'] != 1]

In [290]:
test_df.head()

Unnamed: 0,neighbor_id,similarity,num_interactions
1,3933,35,364
2,23,17,363
3,3782,17,170
4,203,15,169
5,4459,15,160


In [288]:
# get all neighbor_ids from test_df and check original df for number of interactions
df[df['user_id'].isin(test_df['neighbor_id'])].groupby('user_id')['article_id'].count().sort_values(ascending=False).head()

user_id
23      364
3782    363
98      170
3764    169
203     160
Name: article_id, dtype: int64

In [292]:
# create a dictionary of user_id and number of interactions for all neighbor_id in test_df
user_interactions = df[df['user_id'].isin(test_df['neighbor_id'])].groupby('user_id')['article_id'].count().sort_values(ascending=False).to_dict()
user_interactions


{23: 364,
 3782: 363,
 98: 170,
 3764: 169,
 203: 160,
 4459: 158,
 242: 148,
 49: 147,
 3910: 147,
 131: 145,
 3697: 145,
 3870: 144,
 58: 142,
 3740: 140,
 21: 137,
 4785: 136,
 52: 132,
 3596: 131,
 170: 116,
 3169: 114,
 184: 104,
 60: 103,
 4892: 102,
 912: 102,
 5140: 101,
 3540: 101,
 651: 98,
 204: 97,
 3072: 96,
 5138: 95,
 371: 95,
 249: 94,
 3784: 94,
 3483: 92,
 288: 91,
 295: 91,
 4706: 89,
 3006: 89,
 322: 85,
 591: 84,
 273: 84,
 619: 84,
 3622: 83,
 2926: 83,
 135: 82,
 3: 82,
 4134: 82,
 4277: 82,
 8: 82,
 2975: 81,
 290: 80,
 3353: 80,
 3621: 80,
 696: 79,
 186: 79,
 2982: 79,
 223: 79,
 40: 78,
 3358: 78,
 665: 77,
 3532: 76,
 4932: 76,
 330: 76,
 4484: 75,
 4293: 75,
 3197: 74,
 45: 73,
 3500: 72,
 195: 72,
 3578: 70,
 87: 69,
 395: 69,
 113: 68,
 38: 68,
 765: 68,
 669: 67,
 3818: 67,
 4883: 67,
 5041: 67,
 3208: 66,
 4755: 66,
 3684: 65,
 750: 64,
 726: 64,
 187: 63,
 46: 63,
 214: 62,
 5057: 62,
 3967: 62,
 4201: 61,
 4934: 61,
 418: 60,
 754: 60,
 3141: 60,
 256

In [293]:
# create a column for number of interactions in test_df called 'num_interactions' based on user_interactions dictionary
test_df['num_interactions'] = test_df['neighbor_id'].map(user_interactions)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [294]:
test_df

Unnamed: 0,neighbor_id,similarity,num_interactions
1,3933,35,45
2,23,17,364
3,3782,17,363
4,203,15,160
5,4459,15,158
...,...,...,...
5144,4211,0,2
5145,4210,0,10
5146,4208,0,1
5147,4207,0,3


In [None]:
ss

In [None]:
num_interactions = df.groupby('user_id')['article_id'].count().to_frame().reset_index().rename(columns={'article_id': 'num_interactions'})

In [295]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # get similarity to user_id df and rename columns
    neighbors_df = user_item.dot(user_item.loc[user_id]).sort_values(ascending=False).to_frame().reset_index().rename(columns={0: 'similarity', 'user_id': 'neighbor_id'})
    neighbors_df = neighbors_df[neighbors_df['neighbor_id'] != user_id]

    # create a dictionary of user_id and number of interactions for all neighbor_id in test_df
    user_interactions = df[df['user_id'].isin(test_df['neighbor_id'])].groupby('user_id')['article_id'].count().sort_values(ascending=False).to_dict()
    
    # create a column for number of interactions in test_df called 'num_interactions' based on user_interactions dictionary
    neighbors_df['num_interactions'] = neighbors_df['neighbor_id'].map(user_interactions)
    
    return neighbors_df # Return the dataframe specified in the doc_string


def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # Your code here
    
    return recs, rec_names

In [None]:
# WIP

In [None]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

`5.` Use your functions from above to correctly fill in the solutions to the dictionary below.  Then test your dictionary against the solution.  Provide the code you need to answer each following the comments below.

In [None]:
### Tests with a dictionary of results

user1_most_sim = # Find the user that is most similar to user 1 
user131_10th_sim = # Find the 10th most similar user to user 131

In [None]:
## Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
}

t.sol_5_test(sol_5_dict)

`6.` If we were given a new user, which of the above functions would you be able to use to make recommendations?  Explain.  Can you think of a better way we might make recommendations?  Use the cell below to explain a better method for new users.

**Provide your response here.**

`7.` Using your existing functions, provide the top 10 recommended articles you would provide for the a new user below.  You can test your function against our thoughts to make sure we are all on the same page with how we might make a recommendation.

In [None]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs = # Your recommendations here



In [None]:
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0','1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")

### <a class="anchor" id="Content-Recs">Part IV: Content Based Recommendations (EXTRA - NOT REQUIRED)</a>

Another method we might use to make recommendations is to perform a ranking of the highest ranked articles associated with some term.  You might consider content to be the **doc_body**, **doc_description**, or **doc_full_name**.  There isn't one way to create a content based recommendation, especially considering that each of these columns hold content related information.  

`1.` Use the function body below to create a content based recommender.  Since there isn't one right answer for this recommendation tactic, no test functions are provided.  Feel free to change the function inputs if you decide you want to try a method that requires more input values.  The input values are currently set with one idea in mind that you may use to make content based recommendations.  One additional idea is that you might want to choose the most popular recommendations that meet your 'content criteria', but again, there is a lot of flexibility in how you might make these recommendations.

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

In [None]:
def make_content_recs():
    '''
    INPUT:
    
    OUTPUT:
    
    '''

`2.` Now that you have put together your content-based recommendation system, use the cell below to write a summary explaining how your content based recommender works.  Do you see any possible improvements that could be made to your function?  Is there anything novel about your content based recommender?

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

**Write an explanation of your content based recommendation system here.**

`3.` Use your content-recommendation system to make recommendations for the below scenarios based on the comments.  Again no tests are provided here, because there isn't one right answer that could be used to find these content based recommendations.

### This part is NOT REQUIRED to pass this project.  However, you may choose to take this on as an extra way to show off your skills.

In [None]:
# make recommendations for a brand new user


# make a recommendations for a user who only has interacted with article id '1427.0'



### <a class="anchor" id="Matrix-Fact">Part V: Matrix Factorization</a>

In this part of the notebook, you will build use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

`1.` You should have already created a **user_item** matrix above in **question 1** of **Part III** above.  This first question here will just require that you run the cells to get things set up for the rest of **Part V** of the notebook. 

In [None]:
# Load the matrix here
user_item_matrix = pd.read_pickle('user_item_matrix.p')

In [None]:
# quick look at the matrix
user_item_matrix.head()

`2.` In this situation, you can use Singular Value Decomposition from [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.svd.html) on the user-item matrix.  Use the cell to perform SVD, and explain why this is different than in the lesson.

In [None]:
# Perform SVD on the User-Item Matrix Here

u, s, vt = # use the built in to get the three matrices

**Provide your response here.**

`3.` Now for the tricky part, how do we choose the number of latent features to use?  Running the below cell, you can see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix.  Run the cell below to get an idea of how the accuracy improves as we increase the number of latent features.

In [None]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

`4.` From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations.  Instead, we might split our dataset into a training and test set of data, as shown in the cell below.  

Use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below: 

* How many users can we make predictions for in the test set?  
* How many users are we not able to make predictions for because of the cold start problem?
* How many articles can we make predictions for in the test set?  
* How many articles are we not able to make predictions for because of the cold start problem?

In [None]:
df_train = df.head(40000)
df_test = df.tail(5993)

def create_test_and_train_user_item(df_train, df_test):
    '''
    INPUT:
    df_train - training dataframe
    df_test - test dataframe
    
    OUTPUT:
    user_item_train - a user-item matrix of the training dataframe 
                      (unique users for each row and unique articles for each column)
    user_item_test - a user-item matrix of the testing dataframe 
                    (unique users for each row and unique articles for each column)
    test_idx - all of the test user ids
    test_arts - all of the test article ids
    
    '''
    # Your code here
    
    return user_item_train, user_item_test, test_idx, test_arts

user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)

In [None]:
# Replace the values in the dictionary below
a = 662 
b = 574 
c = 20 
d = 0 


sol_4_dict = {
    'How many users can we make predictions for in the test set?': # letter here, 
    'How many users in the test set are we not able to make predictions for because of the cold start problem?': # letter here, 
    'How many articles can we make predictions for in the test set?': # letter here,
    'How many articles in the test set are we not able to make predictions for because of the cold start problem?': # letter here
}

t.sol_4_test(sol_4_dict)

`5.` Now use the **user_item_train** dataset from above to find U, S, and V transpose using SVD. Then find the subset of rows in the **user_item_test** dataset that you can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions `2` - `4`.

Use the cells below to explore how well SVD works towards making predictions for recommendations on the test data.  

In [None]:
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = # fit svd similar to above then use the cells below

In [None]:
# Use these cells to see how well you can use the training 
# decomposition to predict on test data

`6.` Use the cell below to comment on the results you found in the previous question. Given the circumstances of your results, discuss what you might do to determine if the recommendations you make with any of the above recommendation systems are an improvement to how users currently find articles? 

**Your response here.**

<a id='conclusions'></a>
### Extras
Using your workbook, you could now save your recommendations for each user, develop a class to make new predictions and update your results, and make a flask app to deploy your results.  These tasks are beyond what is required for this project.  However, from what you learned in the lessons, you certainly capable of taking these tasks on to improve upon your work here!


## Conclusion

> Congratulations!  You have reached the end of the Recommendations with IBM project! 

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the [rubric](https://review.udacity.com/#!/rubrics/2322/view). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.


## Directions to Submit

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations! 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])