# Part 1: Collaborative filtering ##

* Based on the idea that there is a taste doppelganger for a user - another user that has similar tastes
* Two types: User-to-user filtering, Item-to-item filtering

## User to user filtering ##

* Utility matrix: matrix of users and products.


In [3]:
import pandas as pd
import numpy as np
import requests
import json
from sklearn.metrics.pairwise import cosine_similarity

Let's say we have two users A, B and we have recorded ratings(0-5) for movies for each of them. For movies where they did give a rating we give the movie a rating of 0.

* User A: 4,0,5,3,5,0,0
* User B: 0,4,0,4,0,5,0
* User C: 2,0,2,0,1,0,0

In [2]:
'''
Checking how similar the users A,B are.
'''
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
                  np.array([0,4,0,4,0,5,0]).reshape(1,-1))

array([[ 0.18353259]])

In [3]:
'''
Checking how similar the users A,C are.
'''
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
                  np.array([2,0,2,0,1,0,0]).reshape(1,-1))

array([[ 0.88527041]])

** Overcoming the missing rating problem: alternatives to putting in 0 values **

* Center each users ratings.
* We take each users rating and subtract the mean for all ratings of that user. 
* For example if the mean rating of user A is 4.25, we subtract 4.25 from every rating that user A ever gave. 
* Once this is done we continue finding the mean for every other user and subtracting that each of their own ratings. 
* This formula is equivalent to the ** Pearson correlation coefficient** where are the values will lie between -1,1.

In [4]:
'''
Checking how similar the users A,B are AFTER centering the values
'''
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
                  .reshape(1,-1),\
                  np.array([0,-.33,0,-.33,0,.66,0])\
                  .reshape(1,-1))

array([[ 0.30772873]])

In [5]:
'''
Checking how similar the users A,C are AFTER centering the values
'''
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
                  .reshape(1,-1),\
                  np.array([.33,0,.33,0,-.66,0,0])\
                  .reshape(1,-1))

array([[-0.24618298]])

Example: We have 3 users, X,Y,Z. We will try to predict the ratingo of user X using the ratings of users Y,Z who are similar to X.

* We will take the weighted average using ratings from Y,Z, according to their centered cosine similarity.

** Have to predict X's rating for the last movie in the user_x list.**

In [8]:
'''
Get cosine similarity between user X and Y
'''
user_x = [0, .33, 0, -.66, 0, .33, 0]
user_y = [0, 0, 0, -1, 0, .5, .5]

cosine_similarity(np.array(user_x).reshape(1,-1),\
                  np.array(user_y).reshape(1,-1))

array([[ 0.83333333]])

In [9]:
'''
Get cosine similarity between user X and Z
'''
user_x = [0,.33, 0, -.66, 0, .33,0]
user_z = [0, -.125, 0, -.625, 0, .375, .375]

cosine_similarity(np.array(user_x).reshape(1,-1),\
                  np.array(user_z).reshape(1,-1))

array([[ 0.73854895]])

Weigh each users rating be their similarity to X and then divide by total similarity.

In [10]:
x_rating_movie = ((0.83333333 * 4) + (0.73854895 * 4.5)) / (0.8333333 + 0.73854895)
x_rating_movie

4.234925100146655

X's predicted rating for the last movie in the list is 4.23.

## Item to item filtering ##

* Rather than finding similar users based on their ratings history, each rated item is compared against all other items to find the most similar ones using centered cosine similarity.
* We have a utility matrix and for this example - users are column values and songs are the rows. 

In [13]:
'''
Creating utility matrix
'''
df = pd.DataFrame({'User1' : [2, None, 4, None, 5], \
                   'User2' : [None, 3, None, 3, None], \
                   'User3' : [1, None, 5, None, 4], \
                   'User4' : [None, 4, 4, 4, None], \
                   'User5' : [3, None, None, None, 5],})
df.index = ['Song1', 'Song2', 'Song3', 'Song4', 'Song5']
df

Unnamed: 0,User1,User2,User3,User4,User5
Song1,2.0,,1.0,,3.0
Song2,,3.0,,4.0,
Song3,4.0,,5.0,4.0,
Song4,,3.0,,4.0,
Song5,5.0,,4.0,,5.0


In [23]:
'''
Function that takes in rating matrix with user and song and return an expected rating of that item for that user 
based on collaborative filtering.

k specifies the number of neighbors to consider
'''
def get_rating(ratings, target_user, target_item, k = 2):
    # Center ratings
    centered_ratings = ratings.apply(lambda x: x - x.mean(), axis = 1)
    csim_list = []
    for i in centered_ratings.index:
        csim_list.append(cosine_similarity(np.nan_to_num(centered_ratings.loc[i, :].values).reshape(1, -1),
                                          np.nan_to_num(centered_ratings.loc[target_item, :]).reshape(1, -1)).item())
        new_ratings = pd.DataFrame({'similarity': csim_list, 'rating': ratings[target_user]},
                                   index = ratings.index)
        top = new_ratings.dropna().sort_values('similarity', 
                                              ascending = False)[:k].copy()
        top['multiple'] =  top['rating'] * top['similarity']
        result = top['multiple'].sum() / top['similarity'].sum()
        return result

In [24]:
'''
Get rating for User3 on Song5 using 2 nearest neighbors and collaborative filtering
'''
get_rating(df, 'User3', 'Song5', 2)

2.9999999999999996

# Part 2: Content based filtering # 

* Collaborative filtering looks at users and items as a single entity when making comparisons. Content based filtering decomposes users and items into ** feature baskets **. 
* Rather than treating each song as a single indivisible unit, we convert the song to a feature vector that can be compared using cosine similarity.
* The listeners are decomposed into feature vectors as well. 
* If a user liked song X but never heard song Y, but X and Y are genetically almost identical(similar feature vectors) then that user should like song Y as well.

## Hybrid systems ##

** Features of Collaborative(doppelganger) filtering **

* Pro: no need to hand craft features
* Con: doesnt work well with a large number of items and users
* Con: there is sparsity when the number of items(songs/movies) far exceeds the number of users that can listen/watch them.

** Features of Content based filtering **

* Pro: doesn't need a large number of users
* Con: defining the right features is hard
* Con: lack of serendipity.

Hybrid systems use both.

# Part 3: Recommender system for Github. # 

Using the Github API to create a recommendation system based on collaborative(doppelganger) filtering.

Steps:

* Get all repos you have starred and their creators.
* Get list of all repos that each of these users has starred.
* Compare your starred repos with the starred repos of the users whos repos YOU starred to find the most similar users to you.
* Once we have the list of similar users, we use their starred repos to give recommendations to you.

## Step 1: Get starred repos ##

In [4]:
myun = 'adarsh0806'
mypw = 'a766dd64f4798041198ff30560a4aef375efde30'

In [5]:
'''
Function to get all starred repos
'''
my_starred_repos = []
def get_starred_by_me():
    resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/user/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        resp_list.append(json.loads(next_url_resp.text))
        
    for i in resp_list:
        for j in i:
            msr = j['html_url']
            my_starred_repos.append(msr)
get_starred_by_me()

In [46]:
'''
Get all starred repos
'''
my_starred_repos

[u'https://github.com/NathanEpstein/Dora',
 u'https://github.com/dataproofer/Dataproofer',
 u'https://github.com/iamaziz/PyDataset',
 u'https://github.com/stitchfix/d3-jupyter-tutorial',
 u'https://github.com/hangtwenty/dive-into-machine-learning',
 u'https://github.com/vinta/awesome-python',
 u'https://github.com/josephmisiti/awesome-machine-learning',
 u'https://github.com/Quartz/bad-data-guide',
 u'https://github.com/fivethirtyeight/data',
 u'https://github.com/mikemull/Notebooks',
 u'https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects',
 u'https://github.com/simple-statistics/simple-statistics',
 u'https://github.com/cchi/viral-markov',
 u'https://github.com/caesar0301/awesome-public-datasets',
 u'https://github.com/summanlp/textrank',
 u'https://github.com/brandomr/document_cluster',
 u'https://github.com/derek73/python-nameparser',
 u'https://github.com/paulgb/sklearn-pandas',
 u'https://github.com/donnemartin/data-science-ipython-notebooks',
 u'https://github.

In [6]:
len(my_starred_repos)

44

## Step 2.a: Get usernames of each starred repo so that we can retrieve the repos they starred ##

This method uses the URLs themselves to scrape the value of the username.

In [7]:
'''
Stripping username from starred repo link
'''
my_starred_users = []
for ln in my_starred_repos:
    right_split = ln.split('.com/')[1]
    starred_usr = right_split.split('/')[0]
    my_starred_users.append(starred_usr)
    
'''
Get usernames
'''
my_starred_users

[u'prakhar1989',
 u'rasbt',
 u'DrSkippy',
 u'kjw0612',
 u'ogrisel',
 u'ChristosChristofidis',
 u'mwaskom',
 u'bulutyazilim',
 u'NathanEpstein',
 u'dataproofer',
 u'iamaziz',
 u'stitchfix',
 u'hangtwenty',
 u'vinta',
 u'josephmisiti',
 u'Quartz',
 u'fivethirtyeight',
 u'mikemull',
 u'rhiever',
 u'simple-statistics',
 u'cchi',
 u'caesar0301',
 u'summanlp',
 u'brandomr',
 u'derek73',
 u'paulgb',
 u'donnemartin',
 u'rushter',
 u'fasouto',
 u'szilard',
 u'bloomberg',
 u'scala',
 u'apache',
 u'tensorflow',
 u'Theano',
 u'fchollet',
 u'turi-code',
 u'bokeh',
 u'matplotlib',
 u'scipy',
 u'numpy',
 u'scikit-learn',
 u'pandas-dev',
 u'turi-code']

## Step 2.b: Get repos of these users(ones we starred) ##

In [8]:
'''
Similar to get_starred_by_me() except that this function calls a different end point.
'''
starred_repos = {k:[] for k in set(my_starred_users)}

def get_starred_by_user(user_name):
    starred_resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/users/'+ user_name +'/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    starred_resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        starred_resp_list.append(json.loads(next_url_resp.text))
        
    for i in starred_resp_list:
        for j in i:
            sr = j['html_url']
            starred_repos.get(user_name).append(sr)

In [9]:
'''
Print all the distinct repo names of the users whos repos we starred
'''
for usr in list(set(my_starred_users)):
    print(usr)
    try:
        get_starred_by_user(usr)
    except:
        print('failed for user', usr)

donnemartin
fivethirtyeight
NathanEpstein
dataproofer
rasbt
josephmisiti
summanlp
prakhar1989
derek73
simple-statistics
ogrisel
Quartz
caesar0301
kjw0612
scala
rushter
bokeh
fchollet
DrSkippy
szilard
mikemull
tensorflow
bulutyazilim
numpy
stitchfix
mwaskom
turi-code
pandas-dev
matplotlib
scipy
bloomberg
apache
Theano
cchi
hangtwenty
iamaziz
brandomr
scikit-learn
ChristosChristofidis
vinta
paulgb
fasouto
rhiever


In [10]:
'''
Number of users whose we starred
'''
len(list(starred_repos))

43

## Step 4: Build feature set/vocabulary that includes all the repos starred by the users you starred ##

In [11]:
'''
Get URLs of all the repos starred by the users whom you starred
'''

repo_vocab = [item for sl in list(starred_repos.values()) for item in sl]

# Convert to list and take only distinct repos in the form of a set
repo_set = list(set(repo_vocab))
repo_set

[u'https://github.com/thumbor/remotecv',
 u'https://github.com/rewardz/django_model_helpers',
 u'https://github.com/ceteri/exelixi',
 u'https://github.com/notmatthancock/outline_to_dot',
 u'https://github.com/scalyr/cloud-costs',
 u'https://github.com/auduno/clmtools',
 u'https://github.com/arkency/reactjs_koans',
 u'https://github.com/liuliu/klaus',
 u'https://github.com/AllThingsSmitty/jquery-tips-everyone-should-know',
 u'https://github.com/adriank/ObjectPath',
 u'https://github.com/fivethirtyeight/data',
 u'https://github.com/zimbatm/socketmaster',
 u'https://github.com/java8/Java8InAction',
 u'https://github.com/NahimNasser/django-unchained',
 u'https://github.com/ajtulloch/dnngraph',
 u'https://github.com/thoughtbot/Argo',
 u'https://github.com/jsonpickle/jsonpickle',
 u'https://github.com/jflesch/paperwork',
 u'https://github.com/syrusakbary/gdom',
 u'https://github.com/emre/conqueue',
 u'https://github.com/gpoore/minted',
 u'https://github.com/gjreda/pydata2014nyc',
 u'https://

In [12]:
'''
Number of repos starred by the users
'''
len(repo_set)

7071

** I have starred 44 repos. The authors of those repos have starred a total of 7071 repos.**

## Step 5: Creation of binary vector for each user ##

* We have the full feature set/repo vocabulary.
* Run every user to create a binary vector that contains a 1 for every repo they have starred and a 0 for every repo they have not.

In [13]:
'''
Using the starred user - starred user repo vocabulary to create binary vector.

Check every user if they had starred every repo in our repo vocabulary. If they did, they get a 1, else 0.
'''
all_usr_vector = []

for k,v in starred_repos.items():
    usr_vector = []
    for url in repo_set:
        if url in v:
            usr_vector.extend([1])
        else:
            usr_vector.extend([0])
    all_usr_vector.append(usr_vector)

'''
all_usr_vector is a 5,915 item binary vector for each of the 35 users whose repos we starred
'''

'\nall_usr_vector is a 5,915 item binary vector for each of the 35 users whose repos we starred\n'

In [14]:
'''
Create item vector dataframe with usernames as rows and repos as columns
'''
df = pd.DataFrame(all_usr_vector, 
                  columns=repo_set, 
                  index=starred_repos.keys())
df

Unnamed: 0,https://github.com/thumbor/remotecv,https://github.com/rewardz/django_model_helpers,https://github.com/ceteri/exelixi,https://github.com/notmatthancock/outline_to_dot,https://github.com/scalyr/cloud-costs,https://github.com/auduno/clmtools,https://github.com/arkency/reactjs_koans,https://github.com/liuliu/klaus,https://github.com/AllThingsSmitty/jquery-tips-everyone-should-know,https://github.com/adriank/ObjectPath,...,https://github.com/OpenDDRdotORG/OpenDDR-Java,https://github.com/maxpumperla/hyperas,https://github.com/cloudpipe/cloudpipe,https://github.com/Homebrew/legacy-homebrew,https://github.com/dnlcrl/PyGraphArt,https://github.com/vinta/pangu.js,https://github.com/adobe-research/libkafka,https://github.com/this-is-ari/python-tesseract-3.02-training,https://github.com/alex-pirozhenko/sklearn-pmml,https://github.com/WebKit/webkit
donnemartin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fivethirtyeight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
cchi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
rushter,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
rasbt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
josephmisiti,0,0,0,0,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
summanlp,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
prakhar1989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
derek73,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
simple-statistics,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
'''
Sanity checking the number of columns/rows
'''
len(df.columns)

7071

In [16]:
'''
Adding myself to the dataframe so we can compare ourselves to other users
'''
my_repo_comp = []
for i in df.columns:
    if i in my_starred_repos:
        my_repo_comp.append(1)
    else:
        my_repo_comp.append(0)

In [17]:
'''
Create dataframe for my github repo with 5915 columns with 0 value if I have starred that repo and 0 if I have not.
'''
mrc = pd.Series(my_repo_comp).to_frame(myun).T
mrc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7061,7062,7063,7064,7065,7066,7067,7068,7069,7070
adarsh0806,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
'''
Add the appropriate column names and concatenate this to our dataframe.
'''
# Rename columns
mrc.columns = df.columns

# Add my github row to the dataframe
fdf = pd.concat([df, mrc])

fdf

Unnamed: 0,https://github.com/thumbor/remotecv,https://github.com/rewardz/django_model_helpers,https://github.com/ceteri/exelixi,https://github.com/notmatthancock/outline_to_dot,https://github.com/scalyr/cloud-costs,https://github.com/auduno/clmtools,https://github.com/arkency/reactjs_koans,https://github.com/liuliu/klaus,https://github.com/AllThingsSmitty/jquery-tips-everyone-should-know,https://github.com/adriank/ObjectPath,...,https://github.com/OpenDDRdotORG/OpenDDR-Java,https://github.com/maxpumperla/hyperas,https://github.com/cloudpipe/cloudpipe,https://github.com/Homebrew/legacy-homebrew,https://github.com/dnlcrl/PyGraphArt,https://github.com/vinta/pangu.js,https://github.com/adobe-research/libkafka,https://github.com/this-is-ari/python-tesseract-3.02-training,https://github.com/alex-pirozhenko/sklearn-pmml,https://github.com/WebKit/webkit
donnemartin,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fivethirtyeight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
cchi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
rushter,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
rasbt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
josephmisiti,0,0,0,0,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
summanlp,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
prakhar1989,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
derek73,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
simple-statistics,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 6: Compare myself with all the other users ##

We will use the **pearsonr** function which is the cosine similarity measure.

In [19]:
from sklearn.metrics import jaccard_similarity_score
from scipy.stats import pearsonr

'''
Compare my vector(last row) with every other user(all other rows) to generate the centered cosine similarity which is 
Pearson Correlation Coefficient
'''
sim_score = {}
for i in range(len(fdf)):
    ss = pearsonr(fdf.iloc[-1,:], fdf.iloc[i,:])
    sim_score.update({i: ss[0]})

  r = r_num / r_den


In [20]:
'''
Create dataframe with pearsonr correlation coefficient values
'''
sf = pd.Series(sim_score).to_frame('similarity')
sf

Unnamed: 0,similarity
0,0.122572
1,
2,0.357161
3,0.173641
4,0.116812
5,0.051244
6,
7,0.030523
8,0.008801
9,


** We get NaN values because those users have no starred repos and therefore a division by 0 is taking place. **

In [21]:
'''
Sort correlation values
'''
sf.sort_values('similarity', ascending=False)

Unnamed: 0,similarity
43,1.0
2,0.357161
3,0.173641
38,0.160135
0,0.122572
4,0.116812
28,0.097993
42,0.082509
27,0.078369
16,0.072555


User 35 is myself, hence the value 1. I share the greatest similarity with the user in row 30. Now I will investigate that.

## Step 7: Explore similar users ## 

In [22]:
'''
Explore user in row 30 with highest similarity
'''
fdf.index[2]

u'cchi'

His bio says: 'Data scientist and software engineer living in NYC. Working at @Bloomberg. Studied applied math and code at @columbia & @recursecenter.'

Looks like a good match.

In [23]:
'''
Check all repos starred by NathanEpstein
'''
fdf.iloc[2,:][fdf.iloc[2,:]==1]

https://github.com/hangtwenty/dive-into-machine-learning              1
https://github.com/prakhar1989/awesome-courses                        1
https://github.com/paultuckey/s3_bucket_to_bucket_copy_py             1
https://github.com/johnmyleswhite/ML_for_Hackers                      1
https://github.com/numenta/nupic                                      1
https://github.com/dariusk/corpora                                    1
https://github.com/kjw0612/awesome-rnn                                1
https://github.com/bulutyazilim/awesome-datascience                   1
https://github.com/rasbt/python-machine-learning-book                 1
https://github.com/clips/pattern                                      1
https://github.com/donnemartin/data-science-ipython-notebooks         1
https://github.com/scikit-learn/scikit-learn                          1
https://github.com/apache/incubator-predictionio                      1
https://github.com/ogrisel/parallel_ml_tutorial                 

In [24]:
'''
Explore the second most similar user
'''
print fdf.index[3]
fdf.iloc[3,:][fdf.iloc[3,:]==1]

rushter


https://github.com/fivethirtyeight/data                        1
https://github.com/ajtulloch/dnngraph                          1
https://github.com/syrusakbary/gdom                            1
https://github.com/asciimoo/exrex                              1
https://github.com/Tetrachrome/subpixel                        1
https://github.com/dollabs/pamela                              1
https://github.com/channelcat/sanic                            1
https://github.com/apache/incubator-singa                      1
https://github.com/s16h/py-must-watch                          1
https://github.com/DEAP/deap                                   1
https://github.com/PrincetonVision/marvin                      1
https://github.com/jaberg/skdata                               1
https://github.com/benjaminwilson/word2vec-norm-experiments    1
https://github.com/Newmu/dcgan_code                            1
https://github.com/osh/kerlym                                  1
https://github.com/tmadl/

In [25]:
'''
Explore the third most similar user
'''
print fdf.index[38]
fdf.iloc[38,:][fdf.iloc[38,:]==1]

NathanEpstein


https://github.com/prakhar1989/awesome-courses                        1
https://github.com/defunkt/gist                                       1
https://github.com/yyuu/pyenv                                         1
https://github.com/exercism/xjavascript                               1
https://github.com/ervandew/supertab                                  1
https://github.com/codemirror/CodeMirror                              1
https://github.com/wealthbot-io/wealthbot                             1
https://github.com/cheeriojs/cheerio                                  1
https://github.com/kilimchoi/engineering-blogs                        1
https://github.com/olistic/warriorjs                                  1
https://github.com/thank-you-github/thank-you-github                  1
https://github.com/paulirish/git-open                                 1
https://github.com/maxogden/csv-write-stream                          1
https://github.com/nabilhassein/nyc-summons-precinct-visualizati

## Step 8: Generate recommendations using the 3 most similar users ##

In [27]:
'''
Gather links to repos the 3 users have starred, that I have not.
'''
all_recs = fdf.iloc[[2,3,38,43],:][fdf.iloc[[2,3,38,43],:]==1].fillna(0).T
all_recs

Unnamed: 0,cchi,rushter,NathanEpstein,adarsh0806
https://github.com/thumbor/remotecv,0.0,0.0,0.0,0.0
https://github.com/rewardz/django_model_helpers,0.0,0.0,0.0,0.0
https://github.com/ceteri/exelixi,0.0,0.0,0.0,0.0
https://github.com/notmatthancock/outline_to_dot,0.0,0.0,0.0,0.0
https://github.com/scalyr/cloud-costs,0.0,0.0,0.0,0.0
https://github.com/auduno/clmtools,0.0,0.0,0.0,0.0
https://github.com/arkency/reactjs_koans,0.0,0.0,0.0,0.0
https://github.com/liuliu/klaus,0.0,0.0,0.0,0.0
https://github.com/AllThingsSmitty/jquery-tips-everyone-should-know,0.0,0.0,0.0,0.0
https://github.com/adriank/ObjectPath,0.0,0.0,0.0,0.0


In [28]:
'''
Check for repos that all 4 of us have starred
'''
all_recs[(all_recs==1).all(axis=1)]

Unnamed: 0,cchi,rushter,NathanEpstein,adarsh0806
https://github.com/prakhar1989/awesome-courses,1.0,1.0,1.0,1.0
https://github.com/scikit-learn/scikit-learn,1.0,1.0,1.0,1.0
https://github.com/josephmisiti/awesome-machine-learning,1.0,1.0,1.0,1.0


In [29]:
'''
Get repos that those 3 have starred and I have not
'''

# Create new dataframe with repos I have not starred
str_recs_tmp = all_recs[all_recs['adarsh0806'] == 0].copy()

# All rows, all columns except the last one - adarsh0806
str_recs = str_recs_tmp.iloc[:, :-1].copy()

str_recs

Unnamed: 0,cchi,rushter,NathanEpstein
https://github.com/thumbor/remotecv,0.0,0.0,0.0
https://github.com/rewardz/django_model_helpers,0.0,0.0,0.0
https://github.com/ceteri/exelixi,0.0,0.0,0.0
https://github.com/notmatthancock/outline_to_dot,0.0,0.0,0.0
https://github.com/scalyr/cloud-costs,0.0,0.0,0.0
https://github.com/auduno/clmtools,0.0,0.0,0.0
https://github.com/arkency/reactjs_koans,0.0,0.0,0.0
https://github.com/liuliu/klaus,0.0,0.0,0.0
https://github.com/AllThingsSmitty/jquery-tips-everyone-should-know,0.0,0.0,0.0
https://github.com/adriank/ObjectPath,0.0,0.0,0.0


In [30]:
'''
Get repos that all 3 have starred
'''
str_recs[(str_recs == 1).all(axis=1)]

Unnamed: 0,cchi,rushter,NathanEpstein


In [31]:
'''
Get list of repos that at least 2/3 of the users have starred.
'''
str_recs[str_recs.sum(axis=1)>1]

Unnamed: 0,cchi,rushter,NathanEpstein
https://github.com/numenta/nupic,1.0,1.0,0.0
https://github.com/kilimchoi/engineering-blogs,0.0,1.0,1.0
https://github.com/rasbt/python-machine-learning-book,1.0,1.0,0.0
https://github.com/clips/pattern,1.0,1.0,0.0
https://github.com/apache/incubator-predictionio,1.0,1.0,0.0
https://github.com/tensorflow/skflow,1.0,1.0,0.0
https://github.com/airbnb/aerosolve,1.0,1.0,0.0
https://github.com/aosabook/500lines,0.0,1.0,1.0
https://github.com/JohnLangford/vowpal_wabbit,1.0,1.0,0.0
https://github.com/you-dont-need/You-Dont-Need-Lodash-Underscore,0.0,1.0,1.0


In [32]:
len(str_recs[str_recs.sum(axis=1)>1])

10

## Conclusion ##

We have a list of 10 repos that at least 2 of the 3 most similar users to me have starred. (2nd run)

In [39]:
import sys
sys.version
# sys.version_info

'2.7.12 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:43:17) \n[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]'