# Recommender systems - Project 1

## Recommend wiki pages based on the current topic of interest

### Abstract

The main idea of this project is to build a content based recommender that recommends another document that the user might be interested to read, based on the current document s/he is reading. We will implement this project using TFIDF (Term Frequency - Inverse Document Frequency) method to assign a numeric score to each word in the document, and use K nearest neighbors technique to identify the related documents. Based on the current document of interest, we will display a set of documents nearest to the current document.

### TFIDF algorithm
TFIDF (Term Frequency - Inverse Document Frequency) is one of the most popular text processing algorithms that helps us to accurately assign importance scores to each word in a document. At a very high level, the algorithm follows the following logic:

1. Let $D = {d_1, d_2 ... d_n}$ be a set of documents. 
2. For each document $d$ in $D$ perform the following:

    a. Get the frequencies of all the words in $d$. Call this as TF (Term Frequency) vector for document $d$

3. Get the list of all unique words in all the documents, and for each unique word, get the number of documents containing the word. Let DF (Document Frequency) be the vector containing these counts. 

4. For each word in DF, get the following:

    $$IDF_w=log(n/(1+\mbox{number of documents containing the word }w))$$
    The log can have any valid base. IDF stands for Inverse Document Frequency. "n" represents the total number of documents
    
5. For each document $d$, multiply the elements of $TF_d$ with the corresponding elements of IDF, to obtain TFIDF vector for document $d$. 

**Example:**

Let us suppose that we have 64 documents, and 2 of the documents have the following phrases:
$d_1$: "sekhar mekala is a data science evangelist"
$d_2$: "mekala sekhar"

The term frequency vectors of these two documents are given below:
$$TF_1 = ["sekhar":1,"mekala":1,"is":1,"a":1,"data":1,"science":1,"evangelist":1]$$

$$TF_2 = ["sekhar":1,"mekala":1]$$

The DF vector is given below:
$$DF = ["sekhar":2,"mekala":2,"is":61,"a":62,"data":5,"science":20,"evangelist":3 ....]$$

The "....." in the DF vector represents the other words (in the remaining documents) and the number of documents containing those words. We do not need any details about the other words in other documents, since we will calculate the TFIDF of two documents only in this example. 

For "sekhar" the IDF score can be calculated as:
$$log_{10} (64/(1+2)) = 1.33$$

Based on the same logic, we can find the IDF vector as :
$$IDF = ["sekhar":1.33,"mekala",1.33","is":0.014,"a":0.007,"data":1.03,"science":0.48,"evangelist":1.2 ....]$$

Now we will get the product of corresponding elements of TF vectors (of each document) and the IDF vector:

$$TFIDF_1 = ["sekhar":1.33,"mekala",1.33","is":0.014,"a":0.007,"data":1.03,"science":0.48,"evangelist":1.2, "otherwords":0 ....]$$

$$TFIDF_2 = ["sekhar":1.33,"mekala",1.33","otherwords":0 ....]$$

The TFIDF vectors show that the common words in a document (but rare in other documents) are scored higher than the words, which are common in all the documents.

### Manual implementation of TFIDF in python

Import all the required packages first.


In [1]:
##Import all the required packages
import pandas as pd
import numpy as np
from collections import Counter
from IPython.display import display # Allows the use of display() for DataFrames

import itertools
import warnings
warnings.filterwarnings('ignore')
from time import time

The following code will implement the TFIDF algorithm manually. It creates a function named get_TFIDF(docs), where the input parameter "docs" will be a dictionary, in which all the keys represent the document IDs (unique identifiers), and the values of the keys will be the text of the respective document. The function will output a TFIDF data frame.

In [2]:
def get_TFIDF(docs):
        '''
        The function get_TFIDF(docs) will accept a dictionary as input parm
        Each key in the dictionary will represent unique document ID, and the 
        value will represent the document text
        '''
        ##Get the term frequencies in each document. 
        ##We will create a data frame in the format <doc ID>,<word>,<freq>. Each row can 
        ##belong to the same doc ID, but to a different word in the document. The 
        ##frequency of the words in the document are also listed

        #Declare empty list objects
        l1=[]
        l2=[]

        #Collect the key and values of the docs dict to two lists
        for (k,d) in docs.items():
            s = d.split()
            l1.append([k]*len(s))
            l2.append(s)

        #Flatten the data (nested lists to a flat list)
        l1= list(itertools.chain(*l1))
        l2= list(itertools.chain(*l2))
        #Create a data frame now
        df = pd.DataFrame(zip(l1,l2),columns=["doc_id","word"])

        #Group by the data and count the words in each document
        TF=pd.DataFrame(df.groupby(["doc_id","word"])["word"].count())

        #Assign the names to the indices, since we have "word" in two levels of indices, 
        #and we are getting conflit, when we are resetting the index
        TF.index.names=["a","b"]

        #Now reset he index
        TF=pd.DataFrame(TF).reset_index()

        #Name the columns
        TF.columns=["ID","word","freq"]
        


        ##Preparation of inverse document frequency (IDF) data frame.
        ##We will get a data frame of the form <word>,<IDF>. The IDF 
        ##will be computed using a log to base 10, although
        ##any log base can be used
        IDF=TF.groupby(["word"]).count().reset_index()
        IDF=IDF.drop("ID",axis=1)
        IDF["IDF"] = np.log10(len(docs)/(IDF["freq"]+1))
        IDF=IDF.drop("freq",axis=1)

        ##Getting the TFIDF scores for each word in the documents
        TFIDF=TF.join(IDF.set_index("word"),on="word",how="left")
        #print TFIDF
        TFIDF["TFIDF"] = TFIDF["freq"] * TFIDF["IDF"]

        ##Drop the "freq" column, since it is of no use
        TFIDF= TFIDF.drop(["freq","IDF"],axis=1)

        ##Pivot the data frame
        TFIDF=TFIDF.pivot(index="ID",columns="word",values="TFIDF").reset_index()

        ##Rename the index name from word to ""
        TFIDF.columns.name=""

        ##Fill the data frame with 0, where ever we have NA
        TFIDF.fillna(value=0,inplace=True)
        return TFIDF


Let us run the above function on a set of documents.

In [3]:
##Define a dictionary of documents:

docs = {"d1":"Julia Roberts Actress", "d2":"George Lucas Filmmaker", 
"d3":"Oprah Winfrey Television personality",  "d4":"Tom Hanks Actor",  
"d5":"Michael Jordan Sportsperson Basketball", "d6":"The Rolling Stones Musicians",  
"d7":"Tiger Woods Sportsperson Golf",  "d8":"Backstreet Boys Musicians",  
"d9":"Cher Musician",  "d10":"Steven Spielberg Filmmaker",
       "d11":"Tom Cruise Actor","d12":"Bruce Willis Actor"}

#Start time capture
start = time()
TFIDF=get_TFIDF(docs)

#End time capture
end = time() 

print "Manual runtime is {}seconds".format(end-start)

print np.array(TFIDF)

Manual runtime is 0.28200006485seconds
[['d1' 0.0 0.7781512503836436 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  0.7781512503836436 0.0 0.0 0.0 0.0 0.0 0.7781512503836436 0.0 0.0 0.0 0.0
  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
 ['d10' 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6020599913279624 0.0 0.0 0.0 0.0
  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7781512503836436 0.0 0.7781512503836436
  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
 ['d11' 0.47712125471966244 0.0 0.0 0.0 0.0 0.0 0.0 0.7781512503836436 0.0
  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  0.0 0.6020599913279624 0.0 0.0 0.0 0.0]
 ['d12' 0.47712125471966244 0.0 0.0 0.0 0.0 0.7781512503836436 0.0 0.0 0.0
  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  0.0 0.0 0.7781512503836436 0.0 0.0 0.0]
 ['d2' 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6020599913279624
  0.7781512503836436 0.0 0.0 0.0 0.0 0.7781512503836436 0.0 0.0 0.0 0.0 0.0
  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
 ['d3' 0.0 

We can see that our function get_TFIDF has created a TFIDF data frame. The run time of the code is displayed above. Most of the data in the data frame are zeros. This implies that using a sparse matrix would be beneficial to perform the text processing. The inbuilt sklearn packages use sparse matrix to perform computations. Let us use the sklearn packages to obtain the same TFIDF matrix.

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame(docs.items(),columns=["Doc ID","Text"])
print "Data represented as data frame:\n\n{}".format(df)
start=time()
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df["Text"])
#print df_counts
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)
end=time()
print "\n\nRun time using sklearn package is {}\n".format(end-start)
print "Some of the initial TFIDF rows:\n\n{}".format(df_tfidf[0:])

Data represented as data frame:

   Doc ID                                    Text
0      d8               Backstreet Boys Musicians
1      d9                           Cher Musician
2      d6            The Rolling Stones Musicians
3      d7           Tiger Woods Sportsperson Golf
4      d4                         Tom Hanks Actor
5      d5  Michael Jordan Sportsperson Basketball
6      d2                  George Lucas Filmmaker
7      d3    Oprah Winfrey Television personality
8      d1                   Julia Roberts Actress
9     d10              Steven Spielberg Filmmaker
10    d11                        Tom Cruise Actor
11    d12                      Bruce Willis Actor


Run time using sklearn package is 0.0

Some of the initial TFIDF rows:

  (0, 2)	0.604391549677
  (0, 4)	0.604391549677
  (0, 17)	0.519058483562
  (1, 6)	0.707106781187
  (1, 16)	0.707106781187
  (2, 17)	0.444226001257
  (2, 27)	0.517256628702
  (2, 21)	0.517256628702
  (2, 25)	0.517256628702
  (3, 28)	0.517256628

If you observe, the TFIDF values obtained by sklearn package is represented as a sparse matrix. Also the above results are not matching our implementation, since we did not normalize the TF and we used log10 in our function to compute IDF. The sklearn implementation of TFIDF uses a different IDF formula (see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer for more information). But we should note that relatively the values represent same importance for all words in both the methods. Both have scored frequent words a low TFIDF value, while rare words a high TFIDF score.

## Implementing document retrieval system

We will be using a data set called "people_wiki.csv" to implement a document retrieval system. This data set was as a part of Coursera Machine Learning course from University of Washington. The course used this data set to perfom text analytics using _graph lab create_ software. But we will implement our document retrieval system using sklearn packages. Other than the data set, I did _not_ use any code from the course in Coursera.

In [7]:
##Reading the data set to a data frame
X = pd.read_csv("people_wiki.csv")

print "Some initial rows in the data set: \n"
display(X.head())

print "\nDataset summary: \n"
display(X.describe())

print "\nThe data set has {} rows and {} columns".format(X.shape[0],X.shape[1])

Some initial rows in the data set: 



Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...



Dataset summary: 



Unnamed: 0,URI,name,text
count,59071,59071,59071
unique,59071,59070,59071
top,<http://dbpedia.org/resource/Andrew_Wildman>,author),beth denisch born augusta georgia feb 25 1958 ...
freq,1,2,1



The data set has 59071 rows and 3 columns


The data set has 59071 rows and 3 columns. The URI column has the URL, the _name_ column has the person's name and the _text_ has the text related to the person in the _name_ column.

### Project requirements
The user interacts with this system by searching a list of word(s), and the system lists the topics matching the user search (the search must be fuzzy search. This means, if an exact word match is not found, the system should return the nearest matching topic). If the user is not satisfied with the results, he can search the database with new words. 

The user will select a topic from the listed topics. The system will display the topic's text, and recommends another set of topics, the user might be interested. 

### Approach

1. We do not need the URI column in the data set, since the _text_ column has all the required data about the person.
2. We will separate the _text_ column and _name_ columns into two data frames.
3. We will get the TF-IDF for all the _text_ data using sklearn packages.
4. We will use KNN (K Nearest Neigbbors) method of sklearn package, to get the relevant documents, based on the _current document_ or search words. 


### Implementation
The python code to implement the above approach is listed below:

In [8]:
#X=pd.read_csv("people_wiki.csv")
#Get the name column to a different data frame
y=X.pop("name")

#Drop the URI column
X.drop(["URI"],axis=1,inplace=True)

The data frame X has all the text data, and the data frame y has all the people names. See the sample data of these two data frames below:

In [9]:
print "\nX data frame's data \n"
display(X.head())
print "\ny data frame's data \n"
display(y.head())


X data frame's data 



Unnamed: 0,text
0,digby morrell born 10 october 1979 is a former...
1,alfred j lewy aka sandy lewy graduated from un...
2,harpdog brown is a singer and harmonica player...
3,franz rottensteiner born in waidmannsfeld lowe...
4,henry krvits born 30 december 1974 in tallinn ...



y data frame's data 



0          Digby Morrell
1         Alfred J. Lewy
2          Harpdog Brown
3    Franz Rottensteiner
4                 G-Enka
Name: name, dtype: object

### Computing TF-IDF
The following code computes the Term Frequency - Inverse Document Frequency (TFIDF) of all the _text_ data in X data frame

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

start=time()
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X["text"])
#print X_counts
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
end=time()
print "Run time using sklearn package is {} sec\n".format(end-start)
#print "Some of the initial TFIDF rows:\n\n{}".format(X_tfidf[0:2])
print "The TF-IDF matrix has {} rows and {} columns\n".format(X_tfidf.shape[0],X_tfidf.shape[1])

Run time using sklearn package is 26.3109998703 sec

The TF-IDF matrix has 59071 rows and 548429 columns



The computation time along with the TF-IDF matrix's shape is displayed above. We can see that the TF-IDF matrix has 59071 rows and 548429 columns. This matrix is represented as a sparse matrix internally, and this representation will help for faster computations. Also the matrix has more columns than rows. Hence a dimensional reduction method like PCA (Principal Component Analysis) can help to reduce the dimensions. But we will not perform the PCA in this project.


### Fitting a KNN model on TF-IDF data
KNN (K Nearest Neighbor) model is fit on the TF-IDF data (present in X_tfidf data frame). This model will help us to retrieve the best 10 relevant documents, for a given document.


In [11]:
from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors=10,algorithm='brute' ,metric='cosine',leaf_size=30,p=2).fit(X_tfidf)
nbrs

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=10, p=2, radius=1.0)

### User interface

Let us create a simple user interface to accept a list of words and perform the document retrieval. This interface will help us to test our document retrieval system.

### Example of relevant topics retrieval

In [18]:
##### Function to search for a string (perform fuzzy search of string in X, and returns 10 matches
##### 10 matches, since we trained the KNN algorithm with k=10
def search_a_person(search_string,count_vect,tfidf_transformer,nbrs):
    y_counts=count_vect.transform([search_string])
    y_tfidf=tfidf_transformer.transform(y_counts)
    i=nbrs.kneighbors(y_tfidf,return_distance=True)
    display_df = pd.DataFrame(y.ix[i[1][0]]).reset_index(drop=True)
    display_df.columns=["Topic"]
    return display_df

###Function to retrieve the index to get the text associated with the index
###Will search for an exact match in y (doc_ids parm), and returns a matching index in y (or doc_ids)
def retrieve_topic(display_df,chosen_id,doc_ids):
            i = doc_ids[doc_ids==display_df.iloc[int(chosen_id)]["Topic"]].index.tolist()
            #return X["text"][i[0]]
            return i

##Initialize the choice to 0        
choice=0

#Code to display the UI choices and to process the retrieval requests
while int(choice) < 3:
    #Display the initial menu
    print "1. Search a person"
    print "2. Quit"
    
    #Read the input from the interface
    choice = raw_input()
    
    #validate the choice and take the action
    try:
        if int(choice) == 2:
            break
        if int(choice) == 1:
            print "\n\nEnter a search string in the below text box:"
            search_string=raw_input()
            display_df=search_a_person(search_string,count_vect,tfidf_transformer,nbrs)
            display(display_df)
            
            print "Choose a topic using the number listed beside the topic"
            chosen_id = raw_input()
            #try:
            while int(chosen_id) < 10:
                i=retrieve_topic(display_df,chosen_id,y)
                print X["text"][i[0]]
                print "\n\nYou may be interested in these people..."
                ip=X_tfidf[y[y==display_df.iloc[int(chosen_id)]["Topic"]].index.tolist()]
                l = nbrs.kneighbors(ip,return_distance=True)
                a=l[1].tolist()

                #display_df=pd.DataFrame(zip(list(y.ix[a[0]]),l[0][0]),columns=["Topic","Distance"])
                #display(display_df.iloc[1:])
                display_df=pd.DataFrame(list(y.ix[a[0]]),columns=["Topic"])
                display(display_df.iloc[1:])

                print "Enter your choice. Enter Q to quit"
                chosen_id = raw_input()
                if chosen_id == 'Q' or chosen_id == 'q':
                    break
                else:    
                    continue
                break  
            #except:
                #print "Incorrect choice made. Quitting the program"
            break
        else:
            print "\nIncorrect choice. Enter 1 (for search) or 2 (for exit)"
            choice=0
    except:
        print "\nIncorrect choice. Enter 1 (for search) or 2 (for exit)"
        choice = 0



1. Search a person
2. Quit
1


Enter a search string in the below text box:
obama


Unnamed: 0,Topic
0,Barack Obama
1,Kenneth D. Thompson
2,Jonathan Alter
3,Samantha Power
4,John D. McCormick
5,Robert Gibbs
6,Eric Holder
7,Joe the Plumber
8,Steve Schale
9,Batton Lash


Choose a topic using the number listed beside the topic
0
barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american to hold the office born in honolulu hawaii obama is a graduate of columbia university and harvard law school where he served as president of the harvard law review he was a community organizer in chicago before earning his law degree he worked as a civil rights attorney and taught constitutional law at the university of chicago law school from 1992 to 2004 he served three terms representing the 13th district in the illinois senate from 1997 to 2004 running unsuccessfully for the united states house of representatives in 2000in 2004 obama received national attention during his campaign to represent illinois in the united states senate with his victory in the march democratic party primary his keynote address at the democratic national convention in july and his election to the senate in nov

Unnamed: 0,Topic
1,Joe Biden
2,Hillary Rodham Clinton
3,Samantha Power
4,Eric Stern (politician)
5,George W. Bush
6,John McCain
7,Artur Davis
8,Henry Waxman
9,Jeff Sessions


Enter your choice. Enter Q to quit
2
hillary diane rodham clinton hlri dan rdm klntn born october 26 1947 is a former united states secretary of state us senator and first lady of the united states from 2009 to 2013 she was the 67th secretary of state serving under president barack obama she previously represented new york in the us senate 2001 to 2009 before that as the wife of the 42nd president of the united states bill clinton she was first lady from 1993 to 2001 in the 2008 election clinton was a leading candidate for the democratic presidential nominationa native of illinois hillary rodham was the first student commencement speaker at wellesley college in 1969 and earned a jd from yale law school in 1973 after a brief stint as a congressional legal counsel she moved to arkansas and married bill clinton in 1975 rodham cofounded arkansas advocates for children and families in 1977 in 1978 she became the first female chair of the legal services corporation and in 1979 the first fema

Unnamed: 0,Topic
1,Bill Clinton
2,Ann Lewis
3,Barack Obama
4,Melanne Verveer
5,Jill Alper
6,Vanessa Gilmore
7,Sheila Widnall
8,L. Jean Lewis
9,Samantha Power


Enter your choice. Enter Q to quit
q


In the above set of program execution, we retrieved documents related to "Obama" first (we searched the documents for the word Obama, and the system has displayed the related documents. We chose the choice 0 for Barak Obama). The system has displayed the text related to Barak Obama, and also displayed some more topics (related to Barak Obama). The we chose the topic related to Hillary Clinton. This choice has displayed text related to Hillary Clintol, along with some related topics. We finally quit the system by entering "q".

With two simple algorithms (TF-IDF and KNN), we are able to construct a decent document retrieval system. As a part of future work, we would like to work on the following:

1. Use python's NLTK (Natural Language Tool Kit) to match the text, instead of TF-IDF algorithm, and compare the performances.
2. Create a better GUI to perform testing.
3. Perform the clustering of data using K Means clustering with/without transforming the data using PCA (Principal Component Analysis), and understand if PCA improves clustering of data.
4. Fit a supervised learning algorithm to accurately predict the cluster to which a document belongs to.
4. Improve the quality (and run-time) of suggested topics by searching the related topics within the relevant cluster.

## Conclusion

In this project we performed the following:
1. Built a manual function to find the TF-IDF for a set of documents
2. Found the run times of manual execution and in-built sklearn packages to calculate the TF-IDF values of a set of documents
3. Built a document retrieval system on persons data set (from wikipedia), based on TF-IDF concept and KNN method. 