# Document retrieval project in Sklearn

# Fire up packages

In [1]:
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
import pandas
from sklearn.cross_validation import train_test_split
import numpy

# Load data

In [2]:
people = pandas.read_csv('people_wiki.csv')

In [3]:
people.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


# NLP for clustering: create tfidf column in data frame

**Different from classification case, the clustering analysis does not have to remove stopwords. The tfidf method can downweight the unnecsssary word. However, I still believe that removing the stopwords can help to improve the performance of model.**

In [8]:
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list

In [9]:
def to_words(raw_review):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_review) 
    words = letters_only.lower().split()                             
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops] 
    return( " ".join( meaningful_words )) 

** The next step will be creating the feature matrix by using tfidf techniques**

In [10]:
clean_text=[]
for each in people['text']:
    clean_text.append(to_words(each))

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
features=tfidf.fit_transform(clean_text)

** Then the feature for clustering is created. We can use it to fit in machine learning model**

# Implementing clustering techniques

## Cosine Distance Examining

In [33]:
from sklearn.metrics.pairwise import cosine_distances as CD

** Let us check the Cosine distance of some pairs of people**

In [28]:
obama=people[people['name']=='Barack Obama'].index.tolist()[0]
beckham=people[people['name']=='David Beckham'].index.tolist()[0]
clinton=people[people['name']=='Bill Clinton'].index.tolist()[0]
swift=people[people['name']=='Taylor Swift'].index.tolist()[0]

In [37]:
print 'Cosine distance between Obama and Beckham is '+' '+ str(CD(features[obama],features[beckham]))
print 'Cosine distance between Obama and Clinton is '+' '+ str(CD(features[obama],features[clinton]))
print 'Cosine distance between Obama and Swift is '+' '+ str(CD(features[obama],features[swift]))

Cosine distance between Obama and Beckham is  [[ 0.97828055]]
Cosine distance between Obama and Clinton is  [[ 0.8090512]]
Cosine distance between Obama and Swift is  [[ 0.96103638]]


** Smaller cosine distance represents more obvious similarity. We can see from the above example that the features really make sense!**

## Searching for nearest neighnours

**Before fitting the model, I write several functions that can boost the efficiency of result query.**

In [96]:
def person(name):
    return int(people[people['name']==name].index.tolist()[0])

### K-Nearest-Neighbours

In [41]:
from sklearn.neighbors import NearestNeighbors
knn=NearestNeighbors(n_neighbors=20,algorithm='brute',metric='cosine')

In [44]:
knn_fit=knn.fit(features)

In [100]:
query_name=int(person('Barack Obama'))

In [114]:
Obama_Neighbours=knn_fit.kneighbors(features[query_name])

In [113]:
Result= pandas.DataFrame({'Index':Obama_Neighbours[1].tolist()[0]})
Result['Name']=Result['Index'].apply(lambda x: people['name'][x])
Result['Cosine Similariry']=Obama_Neighbours[0].tolist()[0]
Result['Cosine Distance']=Result['Index'].apply(lambda x: CD(features[Result['Index'][0]],features[x]))
Result

Unnamed: 0,Index,Name,Cosine Similariry,Cosine Distance
0,35817,Barack Obama,-2.220446e-16,[[-2.22044604925e-16]]
1,24478,Joe Biden,0.6618722,[[0.661872189888]]
2,38376,Samantha Power,0.7146266,[[0.714626642722]]
3,57108,Hillary Rodham Clinton,0.7257467,[[0.725746708995]]
4,38714,Eric Stern (politician),0.738494,[[0.738493956427]]
5,6796,Eric Holder,0.7542794,[[0.754279418869]]
6,46140,Robert Gibbs,0.7608155,[[0.76081549297]]
7,18827,Henry Waxman,0.7687019,[[0.768701913587]]
8,2412,Joe the Plumber,0.7696896,[[0.769689577443]]
9,44681,Jesse Lee (politician),0.770397,[[0.770396987202]]


**The result of clustering analysis should be judged by people's experience. In this case, we can see that the result does make sense**

**Since we are making a text retrieval system, I will combine the above steps together in a function so the search for similar articles will be more convenient.**

In [117]:
def knn_query(name,neighbours=20):
    name_index=int(people[people['name']==name].index.tolist()[0])
    knn=NearestNeighbors(n_neighbors=neighbours,algorithm='brute',metric='cosine')
    knn_fit=knn.fit(features)
    knn_result=knn_fit.kneighbors(features[name_index])
    Result= pandas.DataFrame({'Index':knn_result[1].tolist()[0]})
    Result['Name']=Result['Index'].apply(lambda x: people['name'][x])
    Result['Cosine Similariry']=knn_result[0].tolist()[0]
    Result['Cosine Distance']=Result['Index'].apply(lambda x: CD(features[Result['Index'][0]],features[x]))
    return Result

In [120]:
knn_query('David Beckham',10)

Unnamed: 0,Index,Name,Cosine Similariry,Cosine Distance
0,23386,David Beckham,-2.220446e-16,[[-2.22044604925e-16]]
1,50411,Victoria Beckham,0.5591475,[[0.559147546617]]
2,24913,Bobby Charlton,0.7056406,[[0.705640553633]]
3,53393,Steven Gerrard,0.7352798,[[0.735279811321]]
4,43981,Fernando Torres,0.7483658,[[0.748365827569]]
5,26762,Wayne Rooney,0.749405,[[0.749405036692]]
6,43098,Kim Milton Nielsen,0.7649533,[[0.764953307576]]
7,24258,Sol Campbell,0.766911,[[0.766910976189]]
8,14068,Rio Ferdinand,0.7743649,[[0.774364871086]]
9,38672,Shay Given,0.7751108,[[0.775110836483]]


**The KNN model really makes sense. Next, I will try other two models and compare the result of clustering analysis**

### Try more cases in order to justify the reasonability of the method

In [129]:
knn_query('Barack Obama')

Unnamed: 0,Index,Name,Cosine Similariry,Cosine Distance
0,35817,Barack Obama,-2.220446e-16,[[-2.22044604925e-16]]
1,24478,Joe Biden,0.6618722,[[0.661872189888]]
2,38376,Samantha Power,0.7146266,[[0.714626642722]]
3,57108,Hillary Rodham Clinton,0.7257467,[[0.725746708995]]
4,38714,Eric Stern (politician),0.738494,[[0.738493956427]]
5,6796,Eric Holder,0.7542794,[[0.754279418869]]
6,46140,Robert Gibbs,0.7608155,[[0.76081549297]]
7,18827,Henry Waxman,0.7687019,[[0.768701913587]]
8,2412,Joe the Plumber,0.7696896,[[0.769689577443]]
9,44681,Jesse Lee (politician),0.770397,[[0.770396987202]]


In [130]:
knn_query('Taylor Swift')

Unnamed: 0,Index,Name,Cosine Similariry,Cosine Distance
0,54264,Taylor Swift,1.110223e-16,[[1.11022302463e-16]]
1,317,Carrie Underwood,0.696475,[[0.696475012159]]
2,9379,Al Swift,0.702899,[[0.702898984607]]
3,29297,Kelly Clarkson,0.7047266,[[0.704726570068]]
4,25403,Ed Sheeran,0.7070672,[[0.707067161949]]
5,52794,Bill Swift,0.7117538,[[0.71175375138]]
6,19943,Tim McGraw,0.7160471,[[0.71604708008]]
7,27793,Adele,0.7165144,[[0.716514435738]]
8,35807,Joss Stone,0.7200933,[[0.720093325558]]
9,1341,Dolly Parton,0.7225681,[[0.722568149167]]


**We can find that the knn method really makes sense. Also, the speed of the whole procedure is fast enough, which can provide a satisfied result.**

## Conclusion

### 1: NLP, or natural language processing, will be the key to the subsquent step. Luckily the whole procedure is not that complicated.
### 2: KNN is a powerful method to find similarity. However, this project is not clustering-oriented as it just requires the output of similar items. Kmeans will be the same algorithim implemented in clustering analysis. The result cannot be tested by test set. Instead, our background knowledge will be the key to test whether the method makes sense or not.