## Text summarization

As per this gensim post http://rare-technologies.com/text-summarization-with-gensim/

In [1]:
from __future__ import division
from gensim.summarization import summarize, keywords
import wikipedia

In [2]:
df = pd.DataFrame([['Albert Einstein'],
                    ['Richard Feynman'],
                    ['Leonhard Euler'],
                    ['Stephen Hawking']])
df.columns = ['name']
df.head()

Unnamed: 0,name
0,Albert Einstein
1,Richard Feynman
2,Leonhard Euler
3,Stephen Hawking


Get firt 100 sentences about each person from wiki, extract keywords and summary.

In [3]:
df['wiki_content'] = df['name'].map(lambda x: wikipedia.summary(x, sentences=100))
df['content_keywords'] = df['wiki_content'].map(lambda x: keywords(x, ratio=0.9, lemmatize=True))
df['content_summary'] = df['wiki_content'].map(lambda x: summarize(x))
df.head()

Unnamed: 0,name,wiki_content,content_keywords,content_summary
0,Albert Einstein,Albert Einstein (/ˈaɪnstaɪn/; German: [ˈalbɛɐ̯...,einstein\nmechanics\ngeneral theory\nfield\nth...,"He developed the general theory of relativity,..."
1,Richard Feynman,"Richard Phillips Feynman (/ˈfaɪnmən/; May 11, ...",challenger\nshuttle\npublication\npath integra...,"Richard Phillips Feynman (/ˈfaɪnmən/; May 11, ..."
2,Leonhard Euler,Leonhard Euler (/ˈɔɪlər/ OY-lər; Swiss Standar...,theory\nmathematicians\neuler\nswiss\ngerman\n...,Leonhard Euler (/ˈɔɪlər/ OY-lər; Swiss Standar...
3,Stephen Hawking,"Stephen William Hawking, CH, CBE, FRS, FRSA (/...",fellow\nhonorary\npresidential\nmedal\ngradual...,Hawking was the Lucasian Professor of Mathemat...


### Do something more with the keywords

In [4]:
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import pairwise_distances

Stem the keywords for easier comparison.

In [5]:
stemmer = PorterStemmer()
df['stem_keywords'] = df['content_keywords'].map(lambda x: ' '.join(stemmer.stem(word) for word in x.split()))
df.head()

Unnamed: 0,name,wiki_content,content_keywords,content_summary,stem_keywords
0,Albert Einstein,Albert Einstein (/ˈaɪnstaɪn/; German: [ˈalbɛɐ̯...,einstein\nmechanics\ngeneral theory\nfield\nth...,"He developed the general theory of relativity,...",einstein mechan gener theori field theoret ger...
1,Richard Feynman,"Richard Phillips Feynman (/ˈfaɪnmən/; May 11, ...",challenger\nshuttle\npublication\npath integra...,"Richard Phillips Feynman (/ˈfaɪnmən/; May 11, ...",challeng shuttl public path integr formul inst...
2,Leonhard Euler,Leonhard Euler (/ˈɔɪlər/ OY-lər; Swiss Standar...,theory\nmathematicians\neuler\nswiss\ngerman\n...,Leonhard Euler (/ˈɔɪlər/ OY-lər; Swiss Standar...,theori mathematician euler swiss german mathem...
3,Stephen Hawking,"Stephen William Hawking, CH, CBE, FRS, FRSA (/...",fellow\nhonorary\npresidential\nmedal\ngradual...,Hawking was the Lucasian Professor of Mathemat...,fellow honorari presidenti medal gradual paral...


Try cosine distance on tfidf matrix to see who is similar to whom.

In [6]:
tfidf_vect = TfidfVectorizer(input='content', lowercase=False, tokenizer=None, stop_words='english', use_idf=True)
tfidf = tfidf_vect.fit_transform(df['stem_keywords'])

In [7]:
np.set_printoptions(suppress=True) # turn off scientific notation when printing numbers
print pairwise_distances(tfidf, metric='cosine')

[[-0.          0.89185467  0.9207151   0.88890348]
 [ 0.89185467  0.          0.96882654  0.93670002]
 [ 0.9207151   0.96882654 -0.          0.96913659]
 [ 0.88890348  0.93670002  0.96913659  0.        ]]


Try cosine similarity on bag of words to see who is similar to whom.

In [8]:
cnt_vect = CountVectorizer()
cnt = cnt_vect.fit_transform(df['stem_keywords'])

In [9]:
print pairwise_distances(cnt.toarray(), metric='jaccard')

[[ 0.          0.8989899   0.92682927  0.89565217]
 [ 0.8989899   0.          0.97222222  0.92857143]
 [ 0.92682927  0.97222222  0.          0.968     ]
 [ 0.89565217  0.92857143  0.968       0.        ]]


Based on stemmed keywords, Einstein is closest Feynman. Cosine distance in this scenario yields pretty much the same results as jaccard distance.