# Let's start
To compare text we have to make vector of features out of them. First clean it up and prepare your text.

# Cleaning

Given corpus of sentences let's prepare them before vectorizing process.

In [26]:
corpus = [
    'This is my life, beautiful life',
    'Life is hard and full of problems',
    'Red fox ran after blue cat'
]


## Stopwords
Stopwords are just stop-words, there are no features at all. Let's remove them.

In [27]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

def text_without_stopwords(input):
    without = [z for z in input.lower().split() if z not in stop]
    return ' '.join(without)

corpus = [text_without_stopwords(sentence) for sentence in corpus]
print(corpus)

['life, beautiful life', 'life hard full problems', 'red fox ran blue cat']


## Tokenize - punctuation and stemming
There are some words that are equal as far as meaning but has different endings. Moreover we don't want to have commas and dots included in vectorization. So let's stem and remove punctuation using `nltk`.

In [28]:
from nltk.stem.porter import PorterStemmer
from nltk import word_tokenize
from string import punctuation
stemmer = PorterStemmer()

remove_punctuation_map = dict((ord(char), None) for char in punctuation)
def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]
def normalize(text):
    return stem_tokens(word_tokenize(text.translate(remove_punctuation_map)))
corpus = [' '.join(normalize(sentence)) for sentence in corpus]
print(corpus)

['life beauti life', 'life hard full problem', 'red fox ran blue cat']


# Vectorizing
Now we have words cleaned and prepared. Let's map word->number, this vectors will be later on processed as regular number vectors. One of the ways is to use TfIdf - it takes term frequency (how often word occurs in document) and inverted document frequency (how rare is the word throughout documents) to produce final vectors.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=normalize)
docs_fit = vectorizer.fit(corpus)
features = docs_fit.get_feature_names()
tdifd = docs_fit.transform(corpus)
print(features)
print(tdifd)

['beauti', 'blue', 'cat', 'fox', 'full', 'hard', 'life', 'problem', 'ran', 'red']
  (0, 6)	0.835591541945
  (0, 0)	0.549351231026
  (1, 7)	0.52863460666
  (1, 6)	0.402040244161
  (1, 5)	0.52863460666
  (1, 4)	0.52863460666
  (2, 9)	0.4472135955
  (2, 8)	0.4472135955
  (2, 3)	0.4472135955
  (2, 2)	0.4472135955
  (2, 1)	0.4472135955


As we can see the highest "notes" has `life` word in first document. It occurs there twice. But in the 2nd document it has lower score (0.40) comparing to other once-occui