# Mystery Friend

You've received an anonymous postcard from a friend who you haven't seen in years. Your friend did not leave a name, but the card is definitely addressed to you. So far, you've narrowed your search down to three friends, based on handwriting:
- Emma Goldman
- Matthew Henson
- TingFang Wu

But which one sent you the card?

Just like you can classify a message as spam or not spam with a spam filter, you can classify writing as related to one friend or another by building a kind of friend writing classifier. You have past writing from all three friends stored up in the variable `friends_docs`, which means you can use scikit-learn's bag-of-words and Naive Bayes classifier to determine who the mystery friend is!

Ready?

## Feature Vectors Are in the Bag with Scikit-Learn

1. In the code block below, import `CountVectorizer` from `sklearn.feature_extraction.text`. Below it, import `MultinomialNB` from `sklearn.naive_bayes`.

In [5]:
# import sklearn modules here:
from sklearn.naive_bayes import MultinomialNB
# Import CountVectorizer from sklearn:
from sklearn.feature_extraction.text import CountVectorizer

2. Define `bow_vectorizer` as an implementation of `CountVectorizer`.

In [6]:
# Create bow_vectorizer:
bow_vectorizer=CountVectorizer()

3. Use your newly minted `bow_vectorizer` to both `fit` (train) and `transform` (vectorize) all your friends' writing (stored in the variable `friends_docs`). Save the resulting vector object as `friends_vectors`.

In [8]:
import import_ipynb

from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs

friends_docs = goldman_docs + henson_docs + wu_docs

# Define friends_vectors:
friends_vectors = bow_vectorizer.fit_transform(friends_docs)
print("Friend_vectors ->", friends_vectors)
test_vectors = bow_vectorizer.transform(friends_docs)
print("Test vectors ->", test_vectors)

Friend_vectors ->   (0, 3004)	5
  (0, 1469)	2
  (0, 2103)	4
  (0, 1500)	1
  (0, 1385)	1
  (0, 204)	1
  (0, 862)	1
  (0, 1653)	1
  (0, 285)	1
  (0, 2619)	1
  (0, 3048)	1
  (0, 2996)	1
  (0, 2898)	1
  (0, 1095)	1
  (0, 2046)	1
  (0, 1516)	1
  (0, 1453)	1
  (0, 234)	1
  (0, 441)	1
  (0, 783)	1
  (1, 3004)	5
  (1, 2103)	2
  (1, 204)	1
  (1, 2046)	1
  (1, 1542)	2
  :	:
  (460, 2406)	1
  (460, 546)	1
  (460, 641)	1
  (460, 2746)	1
  (460, 1673)	1
  (460, 3157)	1
  (460, 1912)	1
  (460, 662)	1
  (460, 230)	1
  (460, 614)	2
  (460, 2514)	1
  (460, 1859)	1
  (460, 3094)	1
  (460, 2056)	1
  (460, 341)	1
  (460, 904)	1
  (460, 1193)	1
  (460, 485)	1
  (460, 1110)	1
  (460, 3209)	1
  (460, 616)	1
  (460, 257)	1
  (460, 561)	1
  (460, 2697)	1
  (460, 2623)	1
Test vectors ->   (0, 204)	1
  (0, 234)	1
  (0, 285)	1
  (0, 441)	1
  (0, 783)	1
  (0, 862)	1
  (0, 1095)	1
  (0, 1385)	1
  (0, 1453)	1
  (0, 1469)	2
  (0, 1500)	1
  (0, 1516)	1
  (0, 1653)	1
  (0, 2046)	1
  (0, 2103)	4
  (0, 2619)	1
  (0, 2898

4. Create a new variable `mystery_vector`. Assign to it the vectorized form of `[mystery_postcard]` using the vectorizer's `.transform()` method.

   (`mystery_postcard` is a string, while the vectorizer expects a list as an argument.)

In [15]:
mystery_postcard = """
Henrik Ibsen, the hater of all social shams, was probably the first to realize this great truth. 
Nora leaves her husband, not—as the stupid critic would have it—because she is tired of her responsibilities or 
feels the need of woman's rights, but because she has come to know that for eight years she had lived with a 
stranger and borne him children. Can there be anything more humiliating, more degrading than a life-long 
proximity between two strangers? No need for the woman to know anything of the man, save his income. 
As to the knowledge of the woman—what is there to know except that she has a pleasing appearance? 
We have not yet outgrown the theologic myth that woman has no soul, that she is a mere appendix to man, 
made out of his rib just for the convenience of the gentleman who was so strong that he was afraid of his own shadow.
"""
# Define mystery_vector:
mystery_vector = bow_vectorizer.transform([mystery_postcard])
print("Mistery vector ->", mystery_vector)

Mistery vector ->   (0, 144)	1
  (0, 171)	1
  (0, 204)	1
  (0, 218)	2
  (0, 221)	1
  (0, 225)	1
  (0, 265)	2
  (0, 335)	1
  (0, 350)	2
  (0, 378)	1
  (0, 416)	1
  (0, 469)	1
  (0, 481)	1
  (0, 543)	1
  (0, 601)	1
  (0, 704)	1
  (0, 748)	1
  (0, 810)	1
  (0, 992)	1
  (0, 1109)	1
  (0, 1196)	1
  (0, 1221)	1
  (0, 1238)	3
  (0, 1313)	1
  (0, 1369)	1
  :	:
  (0, 2782)	1
  (0, 2805)	1
  (0, 2885)	1
  (0, 2886)	1
  (0, 2894)	1
  (0, 2903)	1
  (0, 3002)	1
  (0, 3003)	5
  (0, 3004)	11
  (0, 3010)	1
  (0, 3012)	2
  (0, 3029)	1
  (0, 3050)	1
  (0, 3052)	6
  (0, 3107)	1
  (0, 3122)	1
  (0, 3251)	3
  (0, 3263)	1
  (0, 3280)	1
  (0, 3295)	1
  (0, 3323)	1
  (0, 3328)	4
  (0, 3349)	1
  (0, 3364)	1
  (0, 3366)	1


## This Mystery Friend Gets Classified

5. You've vectorized and prepared all the documents. Let's take a look at your friends' writing samples to get a sense of how they write.

   Print out one document of each friend's writing - try any one between `0` and `140`. (Your friends' documents are stored in `goldman_docs`, `henson_docs`, and `wu_docs`.)

In [10]:
# Print out a document from each friend:
print("This is a postcard from Goldman \n▼\n",goldman_docs[0])
print("▲")
print("\nThis is a postcard from Henson \n▼\n",henson_docs[0])
print("▲")
print("\nThis is a postcard from Wu \n▼\n", wu_docs[0])
print("▲")

This is a postcard from Goldman 
▼
 
The history of human growth and development is at the same time the
history of the terrible struggle of every new idea heralding the
approach of a brighter dawn
▲

This is a postcard from Henson 
▼
 
When the news of the discovery of the North Pole, by Commander Peary,
was first sent to the world, a distinguished citizen of New York City,
well versed in the affairs of the Peary Arctic Club, made the statement,
that he was sure that Matt Henson had been with Commander Peary on the
day of the discovery
▲

This is a postcard from Wu 
▼
 
The Importance of Names

  "What's in a name?  That which we call a rose
  By any other name would smell as sweet."


Notwithstanding these lines, I maintain that the selection of names is
important
▲


6. Have an inkling about which friend wrote the mystery card? We can use a classifier to confirm those suspicions...

   Implement a Naive Bayes classifier using `MultinomialNB`. Save the result to `friends_classifier`.

In [11]:
# Define friends_classifier:
friends_classifier =  MultinomialNB()


7. Train `friends_classifier` on `friends_vectors` and `friends_labels` using the classifier's `.fit()` method.

In [12]:
friends_labels = ["Emma"] * 154 + ["Matthew"] * 141 + ["Tingfang"] * 166

# Train the classifier:
friends_classifier.fit(friends_vectors, friends_labels)


8. Change `predictions` value from `["None Yet"]` to the classifier's prediction about which friend wrote the postcard. You can do this by calling the classifier's `predict()` method on the `mystery_vector`.

In [16]:
# Change predictions:
predictions = friends_classifier.predict(mystery_vector)

## Mystery Revealed!

9. Uncomment the final print statement and run the code block below to see who your mystery friend was all along!

In [17]:
mystery_friend = predictions[0] if predictions[0] else "someone else"

# Uncomment the print statement:
print("The postcard was from {}!".format(mystery_friend))

The postcard was from Emma!


10. But does it really work? Find some lines by Emma Goldman, Matthew Henson, and TingFang Wu on <a href="http://www.gutenberg.org" target="_blank">gutenberg.org</a> and save them to `mystery_postcard` to see how the classifier holds up!

    Try using the `.predict_proba()` method instead of `.predict()` and print out `predictions` to see the estimated probabilities that the `mystery_postcard` was written by each person.
   
    What happens when you add in a recent email or text instead?