In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In order to work with the documents we want to load, you will want to study 

* [`from sklearn.feature_extraction.text import TfidfVectorizer`]()
* [`from sklearn.naive_bayes import MultinomialNB`]()

Our test data will be given by the following dictionary:

In [2]:
test_map = {
    'a': # Plato
        [
        'data/plato/test/pg1726.txt', # Title: Cratylus
        'data/plato/test/pg1616.txt', # Title: Ion
        'data/plato/test/pg1735.txt', # Title: Theaetetus 
        'data/plato/test/pg1635.txt'  # Title: Sophist
        ],
    'b': 
        [ # Hume
        'data/hume/test/pg59792-0.txt', # Title: Hume's Political Discourses
        'data/hume/test/pg62856-0.txt', # Title: A Treatise of Human Nature Being an Attempt to Introduce the Experimental Method into Moral Subjects
        'data/hume/test/pg9662.txt',    # Title: An Enquiry Concerning Human Understanding
        ],
    'c':
        [ # Aristotle
        'data/aristotle/test/pg59058.txt', # Title: Aristotle's History of Animals In Ten Books
        'data/aristotle/test/pg2412.txt',  # Title: The Categories
        'data/aristotle/test/pg6762.txt',  # Title: Politics A Treatise on Government
        'data/aristotle/test/pg1974.txt',  # Title: Poetics
        ]
}

In [3]:
training_map = {
    'a': 
        [ # Plato
        'data/plato/train/pg1750.txt', # Laws
        'data/plato/train/pg1497.txt', # The Republic
        'data/plato/train/pg1600.txt', # Symposium
        ],
    'b':
        [ # Hume
        'data/hume/train/pg10574.txt', # The History of England, Volume I
        'data/hume/train/pg4705.txt',  # A Treatise of Human Nature
        'data/hume/train/pg36120.txt', # Essays
        ],
    'c':
        [ # Aristotle
        'data/aristotle/train/pg8438.txt', # Ethics
        'data/aristotle/train/pg26095.txt',# The Athenian Constitution
        'data/aristotle/train/pg6763.txt'  # The Poetics
        ]
}

In [4]:
files_train = []
y_train = []

for k in training_map.keys():
    files_train.extend(training_map[k])
    y_train.extend(k * len(training_map[k]))
    pass

In [5]:
files_train

['./plato/train/pg1750.txt',
 './plato/train/pg1497.txt',
 './plato/train/pg1600.txt',
 './hume/train/pg10574.txt',
 './hume/train/pg4705.txt',
 './hume/train/pg36120.txt',
 './aristotle/train/pg8438.txt',
 './aristotle/train/pg26095.txt',
 './aristotle/train/pg6763.txt']

In [6]:
y_train

['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']

# PART I

Use the dictionary map in the variable
          `training_map`. Your function will take the files (in the order they appear in
          `training_map`) and pass the  data into the [`TfidfVectorizer`]() vectorizer.  You
          will need to set the parameter to the constructor to `input='file'` and the
          `stop_words` to `'english'` (e.g. initialize the vectorizer to `TfidfVectorizer(input='file', stop_words='english')`.

* **You will just need to show the new function and the initialization of the vectorizer in this step.**  This will be one or two cells at most.
* You will use `fit_transform()` with the parameter being a list of the training files objects.


In [7]:
def your_pt1_function(files_train):
    vectorizer = TfidfVectorizer(input='file', stop_words='english')
    
    # call vectorizer.fit_transform on the list of FILE OBJECTS  
    
    return vectorizer

Here is an example to get a list of file objects from our `files_train`:

In [8]:
[open(f) for f in files_train]

[<_io.TextIOWrapper name='./plato/train/pg1750.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./plato/train/pg1497.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./plato/train/pg1600.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./hume/train/pg10574.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./hume/train/pg4705.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./hume/train/pg36120.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./aristotle/train/pg8438.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./aristotle/train/pg26095.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='./aristotle/train/pg6763.txt' mode='r' encoding='UTF-8'>]

Now you need only use this in your `fit_transform()` call of your function.

# PART II

Now that you have a vectorizer which effectively builds the data structure to hold the
TF-IDF of all the words which appear for each document, you can move to the training
phase for the Bayesian classifier.  Look in the sample notebook for guidance. You will take as
input the vectorizer output (the documents vectorized by TF-IDF) and the corresponding
classes (in the order they appear in the original dictionary map) and pass that into the [`MultinomialNB.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.fit) method.

* **Show the initialization of your `MultinomialNB()` classifier and the application of the `fit()` method.**


In [9]:
# initialize MultinomialNB (one line) (e.g. clf = ???)


# e.g. clf.fit( with_the_approproate_parameters )



# PART III

Once you have the classifier, you will need to convert a test file using
the vectorizer from part I.  Then you will execute the `predict()` 
method of your classifier.

Assume `vectorizer` is your TF-IDF vectorizer from above and the `clf` your
classifier from part II above, your code could be modeled after this:

```python
x_test = vectorizer.transform([open("data/aristotle/test/pg2412.txt")])

# should be class C!
clf.predict(x_test)
```

In [10]:
def your_pt3_function (a_document_vectorized_test_document, a_classifier):
    # pred = a_classifier.predict ( your_vectorized_test_document )
    
    return pred[0]  # the class

To test all the documents, your code might look like this:

In [11]:
files_test = [
   ("data/philosopher_name/test/filename.txt", 'a'), # the class should match the file (e.g. Hume is 'b') 
   # add all the remaining files
]

In [12]:
for f, cls_predict in files_test:
    # x_test =  vectorizer.transform([open(f)])
    
    # pred = your_pt3_function()
    
    # print (f"{f}: {cls_predict == pred}")
    
    pass # remove